In my pursuit of doing Real Time (60fps) Ray Tracing for a game, I have been doing a lot of profiling with 'perf.' One way to quickly analyse the results from a perf record run, is by making a FlameGraph. Here's a graph for my ray tracing system:
Click here for expanded and interactive view.
During my optimization effort, I've found that lining up all the data nicely for consumption by your algorithm works wonders. Have everything ready to go, and blast through it with your SIMD units. For ray tracing, this means having your intersection routines blast through the data, as ray tracing in its core, is testing rays versus shapes. In my game, these shapes are all AABBs, and my intersection code tests 8 AABBs versus a single ray in one go. A big contribution to hitting 60fps ray tracing is the fact that my scenes use simple geometry: AABBs, almost as simple as spheres, but more practical for world building.
This is all fine and dandy, but does expose a new problem: your CPU is busy more with wrangling the data than doing the actual computation. Even when I cache the paths that primary rays take (from camera into scene) for quick reuse, the administration around intersection tests takes up more time than the tests themselves.
This is visible in the graph above, where the actual tests are in linesegment_vs_box8 (for shadow rays) and ray_vs_box8 (for primary rays.) It seems to be some wall I am hitting, and having a hard time to push through for even more performance.
So my shadow rays are more costly than my primary rays. I have a fixed camera position, so the primary rays traverse the world grid in the same fashion each frame. This, I exploit. But shadow rays go all over the place, of course, and need to dynamically march through my grid.
In order to alleviate the strain on the CPU a bit, I cut the number of shadow rays in half, by only computing shadow once for two frames, for each pixel. So half the shadow information lags by one frame.
So to conclude: if you line up all your geometry before hand, and having it packed by sets of 8, then the actual intersection tests take almost no time at all. This makes it possible to do real time ray tracing at a 800x400 resolution, at 60 frames per second, at 1.5 rays per pixel on 4 cores equipped with AVX2. To go faster than that, I need to find a way to accelerate the data-wrangling.