Thursday, September 30, 2010

Multithreaded Occlusion Culling

One of my personal projects was this optimized occlusion culling engine from Multicore Programming in Spring 2010. I worked on this project with Damon Rocco and Zachary Meister. Zach was the primary author of the paper, I wrote most of the rendering and parallelization code, and Damon took care of most of the geometric algorithms. Our results are detailed in a rather interesting paper found here: Occlusion Culling Paper

This project was interesting because it took a relatively small kernel of functionality, cell-portal culling and BVH culling, and focused intensely on optimizing out all bottlenecks. We attacked the problem on algorithmic, parallel, and micro-architectural fronts.

We used Intel TBB Tasks as our basic method of breaking off units of work. Combined with a recursive formulation of the problem this gave us a natural and efficient way to decompose our parallel work. Tuning of our inner "work" loops also brought insight into the performance considerations of small loops, memory alignment, and vector arithmetic. Interestingly, we were also able to make algorithmic improvements by exploiting temporal coherence between consecutive frames.

Besides the work discussed in the paper, I also took some effort to make the renderer efficient, which resulted in a roughly %50 speedup over a naive approach. Besides instancing of the models, we did an initial rendering pass that only drew depth. This enabled us to avoid unnecessary fill on pixels that would later be covered. In addition, to avoid unnecessary fill even earlier, we also performed a coarse depth-sort by dividing the world into octants and rendered front-to-back.

A demonstration video of our work is embedded below:

No comments:

Post a Comment