"A parallel ray tracing library is presented for rendering high detail images of three dimensional geometry and computational fields. The library has been developed for use on distributed memory and shared memory parallel computers and can also run on sequential computers. Parallelism is achieved through the use of message passing and threads. It is shown that the library achieves almost linear scalability when run on large distributed memory parallel computers as well as large shared memory parallel computers.
Several applications of parallel rendering are explored including rendering of CAD models, animation, magnetic resonance imaging, and visualization of volumetric flow fields. Ray tracing offers many advantages over polygon rendering techniques, in its innate parallelism, and quality of output"--Abstract, page iii.
VMD remote OSP RT performance for a VCA GPU cluster located 2,100 miles from theVMD workstation and HMD. Oculus Rift DK2 HMD head pose and display updates weremaintained at a constant 75 Hz, matching its display refresh rate. All testscomputed two subframes in parallel per update, thus accumulating two times thesample counts shown in the subframe samples column for each display update. Raytracing workload increases as (pixel count ×light count × sample count× subframe count).
Photo-realistic rendering is the process of turning 3D models into images that are indistinguishable from a photograph. It requires the accurate simulation of light propagation according to the laws of physics. The best known method for solving this problem is Monte Carlo ray tracing, an algorithm that follows the paths of billions of light rays as they reflect off surfaces in a virtual scene. The two key challenges in Monte Carlo ray tracing are (a) carefully selecting a statistically representative set of light paths, and (b) determining the intersection points of the path segments with the scene as quickly as possible. The latter, known as the visibility problem, is solved by the ray tracing kernels and it is usually the most compute intensive part of a rendering system.Embree provides a Monte Carlo ray tracer as an example. This renderer demonstrates how an efficient rendering system is designed and implemented using Embree's key technologies. The renderer is also an excellent framework for evaluating and comparing different ray tracing kernels in a realistic application scenario.Figure 1: Progressive rendering of the imperial crown of Austria. A single machine with four Intel® Xeon® processors computes preview images of this 3D model at interactive frame rates (left). The image converges to a better solution within a few seconds (middle). A perfect image (right) only takes about a minute to compute. Model courtesy of Martin Lubich,
Figure 2: Ray tracing simulates the propagation of light in a scene. The figure shows three possible light paths that connect the light source with a pixel in the image plane. The intersection points of the path segments with the scene are computed by the ray tracing kernel.The Monte Carlo ray tracing algorithm computes the final pixel color by averaging a large number of random samples. While the result is statistically correct for a large number of samples, too few samples results in visible noise artifacts. For a high quality result, hundreds or even thousands of light paths are required per pixel. In practice these paths are usually not chosen entirely at random. Instead, sophisticated algorithms have been developed to select the paths which provide the most information about the scene. The part of the rendering engine responsible for choosing the paths and combining their results is known as the integrator, because the most common mathematical formulation of the problem takes the form of an integral. Every path consists of multiple segments that each corresponds to the path of a single virtual photon.There exist a large number of Monte Carlo ray tracing algorithms. They differ in how the light paths are chosen and what effects are efficiently supported. Path tracing1 is one example. It traces rays backward from the camera towards the light sources. Another example is stochastic progressive photon mapping2, where paths are traced both from the camera and from the light sources. The paths are then loosely connected at their end points.Different applications required different rendering algorithms. This is why Embree only provides an example in this space. We have chosen the path tracing1 algorithm, because it is simple and it works well in many applications. The architecture of the renderer was inspired by the design of PBRT3.
Figure 3: Coherent rays (left) are used for real-time ray tracing. They are handled very efficiently by packet tracing algorithms. Incoherent rays (right), are more difficult to handle, but they are required for photo-realistic rendering.
The core of a ray tracer is its acceleration structure. Imagine a scene with tens of millions of triangles and billions of rays being traced. The brute force approach of testing every ray against every surface element (typically triangles) for intersection is clearly infeasible. Instead, the triangles are sorted into a spatial data structure that guides the rays to potential intersection candidates. A popular acceleration structure known as a bounding volume hierarchy (BVH) sorts triangles into a hierarchy of boxes, each level containing increasingly smaller subsets of the scene. At each level, the set of triangles is split into two or more sub-sets until the sets are considered small enough. During rendering, a ray only needs to be intersected with triangles that are contained in a box that the ray intersects. Due to the hierarchical nature of the data structure, the majority of boxes and triangles can be quickly discarded, reducing the work per ray to a few dozen ray-box intersection tests and a few ray-triangle intersections.The acceleration structures are the core contribution of Embree. They take maximum advantage of the latest Intel® CPUs and they are designed for easy integration into other rendering engines. Embree implements a binary BVH as well as a four-wide multi bounding volume hierarchy5, both with highly optimized single ray traversal kernels. The parallel acceleration structure builders support spatial splits to efficiently handle scenes with problematic geometry such as large diagonal triangles.
Monte Carlo ray tracing is very easy to parallelize with multiple execution threads because all light paths are mutually independent. The image plane is simply subdivided into a set of small tiles. Whenever a thread finishes rendering of its current tile, it picks the next one from the list of unfinished tiles. Scalability on multi-core processors and multi-socket servers is close to linear. On a four-socket server with a total of 40 physical cores, for example, our renderer achieves 95% parallel efficiency.
In addition to thread parallelism that maps tasks to the cores of a processor, there also is data parallelism that maps computation within a thread to the SIMD (Single Instruction Multiple Data) units of a CPU. Data parallelism is more difficult to exploit than thread parallelism. It works best when multiple collocated data items are processed by the same instruction stream. For real-time ray tracing this can be achieved by treating a set of similar rays as a packet and tracing them together through the acceleration structure. Because the rays are coherent, they are likely to visit the same boxes and intersect the same triangles. This results in excellent performance, because neither the memory access nor the control flow diverges. This scheme, however, breaks when the rays become incoherent. Each ray may travel through a different part of the scene and they may also want to execute different code sections. One ray, for example, might already have found its closest intersection and now wants to proceed to material evaluation, while another ray is still searching for its hit point. Fortunately, there are other strategies to utilize data parallelism. Instead of grouping rays together, we can also group data elements of the acceleration structure together. Embree uses this approach. All rays are traced independently, which greatly simplifies the development of the renderer.Embree supports two acceleration structures that use the four-wide data parallel instructions provided by Intel® Streaming SIMD Extensions 4 (Intel® SSE4). The first is a bounding volume hierarchy with a branching factor of four5. It packs four boxes together in an SSE friendly layout and computes the intersection of a ray with all four of them in parallel. Triangles are treated similarly. The second acceleration structure is a traditional binary bounding volume hierarchy. It also stores the boxes in a special layout in memory and intersects a ray with the near and far planes of two boxes in parallel. This is possible with the fast shuffling operations provided by Intel® SSE4 and a simple arithmetic trick: min(a,b) = -max(-a,-b). This allows us to execute minimum and maximum computations in the same four-wide register by flipping some of the sign bits before and after the computation.The acceleration structures are carefully optimized to take maximum advantage of the latest Intel® processors. Optimal instruction scheduling, latency minimization, and cache-coherent memory access patterns were important considerations.
The Task Parallel Library (TPL) is designed to make it much easier to write managed code that can automatically use multiple processors. Using the library, you can conveniently express potential parallelism in existing sequential code, where the exposed parallel tasks will be run concurrently on all available processors. Usually this results in significant speedups. 2b1af7f3a8