parallel ray tracing - technical university of denmarketd.dtu.dk/thesis/248472/ep09_40_net.pdf ·...

Parallel Ray Tracing

Thomas Vesterlkke Christensen

Supervisor: Sven Karlsson

Kongens Lyngby 2009

IMM-MSc-2009-40

Technical University of Denmark

Informatics and Mathematical Modelling

Building 321, DK-2800 Kongens Lyngby, Denmark

Phone +45 45253351, Fax +45 45882673

[email protected]

www.imm.dtu.dk

IMM-MSc: ISSN 0909-3192

Summary

This Master thesis presents a survey of the global illumination model, ray trac-ing, along with the most commonly used data structures and algorithms forhigh performance ray tracing. It then analyzes, implements and compares twodifferent types of parallel ray tracing, finding that the most commonly used typeof parallel ray tracing performs 55 to 64 percent better than its counterpart.

Resume

Dette speciale viser en undersgelse af global belysningsmodellen, ray tracing,sammen med de mest brugte data strukturer og algoritmer brugt til hjtydenderay tracing. Specialet analyserer, implementerer og sammenligner herefter totyper af parallel ray tracing, hvor det findes at den mest brugte type parallelray tracing yder 55 til 64 procent bedre en sit modstykke.

Acknowledgements

Let the problems be big or small, my supervisor, Sven Karlsson, always had theknowledge, skills and time to help me find a solution. During my thesis, he hasmade me revisit old habits and given me many tools to improve them not onlyduring my thesis but also in the future. Thus, making me, what I believe to be,a better engineer.

For all his encouragement, dedication, sharing of knowledge and experience I ameternally grateful and future students would be lucky to have Sven as supervisor.

I would also like to thank Andreas Brentzen for introducing me to the problemof this thesis, my old friend, Steffen Nielsen, for supplying me with test modelsand finally, Daniel Pohl and Jacco Bikker for taking the time to answer mymails.

Contents

Summary i

Resume iii

Acknowledgements v

1 Introduction 1

2 Schedule 7

3 Ray tracing 11

3.1 Ray tracing models . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 Ray tracing implementations . . . . . . . . . . . . . . . . . . . . 17

3.3 Chosen Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Algorithms and Data structures 21

4.1 Acceleration Structure . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2 Chosen Data Structure . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3 Traversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.4 Chosen traversal . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.5 Intersection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.6 Shading / Physical model . . . . . . . . . . . . . . . . . . . . . . 37

4.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5 KD-tree Heuristic 43

5.1 Choosing split position . . . . . . . . . . . . . . . . . . . . . . . . 43

5.2 Termination Criteria . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.3 Implemented Heuristic . . . . . . . . . . . . . . . . . . . . . . . . 48

viii CONTENTS

6 Parallel Ray Tracing 496.1 Pixel Distributed Parallel Ray Tracing . . . . . . . . . . . . . . . 496.2 Balancing the work . . . . . . . . . . . . . . . . . . . . . . . . . . 506.3 Object Distributed Parallel Ray Tracing . . . . . . . . . . . . . . 516.4 Others . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

7 Implementation 557.1 KD-tree construction . . . . . . . . . . . . . . . . . . . . . . . . . 557.2 ODPRT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 637.3 PDPRT: Load Balancing . . . . . . . . . . . . . . . . . . . . . . . 677.4 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

8 Numerical analysis 758.1 Ray creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 758.2 Ray intersection . . . . . . . . . . . . . . . . . . . . . . . . . . . 778.3 Intersection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 798.4 Triangle intersection . . . . . . . . . . . . . . . . . . . . . . . . . 798.5 Sphere intersection . . . . . . . . . . . . . . . . . . . . . . . . . . 808.6 Secondary Rays . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818.7 Object Distributed Parallel Ray Tracing . . . . . . . . . . . . . . 82

9 Testing 839.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 839.2 PDPRT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 839.3 KD-Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 849.4 ODPRT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 849.5 Memory leaks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

10 Results 8710.1 Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8710.2 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8910.3 Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9210.4 System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9310.5 Maximum Tree depth . . . . . . . . . . . . . . . . . . . . . . . . 9510.6 Ray prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9810.7 Ray tracing comparison . . . . . . . . . . . . . . . . . . . . . . . 102

11 Schedule evaluation 107

12 Conclusion 10912.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11112.2 The future for ray tracing . . . . . . . . . . . . . . . . . . . . . . 112

A Extended results 115

Chapter 1

Introduction

In the past one and half decade the preferred method of real time renderinghas been to use local lighting models. Currently, local lighting models in re-altime rendering are almost always achieved through the assistance of GraphicProcessing Units(GPU) rasterizer. Local lighting models have the advantage ofbeing computationally cheap while still being able to capture the basic parts ofthe real world, such as shape, size and view-dependent illumination of objects.However, there are many phenomenons they do not handle very well. This in-cludes shadows, reflections, transparency and many more. While many of thesephenomenons can and has been approximated with local lighting models usingtechniques of various difficulty, especially hardware and game developers havestarted to search for other models to achieve better photo-realism using less ef-fort. One such model is ray tracing, which is a global illumination model [111].

Ray tracing works by shooting an infinitely thin line, which is formally knownas a ray, from the eye through each pixel of the screen. The first object in thescene hit by the ray is the one displayed in the pixel, see Figure 1.1. In thisreport, object is used to denote the geometry in a scene. Objects are in mostcases triangles or spheres.

Currently, there is much discussion in the industry [95, 55, 102, 96, 93, 94] ifray tracing should be pursued. Overall, two major factions exists. First, thosewho believe in a complete transition from the current rasterizer technology to

2 Introduction

Image

Light Source

Eye

Figure 1.1: This figure shows the basic principle behind ray tracing. Rays aresent from the eye through each pixel to determine what object should be viewedin the pixel.

ray tracing technology. Second, those who believe that a hybrid between therasterizer and ray tracing technology is the most realistic solution. The mainargument from the last group is based on the fact that ray tracing is much moreexpensive than the what can be achieved through todays rasterizer technology.

This thesis will not take part in this discussion. However, some reflections onthe matter will be provided in the conclusion.

In the past couple of years development of processors has focused on increasingthe number of cores in a processor, rather than its clock frequency. This devel-opment is one of the arguments from the supporters of non-hybrid ray tracing,as each ray is independent of each other. This makes ray tracing extremely easyto parallelize on many platforms.

The purpose of this thesis is to implement and analyze two different algorithmsfor performing parallel ray tracing.

The first and traditional algorithm, which in this thesis will be referred to asPixel Distributed Parallel Ray Tracing(PDPRT), tries to divide the work ac-cording to the pixels of the image, see Figure 6.1. Thus, in this example thescreen has been divided into 4 rectangles where cpu 1 calculates the color for

3

all the pixels in the upper left corner, cpu 2 calculates the color for all thepixels in the upper right corner, etc. No communication or synchronization isneeded between the processors as each pixel is calculated independently of eachother. However, the drawback is that each processor must contain have accessto all objects.

cpu 1 cpu 2

cpu 3 cpu 4

Image

Figure 1.2: Distributing work over pixels.

The second algorithm distributes the objects over the processors see Figure 1.3.In this example the circle has been given to cpu 1, an ellipsis and a triangleto cpu 2 etc. This thesis will refer to the algorithm as Object DistributedParallel Ray Tracing(ODPRT). While each processor can do its work completelyindependent of each other in PDPRT, a bit more synchronization is requiredbetween the processor in ODPRT. In Figure 1.3, the ray is initially sent to cpu1 which will check if it hits the circle. As this is not the case, it is transferredfrom cpu 1 to cpu 2 which repeats the process until the ray hits an object.Thus, processors need to be able to receive information from other processors.

ODPRT has two major benefits. First, distributing the objects allows handlinglarger scenes. Second, it should achieve better spatial locality, e.g. better uti-lization of the cache. Thus, it should be able to decreases the time spent onloading objects from main memory compared to the PDPRT. On the other handthe ODPRT is likely to spend a lot of time sending rays between processors.The goal of this project is therefore to investigate, which of the approaches yieldthe highest performance.

However, to truly test this the algorithms will not just be implemented on a

4 Introduction

Image

Eye

cpu 2

cpu 4

cpu 1

cpu 3

Figure 1.3: Distributing work over objects.

normal shared-memory multi-core architecture. They will also be implementedon an architecture, where transferring data between processors(cores) are ex-pensive. For this purpose, the Cell-processor has been chosen. Another benefitof choosing the Cell-processor is that it resembles Intels upcoming Larrabeeprocessor, which is intended as a competitor to the existing GPU technologyby NVidia and AMD. Due to time constraints this implementation was notcompleted and will therefore not be discussed in this thesis.

In this thesis a survey of ray tracing has been performance in order to implementa high performing ray tracer. Furthermore, two types of parallel ray tracershas been analyzed and implemented. These implementations have then beencompared in order to conclude, which of the parallel ray tracing approaches,that are most suitable when aiming at the high performance.

Chapter 2 introduces the initial schedule laid out for this thesis. In chapter 3the ray tracing algorithm will be described, followed by chapter 4 which coversthe most relevant algorithms and data structures used in todays state-of-the-art ray tracer. Chapter 5 presents common heuristics used to achieve the bestperformance from data structures in ray tracing. This is followed by chapter 6,which describes the two types of parallel ray tracing analyzed in this thesis. Inchapter 7 the details surrounding the implementation is discussed. A numer-ical analysis of the ray tracing implementation is then presented in chapter 8followed by a discussion on how to validate the correctness of the implementa-tion in chapter 9. Chapter 10 then presents and discusses the results from theexperiments performed in this thesis. In chapter 11 the initial schedule fromchapter 2 is compared and evaluated against the actual process of this thesis.

5

Finally, a conclusion is presented in chapter 12.

Appendix A contains the results from chapter 10 in their entirety.

6 Introduction

Chapter 2

Schedule

This section discusses the original schedule for this thesis. It includes analysisof expected risks and expected time frames.

Overall, the plan was to implement the ray tracers in iterations. Thus, theimplementation was broken down to several ray tracing implementations:

Simple Sequential ray tracer. A normal ray tracer with only simpledata structures and no parallel computations, which could work on bothshared memory systems and Cell.

Advanced Sequential ray tracer. A normal ray tracer with no parallelcomputations using advanced data structures, which could work on bothshared memory systems and Cell.

PDPRT on shared memory. A parallel ray tracer using the traditionalapproach as described in section 1.

ODPRT on shared memory. A parallel ray tracer using the lesstraditional approach as described in section 1.

PDPRT on Cell. A simple implementation of PDPRT on Cell.

ODPRT on Cell. A simple implementation of ODPRT on Cell.

8 Schedule

PDPRT on Cell with cache. An implementation of PDPRT on Cellwith software managed cache.

ODPRT on Cell with cache. An implementation of ODPRT on Cellwith software managed cache.

The plan was then to analyze, implement and document each implementationbefore continuing to the next. I planned to do this in a matter such that thenext implementation to be implemented could benefit from the previous. Ingeneral I expected one week of analysis and implementation to cause one weekof documentation time.

At the time of writing this schedule, no literature of ODPRT was known. I,therefore, considered implementation of ODPRT to pose the greatest risk, so tomeet unexpected problems as early as possible, I found it best to prioritize thoseimplementations. As I had only limited experience programming on the Cell Ifound it best to do the ODPRT implementation on Cell before the PDPRTimplementation on Cell.

Thus, I intended to start with a sequential ray tracing implementation, whereI had previous experience to rely on. Finishing the sequential ray tracing im-plementation, I expected it to be straight forward to implement PDPRT on ashared memory system. I expected no great risks in implementing these, there-fore only a week were put aside for them to be finished.

As I found it difficult to begin implementation of ODPRT before appropriatedata structures were in place, I planned to continue with advanced sequentialray tracing implementation. I had previously experienced numerical problemswith data structure in computer graphics, I therefore expected some risks andtherefore allocated one week for this implementation.

Given the advanced sequential ray tracing implementation was done, I found itbest to implement the ODPRT for a shared memory system before continuing toCell. The reason for this choice is that I believed, I would be more equipped todeal with any unexpected algorithmic problems on a shared memory problems.Overall, I expected any issues with ODPRT on shared memory systems to belimited to race conditions, which I had much experience handling. Therefore,one week of analysis and implementation was expected to be reasonable.

Afterwards I expected time to setup a proper workspace for programming on theCell and to find relevant literature on the most common problems, that neededto be solved on the Cell. Little risk were expected with this process. Although,it was expected to be time consuming to find literature on specific problems.Thus, a week was allocated for this problem.

9

I was then planning to begin analysis and implementation of ODPRT on Cell,which I expected to contain unforseen problems. Two weeks were thereforeallocated for this implementation.

When the ODPRT implementation was working on Cell, the plan was to extendit with a software managed cache to improve performance. I considered thisto be a simple change and that the greatest risk was to make in work on Cell.Therefore, one week was expected to be enough.

Remaining was then the PDPRT implementation on Cell with and withoutcache, which I expected to be somewhat easier to finish once the ODPRT im-plementations on Cell were done. Thus, a week were allocated for each imple-mentation.

Given that all implementations were done I expected it would take me approx-imately one week to compare the algorithms and gather results and I expectedit to take approximately one week to document the results.

In order to have time to cope with any major changes in implementations andthe report, I planned to have the first draft of the report finished on monthbefore handin.

During my thesis work, I was to attend a project in cooperation with the Na-tional Academy of Digital Interactive Entertainment(DADIU1) for a month inMarch. In this time I expected it to be unrealistic that work would be performedon this thesis.

Thus, taking all these considerations lead to the schedule shown in Table 2.1.

The schedule has during this project proven to fail in several ways, which iselaborated in details in chapter 11.

1Det Danske Akademi for Digital Interaktiv Underholdning

10 Schedule

January MayWeek 14: Continnue implementing

ODPRT to CellWeek 2: Implement sequential ray

tracer and PDPRT forshared memory system

Week 15: Extend ODPRT on Cellwith cache

Week 3: Implement advanced se-quential ray tracer

Week 16: Document work withODPRT on Cell

Week 4: Write report Week 17: Implement PDPRT to CellWeek 5: Analyze and implement

ODPRT for shared memorysystem

Week 18: Extend PDPRT on Cellwith cache

February AprilWeek 6: Document work with

ODPRT on shared memorysystem

Week 19: Document work withPDPRT on Cell

Week 7: Setup of Cell workspaceand search for further liter-ature on Cell programming

Week 20: Compare implementations

Week 8: Analyze and begin imple-mentation of ODPRT toCell

Week 21: Document results

Week 9: DADIU Week 22: Finish first draft of reportMarch June

Week 10: DADIU Week 23: FixingWeek 11: DADIU Week 24: FixingWeek 12: DADIU Week 25: FixingWeek 13: DADIU Week 26: Fixing

Week 27: Report handin

Table 2.1: This table shows the original schedule laid out for this thesis.

Chapter 3

Ray tracing

The word, Ray Tracing, is used in literature to describe a global illuminationmodel. However, it is also commonly used as a general term to cover manydifferent rendering models. This section will clarify what the basic ray tracingalgorithm is and what rendering models it usually involves. In addition, thecomplexity of running time and memory will be discussed. This section willalso serve as reference for ray tracing and how it can be extended to achievemore photo-realistic images. The section is then followed up by a survey of thedifferent approaches used to implement ray tracing. Finally, a short section willexplain the chosen rendering model used for this thesis.

3.1 Ray tracing models

Ray tracing is essentially based on the physical model stating that the colorperceived by the human eye is determined by how light sources interact withthe environment visible to the eye, e.g. reflection and transmission of light.

12 Ray tracing

3.1.1 Photon Tracing

In ray tracing, this interaction is modeled with infinitely thin light rays emittedfrom the light source out in the scene. Whenever a ray hits a surface, it reflectsand transmits new light rays according to the properties of the surface. Thisprocess is repeated recursively until a ray hits the eye. Often the eye is alsoknown as the camera of the scene. When a ray hits the eye, then the color atthat pixel on the screen is determined by the path taken by the ray from thelight source to the eye, see Figure 3.1. How color is usually calculated is coveredin section 4.6.

This model is sometimes referred to as photon tracing. The model is very ex-pensive and only used for special purposes such as rendering based on physicallycorrect models [1].

Image

Light Source

Eye

Figure 3.1: This figure shows the principle behind Photon tracing. Rays aresent from the light source in all directions. Direction of the rays change everytime it hits a surface. This continues until a ray hits the image.

3.1.2 Ray Casting

The first real ray tracing algorithm was described by Appel [15] in 1968 forperforming hidden surface removal and shading of objects. Appel recognizedthat most of the rays emitted from the light sources would never hit the eye.

3.1 Ray tracing models 13

So instead of sending rays from the light source to the eye, he proposed sendingrays from the eye to the light source. Thus, one ray is sent from the eye througheach pixel. The first object in the scene intersected by the ray is the one shownin the pixel. Let the point intersected by the ray be known as p. To determineif the object is in shadow at point p, a new ray is sent from the p towards eachlight source. These rays are formally known as shadow rays. If a shadow rayhits an object before the light source, then the object hit prevents the light fromreaching p. Thus, p would be in shadow, see Figure 3.2.

Image

Light Source

EyeShadow rays

Figure 3.2: This figure shows the principle behind Ray casting. Rays are sentfrom the eye through each pixels of the image. When hitting a surface a shadowray is generated from the hit point to each light source.

Note that ray casting uses O(nlt) time and O(n) memory, where n is the numberof pixels, l is the number of light sources and t is the amount of time it takes tocalculate first hit of a ray.

3.1.3 Whitted ray tracing

Ray casting was improved by Whitted [113] in 1980 by introducing reflectedand transmitted rays after the first/primary rays had hit a surface. Togetherwith shadow rays reflected and transmitted rays are often referred to as sec-ondary rays. This type of ray tracing is referred to as Whitted or recursive raytracing, see Figure 3.3. Throughout this report recursive ray tracing will beused to denote a ray tracing implementation using recursive function calls and

14 Ray tracing

not the Whitted ray tracing model. Whitted ray tracing improves photorealismby simulating reflection and transparency of materials, however the amount ofcreated rays increases exponentially with respect to the number of bounces aray is allowed to make. Thus, the running time becomes O(nl2mt), where m ishow many bounces a ray can make. The memory requirements for Whitted raytracing depends on the order rays are followed and whether or not pixels areprocessed sequentially after each other. Thus if rays are followed in breadth-first manner the memory requirements will be O(n+2m) and if depth-first onlyO(n + m) assuming the pixels are handled one at a time.

Light SourcePrimary ray

Object

Reflected ray

Shadow rays

Transmitted ray

Figure 3.3: This figure shows how Whitted ray tracing generates reflected andtransmitted rays when a ray hits an object. This process is repeated recursivelysuch that new rays are generated if or when a reflected/transmitted ray hits anobject.

3.1.4 Distribution/Distributed ray tracing

The reflection, transparency and shadows simulated with Whitted ray tracinghave the problem that they are perfect by nature as opposed to real life, wherethey are more blurred and fuzzy. Glossy reflections, translucency, soft shad-ows including motion blur and depth of field were all physical phenomenonsaddressed in 1984 by Cook [29]. All of these problems are essentially handledin the same manner namely by sampling more rays as shown in Figure 3.4. Soto simulate glossy reflections multiple reflected rays are created perturbed fromthe perfect reflected direction. The method is applied in more or less the same

3.1 Ray tracing models 15

manner for the other phenomenons. This type of ray tracing is referred to asDistribution or Distributed ray tracing and is one of the first Monte Carlo raytracing algorithms as the rays are assumed to be perturbed randomly.

The time complexity now becomes O(knnkll(kr + kt)mt), where kn denotes the

number of samples per pixel, kl samples per light source, kr sample per reflectedray and kt the samples per transmitted ray.

The overall memory requirements remain more or less the same compared toWhitted ray tracing, which is O(n+(kr +kt)

m) or O(n+m). Again the memoryused depends on the traversal order.

Light SourcePrimary rays

Object

Reflected rays

Shadow rays

Transmitted rays

Figure 3.4: This figure shows how distribution ray tracing generates multiplerays instead of just one ray.

3.1.5 Path tracing

All the previous models except photon tracing have the limitation that theycannot simulate the indirect lighting obtained from light reflected on a surface,see Figure 3.5. The problem is that shadow rays only test if a point p has directcontact with a light source. Thus, it fails to notice that light may be reflectedon surfaces. Radiosity [111], another global illumination model was built toaddress this problem.

This issue received attention in 1986 by Kajiya [58], when he formulated the

16 Ray tracing

Image

Light Source

Eye Shadow raylight

Figure 3.5: This figure shows how shadow rays incorrectly reports a point to bein shadow as light indirectly reaches the point in question.

rendering equation and proposed path tracing as solution to it. Path tracingcreates only one reflected/transmitted ray after a surface is hit see Figure 3.6.This ray is sent in a random direction according to the properties of the surface.The process is repeated recursively until a light source is reached. For this togive images of reasonable quality, a very high amount of rays must be generatedfor each pixel.

The running time is therefore O(pnl), where p is the number of rays created foreach pixel. Since no ray tree is generated the memory is now limited to O(n).

Bi-directional path tracing [62] and Metropolis Light Transport [103] was intro-duced in 1993 and 1997 respectively and both provide same image quality asnormal path tracing while reducing rendering time.

3.1.6 Photon mapping

In 1996 another approach was suggested for rendering indirect lighting by WannJensen [57]. This approach is a two-pass algorithm, where the first pass emitsphotons from all the light sources and when a photon hits a surface it is savedinside a data structure called the photon map. The second pass is just a normalray tracer like those described above with the addition that every time they hit

3.2 Ray tracing implementations 17

Light SourcePrimary ray

Object

Reflected rays

Shadow ray

5%5%

15%

10%

Figure 3.6: This figure shows how Path tracing generates only one ray besidethe shadow ray. Note, how the ray is generated randomly according to theproperties of the objects material.

a surface they make a lookup in the photon map to see if there are photonsstored nearby.

The running time and memory complexity depends on the kind of ray tracerused in the second pass, on the number of photons emitted and on the datastructure used for the photon map. In the implementation by Wann Jensen [57],where the photon map was proposed, a kd-tree is used, however others reportof success using hashtables [86].

If the scene and light sources are static then the photon map is view independentand can be precomputed. This is often done in computer games, where it isreferred to as light maps.

3.2 Ray tracing implementations

At the time of this writing four approaches are commonly considered for raytracing.

18 Ray tracing

3.2.1 CPU approach

The CPU approach refers to the range of ray tracers utilizing the CPU for allthe ray tracing calculation. This approach also includes cluster-solution and isby far the most popular way of implementing ray tracing. Thus, this sectionlimits itself to the most noteworthy though many more will be referred to duringthis thesis.

One of the ray tracers with the longest history is POV-ray [7]. Although, it wascreated in 1991 it has roots back to the 1980s on both Unix and Amiga.

Of other interesting implementations are the OpenRT API [5] from 2005, whichaims to provide a rendering API similar to that of OpenGL. The video games,Quake 3 and 4 [80], have both been refurnished with the OpenRT API byPohl [80] achieving convincing results.

At the time of this writing, the Mental Ray application by Mental Images [2] isamong the fastest ray tracers around and its usage include several feature films.

Many interactive (real time) ray tracers has been reported using the CPU ap-proach, however most of these are only ray casters, which means that the com-plexity, photo realism and most of all difference to rasterized images are kept ata minimum.

3.2.2 GPU approach

GPU ray tracing uses the Graphics Processing Unit for the calculation

The first description of a simulated ray tracer using the GPU was given in 2002by Purcell et al [85]. In 2003 Purcell et al [86] showed how the photon mappingmodel also could be applied on the GPU. In 2004 a path tracing implementationwas introduced by Ernst et al [37].

While Purcell et al used simple data structures Foley et [38] presented a GPUray tracer in 2005 using the kd-tree data structure, which will be discussed insection 4.1.3. Similar approaches has later been used by various people [53, 84].

It is noticeable that most of the referenced GPU ray tracers claim to be the firstreal GPU ray tracer to outperform those on the CPU. However, many of theimplementations are only ray casters, which makes them avoid the problem ofexponential memory use. A problem the GPUs have proven to have difficulties

3.3 Chosen Model 19

handling in general. This is also likely to be the reason that path tracing wasattempted early in GPU ray tracing as it also avoids the exponential memory.

3.2.3 Custom Hardware approach

The third approach is to use custom hardware made for ray tracing.

One of the its first occurrences were proposed by Green [43] in 1991 where hedescribes how Ray tracing may be used on a distributed memory multiprocessor.Another approach was presented in 2001 by Advanced Rendering Technologiescalled the AR350 [46] chip. In 2002 Schmittler et al [91] presented the SaarCORprototype developed at Saarland University, which in 2005 was succeeded by theRPU Chip [116].

The main problem with these approaches are that they are either experimentalprototypes, special made for industrial use or have limited programmability.

3.2.4 Hybrid approach

The hybrid approach uses a mixture of the above.

Before the first real GPU ray tracers saw the light the ray engine [24] usedthe GPU for all the intersection calculation, while the remaining work was stillcarried out on the CPU.

Recently a new chip has been developed by Caustic Graphics [97], which insteadof doing the entire ray tracing focuses on specific calculations, such that itcomplements both the CPU and GPU. The exact details of this chip is currentlykept secret from the public.

Various people [95, 96, 94] believe that hybrid solutions are likely to be seen inthe future.

3.3 Chosen Model

Most of the recent ray tracing work is aiming at real time performance and inorder to achieve that they have chosen the simplest of the mentioned rendering

20 Ray tracing

models namely ray casting. However I believe that ray casting can be com-pared to a very slow rasterizer with automatic shadows. Distribution ray tracingand its more complex counterparts are however presently so time consuming,that it might be limiting the thesis a great deal. Thus this thesis is focused onthe Whitted ray tracing model.

Chapter 4

Algorithms and Datastructures

This chapter describes the most popular algorithms and data structures usedfor ray tracing. Each section is followed by a short discussion explaining whatalgorithm or data structure I have chosen to pursue in my implementation. Thefirst section surveys the different data structures for objects. This is followedup by a section discussing traversal of the data structure. Next is a sectionexamining the intersection of objects with rays. Afterwards, different ways ofshading are covered. Finally, the chapter is concluded with a summary of thefeatures chosen for the ray tracer developed in this thesis.

To find the first object hit by a ray one could use the naive approach and testthe ray against all objects, which would take O(n), where n is the number ofobjects. If there are many objects in a scene this will run unacceptably slowas Havran [48] puts it. Thus to minimize the time spent, acceleration structureshave been introduced for the objects. With the introduction of accelerationstructures the ray tracing is often broken into three stages when discussed inliterature:

Traversal of a ray, which can be thought of as the process of finding nearbyobjects

Intersecting rays with objects, which tests if the ray actually hits any of

22 Algorithms and Data structures

the object.

Shading of a ray with the object hit, tells what color the object will have.

This chapter is therefore structured in the same fashion. Since, the greatest im-pact on performance can be achieved through the choice of acceleration structureand its traversal, special emphasis is put on these sections.

4.1 Acceleration Structure

4.1.1 Grids

The most simple choice to use is a uniform grid [40], where the scene is dividedinto the cells of the grid see Figure 4.1. Thus looking up a cell takes O(1), whilefinding the first intersection in a cell takes O(n), where n equals the number ofobjects inside the cell. This data structure will of course work well, when theobjects are distributed uniformly over the scene, and less well when most of theobjects are located in a few cells. The grid data structure was used in an earlyinteractive ray tracer from 1999 by Parker et al [77] and in the Purcells GPUray tracer [85].

ray

Figure 4.1: This Figure shows in 2D how objects are stored in a uniform grid.

4.1 Acceleration Structure 23

This scheme has been modified over time, which has lead to hashed grids byLagae et al [64], that deals well with the problem that some scenes containsa lot of empty cell, taking up unnecessary memory. Another variation is theproximity clouds by Cohen et al [28], which aims at speeding up the traversalof rays through empty space. This approach has been used in GPU ray tracingby Karlsson et al [61].

4.1.2 Octrees

The octree is another common data structure, which recursively divides thescene into 8 axis-aligned boxes of equal size until the objects inside the box orthe depth gets beyond some threshold see Figure 4.2. This generates a treewhere each internal node has 8 children. Depending on the balance of this tree,looking up a part of the scene or leaf of the octree takes O(logn). Finding thefirst intersection with the objects in the leaf takes O(1), when the terminationcriteria is put on the objects rather than the depth of the tree.

ray

Figure 4.2: This Figure shows in 2D how Octrees are used to contain objects.

The octree sometimes has efficiency issues with object lying on the divisionplane, which evidently means that the objects is placed in two boxes of theoctree. An example of this can be viewed in Figure 4.2, where the large circletouches lower left cell. Thus, when a ray goes through the cell it has to checkfor intersection with the circle. This problem is referred to as fragmentation byMacDonald et al [69]. One way to minimize the problem can be achieved by avariation of the octree called loose octree by Thatcher [32].


4.1.3 Binary Space Partition Tree

Another popular choice is the Binary Space Partition(BSP) tree described bySchumacher et al [92] and refined by Fuchs et al [39]. The BSP tree recursivelyinserts a plane dividing the scene into two until the objects or the depth goesbeyond some threshold. The BSP tree uses O(logn) for lookup and O(1) forintersection checking. Like with the octree this running time depends heavilyon the balance of the tree. In 1992, Haines [100] described octrees and BSPtrees as equally good, although BSP trees being simpler.

A special case of the BSP tree is the kd-tree by Bentley [19]. The kd-tree willalways make the split along on of the axes, see Figure 4.3. It is therefore oftenreferred to as the axis-aligned BSP tree. Like with the octree, the kd-tree alsohas decreased performance when objects touches a splitting plane as it causesfragmentation.

Figure 4.3: Example of a scene(left) and its kd-tree(right).

A variation of the kd-tree is the spatial kd-tree by Ooi [76], which makes twosplitting planes instead of one. This has two advantages. First, this meansobjects do not have to be copied to two subtrees when fragmentation occurs.The other advantage is that it allows skipping of empty space during traversal.

Havran [48] have in his PHD thesis compared most of the described data struc-tures here for use in ray shooting algorithms and found that the best generalpurpose acceleration structure was the kd-tree combined with the Surface AreaHeuristic by MacDonald et al [69]. This result has been acknowledged and usedby many people [17, 74, 53] in the following years.

4.1 Acceleration Structure 25

4.1.4 Bounding Volume Hierarchies

Bounding Volume Hierarchies(BVH) by Rubin et al [90] are another data struc-ture used for ray tracing by many people [18, 83, 44, 23, 105]. BVHs are binarytrees of bounding volumes, where each interior node is a bounding volume, whichcompletely encloses the bounding volumes of its children. Leaf nodes are bound-ing volumes, that encloses part of the scenes object. The bounding volume canbe any shape as long as it applies to the described rules. The most commonlyused bounding volume is however the axis-aligned box, see Figure 4.4. BVHshave the same asymptotical running time as octrees and BSP trees.

Figure 4.4: Example of a scene(left) and its bounding volume hierarchyu(right).

The BVHs have a variation very similar to the spatial kd-tree, which is calledBounding Interval Hierarchy(BIH) presented by Wachter et al [112], in thatit for each node stores two clipping planes. Like with the spatial kd-tree thisallows for empty space to be omitted during traversal.

Another, and at this time relatively new modification to the BVH is the shallowbounding volume hierarchy by Dammertz et al [31], which is essentially just anormal BVH with more than two children nodes. Thus, it is no longer a binarytree.

4.1.5 Others

This final section describes some of the most recent data structure suggested forray tracing.


The first one is called ray strips/ReduceM presented by Lauterbach et al in2008 [65, 66]. The previous tree structures presented all contain triangles orother objects in the leaves, see Figure 4.5a and 4.5b. Ray strips uses a differentapproach where the leaves of the tree contain triangle strips instead of justtriangles, see Figure 4.5c. This does not only reduce the memory used for storingthe triangles, it may also produce more shallow hierarchies. Thus, saving evenmore memory. Another benefit of gathering the triangles in a strip is to useSIMD to make multiple triangle intersection checks at the same time. Raytracing will however only benefit from the approach if the scene contains a largeamount of connected triangles.

(a)

(b) (c)

Figure 4.5: This figure shows (a) a scene divided by axis aligned splitting planes(b) represented by an ordinary kd-tree (c) and a kd-tree using ray strips.

4.2 Chosen Data Structure 27

The second data structure called constrained tetrahedralization [63] goes awayfrom using hierarchies. It can be thought of as the 3-dimensional version oftriangulation, where the scene is divided into tetrahedrons rather than triangles.The benefit of this approach is that ray traversal is much faster and can be donewithout a stack, which makes it more suitable for GPU ray tracing. Anotherproperty of constrained tetrahedralization is that it can handle deformable anddynamic geometry to some degree without recalculating the data structure fromscratch.

4.2 Chosen Data Structure

In this thesis I have chosen to focus on static scenes containing objects of finitesize meaning objects that can be contained in a bounding box. It is a generaltrick to limit the geometry of the scene to triangles only [107, 106, 78, 24], as itallows one to use techniques like triangle strips. It can also eliminates the needfor polymorphism for objects as only one type is available.

However, I believe that ray tracings ability to handle almost arbitrary objectsis what makes it interesting compared to rasterization. Thus, the ray tracerused in this thesis is developed to handle all objects where one can:

Enclose the object inside a bounding box, e.g. a plane is an infinite objectand cannot be enclosed by a bounding box.

Calculate the intersection point of a ray and the object.

Calculate the normal of any point on the surface of the object.

Determine if object intersects a bounding box.

I have however limited myself to implement only triangles and spheres. But ithas been made easy to extend the ray tracer with arbitrary objects satisfying therules above. Spheres and triangles have been used at the same time successfullyin a state-of-the-art ray tracer called Arauna [20, 21] providing interactive framerates.

The Arauna engine uses the kd-tree with the Surface Area Heuristic for itsstatic objects. It has, as mentioned, generally been accepted that the kd-treeprovides the overall best performance compared to other data structures, if onedisregards the newest data structures surveyed. I have therefore chosen to usethe kd-tree with the Surface Area Heuristic. The Surface Area Heuristic will be


discussed in detail in chapter 5. If I had used only triangles for the geometry,then ray strips may have been a more promising choice.

4.2.1 Dynamic scenes

In this thesis I have focused on static scenes. However, to allow dynamic objectsone could just rebuild the acceleration structure in each frame. This has beendone by various people [17, 105, 51].

Another and more common way of handling dynamic scenes is to keep thestatic and dynamic objects in separate data structures. Finding a rays firstintersection is then done by traversing both structures. The advantage is thatonly the structure of dynamic objects needs to be rebuilt for each frame ifthe structure needs rebuilding. This method is used in the Arauna [20] engine,where the static objects are kept in a kd-tree with the Surface Area Heuristic,while the dynamic geometry uses a BIH.

Among other methods are the use of Render Cache [41]. Render Cache assumesthat there is high coherency between frames. This is exploited by reusing infor-mation from the last frame to create the next frame using various techniques.DeMarle et al[33] also try to reuse content from the last frame when renderingthe next. However, they deal with rendering of large scenes rather than dynamicscenes.

4.3 Traversal 29

4.3 Traversal

This section gives a review of the most common techniques and implementa-tions used for traversing a ray through a used data structure. Note that someemphasis is made on traversal of kd-trees, as that is the spatial data structureused for the objects in my ray tracer.

Traversal of a ray in a kd-tree can be split into two stages:

Finding the first leaf intersected by a ray.

Continue to find the next leaf intersected by a ray, until it hits an objectinside the leafs bounding box.

The first stage should take O(logn). Although, it depends on the balance of thetree as mentioned in section 4.1.

One common optimization suggested by Haines[45] is based on the fact that theeye sometimes exists inside a leaf, which means that all primary rays start inthe same leaf of the kd-tree see Figure 4.6. Thus, the first stage needs only beperformed once and then the result can be applied to all the rays.

Primary rays

(a) (b)

Figure 4.6: This Figure shows how all primary rays are generated in the samevoxel of the kd-tree.

Instead of shooting primary rays, another approach is to use the GPUs raster-izer [67, 53]. The rasterizer allows one to transfer triangles to the GPU, which


then can used to find the objects first hit in each pixel. Secondary rays can thenbe used as normal afterwards.

The problem with these approaches are that they only work for primary rays. Asthere are exponentially many secondary rays compared to primary rays, it meansthat the optimizations will not be facing the bottleneck. Another issue with thesecond approach is that it does not scale in a logarithmic with the number ofobjects, as all triangles need to be sent to the rasterizer. Furthermore, it onlyhandles triangles.

The first trick however can be used in a similar way for secondary rays, as theystart in the leaf, where the previous ray terminated. But this has little effectunless the secondary rays hit a surface inside the first leaf they traverse.

4.3.1 Ropes

If a ray has to traverse all leaf nodes then the worst case running time for bothstages will be O(nlogn). One proposed solution to handle this problem is to useneighbor links between leafs [47]. These links are sometimes called ropes. Thissaves one from traversing from the root down to the next leaf. The problem isthat one leaf may have O(n) neighbors, which yields O(n) extra memory needed.This poses a great problem for large scenes.

4.3.2 Coherent traversal

Often traversal of rays with the same origin and similar direction hit the sameobjects. This has left to techniques like beam tracing [52] and cone tracing,where the primary rays are replaced by a pyramid(beam) or cone respectively.So instead of traversing a ray for each pixel, only one beam is traversed throughthe scene, decreasing the work by a great magnitude. They are however rarelyused as they are both rather complex.

Beam tracing has however laid the foundation in 2005 for the multi-level raytracing algorithm(MLTRA) by Reshetov et al [89]. It is essentially beam tracing,in the way that it sends out a beam as a container of rays. The beam is thentested against the kd-tree trying to disregard subsets of the kd-tree from furthertraversal. This enables a ray to start at a deeper level of the tree. Thus, the maindifference between beam tracing and MLTRA is that MLTRA uses beams tosave traversal time, while beam tracing uses beams for the traversal, intersection

4.3 Traversal 31

and shading part of ray tracing. MLTRA has been applied by Bikker [20] forthe primary rays in the Arauna engine.

The problem with this method is that after the rays first hit, the rays no longerstart in the same position. Furthermore, the rays become very incoherent, sincethey are likely to continue in different directions. Although Wald et al [110]have reported shadow rays to be highly coherent, this still leaves reflected andtransmitted rays as a bottleneck. This makes it infeasible to gather rays ina large beam as they traverse different parts of the kd-tree. This has beenacknowledged by several people [74, 53, 112].

A similar and quite popular method is to use ray packets [108, 53, 107, 17],where a packet of 4 or more rays are traversed in a tree at the same time. Onereason for its popularity besides exploiting coherence is that the traversal can bedone using SIMD instructions. The Arauna engine [20] uses this technique forthe secondary rays. Wald et al show very promising results with this method.But, it should be mentioned that many scientists often fail to mention thatthis technique has the greatest impact for primary rays. Thus, the techniqueenables them to achieve interactive frame rates in ray tracing, when they limitthemselves to the ray casting model where only primary rays and shadow raysare created. However, there are exponentially many secondary rays compared toprimary rays. This means that this speedup does not address the computationalbottleneck in ray tracing.

Many different results have been measured using this technique. One shouldtherefore be skeptic about any conclusion on the subject. Reshetov [88] findthat for primary and shadow rays, the use of ray packets give a 2x speedup,while for the remaining secondary rays the speedup is limited to a factor of1.2x.

Similar experiments with ray packets have been performed by Wald et al onWhitted and distribution ray tracing [23], where they achieved a speedup be-tween a factor 2x and 3x.

Mansson et al has recognized that the most significant speedup can be gainedby exploiting the coherency of secondary rays in better fashion. They have thustried different approaches [74] to address the problem. They were however notable to improve performance beyond the existing, naive techniques.

Daniel Pohl [82], on the other hand, discusses how the use of ray packets de-creases performance on models, that are frequently used in the games.

Another approach has been used by Lauterbach et al [65], where they vary thesize of the ray packets according to the current depth of the kd-tree.


4.3.3 KD-Tree traversal

The classical way of traversing a kd-tree is to use a stack of tree nodes. Thus,whenever a node has been tested for intersection without finding an intersectiona node is popped of the stack.

Instead of keeping a stack, one can move the origin of the ray, when it hastraversed a leaf. The downside is that one will have to traverse the kd-tree againstarting at the root and one need to store the axis aligned bounding box of eachleaf. The benefit is that the need of the stack has been eliminated, which hasbeen recognized as a great advantage for GPU ray tracing by Popov et al [84].This approach is sometimes known as kd-restart [38, 67].

Foley and Sugerman proposes a small adjustment to avoid starting at the rooteach time a leaf has been visited. They call this approach the kd-backtrackalgorithm. As the name suggests when intersecting a leaf one backtracks untilthe next node to visit is found. This requires that each node has a link to itsparent node.

Havran has compared the most common implementation of kd-tree traversalalgorithms in his thesis [48]:

A sequential implementation [59], which he denotes TAseq. This is thestackless(kd-restart) approach described above.

A recursive implementation [56], which he denotes TAArec.

Havrans modified recursive implementation, which he denotes TABrec.

Several implementations using neighbor links [69, 47].

In his thesis, he finds that his own modification, TABrec, gives the best perfor-mance. The main difference between TAArec and TA

Brec is, that TA

Brec addresses

some numerical problems with TAArec by introducing some if-statements in thealgorithm.

4.4 Chosen traversal

I recognize the wide use of ray packets in ray tracing. However, due to the verydifferent results achieved for secondary rays it has been decided not to use thetechnique in this thesis. This choice also serves to limit the amount of work.

4.4 Chosen traversal 33

Efficient implementations for tree traversal using a stack already exists. I havetherefore decided to rely on these, using a slightly modified version of the im-plementation, TABrec, presented and tested by Havran. The modifications, Ihave introduced, mainly aim at attaining higher numerical stability, which isdiscussed in section 8.


4.5 Intersection

This section gives a short review of methods used for intersection checking.

Since intersection checking often takes far most of the time in ray tracing, varioustechniques have been applied to speed it up.

4.5.1 Mailboxes

When using a kd-tree fragmentation is known to occur. This means that anobject is contained in more than one leaf of the kd-tree. Therefore, when findingan intersection one must also check that the intersection exists inside the voxelof the current leaf, as it may have found an intersection in another leaf seeFigure 4.7. If the intersection point exists outside the voxel of the leaf, onemust continue the traversal as there might be a closer hit. Continuing thetraversal will however often result in testing triangles, which have already beentested for intersection. This is the situation in Figure 4.7, where a ray traversesnode B registering an intersection with triangle 1 lying outside of B. Traversalthen continues to node A, where the ray tests for intersection against triangle 1and 2. Thus, the ray perform the same intersection against triangle 1 twice.

ray

1

A B

C

Figure 4.7: This Figure shows how a ray may detect an intersection outside theleaf it is currently traversing.

To increase performance various people [13, 27, 45] have tried to use mailboxes.Mailboxes keep track of the rays, which have intersected an object. Thus, before

4.5 Intersection 35

a ray intersects an object it checks if it has already performed this intersection.This allows one to avoid testing the same ray/object intersection check twice.The drawback is that extra memory is needed. Wald et al has tried using hashedmailboxes [107] with results inferior to those of normal mailboxes. Anotherproblem with mailboxes occurs when using them on parallel ray tracing [98].

Havran [49] recommends using mailboxes only for complex objects.

4.5.2 GPU intersection check

Intersecting multiple rays with one triangle or the same ray with multiple tri-angle is something that can be done independently of each other. It is thereforevery suitable for parallel computation. This have lead to the ray engine [24],which used the huge parallel processing power of the GPU for intersection check-ing. At that time the main problem with this method was that a amount ofsignificant time was spent feeding the results back to the cpu from the GPU.

4.5.3 Calculating ray/object intersection

It is generally useful to think of the intersection calculating as two parts:

Does a ray intersect an object?

Where does it intersect the object?

The reason for this is first of all that if the ray does not intersect the object thenthe intersection point is not needed and should not be calculated. The secondreason is that shadow rays do not need the intersection point. They just need toknow if they hit an object before they reach the light source. This is exploitedin ODPRT, as rays are sent from one subtree to another.

For algorithms calculating intersections between objects there is, at the time ofthis writing no better reference than the web site [8] of the Real-time Render-ing text book [11].

Thus for triangle intersections this thesis have used the Moller-Trumbore algo-rithm [75], which has been thoroughly examined for optimal performance [73] onmultiple architectures. Although, it seems that Chirkov [26] presents a newer,more precise and generally faster algorithm for solving the yes/no problem. It


was however made for line-segments and not infinitely long rays. Thus, doingthe necessary adjustment may prove to decrease its benefits considerably.

For the sphere intersection the intersection method by Hultquist [54] has beenused in a slightly modified version.

Part of traversing the kd-tree is to establish at what distance the ray entersand leaves the bounding box of the scene. This requires intersection checkingof a ray against a axis aligned bounding box. The easiest solution is to useWoo algorithm [114]. Smits [98] suggest a faster implementations, which Amyet al [14] have given an even more optimal version of in 2005 by using precom-putation. This implementation has been used in this thesis. However further,research suggest that Eisemann et al [34] have found an even faster algorithmat the expense of more precomputation.

The precise math behind the intersection methods used is analyzed in section 8.

4.6 Shading / Physical model 37

4.6 Shading / Physical model

This section gives a short description of a select number of shading modelsincluding common problems involved with shading.

Shading can be thought of as determining the color of a surface based on itsability to reflect light and on the lights ability to send light. Ideally, all time isspent on shading, as it determines the quality of the rendered image.

4.6.1 Flat shading

One of the simplest shading models is flat shading. Flat shading assumes thatthe object is flat, in which case it applies the same color for the entire sur-face. This is of course a problem for spheres as they are not flat. In APIslike OpenGL [115] the problem is handled by approximating all objects withtriangles.

4.6.2 Phong shading

Flat shading is a very limited model as it does not produce many of the phe-nomenons seen in nature. Some of these phenomenons are handled by Phongshading [79]. Phong shading splits the shading calculations into three parts:

Diffuse(Lambertian) reflection, which simulates the ability to scatter lightin all direction.

Specular reflection, which simulates shiny surfaces, that creates mirror-likereflection.

Ambient light. This term is used to account for the light scattered throughthe scene.

The color of a surface is then calculated as the sum of these shading calculations:

~Cphong = kambient ~Cambient +

lights

kdiffuse ~Cdiffuse( ~N ~L) + kspecular ~Cspecular(~R ~V )

The parameters of the Phong equation are:


~Cphong is the final color calculated at the surface point in interest usingPhong shading.

~Cambient is the ambient light color scattered in the scene.

~Cdiffuse is the diffuse color of the light source hitting the surface.

~Cspecular is the specular color of the light source hitting the surface.

kambient is the amount of ambient light reflected by the surface.

kdiffuse is the amount of diffuse light reflected by the surface.

kspecular is the amount of specular light reflected by the surface. Notethat kambient,kdiffuse and kspecular can be vectors instead of scalar values,which enables surfaces to reflect a different amount of red, green and bluelight [115].

~N is the normalized surface normal at the point in interest.

~L is a normalized vector from the surface point to each light source.

~R is the normalized direction a reflected ray of light would take from thispoint.

~V is the normalized direction of the ray hitting the surface.

is the Phong exponent. Is controls the size of the highlights generatedby specular reflections

Note, the diffuse and specular reflection should only be included if the light infocus is not occluded.

4.6.3 Blinn-Phong shading

Since Phong shading requires a lot of calculations an approximate version wasdeveloped by Blinn [22]. Instead of calculating (~R ~V ), he approximates itwith:

( ~N ~H) =(

~N L + V|L + V |

)

Where H is sometimes noted as the half-vector.

In OpenGL Blinn-Phong shading is used with great results. Its advantage isthat if the light source and the viewer is located infinitely far away, then thehalf-vector remains constant for each light source during a frame.


4.6.4 Cook Torrance Reflection

The Phong model produces believable results, it is however a very simple modelcompared to the way nature behaves. Thus, improvements have been made,which include the Cook-Torrance [30] model. The Cook-Torrance model pro-vides a more physically correct model for calculating the specular reflection:

DFG

~V ~NIn this term, D is the Deckmann distribution factor, F is the Fresnel term andG is the geometric attenuation term:

D =e(

tan m

)2

4m2 cos4

F = (1 + ~V ~N)

G = min(1,2( ~H ~N)(~V ~N)

(~V ~H,2( ~H ~N)(~L ~N)

(~V ~H))

The only new arguments are:

m is the average slope of the surface microfacets, which controls thesmoothness of the surface.

is the angle ~H and ~N .

4.6.5 Light distance attenuation

Another property of light observed is, that its intensity decreases over distance.In OpenGL, this has been modeled with the attenuation factor:

1

kc + kld + kqd2

Where d is the distance from a light source and kc, kl and kq are the constant,linear and quadratic attenuation constants, respectively. It is worth noting thatat some distance the contribution from a light source can be omitted from theshading calculation without any notable loss of quality. This observation has


been used in the Arauna engine[20], where a separate data structure is keptfor the light source. This means that shadow rays are only generated for lightsources, which may give a contribution to the color.

4.6.6 Gamma correction

When a computer monitor outputs an RGB color value on the screen it getsdistorted as a consequence of the way color is depicted on the screen. Thisdistortion will make a picture seem darker than intended.

To adjust for this distortion the color is often gamma corrected, which meansthat the original color is taken to the power of a gamma factor:

Cout = Cin

This operation alone can take up to 5 percent of the time on small scenes withfew or no reflections and refractions.

4.6.7 Chosen model

In this thesis gamma correction is used along with a slightly modified version,M(c, r), of the Phong model:

~Cphong2 = kambient ~Cambient +

lights

~Clight

(

kdiffuse( ~N ~L) + kspecular(~R ~V ))

M(s, r) = Cmat

(

~Cphong2 + kref M(kref s, rref ) + ktrans M(ktrans s, rtrans))

Note that, I assume that light sources have the same specular and diffuse color,~Clight. Furthermore, I have added two terms to the equation to handle reflectionand transmission. The constants, kref and ktrans, determine how much color areflected and transmitted ray adds to the shading. This means their sum shouldbe below 1.

I have also added a surface color, Cmat, which mathematically is a diagonalmatrix, representing the red, green and blue color of the material. This matrixhas been implemented as 3-dimensional vector.


Finally, s is the total significance of a ray, r. For a primary ray the significanceis always 1.

Note how this model requires the primary rays to wait for the secondary raysto finish before the color can be calculated.

To solve this problem I have reformulated the model, such that the rays neednot depend on each other. This requires one to distinguish between shadow raysand light rays :

SR(s, r) = sCmat ~Clight(

kdiffuse( ~N ~L) + kspecular(~R ~V ))

LR(s, r) = skambientCmat ~Cambient

Where SR(s, r) and LR(s, r) are the models for shadow and light rays, respec-tively. From a mathematical point of view this produces the exact same value.In reality however it involves a different amount of multiplication, which in theend gives the final image a slightly different color due to the accumulation ofrounding errors. The advantage is however that one can add the contributionof a ray directly into a pixel and then destroy the ray.


4.7 Summary

This section summarizes the choices I have made in the previous sections.

The ray tracer I have implemented uses Whitted ray tracing as described insection 3.1.3. Thus, the ray tracer is able to simulate hard shadows, reflectionand transparency.

I have chosen to support only triangles and spheres in my ray tracer.

However, the design can in principle handle any object, as long as they allowone to:

Enclose the object inside a bounding box.

Calculate the intersection point of a ray and the object.

Calculate the normal of any point on the surface of the object.

Determine if object intersects a bounding box.

I have chosen to organize the objects of the scene a the kd-tree with the Sur-face Area Heuristic. The Surface Area Heuristic will be discussed in detail inchapter 5.

Traversal of the kd-tree is performed using a slightly modified version of HavransC-implementation of TABrec.

I calculate intersections between a ray and the object of the scene through theMoller-Trumbore algorithm for the triangles and Hultquists approach for thespheres.

Finally, I use my own slightly modified version of the Phong Lighting model forthe shading calculation.

Chapter 5

KD-tree Heuristic

This chapter covers different heuristic used to construct kd-trees, while providingdetail on the Surface Area Heuristic used for the kd-tree construction.

Recall, that kd-trees are binary trees which recursively inserts a splitting planealong one of the axes until some criteria is met.

Overall, two problems need to be overcome when generating kd-tree:

One need to find the axis and position to split.

One need to determine when to stop splitting. This is referred to as thetermination criteria.

The chapter is therefore organized to address the two problems in turn.

5.1 Choosing split position

This section presents three different heuristics for finding a split position.

44 KD-tree Heuristic

5.1.1 Spatial Median

The easiest heuristic to use is the spatial median [60], which chooses the longestaxis and split it in half. However, this approach does not ensure that the treeis balanced.

5.1.2 Object Median

Another easy method is to make a split such that there is the same amount ofprimitives on each size. This approach is often referred to as the object median.This approach ensures the tree to be balanced thus achieving O(logn) for lookupand O(1) for collision testing.

5.1.3 Surface Area Heuristic

Some primitives are larger than others and may therefore have higher probabilityfor being hit by a ray. This lead to the Surface Area Heuristic by MacDonald etal [69], that in general has proven to be better than the two previous heuristics.

Given a bounding box, B, with N triangles(objects), the time necessary to findthe closest intersection if any in the bounding box is approximated with:

NCIntersection (5.1)

Here CIntersection is the approximate cost of intersecting one triangle. If thesame bounding box, B, is split into two, such that the two smaller boundingboxes are children of in kd-tree, then one can now estimate the time to find theclosest intersection with the expected number of triangles to intersect plus thetime to make a traversal step in the kd-tree:

CTraversal + CIntersection (PLNL + PRNR) (5.2)

In this expression CTraversal denotes the time it takes to go down one level inthe kd-tree, PL and PR are the probabilities for traversing the left and right

5.2 Termination Criteria 45

subtree respectively while NL and NR denote the number of triangles in thetwo subtrees.

In the Surface Area Heuristic the probability of traversing the left (and right)subtree is approximated with:

PL =SAL

SAB(5.3)

Where SAL is the surface area of the bounding box of the left subtree and SABis the surface area of the bounding box B.

5.1.3.1 Split candidates

The Surface Area Heuristic gives an estimate of how feasible it is to make asplit, but it does not tell where to make a split. When it was initially proposedby MacDonald et al they suggested different schemes for creating possible splitpositions or split candidates. The suggested approaches included the spatialmedian on each axis or K equally spaced intervals within the bounding box of thescene. The problem with these split candidates is that are chosen independentlyof the scene.

Later a more elaborate approach, which was discussed by Wald et al [109], usedthe sides of an objects bounding box, assuming that all objects in the scenescan be contained in a bounding box. While there are infinitely many splitcandidates, this approach is ensured to find a position, which minimizes (5.2),as they are the only positions, where NL and NR change.

The downside is that finding the best split will take at least O(N) in each stepof the kd-tree construction.

5.2 Termination Criteria

This section describes heuristics used commonly used as termination criteria.

Havran et al [50] presents a nice overview of termination criteria for the kd-tree,which is reproduced here.


5.2.1 Ad Hoc Termination Criteria

The termination criteria used when kd-tree and octrees where first introduceduses the common sense, that recursion should stop when the the depth or thenumber of objects inside a leaf reaches a certain threshold. Havran et al refersto this as Ad Hoc Termination Criteria(AHTC). They point out that the mainproblem with these criteria are that they do not consider the object distribution.Furthermore, the threshold values need to be determined empirically on a sceneto scene basis.

Finally, they conclude that setting the values incorrectly can cause two problems.One, the memory space needed can be too high and the construction time toolong. Two, the combined traversal and intersection time of the kd-tree becomestoo long.

5.2.2 Automatic Termination Criteria

Another approach is the Automatic Termination Criteria(ATC) [50, 109]. Thisheuristic is based on the Surface Area Heuristic. When presented with thechoice of whether or not to split a bounding box ATC applies a greedy solutionby checking whether or not it is cheaper to do a traversal and intersect anexpected number of triangles, (5.2), compared to the cost of intersecting all thetriangles (5.1):

NCIntersection > CTraversal +CIntersection

SAB(SALNL + SARNR)

This is how literature refers to the inequality, however rewriting it shows thatthere is only one degree of freedom for tuning the heuristic:

N >CTraversal

CIntersection+

SALNL + SARNRSAB

= SAHcost (5.4)

Thus the only real parameter for the heuristic is the cost of traversal per inter-section, CTraversal

CIntersection, which logically should be positive.

While this heuristic is much more intelligent, it is by no means perfect. Themain problem with the heuristic is that it is greedy. So, even if (10.3) responds

5.2 Termination Criteria 47

that making one split is not feasible, it may be that making two splits arefeasible:

N >2CTraversalCIntersection

+SALLNLL + SALRNLR + SARLNRL + SARRNRR

SAB

The Automatic Termination Criteria can therefore get stuck in a local minimumleading to premature termination. This behavior can result in trees with veryshallow depth. Thus, the traversal time may be quick, but the intersection timemay be long due to too many objects in the leaves. Wald and Havran [109]reports this to be a problem particularly for architectural scenes, where flatcells need to be split.

5.2.3 Empty Space cut-off extension

Wald and Havran [109] also reports that only few modifications to ATC areknown to yield improvements in general. One, that do seem to give such im-provement is one, which tries to provide further encouragement to cut off emptyspace. This is done by reducing the cost by a constant factor, , when one ofthe children contains no objects:

N > SAHcost (5.5)

Where Wald and Havran defines by:

=

{

0.8; NL = 0 NR = 01 ; otherwise

(5.6)

5.2.4 Others modifications

Of other modifications mentioned by Wald and Havran [109] are:

To continue recursion for a number of steps even if the termination criteriasuggest not to. This technique is, as Wald and Havran put it, supposedlyhard to master for general scenes [109].

To restrict the depth to reduce memory usage.


5.3 Implemented Heuristic

In this thesis I have implemented and used the Surface Area Heuristic for con-structing kd-trees. For the termination criteria, the Automatic TerminationCriteria has been used together with the - and maximum depth-modificationsdiscussed above. Section 10.5 discusses how the maximum depth has been cho-sen.

Chapter 6

Parallel Ray Tracing

This chapter describes the parallel ray tracers analyzed in this thesis.

6.1 Pixel Distributed Parallel Ray Tracing

In this thesis a Pixel Distributed Parallel Ray Tracer (PDPRT) is defined as aray tracer, where the screen or pixels of the screen has been divided into somesections, which then are distributed among the threads of the ray tracer, seeFigure 6.1.

A ray tracer, which runs sequentially, is very easy to extend into one whichutilizes the PDPRT paradigm. Essentially, each thread render their own part ofthe image and when all threads has finished their workload the images is pastedinto one larger image.

PDPRT is the type of ray tracing all investigated parallel ray tracers [18, 20, 17,81] are known to use. Furthermore, it has been reported that the performanceof PDPRT is almost linear with the number of processors used [81].

50 Parallel Ray Tracing

Thread 1 Thread 2

Thread 3 Thread 4

Figure 6.1: The pixels of the screen distributed to different threads.

6.2 Balancing the work

Consider the case that the most complex part of the image lies in just onesection of the image. This happens if the section contains a lot of reflections,refractions and expensive intersection checks. Such situations are likely to giveone of the threads a workload considerable heavier than the remaining threads.This is a problem, because when a thread is done with its workload it will haveto wait for the other threads to finish - thus it will be idle while waiting for thethread with heavy workload.

This has been reported to be a problem and has been solved by performingload balancing [18, 20, 17], such that each thread have approximately the sameamount of work to do.

6.2.1 Bag of Task

One way of balancing the workload, which in general works quite well, is touse the Bag of Tasks paradigm. This has been applied to PDPRT the examplein Figure 6.2, where the image is divieded into m = 16 sections, where m ingeneral should be somewhat larger than the t = 4 threads. These m sectionsconstitutes the bag of tasks.

The threads are initially given one section(Task) to ray trace and when theycomplete their section, they look into the bag and take another section to raytrace until the bag is empty. In Figure 6.2 thread 1 is working on task 7, thread2 on task 6, etc.

6.3 Object Distributed Parallel Ray Tracing 51

Figure 6.2: Example of the Bag and Tasks paradigm used on PDPRT. In thisexample m = 16 sections, where thread 1 is currently executing task 7, thread2 task 6 and so on, while there 7 tasks waiting to be processed.

In Figure 6.3a and 6.3b an example shows how the bag of task solution de-creases the time each thread waits, which in the example increases the overallperformance. The larger m is the more balanced the workload will be. Thusdecreasing the waiting time for each thread. However, Figure 6.3b shows thatevery time a thread finishes a section it has to acquire exclusive access to thebag in order to avoid race conditions, which is done using some type of syn-chronization. Thus, it is likely that other threads will stall, while waiting to getaccess to the bag. One should therefore at the same time try to keep m as smallas possible, if this overhead is to be minimized. A proper value for m is bestfound through experimentation.

This type of load balancing has been applied by various people [18, 20].

6.3 Object Distributed Parallel Ray Tracing

The other type of parallel ray tracing investigated in this thesis will be referredto as Object Distributed Parallel Ray Tracing (ODPRT), where the objects ofthe scene are distributed to different threads, see Figure 6.4. It works by send-


Thread 1

Thread 2

Thread 3

Thread 4

Working Waiting

Work done

(a)

Thread 1

Thread 2

Thread 3

Thread 4

Working Waiting

Work done

Synchronizing

(b)

Figure 6.3: This Figure shows how the use of the bag of task paradigm can beused to balance workload and achieve greater performance.

ing rays through the threads containing the objects, which it needs to checkintersections against.

Investigations shows one of the first uses to be made in 1997 by Reinhard etal[87], where it is denoted as the data driven approach. Reinhard et al[87] usea uniform grid, however, as future work they suggest using a hierarchical datastructure to increase performance. Pharr et al [78] also used a very similarapproach in 1997, where rays were stored in voxels and these voxels whereprocessed one at a time. Thus, this approach were not parallel.

The main benefit of ODPRT is that by distributing the scene to multiple pro-cessors one is able to hanlde larger scenes. Furthermore, it can improve spatiallocality, which lowers the amount of time spent on transfers from main memoryto the cache. The drawback is that a ray, its refracted or reflected ray is very

6.4 Others 53

Image

Eye

cpu 2

cpu 4

cpu 1

cpu 3

Figure 6.4: This Figure shows how objects are distributed among processors inObject Distributed Parallel Ray Tracing.

likely to travel from one part of the scene to another thus it is necessary forthe threads to communicate so that rays can be transferred between them. Thiswas not a problem with PDPRT as the threads could work independently ofeach other. Reinhard et al[87] also reports load balancing to be a problem.

6.4 Others

During my investigations I have found only limited variations to these two typesof parallel ray tracing, which has been mentioned above. However, it might bethat other approaches are able to benefit ray tracing when parallelized.

Chapter 7

Implementation

This chapter covers the general implementation details. First section coversthe construction of the kd-tree. Next section covers details of the ODPRTimplementation, which followed by a minor section discussing load balancingfor PDPRT. Finally, a section covers some of the many optimizations made inthe ray tracer through this thesis.

In this thesis all software has been developed in C++:

This is the weapon of a Jedi Knight. Not as clumsy or random as a blaster;an elegant weapon for a more civilized age. Obi-Wan Kenobi

7.1 KD-tree construction

The process of constructing a KD-tree using the Surface Area Heuristic can ingeneral be described as in Algorithm 7.1.1 and 7.1.2[109].

56 Implementation

Algorithm 7.1.1 compileKDtree(objects o) returns root node

1: let b be the axis aligned bounding box containing o.2: return subdivide(o,b)

Algorithm 7.1.2 subdivide(objects o, AABB b) returns node

1: if Termination criteria is satisfied then2: return leaf(o)3: (pos, cost) = findBestSAHSplit(o,b)4: if cost is not too high then5: Divide b at pos into bleft and bright6: Place all objects from o belonging to bleft in oleft7: Place all objects from o belonging to bright in oright8: return node(p, subdivide(oleft, bleft), subdivide(oright, bright))9: else

10: return leaf(o)

7.1.1 Naive implementation

Initially a naive implementation was made running in O(N2). This implemen-tation used about 2 12 day for the construction of the kd-tree for the StanfordBuddha statue, which consists of approximately 1 million triangles. The mainidea behind the implementation can be seen in Algorithm 7.1.3.

7.1 KD-tree construction 57

Algorithm 7.1.3 findBestSAHSplitNaive(objects o, AABB b) returns (splitposition, cost)

1: costbest := inf2: posbest := NAN3: for each axis do4: Place all split candidates from o in h5: for all split candidates p in h do6: TL := 0 #Children in left node7: TR := 0 #Children in right node8: for all objects q in o do9: if q is on right side of p then

10: TR := TR + 111: if q is on left side of p then12: TL := TL + 113: cost = SAH cost14: if cost < costbest then15: costbest := cost16: posbest := p17: return (posbest, costbest)

7.1.2 Fast implementation

To solve this problem a new construction implementation was made.

The best known algorithm for constructing a kd-tree using the Surface AreaHeuristic has the time complexity O(NlogN)[109]. Wald et al [109] and Ben-thin [17] both describes how this running time can be achieved by precalculatinga sorted list of split candidates for each axis, and then maintaining the sortedlist for each recursive step of the kd-tree construction.

The main difference between their solutions is that Wald et al makes a certaintype of splits, which demands that all geometry are triangles. Benthin justassumes that all geometry can be contained by an axis aligned bounding box.Benthins method is sometimes referred to as a boxed builder [83].

Initially their solutions behave the same as they both calculate the exact sameset of split candidates. The difference occurs when a split is made. ConsiderFigure 7.1a and 7.1b , where three triangles are split. In Figure 7.1a the splitcandidates from the triangles bounding boxes are still used as if nothing hashappened. This represents the solution from Benthin. However, in Figure 7.1b

58 Implementation

a perfect split is made, which means that new split candidates are computed fortriangles overlapping the splitting plane. This is the solution used by Wald etal, which according to Havran et al [50] gives 9 35% speedup when traversingthe kd-tree.

split plane

t1 t3

t4 t6t2 t5

b1 b2

(a)

split plane

t1 t3

t4

t6

t2

t5

b1

b2

(b)

Figure 7.1: Example of (a) triangles being subdivided in a kd-tree using a boxedbuilder (b) and when doing a perfect split after splitting.

Essentially, both O(NlogN) implementations use the same approach as in Algo-rithm 7.1.3. The difference is however that the split candidates are now orderedwhen doing the loop at line 5, and that line 8-12 is replaced with a routine thatcalculates TL and TR in constant time. To achieve this running time, each splitcandidate need to contain the following information:

Position of split, p, where is the split located on the given axis.

Object identifier, id, which object provided this split candidate.

Type of split candidate, t. If p is the position where the object withid begins on the given axis, then it has type opened. If it is where it ends,then it has type closed. Finally, if the object both begins and ends at pon the given axis, then it has type planar

Using this extra information one can find the best split position with SAH inO(NlogN) as shown in Algorithm 7.1.4.


Algorithm 7.1.4 findBestSAHSplitFast(sorted split candidates h, AABB b, intobjs) returns (split position, cost)

1: costbest := inf2: posbest := NAN3: for each axis do4: opened := 05: closed := 06: planar := 07: for all split candidates p in h do8: increment opened, closed or planar according to ps type.9: TL := planar + opened #Children in left node

10: TR := objs planar closed #Children in right node11: cost = SAH cost12: if cost < costbest then13: costbest := cost14: posbest := p15: return (posbest, costbest)

Benthin and Wald et als variations have a number of drawbacks and benefits.First, consider the situation in Figure 7.2, where two splitting planes have beeninserted. In this situation, the bounding box of the object intersects the top-leftarea, which means the object is carried over to a node it does not intersect.Havran [48] proposes three solutions:

Use perfect splits.

Place the object in both nodes.

Make a box-object intersection check to determine where the object be-longs, before continuing the recursive subdivision.

Placing the object in both nodes has the problem that it uses more memoryand it increases the number of intersection checks needed when finding the firstrayhit. The last solution uses more time in the construction phase and I havefound it to have numerical issues, which is discussed in section 7.1.2.1. Notethat both placing objects and making the box-intersection have the problemthat they calculate the SAH cost inaccurately. The reason lies at TL and TR,which are the number of objects on each side of the split. However, because aboxed builder is used one cannot find the real value of TL and TR by lookingonly at the split candidates. Thus, this is the only reason that perfect splits canachieve kd-trees with 9 35% faster traversal time.

60 Implementation

split

split

Figure 7.2: This situation shows how the use of bounding boxes can make anobject end up in a leaf of the kd-tree, which it does not intersect.

To determine which objects are inside a node one can in general find the infor-mation by searching through the split candidates for the node being processed.This is however not always the case with a boxed builder as depicted in Fig-ure 7.3 where an object has no split candidate in the node, it should be in. Thesituation occurs when all the split candidates are cut out by the splitting planesas shown in the Figure. The occurrence of this problem has not been mentionedby Benthin, but can be solved by keeping track of the contained objects of acell without sacrificing the desired, asymptotic running time. Notice, how thisproblem is avoided when using perfect splits, since an object contained in a cellalways will have at least one split candidate at one or more of the axes. There-fore, keeping track of the contained objects is not necessary as the informationcan be found in the list of split candidates.

Figure 7.3: This situation shows how the use of bounding boxes can cause anobject not to end up in a leaf of the kd-tree, which it intersects.

Another issue, which Benthin does not mention is the case where an object, o,


does not have its opened split candidate inside a node, n,being processed. Thisoccur in both Figure 7.2 and 7.3. The problem occurs when calculating TL andTR for a split as o will not be registered as being in the left node as it oughtto, Thus, TL will be smaller and TR will be larger than it should. This issuewill not break the algorithm, but only degrade the tree according to the SurfaceArea Heuristic. I have solved this problem by initializing TL to the number ofobjects already opened in n. Thus, I keep track of which objects, that no longerhas its opened split candidate inside a node.

The problem with perfect splits, on the other hand, is that they will increasethe time it takes to construct the kd-tree, and Wald et al [109] also mentions theproblem of numerical instability without too much elaboration. Furthermore,Wald et al assumes that the number of triangles overlapping a splitting plane isin O(

N) even though no proof for th

parallel ray tracing - technical university of denmarketd.dtu.dk/thesis/248472/ep09_40_net.pdf ·...

Documents