practical parallel processing for today’s rendering challenges siggraph 2001 course 40

Practical Parallel Processing for Today’s Rendering Challenges -- 1

Practical Parallel Processing for Today’s Rendering Challenges

SIGGRAPH 2001 Course 40Los Angeles, CA


SpeakersSpeakers

Alan Chalmers, University of Bristol Tim Davis, Clemson University Erik Reinhard, University of Utah Toshi Kato, SquareUSA


ScheduleSchedule

Introduction Parallel / Distributed Rendering Issues Classification of Parallel Rendering

Systems Practical Applications Summary / Discussion


ScheduleSchedule

Introduction (Davis) Parallel / Distributed Rendering Issues Classification of Parallel Rendering



The Need for SpeedThe Need for Speed

Graphics rendering is time-consuming• large amount of data in a single image

• animations much worse

Demand continues to rise for high-quality graphics


Rendering and Parallel ProcessingRendering and Parallel Processing

A holy union Many graphics rendering tasks can be

performed in parallel Often “embarrassing parallel”


3-D Graphics Boards3-D Graphics Boards

Getting better Perform “tricks” with texture mapping Steve Jobs’ remark on constant frame

rendering time


Parallel / Distributed Rendering

Fundamental Issues

• Task Management

Task subdivision, Migration, Load balancing

• Data Management

Data distributed across system

• Communication

Fundamental Issues

• Task Management

Task subdivision, Migration, Load balancing

• Data Management

Data distributed across system

• Communication


ScheduleSchedule

Introduction Parallel / Distributed Rendering Issues

(Chalmers) Classification of Parallel Rendering



Introduction

“Parallel processing is like a dog’s walking on its hind legs. It is not done well, but you are surprised to find it done at all”

[Steve Fiddes (apologies to Samuel Johnson)]

• Co-operation

• Dependencies

• Scalability

• Control

“Parallel processing is like a dog’s walking on its hind legs. It is not done well, but you are surprised to find it done at all”

[Steve Fiddes (apologies to Samuel Johnson)]

• Co-operation

• Dependencies

• Scalability

• Control


Co-operation

Solution of a single problem

• One person takes a certain time to solve the problem

• Divide problem into a number of sub-problems

• Each sub-problem solved by a single worker

• Reduced problem solution time

BUT

• co-operation overheads

Solution of a single problem

• One person takes a certain time to solve the problem

• Divide problem into a number of sub-problems

• Each sub-problem solved by a single worker

• Reduced problem solution time

BUT

• co-operation overheads


Working TogetherWorking Together

Overheads• access to pool

• collision avoidance


DependenciesDependencies

Divide a problem into a number of distinct stages• Parallel solution of one stage before next can start

• May be too severe no parallel solution

each sub-problem dependent on previous stage

• Dependency-free problems

order of task completion unimportant

BUT co-operation still required


Building with BlocksBuilding with Blocks

Strictly sequential Dependency-free


ScalabilityScalability

Upper bound on the number of workers• Additional workers will NOT improve solution time

• Shows how suitable a problem is for parallel processing

• Given problem finite number of sub-problems

more workers than tasks

• Upper bound may be (a lot) less than number of tasks

bottlenecks


Bottleneck at Doorway Bottleneck at Doorway

@ $ &

More workers may result in LONGER solution time


ControlControl

Required by all parallel implementations• What constitutes a task

• When has the problem been solved

• How to deal with multiple stages

• Forms of control

centralised

distributed


Control RequiredControl Required

Sequential

Parallel


Inherent DifficultiesInherent Difficulties

Failure to successfully complete• Sequential solution

deficiencies in algorithm or data

• Parallel solution

deficiencies in algorithm or data

deadlock

data consistency


Novel DifficultiesNovel Difficulties

Factors arising from implementation• Deadlock

processor waiting indefinitely for an event

• Data consistency

data is distributed amongst processors

• Communication overheads

latency in message transfer


Evaluating Parallel ImplementationsEvaluating Parallel Implementations

Realisation penalties• Algorithmic penalty

nature of the algorithm chosen

• Implementation penalty

need to communicate

concurrent computation & communication activities

idle time


Solution TimesSolution Times


Task ManagementTask Management

Providing tasks to the processors• Problem decomposition

algorithmic decomposition

domain decomposition

• Definition of a task

• Computational Model


Problem DecompositionProblem Decomposition

Exploit parallelism• Inherent in algorithm

algorithmic decomposition

parallelising compilers

• Applying same algorithm to different data items

domain decomposition

need for explicit system software support


Abstract Definition of a TaskAbstract Definition of a Task

• Principal Data Item (PDI) - application of algorithm

• Additional Data Items (ADIs) - needed to complete computation


Computational ModelsComputational Models

Determines the manner tasks are allocated to PEs• Maximise PE computation time

• Minimise idle time

load balancing

• Evenly allocate tasks amongst the processors


Data Driven ModelsData Driven Models

All PDIs allocated to specific PEs before computation starts

Each PE knows a priori which PDIs it is responsible for

Balanced (geometric decomposition)• evenly allocate tasks amongst the processors

• if PDIs not exact multiple of Pes then some PEs do one extra task

portion at each PE = number of PDIsnumber of PEs


Balanced Data DrivenBalanced Data Driven

+

solution time = initial distribution

result collation

+243


Demand Driven ModelDemand Driven Model

Task computation time unknown• Work is allocated dynamically as PEs become idle

PEs no longer bound to particular PDIs

• PEs explicitly demand new tasks

• Task supplier process must satisfy these demands


Dynamic Allocation of TasksDynamic Allocation of Tasks

solution time =+

2 x total comms time

number of PEstotal comp time for all PDIs


Task Supplier ProcessTask Supplier Process

Simple demand driven task supplier

PROCESS Task_Supplier() Begin remaining_tasks := total_number_of_tasks

(* initialise all processors with one task *) FOR p = 1 TO number_of_PEs SEND task TO PE[p] remaining_tasks := remaining_tasks -1

WHILE results_outstanding DO RECEIVE result FROM PE[i] IF remaining_tasks > 0 THEN SEND task TO PE[i] remaining_tasks := remaining_tasks -1 ENDIF

End (* Task_Supplier *)


Load BalancingLoad Balancing

All PEs should complete at the same time• Some PEs busy with complex tasks

• Other PEs available for easier tasks

• Computation effort of each task unknown

hot spot at end of processing unbalanced solution

• Any knowledge about hot spots should be used


Task Definition & GranularityTask Definition & Granularity

Computational elements• Atomic element (ray-object intersection)

sequential problem’s lowest computational element

• Task (trace complete path of one ray)

parallel problem’s smallest computational element

• Task granularity

number of atomic units is one task


Task PacketTask Packet

Unit of task distribution• Informs a PE of which task(s) to perform

• Task packet may include

indication of which task(s) to compute

data items (the PDI and (possibly) ADIs)

• Task packet for ray tracer one or more rays to be traced


Algorithmic DependenciesAlgorithmic Dependencies

Algorithm adopted for parallelisation:• May specify order of task completion

• Dependencies MUST be preserved

• Algorithmic dependencies introduce:

synchronisation points distinct problem stages

data dependencies careful data management


Distributed Task ManagementDistributed Task Management

Centralised task supply• All requests for new tasks to System Controller

bottleneck

• Significant delay in fetching new tasks

Distributed task supply

• task requests handled remotely from System Controller

• spread of communication load across system

• reduced time to satisfy task request


Preferred Bias AllocationPreferred Bias Allocation

Combining Data driven & Demand driven• Balanced data driven

tasks allocated in a predetermined manner

• Demand driven

tasks allocated dynamically on demand

• Preferred Bias: Regions are purely conceptual

enables the exploitation of any coherence


Conceptual RegionsConceptual Regions

• task allocation no longer arbitrary


Data ManagementData Management

Providing data to the processors• World model

• Virtual shared memory

• Data manager process

local data cache

requesting & locating data

• Consistency


Remote Data FetchesRemote Data Fetches

Advanced data management• Minimising communication latencies

Prefetching

Multi-threading

Profiling

• Multi-stage problems


Data RequirementsData Requirements

Requirements may be large• Fit in the local memory of each processor

world model

• Too large for each local memory

distributed data

provide virtual world model/virtual shared memory


Virtual Shared Memory (VSM)Virtual Shared Memory (VSM)

Providing a conceptual single memory space• Memory is in fact distributed

• Request is the same for both local & remote data

• Speed of access may be (very) differentSystem Software Provided by DM process

Compiler HPF, ORCA

Operating System Coherent Paging

Hardware DDM, DASH, KSR-1

Higherlevel

Lowerlevel


ConsistencyConsistency

Read/write can result in inconsistencies• Distributed memory

multiple copies of the same data item

• Updating such a data item

update all copies of this data item

invalidate all other copies of this data item


Minimising Impact of Remote DataMinimising Impact of Remote Data

Failure to find a data item locally remote fetch• Time to find data item can be significant

• Processor idle during this time

• Latency difficult to predict

eg depends on current message densities

• Data management must minimise this idle time


Data Management TechniquesData Management Techniques

Hiding the Latency• Overlapping the communication with computation

prefetching

multi-threading

Minimising the Latency• Reducing the time of a remote fetch

profiling

caching


PrefetchingPrefetching

Exploiting knowledge of data requests• A priori knowledge of data requirements

nature of the problem

choice of computational model

• DM can prefetch them (up to some specified horizon)

available locally when required

overlapping communication with computation


Multi-ThreadingMulti-Threading

Keeping PE busy with useful computation• Remote data fetch current task stalled

• Start another task (Processor kept busy)

separate threads of computation (BSP)

• Disadvantages: Overheads

Context switches between threads

Increased message densities

Reduced local cache for each thread


Results for Multi-ThreadingResults for Multi-Threading

• More than optimal threads reduces performance

• “Cache 22” situation

less local cache more data misses more threads


ProfilingProfiling

Reducing the remote fetch time• At the end of computation all data requests are

known

if known then can be prefetched

• Monitor data requests for each task

build up a “picture” of possible requirements

• Exploit spatial coherence (with preferred bias allocation)

prefetch those data items likely to be required


Spatial CoherenceSpatial Coherence


ScheduleSchedule


Systems (Davis)

Practical Applications

Summary / Discussion


Classification of Parallel Rendering Systems

Classification of Parallel Rendering Systems

Parallel rendering performed in many ways

Classification by• task subdivision

polygon rendering ray tracing

• hardware

parallel hardware distributed computing


Classification by Task SubdivisionClassification by Task Subdivision

Original rendering task broken into smaller pieces to be processed in parallel

Depends on type of rendering Goals

• maximize parallelism

• minimize overhead, including communication


Task Subdivision in Polygon Rendering

Task Subdivision in Polygon Rendering

Rendering many primitives Polygon rendering pipeline

• geometry processing (transformation, clipping, lighting)

• rasterization (scan conversion, visibility, shading)


Polygon Rendering PipelinePolygon Rendering Pipeline

Graphics database traversal

Display

GeometryProcessing

Rasterization

… G GG G

… R RR R


Primitive Processing and SortingPrimitive Processing and Sorting

View processing of primitives as sorting problem• primitives can fall anywhere on or off the screen

Sorting can be done in either software or hardware, but mostly done in hardware


Primitive Processing and SortingPrimitive Processing and Sorting

Sorting can occur at various places in the rendering pipeline• during geometry processing (sort-first)

• between geometry processing and rasterization (sort-middle)

• during rasterization (sort-last)


Sort-firstSort-first

GeometryProcessing

Rasterization

Graphics database(arbitrarily partitioned)

Display

…

G GG G …

R RR R

Redistribute “raw” primitives

…

(Pre-transform)


Sort-first MethodSort-first Method

Each processor (renderer) assigned a portion of the screen

Primitives arbitrarily assigned to processors

Processors perform enough calculations to send primitives to correct renderers

Processors then perform geometry processing and rasterization for their primitives in parallel


Screen SubdivisionScreen Subdivision


Sort-first DiscussionSort-first Discussion

+ Communication costs can be kept low

- Duplication of effort if primitives fall into more than one screen area

- Load imbalance if primitives concentrated

- Very few, if any, sort-first renderers built


Sort-middleSort-middle

GeometryProcessing

Rasterization


Display

…

G GG G

…

R RR R

Redistribute screen-space primitives

…


Sort-middle MethodSort-middle Method

Primitives arbitrarily assigned to renderers

Each renderer performs geometry processing on its primitives

Primitives then redistributed to rasterizers according to screen region


Sort-middle DiscussionSort-middle Discussion

+ Natural breaking point in graphics pipeline

- Load imbalance if primitives concentrated in particular screen regions

+ Several successful hardware implementations• PixelPlanes 5

• SGI Reality Engine


Sort-lastSort-last

GeometryProcessing

Rasterization


Display

…

G GG G …

R RR R

Redistribute pixels, samples, orfragments

…

(Compositing)


Sort-last MethodSort-last Method

Primitives arbitrarily distributed to renderers

Each renderer computes pixel values for its primitives

Pixel values are then sent to processors according to screen location

Rasterizers perform visibility and compositing


Sort-last DiscussionSort-last Discussion

+ Less prone to load imbalance

- Pixel traffic can be high

+ Some working systems • Denali


Task Subdivision in Ray TracingTask Subdivision in Ray Tracing

Ray tracing often prohibitively expensive on single processor

Prime candidate for parallelization• each pixel can be rendered independently

Processing easily subdivided• image space subdivision

• object space subdivision

• object subdivision


Image Space SubdivisionImage Space Subdivision


Image Space Subdivision DiscussionImage Space Subdivision Discussion

+ Straightforward

+ High parallelism possible

- Entire scene database must reside on each processor• need adequate storage

+ Low processor communication


Image Space Subdivision DiscussionImage Space Subdivision Discussion

- Load imbalance possible• screen space may be further subdivided

+ Used in many parallel ray tracers• works better with MIMD machines

• distributed computing environments


Object Space SubdivisionObject Space Subdivision

3-D object space divided into voxels Each voxel assigned to a processor Rays are passed from processor to

processor as voxel space is traversed


Object Space Subdivision Discussion

Object Space Subdivision Discussion

+ Each processor needs only scene information associated with its voxel(s)

- Rays must be tracked through voxel space

+ Load balance good

- Communication can be high

+ Some successful systems


Object PartitioningObject Partitioning

Each object in the scene is assigned to a processor

Rays passed as messages between processors

Processors check for intersection


Object Partitioning DiscussionObject Partitioning Discussion

+ Load balancing good

- Communication high due to ray message traffic

- Fewer implementations


ScheduleSchedule

Introduction Parallel / Distributed Rendering Issues Classification of Parallel Rendering Systems Practical Applications

• Rendering at Clemson / Distributed Computing and Spatial/Temporal Coherence (Davis)

• Interactive Ray Tracing• Parallel Rendering and the Quest for Realism: The Kilauea

Massively Parallel Ray Tracer Summary / Discussion


Practical Experiences at Clemson

Problems with Rendering Current Resources Deciding on a Solution A New Render Farm


A Demand for Rendering

Computer Animation course 3 SIGGRAPH animation submissions

• render over semester break


Current Resources

dedicated lab• 8 SGI 02’s (R12000, 384 MB)

general-purpose lab• 4 SGI 02’s

shared lab• dual-pipe Onyx2 (8 R12000, 8 GB)

• 10 SGI 02’s (R12000, 256 MB)

offices• 5 SGI 02’s


Resource Problems Rendering prohibits interactive sessions Little organized control over resources

• users must be self-monitoring

m renders on n machines 1 render on n/m machines

Disk space Cross-platform distributed rendering to PCs

problematic• security (rsh)

• distributed rendering software

• directory paths


Short-term Solutions

Distributed rendering restricted to late night

Resources partitioned


Problems with Maya

video Traditional distributed computing

problems• dropped frames

• incomplete frames

• tools developed


Problems with Maya

Tools (DropCheck)


Problems with Maya

Tools (Load Scan)


Problems with Maya

Animation inconsistencies• next slide

Some frames would not render Particle system inconsistencies


Problems with Maya


Rendering Tips

Layering


Deciding on a Solution - RenderDrive

RenderDrive by ART (Advanced Rendering Technology)• network appliance for ray tracing

• 16-48 specialized processors

• claims speedups of 15-40 over Pentium III

• 768MB to 1.5GB memory

• 4GB hard disk cache


Deciding on a Solution - RenderDrive

• plug-in interface to Maya

• Renderman ray tracer

• $15K - $25K


Deciding on a Solution - PCs

Network of PCs as a render farm 10 PCs each with 1.4GHz, 1GB memory,

and 40GB hard drive Maya will run under Windows 2000 or

Linux (Maya 4.0) Distributed rendering software not

included for Windows 2000


Deciding on a Solution - PCs Win

RenderDrive had some unusual anomalies

Interactive capabilities Scan-line or ray tracing Distributed rendering software may be

included Problems with security still exist

• shared file system


ScheduleSchedule


• Rendering at Clemson / Distributed Computing and Spatial/Temporal Coherence (Davis)

• Interactive Ray Tracing

• Parallel Rendering and the Quest for Realism: The Kilauea Massively Parallel Ray Tracer



Agenda

Background Temporal Depth-Buffer Frame Coherence Algorithm Parallel Frame Coherence Algorithm


Background - Ray TracingBackground - Ray Tracing

Closest to physical model of light High cost in terms of time / complexity


Background - Frame CoherenceBackground - Frame Coherence

Frame coherence • those pixels that do not change from one frame to

the next

• derived from object and temporal coherence

We should not have to re-compute those pixels whose values will not change• writing pixels to frame files


Background - Test AnimationBackground - Test Animation

Glass Bounce (60 frames at 320x240; 5 obj)


Background - Frame Coherence Background - Frame Coherence


Previous WorkPrevious Work

Frame coherence• moving camera/static world [Hubschman and

Zucker 81]

• estimated frames [Badt 88]

• stereoscopic pairs [Adelson and Hodges 93/95]

• 4D bounding volumes [Glassner 88]

• voxels and ray tracking [Jevans 92]

• incremental ray tracing [Murakami90]


Previous Work (cont.)Previous Work (cont.)

Distributed computing• Alias and 3D Studio

• most major productions starting with Toy Story [Henne 96]


GoalsGoals

Render exactly the same set of frames in much less time

Work in conjunction with other optimization techniques

Run on a variety of platforms Extend a currently popular ray tracer

(POV-Ray) to allow for general use


Temporal Depth-BufferTemporal Depth-Buffer

Similar to traditional z-buffer For each pixel, store a temporal depth in

frame units

1

2

3

1

2

3

5 5 5 5 5 5 5 5 5 5 5 5 55 5 5 5 5 5 5 5 5 5 5 5 55 5 5 5 5 5 5 3 3 3 5 5 55 5 5 5 5 5 5 3 3 3 5 5 55 5 5 5 5 2 2 2 3 3 5 5 55 5 5 5 5 2 2 2 3 3 5 5 55 5 5 5 2 2 2 2 5 5 5 5 55 5 5 5 2 2 2 2 5 5 5 5 55 5 5 5 2 2 2 5 5 5 5 5 55 5 5 5 5 5 5 5 5 5 5 5 5


Frame Coherence AlgorithmFrame Coherence Algorithm


Identify volume within 3D object space where movement occurs

Divide volume uniformly into voxels For each voxel, create a list of frame

numbers in which changing objects inhabit this voxel



In each frame, track rays through voxels for each pixel

From the voxels traversed, find the one with the lowest frame number

Record that number in the temporal depth-buffer




for each frame of the animation

for each pixel that needs to be computed for this frame

trace the rays for this pixel

for each voxel that any of these rays intersect

get the next frame number to compute

set the t-buffer entry to the lowest frame number found



1

2

3

5 5 5 5 5 5 5 5 5 5 5 5 55 5 5 5 5 5 5 5 5 5 5 5 55 5 5 5 5 5 5 3 3 3 5 5 55 5 5 5 5 5 5 3 3 3 5 5 55 5 5 5 5 2 2 2 3 3 5 5 55 5 5 5 5 2 2 2 3 3 5 5 55 5 5 5 2 2 2 2 5 5 5 5 55 5 5 5 2 2 2 2 5 5 5 5 55 5 5 5 2 2 2 5 5 5 5 5 55 5 5 5 5 5 5 5 5 5 5 5 5


Voxel Volume Voxel Volume

Uniform voxel spatial subdivision Voxel can be non-cubical Ways to determine voxel volume

• user-supplied

• pre-processing phase

active voxel marking

in distributed environment, done by master or slave or both


Frame Coherence ExampleFrame Coherence Example


Test AnimationTest Animation

Pool Shark (620 frames at 640x480; 174 obj)


Test Animations - ProblemTest Animations - Problem

Bounding box problem


ResultsResults

standardalgorithm

frame coherencealgorithm

ratio of framecoherence to standard

speedup

total number ofrays 47,841,269 13,259,380 0.27 --

total parse time0:48 1:30 1.88 --

first framerendering time 6:34 8:49 1.34 0.75

average framerendering time 7:15 3:05 0.43 2.33

total framerendering time 5:26:55 2:19:51 0.43 2.33


Frame Coherence DiscussionFrame Coherence Discussion

Localized movement can have global effects

Performance depends on both the number and complexity of recomputed pixels

Issues• overhead

• antialiasing

• motion blur


Uses less memory than other methods Simple Can be used with other algorithms

Temporal Depth-Buffer DiscussionTemporal Depth-Buffer Discussion


Parallel Frame Coherence AlgorithmParallel Frame Coherence Algorithm

Distributed computing environment 1-8 Sun Sparc Ultra 5 processors running

at 270 MHz Coarse-grain parallelism Load balancing

• divide work among processors

• keep data together for frame coherence



Image space subdivision• each processor computes a subregion for the

entire length of the run

Recursively subdivide subsequences to keep processors busy

… …… …


Screen SubdivisionScreen Subdivision



Coarse bin packing: find block with smallest number of computed frames

Keep statistics on average first frame time and average coherent frame time

Find a hole in the sequence Leave some free frames before new start


Load Balancing ExampleLoad Balancing Example

18414

3

4

2

614

3

4

2

1141914

speedprocessor new

speedprocessor current

2

1 start tmp - end holestart tmp framestart

143113

811

3

4

15

3011

speedprocessor new

speedprocessor current

time frame avg

time framefirst start hole start tmp

h o l es t a r t

h o l ee n d

f i r s t f r a m e t i m e = 3 0a v g f r a m e t i m e = 1 5

c u r r e n t p r o c e s s o r s p e e d = 4n e w p r o c e s s o r s p e e d = 3

t m ps t a r t

s t a r tf r a m e

… …1 0 1 91 81 71 61 51 41 1 1 31 2 2 0


Results - Parallel Frame CoherenceResults - Parallel Frame Coherence

standardalgorithm

parallel with 8machines

speedup parallel frame coherencewith 8 machines

speedup

total number ofrays 47,841,269 49,161,582 1.03 18,299,347 0.38

total parse time0:48 -- -- -- --

first framerendering time 6:34 -- -- -- --

average framerendering time 7:15 1:05 6.7 :34 12.9

total framerendering time 5:26:55 49:49 6.6 25:47 12.9


ResultsResultsstandardalgorithm



speedup






number ofprocessors

total numberof rays

ratio tosingle processor

average framerendering time

total renderingtime

speedup

1 15,731,252 1.00 2:42 2:42:26 1.00

2 5,890,290 0.37 :38 38:25 4.23

4 5,913,926 0.38 :22 22:12 7.31

8 6,063,338 0.39 :16 16:28 9.86

12 6,086,781 0.39 :12 11:37 13.98

16 6,323,673 0.40 :11 10:50 14.99


Another Test AnimationAnother Test Animation

Soda Worship (60 frames at 160x120; 839 obj)


Another Test AnimationAnother Test Animation


ResultsResultsstandardalgorithm



speedup






number ofprocessors

total numberof rays

ratio tosingle processor

average framerendering time

total renderingtime

speedup

1 44,454,548 1.00 28:10 28:10:10 1.00

2 22,163,526 0.50 15:11 11:48:11 2.39

4 22,286,422 0.50 7:45 4:27:26 6.32

8 22,409,023 0.50 3:58 2:16:34 12.38

12 23,125,140 0.52 2:38 1:31:05 18.56

16 23,180,741 0.52 2:02 1:12:15 23.39


Good speedup Multiplicative speedup with both Speedup limitations

• voxel approximation

• writing pixels to frame files (communication)

Results DiscussionResults Discussion


ConclusionsConclusions

Frame coherence algorithm combined with distributed computing provides good speedup

Algorithm scales well Techniques are useful and accessible to

a wide variety of users Benefits depend on inherent properties

of the animation


Shameless AdvertisementShameless Advertisement

Masters of Fine Arts in Computing (MFAC)• special effects and animation courses

• two year program

Clemson Computer Animation Festival in Fall 2002


ScheduleSchedule


• Rendering at Clemson / Distributed Computing and Spatial/Temporal Coherence

• Interactive Ray Tracing (Reinhard)

• Parallel Rendering and the Quest for Realism: The Kilauea Massively Parallel Ray Tracer



OverviewOverview

Introduction Interactive ray tracer Animation and interactive ray tracing Sample reuse techniques

IntroductionIntroduction


Interactive Ray TracingInteractive Ray Tracing

Renders effects not available using other rendering algorithms

Feasible on high-end supercomputers provided suitable hardware is chosen

Scales sub-linearly in scene complexity Scales almost linearly in number of

processors


Hardware ChoicesHardware Choices

Shared memory vs. distributed memory Latency and throughput for pixel

communication

Choice Shared memory• This section of the course focuses on SGI Origin

series super computers


Shared MemoryShared Memory

Shared address space Physically distributed memory ccNUMA architecture


SGI Origin 2000 ArchitectureSGI Origin 2000 Architecture


ImplicationsImplications

ccNUMA machines are easy to program, But it is more difficult to generate

efficient code

Memory mapping and processor placement may be important for certain applications

Topic returns later in this course


OverviewOverview

Introduction Interactive ray tracer Animation and interactive ray tracing Sample re-use techniques

Interactive Ray TracingInteractive Ray Tracing


Basic AlgorithmBasic Algorithm

Master-slave configuration Master (display thread) displays results

and farms out ray tasks Slaves produce new rays Task size reduced towards end of each

frame• Load balancing

• Cache coherence


Tracing a Single RayTracing a Single Ray

Use spatial subdivisions for ray acceleration (assumed familiar)

Use grid or bounding volume hierarchy Could be optimized further, but good

results have been obtained with these acceleration structures

Efficiency mainly due to low level optimization


Low Level OptimizationLow Level Optimization

Ray tracing in general:• Ray coherence: neighboring rays tend to intersect

the same objects

• Cache coherence: objects previously intersected are likely to still reside in cache for current ray

• Memory access patterns are important (next slide)


Memory AccessMemory Access

On SGI Origin series computers:• Memory allocated for a specific process may be

located elsewhere in the machine reading memory may be expensive

• Processes may migrate to other processors when executing a system call whole cache becomes invalidated; previously local memory may now be remote and more expensive to access


Memory Access (2)Memory Access (2)

Pin down processes to processors Allocate memory close to where the

processes run that will use this memory

Use sysmp and sproc for processor placement

Use mmap or dplace for memory placement


Further Low Level OptimizationsFurther Low Level Optimizations

Know the architecture you work on (Appendix III.A in the course notes)

Use profiling to find expensive bits of code and cache misses (Appendix III.B in the course notes)

Use padding to fit important data structures on a single cache line


Frameless RenderingFrameless Rendering

Display pixel as soon as it is computed No concept of frames

• Perceptually preferable

• Equivalent of a full frame takes longer to compute

• Less efficient exploitation of cache coherence

• This alternative will return later in this course


OverviewOverview



Animation and Interactive Ray Tracing

Animation and Interactive Ray Tracing


Why Animation?Why Animation?

Once interactive rendering is feasible, walk-through is not enough

Desire to manipulate the scene interactively

Render preprogrammed animation paths


Issues to Be AddressedIssues to Be Addressed

What stops us from animating objects?

• Answer: spatial subdivisions

• Acceleration structures normally built during pre-processing

• They assume objects are stationary


Possible SolutionsPossible Solutions

Target applications that require a small number of objects to be manipulated/ animated• Render these objects separately

Traversal cost will be linear in the number of animated objects

Only feasible for extremely small number of objects


Possible Solutions (2)Possible Solutions (2)

Target small number of manipulated or animated objects• Modify existing spatial subdivisions

For each frame delete object from data structure

Update object’s coordinates

Re-insert object into data structure

• This is our preferred approach


Spatial SubdivisionSpatial Subdivision

Should be able to deal with• Basic operations such as insertion and deletion of

objects should be rapid

• User manipulation can cause the extent of the scene to grow


Subdivisions InvestigatedSubdivisions Investigated

Regular grid Hierarchical grid

• Borrows from octree spatial subdivision

• In our case this is a full tree: all leaf nodes are at the same depth

Both acceleration structures are investigated in the next few slides


Regular Grid Data StructureRegular Grid Data Structure

We assume familiarity with spatial subdivisions!


Object Insertion Into GridObject Insertion Into Grid

Compute bounding box of object Compute overlap of bounding box with

grid voxels Object is inserted into overlapping voxels

Object deletion works similarly


Extensions to Regular GridExtensions to Regular Grid

Dealing with expanding scenes requires

• Modifications to object insertion/deletion

• Ray traversal


Extensions to Regular Grid (2)Extensions to Regular Grid (2)


Features of New Grid Data StructureFeatures of New Grid Data Structure

We call this an ‘Interactive Grid’• Straightforward object insertion/deletion

• Deals with expanding scenes

• Insertion cost depends on relative object size

• Traversal cost somewhat higher than for regular grid


Hierarchical GridHierarchical Grid

Objectives• Reduce insertion/deletion cost for larger objects

• Retain advantages of interactive grid


Hierarchical Grid (2)Hierarchical Grid (2)



Build full octree with all leaf nodes at the same level• Allow objects to reside in leaf nodes as well as in

nodes higher up in the hierarchy

• Each object can be inserted into one or more voxels of at most one level in the hierarchy

• Small object reside in leaf nodes, large objects reside elsewhere in the hierarchy



Features:• Deals with expanding scenes similar to interactive

grid

• Reduced insertion/deletion cost

• Traversal cost somewhat higher than interactive grid


Test ScenesTest Scenes


VideoVideo


MeasurementsMeasurements

We measure• Traversal cost of

Interactive grid

Hierarchical grid

Regular grid

• Object update rates of

Interactive grid

Hierarchical grid


Framerate vs. Grid Size (Sphereflake)Framerate vs. Grid Size (Sphereflake)


Framerate vs. Grid Size (Triangles)Framerate vs. Grid Size (Triangles)


Framerate Over Time (Sphereflake)Framerate Over Time (Sphereflake)


Framerate Over Time (Triangles)Framerate Over Time (Triangles)



Interactive manipulation of ray traced scenes is both desirable and feasible using these modifications to grid and hierarchical grids

Slight impact on traversal cost (More results available in course notes)


OverviewOverview


Sample Re-use TechniquesSample Re-use Techniques


Brute Force Ray TracingBrute Force Ray Tracing

Enables interactive ray tracing

Does not allow large image sizes Does not scale to scenes with

high depth complexity


SolutionSolution

Exploit temporal coherence Re-use results from previous frames


Practical SolutionsPractical Solutions

Tapestry (Simmons et. al. 2000)• Focuses on complex lighting simulation

Render cache (Walter et. al. 1999)• Addresses scene complexity issues

• Explained next

Parallel render cache (Reinhard et. al. 2000)• Builds on Walter’s render cache

• Explained next


Render Cache AlgorithmRender Cache Algorithm

Basic setup• One front-end for:

Displaying pixels

Managing previous results

• Parallel back-end for:

Producing new pixels


Render Cache Front-endRender Cache Front-end

Frame based rendering For each frame do:

• Project existing points

• Smooth image and display

• Select new rays using heuristics

• Request samples from back-end

• Insert new points into point cloud


Render CacheRender Cache


Render Cache (2)Render Cache (2)

Point reprojection is relatively cheap Smooth camera movement for small

images Does not scale to large images or large

numbers of renderers front-end becomes bottleneck


Parallel Render CacheParallel Render Cache

Aim: remove front-end bottleneck• Distribute point reprojection functionality

• Integrate point reprojection with renderers

• Front-end only displays results


Parallel Render Cache (2)Parallel Render Cache (2)


Parallel Render Cache (3)Parallel Render Cache (3)

Features:• Scalable behavior for scene complexity

• Scalable in number of processors

• Allows larger images to be rendered

• Retains artifacts from render cache

• Introduces new artifacts


ArtifactsArtifacts

Render cache artifacts at tile boundaries Image deteriorates during camera

movement

These artifacts are deemed more acceptable than loss of smooth camera movement!


VideoVideo


Test ScenesTest Scenes


ResultsResults

Sub-parts of algorithm measured individually• Measure time per call to subroutine

• Sum over all processors and all invocations

• Afterwards divide by number of processors and number of invocations

• Results are measured in events per second per processor


Scalability (Teapot Model)Scalability (Teapot Model)


Scalability (Room Model)Scalability (Room Model)


Samples Per SecondSamples Per Second


Reprojections Per SecondReprojections Per Second



Exploitation of temporal coherence gives significantly smoother results than available with brute force ray tracing alone

This is at the cost of some artifacts which require further investigation

(More results available in course notes)


AcknowledgementsAcknowledgements

Thanks to:• Steven Parker for writing the interactive ray tracer

in the first place

• Brian Smits, Peter Shirley and Charles Hansen for involvement in the animation and parallel point reprojection projects

• Bruce Walter and George Drettakis for the render cache source code


ScheduleSchedule


• Rendering at Clemson / Distributed Computing and Spatial/Temporal Coherence

• Interactive Ray Tracing

• Parallel Rendering and the Quest for Realism: The Kilauea Massively Parallel Ray Tracer (Kato)



OutlineOutline

What is Kilauea ? Parallel ray tracing & photon mapping Kilauea architecture Shading logic Rendering results


OutlineOutline



ObjectiveObjective

Global illumination Extremely complex scenes


Parallel ProcessingParallel Processing

Hardware• Multi-CPU machine

• Linux PC cluster

Software• Threading (Pthread)

• Message passing (MPI)


Our Render FarmOur Render Farm


Global IlluminationGlobal Illumination

Photon map


Ray Tracing RendererRay Tracing Renderer

Machine : A B C

Machine : A B C

Machine : A B CRead Scene

Ray Tracing

Shading

Output



Read Scene

Ray Tracing

Shading

Output

Machine : G H I

Machine : D E F

Machine : A B C



Machine : G H I

Machine : D E F

Machine : A B CRead Scene

Ray Tracing

Shading

Output


OutlineOutline



Parallel Ray TracingParallel Ray Tracing

Simple case Complex case


Accel GridAccel GridHierarchical uniform grid

Scene data


Simple Case (scene distribution)Simple Case (scene distribution)

Machine A

Machine BScene Data

copy


Simple Case (ray tracing)Simple Case (ray tracing)

Machine A

Machine BScreen


Complex Case (scene distribution)Complex Case (scene distribution)

Machine A

Machine B

Random

Scene Data


Complex Case (accel grid construction)Complex Case (accel grid construction)Independent construction Aligned by table

Machine B

Machine AMachine A

Machine B


Complex Case (ray tracing)Complex Case (ray tracing)Machine A

Machine B

Screen

CompareResults


OutlineOutline



Parallel Photon MappingParallel Photon Mapping

Photon trace Photon lookup


Photon Tracing (simple case)Photon Tracing (simple case)

PhotonMap

Store

Store


Photon Tracing (complex case)Photon Tracing (complex case)

PhotonMap B

Randomly store

PhotonMap A

Machine B

Machine A


Photon Lookup (simple case)Photon Lookup (simple case)

Machine A

Machine B

PhotonMap

PhotonMap

Lookuprequest

Irradiancevalue

Lookuprequest

Irradiancevalue


Photon Lookup (complex case)Photon Lookup (complex case)

Machine A

Machine B

PhotonMap A

PhotonMap B

Lookuprequest

Irradiancecalculation

Irradiancevaluecopy


OutlineOutline



TaskTask

MtaskWtaskBtaskStaskRtask

AtaskEtaskLtaskPtaskOtask


Task AssignmentTask Assignment

TaskTask

TaskTask

Machine A Task

TaskTask

Machine B

TaskTask

Task

Machine C


Roles of TasksRoles of Tasks

pixel

T S RA

ACompare


Task ConfigurationTask Configuration

A

A

T S RMachine A

Machine B



A

A

T S R

T S R

Machine A

Machine B



A

A

T S R

T S R

Machine A

Machine BA

A

T S R

T S R

Machine C

Machine DA

A

T S R

T S R

Machine E

Machine F


Task InteractionTask Interaction

A

A

T S R

T S R

Machine

A

Machine

B

pixel

pixel



A

A

T S R

T S R

Machine

A

Machine

B

pixel

pixel



A

A

T S R

T S R

Machine

A

Machine

B

pixel Compare

pixel Compare


Task Interaction (simple case)Task Interaction (simple case)

A

A

T S R

T S R

Machine

A

Machine

B


Roles of Tasks (photon map)Roles of Tasks (photon map)

T S

RA

A

LP

PLookup

PhotonMap B

PhotonMap A


Task Configuration (photon map)Task Configuration (photon map)

A

A

L

P

P

RST

Machine A

Machine B


Task Configuration (photon map)Task Configuration (photon map)

T SR

A

A

L

P

P

L

RST

Machine A

Machine B


Task Interaction (photon map)Task Interaction (photon map)

T S

L

P

P

L

STMachine A

Machine B



T S

L

P

P

L

ST

photon

photonMachine A

Machine B



T S

L

P

P

L

ST

photonLookup

photon

Lookup

Machine A

Machine B


Task Configuration (simple photon)Task Configuration (simple photon)

T SR

A

A

L

P

P

L

RST

Machine A

Machine B


Task PriorityTask Priority

pixel

Compare

photon

T SR

L

A

PLookup

Low HighPriority


OutlineOutline



Parallel Shading ProblemParallel Shading Problem

NReflection

I

P

Cp = Cs + Cr


Parallel Shading ProblemParallel Shading Problem

NReflection

I

P

Machine B

Machine A

Cp = Cs + Cr


Parallel Shading Problem (solution)Parallel Shading Problem (solution)

AB

C D E



AB

C D E

A : C = Cs + CrB : C = Cs + Cr



AB

C D E



Decomposing Shading ComputationDecomposing Shading Computation

shading calculation



funcA funcBoutside task

shading calculation



funcA funcBoutside task

shading calculation

SPOT SPOToutside task


SPOTSPOT

Method+

Data

data slot


SPOT ConditionSPOT Condition


Parallel Shading Solution using SPOTParallel Shading Solution using SPOT

Machine B

Machine A

Outside Task

Cs

Cr

C = Cs + CrSPOT

ASPOT

B

ReflectionRay


Parallel Shading Solution using SPOTParallel Shading Solution using SPOT

SPOT SPOT

SPOT SPOT

SPOT SPOT

Machine A

Machine B

Outside Task

A

B

C


Shader SPOT Network ExampleShader SPOT Network Example


OutlineOutline



Rendering ResultsRendering Results

Test machine specification• 1GHz Dual Pentium III

• 512Mbyte memory

• 100BaseT Ethernet

• 18 machines connected via 100BaseT switch


QuatroQuatro 700,223 triangles, 1 area point & sky light,

1280 x 692 18 machines : 7min 19sec


Quatro : single Atask testQuatro : single Atask test

Speedup

0.00

5.00

10.00

15.00

20.00

25.00

1 3 5 7 9 11 13 15 17

Number of machines

Spe

edup raytrace

linearall

Rendering time

0:00:00

0:14:24

0:28:48

0:43:12

0:57:36

1:12:00

1:26:24

1:40:48

1 3 5 7 9 11 13 15 17

Number of machines

Exe

cutio

n tim

e (h

:m:s

)

allraytrace


JeepJeep 715,059 triangles, 1 directional & sky light, 1280 x 692 18 machines : 8min 27sec


Jeep4Jeep4 2,859,636 triangles, 1 directional & sky light, 1280 x 692

18 machines : 12min 38sec 2 Atsks x 1


Jeep4 : 2 Atasks testJeep4 : 2 Atasks test

1Atask group = 2 machines

Speedup

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

10.00

1 2 3 4 5 6 7 8 9

Number of Atask group

Spee

dup raytrace

linearall

Rendering time

0:00:00

0:14:24

0:28:48

0:43:12

0:57:36

1:12:00

1:26:24

1:40:48

1 2 3 4 5 6 7 8 9

Number of Atask group

allraytrace


Jeep8Jeep8 5,719,072 triangles, 1 directional & sky light, 1280 x 692

16 machines : 18min 43sec 4 Atasks x 4


Escape PODEscape POD 468,321 triangles, 1 directional & sky light, 1280 x 692 18 machines : 14min 55sec


ansGunansGun 20,279 triangles, 1 spot & sky light, 1280 x 960 18 machines : 16min 38sec


SCN101SCN101 787,255 triangls, 1 area light, 1280 x 692 18 machines : 9min 10sec


VideoVideo


Conclusion / Future WorkConclusion / Future Work

We achieved:• Close to linear parallel performance

• Highly extensible architecture

We will achieve even more:• Speed

• Stability

• Usability (user interface)

• Etc.


Additional InformationAdditional Information

Kilauea live rendering demo• BOOTH #1927 SquareUSA

http://www.squareusa.com/kilauea/


ScheduleSchedule


Systems Practical Applications Summary / Discussion (Chalmers)


SummarySummary


Contact InformationContact Information

Alan Chalmers [email protected]

Tim [email protected]

Toshi Katohttp://www.squareusa.com/kilauea/

Erik [email protected]

Slideshttp://www.cs.clemson.edu/~tadavis

practical parallel processing for today’s rendering challenges siggraph 2001 course 40

Documents

speedgraphics rendering

problemdivide problem

solution timeshows

single imageanimations

university of bristoltim

university of utahtoshi

single problemone person

certain time