practical parallel processing for today’s rendering challenges siggraph 2001 course 40
DESCRIPTION
Practical Parallel Processing for Today’s Rendering Challenges SIGGRAPH 2001 Course 40 Los Angeles, CA. Speakers. Alan Chalmers, University of Bristol Tim Davis, Clemson University Erik Reinhard, University of Utah Toshi Kato, SquareUSA. Schedule. Introduction - PowerPoint PPT PresentationTRANSCRIPT
Practical Parallel Processing for Today’s Rendering Challenges -- 1
Practical Parallel Processing for Today’s Rendering Challenges
SIGGRAPH 2001 Course 40Los Angeles, CA
Practical Parallel Processing for Today’s Rendering Challenges -- 2
SpeakersSpeakers
Alan Chalmers, University of Bristol Tim Davis, Clemson University Erik Reinhard, University of Utah Toshi Kato, SquareUSA
Practical Parallel Processing for Today’s Rendering Challenges -- 3
ScheduleSchedule
Introduction Parallel / Distributed Rendering Issues Classification of Parallel Rendering
Systems Practical Applications Summary / Discussion
Practical Parallel Processing for Today’s Rendering Challenges -- 4
ScheduleSchedule
Introduction (Davis) Parallel / Distributed Rendering Issues Classification of Parallel Rendering
Systems Practical Applications Summary / Discussion
Practical Parallel Processing for Today’s Rendering Challenges -- 5
The Need for SpeedThe Need for Speed
Graphics rendering is time-consuming• large amount of data in a single image
• animations much worse
Demand continues to rise for high-quality graphics
Practical Parallel Processing for Today’s Rendering Challenges -- 6
Rendering and Parallel ProcessingRendering and Parallel Processing
A holy union Many graphics rendering tasks can be
performed in parallel Often “embarrassing parallel”
Practical Parallel Processing for Today’s Rendering Challenges -- 7
3-D Graphics Boards3-D Graphics Boards
Getting better Perform “tricks” with texture mapping Steve Jobs’ remark on constant frame
rendering time
Practical Parallel Processing for Today’s Rendering Challenges -- 8
Parallel / Distributed Rendering
Fundamental Issues
• Task Management
Task subdivision, Migration, Load balancing
• Data Management
Data distributed across system
• Communication
Fundamental Issues
• Task Management
Task subdivision, Migration, Load balancing
• Data Management
Data distributed across system
• Communication
Practical Parallel Processing for Today’s Rendering Challenges -- 9
ScheduleSchedule
Introduction Parallel / Distributed Rendering Issues
(Chalmers) Classification of Parallel Rendering
Systems Practical Applications Summary / Discussion
Practical Parallel Processing for Today’s Rendering Challenges -- 10
Introduction
“Parallel processing is like a dog’s walking on its hind legs. It is not done well, but you are surprised to find it done at all”
[Steve Fiddes (apologies to Samuel Johnson)]
• Co-operation
• Dependencies
• Scalability
• Control
“Parallel processing is like a dog’s walking on its hind legs. It is not done well, but you are surprised to find it done at all”
[Steve Fiddes (apologies to Samuel Johnson)]
• Co-operation
• Dependencies
• Scalability
• Control
Practical Parallel Processing for Today’s Rendering Challenges -- 11
Co-operation
Solution of a single problem
• One person takes a certain time to solve the problem
• Divide problem into a number of sub-problems
• Each sub-problem solved by a single worker
• Reduced problem solution time
BUT
• co-operation overheads
Solution of a single problem
• One person takes a certain time to solve the problem
• Divide problem into a number of sub-problems
• Each sub-problem solved by a single worker
• Reduced problem solution time
BUT
• co-operation overheads
Practical Parallel Processing for Today’s Rendering Challenges -- 12
Working TogetherWorking Together
Overheads• access to pool
• collision avoidance
Practical Parallel Processing for Today’s Rendering Challenges -- 13
DependenciesDependencies
Divide a problem into a number of distinct stages• Parallel solution of one stage before next can start
• May be too severe no parallel solution
each sub-problem dependent on previous stage
• Dependency-free problems
order of task completion unimportant
BUT co-operation still required
Practical Parallel Processing for Today’s Rendering Challenges -- 14
Building with BlocksBuilding with Blocks
Strictly sequential Dependency-free
Practical Parallel Processing for Today’s Rendering Challenges -- 15
ScalabilityScalability
Upper bound on the number of workers• Additional workers will NOT improve solution time
• Shows how suitable a problem is for parallel processing
• Given problem finite number of sub-problems
more workers than tasks
• Upper bound may be (a lot) less than number of tasks
bottlenecks
Practical Parallel Processing for Today’s Rendering Challenges -- 16
Bottleneck at Doorway Bottleneck at Doorway
@ $ &
More workers may result in LONGER solution time
Practical Parallel Processing for Today’s Rendering Challenges -- 17
ControlControl
Required by all parallel implementations• What constitutes a task
• When has the problem been solved
• How to deal with multiple stages
• Forms of control
centralised
distributed
Practical Parallel Processing for Today’s Rendering Challenges -- 18
Control RequiredControl Required
Sequential
Parallel
Practical Parallel Processing for Today’s Rendering Challenges -- 19
Inherent DifficultiesInherent Difficulties
Failure to successfully complete• Sequential solution
deficiencies in algorithm or data
• Parallel solution
deficiencies in algorithm or data
deadlock
data consistency
Practical Parallel Processing for Today’s Rendering Challenges -- 20
Novel DifficultiesNovel Difficulties
Factors arising from implementation• Deadlock
processor waiting indefinitely for an event
• Data consistency
data is distributed amongst processors
• Communication overheads
latency in message transfer
Practical Parallel Processing for Today’s Rendering Challenges -- 21
Evaluating Parallel ImplementationsEvaluating Parallel Implementations
Realisation penalties• Algorithmic penalty
nature of the algorithm chosen
• Implementation penalty
need to communicate
concurrent computation & communication activities
idle time
Practical Parallel Processing for Today’s Rendering Challenges -- 22
Solution TimesSolution Times
Practical Parallel Processing for Today’s Rendering Challenges -- 23
Task ManagementTask Management
Providing tasks to the processors• Problem decomposition
algorithmic decomposition
domain decomposition
• Definition of a task
• Computational Model
Practical Parallel Processing for Today’s Rendering Challenges -- 24
Problem DecompositionProblem Decomposition
Exploit parallelism• Inherent in algorithm
algorithmic decomposition
parallelising compilers
• Applying same algorithm to different data items
domain decomposition
need for explicit system software support
Practical Parallel Processing for Today’s Rendering Challenges -- 25
Abstract Definition of a TaskAbstract Definition of a Task
• Principal Data Item (PDI) - application of algorithm
• Additional Data Items (ADIs) - needed to complete computation
Practical Parallel Processing for Today’s Rendering Challenges -- 26
Computational ModelsComputational Models
Determines the manner tasks are allocated to PEs• Maximise PE computation time
• Minimise idle time
load balancing
• Evenly allocate tasks amongst the processors
Practical Parallel Processing for Today’s Rendering Challenges -- 27
Data Driven ModelsData Driven Models
All PDIs allocated to specific PEs before computation starts
Each PE knows a priori which PDIs it is responsible for
Balanced (geometric decomposition)• evenly allocate tasks amongst the processors
• if PDIs not exact multiple of Pes then some PEs do one extra task
portion at each PE = number of PDIsnumber of PEs
Practical Parallel Processing for Today’s Rendering Challenges -- 28
Balanced Data DrivenBalanced Data Driven
+
solution time = initial distribution
result collation
+243
Practical Parallel Processing for Today’s Rendering Challenges -- 29
Demand Driven ModelDemand Driven Model
Task computation time unknown• Work is allocated dynamically as PEs become idle
PEs no longer bound to particular PDIs
• PEs explicitly demand new tasks
• Task supplier process must satisfy these demands
Practical Parallel Processing for Today’s Rendering Challenges -- 30
Dynamic Allocation of TasksDynamic Allocation of Tasks
solution time =+
2 x total comms time
number of PEstotal comp time for all PDIs
Practical Parallel Processing for Today’s Rendering Challenges -- 31
Task Supplier ProcessTask Supplier Process
Simple demand driven task supplier
PROCESS Task_Supplier() Begin remaining_tasks := total_number_of_tasks
(* initialise all processors with one task *) FOR p = 1 TO number_of_PEs SEND task TO PE[p] remaining_tasks := remaining_tasks -1
WHILE results_outstanding DO RECEIVE result FROM PE[i] IF remaining_tasks > 0 THEN SEND task TO PE[i] remaining_tasks := remaining_tasks -1 ENDIF
End (* Task_Supplier *)
Practical Parallel Processing for Today’s Rendering Challenges -- 32
Load BalancingLoad Balancing
All PEs should complete at the same time• Some PEs busy with complex tasks
• Other PEs available for easier tasks
• Computation effort of each task unknown
hot spot at end of processing unbalanced solution
• Any knowledge about hot spots should be used
Practical Parallel Processing for Today’s Rendering Challenges -- 33
Task Definition & GranularityTask Definition & Granularity
Computational elements• Atomic element (ray-object intersection)
sequential problem’s lowest computational element
• Task (trace complete path of one ray)
parallel problem’s smallest computational element
• Task granularity
number of atomic units is one task
Practical Parallel Processing for Today’s Rendering Challenges -- 34
Task PacketTask Packet
Unit of task distribution• Informs a PE of which task(s) to perform
• Task packet may include
indication of which task(s) to compute
data items (the PDI and (possibly) ADIs)
• Task packet for ray tracer one or more rays to be traced
Practical Parallel Processing for Today’s Rendering Challenges -- 35
Algorithmic DependenciesAlgorithmic Dependencies
Algorithm adopted for parallelisation:• May specify order of task completion
• Dependencies MUST be preserved
• Algorithmic dependencies introduce:
synchronisation points distinct problem stages
data dependencies careful data management
Practical Parallel Processing for Today’s Rendering Challenges -- 36
Distributed Task ManagementDistributed Task Management
Centralised task supply• All requests for new tasks to System Controller
bottleneck
• Significant delay in fetching new tasks
Distributed task supply
• task requests handled remotely from System Controller
• spread of communication load across system
• reduced time to satisfy task request
Practical Parallel Processing for Today’s Rendering Challenges -- 37
Preferred Bias AllocationPreferred Bias Allocation
Combining Data driven & Demand driven• Balanced data driven
tasks allocated in a predetermined manner
• Demand driven
tasks allocated dynamically on demand
• Preferred Bias: Regions are purely conceptual
enables the exploitation of any coherence
Practical Parallel Processing for Today’s Rendering Challenges -- 38
Conceptual RegionsConceptual Regions
• task allocation no longer arbitrary
Practical Parallel Processing for Today’s Rendering Challenges -- 39
Data ManagementData Management
Providing data to the processors• World model
• Virtual shared memory
• Data manager process
local data cache
requesting & locating data
• Consistency
Practical Parallel Processing for Today’s Rendering Challenges -- 40
Remote Data FetchesRemote Data Fetches
Advanced data management• Minimising communication latencies
Prefetching
Multi-threading
Profiling
• Multi-stage problems
Practical Parallel Processing for Today’s Rendering Challenges -- 41
Data RequirementsData Requirements
Requirements may be large• Fit in the local memory of each processor
world model
• Too large for each local memory
distributed data
provide virtual world model/virtual shared memory
Practical Parallel Processing for Today’s Rendering Challenges -- 42
Virtual Shared Memory (VSM)Virtual Shared Memory (VSM)
Providing a conceptual single memory space• Memory is in fact distributed
• Request is the same for both local & remote data
• Speed of access may be (very) differentSystem Software Provided by DM process
Compiler HPF, ORCA
Operating System Coherent Paging
Hardware DDM, DASH, KSR-1
Higherlevel
Lowerlevel
Practical Parallel Processing for Today’s Rendering Challenges -- 43
ConsistencyConsistency
Read/write can result in inconsistencies• Distributed memory
multiple copies of the same data item
• Updating such a data item
update all copies of this data item
invalidate all other copies of this data item
Practical Parallel Processing for Today’s Rendering Challenges -- 44
Minimising Impact of Remote DataMinimising Impact of Remote Data
Failure to find a data item locally remote fetch• Time to find data item can be significant
• Processor idle during this time
• Latency difficult to predict
eg depends on current message densities
• Data management must minimise this idle time
Practical Parallel Processing for Today’s Rendering Challenges -- 45
Data Management TechniquesData Management Techniques
Hiding the Latency• Overlapping the communication with computation
prefetching
multi-threading
Minimising the Latency• Reducing the time of a remote fetch
profiling
caching
Practical Parallel Processing for Today’s Rendering Challenges -- 46
PrefetchingPrefetching
Exploiting knowledge of data requests• A priori knowledge of data requirements
nature of the problem
choice of computational model
• DM can prefetch them (up to some specified horizon)
available locally when required
overlapping communication with computation
Practical Parallel Processing for Today’s Rendering Challenges -- 47
Multi-ThreadingMulti-Threading
Keeping PE busy with useful computation• Remote data fetch current task stalled
• Start another task (Processor kept busy)
separate threads of computation (BSP)
• Disadvantages: Overheads
Context switches between threads
Increased message densities
Reduced local cache for each thread
Practical Parallel Processing for Today’s Rendering Challenges -- 48
Results for Multi-ThreadingResults for Multi-Threading
• More than optimal threads reduces performance
• “Cache 22” situation
less local cache more data misses more threads
Practical Parallel Processing for Today’s Rendering Challenges -- 49
ProfilingProfiling
Reducing the remote fetch time• At the end of computation all data requests are
known
if known then can be prefetched
• Monitor data requests for each task
build up a “picture” of possible requirements
• Exploit spatial coherence (with preferred bias allocation)
prefetch those data items likely to be required
Practical Parallel Processing for Today’s Rendering Challenges -- 50
Spatial CoherenceSpatial Coherence
Practical Parallel Processing for Today’s Rendering Challenges -- 51
ScheduleSchedule
Introduction Parallel / Distributed Rendering Issues Classification of Parallel Rendering
Systems (Davis)
Practical Applications
Summary / Discussion
Practical Parallel Processing for Today’s Rendering Challenges -- 52
Classification of Parallel Rendering Systems
Classification of Parallel Rendering Systems
Parallel rendering performed in many ways
Classification by• task subdivision
polygon rendering ray tracing
• hardware
parallel hardware distributed computing
Practical Parallel Processing for Today’s Rendering Challenges -- 53
Classification by Task SubdivisionClassification by Task Subdivision
Original rendering task broken into smaller pieces to be processed in parallel
Depends on type of rendering Goals
• maximize parallelism
• minimize overhead, including communication
Practical Parallel Processing for Today’s Rendering Challenges -- 54
Task Subdivision in Polygon Rendering
Task Subdivision in Polygon Rendering
Rendering many primitives Polygon rendering pipeline
• geometry processing (transformation, clipping, lighting)
• rasterization (scan conversion, visibility, shading)
Practical Parallel Processing for Today’s Rendering Challenges -- 55
Polygon Rendering PipelinePolygon Rendering Pipeline
Graphics database traversal
Display
GeometryProcessing
Rasterization
… G GG G
… R RR R
Practical Parallel Processing for Today’s Rendering Challenges -- 56
Primitive Processing and SortingPrimitive Processing and Sorting
View processing of primitives as sorting problem• primitives can fall anywhere on or off the screen
Sorting can be done in either software or hardware, but mostly done in hardware
Practical Parallel Processing for Today’s Rendering Challenges -- 57
Primitive Processing and SortingPrimitive Processing and Sorting
Sorting can occur at various places in the rendering pipeline• during geometry processing (sort-first)
• between geometry processing and rasterization (sort-middle)
• during rasterization (sort-last)
Practical Parallel Processing for Today’s Rendering Challenges -- 58
Sort-firstSort-first
GeometryProcessing
Rasterization
Graphics database(arbitrarily partitioned)
Display
…
G GG G …
R RR R
Redistribute “raw” primitives
…
(Pre-transform)
Practical Parallel Processing for Today’s Rendering Challenges -- 59
Sort-first MethodSort-first Method
Each processor (renderer) assigned a portion of the screen
Primitives arbitrarily assigned to processors
Processors perform enough calculations to send primitives to correct renderers
Processors then perform geometry processing and rasterization for their primitives in parallel
Practical Parallel Processing for Today’s Rendering Challenges -- 60
Screen SubdivisionScreen Subdivision
Practical Parallel Processing for Today’s Rendering Challenges -- 61
Sort-first DiscussionSort-first Discussion
+ Communication costs can be kept low
- Duplication of effort if primitives fall into more than one screen area
- Load imbalance if primitives concentrated
- Very few, if any, sort-first renderers built
Practical Parallel Processing for Today’s Rendering Challenges -- 62
Sort-middleSort-middle
GeometryProcessing
Rasterization
Graphics database(arbitrarily partitioned)
Display
…
G GG G
…
R RR R
Redistribute screen-space primitives
…
Practical Parallel Processing for Today’s Rendering Challenges -- 63
Sort-middle MethodSort-middle Method
Primitives arbitrarily assigned to renderers
Each renderer performs geometry processing on its primitives
Primitives then redistributed to rasterizers according to screen region
Practical Parallel Processing for Today’s Rendering Challenges -- 64
Sort-middle DiscussionSort-middle Discussion
+ Natural breaking point in graphics pipeline
- Load imbalance if primitives concentrated in particular screen regions
+ Several successful hardware implementations• PixelPlanes 5
• SGI Reality Engine
Practical Parallel Processing for Today’s Rendering Challenges -- 65
Sort-lastSort-last
GeometryProcessing
Rasterization
Graphics database(arbitrarily partitioned)
Display
…
G GG G …
R RR R
Redistribute pixels, samples, orfragments
…
(Compositing)
Practical Parallel Processing for Today’s Rendering Challenges -- 66
Sort-last MethodSort-last Method
Primitives arbitrarily distributed to renderers
Each renderer computes pixel values for its primitives
Pixel values are then sent to processors according to screen location
Rasterizers perform visibility and compositing
Practical Parallel Processing for Today’s Rendering Challenges -- 67
Sort-last DiscussionSort-last Discussion
+ Less prone to load imbalance
- Pixel traffic can be high
+ Some working systems • Denali
Practical Parallel Processing for Today’s Rendering Challenges -- 68
Task Subdivision in Ray TracingTask Subdivision in Ray Tracing
Ray tracing often prohibitively expensive on single processor
Prime candidate for parallelization• each pixel can be rendered independently
Processing easily subdivided• image space subdivision
• object space subdivision
• object subdivision
Practical Parallel Processing for Today’s Rendering Challenges -- 69
Image Space SubdivisionImage Space Subdivision
Practical Parallel Processing for Today’s Rendering Challenges -- 70
Image Space Subdivision DiscussionImage Space Subdivision Discussion
+ Straightforward
+ High parallelism possible
- Entire scene database must reside on each processor• need adequate storage
+ Low processor communication
Practical Parallel Processing for Today’s Rendering Challenges -- 71
Image Space Subdivision DiscussionImage Space Subdivision Discussion
- Load imbalance possible• screen space may be further subdivided
+ Used in many parallel ray tracers• works better with MIMD machines
• distributed computing environments
Practical Parallel Processing for Today’s Rendering Challenges -- 72
Object Space SubdivisionObject Space Subdivision
3-D object space divided into voxels Each voxel assigned to a processor Rays are passed from processor to
processor as voxel space is traversed
Practical Parallel Processing for Today’s Rendering Challenges -- 73
Object Space Subdivision Discussion
Object Space Subdivision Discussion
+ Each processor needs only scene information associated with its voxel(s)
- Rays must be tracked through voxel space
+ Load balance good
- Communication can be high
+ Some successful systems
Practical Parallel Processing for Today’s Rendering Challenges -- 74
Object PartitioningObject Partitioning
Each object in the scene is assigned to a processor
Rays passed as messages between processors
Processors check for intersection
Practical Parallel Processing for Today’s Rendering Challenges -- 75
Object Partitioning DiscussionObject Partitioning Discussion
+ Load balancing good
- Communication high due to ray message traffic
- Fewer implementations
Practical Parallel Processing for Today’s Rendering Challenges -- 76
ScheduleSchedule
Introduction Parallel / Distributed Rendering Issues Classification of Parallel Rendering Systems Practical Applications
• Rendering at Clemson / Distributed Computing and Spatial/Temporal Coherence (Davis)
• Interactive Ray Tracing• Parallel Rendering and the Quest for Realism: The Kilauea
Massively Parallel Ray Tracer Summary / Discussion
Practical Parallel Processing for Today’s Rendering Challenges -- 77
Practical Experiences at Clemson
Problems with Rendering Current Resources Deciding on a Solution A New Render Farm
Practical Parallel Processing for Today’s Rendering Challenges -- 78
A Demand for Rendering
Computer Animation course 3 SIGGRAPH animation submissions
• render over semester break
Practical Parallel Processing for Today’s Rendering Challenges -- 79
Current Resources
dedicated lab• 8 SGI 02’s (R12000, 384 MB)
general-purpose lab• 4 SGI 02’s
shared lab• dual-pipe Onyx2 (8 R12000, 8 GB)
• 10 SGI 02’s (R12000, 256 MB)
offices• 5 SGI 02’s
Practical Parallel Processing for Today’s Rendering Challenges -- 80
Resource Problems Rendering prohibits interactive sessions Little organized control over resources
• users must be self-monitoring
m renders on n machines 1 render on n/m machines
Disk space Cross-platform distributed rendering to PCs
problematic• security (rsh)
• distributed rendering software
• directory paths
Practical Parallel Processing for Today’s Rendering Challenges -- 81
Short-term Solutions
Distributed rendering restricted to late night
Resources partitioned
Practical Parallel Processing for Today’s Rendering Challenges -- 82
Problems with Maya
video Traditional distributed computing
problems• dropped frames
• incomplete frames
• tools developed
Practical Parallel Processing for Today’s Rendering Challenges -- 83
Problems with Maya
Tools (DropCheck)
Practical Parallel Processing for Today’s Rendering Challenges -- 84
Problems with Maya
Tools (Load Scan)
Practical Parallel Processing for Today’s Rendering Challenges -- 85
Problems with Maya
Animation inconsistencies• next slide
Some frames would not render Particle system inconsistencies
Practical Parallel Processing for Today’s Rendering Challenges -- 86
Problems with Maya
Practical Parallel Processing for Today’s Rendering Challenges -- 87
Rendering Tips
Layering
Practical Parallel Processing for Today’s Rendering Challenges -- 88
Rendering Tips
Layering
Practical Parallel Processing for Today’s Rendering Challenges -- 89
Deciding on a Solution - RenderDrive
RenderDrive by ART (Advanced Rendering Technology)• network appliance for ray tracing
• 16-48 specialized processors
• claims speedups of 15-40 over Pentium III
• 768MB to 1.5GB memory
• 4GB hard disk cache
Practical Parallel Processing for Today’s Rendering Challenges -- 90
Deciding on a Solution - RenderDrive
• plug-in interface to Maya
• Renderman ray tracer
• $15K - $25K
Practical Parallel Processing for Today’s Rendering Challenges -- 91
Deciding on a Solution - PCs
Network of PCs as a render farm 10 PCs each with 1.4GHz, 1GB memory,
and 40GB hard drive Maya will run under Windows 2000 or
Linux (Maya 4.0) Distributed rendering software not
included for Windows 2000
Practical Parallel Processing for Today’s Rendering Challenges -- 92
Deciding on a Solution - PCs Win
RenderDrive had some unusual anomalies
Interactive capabilities Scan-line or ray tracing Distributed rendering software may be
included Problems with security still exist
• shared file system
Practical Parallel Processing for Today’s Rendering Challenges -- 93
ScheduleSchedule
Introduction Parallel / Distributed Rendering Issues Classification of Parallel Rendering Systems Practical Applications
• Rendering at Clemson / Distributed Computing and Spatial/Temporal Coherence (Davis)
• Interactive Ray Tracing
• Parallel Rendering and the Quest for Realism: The Kilauea Massively Parallel Ray Tracer
Summary / Discussion
Practical Parallel Processing for Today’s Rendering Challenges -- 94
Agenda
Background Temporal Depth-Buffer Frame Coherence Algorithm Parallel Frame Coherence Algorithm
Practical Parallel Processing for Today’s Rendering Challenges -- 95
Background - Ray TracingBackground - Ray Tracing
Closest to physical model of light High cost in terms of time / complexity
Practical Parallel Processing for Today’s Rendering Challenges -- 96
Background - Frame CoherenceBackground - Frame Coherence
Frame coherence • those pixels that do not change from one frame to
the next
• derived from object and temporal coherence
We should not have to re-compute those pixels whose values will not change• writing pixels to frame files
Practical Parallel Processing for Today’s Rendering Challenges -- 97
Background - Test AnimationBackground - Test Animation
Glass Bounce (60 frames at 320x240; 5 obj)
Practical Parallel Processing for Today’s Rendering Challenges -- 98
Background - Frame Coherence Background - Frame Coherence
Practical Parallel Processing for Today’s Rendering Challenges -- 99
Previous WorkPrevious Work
Frame coherence• moving camera/static world [Hubschman and
Zucker 81]
• estimated frames [Badt 88]
• stereoscopic pairs [Adelson and Hodges 93/95]
• 4D bounding volumes [Glassner 88]
• voxels and ray tracking [Jevans 92]
• incremental ray tracing [Murakami90]
Practical Parallel Processing for Today’s Rendering Challenges -- 100
Previous Work (cont.)Previous Work (cont.)
Distributed computing• Alias and 3D Studio
• most major productions starting with Toy Story [Henne 96]
Practical Parallel Processing for Today’s Rendering Challenges -- 101
GoalsGoals
Render exactly the same set of frames in much less time
Work in conjunction with other optimization techniques
Run on a variety of platforms Extend a currently popular ray tracer
(POV-Ray) to allow for general use
Practical Parallel Processing for Today’s Rendering Challenges -- 102
Temporal Depth-BufferTemporal Depth-Buffer
Similar to traditional z-buffer For each pixel, store a temporal depth in
frame units
1
2
3
1
2
3
5 5 5 5 5 5 5 5 5 5 5 5 55 5 5 5 5 5 5 5 5 5 5 5 55 5 5 5 5 5 5 3 3 3 5 5 55 5 5 5 5 5 5 3 3 3 5 5 55 5 5 5 5 2 2 2 3 3 5 5 55 5 5 5 5 2 2 2 3 3 5 5 55 5 5 5 2 2 2 2 5 5 5 5 55 5 5 5 2 2 2 2 5 5 5 5 55 5 5 5 2 2 2 5 5 5 5 5 55 5 5 5 5 5 5 5 5 5 5 5 5
Practical Parallel Processing for Today’s Rendering Challenges -- 103
Frame Coherence AlgorithmFrame Coherence Algorithm
Practical Parallel Processing for Today’s Rendering Challenges -- 104
Frame Coherence AlgorithmFrame Coherence Algorithm
Practical Parallel Processing for Today’s Rendering Challenges -- 105
Identify volume within 3D object space where movement occurs
Divide volume uniformly into voxels For each voxel, create a list of frame
numbers in which changing objects inhabit this voxel
Frame Coherence AlgorithmFrame Coherence Algorithm
Practical Parallel Processing for Today’s Rendering Challenges -- 106
In each frame, track rays through voxels for each pixel
From the voxels traversed, find the one with the lowest frame number
Record that number in the temporal depth-buffer
Frame Coherence AlgorithmFrame Coherence Algorithm
Practical Parallel Processing for Today’s Rendering Challenges -- 107
Frame Coherence AlgorithmFrame Coherence Algorithm
for each frame of the animation
for each pixel that needs to be computed for this frame
trace the rays for this pixel
for each voxel that any of these rays intersect
get the next frame number to compute
set the t-buffer entry to the lowest frame number found
Practical Parallel Processing for Today’s Rendering Challenges -- 108
Frame Coherence AlgorithmFrame Coherence Algorithm
1
2
3
5 5 5 5 5 5 5 5 5 5 5 5 55 5 5 5 5 5 5 5 5 5 5 5 55 5 5 5 5 5 5 3 3 3 5 5 55 5 5 5 5 5 5 3 3 3 5 5 55 5 5 5 5 2 2 2 3 3 5 5 55 5 5 5 5 2 2 2 3 3 5 5 55 5 5 5 2 2 2 2 5 5 5 5 55 5 5 5 2 2 2 2 5 5 5 5 55 5 5 5 2 2 2 5 5 5 5 5 55 5 5 5 5 5 5 5 5 5 5 5 5
Practical Parallel Processing for Today’s Rendering Challenges -- 109
Voxel Volume Voxel Volume
Uniform voxel spatial subdivision Voxel can be non-cubical Ways to determine voxel volume
• user-supplied
• pre-processing phase
active voxel marking
in distributed environment, done by master or slave or both
Practical Parallel Processing for Today’s Rendering Challenges -- 110
Frame Coherence ExampleFrame Coherence Example
Practical Parallel Processing for Today’s Rendering Challenges -- 111
Test AnimationTest Animation
Pool Shark (620 frames at 640x480; 174 obj)
Practical Parallel Processing for Today’s Rendering Challenges -- 112
Test Animations - ProblemTest Animations - Problem
Bounding box problem
Practical Parallel Processing for Today’s Rendering Challenges -- 113
ResultsResults
standardalgorithm
frame coherencealgorithm
ratio of framecoherence to standard
speedup
total number ofrays 47,841,269 13,259,380 0.27 --
total parse time0:48 1:30 1.88 --
first framerendering time 6:34 8:49 1.34 0.75
average framerendering time 7:15 3:05 0.43 2.33
total framerendering time 5:26:55 2:19:51 0.43 2.33
Practical Parallel Processing for Today’s Rendering Challenges -- 114
Frame Coherence DiscussionFrame Coherence Discussion
Localized movement can have global effects
Performance depends on both the number and complexity of recomputed pixels
Issues• overhead
• antialiasing
• motion blur
Practical Parallel Processing for Today’s Rendering Challenges -- 115
Uses less memory than other methods Simple Can be used with other algorithms
Temporal Depth-Buffer DiscussionTemporal Depth-Buffer Discussion
Practical Parallel Processing for Today’s Rendering Challenges -- 116
Parallel Frame Coherence AlgorithmParallel Frame Coherence Algorithm
Distributed computing environment 1-8 Sun Sparc Ultra 5 processors running
at 270 MHz Coarse-grain parallelism Load balancing
• divide work among processors
• keep data together for frame coherence
Practical Parallel Processing for Today’s Rendering Challenges -- 117
Load BalancingLoad Balancing
Image space subdivision• each processor computes a subregion for the
entire length of the run
Recursively subdivide subsequences to keep processors busy
… …… …
Practical Parallel Processing for Today’s Rendering Challenges -- 118
Screen SubdivisionScreen Subdivision
Practical Parallel Processing for Today’s Rendering Challenges -- 119
Load BalancingLoad Balancing
Coarse bin packing: find block with smallest number of computed frames
Keep statistics on average first frame time and average coherent frame time
Find a hole in the sequence Leave some free frames before new start
Practical Parallel Processing for Today’s Rendering Challenges -- 120
Load Balancing ExampleLoad Balancing Example
18414
3
4
2
614
3
4
2
1141914
speedprocessor new
speedprocessor current
2
1 start tmp - end holestart tmp framestart
143113
811
3
4
15
3011
speedprocessor new
speedprocessor current
time frame avg
time framefirst start hole start tmp
h o l es t a r t
h o l ee n d
f i r s t f r a m e t i m e = 3 0a v g f r a m e t i m e = 1 5
c u r r e n t p r o c e s s o r s p e e d = 4n e w p r o c e s s o r s p e e d = 3
t m ps t a r t
s t a r tf r a m e
… …1 0 1 91 81 71 61 51 41 1 1 31 2 2 0
Practical Parallel Processing for Today’s Rendering Challenges -- 121
Results - Parallel Frame CoherenceResults - Parallel Frame Coherence
standardalgorithm
parallel with 8machines
speedup parallel frame coherencewith 8 machines
speedup
total number ofrays 47,841,269 49,161,582 1.03 18,299,347 0.38
total parse time0:48 -- -- -- --
first framerendering time 6:34 -- -- -- --
average framerendering time 7:15 1:05 6.7 :34 12.9
total framerendering time 5:26:55 49:49 6.6 25:47 12.9
Practical Parallel Processing for Today’s Rendering Challenges -- 122
ResultsResultsstandardalgorithm
frame coherencealgorithm
ratio of framecoherence to standard
speedup
total number ofrays 15,731,252 6,386,883 0.41 --
total parse time0:11 0:19 1.73 --
first framerendering time 2:39 3:19 1.25 0.80
average framerendering time 2:42 1:39 0.61 1.64
total framerendering time 2:42:26 1:39:02 0.61 1.64
number ofprocessors
total numberof rays
ratio tosingle processor
average framerendering time
total renderingtime
speedup
1 15,731,252 1.00 2:42 2:42:26 1.00
2 5,890,290 0.37 :38 38:25 4.23
4 5,913,926 0.38 :22 22:12 7.31
8 6,063,338 0.39 :16 16:28 9.86
12 6,086,781 0.39 :12 11:37 13.98
16 6,323,673 0.40 :11 10:50 14.99
Practical Parallel Processing for Today’s Rendering Challenges -- 123
Another Test AnimationAnother Test Animation
Soda Worship (60 frames at 160x120; 839 obj)
Practical Parallel Processing for Today’s Rendering Challenges -- 124
Another Test AnimationAnother Test Animation
Practical Parallel Processing for Today’s Rendering Challenges -- 125
ResultsResultsstandardalgorithm
frame coherencealgorithm
ratio of framecoherence to standard
speedup
total number ofrays 44,454,548 19,944,939 0.45 --
total parse time3:06 3:47 1.04 --
first framerendering time 27:54 29:14 1.07 0.94
average framerendering time 28:07 15:07 0.54 1.86
total framerendering time 28:10:10 15:11:27 0.54 1.85
number ofprocessors
total numberof rays
ratio tosingle processor
average framerendering time
total renderingtime
speedup
1 44,454,548 1.00 28:10 28:10:10 1.00
2 22,163,526 0.50 15:11 11:48:11 2.39
4 22,286,422 0.50 7:45 4:27:26 6.32
8 22,409,023 0.50 3:58 2:16:34 12.38
12 23,125,140 0.52 2:38 1:31:05 18.56
16 23,180,741 0.52 2:02 1:12:15 23.39
Practical Parallel Processing for Today’s Rendering Challenges -- 126
Good speedup Multiplicative speedup with both Speedup limitations
• voxel approximation
• writing pixels to frame files (communication)
Results DiscussionResults Discussion
Practical Parallel Processing for Today’s Rendering Challenges -- 127
ConclusionsConclusions
Frame coherence algorithm combined with distributed computing provides good speedup
Algorithm scales well Techniques are useful and accessible to
a wide variety of users Benefits depend on inherent properties
of the animation
Practical Parallel Processing for Today’s Rendering Challenges -- 128
Shameless AdvertisementShameless Advertisement
Masters of Fine Arts in Computing (MFAC)• special effects and animation courses
• two year program
Clemson Computer Animation Festival in Fall 2002
Practical Parallel Processing for Today’s Rendering Challenges -- 129
ScheduleSchedule
Introduction Parallel / Distributed Rendering Issues Classification of Parallel Rendering Systems Practical Applications
• Rendering at Clemson / Distributed Computing and Spatial/Temporal Coherence
• Interactive Ray Tracing (Reinhard)
• Parallel Rendering and the Quest for Realism: The Kilauea Massively Parallel Ray Tracer
Summary / Discussion
Practical Parallel Processing for Today’s Rendering Challenges -- 130
OverviewOverview
Introduction Interactive ray tracer Animation and interactive ray tracing Sample reuse techniques
IntroductionIntroduction
Practical Parallel Processing for Today’s Rendering Challenges -- 132
Interactive Ray TracingInteractive Ray Tracing
Renders effects not available using other rendering algorithms
Feasible on high-end supercomputers provided suitable hardware is chosen
Scales sub-linearly in scene complexity Scales almost linearly in number of
processors
Practical Parallel Processing for Today’s Rendering Challenges -- 133
Hardware ChoicesHardware Choices
Shared memory vs. distributed memory Latency and throughput for pixel
communication
Choice Shared memory• This section of the course focuses on SGI Origin
series super computers
Practical Parallel Processing for Today’s Rendering Challenges -- 134
Shared MemoryShared Memory
Shared address space Physically distributed memory ccNUMA architecture
Practical Parallel Processing for Today’s Rendering Challenges -- 135
SGI Origin 2000 ArchitectureSGI Origin 2000 Architecture
Practical Parallel Processing for Today’s Rendering Challenges -- 136
ImplicationsImplications
ccNUMA machines are easy to program, But it is more difficult to generate
efficient code
Memory mapping and processor placement may be important for certain applications
Topic returns later in this course
Practical Parallel Processing for Today’s Rendering Challenges -- 137
OverviewOverview
Introduction Interactive ray tracer Animation and interactive ray tracing Sample re-use techniques
Interactive Ray TracingInteractive Ray Tracing
Practical Parallel Processing for Today’s Rendering Challenges -- 139
Basic AlgorithmBasic Algorithm
Master-slave configuration Master (display thread) displays results
and farms out ray tasks Slaves produce new rays Task size reduced towards end of each
frame• Load balancing
• Cache coherence
Practical Parallel Processing for Today’s Rendering Challenges -- 140
Tracing a Single RayTracing a Single Ray
Use spatial subdivisions for ray acceleration (assumed familiar)
Use grid or bounding volume hierarchy Could be optimized further, but good
results have been obtained with these acceleration structures
Efficiency mainly due to low level optimization
Practical Parallel Processing for Today’s Rendering Challenges -- 141
Low Level OptimizationLow Level Optimization
Ray tracing in general:• Ray coherence: neighboring rays tend to intersect
the same objects
• Cache coherence: objects previously intersected are likely to still reside in cache for current ray
• Memory access patterns are important (next slide)
Practical Parallel Processing for Today’s Rendering Challenges -- 142
Memory AccessMemory Access
On SGI Origin series computers:• Memory allocated for a specific process may be
located elsewhere in the machine reading memory may be expensive
• Processes may migrate to other processors when executing a system call whole cache becomes invalidated; previously local memory may now be remote and more expensive to access
Practical Parallel Processing for Today’s Rendering Challenges -- 143
Memory Access (2)Memory Access (2)
Pin down processes to processors Allocate memory close to where the
processes run that will use this memory
Use sysmp and sproc for processor placement
Use mmap or dplace for memory placement
Practical Parallel Processing for Today’s Rendering Challenges -- 144
Further Low Level OptimizationsFurther Low Level Optimizations
Know the architecture you work on (Appendix III.A in the course notes)
Use profiling to find expensive bits of code and cache misses (Appendix III.B in the course notes)
Use padding to fit important data structures on a single cache line
Practical Parallel Processing for Today’s Rendering Challenges -- 145
Frameless RenderingFrameless Rendering
Display pixel as soon as it is computed No concept of frames
• Perceptually preferable
• Equivalent of a full frame takes longer to compute
• Less efficient exploitation of cache coherence
• This alternative will return later in this course
Practical Parallel Processing for Today’s Rendering Challenges -- 146
OverviewOverview
Introduction Interactive ray tracer Animation and interactive ray tracing Sample re-use techniques
Practical Parallel Processing for Today’s Rendering Challenges -- 147
Animation and Interactive Ray Tracing
Animation and Interactive Ray Tracing
Practical Parallel Processing for Today’s Rendering Challenges -- 148
Why Animation?Why Animation?
Once interactive rendering is feasible, walk-through is not enough
Desire to manipulate the scene interactively
Render preprogrammed animation paths
Practical Parallel Processing for Today’s Rendering Challenges -- 149
Issues to Be AddressedIssues to Be Addressed
What stops us from animating objects?
• Answer: spatial subdivisions
• Acceleration structures normally built during pre-processing
• They assume objects are stationary
Practical Parallel Processing for Today’s Rendering Challenges -- 150
Possible SolutionsPossible Solutions
Target applications that require a small number of objects to be manipulated/ animated• Render these objects separately
Traversal cost will be linear in the number of animated objects
Only feasible for extremely small number of objects
Practical Parallel Processing for Today’s Rendering Challenges -- 151
Possible Solutions (2)Possible Solutions (2)
Target small number of manipulated or animated objects• Modify existing spatial subdivisions
For each frame delete object from data structure
Update object’s coordinates
Re-insert object into data structure
• This is our preferred approach
Practical Parallel Processing for Today’s Rendering Challenges -- 152
Spatial SubdivisionSpatial Subdivision
Should be able to deal with• Basic operations such as insertion and deletion of
objects should be rapid
• User manipulation can cause the extent of the scene to grow
Practical Parallel Processing for Today’s Rendering Challenges -- 153
Subdivisions InvestigatedSubdivisions Investigated
Regular grid Hierarchical grid
• Borrows from octree spatial subdivision
• In our case this is a full tree: all leaf nodes are at the same depth
Both acceleration structures are investigated in the next few slides
Practical Parallel Processing for Today’s Rendering Challenges -- 154
Regular Grid Data StructureRegular Grid Data Structure
We assume familiarity with spatial subdivisions!
Practical Parallel Processing for Today’s Rendering Challenges -- 155
Object Insertion Into GridObject Insertion Into Grid
Compute bounding box of object Compute overlap of bounding box with
grid voxels Object is inserted into overlapping voxels
Object deletion works similarly
Practical Parallel Processing for Today’s Rendering Challenges -- 156
Extensions to Regular GridExtensions to Regular Grid
Dealing with expanding scenes requires
• Modifications to object insertion/deletion
• Ray traversal
Practical Parallel Processing for Today’s Rendering Challenges -- 157
Extensions to Regular Grid (2)Extensions to Regular Grid (2)
Practical Parallel Processing for Today’s Rendering Challenges -- 158
Features of New Grid Data StructureFeatures of New Grid Data Structure
We call this an ‘Interactive Grid’• Straightforward object insertion/deletion
• Deals with expanding scenes
• Insertion cost depends on relative object size
• Traversal cost somewhat higher than for regular grid
Practical Parallel Processing for Today’s Rendering Challenges -- 159
Hierarchical GridHierarchical Grid
Objectives• Reduce insertion/deletion cost for larger objects
• Retain advantages of interactive grid
Practical Parallel Processing for Today’s Rendering Challenges -- 160
Hierarchical Grid (2)Hierarchical Grid (2)
Practical Parallel Processing for Today’s Rendering Challenges -- 161
Hierarchical Grid (3)Hierarchical Grid (3)
Build full octree with all leaf nodes at the same level• Allow objects to reside in leaf nodes as well as in
nodes higher up in the hierarchy
• Each object can be inserted into one or more voxels of at most one level in the hierarchy
• Small object reside in leaf nodes, large objects reside elsewhere in the hierarchy
Practical Parallel Processing for Today’s Rendering Challenges -- 162
Hierarchical Grid (4)Hierarchical Grid (4)
Features:• Deals with expanding scenes similar to interactive
grid
• Reduced insertion/deletion cost
• Traversal cost somewhat higher than interactive grid
Practical Parallel Processing for Today’s Rendering Challenges -- 163
Test ScenesTest Scenes
Practical Parallel Processing for Today’s Rendering Challenges -- 164
VideoVideo
Practical Parallel Processing for Today’s Rendering Challenges -- 165
MeasurementsMeasurements
We measure• Traversal cost of
Interactive grid
Hierarchical grid
Regular grid
• Object update rates of
Interactive grid
Hierarchical grid
Practical Parallel Processing for Today’s Rendering Challenges -- 166
Framerate vs. Grid Size (Sphereflake)Framerate vs. Grid Size (Sphereflake)
Practical Parallel Processing for Today’s Rendering Challenges -- 167
Framerate vs. Grid Size (Triangles)Framerate vs. Grid Size (Triangles)
Practical Parallel Processing for Today’s Rendering Challenges -- 168
Framerate Over Time (Sphereflake)Framerate Over Time (Sphereflake)
Practical Parallel Processing for Today’s Rendering Challenges -- 169
Framerate Over Time (Triangles)Framerate Over Time (Triangles)
Practical Parallel Processing for Today’s Rendering Challenges -- 170
ConclusionsConclusions
Interactive manipulation of ray traced scenes is both desirable and feasible using these modifications to grid and hierarchical grids
Slight impact on traversal cost (More results available in course notes)
Practical Parallel Processing for Today’s Rendering Challenges -- 171
OverviewOverview
Introduction Interactive ray tracer Animation and interactive ray tracing Sample re-use techniques
Sample Re-use TechniquesSample Re-use Techniques
Practical Parallel Processing for Today’s Rendering Challenges -- 173
Brute Force Ray TracingBrute Force Ray Tracing
Enables interactive ray tracing
Does not allow large image sizes Does not scale to scenes with
high depth complexity
Practical Parallel Processing for Today’s Rendering Challenges -- 174
SolutionSolution
Exploit temporal coherence Re-use results from previous frames
Practical Parallel Processing for Today’s Rendering Challenges -- 175
Practical SolutionsPractical Solutions
Tapestry (Simmons et. al. 2000)• Focuses on complex lighting simulation
Render cache (Walter et. al. 1999)• Addresses scene complexity issues
• Explained next
Parallel render cache (Reinhard et. al. 2000)• Builds on Walter’s render cache
• Explained next
Practical Parallel Processing for Today’s Rendering Challenges -- 176
Render Cache AlgorithmRender Cache Algorithm
Basic setup• One front-end for:
Displaying pixels
Managing previous results
• Parallel back-end for:
Producing new pixels
Practical Parallel Processing for Today’s Rendering Challenges -- 177
Render Cache Front-endRender Cache Front-end
Frame based rendering For each frame do:
• Project existing points
• Smooth image and display
• Select new rays using heuristics
• Request samples from back-end
• Insert new points into point cloud
Practical Parallel Processing for Today’s Rendering Challenges -- 178
Render CacheRender Cache
Practical Parallel Processing for Today’s Rendering Challenges -- 179
Render Cache (2)Render Cache (2)
Point reprojection is relatively cheap Smooth camera movement for small
images Does not scale to large images or large
numbers of renderers front-end becomes bottleneck
Practical Parallel Processing for Today’s Rendering Challenges -- 180
Parallel Render CacheParallel Render Cache
Aim: remove front-end bottleneck• Distribute point reprojection functionality
• Integrate point reprojection with renderers
• Front-end only displays results
Practical Parallel Processing for Today’s Rendering Challenges -- 181
Parallel Render Cache (2)Parallel Render Cache (2)
Practical Parallel Processing for Today’s Rendering Challenges -- 182
Parallel Render Cache (3)Parallel Render Cache (3)
Features:• Scalable behavior for scene complexity
• Scalable in number of processors
• Allows larger images to be rendered
• Retains artifacts from render cache
• Introduces new artifacts
Practical Parallel Processing for Today’s Rendering Challenges -- 183
ArtifactsArtifacts
Render cache artifacts at tile boundaries Image deteriorates during camera
movement
These artifacts are deemed more acceptable than loss of smooth camera movement!
Practical Parallel Processing for Today’s Rendering Challenges -- 184
VideoVideo
Practical Parallel Processing for Today’s Rendering Challenges -- 185
Test ScenesTest Scenes
Practical Parallel Processing for Today’s Rendering Challenges -- 186
ResultsResults
Sub-parts of algorithm measured individually• Measure time per call to subroutine
• Sum over all processors and all invocations
• Afterwards divide by number of processors and number of invocations
• Results are measured in events per second per processor
Practical Parallel Processing for Today’s Rendering Challenges -- 187
Scalability (Teapot Model)Scalability (Teapot Model)
Practical Parallel Processing for Today’s Rendering Challenges -- 188
Scalability (Room Model)Scalability (Room Model)
Practical Parallel Processing for Today’s Rendering Challenges -- 189
Samples Per SecondSamples Per Second
Practical Parallel Processing for Today’s Rendering Challenges -- 190
Reprojections Per SecondReprojections Per Second
Practical Parallel Processing for Today’s Rendering Challenges -- 191
ConclusionsConclusions
Exploitation of temporal coherence gives significantly smoother results than available with brute force ray tracing alone
This is at the cost of some artifacts which require further investigation
(More results available in course notes)
Practical Parallel Processing for Today’s Rendering Challenges -- 192
AcknowledgementsAcknowledgements
Thanks to:• Steven Parker for writing the interactive ray tracer
in the first place
• Brian Smits, Peter Shirley and Charles Hansen for involvement in the animation and parallel point reprojection projects
• Bruce Walter and George Drettakis for the render cache source code
Practical Parallel Processing for Today’s Rendering Challenges -- 193
ScheduleSchedule
Introduction Parallel / Distributed Rendering Issues Classification of Parallel Rendering Systems Practical Applications
• Rendering at Clemson / Distributed Computing and Spatial/Temporal Coherence
• Interactive Ray Tracing
• Parallel Rendering and the Quest for Realism: The Kilauea Massively Parallel Ray Tracer (Kato)
Summary / Discussion
Practical Parallel Processing for Today’s Rendering Challenges -- 194
OutlineOutline
What is Kilauea ? Parallel ray tracing & photon mapping Kilauea architecture Shading logic Rendering results
Practical Parallel Processing for Today’s Rendering Challenges -- 195
OutlineOutline
What is Kilauea ? Parallel ray tracing & photon mapping Kilauea architecture Shading logic Rendering results
Practical Parallel Processing for Today’s Rendering Challenges -- 196
ObjectiveObjective
Global illumination Extremely complex scenes
Practical Parallel Processing for Today’s Rendering Challenges -- 197
Parallel ProcessingParallel Processing
Hardware• Multi-CPU machine
• Linux PC cluster
Software• Threading (Pthread)
• Message passing (MPI)
Practical Parallel Processing for Today’s Rendering Challenges -- 198
Our Render FarmOur Render Farm
Practical Parallel Processing for Today’s Rendering Challenges -- 199
Global IlluminationGlobal Illumination
Photon map
Practical Parallel Processing for Today’s Rendering Challenges -- 200
Ray Tracing RendererRay Tracing Renderer
Machine : A B C
Machine : A B C
Machine : A B CRead Scene
Ray Tracing
Shading
Output
Practical Parallel Processing for Today’s Rendering Challenges -- 201
Ray Tracing RendererRay Tracing Renderer
Read Scene
Ray Tracing
Shading
Output
Machine : G H I
Machine : D E F
Machine : A B C
Practical Parallel Processing for Today’s Rendering Challenges -- 202
Ray Tracing RendererRay Tracing Renderer
Machine : G H I
Machine : D E F
Machine : A B CRead Scene
Ray Tracing
Shading
Output
Practical Parallel Processing for Today’s Rendering Challenges -- 203
OutlineOutline
What is Kilauea ? Parallel ray tracing & photon mapping Kilauea architecture Shading logic Rendering results
Practical Parallel Processing for Today’s Rendering Challenges -- 204
Parallel Ray TracingParallel Ray Tracing
Simple case Complex case
Practical Parallel Processing for Today’s Rendering Challenges -- 205
Parallel Ray TracingParallel Ray Tracing
Simple case Complex case
Practical Parallel Processing for Today’s Rendering Challenges -- 206
Accel GridAccel GridHierarchical uniform grid
Scene data
Practical Parallel Processing for Today’s Rendering Challenges -- 207
Simple Case (scene distribution)Simple Case (scene distribution)
Machine A
Machine BScene Data
copy
Practical Parallel Processing for Today’s Rendering Challenges -- 208
Simple Case (ray tracing)Simple Case (ray tracing)
Machine A
Machine BScreen
Practical Parallel Processing for Today’s Rendering Challenges -- 209
Parallel Ray TracingParallel Ray Tracing
Simple case Complex case
Practical Parallel Processing for Today’s Rendering Challenges -- 210
Complex Case (scene distribution)Complex Case (scene distribution)
Machine A
Machine B
Random
Scene Data
Practical Parallel Processing for Today’s Rendering Challenges -- 211
Complex Case (accel grid construction)Complex Case (accel grid construction)Independent construction Aligned by table
Machine B
Machine AMachine A
Machine B
Practical Parallel Processing for Today’s Rendering Challenges -- 212
Complex Case (ray tracing)Complex Case (ray tracing)Machine A
Machine B
Screen
CompareResults
Practical Parallel Processing for Today’s Rendering Challenges -- 213
OutlineOutline
What is Kilauea ? Parallel ray tracing & photon mapping Kilauea architecture Shading logic Rendering results
Practical Parallel Processing for Today’s Rendering Challenges -- 214
Parallel Photon MappingParallel Photon Mapping
Photon trace Photon lookup
Practical Parallel Processing for Today’s Rendering Challenges -- 215
Parallel Photon MappingParallel Photon Mapping
Photon trace Photon lookup
Practical Parallel Processing for Today’s Rendering Challenges -- 216
Photon Tracing (simple case)Photon Tracing (simple case)
PhotonMap
Store
Store
Practical Parallel Processing for Today’s Rendering Challenges -- 217
Photon Tracing (complex case)Photon Tracing (complex case)
PhotonMap B
Randomly store
PhotonMap A
Machine B
Machine A
Practical Parallel Processing for Today’s Rendering Challenges -- 218
Parallel Photon MappingParallel Photon Mapping
Photon trace Photon lookup
Practical Parallel Processing for Today’s Rendering Challenges -- 219
Photon Lookup (simple case)Photon Lookup (simple case)
Machine A
Machine B
PhotonMap
PhotonMap
Lookuprequest
Irradiancevalue
Lookuprequest
Irradiancevalue
Practical Parallel Processing for Today’s Rendering Challenges -- 220
Photon Lookup (complex case)Photon Lookup (complex case)
Machine A
Machine B
PhotonMap A
PhotonMap B
Lookuprequest
Irradiancecalculation
Irradiancevaluecopy
Practical Parallel Processing for Today’s Rendering Challenges -- 221
OutlineOutline
What is Kilauea ? Parallel ray tracing & photon mapping Kilauea architecture Shading logic Rendering results
Practical Parallel Processing for Today’s Rendering Challenges -- 222
TaskTask
MtaskWtaskBtaskStaskRtask
AtaskEtaskLtaskPtaskOtask
Practical Parallel Processing for Today’s Rendering Challenges -- 223
Task AssignmentTask Assignment
TaskTask
TaskTask
Machine A Task
TaskTask
Machine B
TaskTask
Task
Machine C
Practical Parallel Processing for Today’s Rendering Challenges -- 224
Roles of TasksRoles of Tasks
pixel
T S RA
ACompare
Practical Parallel Processing for Today’s Rendering Challenges -- 225
Task ConfigurationTask Configuration
A
A
T S RMachine A
Machine B
Practical Parallel Processing for Today’s Rendering Challenges -- 226
Task ConfigurationTask Configuration
A
A
T S R
T S R
Machine A
Machine B
Practical Parallel Processing for Today’s Rendering Challenges -- 227
Task ConfigurationTask Configuration
A
A
T S R
T S R
Machine A
Machine BA
A
T S R
T S R
Machine C
Machine DA
A
T S R
T S R
Machine E
Machine F
Practical Parallel Processing for Today’s Rendering Challenges -- 228
Task InteractionTask Interaction
A
A
T S R
T S R
Machine
A
Machine
B
pixel
pixel
Practical Parallel Processing for Today’s Rendering Challenges -- 229
Task InteractionTask Interaction
A
A
T S R
T S R
Machine
A
Machine
B
pixel
pixel
Practical Parallel Processing for Today’s Rendering Challenges -- 230
Task InteractionTask Interaction
A
A
T S R
T S R
Machine
A
Machine
B
pixel
pixel
Practical Parallel Processing for Today’s Rendering Challenges -- 231
Task InteractionTask Interaction
A
A
T S R
T S R
Machine
A
Machine
B
pixel
pixel
Practical Parallel Processing for Today’s Rendering Challenges -- 232
Task InteractionTask Interaction
A
A
T S R
T S R
Machine
A
Machine
B
pixel Compare
pixel Compare
Practical Parallel Processing for Today’s Rendering Challenges -- 233
Task InteractionTask Interaction
A
A
T S R
T S R
Machine
A
Machine
B
pixel Compare
pixel Compare
Practical Parallel Processing for Today’s Rendering Challenges -- 234
Task Interaction (simple case)Task Interaction (simple case)
A
A
T S R
T S R
Machine
A
Machine
B
Practical Parallel Processing for Today’s Rendering Challenges -- 235
Roles of Tasks (photon map)Roles of Tasks (photon map)
T S
RA
A
LP
PLookup
PhotonMap B
PhotonMap A
Practical Parallel Processing for Today’s Rendering Challenges -- 236
Task Configuration (photon map)Task Configuration (photon map)
A
A
L
P
P
RST
Machine A
Machine B
Practical Parallel Processing for Today’s Rendering Challenges -- 237
Task Configuration (photon map)Task Configuration (photon map)
T SR
A
A
L
P
P
L
RST
Machine A
Machine B
Practical Parallel Processing for Today’s Rendering Challenges -- 238
Task Interaction (photon map)Task Interaction (photon map)
T S
L
P
P
L
STMachine A
Machine B
Practical Parallel Processing for Today’s Rendering Challenges -- 239
Task Interaction (photon map)Task Interaction (photon map)
T S
L
P
P
L
ST
photon
photonMachine A
Machine B
Practical Parallel Processing for Today’s Rendering Challenges -- 240
Task Interaction (photon map)Task Interaction (photon map)
T S
L
P
P
L
ST
photon
photonMachine A
Machine B
Practical Parallel Processing for Today’s Rendering Challenges -- 241
Task Interaction (photon map)Task Interaction (photon map)
T S
L
P
P
L
ST
photonLookup
photon
Lookup
Machine A
Machine B
Practical Parallel Processing for Today’s Rendering Challenges -- 242
Task Configuration (simple photon)Task Configuration (simple photon)
T SR
A
A
L
P
P
L
RST
Machine A
Machine B
Practical Parallel Processing for Today’s Rendering Challenges -- 243
Task PriorityTask Priority
pixel
Compare
photon
T SR
L
A
PLookup
Low HighPriority
Practical Parallel Processing for Today’s Rendering Challenges -- 244
OutlineOutline
What is Kilauea ? Parallel ray tracing & photon mapping Kilauea architecture Shading logic Rendering results
Practical Parallel Processing for Today’s Rendering Challenges -- 245
Parallel Shading ProblemParallel Shading Problem
NReflection
I
P
Cp = Cs + Cr
Practical Parallel Processing for Today’s Rendering Challenges -- 246
Parallel Shading ProblemParallel Shading Problem
NReflection
I
P
Machine B
Machine A
Cp = Cs + Cr
Practical Parallel Processing for Today’s Rendering Challenges -- 247
Parallel Shading Problem (solution)Parallel Shading Problem (solution)
AB
C D E
Practical Parallel Processing for Today’s Rendering Challenges -- 248
Parallel Shading Problem (solution)Parallel Shading Problem (solution)
AB
C D E
A : C = Cs + CrB : C = Cs + Cr
Practical Parallel Processing for Today’s Rendering Challenges -- 249
Parallel Shading Problem (solution)Parallel Shading Problem (solution)
AB
C D E
A : C = Cs + CrB : C = Cs + Cr
Practical Parallel Processing for Today’s Rendering Challenges -- 250
Parallel Shading Problem (solution)Parallel Shading Problem (solution)
AB
C D E
A : C = Cs + CrB : C = Cs + Cr
Practical Parallel Processing for Today’s Rendering Challenges -- 251
Parallel Shading Problem (solution)Parallel Shading Problem (solution)
AB
C D E
A : C = Cs + CrB : C = Cs + Cr
Practical Parallel Processing for Today’s Rendering Challenges -- 252
Parallel Shading Problem (solution)Parallel Shading Problem (solution)
AB
C D E
A : C = Cs + CrB : C = Cs + Cr
Practical Parallel Processing for Today’s Rendering Challenges -- 253
Parallel Shading Problem (solution)Parallel Shading Problem (solution)
AB
C D E
A : C = Cs + CrB : C = Cs + Cr
Practical Parallel Processing for Today’s Rendering Challenges -- 254
Parallel Shading Problem (solution)Parallel Shading Problem (solution)
AB
C D E
A : C = Cs + CrB : C = Cs + Cr
Practical Parallel Processing for Today’s Rendering Challenges -- 255
Decomposing Shading ComputationDecomposing Shading Computation
shading calculation
Practical Parallel Processing for Today’s Rendering Challenges -- 256
Decomposing Shading ComputationDecomposing Shading Computation
funcA funcBoutside task
shading calculation
Practical Parallel Processing for Today’s Rendering Challenges -- 257
Decomposing Shading ComputationDecomposing Shading Computation
funcA funcBoutside task
shading calculation
SPOT SPOToutside task
Practical Parallel Processing for Today’s Rendering Challenges -- 258
SPOTSPOT
Method+
Data
data slot
Practical Parallel Processing for Today’s Rendering Challenges -- 259
SPOT ConditionSPOT Condition
Practical Parallel Processing for Today’s Rendering Challenges -- 260
Parallel Shading Solution using SPOTParallel Shading Solution using SPOT
Machine B
Machine A
Outside Task
Cs
Cr
C = Cs + CrSPOT
ASPOT
B
ReflectionRay
Practical Parallel Processing for Today’s Rendering Challenges -- 261
Parallel Shading Solution using SPOTParallel Shading Solution using SPOT
SPOT SPOT
SPOT SPOT
SPOT SPOT
Machine A
Machine B
Outside Task
A
B
C
Practical Parallel Processing for Today’s Rendering Challenges -- 262
Shader SPOT Network ExampleShader SPOT Network Example
Practical Parallel Processing for Today’s Rendering Challenges -- 263
OutlineOutline
What is Kilauea ? Parallel ray tracing & photon mapping Kilauea architecture Shading logic Rendering results
Practical Parallel Processing for Today’s Rendering Challenges -- 264
Rendering ResultsRendering Results
Test machine specification• 1GHz Dual Pentium III
• 512Mbyte memory
• 100BaseT Ethernet
• 18 machines connected via 100BaseT switch
Practical Parallel Processing for Today’s Rendering Challenges -- 265
QuatroQuatro 700,223 triangles, 1 area point & sky light,
1280 x 692 18 machines : 7min 19sec
Practical Parallel Processing for Today’s Rendering Challenges -- 266
Quatro : single Atask testQuatro : single Atask test
Speedup
0.00
5.00
10.00
15.00
20.00
25.00
1 3 5 7 9 11 13 15 17
Number of machines
Spe
edup raytrace
linearall
Rendering time
0:00:00
0:14:24
0:28:48
0:43:12
0:57:36
1:12:00
1:26:24
1:40:48
1 3 5 7 9 11 13 15 17
Number of machines
Exe
cutio
n tim
e (h
:m:s
)
allraytrace
Practical Parallel Processing for Today’s Rendering Challenges -- 267
JeepJeep 715,059 triangles, 1 directional & sky light, 1280 x 692 18 machines : 8min 27sec
Practical Parallel Processing for Today’s Rendering Challenges -- 268
Jeep4Jeep4 2,859,636 triangles, 1 directional & sky light, 1280 x 692
18 machines : 12min 38sec 2 Atsks x 1
Practical Parallel Processing for Today’s Rendering Challenges -- 269
Jeep4 : 2 Atasks testJeep4 : 2 Atasks test
1Atask group = 2 machines
Speedup
0.00
1.00
2.00
3.00
4.00
5.00
6.00
7.00
8.00
9.00
10.00
1 2 3 4 5 6 7 8 9
Number of Atask group
Spee
dup raytrace
linearall
Rendering time
0:00:00
0:14:24
0:28:48
0:43:12
0:57:36
1:12:00
1:26:24
1:40:48
1 2 3 4 5 6 7 8 9
Number of Atask group
allraytrace
Practical Parallel Processing for Today’s Rendering Challenges -- 270
Jeep8Jeep8 5,719,072 triangles, 1 directional & sky light, 1280 x 692
16 machines : 18min 43sec 4 Atasks x 4
Practical Parallel Processing for Today’s Rendering Challenges -- 271
Escape PODEscape POD 468,321 triangles, 1 directional & sky light, 1280 x 692 18 machines : 14min 55sec
Practical Parallel Processing for Today’s Rendering Challenges -- 272
ansGunansGun 20,279 triangles, 1 spot & sky light, 1280 x 960 18 machines : 16min 38sec
Practical Parallel Processing for Today’s Rendering Challenges -- 273
SCN101SCN101 787,255 triangls, 1 area light, 1280 x 692 18 machines : 9min 10sec
Practical Parallel Processing for Today’s Rendering Challenges -- 274
VideoVideo
Practical Parallel Processing for Today’s Rendering Challenges -- 275
Conclusion / Future WorkConclusion / Future Work
We achieved:• Close to linear parallel performance
• Highly extensible architecture
We will achieve even more:• Speed
• Stability
• Usability (user interface)
• Etc.
Practical Parallel Processing for Today’s Rendering Challenges -- 276
Additional InformationAdditional Information
Kilauea live rendering demo• BOOTH #1927 SquareUSA
http://www.squareusa.com/kilauea/
Practical Parallel Processing for Today’s Rendering Challenges -- 277
ScheduleSchedule
Introduction Parallel / Distributed Rendering Issues Classification of Parallel Rendering
Systems Practical Applications Summary / Discussion (Chalmers)
Practical Parallel Processing for Today’s Rendering Challenges -- 278
SummarySummary
Practical Parallel Processing for Today’s Rendering Challenges -- 279
Contact InformationContact Information
Alan Chalmers [email protected]
Toshi Katohttp://www.squareusa.com/kilauea/
Erik [email protected]
Slideshttp://www.cs.clemson.edu/~tadavis