parallel rendering

Parallel Graphics Rendering

Matthew CampbellSenior, Computer Science

[email protected]

Overview

Motivation Three categories of parallel

rendering Our approach Results Questions

Motivation PC graphics cards are getting faster at

an exponential rate. PC graphics boards are much cheaper

than proprietary SGI hardware. Geforce4 FX = $150.00 (130 Mtris/sec) SGI Onyx 300 = $145,000 (80 Mtris/sec)

Maintanance costs are lower Replacement parts are easy to get. PC’s are not as complicated as proprietary

hardware.

Parallel Rendering String together numerous PC’s with

good graphics boards and render the models in parallel. Increased performace Better technology tracking

Three groups of algorithms: Sort-First Sort-Middle Sort-Last

Rendering Pipeline Transformation stage:

Per-Vertex operations Primitive Assembly 3D World Space!

Rasterization stage: Per-fragment operations Texture mapping 2D Image Space!

Parallel Rendering – Sort Last Sort Last

Distribute polygons Round robin distribution resulting in an equal load

on each processor. Pass through entire rendering pipeline.

Transformation / Rasterization (see last slide) Each CPU now has the entire scene

But individual scenes are incomplete Hidden polygons may be visible Solution: Image composition

Sort Last – Image Composition

The scene at each CPU has a frame buffer with color values for each pixel and a depth buffer with Z values for each pixel.

Composition: Given 2 scenes it computes the color of the pixel at each screen coordinate Compare the depth buffer values at each pixel

location. The resultant color value is the color of the pixel corresponding to a lower z axis value.

Alpha blending is more complex. Why?

Sort Last – Image Composition

Time complexity of the previous sort algorithm is O(n), which is pretty bad. Can we improve it?

Alternate algorithms: Tree composition. Rotating rings. Binary composition.

Sort-Last Performance Sort-Last has very high communication bandwidth

requirement. Each processor needs to send and receive an entire

frame 1280x1024 resolution, 24-bits for color, 16-bits for depth,

30fps = (3.9MB + 2.6MB) * 30 = 196MB/sec bidirectional!

Need a very fast network interconnecting the CPUs in the cluster.

In actuality, we need more bandwidth, because we haven’t taken into account, the time it takes to render the scene!

But.. No overhead for rendering the actual scene!

Parallel Rendering – Sort Middle

Sort Middle Distribute polygons in a round robin fashion Trap polygons between geometry and

rasterization phases Each CPU in the cluster is responsible for a

specific region in screen coordinates Calculate the bounding boxes (screen

space) for the trapped polygons and redistribute them to the appropriate CPU responsible for the region.

Collate Images

Parallel Rendering – Sort Middle How do you divide the screen into

regions? Strips (either horizontal or vertical) Squares

What is the mapping ratio between CPUs and regions? One-to-One: Each CPU manages 1 region One-to-Many: Each CPU manages many

regions What about polygons that cross region

boundaries? Multiple CPUs render the same polygon.

Sort-Middle Performance Load-balancing can be poor. The slowest CPU

will block the system from rendering the next scene.

Load balancing is highly scene and view dependent. Need adaptive load-balancing schemes.

In high polygon count scenes, the size of each polygon can be very small (~1 – 2 pixels).

In this case, sort middle requires more bandwidth than sort-last.

Communication bandwidth required is dependent on the scene complexity. (Bad)

Parallel Rendering – Sort First Sort First

Distribute polygons round-robin to all CPUs. Calculate bounding volumes for each polygon

Remember, we are still in the world coordinate system. Each CPU is responsible for 1 volume. Redistribute polygons based on bounding

volumes. Pass through complete rendering pipeline In the end we have sub-images at each

processor. Designate a coordinator node, which receives sub-

images from all other processors. Coordinator collates sub-images into the final image.

Sort First - Performance Communication bandwidth required is based

only on screen space resolution. Example:

4 CPUs, 1024*1024 scene, 32 bits/color The coordinator node receives 1024*1024*24

bits/frame. ~ 3MB. Bandwidth: 90MB/sec for 30 fps.

Problem: Similar to sort-middle, load balancing is scene dependent.

Bigger issue: Can’t use a one-to-many CPU to region mapping.

Or can you?

Parallel Rendering Issues Cannot break the rendering pipeline

Pipeline is implemented in hardware Therefore, very expensive. Could lead to

excessive stalls, cache misses, etc.. Modern graphics cards have large amounts

of memory on the board and much faster access times.

8GB/sec vs. 1GB/sec for AGP4x Graphics driver source code is

unavailable Additional cost/overhead due to

framebuffer accesses.

Our Approach High Performance real-time rendering.

High scene complexity and/or multiple displays as in a VE.

Target: 200-300 million triangles/sec. In comparison the best SGI platform – Reality Monster is capable of 80 million polygons/sec

Approach: Distributed Sort-First. Two level sorting.

Organize your model in a spatial tree data structure. At run-time compare bounding volumes for interior

nodes of the tree. The bounding volume for an interior node is a superset of its children. This minimizes comparisons.

Fine pruning based on viewing frustum.

Hardware 32 Intel Xeon processor cluster (1.5 GHz

processor) 256 MB RDRAM/node (3.2 GB/sec memory

bandwidth) Myrinet (4 Gbps) and Fast Ethernet (200

Mbps full-duplex) communication fabrics. 64 bit/66 MHz PCI bus (4 Gbps throughput) 4x AGP (1GB/sec throughput)

Software Extensible Parallel 3D Rendering Engine

Supports large geometric databases, including standard formats such as 3D Studio

Provides an extensible API. Underlying system is based on OpenGL. Based on dynamic shared object model.

Dynamic Load Balancing Adaptively resizes volumes assigned to a

processor for single display systems. Adaptively changes the number of processors

and rendering volumes for multi-display systems.

Software Architecture

Master-Slave arrangement

Multi-threaded

Two stage parallel rendering pipeline.

ProcessProcess

Master Slave0 Slave28

128 Port Myrinet 2000 Switch

SpatialCulling

RenderingPipeline

Load balancing

Frame Collation/Display

Tx Frustum

SpatialCulling

RenderingPipeline

Network I/OFrame

Capture

Thread0

Thread1

Slave1

Tx FrameStatistics

Results – Rendering Rate

0

50

100

150

200

250

0 5 10 15 20 25 30

Number of Nodes

Ren

deri

ng R

ate

(Mill

ion

Pol

ys/s

ec)

Actual Ef fective

0

50

100

150

200

250

0 5 10 15 20 25 30

Number of Nodes

Ren

deri

ng R

ate

(Mill

ion

Pol

ys/s

ec)

Actual Ef fective

Figure 1: Scalability of our implementation. Actual depicts the performancetaking into account triangle overlap among nodes, effective depicts what the

system is capable of delivering. Left image uses a real world dataset (LIDAR data).Right image uses a generated dataset to fully exploit the overlap issue.

Results – Load Balancing

0

20

40

60

80

100

120

140

160

180

200

0 10 20 30 40 50 60 70 80 90 100

Frame Number

Ren

der

Tim

e (m

s)

0

20

40

60

80

100

120

140

160

180

200

Frame Number

Ren

der T

ime

(ms)

Figure 2: The effects of load balancing on 4 nodes (left) and 16 nodes (right). The graph depicts the individiual frame times for first 100 frames.

parallel rendering

Technology

sortmiddle sort

performance sort

parallel rendering string

previous sort algorithm

entire rendering pipeline

color values

slowest cpu

trapped polygons