fragment-parallel composite and filter anjul patney, stanley tzeng, and john d. owens university of...

Post on 19-Dec-2015

222 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Fragment-Parallel Composite and Filter

Anjul Patney, Stanley Tzeng, and John D. OwensUniversity of California, Davis

Parallelism in Interactive Graphics• Well-expressed in hardware as well as APIs

• Consistently growing in degree & expression–More and more cores on upcoming GPUs– From programmable shaders to pipelines

• We should rethink algorithms to exploit this

• This paper provides one example– Parallelization of composite/filter stages

A Feed-Forward Rendering Pipeline

Geometry Processing

Rasterization

Composite

Filter

Primitives

Pixels

Composite & Filter

• Input: – Unordered list of

fragments

• Output– Pixel colors

• Assumption– No fragments are

discarded

Pixel

Sample Locations

Basic Idea

Pixel-Parallel

Processors

Basic Idea

Insufficientparallelism

Irregularity

Fragment-Parallel

Processors

Motivation

• Most applications have low depth complexity– Pixel-level parallelism is sufficient

• We are interested in applications with– Very high depth complexity– High variation in depth complexity

• Further– Future platforms will demand more parallelism– High depth-complexity can limit pixel-

parallelism

Motivation

10

70

130

190

250

310

370

430

490

550

610

670

730

10

100

1000

10000

100000

1000000

Distribution of DepthComplexity

Number of depth layers

Nu

mb

er

of

su

bp

ixels

Related Work

Order-Independent Transparency (OIT)

• Depth-Peeling [Everitt 01]

– One pass per transparent layer

• Stencil-Routed A-buffer [Myers & Bavoil 07]

– One pass per 8 depth layers1

• Bucket Depth-Peeling [Liu et al. 09]

– One pass per up to 32 layers21 Maximum MSAA samples per pixel2 Maximum render targets

Related Work

Order-Independent Transparency (OIT)

• OIT using Direct3D 11 [Gruen et al. 10]

– Use fragment linked-lists– Per-pixel sort and composite

• Hair Self-Shadowing [Sintorn et al. 09]

– Each fragment computes its contribution– Assumes constant opacity

Related Work

Programmable Rendering Pipelines

• RenderAnts [Zhou et al. 09]

– Sort fragments globally– Per-pixel composite/filter

• FreePipe [Liu et al. 10]

– Sort fragments globally– Per-pixel composite/filter

Pixel-Parallel FormulationPi P(i+1) P(i+2)

Sj S(j+1) S(j+2) S(j+3) S(j+4) S(j+5) S(j+6)j (j+1) (j+2) (j+3) (j+4) (j+5) (j+6)Thread IDs

P: PixelS: Subsample

Fragment-Parallel Formulation

P: PixelS: Subsample

Pi P(i+1) P(i+2)

Sj S(j+1) S(j+2) S(j+3) S(j+4) S(j+5) S(j+6)

P: PixelS: Subsample

Thread IDs

jj+ 1 j+ 2 j+ 3 j+ 4 j+ 5 j+ 6 j+ 7 j+ 8 j+ 9 j+ 10 j+ 11 j+ 12 j+ 13 j+ 14 j+ 15 j+ 16 j+ 17 j+ 18 j+ 19 j+ 20 j+ 21 j+ 22 j+ 23

Fragment-Parallel Formulation• How can this behavior be achieved?

• Revisit the composite equation

Cs = α1C1 + (1-α1){α2C2+(1-α2)(…(αN+(1-αN)CB)…}fragment 1 fragment 2 … background

Cs = 1.α1.C1 + (1-α1).α2.C2 + (1-α1)(1-α2).α3.C3 + …

+ (1-α1)(1-α2)…(1-αk-1).αi.Ck + …

+ (1-α1)(1-α2)…(1-αN).CBLocal Contribution Lk

Global Contribution Gk

Fragment-Parallel Formulation

• Lk is trivially parallel (local computation)

• Gk is the result of a scan operation (product)

• For the list of input fragments– Compute G[ ] and L[ ], multiply– Perform reduction to add subpixel contributions

Cs = G1.L1 + G2.L2 + G3.L3 … GN.LN

Gk = (1-α1).(1-α2)…(1-αk-1)Lk = αk.Ck

Fragment-Parallel Formulation• Filter, for every pixel:

• This can be expressed as another reduction– After multiplying with subpixel weights

κm

– Can be merged with previous reduction

Cp = Cs1.κ1 + Cs2.κ2 + … + CsM.κM

Fragment-Parallel Composite & Filter

Final Algorithm

1. Two-key sort (Subpixel ID, depth)

2. Segmented Scan (obtain Gk)

3. Premultiply with weights (Lk, κm)

4. Segmented Reduction

Fragment-Parallel Formulation

P: PixelS: Subsample

Pi P(i+1) P(i+2)

P: PixelS: Subsample

Segmented Scan (product)

Segmented Reduction (sum)

Implementation

• Hardware used: NVIDIA GeForce GTX 280

• We require fast Segmented Scan and Reduce– CUDPP library provides that– Restricts implementation to NVIDIA CUDA

• No direct access to hardware rasterizer–We wrote our own

Example System – Polygons

• Applications– Games

• Depth Complexity– 1 to few tens of layers– Suited to pixel-parallel

• Fragment-parallel software rasterizer

Example System – Particles

• Applications– Simulations, games

• Depth Complexity– Hundreds of layers– High depth-variance

• Particle-parallel sprite rasterizer

Example System – Volumes

• Applications– Scientific Visualization

• Depth Complexity– Tens to Hundreds of

layers– Low depth-variance

• Major-axis-slice rasterizer

Example System – Reyes

• Applications– Offline rendering

• Depth Complexity– Tens of layers– Moderate depth variance

• Data-parallel micropolygon rasterizer

Performance Results

Part

icle

s

Volu

me

Reye

s (g

rass

)

Poly

gon

0

100

200

300

400

500

600

Ren

deri

ng

Tim

e (

ms)

Fragment GenerationPixel-Parallel Composite/FilterFragment-Parallel Composite/Fil-ter

Performance Variation

0 200 400 600 800 1000 1200 1400 16001.00E+05

1.00E+06

1.00E+07

1.00E+08

Performance Variation

Fragment-ParallelPixel-Parallel

Depth Complexity

Fra

gm

en

ts p

er

se

co

nd

Limitations

• Increased memory traffic– Several passes through CUDPP

primitives

• Unclear how to optimize for special cases– Threshold opacity– Threshold depth complexity

Summary and Conclusion

• Parallel formulation of composite equation–Maps well to known primitives– Can be integrated with filter– Consistent performance across varying workloads

• FPC is applicable to future rendering pipelines– Exploits higher degree of parallelism– Better related to size of rendering workload

• A tool for building programmable pipelines

Future Work

• Performance– Reduction in memory traffic– Extension to special-case scenes– Hybrid PPC-FPC formulations

• Applications– Integration with hardware rasterizer– Cinematic rendering, Photoshop

Acknowledgments

• NSF Award 0541448• SciDAC Insitute for Ultrascale

Visualization• NVIDIA Research Fellowship • Equipment donated by NVIDIA• Discussions and Feedback

– Shubho Sengupta (UC Davis), Matt Pharr (Intel), Aaron Lefohn (Intel), Mike Houston (AMD)

– Anonymous reviewers

• Implementation assistance– Jeff Stuart, Shubho Sengupta

Thanks!

top related