fragment-parallel composite and filter anjul patney, stanley tzeng, and john d. owens university of...

Fragment-Parallel Composite and Filter

Anjul Patney, Stanley Tzeng, and John D. OwensUniversity of California, Davis

Parallelism in Interactive Graphics• Well-expressed in hardware as well as APIs

• Consistently growing in degree & expression–More and more cores on upcoming GPUs– From programmable shaders to pipelines

• We should rethink algorithms to exploit this

• This paper provides one example– Parallelization of composite/filter stages

A Feed-Forward Rendering Pipeline

Geometry Processing

Rasterization

Composite

Filter

Primitives

Pixels

Composite & Filter

• Input: – Unordered list of

fragments

• Output– Pixel colors

• Assumption– No fragments are

discarded

Sample Locations

Basic Idea

Pixel-Parallel

Processors

Basic Idea

Insufficientparallelism

Irregularity

Fragment-Parallel

Processors

Motivation

• Most applications have low depth complexity– Pixel-level parallelism is sufficient

• We are interested in applications with– Very high depth complexity– High variation in depth complexity

• Further– Future platforms will demand more parallelism– High depth-complexity can limit pixel-

parallelism

Motivation

100000

1000000

Distribution of DepthComplexity

Number of depth layers

Related Work

Order-Independent Transparency (OIT)

• Depth-Peeling [Everitt 01]

– One pass per transparent layer

• Stencil-Routed A-buffer [Myers & Bavoil 07]

– One pass per 8 depth layers1

• Bucket Depth-Peeling [Liu et al. 09]

– One pass per up to 32 layers21 Maximum MSAA samples per pixel2 Maximum render targets

Related Work

Order-Independent Transparency (OIT)

• OIT using Direct3D 11 [Gruen et al. 10]

– Use fragment linked-lists– Per-pixel sort and composite

• Hair Self-Shadowing [Sintorn et al. 09]

– Each fragment computes its contribution– Assumes constant opacity

Related Work

Programmable Rendering Pipelines

• RenderAnts [Zhou et al. 09]

– Sort fragments globally– Per-pixel composite/filter

• FreePipe [Liu et al. 10]

– Sort fragments globally– Per-pixel composite/filter

Pixel-Parallel FormulationPi P(i+1) P(i+2)

Sj S(j+1) S(j+2) S(j+3) S(j+4) S(j+5) S(j+6)j (j+1) (j+2) (j+3) (j+4) (j+5) (j+6)Thread IDs

P: PixelS: Subsample

Fragment-Parallel Formulation

Pi P(i+1) P(i+2)

Sj S(j+1) S(j+2) S(j+3) S(j+4) S(j+5) S(j+6)

Thread IDs

jj+ 1 j+ 2 j+ 3 j+ 4 j+ 5 j+ 6 j+ 7 j+ 8 j+ 9 j+ 10 j+ 11 j+ 12 j+ 13 j+ 14 j+ 15 j+ 16 j+ 17 j+ 18 j+ 19 j+ 20 j+ 21 j+ 22 j+ 23

Fragment-Parallel Formulation• How can this behavior be achieved?

• Revisit the composite equation

Cs = α1C1 + (1-α1){α2C2+(1-α2)(…(αN+(1-αN)CB)…}fragment 1 fragment 2 … background

Cs = 1.α1.C1 + (1-α1).α2.C2 + (1-α1)(1-α2).α3.C3 + …

+ (1-α1)(1-α2)…(1-αk-1).αi.Ck + …

+ (1-α1)(1-α2)…(1-αN).CBLocal Contribution Lk

Global Contribution Gk

• Lk is trivially parallel (local computation)

• Gk is the result of a scan operation (product)

• For the list of input fragments– Compute G[ ] and L[ ], multiply– Perform reduction to add subpixel contributions

Cs = G1.L1 + G2.L2 + G3.L3 … GN.LN

Gk = (1-α1).(1-α2)…(1-αk-1)Lk = αk.Ck

Fragment-Parallel Formulation• Filter, for every pixel:

• This can be expressed as another reduction– After multiplying with subpixel weights

– Can be merged with previous reduction

Cp = Cs1.κ1 + Cs2.κ2 + … + CsM.κM

Fragment-Parallel Composite & Filter

Final Algorithm

1. Two-key sort (Subpixel ID, depth)

2. Segmented Scan (obtain Gk)

3. Premultiply with weights (Lk, κm)

4. Segmented Reduction

Pi P(i+1) P(i+2)

Segmented Scan (product)

Segmented Reduction (sum)

Implementation

• Hardware used: NVIDIA GeForce GTX 280

• We require fast Segmented Scan and Reduce– CUDPP library provides that– Restricts implementation to NVIDIA CUDA

• No direct access to hardware rasterizer–We wrote our own

Example System – Polygons

• Applications– Games

• Depth Complexity– 1 to few tens of layers– Suited to pixel-parallel

• Fragment-parallel software rasterizer

Example System – Particles

• Applications– Simulations, games

• Depth Complexity– Hundreds of layers– High depth-variance

• Particle-parallel sprite rasterizer

Example System – Volumes

• Applications– Scientific Visualization

• Depth Complexity– Tens to Hundreds of

layers– Low depth-variance

• Major-axis-slice rasterizer

Example System – Reyes

• Applications– Offline rendering

• Depth Complexity– Tens of layers– Moderate depth variance

• Data-parallel micropolygon rasterizer

Performance Results

Fragment GenerationPixel-Parallel Composite/FilterFragment-Parallel Composite/Fil-ter

Performance Variation

0 200 400 600 800 1000 1200 1400 16001.00E+05

1.00E+06

1.00E+07

1.00E+08

Performance Variation

Fragment-ParallelPixel-Parallel

Depth Complexity

Limitations

• Increased memory traffic– Several passes through CUDPP

primitives

• Unclear how to optimize for special cases– Threshold opacity– Threshold depth complexity

Summary and Conclusion

• Parallel formulation of composite equation–Maps well to known primitives– Can be integrated with filter– Consistent performance across varying workloads

• FPC is applicable to future rendering pipelines– Exploits higher degree of parallelism– Better related to size of rendering workload

• A tool for building programmable pipelines

Future Work

• Performance– Reduction in memory traffic– Extension to special-case scenes– Hybrid PPC-FPC formulations

• Applications– Integration with hardware rasterizer– Cinematic rendering, Photoshop

Acknowledgments

• NSF Award 0541448• SciDAC Insitute for Ultrascale

Visualization• NVIDIA Research Fellowship • Equipment donated by NVIDIA• Discussions and Feedback

– Shubho Sengupta (UC Davis), Matt Pharr (Intel), Aaron Lefohn (Intel), Mike Houston (AMD)

– Anonymous reviewers

• Implementation assistance– Jeff Stuart, Shubho Sengupta

Thanks!

fragment-parallel composite and filter anjul patney, stanley tzeng, and john d. owens university of...

sjsj s j

subsample thread ids

pixelparallelism slide

pixel compositefilter

subsample slide

motivation slide

pixel sort

davis slide

Documents

chapter 6 continuous random...

disease group presents… group membermatrix number...

ritu patney technology specialist – windows client |...

brain wave analysis in optimal color allocation for...

chapter 6 continuous random...

programmable graphics pipelines - anjul patney ·...

adversarial discriminative domain adaptation - arxiv ·...

speaker: meng -jing tsai date: 2012.09.10 authors: ...

parallel view-dependent tessellation of catmull-clark...

interactive subdivision of smooth surfaces on gpus ·...

tzeng (2012): story of eel, lanyang museum, taipeh, … ·...

emerging compliance markets for redd+ - terra …...

pearl: perceptual adaptive representation learning in the...

statistical considerations in biosimilar assessment using...

arxiv:1907.12627v1 [cs.gr] 29 jul 2019 · ice: an...

spark summit east keynote by anjul bhambhri

parallel view-dependent tessellation of catmull-clark...

spark summit presentation by anjul bhambhri

4/30/20151 consumption, population, and the cross-section of...

high-quality parallel depth-of- field using line samples...