height field ambient occlusion using cudawili.cc/research/presentations/utu2009.pdf · introduction...

Post on 19-Jul-2020

22 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

IntroductionMethod

Data managementOcclusion computation

Results

Height field ambient occlusion using CUDA

3.6.2009

IntroductionMethod

Data managementOcclusion computation

Results

Outline

1 Introduction

2 MethodIdeaImplementation

3 Data managementNaive solutionsPerformance improvementsPreblend kernel

4 Occlusion computationTheoryKernel

5 Results

IntroductionMethod

Data managementOcclusion computation

Results

Height fields

IntroductionMethod

Data managementOcclusion computation

Results

Self occlusion

IntroductionMethod

Data managementOcclusion computation

Results

Current methods

Marching severaldirections from eachfragment

Sampling several timesalong a direction

Optimizations based onthis approach

IntroductionMethod

Data managementOcclusion computation

Results

IdeaImplementation

Outline

1 Introduction

2 MethodIdeaImplementation

3 Data managementNaive solutionsPerformance improvementsPreblend kernel

4 Occlusion computationTheoryKernel

5 Results

IntroductionMethod

Data managementOcclusion computation

Results

IdeaImplementation

Sweeps

We process several (e.g.64) directions (sweeps)

For a height field of n2

(e.g. 1024x1024) a sweepconsists of n · · ·

√2n lines

One direction = infinitelyhigh but thin light source

IntroductionMethod

Data managementOcclusion computation

Results

IdeaImplementation

Spatial coherence

Each line steps one texelat a time

Keeps a record of previousoccluders

Line-wise coherence

Convex hull subset ofoccluders is enough

3% of occluders requiredin practice

Output: occlusionvector/value on each step

Results for each directionblended together

IntroductionMethod

Data managementOcclusion computation

Results

IdeaImplementation

Implications

Previously bandwidth limited

New method samples significantly less

Bound to discrete directions, but ideally takes each pixel onthem into account

Does not map to shaders well ⇒ GPGPU

IntroductionMethod

Data managementOcclusion computation

Results

IdeaImplementation

Outline

1 Introduction

2 MethodIdeaImplementation

3 Data managementNaive solutionsPerformance improvementsPreblend kernel

4 Occlusion computationTheoryKernel

5 Results

IntroductionMethod

Data managementOcclusion computation

Results

IdeaImplementation

Environment

OpenGL 3.0 (Aug 2008)

CUDA 2.2 (May 2009)

Linux (primary), Windows

Software (CPU) and hardware (GPU) implementations

IntroductionMethod

Data managementOcclusion computation

Results

IdeaImplementation

Tesla architecture

Global (device) memoryon-board, various accessmodes

Shared memory on-chip,for each MP

64kB, 16 banks,interleaved 32b

30 MPs on GTX280, 8ALUs on a MP

IntroductionMethod

Data managementOcclusion computation

Results

Naive solutionsPerformance improvementsPreblend kernel

Outline

1 Introduction

2 MethodIdeaImplementation

3 Data managementNaive solutionsPerformance improvementsPreblend kernel

4 Occlusion computationTheoryKernel

5 Results

IntroductionMethod

Data managementOcclusion computation

Results

Naive solutionsPerformance improvementsPreblend kernel

OpenGL interoperability

IntroductionMethod

Data managementOcclusion computation

Results

Naive solutionsPerformance improvementsPreblend kernel

OpenGL interoperability

Too much data copying

Not enough threads

We can transfer more sweeps at once to remedy the latter

IntroductionMethod

Data managementOcclusion computation

Results

Naive solutionsPerformance improvementsPreblend kernel

Blending in CUDA

Write into multiple dests as before

Blend using CUDA

⇒ still memcpies, sampling slower in CUDA than inOpenGL

IntroductionMethod

Data managementOcclusion computation

Results

Naive solutionsPerformance improvementsPreblend kernel

Outline

1 Introduction

2 MethodIdeaImplementation

3 Data managementNaive solutionsPerformance improvementsPreblend kernel

4 Occlusion computationTheoryKernel

5 Results

IntroductionMethod

Data managementOcclusion computation

Results

Naive solutionsPerformance improvementsPreblend kernel

Sample src PBO with CUDA

Reduces data passed between OpenGL and CUDA

Need to copy src PBO to CudaArray

CUDA provides 2D bilinear filtering (read-only)

IntroductionMethod

Data managementOcclusion computation

Results

Naive solutionsPerformance improvementsPreblend kernel

Preblending in CUDA

Cannot write directly,atomic adds are slow

180◦ rotation is trivial

90◦/270◦ needs sharedmem tricks for memcoalescing

IntroductionMethod

Data managementOcclusion computation

Results

Naive solutionsPerformance improvementsPreblend kernel

Packing texels

CUDA memory coalescingonly works forthread-consecutive 32belements

But we have only singleluminance value for a texel

This calls for bit operations

Heaviest on preblendkernels, which luckily havelow arithmetic density

IntroductionMethod

Data managementOcclusion computation

Results

Naive solutionsPerformance improvementsPreblend kernel

Outline

1 Introduction

2 MethodIdeaImplementation

3 Data managementNaive solutionsPerformance improvementsPreblend kernel

4 Occlusion computationTheoryKernel

5 Results

IntroductionMethod

Data managementOcclusion computation

Results

TheoryKernel

Outline

1 Introduction

2 MethodIdeaImplementation

3 Data managementNaive solutionsPerformance improvementsPreblend kernel

4 Occlusion computationTheoryKernel

5 Results

IntroductionMethod

Data managementOcclusion computation

Results

TheoryKernel

Function of the kernel

Processes one line, reads height values as input

Writes aligned packed values

Keeps a representation of the convex hull

Outputs an occlusion vector

Coalescing vs. length uniformity

Thread block size and shared memory resources

IntroductionMethod

Data managementOcclusion computation

Results

TheoryKernel

Occluder processing

Keep a vector of occludersin shared mem

Search it backwards(linear/binary)

Use heuristic comparisonoperator

Insert new occluder

Remove remaining(change length indicator)

Return last − current

IntroductionMethod

Data managementOcclusion computation

Results

TheoryKernel

Outline

1 Introduction

2 MethodIdeaImplementation

3 Data managementNaive solutionsPerformance improvementsPreblend kernel

4 Occlusion computationTheoryKernel

5 Results

IntroductionMethod

Data managementOcclusion computation

Results

Performance figures

Profilers are immature

But we know that we get 50-80 GB/s device memory rates

And 200-400 Gops/s

IntroductionMethod

Data managementOcclusion computation

Results

Comparison

M$ published previous state-of-the-art method inEurographics 2008

It scales badly, 1024x1024 @ 2.5fps, 32 directions

Our method achieves 36-42fps, 64 directions

Is texel-precise on sharp edges

IntroductionMethod

Data managementOcclusion computation

Results

Screen captures

top related