height field ambient occlusion using cudawili.cc/research/presentations/utu2009.pdf · introduction...

IntroductionMethod

Data managementOcclusion computation

Results

Height field ambient occlusion using CUDA

3.6.2009

IntroductionMethod

Results

Outline

1 Introduction

2 MethodIdeaImplementation

3 Data managementNaive solutionsPerformance improvementsPreblend kernel

4 Occlusion computationTheoryKernel

5 Results

IntroductionMethod

Results

Height fields

IntroductionMethod

Results

Self occlusion

IntroductionMethod

Results

Current methods

Marching severaldirections from eachfragment

Sampling several timesalong a direction

Optimizations based onthis approach

IntroductionMethod

Results

IdeaImplementation

Outline

1 Introduction

5 Results

IntroductionMethod

Results

IdeaImplementation

Sweeps

We process several (e.g.64) directions (sweeps)

For a height field of n2

(e.g. 1024x1024) a sweepconsists of n · · ·

√2n lines

One direction = infinitelyhigh but thin light source

IntroductionMethod

Results

IdeaImplementation

Spatial coherence

Each line steps one texelat a time

Keeps a record of previousoccluders

Line-wise coherence

Convex hull subset ofoccluders is enough

3% of occluders requiredin practice

Output: occlusionvector/value on each step

Results for each directionblended together

IntroductionMethod

Results

IdeaImplementation

Implications

Previously bandwidth limited

New method samples significantly less

Bound to discrete directions, but ideally takes each pixel onthem into account

Does not map to shaders well ⇒ GPGPU

IntroductionMethod

Results

IdeaImplementation

Outline

1 Introduction

5 Results

IntroductionMethod

Results

IdeaImplementation

Environment

OpenGL 3.0 (Aug 2008)

CUDA 2.2 (May 2009)

Linux (primary), Windows

Software (CPU) and hardware (GPU) implementations

IntroductionMethod

Results

IdeaImplementation

Tesla architecture

Global (device) memoryon-board, various accessmodes

Shared memory on-chip,for each MP

64kB, 16 banks,interleaved 32b

30 MPs on GTX280, 8ALUs on a MP

IntroductionMethod

Results

Naive solutionsPerformance improvementsPreblend kernel

Outline

1 Introduction

5 Results

IntroductionMethod

Results

OpenGL interoperability

IntroductionMethod

Results

OpenGL interoperability

Too much data copying

Not enough threads

We can transfer more sweeps at once to remedy the latter

IntroductionMethod

Results

Blending in CUDA

Write into multiple dests as before

Blend using CUDA

⇒ still memcpies, sampling slower in CUDA than inOpenGL

IntroductionMethod

Results

Outline

1 Introduction

5 Results

IntroductionMethod

Results

Sample src PBO with CUDA

Reduces data passed between OpenGL and CUDA

Need to copy src PBO to CudaArray

CUDA provides 2D bilinear filtering (read-only)

IntroductionMethod

Results

Preblending in CUDA

Cannot write directly,atomic adds are slow

180◦ rotation is trivial

90◦/270◦ needs sharedmem tricks for memcoalescing

IntroductionMethod

Results

Packing texels

CUDA memory coalescingonly works forthread-consecutive 32belements

But we have only singleluminance value for a texel

This calls for bit operations

Heaviest on preblendkernels, which luckily havelow arithmetic density

IntroductionMethod

Results

Outline

1 Introduction

5 Results

IntroductionMethod

Results

TheoryKernel

Outline

1 Introduction

5 Results

IntroductionMethod

Results

TheoryKernel

Function of the kernel

Processes one line, reads height values as input

Writes aligned packed values

Keeps a representation of the convex hull

Outputs an occlusion vector

Coalescing vs. length uniformity

Thread block size and shared memory resources

IntroductionMethod

Results

TheoryKernel

Occluder processing

Keep a vector of occludersin shared mem

Search it backwards(linear/binary)

Use heuristic comparisonoperator

Insert new occluder

Remove remaining(change length indicator)

Return last − current

IntroductionMethod

Results

TheoryKernel

Outline

1 Introduction

5 Results

IntroductionMethod

Results

Performance figures

Profilers are immature

But we know that we get 50-80 GB/s device memory rates

And 200-400 Gops/s

IntroductionMethod

Results

Comparison

M$ published previous state-of-the-art method inEurographics 2008

It scales badly, 1024x1024 @ 2.5fps, 32 directions

Our method achieves 36-42fps, 64 directions

Is texel-precise on sharp edges

IntroductionMethod

Results

Screen captures

height field ambient occlusion using cudawili.cc/research/presentations/utu2009.pdf · introduction...

Documents

wilf lalonde ©2012 comp 4501 95.4501 ambient occlusion

dynamic ambient occlusion and indirect...

the perceptual influence of approximating ambient occlusion...

high-quality screen-space ambient occlusion using temporal

gmax ambient occlusion for gmax using...

multi-resolution screen-space ambient...

fast precomputed ambient occlusion for proximity shadows

ambient occlusion for particles

screen space ambient occlusion for virtual and mixed reality...

hybrid ambient occlusion - telecom paris

realistic ambient occlusion in real-timeuffe/xjobb/joakim...

gaze-dependent ambient occlusion

program manager - devblogs.microsoft.com · ambient...

multi-resolution screen-space ambient...

neural network ambient occlusion - daniel...

voxel-space ambient occlusion - mathematical sciences home

dynamic ambient occlusion and indirect...

state of the art report on ambient occlusion...state of the...

volumetric ambient occlusion

screen-space ambient occlusion by unsharp masking of the...