rendering the breeze - khronos group · 2014-04-08 · •rendering-min-yu huang (uc davis) ......

65
© Copyright Khronos Group, 2010 - Page 1 Rendering the breeze Yaki Tebeka Benedict R. Gaster Graphic Remedy AMD

Upload: others

Post on 08-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 1

Rendering the breeze

Yaki Tebeka Benedict R. GasterGraphic Remedy AMD

Page 2: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 2

Acknowledgements and contact info

• Rendering

- Min-Yu Huang (UC Davis)

• Cloth Simulation

- Lee Howes

© 2004 – 2010 Graphic Remedy. All Rights Reserved

www.amd.com

[email protected]

www.gremedy.com

[email protected]

Page 3: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 3

Bullet physics

• An open source physics SDK for games (http://bulletphysics.org)

- Popular physics SDK

- Zlib license for copy-and-use openness

- Development led by Erwin Coumans of Sony

• Includes

- Rigid body dynamics, e.g.

- Ragdolls, destruction, and vehicles

- Soft body dynamics

- Cloth, rope, and deformable volumes

• AMD collaborating on GPU acceleration

- Cloth/soft body and fluids in OpenCL and DirectCompute

- Fully open-source contributions

Page 4: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 4

OpenCL physics is…

Fluids

Cloth

Rigid bodies

Page 5: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 5

Introducing cloth simulation

•A subset of the possible set of soft bodies

•For real-time generally based on a mass/spring

system

- Large collection of masses (particles)

- Connect using spring constraints

- Layout and properties change properties of cloth

Page 6: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 6

Springs and masses

• Three main types of springs

- Structural

- Shearing

- Bending

Page 7: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 7

Springs and masses

• Three main types of springs

- Structural

- Shearing

- Bending

Page 8: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 8

Springs and masses

• Three main types of springs

- Structural

- Shearing

- Bending

Page 9: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 9

Parallelism

•Large number of particles- Appropriate for parallel processing- Force from each spring constraint applied

to both connected particles

Original layoutCurrent layout:

Compute forces as

Stretch from rest length

Compute new

positions

Apply position

corrections

to masses

Page 10: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 10

Parallelism

• Large number of particles

- Appropriate for parallel processing

- Force from each spring constraint applied to both connected particles

Original layoutCurrent layout:

Compute forces as

Stretch from rest length

Compute new

positions

Apply position

corrections

to masses

Rest length of spring

Page 11: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 11

Parallelism

• For each simulation iteration:

- Compute forces in each link based on its length

- Correct positions of masses/vertices from forces

- Compute new vertex positions

Original layoutCurrent layout:

Compute forces as

Stretch from rest length

Compute new

positions

Apply position

corrections

to masses

Page 12: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 12

Parallelism

• For each simulation iteration:

- Compute forces in each link based on its length

- Correct positions of masses/vertices from forces

- Compute new vertex positions

Original layoutCurrent layout:

Compute forces as

Stretch from rest length

Apply position

corrections

to masses

Compute new

positions

Page 13: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 13

Parallelism

• For each simulation iteration:

- Compute forces in each link based on its length

- Correct positions of masses/vertices from forces

- Compute new vertex positions

Original layoutCurrent layout:

Compute forces as

Stretch from rest length

Compute new

positions

Apply position

corrections

to masses

Page 14: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 14

CPU approach to simulation

• Iterative integration over vertex positions

- For each spring computes a force.

- Updates both vertices with a new position.

- Repeat n times where n is configurable.

• Note that the computation is serial

- Propagation of values through the solver is immediate.

Page 15: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 15

The CPU approach

for each iteration

{

for(int linkIndex = 0; linkIndex < numLinks; ++linkIndex)

{

float massLSC =

(inverseMass0 + inverseMass1)/linearStiffnessCoefficient;

float k = ((restLengthSquared - lengthSquared) /

(massLSC * (restLengthSquared + lengthSquared)));

vertexPosition0 -= length * (k*inverseMass0);

vertexPosition1 += length * (k*inverseMass1);

}

}

Page 16: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 16

The CPU approach

for each iteration

{

for(int linkIndex = 0; linkIndex < numLinks; ++linkIndex)

{

float massLSC =

(inverseMass0 + inverseMass1)/linearStiffnessCoefficient;

float k = ((restLengthSquared - lengthSquared) /

(massLSC * (restLengthSquared + lengthSquared)));

vertexPosition0 -= length * (k*inverseMass0);

vertexPosition1 += length * (k*inverseMass1);

}

}

Page 17: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 17

The CPU approach

for each iteration

{

for(int linkIndex = 0; linkIndex < numLinks; ++linkIndex)

{

float massLSC =

(inverseMass0 + inverseMass1)/linearStiffnessCoefficient;

float k = ((restLengthSquared - lengthSquared) /

(massLSC * (restLengthSquared + lengthSquared)));

vertexPosition0 -= length * (k*inverseMass0);

vertexPosition1 += length * (k*inverseMass1);

}

}

Page 18: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 18

GPU parallelism

• The CPU implementation was serial.

- No atomicity issues.

- Value propagation immediate from a given update.

• The GPU implementation is parallel within a cloth.

- Multiple updates to the same node create races.

Page 19: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 19

Vertex solver: a single batch

• Single batch

- Highly efficient per solver iteration

- Can re-use position data for central node

- Need to cleverly arrange data to allow efficient loop unrolling

• Double buffer the vertex data

- Updates will then not be seen until the next iteration

• Write to same buffer

- Updates can be seen quickly, but the computation will be non-deterministic

Page 20: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 20

A branch divergence optimization

• The GPU is a vector machine: a collection of wide SIMD cores

- Divergent branches across a vector hurt performance

- Nodes have different degrees

- Regular mesh

- Low overhead

- Similar degree throughout

- Complicated mesh

- Arbitrary numerous peaks

- Pack vertices by degree

Page 21: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 21

Unfortunately…

• Our experiments have found that solver convergence is too slow

- Could we rectify this by changing the core solver algorithm?

• Slow convergence is due to:

- Slow propagation of new values

- Errors introduced because the solver does not conserve internal momentum

• Alternatively, let’s look at another approach

Page 22: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 22

Batching the simulation

• Create independent subsets of links through graph coloring.

• Synchronize between batches

Page 23: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 23

Batching the simulation

• Create independent subsets of links through graph coloring.

• Synchronize between batches

1 batch

Page 24: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 24

Batching the simulation

• Create independent subsets of links through graph coloring.

• Synchronize between batches

2 batches

Page 25: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 25

Batching the simulation

• Create independent subsets of links through graph coloring.

• Synchronize between batches

3 batches

Page 26: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 26

Batching the simulation

• Create independent subsets of links through graph coloring.

• Synchronize between batches

10 batches

Page 27: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

for each iteration

{

for( int i = 0; i < m_batchStartLengths.size(); ++i )

{

int start = m_linkData.m_batchStartLengths[i].first;

int num = m_linkData.m_batchStartLengths[i].second;

for(int linkIndex = start; linkIndex < start + num; ++linkIndex)

{

}

}

}

Driving batches

Page 28: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

for each iteration

{

for( int i = 0; i < m_batchStartLengths.size(); ++i )

{

int start = m_linkData.m_batchStartLengths[i].first;

int num = m_linkData.m_batchStartLengths[i].second;

for(int linkIndex = start; linkIndex < start + num; ++linkIndex)

{

}

}

}

Driving batches

Page 29: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

for each iteration

{

for( int i = 0; i < m_batchStartLengths.size(); ++i )

{

int start = m_linkData.m_batchStartLengths[i].first;

int num = m_linkData.m_batchStartLengths[i].second;

for(int linkIndex = start; linkIndex < start + num; ++linkIndex)

{

}

}

}

Driving batches

Page 30: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

for each iteration

{

for( int i = 0; i < m_batchStartLengths.size(); ++i )

{

int start = m_linkData.m_batchStartLengths[i].first;

int num = m_linkData.m_batchStartLengths[i].second;

for(int linkIndex = start; linkIndex < start + num; ++linkIndex)

{

}

}

}

Driving batches

Statically generated batches and pre-sorted buffers.

Page 31: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

for each iteration

{

for( int i = 0; i < m_batchStartLengths.size(); ++i )

{

int start = m_linkData.m_batchStartLengths[i].first;

int num = m_linkData.m_batchStartLengths[i].second;

for(int linkIndex = start; linkIndex < start + num; ++linkIndex)

{

}

}

}

Driving batches

This loop is now fully parallel.

Page 32: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

solvePositionsFromLinksKernel.kernel.setArg(0, startLink);

solvePositionsFromLinksKernel.kernel.setArg(1, numLinks);

solvePositionsFromLinksKernel.kernel.setArg(2, kst);

solvePositionsFromLinksKernel.kernel.setArg(3, ti);

solvePositionsFromLinksKernel.kernel.setArg(4, m_linkData.m_clLinks.getBuffer());

solvePositionsFromLinksKernel.kernel.setArg(5, m_linkData.m_clLinksMassLSC.getBuffer());

solvePositionsFromLinksKernel.kernel.setArg(6, m_linkData.m_clLinksRestLengthSquared.getBuffer());

solvePositionsFromLinksKernel.kernel.setArg(7, m_vertexData.m_clVertexInverseMass.getBuffer());

solvePositionsFromLinksKernel.kernel.setArg(8, m_vertexData.m_clVertexPosition.getBuffer());

int numWorkItems = workGroupSize*((numLinks + (workGroupSize-1)) / workGroupSize);

cl_int err = m_queue.enqueueNDRangeKernel(

solvePositionsFromLinksKernel.kernel,

cl::NullRange, cl::NDRange(numWorkItems), cl::NDRange(workGroupSize));

Dispatching a batch

Page 33: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

solvePositionsFromLinksKernel.kernel.setArg(0, startLink);

solvePositionsFromLinksKernel.kernel.setArg(1, numLinks);

solvePositionsFromLinksKernel.kernel.setArg(2, kst);

solvePositionsFromLinksKernel.kernel.setArg(3, ti);

solvePositionsFromLinksKernel.kernel.setArg(4, m_linkData.m_clLinks.getBuffer());

solvePositionsFromLinksKernel.kernel.setArg(5, m_linkData.m_clLinksMassLSC.getBuffer());

solvePositionsFromLinksKernel.kernel.setArg(6, m_linkData.m_clLinksRestLengthSquared.getBuffer());

solvePositionsFromLinksKernel.kernel.setArg(7, m_vertexData.m_clVertexInverseMass.getBuffer());

solvePositionsFromLinksKernel.kernel.setArg(8, m_vertexData.m_clVertexPosition.getBuffer());

int numWorkItems = workGroupSize*((numLinks + (workGroupSize-1)) / workGroupSize);

cl_int err = m_queue.enqueueNDRangeKernel(

solvePositionsFromLinksKernel.kernel,

cl::NullRange, cl::NDRange(numWorkItems), cl::NDRange(workGroupSize));

Dispatching a batch

Note that the number of work items is rounded to a multiple of the group size

Page 34: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

__kernel void

SolvePositionsFromLinksKernel(

const int startLink,

const int numLinks,

const float kst,

const float ti,

__global int2 *g_linksVertexIndices,

__global float *g_linksMassLSC,

__global float *g_linksRestLengthSquared,

__global float *g_verticesInverseMass,

__global float4 *g_vertexPositions)

{

}

The OpenCL link solver kernel header

Page 35: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

int linkID = get_global_id(0) + startLink;

if( get_global_id(0) < numLinks ) {

float massLSC = g_linksMassLSC[linkID];

float restLengthSquared = g_linksRestLengthSquared[linkID];

if( massLSC > 0.0f ) {

int2 nodeIndices = g_linksVertexIndices[linkID];

float3 position0 = g_vertexPositions[nodeIndices.x].xyz;

float3 position1 = g_vertexPositions[nodeIndices.y].xyz;

float inverseMass0 = g_verticesInverseMass[nodeIndices.x];

float inverseMass1 = g_verticesInverseMass[nodeIndices.y];

The OpenCL link solver kernel body

Page 36: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

int linkID = get_global_id(0) + startLink;

if( get_global_id(0) < numLinks ) {

float massLSC = g_linksMassLSC[linkID];

float restLengthSquared = g_linksRestLengthSquared[linkID];

if( massLSC > 0.0f ) {

int2 nodeIndices = g_linksVertexIndices[linkID];

float3 position0 = g_vertexPositions[nodeIndices.x].xyz;

float3 position1 = g_vertexPositions[nodeIndices.y].xyz;

float inverseMass0 = g_verticesInverseMass[nodeIndices.x];

float inverseMass1 = g_verticesInverseMass[nodeIndices.y];

The OpenCL link solver kernel body

The number of links might not be a multiple of the block size.

Changing the memory layout might be a better solution.

Page 37: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

float3 del = position1 - position0;

float lengthSquared = dot( del, del );

float k = ((restLengthSquared – lengthSquared )/(massLSC*(lengthSquared ))*kst;

position0 = position0 - del*(k*inverseMass0);

position1 = position1 + del*(k*inverseMass1);

g_vertexPositions[nodeIndices.x] = (float4)(position0, 0.f);

g_vertexPositions[nodeIndices.y] = (float4)(position1, 0.f);

}

}

The OpenCL link solver kernel body

Page 38: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 38

Returning to our batching

• 10 batches: 10 OpenCL kernel enqueues/dispatches

• 1/10 links per batch

• Low compute density per thread

10 batches

Page 39: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 39

Higher efficiency

• Can create larger groups

- The cloth is fixed-structure

- Can be preprocessed

• Fewer batches/dispatches

9 batches

Page 40: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 40

Larger groups still

• We can move to much larger groups of links

- Number of parallel instances reduced

- On arbitrary meshes batches hard to create

• Larger batches

- Smaller dispatches

- Less parallelism

4 batches

Page 41: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 41

Solving cloths together

• Solve multiple cloths together in n batches

• Grouping

- Larger dispatches and reduced number of dispatches

- Regain the parallelism that increased work-per-thread removed

Page 42: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 42

What if we consider SIMD packing

• The previous batching would still require serial solving of each group

• However, the GPU is not really a collection of thousands of cores

- Tens of SIMD cores

- Each SIMD core runs a single instruction over a vector of work items

- Essentially a wide, flexible version of SSE

• So what does that tell us?

- Computations within a vector happen simultaneously

- Maybe we should be batching on the SIMD level, not on the work item level?

• The disadvantage of this is that we lose platform independence

Page 43: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 43

How can this work?

• Take one group from the previous

batching

• This group can be mapped into a single

SIMD core

Page 44: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 44

A tighter loop

• We still need to batch, of course

• However, now we batch only

WITHIN the SIMD

- These batches are processed in a

single kernel call

- The loop to deal with them can

be unrolled

Page 45: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 45

Three Eras of Processor PerformanceSingle-Core

Era

Sin

gle-

thre

ad P

erfo

rman

ce

?

Time

we arehere

o

Enabled by: Moore’s Law Voltage Scaling MicroArchitecture

Constrained by:PowerComplexity

Multi-Core Era

Thro

ugh

pu

t P

erfo

rman

ce

Time(# of Processors)

we arehere

o

Enabled by: Moore’s Law Desire for Throughput 20 years of SMP arch

Constrained by:PowerParallel SW availabilityScalability

HeterogeneousSystems Era

Targ

eted

Ap

plic

atio

n

Perf

orm

ance

Time(Data-parallel exploitation)

we arehere

o

Enabled by: Moore’s Law Abundant data parallelism Power efficient GPUs

Constrained by:Programming modelsCommunication overheads

Page 46: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 46

OpenCL Development Challanges

• Debugging and profiling parallel computing applications are hard and time

consuming tasks

• Delivering, on time, a robust and bug-free parallel computing applications

is hard

• It is almost impossible to optimize a parallel computing application to fully

utilize the available system resources

© 2004 – 2010 Graphic Remedy. All Rights Reserved

Page 47: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 47

gDEBugger

• gDEBugger is an OpenGL, OpenGL ES and OpenCL Debugger, Profiler

and Memory Analyzer.

• It provides the information a developer needs to find bugs and

optimize application performance

Page 48: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 48

Debugging Demo

© 2004 – 2010 Graphic Remedy. All Rights Reserved

Page 49: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 49

OpenCL Calls History

•OpenCL calls are displayed

in a Calls History view

•Properties view displays

each call details- Arguments

- Links to data viewer

- And more

© 2004 – 2010 Graphic Remedy. All Rights Reserved

Page 50: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 50

Automatic errors detection

gDEBugger can automatically break on:

• gDEBugger detected errors

• OpenCL errors

• OpenCL Memory leaks

• OpenCL function calls

• And more

© 2004 – 2010 Graphic Remedy. All Rights Reserved

Page 51: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 51

Call Stack and Source Code views

• View the call stack and source code that led to the error /

OpenCL function call

© 2004 – 2010 Graphic Remedy. All Rights Reserved

Page 52: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 52

Statistical viewer

• Displays a statistical overview of debugged

application’s OpenCL API usage

• Best practice suggestions based on application’s

activities

© 2004 – 2010 Graphic Remedy. All Rights Reserved

Page 53: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 53

Images and Buffers viewer

• Displays images and buffers data

• Image and Data panes

© 2004 – 2010 Graphic Remedy. All Rights Reserved

Page 54: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 54

Kernels Source Code Editor

•Displays allocated OpenCL Programs and Kernels

•Enables Programs and Kernel “Edit and Continue”

© 2004 – 2010 Graphic Remedy. All Rights Reserved

Page 55: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 55

OpenGL interoperability

•Debug applications that use both OpenGL and OpenCL in one

consistent framework

•View interoperability properties

- OpenGL Contexts association

- OpenGL Buffers and Render-buffers association

- OpenGL Texture’s association

© 2004 – 2010 Graphic Remedy. All Rights Reserved

Page 56: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

Profiling Demo

© 2004 – 2010 Graphic Remedy. All Rights Reserved

Page 57: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 57

Real Time Statistics View

• Time based overview of OpenCL activities

• Data is displayed “per queue”- % Kernel, % Write, % Copy, % Read, % Idle

- Work items execution/Sec

- Read, Write, Copy MB/Sec

© 2004 – 2010 Graphic Remedy. All Rights Reserved

Page 58: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 58

Command Queues Viewer

• Displays detailed, per queue, time based view of

OpenCL activities

© 2004 – 2010 Graphic Remedy. All Rights Reserved

Page 59: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 59

Performance graph

• Displays performance counters from different

sources- gDEBugger’s OpenGL and OpenCL Servers

- GPUs and drivers: NVIDIA, AMD, S3 Graphics

- Operating Systems: Windows, Mac OS X, iOS and Linux

© 2004 – 2010 Graphic Remedy. All Rights Reserved

Page 60: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 60

Performance Analysis Toolbar

• Turn off different types of OpenCL operations to

view the effect on parallel computing performance- Disable Kernel Operations

- Disable Read Operations

- Disable Write Operations

- Disable Copy Operations

© 2004 – 2010 Graphic Remedy. All Rights Reserved

Page 61: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

Optimizing OpenCL memory usage

© 2004 – 2010 Graphic Remedy. All Rights Reserved

Page 62: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 62

Memory Analysis viewer

• Displays information about OpenCL objects memory usage

• Tracks OpenCL related memory leaks

• Displays object’s creation call stack

© 2004 – 2010 Graphic Remedy. All Rights Reserved

Page 63: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 63

gDEBugger’s Customer benefits

• Improves application quality

• Optimizes application performance

• Reduces debugging and profiling time

• Shortens "time to market"

• Helps deploying on multiple platforms

© 2004 – 2010 Graphic Remedy. All Rights Reserved

Page 64: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 64

Available on

• Windows - OpenGL, OpenGL ES and OpenCL

• Mac OS X - OpenGL and OpenCL

• iOS - iPhone & iPad on-device and Simulator, OpenGL ES 1.1 and 2.0

• Linux - OpenGL and OpenCL

© 2004 – 2010 Graphic Remedy. All Rights Reserved

Page 65: Rendering the breeze - Khronos Group · 2014-04-08 · •Rendering-Min-Yu Huang (UC Davis) ... Changing the memory layout might be a better solution. float3 del = position1 - position0;

© Copyright Khronos Group, 2010 - Page 65

Available now

•gDEBugger CL is in final beta testing

•Expected to be released in August 2010

•To join the free beta program

www.gremedy.com/gDEBuggerCL.php

© 2004 – 2010 Graphic Remedy. All Rights Reserved