[harvard cs264] 16 - managing dynamic parallelism on gpus: a case study of high performance sorting...

76

Upload: npinto

Post on 29-Nov-2014

1.340 views

Category:

Education


3 download

DESCRIPTION

http://cs264.orghttp://goo.gl/1K2fI

TRANSCRIPT

Page 1: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)
Page 2: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  Data-independent tasks

–  Tasks with statically-known data dependences

–  SIMD divergence

–  Lacking fine-grained synchronization

–  Lacking writeable, coherent caches

Page 3: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  Data-independent tasks

–  Tasks with statically-known data dependences

–  SIMD divergence

–  Lacking fine-grained synchronization

–  Lacking writeable, coherent caches

Page 4: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)
Page 5: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

DEVICE 32-­‐bit  Key-­‐value  Sor7ng  (106  keys  /  sec)    

Keys-­‐only  Sor7ng  (106  pairs/  sec)  

NVIDIA  GTX  280 449      (3.8x  speedup*) 534      (2.9x  speedup*)

* Satish et al.,"Designing efficient sorting algorithms for manycore GPUs," in IPDPS '09

Page 6: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

DEVICE 32-­‐bit  Key-­‐value  Sor7ng  (106  keys  /  sec)    

Keys-­‐only  Sor7ng  (106  pairs/  sec)  

NVIDIA  GTX  480 775 1005

NVIDIA  GTX  280 449 534

NVIDIA  8800  GT 129 171

Page 7: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

DEVICE 32-­‐bit  Key-­‐value  Sor7ng  (106  keys  /  sec)    

Keys-­‐only  Sor7ng  (106  pairs/  sec)  

NVIDIA  GTX  480 775 1005

NVIDIA  GTX  280 449 534

NVIDIA  8800  GT 129 171

Intel    Knight's  Ferry  MIC  32-­‐core* 560

Intel    Core  i7  quad-­‐core  * 240

Intel    Core-­‐2  quad-­‐core* 138

*Satish et al., "Fast Sort on CPUs, GPUs and Intel MIC Architectures,“ Intel Tech Report 2010.

Page 8: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

 

Page 9: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  Each output is dependent upon a finite subset of the input •  Threads are decomposed by output element

•  The output (and at least one input) index is a static function of thread-id

Output  

Thread   Thread   Thread   Thread  

Input  

Page 10: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  Each output element has dependences upon any / all input elements

–  E.g., sorting, reduction, compaction, duplicate removal, histogram generation, map-reduce, etc.

Input  

Output  

?  

Page 11: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  Threads are decomposed by output element

–  Repeatedly iterate over recycled input streams

–  Output stream size is statically known before each pass

Thread   Thread   Thread   Thread  

Thread   Thread   Thread   Thread  

Page 12: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  O(n) global work from passes of pairwise-neighbor-reduction

–  Static dependences, uniform output

+ + + +

Page 13: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  Repeated pairwise swapping

• Bubble sort is O(n2)

• Bitonic sort is O(nlog2n)

–  Need partitioning: dynamic, cooperative

allocation

–  Repeatedly check each vertex or edge • Breadth-first search becomes O(V2)

• O(V+E) is work-optimal

–  Need queue: dynamic, cooperative allocation

Page 14: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  Repeated pairwise swapping

• Bubble sort is O(n2)

• Bitonic sort is O(nlog2n)

–  Need partitioning: dynamic, cooperative

allocation

–  Repeatedly check each vertex or edge • Breadth-first search becomes O(V2)

• O(V+E) is work-optimal

–  Need queue: dynamic, cooperative allocation

Page 15: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  Variable output per thread

–  Need dynamic, cooperative allocation

● ● ● ●●●

●●

● ● ● ● ● ● ● ●

● ● ● ●

● ● ● ●

● ● ● ●

● ● ●●

● ● ●

●● ● ●

●●

●●

● ● ● ●

● ● ● ●● ● ● ●●

●●

●●

●●

●●

●●

 

Page 16: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

•  Where do I put something in a list?

–  Duplicate removal

–  Sorting

–  Histogram compilation

  Where do I enqueue something?

–  Search space exploration

–  Graph traversal

–  General work queues

Input  

Output  

?  

Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread   Thread  

Page 17: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

• For 30,000 producers and consumers?

–  Locks serialize everything

Page 18: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  O(n) work

–  For allocation: use scan results as a scattering vector

–  Popularized by Blelloch et al. in the ‘90s

–  Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009

2   1   0   3   2  

0   2   3   3   6  

Input  

Prefix  Sum  

Page 19: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  O(n) work

–  For allocation: use scan results as a scattering vector

–  Popularized by Blelloch et al. in the ‘90s

–  Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009

2   1   0   3   2  

0   2   3   3   6  

Input    (  &  allocaOon    requirement)  

Result  of    prefix  scan  (sum)  

Thread   Thread   Thread   Thread  Thread  

Page 20: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  O(n) work

–  For allocation: use scan results as a scattering vector

–  Popularized by Blelloch et al. in the ‘90s

–  Merrill et al. Parallel Scan for Stream Architectures. Technical Report CS2009-14, University of Virginia. 2009

2   1   0   3   2  

0   2   3   3   6  

0   1   2   3   4   5   6   7  

Input    (  &  allocaOon    requirement)  

Output  

Result  of    prefix  scan  (sum)  

Thread   Thread   Thread   Thread  Thread  

Page 21: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

0s 1s

Key sequence 1110 1010 1100 1000 0011 0111 0101 0001

Output key sequence 1110 1100 0011 0111 1010 1000 0101 0001

Page 22: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

Key sequence 1110 1010 1100 1000 0011 0111 0101 0001 0 2 4 5 1 3 6 7

Allocation requirements 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

0s 1s

1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5

Scanned allocations (relocation offsets)

1 2 4 4 0 1 2 2 0 1 2 3 1 2 0 3 1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5

Page 23: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

Allocation requirements 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1

0s 1s

1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5

Scanned allocations (bin relocation offsets)

1 2 4 4 0 1 2 2 0 1 2 3 1 2 0 3 1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5

Adjusted allocations (global relocation offsets)

1 2 4 4 4 5 6 6 4 5 6 7 1 2 0 3 1 3 6 7 2 4 0 5 1 3 6 7 2 4 0 5

Key sequence 1110 1010 1100 1000 0011 0111 0101 0001

Output key sequence 1110 1100 0011 0111 1010 1000 0101 0001 0 2 4 5 1 3 6 7

1 2 0 3 4 5 7 6

Page 24: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

 

Page 25: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

Un-fused

GPU  

Globa

l  Device  Mem

ory  

Host  P

rogram

 

Determine  allocaCon  size  

CUDPP  scan  

CUDPP  scan  

Distribute  output  

CUDPP  Scan  

Host  

Page 26: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

Un-fused Fused

GPU  

Globa

l  Device  Mem

ory  

Host  P

rogram

 

Determine  allocaCon  size  

CUDPP  scan  

CUDPP  scan  

Distribute  output  

CUDPP  Scan  

Host  

Host  P

rogram

 

Globa

l  Device  Mem

ory  Scan  

Scan  

Scan  

Determine  allocaCon  

Distribute  output  

GPU  Host  

Page 27: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

1.  Heavy SMT (over-threading) yields usable “bubbles” of free computation

2.  Propagate live data between steps in fast registers / smem

3.  Use scan (or variant) as a “runtime” for everything

Fused

Host  P

rogram

 

Globa

l  Device  Mem

ory  

Scan  

Scan  

Scan  

Determine  allocaCon  

Distribute  output  

GPU  Host  

Page 28: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

1.  Heavy SMT (over-threading) yields usable “bubbles” of free computation

2.  Propagate live data between steps in fast registers / smem

3.  Use scan (or variant) as a “runtime” for everything

Fused

Host  P

rogram

 

Globa

l  Device  Mem

ory  

Scan  

Scan  

Scan  

Determine  allocaCon  

Distribute  output  

GPU  Host  

Page 29: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

Device   Memory  Bandwidth   Compute  Throughput   Memory  wall   Memory  wall  

    (109  bytes/s)   (109  thread-­‐cycles/s)   (bytes/cycle)   (instrs/word)  

GTX  480   169.0   672.0   0.251   15.9  

GTX  285   159.0   354.2   0.449   8.9  

GTX  280   141.7   311.0   0.456   8.8  

Tesla  C1060   102.0   312.0   0.327   12.2  

9800  GTX+   70.4   235.0   0.300   13.4  

8800  GT   57.6   168.0   0.343   11.7  

9800  GT   57.6   168.0   0.343   11.7  

8800  GTX   86.4   172.8   0.500   8.0  

Quadro  FX  5600   76.8   152.3   0.504   7.9  

Page 30: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

Device   Memory  Bandwidth   Compute  Throughput   Memory  wall   Memory  wall  

    (109  bytes/s)   (109  thread-­‐cycles/s)   (bytes/cycle)   (instrs/word)  

GTX  480   169.0   672.0   0.251   15.9  

GTX  285   159.0   354.2   0.449   8.9  

GTX  280   141.7   311.0   0.456   8.8  

Tesla  C1060   102.0   312.0   0.327   12.2  

9800  GTX+   70.4   235.0   0.300   13.4  

8800  GT   57.6   168.0   0.343   11.7  

9800  GT   57.6   168.0   0.343   11.7  

8800  GTX   86.4   172.8   0.500   8.0  

Quadro  FX  5600   76.8   152.3   0.504   7.9  

Page 31: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

GTX285  r+w  memory  wall    (17.8  instrucOons  per    

input  word)  

0  

5  

10  

15  

20  

25  

0   16   32   48   64   80   96   112  

Thread

-­‐InstrucOon

s  /  32-­‐bit  scan

 element  

Problem  Size  (millions)  

Insert  work  here  

Page 32: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

GTX285  r+w  memory  wall  (17.8)  

Data  Movement  Skeleton  

0  

5  

10  

15  

20  

25  

0   16   32   48   64   80   96   112  

Thread

-­‐InstrucOon

s  /  32-­‐bit  scan

 element  

Problem  Size  (millions)  

Insert  work  here  

Page 33: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

GTX285  r+w  memory  wall  (17.8)  

Our  Scan  Kernel  

Data  Movement  Skeleton  

0  

5  

10  

15  

20  

25  

0   16   32   48   64   80   96   112  

Thread

-­‐InstrucOon

s  /  32-­‐bit  scan

 element  

Problem  Size  (millions)  

Insert  work  here  

Page 34: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  Increase granularity / redundant computation

• ghost cells

• radix bits

–  Orthogonal kernel fusion

GTX285  r+w  memory  wall  

(17.8)  

Our  Scan  Kernel  

Data  Movement  Skeleton  

0  

5  

10  

15  

20  

25  

0   16   32   48   64   80   96   112  

Thread

-­‐InstrucOon

s  /  32-­‐bit  scan

 element  

Problem  Size  (millions)  

Insert  work  here  

Page 35: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

CUDPP  Scan  Kernel  

Our  Scan  Kernel  

0  

5  

10  

15  

20  

25  

0   20   40   60   80   100   120  

Thread

-­‐InstrucOon

s  /  32-­‐bit  scan

 element  

Problem  Size  (millions)  

Page 36: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  Partially-coalesced writes

–  2x write overhead

–  4 total concurrent scan operations (radix 16)

GTX285  Scan  Kernel  Wall  

Our  Scan  Kernel  

GTX285  Radix  Scader  Kernel  Wall  

0  

5  

10  

15  

20  

25  

30  

35  

0   16   32   48   64   80   96   112  

Thread

-­‐InstrucOon

s  /  32-­‐bit  scan

 element  

Problem  Size  (millions)  

Insert  work  here  

Page 37: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  Need kernels with tunable local (or redundant) work

• ghost cells

• radix bits

285  Radix  Scader  Kernel  

Wall  

480  Radix  Scader  Kernel  

Wall  

0  

5  

10  

15  

20  

25  

30  

35  

40  

45  

50  

0   10   20   30   40   50   60   70   80   90  

Thread

-­‐instructoins  /  32-­‐bit  w

ord  

Problem  Size  (millions)  

Page 38: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

 

Page 39: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  Virtual processors abstract a diversity of hardware configurations

–  Leads to a host of inefficiencies

–  E.g., only several hundred CTAs

Page 40: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  Virtual processors abstract a diversity of hardware configurations

–  Leads to a host of inefficiencies

–  E.g., only several hundred CTAs

Page 41: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

Grid

A

grid-size = (N / tilesize) CTAs

grid-size = 150 CTAs (or other small constant)

…  

threadblock  

…  

threadblock  Grid

B

Page 42: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  Thread-dependent predicates

–  Setup and initialization code (notably for smem)

–  Offset calculations (notably for smem)

–  Common values are hoisted and kept live

…  

threadblock  

Page 43: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  Thread-dependent predicates

–  Setup and initialization code (notably for smem)

–  Offset calculations (notably for smem)

–  Common values are hoisted and kept live

…  

threadblock  

Page 44: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  Thread-dependent predicates

–  Setup and initialization code (notably for smem)

–  Offset calculations (notably for smem)

–  Common values are hoisted and kept live

–  Spills are really bad

…  

threadblock  

Page 45: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  O( N / tilesize) gmem accesses

–  2-4 instructions per access (offset calcs,

load, store)

–  GPU is least efficient here: get it over with as quick as possible

log tilesize (N) -level tree

Two-level tree

Page 46: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  O( N / tilesize) gmem accesses

–  2-4 instructions per access (offset calcs,

load, store)

–  GPU is least efficient here: get it over with as quick as possible

log tilesize (N) -level tree

Two-level tree

Page 47: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

0  

4  

8  

12  

16  

20  

0   1000   2000   3000   4000   5000   6000   7000   8000   9000  

Thread

-­‐instrucOon

s  /  Elem

ent  

Grid  Size  (#  of  threadblocks)  

Compute  Load  

285  Scan  Kernel  Wall  

Page 48: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  16.1M / 150 CTAs / 1024 = 109.91 tiles per CTA

–  conditional evaluation

–  singleton loads

C = number of CTAs

N = problem size

T = tile size

B = tiles per CTA

Page 49: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  16.1M / 150 CTAs / 1024 = 109.91 tiles per CTA

–  conditional evaluation

–  singleton loads

C = number of CTAs

N = problem size

T = tile size

B = tiles per CTA

Page 50: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  floor(16.1M / (1024 * 150) ) = 109 tiles per CTA

–  16.1M % (1024 * 150) = 136.4 extra tiles

C = number of CTAs

N = problem size

T = tile size

B = tiles per CTA

Page 51: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  floor(16.1M / (1024 * 150) ) = 109 tiles per CTA (14 CTAs)

–  109 + 1 = 110 tiles per CTA (136 CTAs)

–  16.1M % (1024 * 150) = 0.4 extra tiles

C = number of CTAs

N = problem size

T = tile size

B = tiles per CTA

Page 52: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

 

Page 53: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  If you breathe on your code, run it through the VP •  Kernel runtimes

•  Instruction counts

–  Indispensible for tuning •  Host-side timing requires too many iterations

•  Only 1-2 cudaprof iterations for consistent counter-based perf data

–  Write tools to parse the output •  “Dummy” kernels useful for demarcation

Page 54: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

0  

100  

200  

300  

400  

500  

600  

700  

800  

900  

1000  

1100  

0   16   32   48   64   80   96   112   128   144   160   176   192   208   224   240   256   272  

SorOng

 Rate  (106  keys/sec)  

Problem  size  (millions)  

GTX  480  

C2050  (no  ECC)  

GTX  285  

C2050  (ECC)  

GTX  280  

C1060  

9800  GTX+  

Page 55: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

0  

100  

200  

300  

400  

500  

600  

700  

800  

0   16   32   48   64   80   96   112   128   144   160   176   192   208   224   240  

SorOng

 Rate  (m

illions  of  p

airs/sec)  

Problem  size  (millions)  

GTX  480  C2050  (no  ECC)  GTX  285  GTX  280  C2050  (ECC)  C1060  9800  GTX+  

Page 56: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

0  

20  

40  

60  

80  

100  

120  

140  

160  

180  

0   20   40   60   80   100   120  

Kernel  Ban

dwidth  (G

iBytes  /  sec)  

Problem  Size  (millions)  

merrill_tree  Reduce  

merrill_rts  Scan  

Page 57: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

0  

20  

40  

60  

80  

100  

120  

140  

160  

180  

0   20   40   60   80   100   120  

Kernel  Ban

dwidth  (B

ytes  x109  /  sec)  

Problem  Size  (millions)  

merrill_linear  Reduce  

merrill_linear  Scan  

Page 58: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  Implement device “memcpy” for tile-processing •  Optimize for “full tiles”

–  Specialize for different SM versions, input types, etc.

Page 59: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)
Page 60: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  Use templated code to generate various instances

–  Run with cudaprof env vars to collect data

Page 61: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

80  

90  

100  

110  

120  

130  

140  

150  

160  

0   20   40   60   80   100   120  

Band

width  (G

iBytes/sec)  

Words  Copied  (millions)  

128-­‐Thread  CTA  (64B  ld)  

Single  

Single  (no-­‐overlap)  

Double  

Double  (no-­‐overlap)  

Quad  

Quad  (no-­‐overlap)  

70  

80  

90  

100  

110  

120  

130  

140  

0   20   40   60   80   100   120  

Band

width  (G

iBytes/sec)  

Words  Copied  (millions)  

128-­‐Thread  CTA  (128B  ld/st)  

Single  Single  (no-­‐overlap)  Double  Double  (no-­‐overlap)  Quad  Quad  (no-­‐overlap)  Intrinsic  Copy  

One-­‐way  

Two-­‐way  

cudaMemcpy()  us  

Page 62: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

 

Page 63: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size

–  Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size

x0 x1 x2 x3 x4 x5 x6

x0 ⊕(x0..x1) x2 ⊕(x6..x7) ⊕(x2..x3) x4 ⊕(x4..x5) x6

x0 ⊕(x0..x1) x2 ⊕(x4..x7) ⊕(x0..x3) x4 ⊕(x4..x5) x6 i

x0 ⊕(x0..x1) x2 ⊕(x0..x3) i x4 ⊕(x4..x5) x6

x0 i x2 ⊕(x0..x5) ⊕(x0..x1) x4 ⊕(x0..x3) x6

i x0 ⊕(x0..x1) ⊕(x0..x6) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) =1 =3 =0 =2 ⊕1   ⊕3  ⊕0   ⊕2  

⊕1  ⊕0   =1 =0

=0   ⊕0  

⊕0   ⊕1  

⊕0   ⊕1   ⊕2   ⊕3  

x7 t0

t1

t2

t3

t4

t5

m0 m1 m2 m7 m3 m4 m5 m6

i i i i

i i i i

x0 ⊕(x0..x1) ⊕(x1..x2) ⊕(x6..x7) ⊕(x2..x3) ⊕(x3..x4) ⊕(x4..x5) ⊕(x5..x6)

x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x4..x7) ⊕(x0..x3) ⊕(x1..x4) ⊕(x2..x5) ⊕(x3..x6)

x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x7) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6)

x0 x1 x2 x7 x3 x4 x5 x6

t1  

t2  

t3  

t0  

⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7  

⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7  

⊕4   ⊕5   ⊕6  ⊕7  

⊕0  

⊕1  

i i i i

⊕0  

i i i i

⊕2   ⊕3  ⊕1  ⊕0  

m3 m0 m1 m2 m4 m5 m6 m11 m7 m8 m9 m10

Page 64: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  SIMD lanes wasted on O(n)-work Brent Kung (left), but less work when n > warp size

–  Kogge-Stone (right) is O(n log n)-work, but faster when n ≤ warp size

x0 x1 x2 x3 x4 x5 x6

x0 ⊕(x0..x1) x2 ⊕(x6..x7) ⊕(x2..x3) x4 ⊕(x4..x5) x6

x0 ⊕(x0..x1) x2 ⊕(x4..x7) ⊕(x0..x3) x4 ⊕(x4..x5) x6 i

x0 ⊕(x0..x1) x2 ⊕(x0..x3) i x4 ⊕(x4..x5) x6

x0 i x2 ⊕(x0..x5) ⊕(x0..x1) x4 ⊕(x0..x3) x6

i x0 ⊕(x0..x1) ⊕(x0..x6) ⊕(x0..x2) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) =1 =3 =0 =2 ⊕1   ⊕3  ⊕0   ⊕2  

⊕1  ⊕0   =1 =0

=0   ⊕0  

⊕0   ⊕1  

⊕0   ⊕1   ⊕2   ⊕3  

x7 t0

t1

t2

t3

t4

t5

m0 m1 m2 m7 m3 m4 m5 m6

i i i i

i i i i

x0 ⊕(x0..x1) ⊕(x1..x2) ⊕(x6..x7) ⊕(x2..x3) ⊕(x3..x4) ⊕(x4..x5) ⊕(x5..x6)

x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x4..x7) ⊕(x0..x3) ⊕(x1..x4) ⊕(x2..x5) ⊕(x3..x6)

x0 ⊕(x0..x1) ⊕(x0..x2) ⊕(x0..x7) ⊕(x0..x3) ⊕(x0..x4) ⊕(x0..x5) ⊕(x0..x6)

x0 x1 x2 x7 x3 x4 x5 x6

t1  

t2  

t3  

t0  

⊕1   ⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7  

⊕2   ⊕3   ⊕4   ⊕5   ⊕6   ⊕7  

⊕4   ⊕5   ⊕6  ⊕7  

⊕0  

⊕1  

i i i i

⊕0  

i i i i

⊕2   ⊕3  ⊕1  ⊕0  

m3 m0 m1 m2 m4 m5 m6 m11 m7 m8 m9 m10

Page 65: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

t1   t2   t3  t2   t3  

t3  

t0  t1  t0  

t0  

t2  t1  t0  

t3  t2  t1  

t3  t3  t3  

t2  t2  t2  

t1  t1  t1  

t0  t0  t0  

…   …   …   …   tT  -­‐  1  tT/2  +  1  

tT/2     tT/2  +  2  

tT/4  +  1  

tT/4     tT/4  +  2  

tT/2  -­‐  1  t1  t0   t2   tT/4  -­‐1   t3T/4+1  t3T/4     t3T/4+2  t3T/4  -­‐1  

barrier  

Tree-­‐based:  

vs.  raking-­‐based:  

t1   t2   t3  t2   t3  

t3  

t0  t1  t0  

t0  

t2  t1  t0  

t3  t2  t1  

barrier  

barrier  

barrier  

Page 66: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

t1   t2   t3  t2   t3  

t3  

t0  t1  t0  

t0  

t2  t1  t0  

t3  t2  t1  

t3  t3  t3  

t2  t2  t2  

t1  t1  t1  

t0  t0  t0  

…   …   …   …   tT  -­‐  1  tT/2  +  1  

tT/2     tT/2  +  2  

tT/4  +  1  

tT/4     tT/4  +  2  

tT/2  -­‐  1  t1  t0   t2   tT/4  -­‐1   t3T/4+1  t3T/4     t3T/4+2  t3T/4  -­‐1  

barrier  

Tree-­‐based:  

vs.  raking-­‐based:  

t1   t2   t3  t2   t3  

t3  

t0  t1  t0  

t0  

t2  t1  t0  

t3  t2  t1  

barrier  

barrier  

barrier  

Page 67: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  Barriers make O(n) code O(n log n)

–  The rest are “DMA engine” threads

–  Use threadblocks to cover pipeline latencies, e.g., for Fermi:

•  2 worker warps per CTA

•  6-7 CTAs

t1   t2   t3  t2   t3  

t3  

t0  t1  t0  

t0  

t2  t1  t0  

t3  t2  t1  

t3  t3  t3  

t2  t2  t2  

t1  t1  t1  

t0  t0  t0  

…   …   …   …   tT  -­‐  1  tT/2  +  1  

tT/2     tT/2  +  2  

tT/4  +  1  

tT/4     tT/4  +  2  

tT/2  -­‐  1  

t1  t0   t2  tT/4  -­‐1  

t3T/4+1  

t3T/4     t3T/4+2  

t3T/4  -­‐1  

Worker    

DMA  

Page 68: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

 

Page 69: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  Different SMs (varied local storage: registers/smem)

–  Different input types (e.g., sorting chars vs. ulongs)

–  # of steps for each algorithm phase is configuration-driven

–  Template expansion + Constant-propagation + Static loop unrolling + Preprocessor Macros

–  Compiler produces a target assembly that is well-tuned for the specifically targeted hardware and problem

Page 70: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

Serial    ReducOon  (regs)  

Serial    ReducOon  (shmem)  

SIMD  Kogge-­‐Stone    Scan    

(shmem)  Serial  Scan    (shmem)  

Serial  Scan    (regs)  

Digit  Decoding  

0s  carry-­‐in

 

9  

1s  carry-­‐in

 

25  

2s  carry-­‐in

 

33  

3s  carry-­‐in

 

49  

Scader  (SIMD  Kogge-­‐Stone  Scan,  smem  exchange)  

0s  carry-­‐out

 

18  

1s  carry-­‐out

 

32  

2s  carry-­‐out

 

42  

3s  carry-­‐out

 

56  

t 1  

t 2  

t 3  

t 2  

t 3  

t 3  

t 0  

t 1  

t 0  

t 0  

t 2  

t 1  

t 0  

t 3  

t 2  

t 1  

0s  to

tal  

1s  to

tal  

2s  to

tal  

3s  to

tal  

10  

17  

18  

19  

1  11  

20  

12  

27  

2  28

 3  

4  5  

21  

14  

29  

22  

30  

23  

24  

7  15

 31  

8  16

 25

 9  

26  

13  

6  32

 

10  

11  

12  

13  

14  

15  

16  

18  

26  

27  

28  

29  

30  

31  

33  

34  

35  

36  

37  

38  

39  

41  

50  

51  

52  

53  

54  

55  

17  

32  

40  

56  

Loca

l  Excha

nge  Offs

ets  

Globa

l  Sca

tter  Offs

ets  

…  

…  

…  

…  

t T  -­‐  1  

t T/2  +  1  

t T/2    

t 3T/4  -­‐2  

t T/4  +  1  

t T/4    

t T/2  -­‐  2  

t T/2  -­‐  1  

t 1  

t 0  

t T/4  -­‐2  

t T/4  -­‐1  

t 3T/4+1  

t 3T/4    

t T  -­‐  2  

t 3T/4  -­‐1  

…  

…  

…  

…  

t T  -­‐  1  

t T/2  +  1  

t T/2    

t 3T/4  -­‐2  

t T/4  +  1  

t T/4    

t T/2  -­‐  2  

t T/2  -­‐  1  

t 1  

t 0  

t T/4  -­‐2  

t T/4  -­‐1  

t 3T/4+1  

t 3T/4    

t T  -­‐  2  

t 3T/4  -­‐1  

Digit  Flag  Vectors

 

211  

122  

302  

232  

300  

021  

022  

021  

123  

330  

023  

130  

220  

020  

112  

221  

023  

322  

003  

012  

022  

010  

121  

323  

020  

101  

212  

220  

013  

301  

130  

333  

Subseq

uenc

e  of  Key

s  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  1  

1  1  

1  1  

1  1  

1  1  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  1  

1  1  

1  1  

1  1  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  1  

1  1  

1  1  

1  1  

1  1  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  0  

0  1  

1  1  

1  1  

1  1  

0s  

1s  

2s  

3s  

0s  to

tal  

9  

1s  to

tal  

7  

2s  to

tal  

9  

3s  to

tal  

7  

0  0  

0  0  

1  1  

1  1  

2  5  

5  5  

5  5  

5  5  

7  7  

8  8  

1  5  

9  1  

2  3  

4  5  

7  8  

9  6  

1  1  

1  1  

2  3  

3  3  

3  3  

3  4  

5  5  

5  5  

5  5  

6  6  

7  7  

2  5  

7  1  

2  3  

5  6  

7  4  

0  3  

3  4  

4  4  

4  4  

4  4  

5  5  

6  8  

8  8  

8  8  

9  4  

4  8  

9  1  

2  3  

4  5  

6  7  

8  9  

0  0  

0  0  

0  0  

0  1  

2  3  

3  3  

3  3  

4  5  

5  5  

5  6  

6  6  

6  3  

5  2  

3  4  

5  6  

1  7  

0s  

1s  

2s  

3s  

Exch

ange

d  Ke

ys  

300  

330  

130  

220  

020  

130  

010  

220  

211  

021  

021  

301  

221  

121  

122  

302  

232  

022  

122  

112  

322  

212  

013  

123  

023  

023  

003  

323  

020  

101  

022  

333  

t 1  

t 2  

t 3  

t 2  

t 3  

t 3  

t 0  

t 1  

t 0  

t 0  

t 2  

t 1  

t 0  

t 3  

t 2  

t 1  

t 3  

t 3  

t 3  

t 2  

t 2  

t 2  

t 1  

t 1  

t 1  

t 0  

t 0  

t 0  

t 3  

t 3  

t 3  

t 2  

t 2  

t 2  

t 1  

t 1  

t 1  

t 0  

t 0  

t 0  

t T  -­‐  1  

t T/2  +  1  

t T/2    

t T/2  +  2  

t T/4  +  1  

t T/4    

t T/4  +  2  

t T/2  -­‐  1  

t 1  

t 0  

t 2  

t T/4  -­‐1  

t 3T/4+1  

t 3T/4    

t 3T/4+2  

t 3T/4  -­‐1  

t T  -­‐  1  

t T/2  +  1  

t T/2    

t 3T/4  -­‐2  

t T/4  +  1  

t T/4    

t T/2  -­‐  2  

t T/2  -­‐  1  

t 1  

t 0  

t T/4  -­‐2  

t T/4  -­‐1  

t 3T/4+1  

t 3T/4    

t T  -­‐  2  

t 3T/4  -­‐1  

…  

…  

…  

…  

…  

…  

…  

…  

t 1  

t 2  

t 3  

t 2  

t 3  

t 3  

t 0  

t 1  

t 0  

t 0  

t 2  

t 1  

t 0  

t 3  

t 2  

t 1  

t 3  

t 3  

t 3  

t 2  

t 2  

t 2  

t 1  

t 1  

t 1  

t 0  

t 0  

t 0  

t 3  

t 3  

t 3  

t 2  

t 2  

t 2  

t 1  

t 1  

t 1  

t 0  

t 0  

t 0  

t T  -­‐  1  

t T/2  +  1  

t T/2    

t T/2  +  2  

t T/4  +  1  

t T/4    

t T/4  +  2  

t T/2  -­‐  1  

t 1  

t 0  

t 2  

t T/4  -­‐1  

t 3T/4+1  

t 3T/4    

t 3T/4+2  

t 3T/4  -­‐1  

t T  -­‐  1  

t T/2  +  1  

t T/2    

t 3T/4  -­‐2  

t T/4  +  1  

t T/4    

t T/2  -­‐  2  

t T/2  -­‐  1  

t 1  

t 0  

t T/4  -­‐2  

t T/4  -­‐1  

t 3T/4+1  

t 3T/4    

t T  -­‐  2  

t 3T/4  -­‐1  

…  

…  

…  

…  

…  

…  

…  

…  

t 1  

t 2  

t 3  

t 2  

t 3  

t 3  

t 0  

t 1  

t 0  

t 0  

t 2  

t 1  

t 0  

t 3  

t 2  

t 1  

t 3  

t 3  

t 3  

t 2  

t 2  

t 2  

t 1  

t 1  

t 1  

t 0  

t 0  

t 0  

t 3  

t 3  

t 3  

t 2  

t 2  

t 2  

t 1  

t 1  

t 1  

t 0  

t 0  

t 0  

t T  -­‐  1  

t T/2  +  1  

t T/2    

t T/2  +  2  

t T/4  +  1  

t T/4    

t T/4  +  2  

t T/2  -­‐  1  

t 1  

t 0  

t 2  

t T/4  -­‐1  

t 3T/4+1  

t 3T/4    

t 3T/4+2  

t 3T/4  -­‐1  

t T  -­‐  1  

t T/2  +  1  

t T/2    

t 3T/4  -­‐2  

t T/4  +  1  

t T/4    

t T/2  -­‐  2  

t T/2  -­‐  1  

t 1  

t 0  

t T/4  -­‐2  

t T/4  -­‐1  

t 3T/4+1  

t 3T/4    

t T  -­‐  2  

t 3T/4  -­‐1  

…  

…  

…  

…  

…  

…  

…  

…  

t 1  

t 2  

t 3  

t 2  

t 3  

t 3  

t 0  

t 1  

t 0  

t 0  

t 2  

t 1  

t 0  

t 3  

t 2  

t 1  

t 3  

t 3  

t 3  

t 2  

t 2  

t 2  

t 1  

t 1  

t 1  

t 0  

t 0  

t 0  

t 3  

t 3  

t 3  

t 2  

t 2  

t 2  

t 1  

t 1  

t 1  

t 0  

t 0  

t 0  

t T  -­‐  1  

t T/2  +  1  

t T/2    

t T/2  +  2  

t T/4  +  1  

t T/4    

t T/4  +  2  

t T/2  -­‐  1  

t 1  

t 0  

t 2  

t T/4  -­‐1  

t 3T/4+1  

t 3T/4    

t 3T/4+2  

t 3T/4  -­‐1  

t T  -­‐  1  

t T/2  +  1  

t T/2    

t 3T/4  -­‐2  

t T/4  +  1  

t T/4    

t T/2  -­‐  2  

t T/2  -­‐  1  

t 1  

t 0  

t T/4  -­‐2  

t T/4  -­‐1  

t 3T/4+1  

t 3T/4    

t T  -­‐  2  

t 3T/4  -­‐1  

…  

…  

…  

…  

…  

…  

…  

…  

Page 71: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

•  Resource-allocation as runtime

1.  Kernel memory-wall analysis (kernel fusion)

2.  Algorithm serialization

3.  Tune for data-movement

4.  Warp-synchronous programming

5.  Flexible granularity via meta-programming

Page 72: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  Back40Computing (a Google Code Project) •  http://code.google.com/p/back40computing/

–  Default sorting method for Thrust •  http://code.google.com/p/thrust/

Page 73: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)
Page 74: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  A single host-side procedure call launches a kernel that performs orthogonal program steps

MyUberKernel<<<grid_size, num_threads>>>(d_device_storage);

–  No existing public repositories of kernel “subroutines” for scavenging

Page 75: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  Callbacks, iterators, visitors, functors, etc.

– ReduceKernel<<<grid_size, num_threads>>>(CountingIterator(100));

–  E.g., fused kernel left can’t be composed using a callback-based pattern

GATHER(key)

EXCHANGE (key)

SCATTER (key) SCATTER (value)

GATHER (value)

EXCHANGE (value)

LOCAL MULTI-SCAN (flag vectors)

Encode flag bit (into flag vectors)

Decode local rank (from flag vectors)

Extract radix digit

Extract radix digit (again)

Update global radix digit partition offsets

Fused radix sorting kernel

• Digit extraction

• Local prefix scan

• Scatter accordingly

Page 76: [Harvard CS264] 16 - Managing Dynamic Parallelism on GPUs: A Case Study of High Performance Sorting (Duane Merrill, University of Virginia)

–  Compiled libraries suffer from code bloat • CUDPP primitives library is 100s of MBs, yet still doesn’t support all built-in numeric types.

• Specializing for device configurations makes it even worse

–  The alternative is to ship source for #include’ing • Have to be willing to share source

–  Need a way to fit meta-programming in at the JIT / bytecode level to help avoid expansion / mismatch-by-omission

–  Can leverage fundamentally different algorithms for different phases • How to teach the compiler do to this?