an experimental study on performance portability of opencl...

24
An Experimental Study on Performance Portability of OpenCL Kernels SAAHPC’10 Knoxville, Tennessee July 13-15, 2010 Sean Rul Hans Vandierendonck Joris D’Haene Koen De Bosschere

Upload: others

Post on 05-Jun-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Experimental Study on Performance Portability of OpenCL ...saahpc.ncsa.illinois.edu/10/presentations/day1/... · An Experimental Study on Performance Portability of OpenCL Kernels

An Experimental Study on Performance Portability of OpenCL Kernels

SAAHPC’10

Knoxville, Tennessee July 13-15, 2010

Sean Rul

Hans Vandierendonck

Joris D’Haene

Koen De Bosschere

Page 2: An Experimental Study on Performance Portability of OpenCL ...saahpc.ncsa.illinois.edu/10/presentations/day1/... · An Experimental Study on Performance Portability of OpenCL Kernels

Programming Accelerators: a Tower of Babel?

2

SSE

Cell SDK

Brook+

CUDA

Page 3: An Experimental Study on Performance Portability of OpenCL ...saahpc.ncsa.illinois.edu/10/presentations/day1/... · An Experimental Study on Performance Portability of OpenCL Kernels

OpenCL: One ring to bind them all?

‘The open standard for parallel programming of heterogeneous systems’

3

Functional portability

What about performance portability?

What is the specificity of code optimizations to the underlying architecture?

What is the performance impact?

Page 4: An Experimental Study on Performance Portability of OpenCL ...saahpc.ncsa.illinois.edu/10/presentations/day1/... · An Experimental Study on Performance Portability of OpenCL Kernels

Method

4

Parboil

Benchmark

suite

Architectures Code optimizations

Page 5: An Experimental Study on Performance Portability of OpenCL ...saahpc.ncsa.illinois.edu/10/presentations/day1/... · An Experimental Study on Performance Portability of OpenCL Kernels

Method: Hardware Platforms

5

Platform Clock(MHz) Cores

(Threads) GFLOPs Bandwidth(GB/s) TDP (W)

Core i7-720QM 1600 4 (8) 25 21 45

FirePro V8700 750 10(160) 816 108 114

Tesla c1060 1300 30 (240) 933 102 187

Cell SPE 3200 6 (6) 150 25 45

Page 6: An Experimental Study on Performance Portability of OpenCL ...saahpc.ncsa.illinois.edu/10/presentations/day1/... · An Experimental Study on Performance Portability of OpenCL Kernels

Method: Benchmarks

6

Benchmark CUDA→ OpenCL

Parameterized

nVidia GPU

x86 CPU Ati GPU

Cell

Cp

Mri-fhd

Mri-q

Rpes

Tpacf

Pns

Sad

Too small

Page 7: An Experimental Study on Performance Portability of OpenCL ...saahpc.ncsa.illinois.edu/10/presentations/day1/... · An Experimental Study on Performance Portability of OpenCL Kernels

Method: Code Optimizations

7

Loop Unrolling Vectorization Thread Block Size

Loop

A B C D

A’ B’ C’ D’

R1

R2

Single Vector instruction

ai R1, R2, 1

1 1 1 1

Body

Body

Body

Page 8: An Experimental Study on Performance Portability of OpenCL ...saahpc.ncsa.illinois.edu/10/presentations/day1/... · An Experimental Study on Performance Portability of OpenCL Kernels

Translating the Parboil benchmark suite to OpenCL

8

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

CU

DA

Op

enC

L

CU

DA

Op

enC

L

CU

DA

Op

enC

L

CU

DA

Op

enC

L

CU

DA

Op

enC

L

cp mri-fhd mri-q rpes tpacf

Exec

uti

on

tim

e (s

ec)

Other Time

Compile Time

Copy Time

Device time

Page 9: An Experimental Study on Performance Portability of OpenCL ...saahpc.ncsa.illinois.edu/10/presentations/day1/... · An Experimental Study on Performance Portability of OpenCL Kernels

Translating the Parboil benchmark suite to OpenCL

9

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

CU

DA

Op

enC

L

CU

DA

Op

enC

L

CU

DA

Op

enC

L

CU

DA

Op

enC

L

CU

DA

Op

enC

L

cp mri-fhd mri-q rpes tpacf

Exec

uti

on

tim

e (s

ec)

Other Time

Compile Time

Copy Time

Device time

OpenCL kernel compiled at execution time

Page 10: An Experimental Study on Performance Portability of OpenCL ...saahpc.ncsa.illinois.edu/10/presentations/day1/... · An Experimental Study on Performance Portability of OpenCL Kernels

Translating the Parboil benchmark suite to OpenCL

10

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

CU

DA

Op

enC

L

CU

DA

Op

enC

L

CU

DA

Op

enC

L

CU

DA

Op

enC

L

CU

DA

Op

enC

L

cp mri-fhd mri-q rpes tpacf

Exec

uti

on

tim

e (s

ec)

Other Time

Compile Time

Copy Time

Device time

Specific CUDA functions allow faster gonio exec

Page 11: An Experimental Study on Performance Portability of OpenCL ...saahpc.ncsa.illinois.edu/10/presentations/day1/... · An Experimental Study on Performance Portability of OpenCL Kernels

Performance Impact of Loop Unrolling and Vectorization

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

1 2 4 8 16 32 1 2 4 8 16 32

scalar vector

Executio

n t

ime (

rela

tive)

CPU Tesla FirePro Cell

11

Benchmark: cp

Block size Cell: 1

Block size others: 128

Page 12: An Experimental Study on Performance Portability of OpenCL ...saahpc.ncsa.illinois.edu/10/presentations/day1/... · An Experimental Study on Performance Portability of OpenCL Kernels

Performance Impact of Loop Unrolling and Vectorization

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

1 2 4 8 16 32 1 2 4 8 16 32

scalar vector

Executio

n t

ime (

rela

tive)

CPU Tesla FirePro Cell

12

CPU benefits from vectorization

CPU is indifferent to loop unrolling

Benchmark: cp

Block size Cell: 1

Block size others: 128

Page 13: An Experimental Study on Performance Portability of OpenCL ...saahpc.ncsa.illinois.edu/10/presentations/day1/... · An Experimental Study on Performance Portability of OpenCL Kernels

Performance Impact of Loop Unrolling and Vectorization

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

1 2 4 8 16 32 1 2 4 8 16 32

scalar vector

Executio

n t

ime (

rela

tive)

CPU Tesla FirePro Cell

13

Benchmark: cp

Block size Cell: 1

Block size others: 128

Vectorization is critical to Cell

Cell benefits from loop unrolling

Page 14: An Experimental Study on Performance Portability of OpenCL ...saahpc.ncsa.illinois.edu/10/presentations/day1/... · An Experimental Study on Performance Portability of OpenCL Kernels

Performance Impact of Loop Unrolling and Vectorization

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

1 2 4 8 16 32 1 2 4 8 16 32

scalar vector

Executio

n t

ime (

rela

tive)

CPU Tesla FirePro Cell

14

Benchmark: cp

Block size Cell: 1

Block size others: 128

Optimizations interact with each other

FirePro more sensitive

Too much unrolling degrades performance

Page 15: An Experimental Study on Performance Portability of OpenCL ...saahpc.ncsa.illinois.edu/10/presentations/day1/... · An Experimental Study on Performance Portability of OpenCL Kernels

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

1 2 4 8 16 32 1 2 4 8 16 32

scalar vector

CPU Tesla FirePro Cell

Performance Impact of Loop Unrolling and Vectorization

15 Benchmark: cp

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

2.0

1 2 4 8 16 32 1 2 4 8 16 32

scalar vector

Exe

cu

tio

n tim

e (

rela

tive

)

CPU Tesla FirePro Cell

Vectorization is critical to Cell

Cell benefits from loop unrolling

Benchmark: mri-fhd

Vectorization is critical to Cell

Page 16: An Experimental Study on Performance Portability of OpenCL ...saahpc.ncsa.illinois.edu/10/presentations/day1/... · An Experimental Study on Performance Portability of OpenCL Kernels

Performance Impact of Thread Block Size

16

0

2

4

6

8

10

12

14

16

18

20

1 4 16 128 256 512

Execution t

ime (

secs)

Block size

CPU Tesla FirePro Cell

Benchmark: mri-fhd

Loop unrolling: 2

Vectorization: yes

Page 17: An Experimental Study on Performance Portability of OpenCL ...saahpc.ncsa.illinois.edu/10/presentations/day1/... · An Experimental Study on Performance Portability of OpenCL Kernels

Performance Impact of Thread Block Size

17

0

2

4

6

8

10

12

14

16

18

20

1 4 16 128 256 512

Execution t

ime (

secs)

Block size

CPU Tesla FirePro Cell

Benchmark: mri-fhd

Loop unrolling: 2

Vectorization: yes

Thread block size has little to no effect on CPU

Page 18: An Experimental Study on Performance Portability of OpenCL ...saahpc.ncsa.illinois.edu/10/presentations/day1/... · An Experimental Study on Performance Portability of OpenCL Kernels

Performance Impact of Thread Block Size

18

0

2

4

6

8

10

12

14

16

18

20

1 4 16 128 256 512

Execution t

ime (

secs)

Block size

CPU Tesla FirePro Cell

Benchmark: mri-fhd

Loop unrolling: 2

Vectorization: yes

Thread block size not adjustable

for the Cell

Page 19: An Experimental Study on Performance Portability of OpenCL ...saahpc.ncsa.illinois.edu/10/presentations/day1/... · An Experimental Study on Performance Portability of OpenCL Kernels

Performance Impact of Thread Block Size

19

0

2

4

6

8

10

12

14

16

18

20

1 4 16 128 256 512

Execution t

ime (

secs)

Block size

CPU Tesla FirePro Cell

Benchmark: mri-fhd

Loop unrolling: 2

Vectorization: yes

Thread block size is the most important parameter for Tesla

Page 20: An Experimental Study on Performance Portability of OpenCL ...saahpc.ncsa.illinois.edu/10/presentations/day1/... · An Experimental Study on Performance Portability of OpenCL Kernels

Differences in Execution Time

20

Bench-mark Platform CPU i7 Tesla FirePro Cell SPE

cp

CPU i7 12.62 50.50 12.62 3,11

Tesla 2.49 0.48 0.49 2.49

FirePro 164.60 1.83 1.71 164.60

Cell SPE 8.80 39.75 49.49 8.80

mri

-fh

d

CPU i7 2.90 2.93 4.66 3.11

Tesla 0.38 0.31 0.49 0.81

FirePro 1.84 1.30 1.10 n/a

Cell SPE 1.20 1.26 10.22 1.14

mri

-q

CPU i7 5.78 5.99 8.74 6.22

Tesla 0.80 0.77 1.15 1.87

FirePro 1.49 1.98 1.25 n/a

Cell SPE 2.28 2.08 24.00 1.91

O p t i m i z e d f o r …

x96 slower

x4 slower

Page 21: An Experimental Study on Performance Portability of OpenCL ...saahpc.ncsa.illinois.edu/10/presentations/day1/... · An Experimental Study on Performance Portability of OpenCL Kernels

How much performance is exploited?

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

cp mri-fhd mri-q

Perc

enta

ge o

f Pe

ak P

erfo

rman

ce

CPU i7

Tesla

FirePro

Cell

21

Page 22: An Experimental Study on Performance Portability of OpenCL ...saahpc.ncsa.illinois.edu/10/presentations/day1/... · An Experimental Study on Performance Portability of OpenCL Kernels

What have we learned from this?

22

• Thread block size compatibility

• Use of __constant pointer

• Endianness of data

Functional portability: almost a free lunch

• Architecture specific optimizations

• Program specific sensitivity

• Interaction between optimizations

Performance portability: no free lunch

Page 23: An Experimental Study on Performance Portability of OpenCL ...saahpc.ncsa.illinois.edu/10/presentations/day1/... · An Experimental Study on Performance Portability of OpenCL Kernels

Conclusion

• OpenCL can deliver functional portability, but no performance portability

• Developer should consider optimizations relevant to all target architectures

• Auto-tuning is faced with a large search space

23

Page 24: An Experimental Study on Performance Portability of OpenCL ...saahpc.ncsa.illinois.edu/10/presentations/day1/... · An Experimental Study on Performance Portability of OpenCL Kernels

OpenCL Appendix C

Likewise, the heterogeneity of computing architectures might mean that a particular loop construct might execute at an acceptable speed on the CPU but very poorly on a GPU, for example. CPUs are designed in general to work well on latency sensitive algorithms on single threaded tasks, whereas common GPUs may encounter extremely long latencies, potentially orders of magnitude worse. A developer interested in writing portable code may find that it is necessary to test his design on a diversity of hardware designs to make sure that key algorithms are structured in a way that works well on a diversity of hardware. We suggest favoring more work-items over fewer. It is anticipated that over the coming months and years experience will produce a set of best practices that will help foster a uniformly favorable experience on a diversity of computing devices.

24