exploiting auto-tuning to analyze and improve performance ... · haswell cpu skylake cpu tuned for...

18
Exploiting auto-tuning to analyze and improve performance portability on many-core architectures James Price & Simon McIntosh-Smith University of Bristol - High Performance Computing Group http://uob-hpc.github.io Funded in part by Imagination Technologies

Upload: others

Post on 26-Sep-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploiting auto-tuning to analyze and improve performance ... · Haswell CPU Skylake CPU tuned for 100% 100% X 91% 91% 90% 84% 91% 68% 15% 81% 74% 98% 100% 81% 66% 48% 78% 50% 52%

Exploiting auto-tuning to analyze and improve performance portability

on many-core architectures James Price & Simon McIntosh-Smith

University of Bristol - High Performance Computing Group http://uob-hpc.github.io

Funded in part by Imagination Technologies

Page 2: Exploiting auto-tuning to analyze and improve performance ... · Haswell CPU Skylake CPU tuned for 100% 100% X 91% 91% 90% 84% 91% 68% 15% 81% 74% 98% 100% 81% 66% 48% 78% 50% 52%

Overview• Performance portability is one of the key concerns

for developers targeting many different architectures

• Current work in this area has provided mixed results

• Here we propose a method of rigorously analyzing performance portability by exploiting the black-box nature of auto-tuning

• We also present a simple technique that can improve performance portability when using auto-tuning

Page 3: Exploiting auto-tuning to analyze and improve performance ... · Haswell CPU Skylake CPU tuned for 100% 100% X 91% 91% 90% 84% 91% 68% 15% 81% 74% 98% 100% 81% 66% 48% 78% 50% 52%

DevicesVendor Product Architecture Compute units

NVIDIA

GeForce GTX 580 Fermi (GF110) 16GeForce GTX 680 Kepler (GK104) 8

GeForce GTX 780 Ti Kepler (GK110) 15GeForce GTX 980 Ti Maxwell (GM200) 22GeForce GTX 1080 Ti Pascal (GP102) 28

AMD

Radeon HD 7970 Tahiti 32Radeon R9 290X Hawaii 44Radeon R9 Furyx Fiji 64Radeon RX 480 Ellesmere 36

IntelCore i5-3550 Ivy Bridge 4Core i5-4590 Haswell 4Core i5-6600 Skylake 4

Page 4: Exploiting auto-tuning to analyze and improve performance ... · Haswell CPU Skylake CPU tuned for 100% 100% X 91% 91% 90% 84% 91% 68% 15% 81% 74% 98% 100% 81% 66% 48% 78% 50% 52%

Benchmarks• Using three benchmarks from different domains

• Implemented with OpenCL (using PyOpenCL)

‣ Need portability but also explicit control over kernel implementation

• Expose large set of possible implementation decisions as tuning options

• Dynamically generate kernels to provide greater flexibility

Page 5: Exploiting auto-tuning to analyze and improve performance ... · Haswell CPU Skylake CPU tuned for 100% 100% X 91% 91% 90% 84% 91% 68% 15% 81% 74% 98% 100% 81% 66% 48% 78% 50% 52%

Benchmark #1: Jacobi method• Work-group size / parallel decomposition

• Memory layout / memory access pattern

• Loop unrolling

• *+ vs mad vs fma

• Branching vs masks

• Division by diagonal (inline or precompute)

• Address spaces of input vectors

• Embedding values into kernel as constants

Solve Ax = b

Split matrix A into diagonal D and remainder R

xi+1 = D-1(b - Rxi)

Page 6: Exploiting auto-tuning to analyze and improve performance ... · Haswell CPU Skylake CPU tuned for 100% 100% X 91% 91% 90% 84% 91% 68% 15% 81% 74% 98% 100% 81% 66% 48% 78% 50% 52%

Benchmark #2: Bilateral filter• Work-group size

• Tile size / tile layout

• Prefetch pixels into private/local memory

• Buffers vs images

• *+ vs mad vs fma

• Loop interchange

• Native math functions

• Embedding values into kernel as constants

O(x, y) =

rX

i=�r

rX

j=�r

I(i, j) ·W (x, y, i, j)

rX

i=�r

rX

j=�r

W (x, y, i, j)

W (x, y, i, j) = e

(�p

i

2+j

2

2�d

� ||(I(x,y)�I(x+i,y+j)||2�

r

)

Page 7: Exploiting auto-tuning to analyze and improve performance ... · Haswell CPU Skylake CPU tuned for 100% 100% X 91% 91% 90% 84% 91% 68% 15% 81% 74% 98% 100% 81% 66% 48% 78% 50% 52%

Benchmark #3: BUDE

• Parallel decomposition / work-group size

• Data layout / memory access patterns

• Loop interchange

• Address space of molecules / forcefield

• Precomputing forcefield coefficients

• Native math functions

• Embedding values into kernel as constants

Page 8: Exploiting auto-tuning to analyze and improve performance ... · Haswell CPU Skylake CPU tuned for 100% 100% X 91% 91% 90% 84% 91% 68% 15% 81% 74% 98% 100% 81% 66% 48% 78% 50% 52%

Analysis approach• Run auto-tuner to optimize the kernel for each device

individually

‣ Genetic algorithm used for search

‣ Tuned for 24 hours on each device (examines ~10K kernels)

• This produces a set of architecture-specific kernels

• We then run each architecture-specific kernel on every other device and measure efficiency

• Efficiency defined as fraction of best kernel observed on that device

Page 9: Exploiting auto-tuning to analyze and improve performance ... · Haswell CPU Skylake CPU tuned for 100% 100% X 91% 91% 90% 84% 91% 68% 15% 81% 74% 98% 100% 81% 66% 48% 78% 50% 52%

Jacobi efficiencies

GTX

580

GTX

680

GTX

780

TiG

TX98

0Ti

GTX

1080

TiH

D79

70

R929

0X

R9Fu

ryX

RX48

0Iv

yBr

idge

CPU

Has

wel

l CPU

Skyl

ake

CPU

running on

GTX 580

GTX 680

GTX 780 Ti

GTX 980 Ti

GTX 1080 Ti

HD 7970

R9 290X

R9 Fury X

RX 480

Ivy Bridge CPU

Haswell CPU

Skylake CPU

tune

dfo

r

100%

92%

93%

71%

91%

23%

91%

24%

87%

21%

12%

90%

87%

100%

66%

69%

97%

19%

79%

20%

74%

14%

16%

69%

74%

71%

100%

51%

58%

20%

84%

20%

78%

18%

16%

93%

53%

98%

95%

100%

98%

21%

97%

23%

95%

41%

39%

90%

56%

98%

90%

92%

100%

33%

97%

37%

93%

40%

43%

90%

X

55%

68%

53%

68%

100%

99%

100%

99%

19%

37%

45%

X

38%

33%

33%

38%

100%

100%

99%

99%

14%

20%

28%

X

26%

32%

49%

28%

99%

93%

100%

97%

6%

11%

14%

X

57%

62%

36%

62%

100%

100%

99%

100%

13%

17%

21%

1%

1%

5%

5%

1%

73%

2%

62%

2%

100%

2%

2%

5%

5%

5%

5%

5%

85%

70%

80%

93%

53%

100%

99%

4%

4%

4%

5%

4%

71%

57%

65%

81%

44%

97%

100%

Page 10: Exploiting auto-tuning to analyze and improve performance ... · Haswell CPU Skylake CPU tuned for 100% 100% X 91% 91% 90% 84% 91% 68% 15% 81% 74% 98% 100% 81% 66% 48% 78% 50% 52%

Jacobi - NVIDIAG

TX58

0

GTX

680

GTX

780

Ti

GTX

980

Ti

GTX

1080

Ti

running on

GTX 580

GTX 680

GTX 780 Ti

GTX 980 Ti

GTX 1080 Ti

tune

dfo

r

100%

92%

93%

71%

91%

87%

100%

66%

69%

97%

74%

71%

100%

51%

58%

53%

98%

95%

100%

98%

56%

98%

90%

92%

100%• Work-group size

differs between devices

• Other drops due to address spaces and memory access pattern

Page 11: Exploiting auto-tuning to analyze and improve performance ... · Haswell CPU Skylake CPU tuned for 100% 100% X 91% 91% 90% 84% 91% 68% 15% 81% 74% 98% 100% 81% 66% 48% 78% 50% 52%

Jacobi - AMDH

D79

70

R929

0X

R9Fu

ryX

RX48

0

running on

HD 7970

R9 290X

R9 Fury X

RX 480

tune

dfo

r

100%

99%

100%

99%

100%

100%

99%

99%

99%

93%

100%

97%

100%

100%

99%

100%• Most parameters

uniform between all AMD devices

• FuryX suffers a little with expensive division

Page 12: Exploiting auto-tuning to analyze and improve performance ... · Haswell CPU Skylake CPU tuned for 100% 100% X 91% 91% 90% 84% 91% 68% 15% 81% 74% 98% 100% 81% 66% 48% 78% 50% 52%

Jacobi - IntelIv

yBr

idge

CPU

Has

wel

l CPU

Skyl

ake

CPU

running on

Ivy Bridge CPU

Haswell CPU

Skylake CPU

tune

dfo

r

100%

2%

2%

53%

100%

99%

44%

97%

100% • Ivy Bridge performs poorly with fma builtin

• Vectorisation width differs

Page 13: Exploiting auto-tuning to analyze and improve performance ... · Haswell CPU Skylake CPU tuned for 100% 100% X 91% 91% 90% 84% 91% 68% 15% 81% 74% 98% 100% 81% 66% 48% 78% 50% 52%

Jacobi efficiencies

GTX

580

GTX

680

GTX

780

TiG

TX98

0Ti

GTX

1080

TiH

D79

70

R929

0X

R9Fu

ryX

RX48

0Iv

yBr

idge

CPU

Has

wel

l CPU

Skyl

ake

CPU

running on

GTX 580

GTX 680

GTX 780 Ti

GTX 980 Ti

GTX 1080 Ti

HD 7970

R9 290X

R9 Fury X

RX 480

Ivy Bridge CPU

Haswell CPU

Skylake CPU

tune

dfo

r

100%

92%

93%

71%

91%

23%

91%

24%

87%

21%

12%

90%

87%

100%

66%

69%

97%

19%

79%

20%

74%

14%

16%

69%

74%

71%

100%

51%

58%

20%

84%

20%

78%

18%

16%

93%

53%

98%

95%

100%

98%

21%

97%

23%

95%

41%

39%

90%

56%

98%

90%

92%

100%

33%

97%

37%

93%

40%

43%

90%

X

55%

68%

53%

68%

100%

99%

100%

99%

19%

37%

45%

X

38%

33%

33%

38%

100%

100%

99%

99%

14%

20%

28%

X

26%

32%

49%

28%

99%

93%

100%

97%

6%

11%

14%

X

57%

62%

36%

62%

100%

100%

99%

100%

13%

17%

21%

1%

1%

5%

5%

1%

73%

2%

62%

2%

100%

2%

2%

5%

5%

5%

5%

5%

85%

70%

80%

93%

53%

100%

99%

4%

4%

4%

5%

4%

71%

57%

65%

81%

44%

97%

100%

WCE2%2%6%2%

20%2%

19%1%5%4%1%-

Page 14: Exploiting auto-tuning to analyze and improve performance ... · Haswell CPU Skylake CPU tuned for 100% 100% X 91% 91% 90% 84% 91% 68% 15% 81% 74% 98% 100% 81% 66% 48% 78% 50% 52%

Bilateral efficienciesWCE8%8%5%

26%26%25%

-14%37%37%14%

GTX

580

GTX

680

GTX

780

Ti

GTX

980

TiG

TX10

80Ti

R929

0X

R9Fu

ryX

RX48

0

Ivy

Brid

geCP

UH

asw

ell C

PUSk

ylak

eCP

U

running on

GTX 580

GTX 680

GTX 780 Ti

GTX 980 Ti

GTX 1080 Ti

R9 290X

R9 Fury X

RX 480

Ivy Bridge CPU

Haswell CPU

Skylake CPU

tune

dfo

r

100%

96%

95%

100%

93%

70%

70%

70%

8%

19%

19%

98%

100%

96%

97%

90%

68%

71%

70%

6%

13%

13%

99%

99%

100%

99%

97%

73%

76%

69%

7%

18%

18%

100%

98%

98%

100%

97%

85%

85%

84%

7%

17%

17%

99%

97%

98%

98%

100%

86%

87%

87%

7%

12%

12%

35%

37%

37%

35%

X

100%

100%

100%

6%

11%

11%

31%

38%

37%

31%

X

100%

100%

100%

5%

8%

8%

29%

45%

45%

37%

X

100%

100%

100%

5%

9%

9%

14%

74%

74%

14%

14%

28%

28%

29%

100%

98%

98%

73%

72%

71%

72%

72%

27%

27%

27%

99%

100%

100%

74%

74%

73%

71%

73%

25%

26%

26%

99%

100%

100%

Page 15: Exploiting auto-tuning to analyze and improve performance ... · Haswell CPU Skylake CPU tuned for 100% 100% X 91% 91% 90% 84% 91% 68% 15% 81% 74% 98% 100% 81% 66% 48% 78% 50% 52%

BUDE efficienciesWCE39%37%8%

30%13%46%12%35%

----

GTX

580

GTX

680

GTX

780

TiG

TX98

0Ti

GTX

1080

TiH

D79

70

R929

0X

R9Fu

ryX

RX48

0Iv

yBr

idge

CPU

Has

wel

l CPU

Skyl

ake

CPU

running on

GTX 580

GTX 680

GTX 780 Ti

GTX 980 Ti

GTX 1080 Ti

HD 7970

R9 290X

R9 Fury X

RX 480

Ivy Bridge CPU

Haswell CPU

Skylake CPU

tune

dfo

r

100%

100%

X

91%

91%

90%

84%

91%

68%

15%

81%

74%

98%

100%

81%

66%

48%

78%

50%

52%

30%

8%

81%

64%

X

X

100%

83%

35%

33%

46%

32%

32%

13%

37%

39%

X

X

89%

100%

77%

66%

89%

79%

90%

21%

90%

85%

X

X

83%

100%

100%

74%

97%

86%

96%

29%

93%

87%

X

X

X

X

71%

100%

75%

72%

72%

10%

78%

84%

X

X

X

X

61%

73%

100%

66%

67%

10%

64%

61%

X

X

X

X

71%

86%

86%

100%

50%

12%

48%

47%

X

X

X

X

75%

95%

98%

100%

100%

12%

92%

92%

57%

57%

X

X

62%

15%

93%

16%

63%

100%

98%

98%

51%

51%

X

X

61%

15%

93%

14%

61%

98%

100%

98%

43%

43%

X

X

57%

12%

93%

13%

57%

99%

98%

100%

Page 16: Exploiting auto-tuning to analyze and improve performance ... · Haswell CPU Skylake CPU tuned for 100% 100% X 91% 91% 90% 84% 91% 68% 15% 81% 74% 98% 100% 81% 66% 48% 78% 50% 52%

Multi-objective auto-tuning• Extend tuning process to consider multiple devices

at once

• Each time a kernel is generated, the auto-tuner evaluates it on every target device

• The performance values are then reduced into a single number representing the overall ‘fitness’

• We use worst-case efficiency for this fitness function

Page 17: Exploiting auto-tuning to analyze and improve performance ... · Haswell CPU Skylake CPU tuned for 100% 100% X 91% 91% 90% 84% 91% 68% 15% 81% 74% 98% 100% 81% 66% 48% 78% 50% 52%

Multi-objective tuning results

Page 18: Exploiting auto-tuning to analyze and improve performance ... · Haswell CPU Skylake CPU tuned for 100% 100% X 91% 91% 90% 84% 91% 68% 15% 81% 74% 98% 100% 81% 66% 48% 78% 50% 52%

Summary• Over-optimisation hurts performance portability

• Auto-tuning can be a great way to expose these issues

• It can also help generate performance portable kernels

• Future work looking at tuning across different input/problem configurations