how simd width affects energy efficiency: a case study on ... · ibm research –tokyo april 20-22,...

16
IBM Research Tokyo April 20-22, 2016 | COOLChips XIX @ Yokohama, Japan © 2016 IBM Corporation How SIMD Width Affects Energy Efficiency: A Case Study on Sorting Hiroshi Inoue IBM Research – Tokyo

Upload: others

Post on 15-Mar-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How SIMD Width Affects Energy Efficiency: A Case Study on ... · IBM Research –Tokyo April 20-22, 2016 | COOLChips XIX @ Yokohama, Japan © 2016IBM Corporation How SIMD Width Affects

IBM Research – Tokyo

April 20-22, 2016 | COOLChips XIX @ Yokohama, Japan © 2016 IBM Corporation

How SIMD Width Affects Energy Efficiency:A Case Study on Sorting

Hiroshi InoueIBM Research – Tokyo

Page 2: How SIMD Width Affects Energy Efficiency: A Case Study on ... · IBM Research –Tokyo April 20-22, 2016 | COOLChips XIX @ Yokohama, Japan © 2016IBM Corporation How SIMD Width Affects

IBM Research – Tokyo

How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation

Goal & Approach

Goal:§ to understand how SIMD width affects execution time and

energy consumption– Not to propose a new energy-efficient algorithm or system

Approach:§ to take SIMD mergesort as an example§ to measure execution time, power and energy (= execution

time × power) with various hardware configurations on a commodity PC

– SIMD width (8-way AVX, 4-way SSE or 1-way scalar)– Memory bandwidth

2

Page 3: How SIMD Width Affects Energy Efficiency: A Case Study on ... · IBM Research –Tokyo April 20-22, 2016 | COOLChips XIX @ Yokohama, Japan © 2016IBM Corporation How SIMD Width Affects

IBM Research – Tokyo

How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation

SIMD mergesort

§ Combining advantages of sorting networks (SIMD friendly) and usual mergesort (lower computational complexity)

– usual comparison-based mergesort in memory• computational complexity of O(N log(N))• mostly sequential memory accesses

– vector-register-level bitonic merge operation implemented with SIMD min/max instructions

• data parallelism• less conditional branch

è Wider vector gives sub-linear reduction in the number of instructions

3

Page 4: How SIMD Width Affects Energy Efficiency: A Case Study on ... · IBM Research –Tokyo April 20-22, 2016 | COOLChips XIX @ Yokohama, Japan © 2016IBM Corporation How SIMD Width Affects

IBM Research – Tokyo

How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation

< < < <

< < < <

4

2 3

Inputtwo vector registers contain four presorted values in each

Outputeight values in two vector registered are now sorted

SIMD mergingone SIMD comparison and “shuffle” operations for each stage without conditional branch

1 4 7 8

stage 1

stage 2

stage 3

input

output

< < < <

6 5 3 2

1 4 7 85 6

sorted sorted

sorted

(example of bitonic merge)

SIMD-based merge for values in two vector registers

Page 5: How SIMD Width Affects Energy Efficiency: A Case Study on ... · IBM Research –Tokyo April 20-22, 2016 | COOLChips XIX @ Yokohama, Japan © 2016IBM Corporation How SIMD Width Affects

IBM Research – Tokyo

How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation

Evaluation

§ Hardware: a commodity PC + external power meter– Core i7 4770 (Haswell) 3.4 GHz, 4 cores, 8 threads– one or two 4-GB DDR3-1333 DIMMs (single or dual channel)– power meter Yokogawa WT-210 (for system-level power)– Redhat Enterprise Linux 6.5, gcc-5.2

§ Tested algorithms (for sorting random 256-M 32-bit integers)– SIMD mergesort w/ scalar (1 way), SSE (4 way), or AVX (8 way)– radix sort (scalar)– quicksort (std::sort, scalar)

5

Page 6: How SIMD Width Affects Energy Efficiency: A Case Study on ... · IBM Research –Tokyo April 20-22, 2016 | COOLChips XIX @ Yokohama, Japan © 2016IBM Corporation How SIMD Width Affects

IBM Research – Tokyo

How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation

Summary of observations

1. Execution time– Wider SIMD gives larger speedup (up to 10x)

2. Power– SIMD increases power only up to 15%

3. Energy (= Execution time x Power)– Lower energy consumption with wider SIMD

4. Power and Execution time with lower bandwidth-to-compute ratios– Wider SIMD may yield better performance with lower power!

Refer to paper (not covered today)§ Energy consumption with various bandwidth-to-compute

ratios (achieved using DVFS)– Need to balance core compute performance and memory

bandwidth to minimize energy consumption6

Page 7: How SIMD Width Affects Energy Efficiency: A Case Study on ... · IBM Research –Tokyo April 20-22, 2016 | COOLChips XIX @ Yokohama, Japan © 2016IBM Corporation How SIMD Width Affects

IBM Research – Tokyo

How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation

0

5

10

15

20

25

30

mergesort with 8-way SIMD

mergesort with 4-way SIMD

mergesort without SIMD

quicksort (std::sort)

radix sort without SIMD

exec

utio

n tim

e (s

ec)

1 thread 2 threads 4 threads 8 threads (4 cores with 2-way SMT)

Execution time (scalar vs. SIMD with 1 thread)

7

faster

9.7xspeedupby 8-way SIMD

6.8xspeedupby 4-way SIMD

with 1 thread

SIMD mergesort

üWider SIMD gave larger speedup as expected

Page 8: How SIMD Width Affects Energy Efficiency: A Case Study on ... · IBM Research –Tokyo April 20-22, 2016 | COOLChips XIX @ Yokohama, Japan © 2016IBM Corporation How SIMD Width Affects

IBM Research – Tokyo

How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation

0

5

10

15

20

25

30

mergesort with 8-way SIMD

mergesort with 4-way SIMD

mergesort without SIMD

quicksort (std::sort)

radix sort without SIMD

exec

utio

n tim

e (s

ec)

1 thread 2 threads 4 threads 8 threads (4 cores with 2-way SMT)

Execution time (scalar vs. SIMD with 8 thread)

8

5.0x speedupby 8-way SIMD

4.4x speedupby 4-way SIMD

with 8 threads

faster

üSmaller gains from SIMD due to memory bandwidth bottleneck

Page 9: How SIMD Width Affects Energy Efficiency: A Case Study on ... · IBM Research –Tokyo April 20-22, 2016 | COOLChips XIX @ Yokohama, Japan © 2016IBM Corporation How SIMD Width Affects

IBM Research – Tokyo

How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

4.5

mergesort with 8-way SIMD

mergesort with 4-way SIMD

mergesort without SIMD

quicksort (std::sort)

radix sort without SIMD

exec

utio

n tim

e (s

ec)

1 thread 2 threads 4 threads 8 threads (4 cores with 2-way SMT)

Execution time (8-way vs. 4-way)

9

42%speedup

14%speedup

faster

ü8-way SIMD (AVX) gave additional speedups over 4-way SIMD (SSE)

Page 10: How SIMD Width Affects Energy Efficiency: A Case Study on ... · IBM Research –Tokyo April 20-22, 2016 | COOLChips XIX @ Yokohama, Japan © 2016IBM Corporation How SIMD Width Affects

IBM Research – Tokyo

How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation

0

20

40

60

80

100

120

mergesort with 8-way SIMD

mergesort with 4-way SIMD

mergesort without SIMD

quicksort (std::sort)

radix sort without SIMD

pow

er (w

att)

1 thread 2 threads 4 threads 8 threads (4 cores with 2-way SMT)

Power

10

only up to 15% increase in power

ü Increase in power by use of SIMD was not so significant

better

Page 11: How SIMD Width Affects Energy Efficiency: A Case Study on ... · IBM Research –Tokyo April 20-22, 2016 | COOLChips XIX @ Yokohama, Japan © 2016IBM Corporation How SIMD Width Affects

IBM Research – Tokyo

How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation

0

200

400

600

800

1000

1200

mergesort with 8-way SIMD

mergesort with 4-way SIMD

mergesort without SIMD

quicksort (std::sort)

radix sort without SIMD

ener

gy (j

oule

)

1 thread 2 threads 4 threads 8 threads (4 cores with 2-way SMT)

Energy (= Execution time x Power) with 1 thread

11

8.8xreductionin energyby 8-way SIMD

6.4xreduction in energyby 4-way SIMD

with 1 thread

better

ü Energy consumption was significantly reduced due to shorter execution time

Page 12: How SIMD Width Affects Energy Efficiency: A Case Study on ... · IBM Research –Tokyo April 20-22, 2016 | COOLChips XIX @ Yokohama, Japan © 2016IBM Corporation How SIMD Width Affects

IBM Research – Tokyo

How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation

0

200

400

600

800

1000

1200

mergesort with 8-way SIMD

mergesort with 4-way SIMD

mergesort without SIMD

quicksort (std::sort)

radix sort without SIMD

ener

gy (j

oule

)

1 thread 2 threads 4 threads 8 threads (4 cores with 2-way SMT)

betterEnergy (= Execution time x Power) with 8 threads

12

4.6x reduction in energyby 8-way SIMD

3.9x reduction in energyby 4-way SIMD

with 8 threads

ü Energy consumption was significantly reduced due to shorter execution time

Page 13: How SIMD Width Affects Energy Efficiency: A Case Study on ... · IBM Research –Tokyo April 20-22, 2016 | COOLChips XIX @ Yokohama, Japan © 2016IBM Corporation How SIMD Width Affects

IBM Research – Tokyo

How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation

020406080

100120140160180200

mergesort with 8-way SIMD

mergesort with 4-way SIMD

mergesort without SIMD

quicksort (std::sort)

radix sort without SIMD

ener

gy (j

oule

)

1 thread 2 threads 4 threads 8 threads (4 cores with 2-way SMT)

Energy (= Execution time x Power) 8-way vs. 4-way

13

38% less energy

16%lessenergy

42% less execution time with 3% higher power

14% less execution time with 2% lower power

better

ü Wider SIMD yielded better performance with lower power when using 8 threads

Page 14: How SIMD Width Affects Energy Efficiency: A Case Study on ... · IBM Research –Tokyo April 20-22, 2016 | COOLChips XIX @ Yokohama, Japan © 2016IBM Corporation How SIMD Width Affects

IBM Research – Tokyo

How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation

0.0

0.2

0.4

0.6

0.8

1.0

1.2

10 20 30 40 50 60 70 80 90 100 110

thro

ughp

ut (

1/se

c)

power (watt)

mergesort with 8-way SIMD

mergesort with 4-way SIMD

mergesort without SIMD

0.0

0.2

0.4

0.6

0.8

1.0

1.2

10 20 30 40 50 60 70 80 90 100 110

thro

ughp

ut (

1/se

c)

power (watt)

mergesort with 8-way SIMD

mergesort with 4-way SIMD

mergesort without SIMD

Power and Execution time

14

idle power

1 thread

2 threads

4 threads

8 threads(4 cores w/ SMT)

better(lower power)

faster

Wider SIMD yields higher performance and power

Wider SIMD yieldsshorter time and lower power

ü Wider SIMD yielded better performance with lower power when using 8 threads

Page 15: How SIMD Width Affects Energy Efficiency: A Case Study on ... · IBM Research –Tokyo April 20-22, 2016 | COOLChips XIX @ Yokohama, Japan © 2016IBM Corporation How SIMD Width Affects

IBM Research – Tokyo

How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation

0.0

0.2

0.4

0.6

0.8

1.0

1.2

10 20 30 40 50 60 70 80 90 100 110

thro

ughp

ut (

1/se

c)

power (watt)

mergesort with 8-way SIMDmergesort with 4-way SIMDmergesort without SIMDmergesort with 8-way SIMDmergesort with 4-way SIMDmergesort without SIMD

Power and Execution time with reduced bandwidth

15

ü With lower memory bandwidth, power reduction by SIMD was more significant

with 2 memory channels(full bandwidth)

with 1 memory channel(half bandwidth)

Wider SIMD yieldsshorter time and lower power

better(lower power)

faster

Page 16: How SIMD Width Affects Energy Efficiency: A Case Study on ... · IBM Research –Tokyo April 20-22, 2016 | COOLChips XIX @ Yokohama, Japan © 2016IBM Corporation How SIMD Width Affects

IBM Research – Tokyo

How SIMD Width Affects Energy Efficiency: A Case Study on Sorting © 2016 IBM Corporation

Summary & Future work

§ Summary of this study– Wider SIMD gives larger speedup and less energy consumption– Also, it potentially yields lower power by reducing number of

instructions when bandwidth-to-compute ratio is low– (It is important to balance core performance and memory

bandwidth to achieve best energy efficiency)è Increasing SIMD width will be important for future low-power processors even with limited bandwidth-to-compute ratios

§ Future work– to evaluate with other workloads, especially floating-point

intensive applications

16