future vector microprocessor xtensions...

35
Timothy Hayes , Oscar Palomar, Osman Unsal, Adrian Cristal, Mateo Valero FUTURE VECTOR MICROPROCESSOR EXTENSIONS FOR DATA AGGREGATIONS ISCA-43 (2016) [email protected]

Upload: others

Post on 09-Jun-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Timothy Hayes, Oscar Palomar,

Osman Unsal, Adrian Cristal, Mateo Valero

FUTURE VECTOR MICROPROCESSOR

EXTENSIONS FOR DATA AGGREGATIONS

ISCA-43 (2016)

[email protected]

Page 2: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Motivation

Data generation growing at an exponential rate

Increasing demand to summarise/aggregate data quickly

Since ~2005, frequency scaling no longer viable

Explicit forms of parallelism must be used

Data-level parallelism (DLP) is excellent when available

Vector SIMD ISAs are highly efficient

Compact representation – Implicit parallelism – Scalable

Energy-efficient hardware implementations

Vector SIMD ISA perfect when DLP is regular

Many algorithms need transformations to be regular

Transformations often hurt performance

Future Vector Microprocessor Extensions for Data Aggregations2

Page 3: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Contributions

We examine applicability of vector SIMD to aggregations

Propose and evaluate different algorithms

1. With transformations to use typical vector instructions

2. Vectorise directly using our novel vector instructions

Evaluate with many datasets

Five unique distributions

Twenty-two cardinalities

Speedups between 2.7x and 7.6x over scalar baseline

Future Vector Microprocessor Extensions for Data Aggregations

This work is an extension to our HPCA-21 article– VSR Sort: A Novel Vectorised Sorting Algorithm. (2015) Hayes et al.

I will skip many things due to time constraints

3

Page 4: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Presentation Contents

I. Motivation

II. What is Data Aggregation?

III. Experimental Setup

IV. Algorithms

1. Scalar Baseline

2. Polytable

3. Sorted Reduce

4. Monotable

5. Partially Sorted Monotable

6. Summary

V. Conclusions

Future Vector Microprocessor Extensions for Data Aggregations4

Page 5: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

What is a Data Aggregation?

Frequently occurring operation found in

SQL GROUP BY queries

MapReduce

Statistical Languages

OLAP Cubes

Reduction of key-value pairs

Aggregation function, e.g.

SUM

MINIMUM

MAXIMUM

AVERAGE

Future Vector Microprocessor Extensions for Data Aggregations5

Page 6: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

7 5 13 25 85 33 9 44

valu

es

0 1 3 2 2 0 3 1keys

40+

What is a Data Aggregation?

Future Vector Microprocessor Extensions for Data Aggregations6

49+ 110+ 22+

Page 7: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Presentation Contents

I. Motivation

II. What is Data Aggregation?

III. Experimental Setup

IV. Algorithms

1. Scalar Baseline

2. Polytable

3. Sorted Reduce

4. Monotable

5. Partially Sorted Monotable

6. Summary

V. Conclusions

Future Vector Microprocessor Extensions for Data Aggregations7

Page 8: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Query and Algorithms

Scalar baseline – no vector instructions

Regular DLP – Typical vector instructions

A. Polytable

B. Standard Sorted Reduce [not in presentation]

Irregular DLP – Novel vector instructions

A. Advanced Sorted Reduce

B. Monotable

C. Partially Sorted Monotable

Future Vector Microprocessor Extensions for Data Aggregations

SELECT key, COUNT(*), SUM(value)

FROM table GROUP BY key

8

Page 9: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Datasets: Five Distributions

1. Uniform

2. Sorted

3. Sequential

4. Heavy Hitter

5. Zipfian

Future Vector Microprocessor Extensions for Data Aggregations9

4 8 2 3 6 7 4 4 1 6 6 7 1 5 2 4 8 1 3 1 2 2 3 3 7 8 5 5 7 6 5 8

1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8

1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8

3 8 2 3 6 7 3 4 1 3 6 3 3 5 3 4 8 3 3 1 3 2 3 3 7 3 3 5 7 3 5 8

6 8 4 3 2 1 2 2 3 4 8 7 6 1 2 7 3 1 2 5 4 1 5 1 3 1 1 1 1 2 1 5

Page 10: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Datasets: Cardinalities

Number of unique keys within dataset, e.g.

N = 10,000,000

C = 4 10,000,000

Grouped into four cardinality divisions1. Low cardinalities – many repeated keys

2. Low-normal cardinalities

3. High-normal cardinalities

4. High cardinalities – many unique keys

Future Vector Microprocessor Extensions for Data Aggregations10

N=8, C=11 1 1 1 1 1 1 1

N=8, C=21 1 2 1 2 2 2 1

N=8, C=42 1 1 3 4 2 4 1

N=8, C=83 2 8 5 4 6 1 7

Page 11: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Simulation Framework

Custom Simulation Framework

PTLsim – 32 nm Westmere microarchitecture

DRAMSim2 – DDR3-1333

Extended vector SIMD support

Heavily influenced from classical vector machines, e.g. CRAY-1

Emphasis on integer operations

16x vector registers with 64x 64bit elements

Pipelined functional units with 4x lockstepped parallel lanes

Masked operations

Indexed memory operations, i.e. gather/scatter

Integrated in out-of-order superscalar pipeline

Future Vector Microprocessor Extensions for Data Aggregations11

Page 12: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Presentation Contents

I. Motivation

II. What is Data Aggregation?

III. Experimental Setup

IV. Algorithms

1. Scalar Baseline

2. Polytable

3. Sorted Reduce

4. Monotable

5. Partially Sorted Monotable

6. Summary

V. Conclusions

Future Vector Microprocessor Extensions for Data Aggregations12

Page 13: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Scalar Baseline

Future Vector Microprocessor Extensions for Data Aggregations

1 5 5 3

27 19 43 31

0

0

0

0

0

keys

values

table

1

2

3

4

5

13

31

Page 14: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Scalar Baseline – Results

0

15

30

45

60

75

90

105

120

135

4 9

19 38 76

152

305

610

1,220

2,441

4,882

9,765

19,531

39,062

78,125

156,250

312,500

625,000

1,250,000

2,500,000

5,000,000

10,000,000

low low-normal high-normal high

cycl

es

pe

r tu

ple

uniform sorted sequential hhitter zipf

Future Vector Microprocessor Extensions for Data Aggregations14

Page 15: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Presentation Contents

I. Motivation

II. What is Data Aggregation?

III. Experimental Setup

IV. Algorithms

1. Scalar Baseline

2. Polytable

3. Sorted Reduce

4. Monotable

5. Partially Sorted Monotable

6. Summary

V. Conclusions

Future Vector Microprocessor Extensions for Data Aggregations15

Page 16: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Polytable

Future Vector Microprocessor Extensions for Data Aggregations

1 5 5 3

27 19 43 31

0

0

0

0

0

keys

values

table

1

2

3

4

5

16

process m key-values

gather-modify-scatter conflict

Page 17: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Polytable

Future Vector Microprocessor Extensions for Data Aggregations

0

0

0

0

0

m local tables

1

2

3

4

5

0

0

0

0

0

0

0

0

0

0

0

0

0

0

0

17

1 5 5 3

27 19 43 31

keys

values

✓31

43 19

27

Page 18: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Polytable – Results

0

15

30

45

60

75

90

105

120

135

4 9

19

38

76

152

305

610

1,220

2,441

4,882

9,765

19,531

39,062

78,125

156,250

312,500

625,000

1,250,000

2,500,000

5,000,000

10,000,000

low low-normal high-normal high

cycl

es

pe

r tu

ple

uniform sorted sequential hhitter zipf

scalar

0

15

30

45

60

75

90

105

120

135

4 9

19

38

76

152

305

610

1,220

2,441

4,882

9,765

19,531

39,062

78,125

156,250

312,500

625,000

1,250,000

2,500,000

5,000,000

10,000,000

low low-normal high-normal high

cycl

es

pe

r tu

ple

uniform sorted sequential hhitter zipf

polytable

Future Vector Microprocessor Extensions for Data Aggregations18

Page 19: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Presentation Contents

I. Motivation

II. What is Data Aggregation?

III. Experimental Setup

IV. Algorithms

1. Scalar Baseline

2. Polytable

3. Sorted Reduce

4. Monotable

5. Partially Sorted Monotable

6. Summary

V. Conclusions

Future Vector Microprocessor Extensions for Data Aggregations19

Page 20: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Sorted Reduce

Future Vector Microprocessor Extensions for Data Aggregations

5 1 5 3

27 19 43 31

keys

values

Our new sorting algorithm from HPCA-21

Based on vectorised radix sort

Uses novel vector SIMD instructions

Avoids gather-modify-scatter conflicts

Vector Prior Instances (VPI)

20

Page 21: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Sorted Reduce

Future Vector Microprocessor Extensions for Data Aggregations21

inp

ut

ou

tpu

t

2 2 3 1

0 1 0 0

least significant element

most significant element

5 1 5 3

27 19 43 31

keys

values

VPI

Page 22: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Sorted Reduce

Future Vector Microprocessor Extensions for Data Aggregations

sort

22

output

5 1 5 3

27 19 43 31

keys

values

1 + reduce

3 + reduce

5 + reduce

5 5 3 1

27 43 31 19

19

31

70

Page 23: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Sorted Reduce – Results

0

15

30

45

60

75

90

105

120

135

4 9

19

38

76

152

305

610

1,220

2,441

4,882

9,765

19,531

39,062

78,125

156,250

312,500

625,000

1,250,000

2,500,000

5,000,000

10,000,000

low low-normal high-normal high

cycl

es

pe

r tu

ple

uniform sorted sequential hhitter zipf

0

15

30

45

60

75

90

105

120

135

4 9

19

38

76

152

305

610

1,220

2,441

4,882

9,765

19,531

39,062

78,125

156,250

312,500

625,000

1,250,000

2,500,000

5,000,000

10,000,000

low low-normal high-normal high

cycl

es

pe

r tu

ple

uniform sorted sequential hhitter zipf

scalar advanced sorted reduce

Future Vector Microprocessor Extensions for Data Aggregations23

Page 24: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Presentation Contents

I. Motivation

II. What is Data Aggregation?

III. Experimental Setup

IV. Algorithms

1. Scalar Baseline

2. Polytable

3. Sorted Reduce

4. Monotable

5. Partially Sorted Monotable

6. Summary

V. Conclusions

Future Vector Microprocessor Extensions for Data Aggregations24

Page 25: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Monotable

The Polytable algorithm needs to replicate tables

Avoids gather-modify-scatter conflicts

Hurts performance

The Sorted Reduce algorithm uses VSR Sort

VSR Sort uses VPI to resolve gather-modify-scatter conflicts

Could VPI also be used to optimise Polytable?

VPI is not sufficient, but…

Hardware could be reused

Create similar-style but different instruction

Vector Group Aggregate: SUM (VGAsum)

Similar to VPI but uses second vector of values

Vectorise scalar baseline without transformations

Future Vector Microprocessor Extensions for Data Aggregations25

Page 26: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Monotable

Future Vector Microprocessor Extensions for Data Aggregations26

va

lue

ou

tpu

t

6 12 3 5ke

y

2 0 2 2

VGAsummost significant element

least significant element

6 12 9 14

Page 27: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Monotable – Results

0

15

30

45

60

75

90

105

120

135

4 9

19

38

76

152

305

610

1,220

2,441

4,882

9,765

19,531

39,062

78,125

156,250

312,500

625,000

1,250,000

2,500,000

5,000,000

10,000,000

low low-normal high-normal high

cycl

es

pe

r tu

ple

uniform sorted sequential hhitter zipf

0

15

30

45

60

75

90

105

120

135

4 9

19

38

76

152

305

610

1,220

2,441

4,882

9,765

19,531

39,062

78,125

156,250

312,500

625,000

1,250,000

2,500,000

5,000,000

10,000,000

low low-normal high-normal high

cycl

es

pe

r tu

ple

uniform sorted sequential hhitter zipf

scalar monotable

Future Vector Microprocessor Extensions for Data Aggregations27

Page 28: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Presentation Contents

I. Motivation

II. What is Data Aggregation?

III. Experimental Setup

IV. Algorithms

1. Scalar Baseline

2. Polytable

3. Sorted Reduce

4. Monotable

5. Partially Sorted Monotable

6. Summary

V. Conclusions

Future Vector Microprocessor Extensions for Data Aggregations28

Page 29: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Partially Sorted Monotable

Losing locality hurts performance

Fully sorting can have a high overhead

VSR Sort has O(k.n) complexity

If we reduce the ‘k’, we reduce the overhead

92,345

0001,0110,1000,1011,1001

Future Vector Microprocessor Extensions for Data Aggregations

Ignore LSBssort on MSBs (partition)

k bits

key

29

Page 30: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Partially Sorted Monotable – Results

0

15

30

45

60

75

90

105

120

135

4 9

19

38

76

152

305

610

1,220

2,441

4,882

9,765

19,531

39,062

78,125

156,250

312,500

625,000

1,250,000

2,500,000

5,000,000

10,000,000

low low-normal high-normal high

cycl

es

pe

r tu

ple

uniform sorted sequential hhitter zipf

0

15

30

45

60

75

90

105

120

135

4 9

19

38

76

152

305

610

1,220

2,441

4,882

9,765

19,531

39,062

78,125

156,250

312,500

625,000

1,250,000

2,500,000

5,000,000

10,000,000

low low-normal high-normal high

cycl

es

pe

r tu

ple

uniform sorted sequential hhitter zipf

scalar partially sorted monotable

Future Vector Microprocessor Extensions for Data Aggregations30

Page 31: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Presentation Contents

I. Motivation

II. What is Data Aggregation?

III. Experimental Setup

IV. Algorithms

1. Scalar Baseline

2. Polytable

3. Sorted Reduce

4. Monotable

5. Partially Sorted Monotable

6. Summary

V. Conclusions

Future Vector Microprocessor Extensions for Data Aggregations31

Page 32: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Summary – Best Speedups Overall

1

2

3

4

5

6

7

8

low low-normal high-normal high

aver

age

spee

du

p o

ver

scal

ar

cardinality

uniform sorted sequential hhitter zipf

Future Vector Microprocessor Extensions for Data Aggregations

2.7x

7.6x

32

Page 33: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Summary – Best Speedups Overall

1

2

3

4

5

6

7

8

low low-normal high-normal high

aver

age

spee

du

p o

ver

scal

ar

cardinality

uniform sorted sequential hhitter zipfm

on

o

po

ly

mo

no

mo

no

mo

no

mo

no

po

ly

mo

no

mo

no

mo

no

ps-

mo

no

sorte

d r

ed

uce

mo

no

ps-

mo

no

ps-

mo

no

ps-

mo

no

mo

no

mo

no

ps-

mo

no

ps-

mo

no

Future Vector Microprocessor Extensions for Data Aggregations

2.7x

7.6x

33

Page 34: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Presentation Contents

I. Motivation

II. What is Data Aggregation?

III. Experimental Setup

IV. Algorithms

1. Scalar Baseline

2. Polytable

3. Sorted Reduce

4. Monotable

5. Partially Sorted Monotable

6. Summary

V. Conclusions

Future Vector Microprocessor Extensions for Data Aggregations34

Page 35: FUTURE VECTOR MICROPROCESSOR XTENSIONS …isca2016.eecs.umich.edu/wp-content/uploads/2016/07/7A-2.pdf2016/07/07  · Future Vector Microprocessor Extensions for Data Aggregations This

Conclusions

Aggregating data quickly is important

DLP & SIMD is an attractive way to accelerate it

Aggregation algorithms are simple but DLP is irregular

We proposed various algorithms

A. Use transformations and typical vector SIMD instructions

B. Avoid transformations using our novel vector instructions

Evaluated using many data distributions and cardinalities

Speedups between 2.7x and 7.6x over scalar baseline

Best solution is dependent on input characteristics

Future Vector Microprocessor Extensions for Data Aggregations35