pslp: padded slp automatic vectorization€¦ · slp vectorization algorithm • input is scalar ir...

90
PSLP: Padded SLP Automatic Vectorization Vasileios Porpodas , Alberto Magni and Timothy M. Jones University of Cambridge University of Edinburgh EuroLLVM APR 2015 slide 1 of 17 www.cl.cam.ac.uk/ vp331/

Upload: others

Post on 08-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

PSLP: Padded SLP

Automatic Vectorization

Vasileios Porpodas†, Alberto Magni‡

and Timothy M. Jones†

University of Cambridge†

University of Edinburgh‡

EuroLLVM APR 2015

slide 1 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 2: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Why SIMD Vectorization?

• Scalable parallelism

Scalar Func. Units

Scalar Reg. File

a. ILP

FUFUFUFU

slide 2 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 3: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Why SIMD Vectorization?

• Scalable parallelism

Scalar Func. Units

Scalar Reg. File

a. ILP

FUFUFUFU

slide 2 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 4: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Why SIMD Vectorization?

• Scalable parallelism

0 1 2 3

Vector Reg. File

b. Vector ParallelismScalar Func. Units

Scalar Reg. File

a. ILP

FUFUFUFU

Vector Unit

slide 2 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 5: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Why SIMD Vectorization?

• Scalable parallelism

0 1 2 3

Vector Reg. File

b. Vector ParallelismScalar Func. Units

Scalar Reg. File

a. ILP

FUFUFUFU

Vector Unit

slide 2 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 6: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Why SIMD Vectorization?

• Scalable parallelism

• High Performance0 1 2 3

Vector Reg. File

b. Vector ParallelismScalar Func. Units

Scalar Reg. File

a. ILP

FUFUFUFU

Vector Unit

slide 2 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 7: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Why SIMD Vectorization?

• Scalable parallelism

• High Performance

• Energy efficiency0 1 2 3

Vector Reg. File

b. Vector ParallelismScalar Func. Units

Scalar Reg. File

a. ILP

FUFUFUFU

Vector Unit

slide 2 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 8: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Why SIMD Vectorization?

• Scalable parallelism

• High Performance

• Energy efficiency

• Supported since mid 90’s

• Frequent updates of vectorISAs

0 1 2 3

Vector Reg. File

b. Vector ParallelismScalar Func. Units

Scalar Reg. File

a. ILP

FUFUFUFU

Vector Unit

AVX2

SSE4

slide 2 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 9: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Why SIMD Vectorization?

• Scalable parallelism

• High Performance

• Energy efficiency

• Supported since mid 90’s

• Frequent updates of vectorISAs

• Vector generation notdone in hardware

• Low-level programming orcapable compiler

0 1 2 3

Vector Reg. File

b. Vector ParallelismScalar Func. Units

Scalar Reg. File

a. ILP

FUFUFUFU

Vector Unit

AVX2

SSE4

slide 2 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 10: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

SLP Straight-Line Code Vectorizer

• Superword Level Parallelism [Larsen PLDI’00]

slide 3 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 11: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

SLP Straight-Line Code Vectorizer

• Superword Level Parallelism [Larsen PLDI’00]

• State-of-the-art straight-line code vectorizer

slide 3 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 12: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

SLP Straight-Line Code Vectorizer

• Superword Level Parallelism [Larsen PLDI’00]

• State-of-the-art straight-line code vectorizer

• Implemented in most compilers (including GCC andLLVM)

slide 3 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 13: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

SLP Straight-Line Code Vectorizer

• Superword Level Parallelism [Larsen PLDI’00]

• State-of-the-art straight-line code vectorizer

• Implemented in most compilers (including GCC andLLVM)

• In theory it should be a superset of loop-vectorizer

slide 3 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 14: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

SLP Straight-Line Code Vectorizer

• Superword Level Parallelism [Larsen PLDI’00]

• State-of-the-art straight-line code vectorizer

• Implemented in most compilers (including GCC andLLVM)

• In theory it should be a superset of loop-vectorizer• Unroll loop and vectorize with SLP• Even if loop-vectorizer fails, SLP could partly succeed

slide 3 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 15: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

SLP Straight-Line Code Vectorizer

• Superword Level Parallelism [Larsen PLDI’00]

• State-of-the-art straight-line code vectorizer

• Implemented in most compilers (including GCC andLLVM)

• In theory it should be a superset of loop-vectorizer• Unroll loop and vectorize with SLP• Even if loop-vectorizer fails, SLP could partly succeed

• In practice it is missing features present in the Loopvectorizer (Interleaved Loads, Predication)

slide 3 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 16: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

SLP Vectorization Algorithm

• Input is scalar IR

Scalar Code

slide 4 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 17: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

SLP Vectorization Algorithm

• Input is scalar IR• Seed instructions are:

1 Consecutive Stores2 Reductions

Find vectorizationseed instructions1.

Scalar Code

slide 4 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 18: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

SLP Vectorization Algorithm

• Input is scalar IR• Seed instructions are:

1 Consecutive Stores2 Reductions

• Graph contains vectorizableisomorphic instructions

Find vectorizationseed instructions1.

Scalar Code

2.Generate graph of

isomorphic scalar groups

slide 4 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 19: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

SLP Vectorization Algorithm

• Input is scalar IR• Seed instructions are:

1 Consecutive Stores2 Reductions

• Graph contains vectorizableisomorphic instructions

• Cost: weighted instr. count

Find vectorizationseed instructions1.

CalculateVector Cost

CalculateScalar Cost3.

Scalar Code

2.Generate graph of

isomorphic scalar groups

slide 4 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 20: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

SLP Vectorization Algorithm

• Input is scalar IR• Seed instructions are:

1 Consecutive Stores2 Reductions

• Graph contains vectorizableisomorphic instructions

• Cost: weighted instr. count

• Check vectorization profitability

Find vectorizationseed instructions1.

CalculateVector Cost

CalculateScalar Cost3.

4.If<Vector Cost

Scalar Cost

Scalar Code

2.Generate graph of

isomorphic scalar groups

slide 4 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 21: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

SLP Vectorization Algorithm

• Input is scalar IR• Seed instructions are:

1 Consecutive Stores2 Reductions

• Graph contains vectorizableisomorphic instructions

• Cost: weighted instr. count

• Check vectorization profitability

• Emit vectors only if profitable

Find vectorizationseed instructions1.

CalculateVector Cost

CalculateScalar Cost3.

4.If<Vector Cost

Scalar Cost

Vectorize groups& emit vectors

YES

5.

DONE

Scalar Code

2.Generate graph of

isomorphic scalar groups

slide 4 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 22: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

SLP Vectorization Algorithm

• Input is scalar IR• Seed instructions are:

1 Consecutive Stores2 Reductions

• Graph contains vectorizableisomorphic instructions

• Cost: weighted instr. count

• Check vectorization profitability

• Emit vectors only if profitable

Find vectorizationseed instructions1.

CalculateVector Cost

CalculateScalar Cost3.

4.If<Vector Cost

Scalar Cost

Vectorize groups& emit vectors

YES

5.

NO

DONE

Scalar Code

2.Generate graph of

isomorphic scalar groups

slide 4 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 23: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

When SLP Fails

1 Data DependenciesADD3ADD1 ADD2

ADD4

slide 5 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 24: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

When SLP Fails

1 Data Dependencies

2 Too manygather/scatterinstructions. Costsoutweigh benefits.

ADD3ADD1 ADD2ADD4

ADD1ADD2ADD3ADD4

Original Vectorized

ADD1 ADD2 ADD3 ADD4

Insert1Insert2Insert3Insert4

Extract1Extract2Extract3Extract4

slide 5 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 25: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

When SLP Fails

1 Data Dependencies

2 Too manygather/scatterinstructions. Costsoutweigh benefits.

3 Non-isomorphism

ADD3ADD1 ADD2ADD4

ADD1ADD2ADD3ADD4

Original Vectorized

ADD1 ADD2 ADD3 ADD4

Insert1Insert2Insert3Insert4

Extract1Extract2Extract3Extract4

ADD1 ADD2 ADD4MUL

slide 5 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 26: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

SLP Fails due to non-isomorphism

X Instruction Node or Constant Data Flow Edge

a. Input C code

...

...

B[i] = A[i] * 7.0 + 1.0;B[i+1]= A[i+1] + 5.0;

slide 6 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 27: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

SLP Fails due to non-isomorphism

X Instruction Node or Constant Data Flow Edge

S S

L

L

+

*

+

a. Input C code

...

...

7.

1. 5.

B[i] = A[i] * 7.0 + 1.0;B[i+1]= A[i+1] + 5.0;

b. DFG

slide 6 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 28: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

SLP Fails due to non-isomorphism

X Instruction Node or Constant Data Flow Edge

S S

L

L

SS

+

*

+

a. Input C code

...

...

7.

1. 5.

S S 0

B[i] = A[i] * 7.0 + 1.0;B[i+1]= A[i+1] + 5.0;

b. DFG c. SLP internal graph d. SLP vectorized groups

slide 6 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 29: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

SLP Fails due to non-isomorphism

X Instruction Node or Constant Data Flow Edge

S S

L

L

SS

+

*

+

a. Input C code

...

...

7.

1. 5.

S S

+ +

0

1

B[i] = A[i] * 7.0 + 1.0;B[i+1]= A[i+1] + 5.0;

b. DFG c. SLP internal graph d. SLP vectorized groups

++

slide 6 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 30: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

SLP Fails due to non-isomorphism

X Instruction Node or Constant Data Flow Edge

S S

L

L

SS

+

*

+

a. Input C code

STOP !NON−ISOMORPHIC

* L

...

...

L

1. 5.

7.

1. 5.

7.

S S

+ +

0

1

2 L*

B[i] = A[i] * 7.0 + 1.0;B[i+1]= A[i+1] + 5.0;

b. DFG c. SLP internal graph d. SLP vectorized groups

++

slide 6 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 31: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

SLP Fails due to non-isomorphism

X Instruction Node or Constant Data Flow Edge

S S

L

L

SS

+

S

+

*

*

+

a. Input C code

STOP !NON−ISOMORPHIC

* L

...

...

L

1. 5.

7.

1. 5.

7.

S S

+ +

0

1

2 L*

B[i] = A[i] * 7.0 + 1.0;B[i+1]= A[i+1] + 5.0;

Scalar Cost

b. DFG c. SLP internal graph d. SLP vectorized groups

LL

S

+

++

7

slide 6 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 32: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

SLP Fails due to non-isomorphism

X Instruction Node or Constant Data Flow Edge

S S

L

L

SS

SS

LL

+

S

+

*

*

+

a. Input C code

STOP !NON−ISOMORPHIC

* L

...

...

L

1. 5.

7.

1. 5.

7.

S S

+ +

0

1

2 L*

B[i] = A[i] * 7.0 + 1.0;B[i+1]= A[i+1] + 5.0;

Vector CostScalar Cost

b. DFG c. SLP internal graph d. SLP vectorized groups

NoBenefit

LL

S

+

++

7 7*

ii++

slide 6 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 33: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

PSLP fixes Non-Isomorphism

S

L

+

S

*

7.

1.

a. PSLP graphs

+

L 5.

Data Flow EdgeInstruction or ConstantX

slide 7 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 34: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

PSLP fixes Non-Isomorphism

S

L

S

L

S

L

+

S

+ +

*

*7.

1.

b. PSLP padded graphsa. PSLP graphs

7.

1. 5.

+

L 5.

Data Flow EdgeInstruction or ConstantX

slide 7 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 35: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

PSLP fixes Non-Isomorphism

S

L

S

L

S

L

+

S

+ +

*

* *7.

1.

b. PSLP padded graphsa. PSLP graphs

7.

1.

7.

5.

+

L 5.

Data Flow EdgeInstruction or ConstantX

slide 7 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 36: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

PSLP fixes Non-Isomorphism

S

L

S

L

S

L

+

S

+ +

*

* *7.

1.

b. PSLP padded graphsa. PSLP graphs

7.

1.

7.

5.Left Right

+

L 5.

Select Instruction Data Flow EdgeInstruction or ConstantX

slide 7 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 37: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

PSLP fixes Non-Isomorphism

S

L

S

L

S

L

+

S

+ +

*

* *7.

1.

b. PSLP padded graphsa. PSLP graphs

7.

1.

7.

5.Left Right

+

L 5.

Select Instruction Data Flow EdgeInstruction or ConstantX

slide 7 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 38: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

PSLP fixes Non-Isomorphism

S

L

S

L

S

L

+

S

+ +

*

* *7.

1.

c. PSLP groupsb. PSLP padded graphsa. PSLP graphs

7.

1.

7.

5.

1

2

0 S S

++

3 **

4 L L

Left Right+

L 5.

Select Instruction Data Flow EdgeInstruction or ConstantX

slide 7 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 39: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

PSLP fixes Non-Isomorphism

S

L

S

L

S

L

+

S

+ +

*

* *7.

1.

c. PSLP groupsb. PSLP padded graphsa. PSLP graphs

7.

1.

7.

5.

1

2

0 S S

++

3 **

4 L L

Left Right 5+

L 5.

Select Instruction Data Flow EdgeInstruction or ConstantX

slide 7 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 40: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

PSLP Algorithm

• Extension to SLP

slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 41: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

PSLP Algorithm

• Extension to SLP

1. Find vectorization seed instructions

slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 42: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

PSLP Algorithm

• Extension to SLP

• Generate multiplegraphs (unlike SLP)

Generate a graph for each seed2.

1. Find vectorization seed instructions

slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 43: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

PSLP Algorithm

• Extension to SLP

• Generate multiplegraphs (unlike SLP)

• Minimal Padding

3. Perform minimal Padding of graphs

Generate a graph for each seed2.

1. Find vectorization seed instructions

slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 44: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

PSLP Algorithm

• Extension to SLP

• Generate multiplegraphs (unlike SLP)

• Minimal Padding

• Cost estimation

CalculateScalar Cost

CalculateVector Cost

3. Perform minimal Padding of graphs

4.Calculate PaddedVector Cost

Generate a graph for each seed2.

1. Find vectorization seed instructions

slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 45: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

PSLP Algorithm

• Extension to SLP

• Generate multiplegraphs (unlike SLP)

• Minimal Padding

• Cost estimation

CalculateScalar Cost

CalculateVector Cost

IfPadded Cost

is best5.

3. Perform minimal Padding of graphs

4.Calculate PaddedVector Cost

Generate a graph for each seed2.

1. Find vectorization seed instructions

slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 46: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

PSLP Algorithm

• Extension to SLP

• Generate multiplegraphs (unlike SLP)

• Minimal Padding

• Cost estimation

• Emit redundant codeto createisomorphism

CalculateScalar Cost

CalculateVector Cost

IfPadded Cost

is best5.

3. Perform minimal Padding of graphs

4.Calculate PaddedVector Cost

Emit Padded Scalars

YES

6.

Generate a graph for each seed2.

1. Find vectorization seed instructions

slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 47: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

PSLP Algorithm

• Extension to SLP

• Generate multiplegraphs (unlike SLP)

• Minimal Padding

• Cost estimation

• Emit redundant codeto createisomorphism

7.If<Vector Cost

Scalar Cost

NO

CalculateScalar Cost

CalculateVector Cost

IfPadded Cost

is best5.

3. Perform minimal Padding of graphs

4.Calculate PaddedVector Cost

Emit Padded Scalars

YES

6.

Generate a graph for each seed2.

1. Find vectorization seed instructions

slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 48: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

PSLP Algorithm

• Extension to SLP

• Generate multiplegraphs (unlike SLP)

• Minimal Padding

• Cost estimation

• Emit redundant codeto createisomorphism

• Code vectorized byoriginal SLP

YES

7.If<Vector Cost

Scalar Cost

NO

CalculateScalar Cost

CalculateVector Cost

IfPadded Cost

is best5.

8.

3. Perform minimal Padding of graphs

4.Calculate PaddedVector Cost

Emit Padded Scalars

YES

6.

} Vanilla SLP

Generate a graph for each seed

9.

Generate SLP graph containinggroups of isomorphic scalars

Vectorize groups & emit vectors

2.

1. Find vectorization seed instructions

slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 49: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

PSLP Algorithm

• Extension to SLP

• Generate multiplegraphs (unlike SLP)

• Minimal Padding

• Cost estimation

• Emit redundant codeto createisomorphism

• Code vectorized byoriginal SLP

YES

7.If<Vector Cost

Scalar Cost

NO

CalculateScalar Cost

CalculateVector Cost

IfPadded Cost

is best5.

8.

3. Perform minimal Padding of graphs

4.Calculate PaddedVector Cost

Emit Padded Scalars

YES

6.

} Vanilla SLP

Generate a graph for each seed

9.

Generate SLP graph containinggroups of isomorphic scalars

Vectorize groups & emit vectors

NO

DONE

2.

1. Find vectorization seed instructions

slide 8 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 50: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Minimal Padding Algorithm

S

+*

7L

1

+

L

S

5

g1g2

Non−Isomorphic

slide 9 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 51: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Minimal Padding Algorithm

S

+*

7L

1

+

L

S

5

g1g2

Non−Isomorphic

MCS1MCS2

slide 9 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 52: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Minimal Padding Algorithm

S

+*

7L

1

+

L

S

5

+

L

S

1

+

L

S

5

g1g2

Non−Isomorphic

g1

g2

MCS1 MCS2MCS1MCS2

slide 9 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 53: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Minimal Padding Algorithm

diff1

diff2

S

+*

7L

1

+

L

S

5

+

L

S

1

+

L

S

5

*

7

g1g2

Non−Isomorphic

g1

g2

L

+

L

+

MCS1 MCS2MCS1MCS2

slide 9 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 54: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Minimal Padding Algorithm

diff1

diff2

S

+

L

1

S

5

+

L

S

+*

7L

1

+

L

S

5

+

L

S

1

+

L

S

5

*

7

g1g2

MinCS2MinCS1

Non−Isomorphic

g1

g2

L

+

L

+

MCS1 MCS2MCS1 MCS1 MCS2MCS2

slide 9 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 55: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Minimal Padding Algorithm

diff1

diff2

S

+

L 7

*

1

S

5

+

L 7

*

S

+*

7L

1

+

L

S

5

+

L

S

1

+

L

S

5

*

7

g1g2

MinCS2MinCS1

Non−Isomorphic

g1

g2diff1diff1

L

+

L

+

MCS1 MCS2MCS1 MCS1 MCS2MCS2

slide 9 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 56: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Minimal Padding Algorithm

diff1

diff2

S

+

L 7

*

1

S

5

+

L 7

*

S

+*

7L

1

+

L

S

5

+

L

S

1

+

L

S

5

*

7

g1g2

MinCS2MinCS1

Isomorphic !Non−Isomorphic

g1

g2diff1diff1

diff2diff2

L

+

L

+

MCS1 MCS2MCS1 MCS1 MCS2MCS2

slide 9 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 57: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Minimal Padding Algorithm

diff1

diff2 SELECTSELECT

S

+

L 7

*

1

S

5

+

L 7

*

S

+*

7L

1

+

L

S

5

+

L

S

1

+

L

S

5

*

7

g1g2

MinCS2MinCS1

Isomorphic !Non−Isomorphic

g1

g2diff1diff1

diff2diff2

L

+

L

+

LeftRight

MCS1 MCS2MCS1 MCS1 MCS2MCS2

slide 9 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 58: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

We can do better: Remove redundant Selects

S

L

+

S

*

EXAMPLE: Instruction acting as Select

7.

1.

+

L 5.

slide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 59: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

We can do better: Remove redundant Selects

S

L

+

*

S

L

+

S

*

S

L

+

*

Left

7

1

EXAMPLE: Instruction acting as Select

7.

1.

+

L 5.

7

5

Right

slide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 60: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

We can do better: Remove redundant Selects

S

L

+

*

S

L

+*

S

L

+

S

*

S

L

+

*

S

L

+*Left

7

17

1

EXAMPLE: Instruction acting as Select

7.

1.

+

L 5.

7

5

Right

1

1

slide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 61: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

We can do better: Remove redundant Selects

S

L

+

*

S

L

+*

S

L

+

S

*

S

L

+

*

S

L

+*Left

7

17

1

EXAMPLE: Instruction acting as Select

7.

1.

+

L 5.

7

5

Right

1

1

slide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 62: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

We can do better: Remove redundant Selects

S

L

+

*

S

L

+*

S

L

+

S

*

S

L

+

*

S

L

+*Left

7

17

1

EXAMPLE: Instruction acting as Select

7.

1.

+

L 5.

7

5

Right

1

1

1

+

C

A

a. Instruction acting as Selectslide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 63: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

We can do better: Remove redundant Selects

S

L

+

*

S

L

+*

S

L

+

S

*

S

L

+

*

S

L

+*Left

7

17

1

EXAMPLE: Instruction acting as Select

7.

1.

+

L 5.

7

5

Right

1

1

A

+

0C

1

+

C

A

a. Instruction acting as Selectslide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 64: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

We can do better: Remove redundant Selects

S

L

+

*

S

L

+*

S

L

+

S

*

S

L

+

*

S

L

+*Left

7

17

1

EXAMPLE: Instruction acting as Select

7.

1.

+

L 5.

7

5

Right

1

1

A

+

0C

A

72

1

+

C

A

a. Instruction acting as Select b. Select constantsslide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 65: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

We can do better: Remove redundant Selects

S

L

+

*

S

L

+*

S

L

+

S

*

S

L

+

*

S

L

+*Left

7

17

1

EXAMPLE: Instruction acting as Select

7.

1.

+

L 5.

7

5

Right

1

1

A

+

0C

A

2

A

72

1

+

C

A

a. Instruction acting as Select b. Select constantsslide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 66: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

We can do better: Remove redundant Selects

S

L

+

*

S

L

+*

S

L

+

S

*

S

L

+

*

S

L

+*Left

7

17

1

EXAMPLE: Instruction acting as Select

7.

1.

+

L 5.

7

5

Right

1

1

A

+

0C

A

2

A

72

A

B1

+

C

A

a. Instruction acting as Select b. Select constants c. Select same nodeslide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 67: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

We can do better: Remove redundant Selects

S

L

+

*

S

L

+*

S

L

+

S

*

S

L

+

*

S

L

+*Left

7

17

1

EXAMPLE: Instruction acting as Select

7.

1.

+

L 5.

7

5

Right

1

1

A

+

0C

A

2

A

72

A

B

A

B

1

+

C

A

a. Instruction acting as Select b. Select constants c. Select same nodeslide 10 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 68: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Opportunities for PSLP in real-life applications

1 Non-isomorphic source code (e.g. computingconjugates in 433.milc)

a[0].reala[0].imaga[1].reala[1].imag

...

...

b[0].imag = − a[0].imag

b[1].imag = − a[1].imag

b[0].real = a[0].real

b[1].real = a[1].real

Memory

slide 12 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 69: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Opportunities for PSLP in real-life applications

1 Non-isomorphic source code (e.g. computingconjugates in 433.milc)

a[0].reala[0].imaga[1].reala[1].imag

...

...

b[0].imag = − a[0].imag

b[1].imag = − a[1].imag

b[0].real = a[0].real

b[1].real = a[1].real

Memory

slide 12 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 70: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Opportunities for PSLP in real-life applications

1 Non-isomorphic source code (e.g. computingconjugates in 433.milc)

a[0].reala[0].imaga[1].reala[1].imag

...

...

b[0].imag = − a[0].imag

b[1].imag = − a[1].imag

b[0].real = a[0].real

b[1].real = a[1].real

Memory

2 Isomorphic source code but non-isomorphic IR dueto high-level optimizations (jdct of cjpeg)

tmp1 = quantval[0]*16384tmp2 = quantval[1]*22725tmp3 = quantval[2]*21407tmp4 = quantval[3]*19266

slide 12 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 71: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Opportunities for PSLP in real-life applications

1 Non-isomorphic source code (e.g. computingconjugates in 433.milc)

a[0].reala[0].imaga[1].reala[1].imag

...

...

b[0].imag = − a[0].imag

b[1].imag = − a[1].imag

b[0].real = a[0].real

b[1].real = a[1].real

Memory

2 Isomorphic source code but non-isomorphic IR dueto high-level optimizations (jdct of cjpeg)

tmp1 = quantval[0]<<14tmp2 = quantval[1]*22725tmp3 = quantval[2]*21407tmp4 = quantval[3]*19266

tmp1 = quantval[0]*16384tmp2 = quantval[1]*22725tmp3 = quantval[2]*21407tmp4 = quantval[3]*19266

opt

slide 12 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 72: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Opportunities for PSLP in real-life applications

1 Non-isomorphic source code (e.g. computingconjugates in 433.milc)

a[0].reala[0].imaga[1].reala[1].imag

...

...

b[0].imag = − a[0].imag

b[1].imag = − a[1].imag

b[0].real = a[0].real

b[1].real = a[1].real

Memory

2 Isomorphic source code but non-isomorphic IR dueto high-level optimizations (jdct of cjpeg)

tmp1 = quantval[0]<<14tmp2 = quantval[1]*22725tmp3 = quantval[2]*21407tmp4 = quantval[3]*19266

tmp1 = quantval[0]*16384tmp2 = quantval[1]*22725tmp3 = quantval[2]*21407tmp4 = quantval[3]*19266

opt

slide 12 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 73: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Experimental Setup

• Implemented PSLP in the trunk version of theLLVM 3.6 compiler.

slide 13 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 74: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Experimental Setup

• Implemented PSLP in the trunk version of theLLVM 3.6 compiler.

• Target: Intel Core i5-4570 @ 3.2Ghz

slide 13 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 75: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Experimental Setup

• Implemented PSLP in the trunk version of theLLVM 3.6 compiler.

• Target: Intel Core i5-4570 @ 3.2Ghz

• Compiler flags: -O3 -allow-partial-unroll-march=core-avx2 -mtune-core-i7 -ffast-math

slide 13 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 76: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Experimental Setup

• Implemented PSLP in the trunk version of theLLVM 3.6 compiler.

• Target: Intel Core i5-4570 @ 3.2Ghz

• Compiler flags: -O3 -allow-partial-unroll-march=core-avx2 -mtune-core-i7 -ffast-math

• Kernels, SPEC 2006 and Mediabench II• We evaluated the following cases:

slide 13 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 77: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Experimental Setup

• Implemented PSLP in the trunk version of theLLVM 3.6 compiler.

• Target: Intel Core i5-4570 @ 3.2Ghz

• Compiler flags: -O3 -allow-partial-unroll-march=core-avx2 -mtune-core-i7 -ffast-math

• Kernels, SPEC 2006 and Mediabench II• We evaluated the following cases:

1 All loop, SLP and PSLP vectorizers disabled (O3)

slide 13 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 78: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Experimental Setup

• Implemented PSLP in the trunk version of theLLVM 3.6 compiler.

• Target: Intel Core i5-4570 @ 3.2Ghz

• Compiler flags: -O3 -allow-partial-unroll-march=core-avx2 -mtune-core-i7 -ffast-math

• Kernels, SPEC 2006 and Mediabench II• We evaluated the following cases:

1 All loop, SLP and PSLP vectorizers disabled (O3)2 O3 + SLP enabled (SLP)

slide 13 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 79: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Experimental Setup

• Implemented PSLP in the trunk version of theLLVM 3.6 compiler.

• Target: Intel Core i5-4570 @ 3.2Ghz

• Compiler flags: -O3 -allow-partial-unroll-march=core-avx2 -mtune-core-i7 -ffast-math

• Kernels, SPEC 2006 and Mediabench II• We evaluated the following cases:

1 All loop, SLP and PSLP vectorizers disabled (O3)2 O3 + SLP enabled (SLP)3 O3 + PSLP enabled (PSLP)

slide 13 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 80: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

PSLP increases performance

0.00

0.20

0.40

0.60

0.80

1.00

1.20

1.40

conjugates

su3-adjoint

make-ahmat-slow

jdct-ifastfloyd-warshall

GMean

Nor

mal

ized

Tim

e

Performance of Kernels (Execution Time)

O3 SLP PSLP

0.97

0.98

0.99

1.00

1.01

cjpegmpeg2dec

433.milc473.astar

GMean

Whole Benchmarks (Execution Time)

O3 SLP PSLP

slide 14 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 81: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

PSLP enables or extends vectorization

0

10

20

30

40

50

conjugates

su3-adjoint

make-ahmat-slow

jdct-ifastfloyd-warshall

cjpegmpeg2dec

433.milc473.astar

Tim

es T

echn

ique

Suc

ceed

s

Vectorization Coverage Breakdown163

SLP-onlyPSLP-extends-SLP

PSLP-only

SLPonly

• SLP is adequate

slide 15 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 82: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

PSLP enables or extends vectorization

0

10

20

30

40

50

conjugates

su3-adjoint

make-ahmat-slow

jdct-ifastfloyd-warshall

cjpegmpeg2dec

433.milc473.astar

Tim

es T

echn

ique

Suc

ceed

s

Vectorization Coverage Breakdown163

SLP-onlyPSLP-extends-SLP

PSLP-only

SLPonly

PSLP extends SLP • SLP is adequate

• SLP stops at non-isomorphiccode. PSLP extends it.

slide 15 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 83: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

PSLP enables or extends vectorization

0

10

20

30

40

50

conjugates

su3-adjoint

make-ahmat-slow

jdct-ifastfloyd-warshall

cjpegmpeg2dec

433.milc473.astar

Tim

es T

echn

ique

Suc

ceed

s

Vectorization Coverage Breakdown163

SLP-onlyPSLP-extends-SLP

PSLP-only

PSLPonly

SLPonly

PSLP extends SLP • SLP is adequate

• SLP stops at non-isomorphiccode. PSLP extends it.

• SLP fails completely. PSLP

succeeds.slide 15 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 84: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Optimizing away redundant Selects

• Select-removal

optimizations

remove about 21%

of the Selects

0%

5%

10%

15%

20%

25%

30%

35%

conjugates

su3-adjoint

make-ahmat-slow

jdct-ifastfloyd-warshall

cjpegmpeg2dec

433.milc473.astar

GMean

Per

cent

age

of S

elec

ts

Percentage of Selects per region before and after Optimizations

Original-Selects Optimized-Selects

slide 16 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 85: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Conclusion

• PSLP improves vectorization coverage compared tothe state-of-the-art

slide 17 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 86: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Conclusion

• PSLP improves vectorization coverage compared tothe state-of-the-art

• Converts non-isomorphic code into isomorphic by:

slide 17 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 87: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Conclusion

• PSLP improves vectorization coverage compared tothe state-of-the-art

• Converts non-isomorphic code into isomorphic by:• Relying on the Min Common Supergraph for minimal

injection of redundant code

slide 17 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 88: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Conclusion

• PSLP improves vectorization coverage compared tothe state-of-the-art

• Converts non-isomorphic code into isomorphic by:• Relying on the Min Common Supergraph for minimal

injection of redundant code• Emitting Select instructions to guarantee correctness

slide 17 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 89: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Conclusion

• PSLP improves vectorization coverage compared tothe state-of-the-art

• Converts non-isomorphic code into isomorphic by:• Relying on the Min Common Supergraph for minimal

injection of redundant code• Emitting Select instructions to guarantee correctness• Optimizing away redundant Selects

slide 17 of 17 www.cl.cam.ac.uk/ ∼vp331/

Page 90: PSLP: Padded SLP Automatic Vectorization€¦ · SLP Vectorization Algorithm • Input is scalar IR • Seed instructions are: 1 Consecutive Stores 2 Reductions • Graph contains

Conclusion

• PSLP improves vectorization coverage compared tothe state-of-the-art

• Converts non-isomorphic code into isomorphic by:• Relying on the Min Common Supergraph for minimal

injection of redundant code• Emitting Select instructions to guarantee correctness• Optimizing away redundant Selects

• PSLP performs better compared to SLP oncommodity SIMD-capable hardware

slide 17 of 17 www.cl.cam.ac.uk/ ∼vp331/