university of michigan electrical engineering and computer science macross: macro-simdization of...

University of MichiganElectrical Engineering and Computer Science

MacroSS: Macro-SIMDization of Streaming Applications

Amir Hormati*, Yoonseo Choi‡, Mark Woh*,

Manjunath Kudlur†, Rodric Rabbah‡, Trevor Mudge*,

Scott Mahlke*

* Advanced Computer Arch. Lab.,

University of Michigan† Nvidia Corp. ‡ IBM T.J. Watson Research

Center


Importance of SIMD

• Energy and area efficient way to exploit data-level parallelism

• Performance in multimedia and communication apps

• Ubiquitous in modern processors– Intel: SSE, Larrabee– IBM: Altivec, Cell SPE – ARM: Neon

Control Unit

Functional Units

Cache

Control Unit

Functional Units

Cache

Control Unit

Functional Units

Cache


Stream Computing

• Prevalent in embedded, desktop and server systems

• Many optimizations for mapping and scheduling applications to parallel architectures

• Retargetability is a big plus in streaming languages

• Task, pipeline, and data-level parallelism is mapped into core-level parallelism

• Data-level parallelism on SIMD engines is not utilized


Traditional Vectorization on Streaming Applications

AudioBeam

BeamForm

erDCT

FFT

FM R

adio

Matr

ix Multip

ly

Matr

ix Multip

ly Block

Bitonic

Sort

FilterB

ank

MP3 D

ecoder

Average

0

0.5

1

1.5

2

2.5

3

3.5ICC + Auto Vectorize

Sp

ee

du

p (

x)


Why SIMD engines are under-utilized?

• Finding data-level parallelism suitable for SIMD engines

• Proper data-alignment

• Complicated compiler optimization and transformations

• Wide variety of SIMD standards


In this work…

• Macro-level SIMDization techniques for streaming languages.

• MacroSS compiler for StreamIt language

• Hardware-based buffer optimizations for packing/unpacking operations

• Evaluation of MacroSS on Intel Core i7


StreamIt

• Main Constructs:– Filter: Encapsulate computation.

• Stateful• Stateless

– Pipeline Expressing pipeline parallelism

– Splitjoin Expressing task/data-level parallelism

• Exposes different types of parallelism

• Scheduling and rate-matching are needed

pipeline

filter

splitjoin


Macro SIMDization

• SIMDization at graph level

• Tunes the graph based on the target system– SIMD standards– Wide/Narrow SIMD

• Actor SIMDization:– Single-Actor– Vertical– Horizontal


EE EE

Single-Actor SIMDization Overview

E

E v

E

E

E

E

E

E

E

EEEE E v

E(8)

E v

E v

Execution ReorderingSerial Execution Ideal VectorizationRealistic Vectorization


0 x0 = pop();1 x1 = pop();2 x2 = pop();3 result[0] = x1 * cos(x0) + x2;4 result[1] = x0 * cos(x1) + x2;5 result[2] = x1 * sin(x0) + x2;5 result[3] = x0 * sin(x1) + x2;6 for (i : 0 to 3) 7 push(result[i]);

E (8)0 x0 = pop();1 x1 = pop();2 x2 = pop();3 result[0] = x1 * cos(x0) + x2;4 result[1] = x0 * cos(x1) + x2;5 result[2] = x1 * sin(x0) + x2;5 result[3] = x0 * sin(x1) + x2;6 for (i : 0 to 3) 7 push(result[i]);

E (8)

Single Actor SIMDization0 x0_v.{3} = peek(9);1 x0_v.{2} = peek(6);2 x0_v.{1} = peek(3);3 x0_v.{0} = pop();

4 x1_v.{3} = peek(9);5 x1_v.{2} = peek(6);6 x1_v.{1} = peek(3);7 x1_v.{0} = pop();

8 x2_v.{3} = peek(9);9 x2_v.{2} = peek(6);10 x2_v.{1} = peek(3);11 x2_v.{0} = pop();

12 result_v[0] = x1_v * cos(x0_v) + x2_v;13 result_v[1] = x0_v * cos(x1_v) + x2_v;14 result_v[2] = x1_v * sin(x0_v) + x2_v;15 result_v[3] = x0_v * sin(x1_v) + x2_v;

16 for (i : 0 to 3) {17 rpush(result_v[i].{3}, 12);18 rpush(result_v[i].{2}, 8);19 rpush(result_v[i].{1}, 4); 20 push(result_v[i].{0});21 }

EV (1)

• Only stateless actors• Scalar buffer accesses • Strided pushes and

pops

0 x0 = pop();1 x1 = pop();2 x2 = pop();3 result[0] = x1 * cos(x0) + x2;4 result[1] = x0 * cos(x1) + x2;5 result[2] = x1 * sin(x0) + x2;5 result[3] = x0 * sin(x1) + x2;6 for (i : 0 to 3) 7 push(result[i]);

E (8)0 x0 = pop();1 x1 = pop();2 x2 = pop();3 result[0] = x1 * cos(x0) + x2;4 result[1] = x0 * cos(x1) + x2;5 result[2] = x1 * sin(x0) + x2;5 result[3] = x0 * sin(x1) + x2;6 for (i : 0 to 3) 7 push(result[i]);

E (4)


Why Scalar Buffers?

Epop=3, push=4

Dpop=2, push=2

8

12

128 bits

60 42

2317 2119

2216 2018

159 1311

148 1210

71 531st Execution

2nd Execution

3nd Execution

?

90 63

2314 2017

2213 1916

2112 1815

112 85

101 74

2nd Execution

1st Execution

20 21 22 23

16 17 18 19

12 13 14 15

8 9 10 11

4 5 6 7

0 1 2 3


Vertical SIMDization

3D 2Epop=6, push=8

4

D0 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11D1

E0 E2 E3 E4 E5 E6 E7E1

1st Execution 2nd Execution 3rd Execution

1st Execution 2nd Execution

Epop=3, push=4

Dpop=2, push=2

8

12

D0 D2D1

E0 E1

D3 D5D4

E2 E3

D6 D8D7

E4 E5

D9 D11D10

E6 E7


Horizontal SIMDization

• Find isomorphic actors in split/join structures

• The isomorphic actors are merge in one vectorized actor

• Actors can be both stateful or stateless.

Source

Splitter

A1

B1

C1

Sink

Joiner

An

Bn

Cn

. . .

. . .

. . .


Epop=3, push=4

Dpop=2, push=2

Fpop=4, push=1

Gpop=2, push=8

Hpop=8, push=n

Apop=n, push=8

Joiner (1, 1, 1, 1)

Splitter (4, 4, 4, 4)

C0pop=1, push=1

C3pop=1, push=1

C2pop=1, push=1

C1pop=1, push=1

B1pop=12, push=3

B2pop=12, push=3

B3pop=12, push=3

B0pop=12, push=3

6

3

3

3333

1 111

4

6

4

2

1

B3

C3

B2

C2

B1

C1

Epop=3, push=4

Dpop=2, push=2

Fpop=4, push=1

Gpop=2, push=8

Hpeek=8, pop=8,

push=n

Apop=n, push=8

Joiner (1, 1, 1, 1)

Splitter (4, 4, 4, 4)

C0pop=1, push=1

B0pop=12, push=3

12

6

6

6

2

8

12

8

4

2

3D 2E

B3B2

B1

3D 2E

C3C2

C1

HJoiner (1)

HSplitter (4)

3D 2E

3D 2Epop=6, push=8

Fpop=4, push=1

Gpeek=4, pop=2,

push=8G

peek=4, pop=2, push=8G

peek=4, pop=2, push=8G

pop=2, push=8

Hpop=8, push=n

C0pop=1, push=1

B0pop=12, push=3

Apop=n, push=8

12

1

6

8

1

2

22

22

6

666

6

Horizontal SIM

Dization

Vertical SIM

Dization

Single-Actor SIM

Dization

?

?


20 21 22 23

16 17 18 19

12 13 14 15

8 9 10 11

4 5 6 7

Streaming Address Generation

0 1 2 3

14 17 20 23

13 16 19 22

12 15 18 21

2 5 8 11

1 4 7 10

0 3 6 9

E pop=2

Dpush=3

12

8

E pop=2

Dpush=3

12

8

Scalar Buffer Vector Buffer

• Area overhead less than 1% on Core i7.

• Critical path two 16-bit adds and one 64-bit add.


Traditional vs. Macro SIMDization

Traditional SIMDization Macro-SIMDization

Applicability Any Streaming

Adjust the schedule xTune streaming graph xIdentify isomorphic actors xEasily retargetable x

Complexity of optis and transformations High Low


Experimental Setup

Backend Compiler

Frontend Compiler

Streaming Program

C Code

Host Compiler

Intel Core i7

• Frontend StreamIt MIT Compiler

• Backend MacroSS

• ICC 11.1 compile C/C++ code

• Core i7 with SSE4


Macro-SIMDization vs. Traditional

AudioBeam

BeamFormer

DCTFFT

FM Radio

Matrix Multip

ly

Matrix Multip

ly Block

Bitonic Sort

FilterBank

MP3 Decoder

Average0

0.51

1.52

2.53

3.5

ICC + Auto Vectorize ICC + Macro SIMDICC + Macro SIMD + Autovectorize

Spee

dup

(x)


Benefits of SAGU

AudioBeam

BeamFormer

DCTFFT

FM Radio

Matrix Multip

ly

Matrix Multip

ly Block

Bitonic Sort

FilterBank

MP3 Decoder

Average0

5

10

15

20

25

% Im

prov

emen

t


Conclusion• Streaming is prevalent in all computing domains.

• Applying traditional SIMDization on streaming applications fails to utilize SIMD engines.

• Macro-SIMDization is done at higher level.

• MacroSS outperforms traditional SIMDization techniques by 54%.


Questions and Comments


Macro-SIMDization vs. Traditional

AudioBeam

BeamForm

erDCT

FFT

FM R

adio

Matr

ix Multip

ly

Matr

ix Multip

ly Block

Bitonic

Sort

FilterB

ank

MP3 D

ecoder

Average

0

0.5

1

1.5

2

2.5

3

3.5

4

GCC + Auto Vectorize GCC + Macro SIMDGCC + Macro SIMD + Autovectorize

Sp

ee

du

p (

x)


SAGU Implementation

• Area overhead less than 1% on Core i7.

• Critical path two 16-bit adds and one 64-bit add.

• Minor ISA modifications are needed.


SIMD + Multi-core Scheduling

• How to schedule for a heterogeneous SIMD system?

• SIMDization reduces memory/bus traffic

• Exploit SIMD parallelism before Core-level parallelism.

• Is this the best we can do?


Multicore + Macro-SIMDization

AudioBeam

BeamForm

erDCT

FFT

FM R

adio

Matr

ix Multip

ly

Matr

ix Multip

ly Block

Bitonic

Sort

FilterB

ank

MP3 D

ecoder

Average

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

2 Cores 4 Cores 2 Cores + Macro SIMD 4 Cores + Macro SIMD

Sp

ee

du

p (

x)

university of michigan electrical engineering and computer science macross: macro-simdization of...

Documents

computer sciencein

advanced computer

graph level

data localityeasy

streaming codes

streaming languages

macross compiler

server systems