university of michigan electrical engineering and computer science macross: macro-simdization of...
Post on 15-Jan-2016
216 views
TRANSCRIPT
University of MichiganElectrical Engineering and Computer Science
MacroSS: Macro-SIMDization of Streaming Applications
Amir Hormati*, Yoonseo Choi‡, Mark Woh*,
Manjunath Kudlur†, Rodric Rabbah‡, Trevor Mudge*,
Scott Mahlke*
* Advanced Computer Arch. Lab.,
University of Michigan† Nvidia Corp. ‡ IBM T.J. Watson Research
Center
University of MichiganElectrical Engineering and Computer Science
Importance of SIMD
• Energy and area efficient way to exploit data-level parallelism
• Performance in multimedia and communication apps
• Ubiquitous in modern processors– Intel: SSE, Larrabee– IBM: Altivec, Cell SPE – ARM: Neon
Control Unit
Functional Units
Cache
Control Unit
Functional Units
Cache
Control Unit
Functional Units
Cache
University of MichiganElectrical Engineering and Computer Science
Stream Computing
• Prevalent in embedded, desktop and server systems
• Many optimizations for mapping and scheduling applications to parallel architectures
• Retargetability is a big plus in streaming languages
• Task, pipeline, and data-level parallelism is mapped into core-level parallelism
• Data-level parallelism on SIMD engines is not utilized
University of MichiganElectrical Engineering and Computer Science
Traditional Vectorization on Streaming Applications
AudioBeam
BeamForm
erDCT
FFT
FM R
adio
Matr
ix Multip
ly
Matr
ix Multip
ly Block
Bitonic
Sort
FilterB
ank
MP3 D
ecoder
Average
0
0.5
1
1.5
2
2.5
3
3.5ICC + Auto Vectorize
Sp
ee
du
p (
x)
University of MichiganElectrical Engineering and Computer Science
Why SIMD engines are under-utilized?
• Finding data-level parallelism suitable for SIMD engines
• Proper data-alignment
• Complicated compiler optimization and transformations
• Wide variety of SIMD standards
University of MichiganElectrical Engineering and Computer Science
In this work…
• Macro-level SIMDization techniques for streaming languages.
• MacroSS compiler for StreamIt language
• Hardware-based buffer optimizations for packing/unpacking operations
• Evaluation of MacroSS on Intel Core i7
University of MichiganElectrical Engineering and Computer Science
StreamIt
• Main Constructs:– Filter: Encapsulate computation.
• Stateful• Stateless
– Pipeline Expressing pipeline parallelism
– Splitjoin Expressing task/data-level parallelism
• Exposes different types of parallelism
• Scheduling and rate-matching are needed
pipeline
filter
splitjoin
University of MichiganElectrical Engineering and Computer Science
Macro SIMDization
• SIMDization at graph level
• Tunes the graph based on the target system– SIMD standards– Wide/Narrow SIMD
• Actor SIMDization:– Single-Actor– Vertical– Horizontal
University of MichiganElectrical Engineering and Computer Science
EE EE
Single-Actor SIMDization Overview
E
E v
E
E
E
E
E
E
E
EEEE E v
E(8)
E v
E v
Execution ReorderingSerial Execution Ideal VectorizationRealistic Vectorization
University of MichiganElectrical Engineering and Computer Science
0 x0 = pop();1 x1 = pop();2 x2 = pop();3 result[0] = x1 * cos(x0) + x2;4 result[1] = x0 * cos(x1) + x2;5 result[2] = x1 * sin(x0) + x2;5 result[3] = x0 * sin(x1) + x2;6 for (i : 0 to 3) 7 push(result[i]);
E (8)0 x0 = pop();1 x1 = pop();2 x2 = pop();3 result[0] = x1 * cos(x0) + x2;4 result[1] = x0 * cos(x1) + x2;5 result[2] = x1 * sin(x0) + x2;5 result[3] = x0 * sin(x1) + x2;6 for (i : 0 to 3) 7 push(result[i]);
E (8)
Single Actor SIMDization0 x0_v.{3} = peek(9);1 x0_v.{2} = peek(6);2 x0_v.{1} = peek(3);3 x0_v.{0} = pop();
4 x1_v.{3} = peek(9);5 x1_v.{2} = peek(6);6 x1_v.{1} = peek(3);7 x1_v.{0} = pop();
8 x2_v.{3} = peek(9);9 x2_v.{2} = peek(6);10 x2_v.{1} = peek(3);11 x2_v.{0} = pop();
12 result_v[0] = x1_v * cos(x0_v) + x2_v;13 result_v[1] = x0_v * cos(x1_v) + x2_v;14 result_v[2] = x1_v * sin(x0_v) + x2_v;15 result_v[3] = x0_v * sin(x1_v) + x2_v;
16 for (i : 0 to 3) {17 rpush(result_v[i].{3}, 12);18 rpush(result_v[i].{2}, 8);19 rpush(result_v[i].{1}, 4); 20 push(result_v[i].{0});21 }
EV (1)
• Only stateless actors• Scalar buffer accesses • Strided pushes and
pops
0 x0 = pop();1 x1 = pop();2 x2 = pop();3 result[0] = x1 * cos(x0) + x2;4 result[1] = x0 * cos(x1) + x2;5 result[2] = x1 * sin(x0) + x2;5 result[3] = x0 * sin(x1) + x2;6 for (i : 0 to 3) 7 push(result[i]);
E (8)0 x0 = pop();1 x1 = pop();2 x2 = pop();3 result[0] = x1 * cos(x0) + x2;4 result[1] = x0 * cos(x1) + x2;5 result[2] = x1 * sin(x0) + x2;5 result[3] = x0 * sin(x1) + x2;6 for (i : 0 to 3) 7 push(result[i]);
E (4)
University of MichiganElectrical Engineering and Computer Science
Why Scalar Buffers?
Epop=3, push=4
Dpop=2, push=2
8
12
128 bits
60 42
2317 2119
2216 2018
159 1311
148 1210
71 531st Execution
2nd Execution
3nd Execution
?
90 63
2314 2017
2213 1916
2112 1815
112 85
101 74
2nd Execution
1st Execution
20 21 22 23
16 17 18 19
12 13 14 15
8 9 10 11
4 5 6 7
0 1 2 3
University of MichiganElectrical Engineering and Computer Science
Vertical SIMDization
3D 2Epop=6, push=8
4
D0 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11D1
E0 E2 E3 E4 E5 E6 E7E1
1st Execution 2nd Execution 3rd Execution
1st Execution 2nd Execution
Epop=3, push=4
Dpop=2, push=2
8
12
D0 D2D1
E0 E1
D3 D5D4
E2 E3
D6 D8D7
E4 E5
D9 D11D10
E6 E7
University of MichiganElectrical Engineering and Computer Science
Horizontal SIMDization
• Find isomorphic actors in split/join structures
• The isomorphic actors are merge in one vectorized actor
• Actors can be both stateful or stateless.
Source
Splitter
A1
B1
C1
Sink
Joiner
An
Bn
Cn
. . .
. . .
. . .
University of MichiganElectrical Engineering and Computer Science
Epop=3, push=4
Dpop=2, push=2
Fpop=4, push=1
Gpop=2, push=8
Hpop=8, push=n
Apop=n, push=8
Joiner (1, 1, 1, 1)
Splitter (4, 4, 4, 4)
C0pop=1, push=1
C3pop=1, push=1
C2pop=1, push=1
C1pop=1, push=1
B1pop=12, push=3
B2pop=12, push=3
B3pop=12, push=3
B0pop=12, push=3
6
3
3
3333
1 111
4
6
4
2
1
B3
C3
B2
C2
B1
C1
Epop=3, push=4
Dpop=2, push=2
Fpop=4, push=1
Gpop=2, push=8
Hpeek=8, pop=8,
push=n
Apop=n, push=8
Joiner (1, 1, 1, 1)
Splitter (4, 4, 4, 4)
C0pop=1, push=1
B0pop=12, push=3
12
6
6
6
2
8
12
8
4
2
3D 2E
B3B2
B1
3D 2E
C3C2
C1
HJoiner (1)
HSplitter (4)
3D 2E
3D 2Epop=6, push=8
Fpop=4, push=1
Gpeek=4, pop=2,
push=8G
peek=4, pop=2, push=8G
peek=4, pop=2, push=8G
pop=2, push=8
Hpop=8, push=n
C0pop=1, push=1
B0pop=12, push=3
Apop=n, push=8
12
1
6
8
1
2
22
22
6
666
6
Horizontal SIM
Dization
Vertical SIM
Dization
Single-Actor SIM
Dization
?
?
University of MichiganElectrical Engineering and Computer Science
20 21 22 23
16 17 18 19
12 13 14 15
8 9 10 11
4 5 6 7
Streaming Address Generation
0 1 2 3
14 17 20 23
13 16 19 22
12 15 18 21
2 5 8 11
1 4 7 10
0 3 6 9
E pop=2
Dpush=3
12
8
E pop=2
Dpush=3
12
8
Scalar Buffer Vector Buffer
• Area overhead less than 1% on Core i7.
• Critical path two 16-bit adds and one 64-bit add.
University of MichiganElectrical Engineering and Computer Science
Traditional vs. Macro SIMDization
Traditional SIMDization Macro-SIMDization
Applicability Any Streaming
Adjust the schedule xTune streaming graph xIdentify isomorphic actors xEasily retargetable x
Complexity of optis and transformations High Low
University of MichiganElectrical Engineering and Computer Science
Experimental Setup
Backend Compiler
Frontend Compiler
Streaming Program
C Code
Host Compiler
Intel Core i7
• Frontend StreamIt MIT Compiler
• Backend MacroSS
• ICC 11.1 compile C/C++ code
• Core i7 with SSE4
University of MichiganElectrical Engineering and Computer Science
Macro-SIMDization vs. Traditional
AudioBeam
BeamFormer
DCTFFT
FM Radio
Matrix Multip
ly
Matrix Multip
ly Block
Bitonic Sort
FilterBank
MP3 Decoder
Average0
0.51
1.52
2.53
3.5
ICC + Auto Vectorize ICC + Macro SIMDICC + Macro SIMD + Autovectorize
Spee
dup
(x)
University of MichiganElectrical Engineering and Computer Science
Benefits of SAGU
AudioBeam
BeamFormer
DCTFFT
FM Radio
Matrix Multip
ly
Matrix Multip
ly Block
Bitonic Sort
FilterBank
MP3 Decoder
Average0
5
10
15
20
25
% Im
prov
emen
t
University of MichiganElectrical Engineering and Computer Science
Conclusion• Streaming is prevalent in all computing domains.
• Applying traditional SIMDization on streaming applications fails to utilize SIMD engines.
• Macro-SIMDization is done at higher level.
• MacroSS outperforms traditional SIMDization techniques by 54%.
University of MichiganElectrical Engineering and Computer Science
Questions and Comments
University of MichiganElectrical Engineering and Computer Science
Macro-SIMDization vs. Traditional
AudioBeam
BeamForm
erDCT
FFT
FM R
adio
Matr
ix Multip
ly
Matr
ix Multip
ly Block
Bitonic
Sort
FilterB
ank
MP3 D
ecoder
Average
0
0.5
1
1.5
2
2.5
3
3.5
4
GCC + Auto Vectorize GCC + Macro SIMDGCC + Macro SIMD + Autovectorize
Sp
ee
du
p (
x)
University of MichiganElectrical Engineering and Computer Science
SAGU Implementation
• Area overhead less than 1% on Core i7.
• Critical path two 16-bit adds and one 64-bit add.
• Minor ISA modifications are needed.
University of MichiganElectrical Engineering and Computer Science
SIMD + Multi-core Scheduling
• How to schedule for a heterogeneous SIMD system?
• SIMDization reduces memory/bus traffic
• Exploit SIMD parallelism before Core-level parallelism.
• Is this the best we can do?
University of MichiganElectrical Engineering and Computer Science
Multicore + Macro-SIMDization
AudioBeam
BeamForm
erDCT
FFT
FM R
adio
Matr
ix Multip
ly
Matr
ix Multip
ly Block
Bitonic
Sort
FilterB
ank
MP3 D
ecoder
Average
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
2 Cores 4 Cores 2 Cores + Macro SIMD 4 Cores + Macro SIMD
Sp
ee
du
p (
x)