anysp: anytime anywhere anyway signal...
TRANSCRIPT
1 1
1
AnySP: Anytime Anywhere Anyway Signal Processing
Mark Woh1, Sangwon Seo1, Scott Mahlke1,Trevor Mudge1, Chaitali Chakrabarti2, Krisztian Flautner3
University of Michigan – ACAL1
Arizona State University2
ARM, Ltd.3
2 2
2
University of Michigan - ACAL
The Old Mobile Phone The Modern Mobile Phone
2
Video Recording
Video Editing
Higher Data Rates
Photos From - http://www.engadget.com/2009/06/10/iphone-3g-s-supports-opengl-es-2-0-but-3g-only-supports-1-1/ http://www.apple.com/iphone
3D Rendering
Advanced Image Processing
• Future phones are becoming more complex • Richer applications require much more requirements
• How do phones handle this now?
3 3
3
University of Michigan - ACAL
Inside Today’s Smart Phones
3
Inside the OMAP3430 Application Processor
Inside the X-Gold 608 (Representation of QCOM)
4 4
4
University of Michigan - ACAL
Power/Performance Requirements for Multiple Systems
4
Different applications have different power/performance characteristics!
We need to design keeping each application in mind!
(Not GPP but Domain Specific Processor)
1
10
100
1000
10000
0 . 1 1 10 100
SODA ( 65 nm )
SODA ( 90 nm )
TI C 6 X
Imagine
VIRAM Pentium M
IBM Cell
P e r f o
r m a n
c e ( G
o p s )
Power ( Watts )
3 G Wireless
4 G Wireless
Mobile HD Video
5 5
5
The Applications
Is there anything we can learn from the applications themselves?
5
6 6
6
University of Michigan - ACAL
H.264 Basics
6
Inverse Transform
Standard Basis Patterns
Residual Data
5
3 2
Basis Coefficients
“Past” Frames CurrentFrame
Future Frames
Motion Compensation
Form Prediction
ReconstructedMacroblock
Current Uncoded Block
Dec
oded
Blo
ck Decoded Block
Prediction Based on Neighbors
Intra-Prediction
ReconstructedMacroblock
Deblocking FilterModules
Entropy decoder
Inverse Transform
MotionCompensation
Intra Prediction
Deblocking Filter
Algorithms
Context-adaptive VLD
4x4 integer kernel
Power Profiling*
Half: 6-tap filterQuarter: 6-tap / bilinear filterEighth: 4-tap filterDirectional spatial prediction
In-loop adaptive filter
4%
29%
10%
34%
Misc Kernels 23%
T.-A. Liu, T.-M. Lin, S. -Z. Wang, et al. “A low-power dual-mode video decoder for mobile applications,” IEEE Communications Magazine, volume 44, issue 8, pp.119-126, Aug. 2006.
7 7
7
University of Michigan - ACAL
4G Wireless Basics
§ Three kernels make up the majority of the work § FFT – Extract Data from Signals § STBC – Combine Data into More Reliable Stream § LDPC – Error Correction on Data Stream
7
Ant A/D
FFTSymbol Timing / Guard Detect
Channel Estimator
MIMODecoder
(STBC)
Channel Decoder(LDPC)
Recovered Data
Ant D/A IFFTGuard Insertion
Channel Encoder(LDPC)
S/PTransmitted
Data
ModulatorDemodulator
8 8
8
University of Michigan - ACAL
Mobile Signal Processing Algorithm Characteristics
8
§ Algorithms have different SIMD widths § From very large to very small
§ Though SIMD width varies all algorithms can exploit it § Large percentage of work can be SIMDized
§ Larger SIMD width tend to have less TLP
Algorithm SIMD Scalar Overhead SIMD Width Amount Workload (%) Workload (%) Workload (%) (Elements) of TLP
4G FFT 75 5 20 1024 Low
STBC 81 5 14 4 High LDPC 49 18 33 96 Low
H.26
4 Deblocking Filter 72 13 15 8 Medium Intra-‐PredicMon 85 5 10 16 Medium Inverse Transform 80 5 15 8 High MoMon CompensaMon 75 5 10 8 High
SIMD comes at a cost! • Register File Power
• Data Movement/Alignment Cost SIMD architectures have to deal with this!
9 9
9
University of Michigan - ACAL Feb 15, 2004
Only the instructions shown in red are MMX computations. All other instructions are simply supporting these computations.
Pentium III – SIMD code for Discrete Cosine Transform (DCT) lea ebx, DWORD PTR [ebp+128] load/address overhead mov DWORD PTR [esp+28], ebx load/address overhead $B1$2: xor eax, eax address overhead move dx, ecx address overhead lea edi, DWORD PTR [ecx+16] load/address overhead mov DWORD PTR [esp+24], ecx load/address overhead $B1$3: movq mm1, MMWORD PTR [ebp] load overhead pxor mm0, mm0 initialization overhead pmaddwd mm1, MMWORD PTR [eax+esi] True Computation movq mm2, MMWORD PTR [ebp+8] load overhead pmaddwd mm2, MMWORD PTR [eax+esi+8] True Computation add eax, 16address overhead paddw mm1, mm0 True Computation paddw mm2, mm1 True Computation movq mm0, mm2 load related overhead psrlq mm2, 32 SIMD reduction overhead povd ecx, mm0 SIMD load overhead movd ebx, mm2 SIMD load overhead add ecx, ebx SIMD conversion Overhead mov WORD PTR [edx], cx store overhead add edx, 2 address overhead cmp edi, edx branch related overhead jg $B1$3 loop branch overhead $B1$4: move cx, DWORD PTR [esp+24] load/address overhead add ebp, 16 address overhead add ecx, 16 address overhead move ax, DWORD PTR [esp+28] load/address overhead cmp eax, ebp branch related overhead jg $B1$2 loop branch overhead
10 10
10
University of Michigan - ACAL 10
11 11
11
University of Michigan - ACAL 11
12 12
12
University of Michigan - ACAL 12
13 13
13
University of Michigan - ACAL
Traditional SIMD Power Breakdown
§ Register File Power consumes a lot of power in traditional 32-wide SIMD architecture
SODA WCDMA MEM 3%
REG 37%
ALU+MULT 11%
CONTROL 38%
INTERCONNECT 2%
SCALAR PIPE 9%
MEM REG ALU+MULT CONTROL INTERCONNECT SCALAR PIPE
14 14
14
University of Michigan - ACAL
Register File Access
14
§ Many of the register file access do not have to go back to the main register file
0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
100%
FFT STBC LDPC Deblocking Filter
Intra-‐PredicMon Inverse Transform
Register Access Bypass Read Bypass Write
Lots of power wasted on unneeded
register file access!
15 15
15
University of Michigan - ACAL
Instruction Pair Frequency
15
Like the Multiply-Accumulate (MAC) instruction there is opportunity to fuse other instructions
A few instruction pairs (3-5) make up the
majority of all instruction pairs!
a ) Intra - prediction and Deblocking Filter Combined
InstrucMon Pair Frequency 1 mul9ply-‐add 26.71% 2 add-‐add 13.74% 3 shuffle-‐add 8.54% 4 shiB right-‐add 6.90% 5 subtract-‐add 6.94% 6 add-‐shiB right 5.76% 7 mul9ply-‐subtract 4.00% 8 shiB right-‐subtract 3.75% 9 add-‐subtract 3.07% 10 Others 20.45%
b ) LDPC
InstrucMon Pair Frequency 1 shuffle-‐move 32.07% 2 abs-‐subtract 8.54% 3 move-‐subtract 8.54% 4 shuffle-‐subtract 3.54% 5 add-‐shuffle 3.54% 6 Others 43.77%
c ) FFT
InstrucMon Pair Frequency 1 shuffle-‐shuffle 16.67% 2 add-‐mul9ply 16.67% 3 mul9ply-‐subtract 16.67% 4 mul9ply-‐add 16.67% 5 subtract-‐mult 16.67% 6 shuffle-‐add 16.67%
16 16
16
University of Michigan - ACAL
Data Alignment Problem!
§ H.264 Intra-prediction has 9 different prediction modes § Each prediction mode requires a specific permutation
16
B C D
A B C D B C D A B C D A B C D
A B C D
A A A A B B B B C C C C D D D D
A B C D
A B C D B C D C
E
E D E
F G
GF D E F
A
A B C D E F G H I J
A B C D G H I JG H I J A B C D
A B C D E F G
A
A B C D E F G H I J
A B C D E F G H I J
A B C D E F G H I J
K L
K L
A B B C C D DD D D DE F F G G
CDE EF FG GHI J JK KLI
CE F G H I J KI J K L D E FD
F F FEG G HH I D E G C D E F0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
16
1644
16x16 luma MBprediction modes for each 4x4 block
a b c d
e f g h
i j k l
m n o p
I
J
K
L
A B C DX
Traditional SIMD machines take too long or cost too much to do this
Good news – small fixed number
patterns per kernel
Intra-‐PredicMon
17 17
17
University of Michigan - ACAL
Summary § Conclusion about 4G and H.264
§ Lots of different sized parallelism § From 4 wide to 96 wide to 1024 wide SIMD
§ Which means many different SIMD widths need to be supported § Very short lived values § Lots of potential for instruction fusings § Limited set of shuffle patterns required for each kernel
18 18
18
AnySP Design
18
19 19
19
University of Michigan - ACAL
Traditional SIMD Architectures
19
32-laneSIMDALU
SIMDRF
32-laneSSN
SIMDto
scalar
EX
WB
STV
VTS
scalarRF
16-bitALU
EX
WB
SIMDDataMEM
ScalarDataMEM
SIMD
Scalar
AGU
32-Wide SIMD with Simple Shuffle Network
20 20
20
University of Michigan - ACAL
CROSSBAR
Scalar Pipeline
Bank 0
Bank 1
Bank 2
Bank 15
L1ProgramMemory
Controller
Bank 3
Bank 4
16-bit 8-wide 16 entry
SIMD RF-0
16-bit 8-wide 16 entry
SIMD RF-1
16-bit 8-wide 16 entry
SIMD RF-2
16-bit 8-wide 16 entry
SIMD RF-7
Scalar Memory Buffer
Multi-SIMD Datapath
AGU Group 0 Pipeline
AGU Group 7 Pipeline
AGU Group 1 Pipeline
Multi-Bank Local Memory
DMA
To Inter-PE
Bus
8 Groups of 8-wide
SIMD(64 Total Lanes)
8-wide SIMD FFU-0
8-wide SIMDFFU-1
8-wide SIMDFFU-2
8-wide SIMD FFU-7
SwizzleNetwork
16-bit 8-wide 4-entry
Buffer-0
16-bit 8-wide 4-entry
Buffer-1
16-bit 8-wide 4-entry
Buffer-2
16-bit 8-wide 4-entry
Buffer-7
Multi-Output Adder Tree
AnySP PE
AnySP Architecture – High Level
20
8 Groups of 8-Wide Flexible Function Units
128x128 16bit Swizzle Network
16 Banked Memory with SRAM-based Crossbar
Multiple Output Adder Tree
Temporary Buffer and Bypass Network
Datapath AGU and Scalar Pipeline
21 21
21
University of Michigan - ACAL
Multi-Width Support
21
16-bit 8-wide 16 entry
SIMD RF-0
16-bit 8-wide 16 entry
SIMD RF-1
16-bit 8-wide 16 entry
SIMD RF-2
16-bit 8-wide 16 entry
SIMD RF-7
Multi-SIMD Datapath
AGU Group 0
AGU Group 7
AGU Group 1
8-wide SIMD FFU-0
8-wide SIMDFFU-1
8-wide SIMDFFU-2
8-wide SIMD FFU-7
SwizzleNetwork
16-bit 8-wide 4-entry
Buffer-0
16-bit 8-wide 4-entry
Buffer-1
16-bit 8-wide 4-entry
Buffer-2
16-bit 8-wide 4-entry
Buffer-7
Multi-Output Adder Tree
CROSSBAR
Bank 0
Bank 1
Bank 2
Bank 15
Bank 3
Bank 4
Multi-Bank Local Memory
16-bit 8-wide 16 entry
SIMD RF-38-wide SIMD
FFU-3
16-bit 8-wide 4-entry
Buffer-3
AGU Group 2
AGU Group 3
Normal 64-Wide SIMD mode – all lanes share one AGU
16-bit 8-wide 16 entry
SIMD RF-0
16-bit 8-wide 16 entry
SIMD RF-1
16-bit 8-wide 16 entry
SIMD RF-2
16-bit 8-wide 16 entry
SIMD RF-7
Multi-SIMD Datapath
AGU Group 0
AGU Group 7
AGU Group 1
8-wide SIMD FFU-0
8-wide SIMDFFU-1
8-wide SIMDFFU-2
8-wide SIMD FFU-7
SwizzleNetwork
16-bit 8-wide 4-entry
Buffer-0
16-bit 8-wide 4-entry
Buffer-1
16-bit 8-wide 4-entry
Buffer-2
16-bit 8-wide 4-entry
Buffer-7
Multi-Output Adder Tree
CROSSBAR
Bank 0
Bank 1
Bank 2
Bank 15
Bank 3
Bank 4
Multi-Bank Local Memory
16-bit 8-wide 16 entry
SIMD RF-38-wide SIMD
FFU-3
16-bit 8-wide 4-entry
Buffer-3
AGU Group 2
AGU Group 3
16-bit 8-wide 16 entry
SIMD RF-0
16-bit 8-wide 16 entry
SIMD RF-1
16-bit 8-wide 16 entry
SIMD RF-2
16-bit 8-wide 16 entry
SIMD RF-7
Multi-SIMD Datapath
AGU Group 0
AGU Group 7
AGU Group 1
8-wide SIMD FFU-0
8-wide SIMDFFU-1
8-wide SIMDFFU-2
8-wide SIMD FFU-7
SwizzleNetwork
16-bit 8-wide 4-entry
Buffer-0
16-bit 8-wide 4-entry
Buffer-1
16-bit 8-wide 4-entry
Buffer-2
16-bit 8-wide 4-entry
Buffer-7
Multi-Output Adder Tree
CROSSBAR
Bank 0
Bank 1
Bank 2
Bank 15
Bank 3
Bank 4
Multi-Bank Local Memory
16-bit 8-wide 16 entry
SIMD RF-38-wide SIMD
FFU-3
16-bit 8-wide 4-entry
Buffer-3
AGU Group 2
AGU Group 3
Each 8-wide SIMD Group works on different memory locations of the same 8-wide code – AGU Offsets
22 22
22
University of Michigan - ACAL
AnySP FFU Datapath
22
Flexible Functional Unit allows us to
1. Exploit Pipeline-parallelism by joining two lanes together 2. Handle register bypass and the temporary buffer 3. Join multiple pipelines to process deeper subgraphs 4. Fuse Instruction Pairs
23 23
23
AnySP Results
23
24 24
24
University of Michigan - ACAL
Simulation Environment § Traditional SIMD architecture comparison
§ SODA at 90nm technology
§ AnySP § Synthesized at 90nm TSMC § Power, timing, area numbers were extracted
§ Performance and Power for each kernel was generated using synthesized data on in-house simulator
§ 4G – based on a NTT DoCoMo 4G test setup § H.264 – 4CIF@30fps
24
25 25
25
University of Michigan - ACAL
AnySP Speedup vs SIMD-based Architecture
§ For all benchmarks we perform more than 2x better than a SIMD-based architecture
25
0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 2 . 5
Baseline 64 - Wide Multi - SIMD Swizzle Network Flexible Functional Unit Buffer + Bypass
N o r m
a l i z e
d S p e
e d u p
FFT 1024 pt Radix - 2
FFT 1024 pt Radix - 4
STBC LDPC H . 264 Intra
Prediction H . 264
Deblocking Filter
H . 264 Inverse
Transform H . 264
Motion Compensation
26 26
26
University of Michigan - ACAL
AnySP Energy-Delay vs SIMD-based Architecture
§ More importantly energy efficiency is much better!
26
0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0
N o r
m a l i z
e d E
n e r g
y - D e
l a y SIMD - based Architecture AnySP
FFT 1024 pt Radix - 2
FFT 1024 pt Radix - 4
STBC LDPC H . 264 Intra
Prediction H . 264
Deblocking Filter
H . 264 Inverse
Transform H . 264 Motion
Compenstation
27 27
27
University of Michigan - ACAL
PE
System
Total
Components Units SIMD Data Mem ( 32 KB )
SIMD Register File ( 16 x 1024 bit ) SIMD ALUs , Multipliers , and SSN
SIMD Pipeline + Clock + Routing
Intra - processor Interconnect Scalar / AGU Pipeline & Misc .
ARM ( Cortex - M 3 ) Global Scratchpad Memory ( 128 KB )
Inter - processor Bus with DMA 90 nm ( 1 V @ 300 MHz )
4 4 4 4
4 4 1 1 1
Area Area mm 2
Area %
9 . 76 38 . 78 % 3 . 17 12 . 59 % 4 . 50 17 . 88 % 1 . 18 4 . 69 %
0 . 94 3 . 73 % 1 . 22 4 . 85 %
0 . 6 2 . 38 % 1 . 8 7 . 15 % 1 . 0 3 . 97 %
25 . 17 100 % Est . 65 nm ( 0 . 9 V @ 300 MHz ) 13 . 14
45 nm ( 0 . 8 V @ 300 MHz ) 6 . 86
4 G + H . 264 Decoder Power
mW Power
% 102 . 88 7 . 24 % 299 . 00 21 . 05 % 448 . 51 31 . 58 % 233 . 60 16 . 45 %
93 . 44 6 . 58 % 134 . 32 9 . 46 %
2 . 5 < 1 % 10 < 1 %
1 . 5 < 1 % 1347 . 03 100 % 1091 . 09 862 . 09
SIMD Buffer ( 128 B ) SIMD Adder Tree
4 4
0 . 82 3 . 25 % 0 . 18 < 1 %
84 . 09 5 . 92 % 10 . 43 < 1 %
AnySP Power Breakdown
§ We estimate that both H.264 and 4G wireless can be done in under 1 Watt at 45nm
27
28 28
28
University of Michigan - ACAL
Conclusion & Future Work § Conclusion
§ We have presented an example architecture that could possibly meet the requirements of 100Mbps 4G and HD video on the same platform § Under the power budget and meeting the performance at 45nm
§ Future and Ongoing Work § Application-specific language § Larger class of algorithms for AnySP § Better utilization of resources for non-parallel kernels
§ Speedup sequential parts
28