july 1 – 3, 2009, braunschweig...diet-soda soda- - taking the sugar out of soda 1 2 2 low power...
TRANSCRIPT
1
1 1
1
University of Michigan – ARM June 09
Overview AnySP
SODA++ Increase the Application Domain of the Wireless
baseband architecture
Diet-SODA SODA- -
Taking the sugar out of SODA
1
2 2
2
Low Power DSP Architectures
Trevor Mudge, Bredt Professor of Engineering, The University of Michigan, Ann Arbor
1st tubs.CITY Symposium July 1 – 3, 2009, Braunschweig
Mark Woh1, Sangwon Seo1, Ron Dreslinski, Geoff Blake, Scott Mahlke1,
Chaitali Chakrabarti2, Krisztian Flautner3
University of Michigan – ACAL1
Arizona State University2
ARM, Ltd.3
2
3 3
3
University of Michigan – ARM June 09
The Old Mobile Phone The Modern Mobile Phone
3
Video Recording
Video Editing
Higher Data Rates
Photos From - http://www.engadget.com/2009/06/10/iphone-3g-s-supports-opengl-es-2-0-but-3g-only-supports-1-1/ http://www.apple.com/iphone
3D Rendering
Advanced Image Processing
• Future phones are becoming more complex • Richer applications require much more requirements
• How do phones handle this now?
4 4
4
University of Michigan – ARM June 09
Inside Today’s Smart Phones
4
Inside the OMAP3430 Application Processor
Inside the X-Gold 608 (Representation of QCOM)
• Modern phones are looking like Frankenchips!
• Some cores unused and functionality duplicated
3
5 5
5
University of Michigan – ARM June 09
Cost for Multi-System Support
Supporting multiple systems is reserved for the most expensive phones Cost is in supporting all the systems that may or may not be used at once
5
Data gathered from - Ramacher, U. 2007. Software-Defined Radio Prospects for Multistandard Mobile Phones. Computer 40, 10 (Oct. 2007) - Finchelstein, D.F.; Sze, V.; Sinangil, M.E.; Koken, Y.; Chandrakasan, A.P., "A low-power 0.7-V H.264 720p video decoder," Solid-State Circuits Conference, 2008. A-SSCC '08.
Programmable Unified Architectures Provide
Lower Cost Faster Time to Market Support for Multiple Applications (Current and Future) Bug Fixes After Manufacturing
So where do we start?
6 6
6
University of Michigan – ARM June 09
View of the Unified Architecture World
6
3G
ISP
Video
WiFi
GSM
2D/3D
3G
ISP
Video
WiFi
GSM
2D/3D GSM
ISP
WiFi 3G
2D/3D
V i d e o
4
7 7
7
University of Michigan – ARM June 09
Power/Performance Requirements for Multiple Systems
7
Different applications have different power/performance characteristics!
We need to design keeping each application in mind! (Not GPP but Domain Specific Processor)
8 8
8
The Applications
Is there anything we can learn from the applications themselves?
8
5
9 9
9
University of Michigan – ARM June 09
H.264 Basics
9
T.-A. Liu, T.-M. Lin, S. -Z. Wang, et al. “A low-power dual-mode video decoder for mobile applications,” IEEE Communications Magazine, volume 44, issue 8, pp.119-126, Aug. 2006.
10 10
10
University of Michigan – ARM June 09
4G Wireless Basics
Three kernels make up the majority of the work FFT – Extract Data from Signals STBC – Combine Data into More Reliable Stream LDPC – Error Correction on Data Stream
10
6
11 11
11
University of Michigan – ARM June 09
Mobile Signal Processing Algorithm Characteristics
11
Algorithms have different SIMD widths From very large to very small
Though SIMD width varies all algorithms can exploit it Large percentage of work can be SIMDized
Larger SIMD width tend to have less TLP
Algorithm SIMD Scalar Overhead SIMDWidth AmountWorkload(%) Workload(%) Workload(%) (Elements) ofTLP
4G FFT 75 5 20 1024 Low
STBC 81 5 14 4 HighLDPC 49 18 33 96 Low
H.264
DeblockingFilter 72 13 15 8 MediumIntra‐PredicMon 85 5 10 16 MediumInverseTransform 80 5 15 8 HighMoMonCompensaMon 75 5 10 8 High
SIMD comes at a cost! • Register File Power
• Data Movement/Alignment Cost SIMD architectures have to deal with this!
12 12
12
University of Michigan – ARM June 09
Traditional SIMD Power Breakdown
Register File Power consumes a lot of power in traditional 32-wide SIMD architecture
7
13 13
13
University of Michigan – ARM June 09
Register File Access
13
Many of the register file access do not have to go back to the main register file
0%10%20%30%40%50%60%70%80%90%
100%
FFT STBC LDPC DeblockingFilter
Intra‐PredicMon InverseTransform
RegisterAccess BypassRead BypassWrite
Lots of power wasted on unneeded register file access!
14 14
14
University of Michigan – ARM June 09
Instruction Pair Frequency
14
Like the Multiply-Accumulate (MAC) instruction there is opportunity to fuse other instructions
A few instruction pairs (3-5) make up the majority of all instruction pairs!
8
15 15
15
University of Michigan – ARM June 09
Data Alignment Problem!
H.264 Intra-prediction has 9 different prediction modes Each prediction mode requires a specific permutation
15
Traditional SIMD machines take too long or cost too much to do this
Good news – small fixed number patterns per kernel
Intra‐PredicMon
16 16
16
University of Michigan – ARM June 09
More Data Alignment Problems!
16
Adder tree can accelerate not only matrix operations!
Many different video kernels can be accelerated too!
InverseTransform
9
17 17
17
University of Michigan – ARM June 09
Even More Data Alignment!
Techniques like 2D-Wave and 3D-Wave decoding for H.264 helps increase amount of parallelism but we have to be able to access different macroblocks for each parallel computation
17
C.H. Meenderinck, A. Azevedo, B.H.H. Juurlink, M. Alvarez, A. Ramirez, Parallel Scalability of Video Decoders, Journal of Signal Processing Systems, August 2008
Block based decoding requires us to access different locations of memory for each task
We cannot just rely of fetching contiguous sets of data
18 18
18
University of Michigan – ARM June 09
Summary Conclusion about 4G and H.264
Lots of different sized parallelism From 4 wide to 96 wide to 1024 wide SIMD
Which means many different SIMD widths need to be supported Very short lived values Lots of potential for instruction fusings Limited set of shuffle patterns required for each kernel
10
19 19
19
AnySP Design
19
20 20
20
University of Michigan – ARM June 09
SODA SIMD Architecture
20
32-Wide SIMD with Simple Shuffle Network
11
21 21
21
University of Michigan – ARM June 09
AnySP Architecture – High Level
21
8 Groups of 8-Wide Flexible Function Units
128x128 16bit Swizzle Network
16 Banked Memory with SRAM-based Crossbar
Multiple Output Adder Tree
Temporary Buffer and Bypass Network
Datapath AGU and Scalar Pipeline
22 22
22
University of Michigan – ARM June 09
Multi-Width Support
22
Normal 64-Wide SIMD mode – all lanes share one AGU Each 8-wide SIMD Group works on different memory locations of the same 8-wide code – AGU Offsets
12
23 23
23
University of Michigan – ARM June 09
AnySP FFU Datapath
23
Flexible Functional Unit allows us to
1. Exploit Pipeline-parallelism by joining two lanes together 2. Handle register bypass and the temporary buffer 3. Join multiple pipelines to process deeper subgraphs 4. Fuse Instruction Pairs
24 24
24
AnySP Results
24
13
25 25
25
University of Michigan – ARM June 09
Simulation Environment Traditional SIMD architecture comparison
SODA at 90nm technology
AnySP Synthesized at 90nm TSMC Power, timing, area numbers were extracted
Kernels were hand written and optimized
4G – based on a NTT DoCoMo 4G test setup H.264 – 4CIF@30fps
25
26 26
26
University of Michigan – ARM June 09
AnySP Speedup vs SIMD-based Architecture
For all benchmarks we perform more than 2x better than a SIMD-based architecture
26
14
27 27
27
University of Michigan – ARM June 09
AnySP Energy-Delay vs SIMD-based Architecture
More importantly energy efficiency is much better!
27
28 28
28
University of Michigan – ARM June 09
AnySP Power Breakdown
We estimate that both H.264 and 4G wireless can be done in under 1 Watt at 45nm
28
15
29 29
29
University of Michigan – ARM June 09
Conclusion & Future Work Conclusion
We have presented an example architecture that could possibly meet the requirements of 100Mbps 4G and HD video on the same platform Under the power budget and meeting the performance at 45nm
Future and Ongoing Work Application-specific language Larger class of algorithms for AnySP Better utilization of resources for non-parallel kernels
Speedup sequential parts
29
30 30
30
Diet-SODA
30
16
31 31
31
University of Michigan – ARM June 09
Diet SODA SODA, Ardbeg, AnySP may be too powerful for the
application Simple Imaging processing for cameras Audio processing for voice
Lose flexibility and generality of Ardbeg, AnySP for performance at less # of gates
Build a modular design which people can add SIMD groups and special function blocks to increase performance, at cost of area but allow voltage scaling
32 32
32
University of Michigan – ARM June 09
Histogram Equalization Spreads out an unevenly distributed histogram Increases contrast
17
33 33
33
University of Michigan – ARM June 09
Histogram Equalization for(i=0; i<length; i++) { for(j=0; j<width; j++){ k = image[i][j]; histogram[k] = histogram[k] + 1; } }
for(i=0; i<length; i++){ for(j=0; j<width; j++){ k = image[i][j]; image[i][j] = sum_of_h[k] * constant;
} }
Hard to parallelize because of histogram[k]
We can parallelize the load but we may suffer on the SIMD to scalar transfer
34 34
34
University of Michigan – ARM June 09
Basic Edge Detection detect_edges : 8 directional masks ( Sobel ) + thresholding
18
35 35
35
University of Michigan – ARM June 09
Basic Edge Detection
Can be easily parallelized across multiple masks or across multiple images
35
for(a=-1; a<2; a++){ for(b=-1; b<2; b++){ sum = sum + image[i+a][j+b]
* mask_0[a+1][b+1]; } }
for(i=0; i<rows; i++){ for(j=0; j<cols; j++){ if(out_image[i][j] > high) { out_image[i][j] = new_hi; } else { out_image[i][j] = new_low; } } }
36 36
36
University of Michigan – ARM June 09
Basic Edge Detection Operations are applied on 3x3 matrices
1) Reduction tree (need a function of sum of 9 values) 2) Select max function 3) Threshold operation if (out_image > threshold) out_image = hi else out_image = low
19
37 37
37
University of Michigan – ARM June 09
Performance Estimate
Estimate for how many cycle we need on 2048x1536 images 3.1 Megapixel
Unoptimized SIMDized kernels assuming 16-wide SIMD Most work is done on 9-wide (3x3) data elements
Adding multiple SIMD groups can increase performance in most kernels
SequenMal(Mcycle) SIMDized(Mcycle)
HistogramEqualizaMon 75 75
EdgeDetecMon 3267 530
HomogeneityEdgeDetecMon 556 50
Filter 411 81
38 38
38
University of Michigan – ARM June 09
Modular Design Consideration Build a highly application specific modular core
Depending on which application supported user can add or remove specialized hardware
Design requirement tradeoff Smallest Area
Use minimum number of SIMD groups or run SIMD groups faster
Faster performance Add more SIMD Groups
Lower Power Add more SIMD Groups and lower frequency and
lower VDD for same throughput 38
20
39 39
39
The End
39