july 1 – 3, 2009, braunschweig...diet-soda soda- - taking the sugar out of soda 1 2 2 low power...

1

1 1

1

University of Michigan – ARM June 09

Overview  AnySP

 SODA++  Increase the Application Domain of the Wireless

baseband architecture

 Diet-SODA  SODA- -

 Taking the sugar out of SODA

1

2 2

2

Low Power DSP Architectures

Trevor Mudge, Bredt Professor of Engineering, The University of Michigan, Ann Arbor

1st tubs.CITY Symposium July 1 – 3, 2009, Braunschweig

Mark Woh1, Sangwon Seo1, Ron Dreslinski, Geoff Blake, Scott Mahlke1,

Chaitali Chakrabarti2, Krisztian Flautner3

University of Michigan – ACAL1

Arizona State University2

ARM, Ltd.3

2

3 3

3


The Old Mobile Phone The Modern Mobile Phone

3

Video Recording

Video Editing

Higher Data Rates

Photos From - http://www.engadget.com/2009/06/10/iphone-3g-s-supports-opengl-es-2-0-but-3g-only-supports-1-1/ http://www.apple.com/iphone

3D Rendering

Advanced Image Processing

•  Future phones are becoming more complex •  Richer applications require much more requirements

•  How do phones handle this now?

4 4

4


Inside Today’s Smart Phones

4

Inside the OMAP3430 Application Processor

Inside the X-Gold 608 (Representation of QCOM)

•  Modern phones are looking like Frankenchips!

•  Some cores unused and functionality duplicated

3

5 5

5


Cost for Multi-System Support

  Supporting multiple systems is reserved for the most expensive phones   Cost is in supporting all the systems that may or may not be used at once

5

Data gathered from - Ramacher, U. 2007. Software-Defined Radio Prospects for Multistandard Mobile Phones. Computer 40, 10 (Oct. 2007) - Finchelstein, D.F.; Sze, V.; Sinangil, M.E.; Koken, Y.; Chandrakasan, A.P., "A low-power 0.7-V H.264 720p video decoder," Solid-State Circuits Conference, 2008. A-SSCC '08.

Programmable Unified Architectures Provide

  Lower Cost   Faster Time to Market   Support for Multiple Applications (Current and Future)   Bug Fixes After Manufacturing

So where do we start?

6 6

6


View of the Unified Architecture World

6

3G

ISP

Video

WiFi

GSM

2D/3D

3G

ISP

Video

WiFi

GSM

2D/3D GSM

ISP

WiFi 3G

2D/3D

V i d e o

4

7 7

7


Power/Performance Requirements for Multiple Systems

7

Different applications have different power/performance characteristics!

We need to design keeping each application in mind! (Not GPP but Domain Specific Processor)

8 8

8

The Applications

Is there anything we can learn from the applications themselves?

8

5

9 9

9


H.264 Basics

9

T.-A. Liu, T.-M. Lin, S. -Z. Wang, et al. “A low-power dual-mode video decoder for mobile applications,” IEEE Communications Magazine, volume 44, issue 8, pp.119-126, Aug. 2006.

10 10

10


4G Wireless Basics

  Three kernels make up the majority of the work   FFT – Extract Data from Signals   STBC – Combine Data into More Reliable Stream   LDPC – Error Correction on Data Stream

10

6

11 11

11


Mobile Signal Processing Algorithm Characteristics

11

  Algorithms have different SIMD widths   From very large to very small

  Though SIMD width varies all algorithms can exploit it   Large percentage of work can be SIMDized

  Larger SIMD width tend to have less TLP

Algorithm SIMD Scalar Overhead SIMDWidth AmountWorkload(%) Workload(%) Workload(%) (Elements) ofTLP

4G FFT 75 5 20 1024 Low

STBC 81 5 14 4 HighLDPC 49 18 33 96 Low

H.264

DeblockingFilter 72 13 15 8 MediumIntra‐PredicMon 85 5 10 16 MediumInverseTransform 80 5 15 8 HighMoMonCompensaMon 75 5 10 8 High

SIMD comes at a cost! • Register File Power

• Data Movement/Alignment Cost SIMD architectures have to deal with this!

12 12

12


Traditional SIMD Power Breakdown

  Register File Power consumes a lot of power in traditional 32-wide SIMD architecture

7

13 13

13


Register File Access

13

 Many of the register file access do not have to go back to the main register file

0%10%20%30%40%50%60%70%80%90%

100%

FFT STBC LDPC DeblockingFilter

Intra‐PredicMon InverseTransform

RegisterAccess BypassRead BypassWrite

Lots of power wasted on unneeded register file access!

14 14

14


Instruction Pair Frequency

14

Like the Multiply-Accumulate (MAC) instruction there is opportunity to fuse other instructions

A few instruction pairs (3-5) make up the majority of all instruction pairs!

8

15 15

15


Data Alignment Problem!

  H.264 Intra-prediction has 9 different prediction modes   Each prediction mode requires a specific permutation

15

Traditional SIMD machines take too long or cost too much to do this

Good news – small fixed number patterns per kernel

Intra‐PredicMon

16 16

16


More Data Alignment Problems!

16

Adder tree can accelerate not only matrix operations!

Many different video kernels can be accelerated too!

InverseTransform

9

17 17

17


Even More Data Alignment!

  Techniques like 2D-Wave and 3D-Wave decoding for H.264 helps increase amount of parallelism but we have to be able to access different macroblocks for each parallel computation

17

C.H. Meenderinck, A. Azevedo, B.H.H. Juurlink, M. Alvarez, A. Ramirez, Parallel Scalability of Video Decoders, Journal of Signal Processing Systems, August 2008

Block based decoding requires us to access different locations of memory for each task

We cannot just rely of fetching contiguous sets of data

18 18

18


Summary   Conclusion about 4G and H.264

  Lots of different sized parallelism  From 4 wide to 96 wide to 1024 wide SIMD

 Which means many different SIMD widths need to be supported   Very short lived values   Lots of potential for instruction fusings   Limited set of shuffle patterns required for each kernel

10

19 19

19

AnySP Design

19

20 20

20


SODA SIMD Architecture

20

32-Wide SIMD with Simple Shuffle Network

11

21 21

21


AnySP Architecture – High Level

21

8 Groups of 8-Wide Flexible Function Units

128x128 16bit Swizzle Network

16 Banked Memory with SRAM-based Crossbar

Multiple Output Adder Tree

Temporary Buffer and Bypass Network

Datapath AGU and Scalar Pipeline

22 22

22


Multi-Width Support

22

Normal 64-Wide SIMD mode – all lanes share one AGU Each 8-wide SIMD Group works on different memory locations of the same 8-wide code – AGU Offsets

12

23 23

23


AnySP FFU Datapath

23

Flexible Functional Unit allows us to

1.  Exploit Pipeline-parallelism by joining two lanes together 2.  Handle register bypass and the temporary buffer 3.  Join multiple pipelines to process deeper subgraphs 4.  Fuse Instruction Pairs

24 24

24

AnySP Results

24

13

25 25

25


Simulation Environment   Traditional SIMD architecture comparison

  SODA at 90nm technology

  AnySP   Synthesized at 90nm TSMC   Power, timing, area numbers were extracted

  Kernels were hand written and optimized

  4G – based on a NTT DoCoMo 4G test setup   H.264 – 4CIF@30fps

25

26 26

26


AnySP Speedup vs SIMD-based Architecture

  For all benchmarks we perform more than 2x better than a SIMD-based architecture

26

14

27 27

27


AnySP Energy-Delay vs SIMD-based Architecture

 More importantly energy efficiency is much better!

27

28 28

28


AnySP Power Breakdown

 We estimate that both H.264 and 4G wireless can be done in under 1 Watt at 45nm

28

15

29 29

29


Conclusion & Future Work  Conclusion

 We have presented an example architecture that could possibly meet the requirements of 100Mbps 4G and HD video on the same platform  Under the power budget and meeting the performance at 45nm

 Future and Ongoing Work  Application-specific language  Larger class of algorithms for AnySP  Better utilization of resources for non-parallel kernels

 Speedup sequential parts

29

30 30

30

Diet-SODA

30

16

31 31

31


Diet SODA  SODA, Ardbeg, AnySP may be too powerful for the

application  Simple Imaging processing for cameras  Audio processing for voice

 Lose flexibility and generality of Ardbeg, AnySP for performance at less # of gates

 Build a modular design which people can add SIMD groups and special function blocks to increase performance, at cost of area but allow voltage scaling

32 32

32


Histogram Equalization   Spreads out an unevenly distributed histogram   Increases contrast

17

33 33

33


Histogram Equalization for(i=0; i<length; i++) { for(j=0; j<width; j++){ k = image[i][j]; histogram[k] = histogram[k] + 1; } }

for(i=0; i<length; i++){ for(j=0; j<width; j++){ k = image[i][j]; image[i][j] = sum_of_h[k] * constant;

} }

Hard to parallelize because of histogram[k]

We can parallelize the load but we may suffer on the SIMD to scalar transfer

34 34

34


Basic Edge Detection   detect_edges : 8 directional masks ( Sobel ) + thresholding

18

35 35

35


Basic Edge Detection

Can be easily parallelized across multiple masks or across multiple images

35

for(a=-1; a<2; a++){ for(b=-1; b<2; b++){ sum = sum + image[i+a][j+b]

* mask_0[a+1][b+1]; } }

for(i=0; i<rows; i++){ for(j=0; j<cols; j++){ if(out_image[i][j] > high) { out_image[i][j] = new_hi; } else { out_image[i][j] = new_low; } } }

36 36

36


Basic Edge Detection Operations are applied on 3x3 matrices

1) Reduction tree (need a function of sum of 9 values) 2) Select max function 3) Threshold operation if (out_image > threshold) out_image = hi else out_image = low

19

37 37

37


Performance Estimate

  Estimate for how many cycle we need on 2048x1536 images 3.1 Megapixel

  Unoptimized SIMDized kernels assuming 16-wide SIMD  Most work is done on 9-wide (3x3) data elements

  Adding multiple SIMD groups can increase performance in most kernels

SequenMal(Mcycle) SIMDized(Mcycle)

HistogramEqualizaMon 75 75

EdgeDetecMon 3267 530

HomogeneityEdgeDetecMon 556 50

Filter 411 81

38 38

38


Modular Design Consideration  Build a highly application specific modular core

 Depending on which application supported user can add or remove specialized hardware

 Design requirement tradeoff  Smallest Area

 Use minimum number of SIMD groups or run SIMD groups faster

 Faster performance  Add more SIMD Groups

 Lower Power  Add more SIMD Groups and lower frequency and

lower VDD for same throughput 38

20

39 39

39

The End

39

july 1 – 3, 2009, braunschweig...diet-soda soda- - taking the sugar out of soda 1 2 2 low power...

Documents