anysp: anytime anywhere anyway signal...

1 1

1

AnySP: Anytime Anywhere Anyway Signal Processing

Mark Woh1, Sangwon Seo1, Scott Mahlke1,Trevor Mudge1, Chaitali Chakrabarti2, Krisztian Flautner3

University of Michigan – ACAL1

Arizona State University2

ARM, Ltd.3

2 2

2

University of Michigan - ACAL

The Old Mobile Phone The Modern Mobile Phone

2

Video Recording

Video Editing

Higher Data Rates

Photos From - http://www.engadget.com/2009/06/10/iphone-3g-s-supports-opengl-es-2-0-but-3g-only-supports-1-1/ http://www.apple.com/iphone

3D Rendering

Advanced Image Processing

•  Future phones are becoming more complex •  Richer applications require much more requirements

•  How do phones handle this now?

3 3

3


Inside Today’s Smart Phones

3

Inside the OMAP3430 Application Processor

Inside the X-Gold 608 (Representation of QCOM)

4 4

4


Power/Performance Requirements for Multiple Systems

4

Different applications have different power/performance characteristics!

We need to design keeping each application in mind!

(Not GPP but Domain Specific Processor)

1

10

100

1000

10000

0 . 1 1 10 100

SODA ( 65 nm )

SODA ( 90 nm )

TI C 6 X

Imagine

VIRAM Pentium M

IBM Cell

P e r f o

r m a n

c e ( G

o p s )

Power ( Watts )

3 G Wireless

4 G Wireless

Mobile HD Video

5 5

5

The Applications

Is there anything we can learn from the applications themselves?

5

6 6

6


H.264 Basics

6

Inverse Transform

Standard Basis Patterns

Residual Data

5

3 2

Basis Coefficients

“Past” Frames CurrentFrame

Future Frames

Motion Compensation

Form Prediction

ReconstructedMacroblock

Current Uncoded Block

Dec

oded

Blo

ck Decoded Block

Prediction Based on Neighbors

Intra-Prediction

ReconstructedMacroblock

Deblocking FilterModules

Entropy decoder

Inverse Transform

MotionCompensation

Intra Prediction

Deblocking Filter

Algorithms

Context-adaptive VLD

4x4 integer kernel

Power Profiling*

Half: 6-tap filterQuarter: 6-tap / bilinear filterEighth: 4-tap filterDirectional spatial prediction

In-loop adaptive filter

4%

29%

10%

34%

Misc Kernels 23%

T.-A. Liu, T.-M. Lin, S. -Z. Wang, et al. “A low-power dual-mode video decoder for mobile applications,” IEEE Communications Magazine, volume 44, issue 8, pp.119-126, Aug. 2006.

7 7

7


4G Wireless Basics

§  Three kernels make up the majority of the work §  FFT – Extract Data from Signals §  STBC – Combine Data into More Reliable Stream §  LDPC – Error Correction on Data Stream

7

Ant A/D

FFTSymbol Timing / Guard Detect

Channel Estimator

MIMODecoder

(STBC)

Channel Decoder(LDPC)

Recovered Data

Ant D/A IFFTGuard Insertion

Channel Encoder(LDPC)

S/PTransmitted

Data

ModulatorDemodulator

8 8

8


Mobile Signal Processing Algorithm Characteristics

8

§  Algorithms have different SIMD widths §  From very large to very small

§  Though SIMD width varies all algorithms can exploit it §  Large percentage of work can be SIMDized

§  Larger SIMD width tend to have less TLP

Algorithm SIMD Scalar Overhead SIMD Width Amount Workload (%) Workload (%) Workload (%) (Elements) of TLP

4G FFT 75 5 20 1024 Low

STBC 81 5 14 4 High LDPC 49 18 33 96 Low

H.26

4 Deblocking Filter 72 13 15 8 Medium Intra-‐PredicMon 85 5 10 16 Medium Inverse Transform 80 5 15 8 High MoMon CompensaMon 75 5 10 8 High

SIMD comes at a cost! • Register File Power

• Data Movement/Alignment Cost SIMD architectures have to deal with this!

9 9

9

University of Michigan - ACAL Feb 15, 2004

Only the instructions shown in red are MMX computations. All other instructions are simply supporting these computations.

Pentium III – SIMD code for Discrete Cosine Transform (DCT) lea ebx, DWORD PTR [ebp+128] load/address overhead mov DWORD PTR [esp+28], ebx load/address overhead $B1$2: xor eax, eax address overhead move dx, ecx address overhead lea edi, DWORD PTR [ecx+16] load/address overhead mov DWORD PTR [esp+24], ecx load/address overhead $B1$3: movq mm1, MMWORD PTR [ebp] load overhead pxor mm0, mm0 initialization overhead pmaddwd mm1, MMWORD PTR [eax+esi] True Computation movq mm2, MMWORD PTR [ebp+8] load overhead pmaddwd mm2, MMWORD PTR [eax+esi+8] True Computation add eax, 16address overhead paddw mm1, mm0 True Computation paddw mm2, mm1 True Computation movq mm0, mm2 load related overhead psrlq mm2, 32 SIMD reduction overhead povd ecx, mm0 SIMD load overhead movd ebx, mm2 SIMD load overhead add ecx, ebx SIMD conversion Overhead mov WORD PTR [edx], cx store overhead add edx, 2 address overhead cmp edi, edx branch related overhead jg $B1$3 loop branch overhead $B1$4: move cx, DWORD PTR [esp+24] load/address overhead add ebp, 16 address overhead add ecx, 16 address overhead move ax, DWORD PTR [esp+28] load/address overhead cmp eax, ebp branch related overhead jg $B1$2 loop branch overhead

10 10

10

University of Michigan - ACAL 10

11 11

11


12 12

12


13 13

13


Traditional SIMD Power Breakdown

§  Register File Power consumes a lot of power in traditional 32-wide SIMD architecture

SODA WCDMA MEM 3%

REG 37%

ALU+MULT 11%

CONTROL 38%

INTERCONNECT 2%

SCALAR PIPE 9%

MEM REG ALU+MULT CONTROL INTERCONNECT SCALAR PIPE

14 14

14


Register File Access

14

§ Many of the register file access do not have to go back to the main register file

0% 10% 20% 30% 40% 50% 60% 70% 80% 90%

100%

FFT STBC LDPC Deblocking Filter

Intra-‐PredicMon Inverse Transform

Register Access Bypass Read Bypass Write

Lots of power wasted on unneeded

register file access!

15 15

15


Instruction Pair Frequency

15

Like the Multiply-Accumulate (MAC) instruction there is opportunity to fuse other instructions

A few instruction pairs (3-5) make up the

majority of all instruction pairs!

a ) Intra - prediction and Deblocking Filter Combined

InstrucMon Pair Frequency 1 mul9ply-‐add 26.71% 2 add-‐add 13.74% 3 shuffle-‐add 8.54% 4 shiB right-‐add 6.90% 5 subtract-‐add 6.94% 6 add-‐shiB right 5.76% 7 mul9ply-‐subtract 4.00% 8 shiB right-‐subtract 3.75% 9 add-‐subtract 3.07% 10 Others 20.45%

b ) LDPC

InstrucMon Pair Frequency 1 shuffle-‐move 32.07% 2 abs-‐subtract 8.54% 3 move-‐subtract 8.54% 4 shuffle-‐subtract 3.54% 5 add-‐shuffle 3.54% 6 Others 43.77%

c ) FFT

InstrucMon Pair Frequency 1 shuffle-‐shuffle 16.67% 2 add-‐mul9ply 16.67% 3 mul9ply-‐subtract 16.67% 4 mul9ply-‐add 16.67% 5 subtract-‐mult 16.67% 6 shuffle-‐add 16.67%

16 16

16


Data Alignment Problem!

§  H.264 Intra-prediction has 9 different prediction modes §  Each prediction mode requires a specific permutation

16

B C D

A B C D B C D A B C D A B C D

A B C D

A A A A B B B B C C C C D D D D

A B C D

A B C D B C D C

E

E D E

F G

GF D E F

A

A B C D E F G H I J

A B C D G H I JG H I J A B C D

A B C D E F G

A

A B C D E F G H I J

A B C D E F G H I J

A B C D E F G H I J

K L

K L

A B B C C D DD D D DE F F G G

CDE EF FG GHI J JK KLI

CE F G H I J KI J K L D E FD

F F FEG G HH I D E G C D E F0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

16

1644

16x16 luma MBprediction modes for each 4x4 block

a b c d

e f g h

i j k l

m n o p

I

J

K

L

A B C DX

Traditional SIMD machines take too long or cost too much to do this

Good news – small fixed number

patterns per kernel

Intra-‐PredicMon

17 17

17


Summary §  Conclusion about 4G and H.264

§  Lots of different sized parallelism § From 4 wide to 96 wide to 1024 wide SIMD

§ Which means many different SIMD widths need to be supported §  Very short lived values §  Lots of potential for instruction fusings §  Limited set of shuffle patterns required for each kernel

18 18

18

AnySP Design

18

19 19

19


Traditional SIMD Architectures

19

32-laneSIMDALU

SIMDRF

32-laneSSN

SIMDto

scalar

EX

WB

STV

VTS

scalarRF

16-bitALU

EX

WB

SIMDDataMEM

ScalarDataMEM

SIMD

Scalar

AGU

32-Wide SIMD with Simple Shuffle Network

20 20

20


CROSSBAR

Scalar Pipeline

Bank 0

Bank 1

Bank 2

Bank 15

L1ProgramMemory

Controller

Bank 3

Bank 4

16-bit 8-wide 16 entry

SIMD RF-0


SIMD RF-1


SIMD RF-2


SIMD RF-7

Scalar Memory Buffer

Multi-SIMD Datapath

AGU Group 0 Pipeline



Multi-Bank Local Memory

DMA

To Inter-PE

Bus

8 Groups of 8-wide

SIMD(64 Total Lanes)

8-wide SIMD FFU-0

8-wide SIMDFFU-1

8-wide SIMDFFU-2

8-wide SIMD FFU-7

SwizzleNetwork

16-bit 8-wide 4-entry

Buffer-0


Buffer-1


Buffer-2


Buffer-7

Multi-Output Adder Tree

AnySP PE

AnySP Architecture – High Level

20

8 Groups of 8-Wide Flexible Function Units

128x128 16bit Swizzle Network

16 Banked Memory with SRAM-based Crossbar

Multiple Output Adder Tree

Temporary Buffer and Bypass Network

Datapath AGU and Scalar Pipeline

21 21

21


Multi-Width Support

21


SIMD RF-0


SIMD RF-1


SIMD RF-2


SIMD RF-7

Multi-SIMD Datapath

AGU Group 0

AGU Group 7

AGU Group 1

8-wide SIMD FFU-0

8-wide SIMDFFU-1

8-wide SIMDFFU-2

8-wide SIMD FFU-7

SwizzleNetwork


Buffer-0


Buffer-1


Buffer-2


Buffer-7


CROSSBAR

Bank 0

Bank 1

Bank 2

Bank 15

Bank 3

Bank 4



SIMD RF-38-wide SIMD

FFU-3


Buffer-3

AGU Group 2

AGU Group 3

Normal 64-Wide SIMD mode – all lanes share one AGU


SIMD RF-0


SIMD RF-1


SIMD RF-2


SIMD RF-7

Multi-SIMD Datapath

AGU Group 0

AGU Group 7

AGU Group 1

8-wide SIMD FFU-0

8-wide SIMDFFU-1

8-wide SIMDFFU-2

8-wide SIMD FFU-7

SwizzleNetwork


Buffer-0


Buffer-1


Buffer-2


Buffer-7


CROSSBAR

Bank 0

Bank 1

Bank 2

Bank 15

Bank 3

Bank 4




FFU-3


Buffer-3

AGU Group 2

AGU Group 3


SIMD RF-0


SIMD RF-1


SIMD RF-2


SIMD RF-7

Multi-SIMD Datapath

AGU Group 0

AGU Group 7

AGU Group 1

8-wide SIMD FFU-0

8-wide SIMDFFU-1

8-wide SIMDFFU-2

8-wide SIMD FFU-7

SwizzleNetwork


Buffer-0


Buffer-1


Buffer-2


Buffer-7


CROSSBAR

Bank 0

Bank 1

Bank 2

Bank 15

Bank 3

Bank 4




FFU-3


Buffer-3

AGU Group 2

AGU Group 3

Each 8-wide SIMD Group works on different memory locations of the same 8-wide code – AGU Offsets

22 22

22


AnySP FFU Datapath

22

Flexible Functional Unit allows us to

1.  Exploit Pipeline-parallelism by joining two lanes together 2.  Handle register bypass and the temporary buffer 3.  Join multiple pipelines to process deeper subgraphs 4.  Fuse Instruction Pairs

23 23

23

AnySP Results

23

24 24

24


Simulation Environment §  Traditional SIMD architecture comparison

§  SODA at 90nm technology

§  AnySP §  Synthesized at 90nm TSMC §  Power, timing, area numbers were extracted

§  Performance and Power for each kernel was generated using synthesized data on in-house simulator

§  4G – based on a NTT DoCoMo 4G test setup §  H.264 – 4CIF@30fps

24

25 25

25


AnySP Speedup vs SIMD-based Architecture

§  For all benchmarks we perform more than 2x better than a SIMD-based architecture

25

0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 2 . 5

Baseline 64 - Wide Multi - SIMD Swizzle Network Flexible Functional Unit Buffer + Bypass

N o r m

a l i z e

d S p e

e d u p

FFT 1024 pt Radix - 2


STBC LDPC H . 264 Intra

Prediction H . 264

Deblocking Filter

H . 264 Inverse

Transform H . 264

Motion Compensation

26 26

26


AnySP Energy-Delay vs SIMD-based Architecture

§ More importantly energy efficiency is much better!

26

0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0

N o r

m a l i z

e d E

n e r g

y - D e

l a y SIMD - based Architecture AnySP



STBC LDPC H . 264 Intra

Prediction H . 264

Deblocking Filter

H . 264 Inverse

Transform H . 264 Motion

Compenstation

27 27

27


PE

System

Total

Components Units SIMD Data Mem ( 32 KB )

SIMD Register File ( 16 x 1024 bit ) SIMD ALUs , Multipliers , and SSN

SIMD Pipeline + Clock + Routing

Intra - processor Interconnect Scalar / AGU Pipeline & Misc .

ARM ( Cortex - M 3 ) Global Scratchpad Memory ( 128 KB )

Inter - processor Bus with DMA 90 nm ( 1 V @ 300 MHz )

4 4 4 4

4 4 1 1 1

Area Area mm 2

Area %

9 . 76 38 . 78 % 3 . 17 12 . 59 % 4 . 50 17 . 88 % 1 . 18 4 . 69 %

0 . 94 3 . 73 % 1 . 22 4 . 85 %

0 . 6 2 . 38 % 1 . 8 7 . 15 % 1 . 0 3 . 97 %

25 . 17 100 % Est . 65 nm ( 0 . 9 V @ 300 MHz ) 13 . 14

45 nm ( 0 . 8 V @ 300 MHz ) 6 . 86

4 G + H . 264 Decoder Power

mW Power

% 102 . 88 7 . 24 % 299 . 00 21 . 05 % 448 . 51 31 . 58 % 233 . 60 16 . 45 %

93 . 44 6 . 58 % 134 . 32 9 . 46 %

2 . 5 < 1 % 10 < 1 %

1 . 5 < 1 % 1347 . 03 100 % 1091 . 09 862 . 09

SIMD Buffer ( 128 B ) SIMD Adder Tree

4 4

0 . 82 3 . 25 % 0 . 18 < 1 %

84 . 09 5 . 92 % 10 . 43 < 1 %

AnySP Power Breakdown

§ We estimate that both H.264 and 4G wireless can be done in under 1 Watt at 45nm

27

28 28

28


Conclusion & Future Work § Conclusion

§ We have presented an example architecture that could possibly meet the requirements of 100Mbps 4G and HD video on the same platform § Under the power budget and meeting the performance at 45nm

§ Future and Ongoing Work § Application-specific language § Larger class of algorithms for AnySP § Better utilization of resources for non-parallel kernels

§ Speedup sequential parts

28

anysp: anytime anywhere anyway signal...

Documents