anysp: anytime anywhere anyway signal...

28
1 1 1 AnySP: Anytime Anywhere Anyway Signal Processing Mark Woh 1 , Sangwon Seo 1 , Scott Mahlke 1 ,Trevor Mudge 1 , Chaitali Chakrabarti 2 , Krisztian Flautner 3 University of Michigan – ACAL 1 Arizona State University 2 ARM, Ltd. 3

Upload: others

Post on 13-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

1 1

1

AnySP: Anytime Anywhere Anyway Signal Processing

Mark Woh1, Sangwon Seo1, Scott Mahlke1,Trevor Mudge1, Chaitali Chakrabarti2, Krisztian Flautner3

University of Michigan – ACAL1

Arizona State University2

ARM, Ltd.3

Page 2: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

2 2

2

University of Michigan - ACAL

The Old Mobile Phone The Modern Mobile Phone

2

Video Recording

Video Editing

Higher Data Rates

Photos From - http://www.engadget.com/2009/06/10/iphone-3g-s-supports-opengl-es-2-0-but-3g-only-supports-1-1/ http://www.apple.com/iphone

3D Rendering

Advanced Image Processing

•  Future phones are becoming more complex •  Richer applications require much more requirements

•  How do phones handle this now?

Page 3: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

3 3

3

University of Michigan - ACAL

Inside Today’s Smart Phones

3

Inside the OMAP3430 Application Processor

Inside the X-Gold 608 (Representation of QCOM)

Page 4: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

4 4

4

University of Michigan - ACAL

Power/Performance Requirements for Multiple Systems

4

Different applications have different power/performance characteristics!

We need to design keeping each application in mind!

(Not GPP but Domain Specific Processor)

1

10

100

1000

10000

0 . 1 1 10 100

SODA ( 65 nm )

SODA ( 90 nm )

TI C 6 X

Imagine

VIRAM Pentium M

IBM Cell

P e r f o

r m a n

c e ( G

o p s )

Power ( Watts )

3 G Wireless

4 G Wireless

Mobile HD Video

Page 5: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

5 5

5

The Applications

Is there anything we can learn from the applications themselves?

5

Page 6: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

6 6

6

University of Michigan - ACAL

H.264 Basics

6

Inverse Transform

Standard Basis Patterns

Residual Data

5

3 2

Basis Coefficients

“Past” Frames CurrentFrame

Future Frames

Motion Compensation

Form Prediction

ReconstructedMacroblock

Current Uncoded Block

Dec

oded

Blo

ck Decoded Block

Prediction Based on Neighbors

Intra-Prediction

ReconstructedMacroblock

Deblocking FilterModules

Entropy decoder

Inverse Transform

MotionCompensation

Intra Prediction

Deblocking Filter

Algorithms

Context-adaptive VLD

4x4 integer kernel

Power Profiling*

Half: 6-tap filterQuarter: 6-tap / bilinear filterEighth: 4-tap filterDirectional spatial prediction

In-loop adaptive filter

4%

29%

10%

34%

Misc Kernels 23%

T.-A. Liu, T.-M. Lin, S. -Z. Wang, et al. “A low-power dual-mode video decoder for mobile applications,” IEEE Communications Magazine, volume 44, issue 8, pp.119-126, Aug. 2006.

Page 7: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

7 7

7

University of Michigan - ACAL

4G Wireless Basics

§  Three kernels make up the majority of the work §  FFT – Extract Data from Signals §  STBC – Combine Data into More Reliable Stream §  LDPC – Error Correction on Data Stream

7

Ant A/D

FFTSymbol Timing / Guard Detect

Channel Estimator

MIMODecoder

(STBC)

Channel Decoder(LDPC)

Recovered Data

Ant D/A IFFTGuard Insertion

Channel Encoder(LDPC)

S/PTransmitted

Data

ModulatorDemodulator

Page 8: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

8 8

8

University of Michigan - ACAL

Mobile Signal Processing Algorithm Characteristics

8

§  Algorithms have different SIMD widths §  From very large to very small

§  Though SIMD width varies all algorithms can exploit it §  Large percentage of work can be SIMDized

§  Larger SIMD width tend to have less TLP

Algorithm   SIMD   Scalar   Overhead   SIMD  Width   Amount  Workload  (%)   Workload  (%)   Workload  (%)   (Elements)   of  TLP  

4G   FFT   75   5   20   1024   Low  

STBC   81   5   14   4   High  LDPC   49   18   33   96   Low  

H.26

4   Deblocking  Filter   72   13   15   8   Medium  Intra-­‐PredicMon   85   5   10   16   Medium  Inverse  Transform   80   5   15   8   High  MoMon  CompensaMon   75   5   10   8   High  

SIMD comes at a cost! • Register File Power

• Data Movement/Alignment Cost SIMD architectures have to deal with this!

Page 9: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

9 9

9

University of Michigan - ACAL Feb 15, 2004

Only the instructions shown in red are MMX computations. All other instructions are simply supporting these computations.

Pentium III – SIMD code for Discrete Cosine Transform (DCT) lea ebx, DWORD PTR [ebp+128] load/address overhead mov DWORD PTR [esp+28], ebx load/address overhead $B1$2: xor eax, eax address overhead move dx, ecx address overhead lea edi, DWORD PTR [ecx+16] load/address overhead mov DWORD PTR [esp+24], ecx load/address overhead $B1$3: movq mm1, MMWORD PTR [ebp] load overhead pxor mm0, mm0 initialization overhead pmaddwd mm1, MMWORD PTR [eax+esi] True Computation movq mm2, MMWORD PTR [ebp+8] load overhead pmaddwd mm2, MMWORD PTR [eax+esi+8] True Computation add eax, 16address overhead paddw mm1, mm0 True Computation paddw mm2, mm1 True Computation movq mm0, mm2 load related overhead psrlq mm2, 32 SIMD reduction overhead povd ecx, mm0 SIMD load overhead movd ebx, mm2 SIMD load overhead add ecx, ebx SIMD conversion Overhead mov WORD PTR [edx], cx store overhead add edx, 2 address overhead cmp edi, edx branch related overhead jg $B1$3 loop branch overhead $B1$4: move cx, DWORD PTR [esp+24] load/address overhead add ebp, 16 address overhead add ecx, 16 address overhead move ax, DWORD PTR [esp+28] load/address overhead cmp eax, ebp branch related overhead jg $B1$2 loop branch overhead

Page 10: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

10 10

10

University of Michigan - ACAL 10

Page 11: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

11 11

11

University of Michigan - ACAL 11

Page 12: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

12 12

12

University of Michigan - ACAL 12

Page 13: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

13 13

13

University of Michigan - ACAL

Traditional SIMD Power Breakdown

§  Register File Power consumes a lot of power in traditional 32-wide SIMD architecture

SODA WCDMA MEM 3%

REG 37%

ALU+MULT 11%

CONTROL 38%

INTERCONNECT 2%

SCALAR PIPE 9%

MEM REG ALU+MULT CONTROL INTERCONNECT SCALAR PIPE

Page 14: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

14 14

14

University of Michigan - ACAL

Register File Access

14

§ Many of the register file access do not have to go back to the main register file

0%  10%  20%  30%  40%  50%  60%  70%  80%  90%  

100%  

FFT   STBC   LDPC   Deblocking  Filter  

Intra-­‐PredicMon   Inverse  Transform  

Register  Access   Bypass  Read   Bypass  Write  

Lots of power wasted on unneeded

register file access!

Page 15: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

15 15

15

University of Michigan - ACAL

Instruction Pair Frequency

15

Like the Multiply-Accumulate (MAC) instruction there is opportunity to fuse other instructions

A few instruction pairs (3-5) make up the

majority of all instruction pairs!

a ) Intra - prediction and Deblocking Filter Combined

InstrucMon  Pair Frequency 1 mul9ply-­‐add 26.71% 2 add-­‐add 13.74% 3 shuffle-­‐add 8.54% 4 shiB  right-­‐add 6.90% 5 subtract-­‐add 6.94% 6 add-­‐shiB  right 5.76% 7 mul9ply-­‐subtract 4.00% 8 shiB  right-­‐subtract 3.75% 9 add-­‐subtract 3.07% 10 Others 20.45%

b ) LDPC

InstrucMon  Pair Frequency 1 shuffle-­‐move 32.07% 2 abs-­‐subtract 8.54% 3 move-­‐subtract 8.54% 4 shuffle-­‐subtract 3.54% 5 add-­‐shuffle 3.54% 6 Others 43.77%

c ) FFT

InstrucMon  Pair Frequency 1 shuffle-­‐shuffle 16.67% 2 add-­‐mul9ply 16.67% 3 mul9ply-­‐subtract 16.67% 4 mul9ply-­‐add 16.67% 5 subtract-­‐mult 16.67% 6 shuffle-­‐add 16.67%

Page 16: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

16 16

16

University of Michigan - ACAL

Data Alignment Problem!

§  H.264 Intra-prediction has 9 different prediction modes §  Each prediction mode requires a specific permutation

16

B C D

A B C D B C D A B C D A B C D

A B C D

A A A A B B B B C C C C D D D D

A B C D

A B C D B C D C

E

E D E

F G

GF D E F

A

A B C D E F G H I J

A B C D G H I JG H I J A B C D

A B C D E F G

A

A B C D E F G H I J

A B C D E F G H I J

A B C D E F G H I J

K L

K L

A B B C C D DD D D DE F F G G

CDE EF FG GHI J JK KLI

CE F G H I J KI J K L D E FD

F F FEG G HH I D E G C D E F0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

16

1644

16x16 luma MBprediction modes for each 4x4 block

a b c d

e f g h

i j k l

m n o p

I

J

K

L

A B C DX

Traditional SIMD machines take too long or cost too much to do this

Good news – small fixed number

patterns per kernel

Intra-­‐PredicMon  

Page 17: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

17 17

17

University of Michigan - ACAL

Summary §  Conclusion about 4G and H.264

§  Lots of different sized parallelism § From 4 wide to 96 wide to 1024 wide SIMD

§ Which means many different SIMD widths need to be supported §  Very short lived values §  Lots of potential for instruction fusings §  Limited set of shuffle patterns required for each kernel

Page 18: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

18 18

18

AnySP Design

18

Page 19: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

19 19

19

University of Michigan - ACAL

Traditional SIMD Architectures

19

32-laneSIMDALU

SIMDRF

32-laneSSN

SIMDto

scalar

EX

WB

STV

VTS

scalarRF

16-bitALU

EX

WB

SIMDDataMEM

ScalarDataMEM

SIMD

Scalar

AGU

32-Wide SIMD with Simple Shuffle Network

Page 20: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

20 20

20

University of Michigan - ACAL

CROSSBAR

Scalar Pipeline

Bank 0

Bank 1

Bank 2

Bank 15

L1ProgramMemory

Controller

Bank 3

Bank 4

16-bit 8-wide 16 entry

SIMD RF-0

16-bit 8-wide 16 entry

SIMD RF-1

16-bit 8-wide 16 entry

SIMD RF-2

16-bit 8-wide 16 entry

SIMD RF-7

Scalar Memory Buffer

Multi-SIMD Datapath

AGU Group 0 Pipeline

AGU Group 7 Pipeline

AGU Group 1 Pipeline

Multi-Bank Local Memory

DMA

To Inter-PE

Bus

8 Groups of 8-wide

SIMD(64 Total Lanes)

8-wide SIMD FFU-0

8-wide SIMDFFU-1

8-wide SIMDFFU-2

8-wide SIMD FFU-7

SwizzleNetwork

16-bit 8-wide 4-entry

Buffer-0

16-bit 8-wide 4-entry

Buffer-1

16-bit 8-wide 4-entry

Buffer-2

16-bit 8-wide 4-entry

Buffer-7

Multi-Output Adder Tree

AnySP PE

AnySP Architecture – High Level

20

8 Groups of 8-Wide Flexible Function Units

128x128 16bit Swizzle Network

16 Banked Memory with SRAM-based Crossbar

Multiple Output Adder Tree

Temporary Buffer and Bypass Network

Datapath AGU and Scalar Pipeline

Page 21: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

21 21

21

University of Michigan - ACAL

Multi-Width Support

21

16-bit 8-wide 16 entry

SIMD RF-0

16-bit 8-wide 16 entry

SIMD RF-1

16-bit 8-wide 16 entry

SIMD RF-2

16-bit 8-wide 16 entry

SIMD RF-7

Multi-SIMD Datapath

AGU Group 0

AGU Group 7

AGU Group 1

8-wide SIMD FFU-0

8-wide SIMDFFU-1

8-wide SIMDFFU-2

8-wide SIMD FFU-7

SwizzleNetwork

16-bit 8-wide 4-entry

Buffer-0

16-bit 8-wide 4-entry

Buffer-1

16-bit 8-wide 4-entry

Buffer-2

16-bit 8-wide 4-entry

Buffer-7

Multi-Output Adder Tree

CROSSBAR

Bank 0

Bank 1

Bank 2

Bank 15

Bank 3

Bank 4

Multi-Bank Local Memory

16-bit 8-wide 16 entry

SIMD RF-38-wide SIMD

FFU-3

16-bit 8-wide 4-entry

Buffer-3

AGU Group 2

AGU Group 3

Normal 64-Wide SIMD mode – all lanes share one AGU

16-bit 8-wide 16 entry

SIMD RF-0

16-bit 8-wide 16 entry

SIMD RF-1

16-bit 8-wide 16 entry

SIMD RF-2

16-bit 8-wide 16 entry

SIMD RF-7

Multi-SIMD Datapath

AGU Group 0

AGU Group 7

AGU Group 1

8-wide SIMD FFU-0

8-wide SIMDFFU-1

8-wide SIMDFFU-2

8-wide SIMD FFU-7

SwizzleNetwork

16-bit 8-wide 4-entry

Buffer-0

16-bit 8-wide 4-entry

Buffer-1

16-bit 8-wide 4-entry

Buffer-2

16-bit 8-wide 4-entry

Buffer-7

Multi-Output Adder Tree

CROSSBAR

Bank 0

Bank 1

Bank 2

Bank 15

Bank 3

Bank 4

Multi-Bank Local Memory

16-bit 8-wide 16 entry

SIMD RF-38-wide SIMD

FFU-3

16-bit 8-wide 4-entry

Buffer-3

AGU Group 2

AGU Group 3

16-bit 8-wide 16 entry

SIMD RF-0

16-bit 8-wide 16 entry

SIMD RF-1

16-bit 8-wide 16 entry

SIMD RF-2

16-bit 8-wide 16 entry

SIMD RF-7

Multi-SIMD Datapath

AGU Group 0

AGU Group 7

AGU Group 1

8-wide SIMD FFU-0

8-wide SIMDFFU-1

8-wide SIMDFFU-2

8-wide SIMD FFU-7

SwizzleNetwork

16-bit 8-wide 4-entry

Buffer-0

16-bit 8-wide 4-entry

Buffer-1

16-bit 8-wide 4-entry

Buffer-2

16-bit 8-wide 4-entry

Buffer-7

Multi-Output Adder Tree

CROSSBAR

Bank 0

Bank 1

Bank 2

Bank 15

Bank 3

Bank 4

Multi-Bank Local Memory

16-bit 8-wide 16 entry

SIMD RF-38-wide SIMD

FFU-3

16-bit 8-wide 4-entry

Buffer-3

AGU Group 2

AGU Group 3

Each 8-wide SIMD Group works on different memory locations of the same 8-wide code – AGU Offsets

Page 22: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

22 22

22

University of Michigan - ACAL

AnySP FFU Datapath

22

Flexible Functional Unit allows us to

1.  Exploit Pipeline-parallelism by joining two lanes together 2.  Handle register bypass and the temporary buffer 3.  Join multiple pipelines to process deeper subgraphs 4.  Fuse Instruction Pairs

Page 23: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

23 23

23

AnySP Results

23

Page 24: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

24 24

24

University of Michigan - ACAL

Simulation Environment §  Traditional SIMD architecture comparison

§  SODA at 90nm technology

§  AnySP §  Synthesized at 90nm TSMC §  Power, timing, area numbers were extracted

§  Performance and Power for each kernel was generated using synthesized data on in-house simulator

§  4G – based on a NTT DoCoMo 4G test setup §  H.264 – 4CIF@30fps

24

Page 25: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

25 25

25

University of Michigan - ACAL

AnySP Speedup vs SIMD-based Architecture

§  For all benchmarks we perform more than 2x better than a SIMD-based architecture

25

0 . 0 0 . 5 1 . 0 1 . 5 2 . 0 2 . 5

Baseline 64 - Wide Multi - SIMD Swizzle Network Flexible Functional Unit Buffer + Bypass

N o r m

a l i z e

d S p e

e d u p

FFT 1024 pt Radix - 2

FFT 1024 pt Radix - 4

STBC LDPC H . 264 Intra

Prediction H . 264

Deblocking Filter

H . 264 Inverse

Transform H . 264

Motion Compensation

Page 26: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

26 26

26

University of Michigan - ACAL

AnySP Energy-Delay vs SIMD-based Architecture

§ More importantly energy efficiency is much better!

26

0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0

N o r

m a l i z

e d E

n e r g

y - D e

l a y SIMD - based Architecture AnySP

FFT 1024 pt Radix - 2

FFT 1024 pt Radix - 4

STBC LDPC H . 264 Intra

Prediction H . 264

Deblocking Filter

H . 264 Inverse

Transform H . 264 Motion

Compenstation

Page 27: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

27 27

27

University of Michigan - ACAL

PE

System

Total

Components Units SIMD Data Mem ( 32 KB )

SIMD Register File ( 16 x 1024 bit ) SIMD ALUs , Multipliers , and SSN

SIMD Pipeline + Clock + Routing

Intra - processor Interconnect Scalar / AGU Pipeline & Misc .

ARM ( Cortex - M 3 ) Global Scratchpad Memory ( 128 KB )

Inter - processor Bus with DMA 90 nm ( 1 V @ 300 MHz )

4 4 4 4

4 4 1 1 1

Area Area mm 2

Area %

9 . 76 38 . 78 % 3 . 17 12 . 59 % 4 . 50 17 . 88 % 1 . 18 4 . 69 %

0 . 94 3 . 73 % 1 . 22 4 . 85 %

0 . 6 2 . 38 % 1 . 8 7 . 15 % 1 . 0 3 . 97 %

25 . 17 100 % Est . 65 nm ( 0 . 9 V @ 300 MHz ) 13 . 14

45 nm ( 0 . 8 V @ 300 MHz ) 6 . 86

4 G + H . 264 Decoder Power

mW Power

% 102 . 88 7 . 24 % 299 . 00 21 . 05 % 448 . 51 31 . 58 % 233 . 60 16 . 45 %

93 . 44 6 . 58 % 134 . 32 9 . 46 %

2 . 5 < 1 % 10 < 1 %

1 . 5 < 1 % 1347 . 03 100 % 1091 . 09 862 . 09

SIMD Buffer ( 128 B ) SIMD Adder Tree

4 4

0 . 82 3 . 25 % 0 . 18 < 1 %

84 . 09 5 . 92 % 10 . 43 < 1 %

AnySP Power Breakdown

§ We estimate that both H.264 and 4G wireless can be done in under 1 Watt at 45nm

27

Page 28: AnySP: Anytime Anywhere Anyway Signal Processingusers.ece.utexas.edu/~Ljohn/Teaching/382M-15/Lectures/Lec22.pdfFFU-2 8-wide SIMD U-7 Swi zle Network 16-bit 8-wid e4- ntry Buffer-0

28 28

28

University of Michigan - ACAL

Conclusion & Future Work § Conclusion

§ We have presented an example architecture that could possibly meet the requirements of 100Mbps 4G and HD video on the same platform § Under the power budget and meeting the performance at 45nm

§ Future and Ongoing Work § Application-specific language § Larger class of algorithms for AnySP § Better utilization of resources for non-parallel kernels

§ Speedup sequential parts

28