sridhar rajagopal

44
Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation Sridhar Rajagopal

Upload: gavan

Post on 20-Jan-2016

47 views

Category:

Documents


0 download

DESCRIPTION

Data-Parallel Digital Signal Processors: Algorithm mapping, Architecture scaling, and Workload adaptation. Sridhar Rajagopal. Digital Signal Processors (DSPs). Audio, automobile, broadband, military, networking, security, video and imaging, wireless communications - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sridhar Rajagopal

Data-Parallel Digital Signal Processors:Algorithm mapping, Architecture scaling,

and Workload adaptation

Sridhar Rajagopal

Page 2: Sridhar Rajagopal

Digital Signal Processors (DSPs)

Audio, automobile, broadband, military, networking, security, video and imaging, wireless communications

A 5 billion $ (and growing) market today

Page 3: Sridhar Rajagopal

We always want something faster!

New high performance applications drive need for faster DSPs

• Physical-layer signal processing in high speed wireless communications to support multimedia

• Application-layer signal processing for video and imaging

Page 4: Sridhar Rajagopal

Example : wireless systems

Data ratesAlgorithmsEstimationDetection

Decoding

Theoretical min ALUs @ 1

GHz

32-user system

1 Mbps/userMIMO

Chip equalizerMatched filter

LDPC

> 200

128 Kbps/userMulti-user

Max. likelihoodInterference cancellation

Viterbi

> 20

16 Kbps /user

Single-user Correlator

Matched filter

Viterbi

> 2

4G3G2G

Time1996 2003 ?

Page 5: Sridhar Rajagopal

Data-Parallel DSPs: state-of-the-art

Clusters of ALUs provide billions of computations per second

Exploit data parallelism in signal processing applications

Imagine stream processor – Stanford (1998 - 2004)

Internal memory

+++***

+++***

+++***

+++***

Clusterof ALUs

Page 6: Sridhar Rajagopal

Proposal:Research questions for DP-DSPs

• Will DP-DSPs work well for wireless systems?

• How do I design DP-DSPs to meet real-time at lowest power?

• Can I improve power efficiency further by adapting DSPs to the application?

Page 7: Sridhar Rajagopal

Contributions: Algorithm mapping

• Efficient mapping of (wireless) algorithms

– parallelization, structure, memory access patterns

– tradeoffs between ALU utilization, inter-cluster

communication, memory stalls, packing

• A reduced inter-cluster network proposed

– exploits inter-cluster communication patterns

– allows greater scalability of the architecture by reducing

wires

Page 8: Sridhar Rajagopal

Contributions: Architecture scaling

• Design methodology and tool to explore architectures for low power

• Provides candidate architectures for low power

• Provides insights into ALU utilization and performance

• Compile-time exploration is orders-of-magnitude faster than run-time exploration

Page 9: Sridhar Rajagopal

Contributions: Workload adaptation

• Adapt the number of clusters and ALUs to

changes in workload during run-time

• Multiplexer network designed

– adapts clusters to DP at run-time

– turns off unused clusters using power gating

• Significant power savings at run-time (up to 60%)

Page 10: Sridhar Rajagopal

Thesis contributions

Data-Parallel DSPs

+++***

+++***

+++***

Algorithmmapping:

Design of algorithms for

efficient mapping and performance

Architecturescaling:

Having designed the algorithms, find a low

power processor

Workloadadaptation:

Having designed the processor, improve power

at run-time

Page 11: Sridhar Rajagopal

Outline

• DP-DSPs : Parallelism and architecture

• Power-aware design exploration

• Power-aware resource utilization at run-time

• Conclusions

Page 12: Sridhar Rajagopal

Parallelism levels in DP-DSPs

Instruction Level Parallelism (ILP) - DSP

Subword Parallelism (SubP) - DSP

Data Parallelism (DP) – vector processor

Not independentDP can decrease by increasing ILP and SubP

– loop unrolling

Page 13: Sridhar Rajagopal

Code snippet for ILP, SubP, DP

int i,a[N],b[N],sum[N];

short int c[N],d[N],diff[N];

for (i = 0; i< 64; ++i)

{

sum[i] = a[i] + b[i];

diff[i] = c[i] - d[i];

}

ILP

DP

SubP

Page 14: Sridhar Rajagopal

Data-Parallel DSPs

• ILP, SubP within cluster, DP across clusters• Communication within clusters using inter-cluster comm.

network• Microcontroller issues same instruction to all clusters

Internal memory

+++***

+++***

+++***

+++***

…ILPSubP

DP

mic

roco

ntr

oll

er

Page 15: Sridhar Rajagopal

ILP is resource-bound

• ILP dependent on resources such as ALUs, read/write ports, inter-cluster communication, registers

• Any one resource bottleneck can affect ILP

Adders Multipliers Inter-cluster communication

Tim

e

Schedule for matrix-matrix multiplication as ALUs increase

Page 16: Sridhar Rajagopal

Signal processing algorithms have DP in plenty

Observations: 1. More DP available after exploiting ILP and SubP

to the point of diminishing returns

2. Used to set number of clusters

3. As clusters are added and exploit this ‘extra’ DP, ILP and SubP are not affected significantly

This ‘extra’ DP is defined as Cluster DP (CDP)

Page 17: Sridhar Rajagopal

Observing CDP in Viterbi decoding

1 10 1001

10

100

1000

Number of clustersFre

qu

en

cy n

eed

ed

to a

ttain

real-

tim

e (

in M

Hz)

K = 9K = 7 K = 5DSP

Max CDP

Page 18: Sridhar Rajagopal

Designing low power DP-DSPs

‘1’ cluster

100 GHz

+

++

*

*

*

‘a’

+

‘m’

*

+

++

*

*

*

‘a’

+

‘m’

*

+

++

*

*

*

‘a’

+

‘m’*

‘c’ clusters

‘f’ MHz

+

++

*

*

*

‘1’

+

‘1’

*

+

++

*

*

*

‘10’+

‘10’

*

+

++

*

*

*

‘10’

+

‘10’

*

+

++

*

*

*

‘10’

+

‘10’

*

‘100’ clusters

10 MHz

Find the right (a,m,c,f) to minimize power

a – #adders/cluster, m – #multipliers/cluster, c – #clusters

Page 19: Sridhar Rajagopal

Detailed simulation using the Imagine processor simulator

• Cycle accurate, parameterized simulator

– Insights into operations every cycle

• High-level C++-based programming

• GUI interface shows dependencies and schedule

• Power and VLSI scaling model available

• Open source allows modifications in architecture,

tools

Page 20: Sridhar Rajagopal

Need for design exploration tool

• Random choice may be way off

– 100x power variation possible

• Exhaustive simulation not possible

– large parameter space (hours for each simulation)

– DSP compilers need hand optimizations for performance

– evolving algorithms -- architecture exploration needed

Page 21: Sridhar Rajagopal

Design exploration framework

Base Data-ParallelDSP

Designworkload

(worst-case)

Applicationworkload

Explore (a,m,c,f)combination thatminimizes power

Dynamic adaptationto turn down (a,m,c,f)

to save power

Hardwareimplementation

+++***

+++***

+++***

Designphase

Utilizationphase

Page 22: Sridhar Rajagopal

DSPs are compute-bound with predictable performance

Computations

Hiddenmemory stalls

Exposedmemory stalls

Totalexecution

time(cycles)

Microcontrollerstalls

tcompute

tstall

Page 23: Sridhar Rajagopal

Minimization for power

C(a,m,c) – capacitance from simulator model f(a,m,c) – real-time clock frequency

– obtained by running application on (a,m,c) architecture

2

, , , , , ,

, , , , , ,

( , , )

3

( , , )

min min ( , , )

min min ( , , )

a m c f a m c f

a m c f a m c f

a m c

a m c

P C a m c V f

P C a m c f

V f

Page 24: Sridhar Rajagopal

Sensitivity to technology and modeling

• Sensitivity to technology ‘p’

• Sensitivity to adder-multiplier power ratio ‘’– 0.01 0.1 for 32-bit adders and 32x32

multipliers

• Sensitivity to memory stalls ‘’– difficult to predict at compile time (5-20 %)– assume q = 25% of execution time as worst case

– fstall = q* (1-) * fmin 0 1

, , , , , , ( , , )2 min min ( , , ) where p 3

a m c f a m c f

p

a m cP C a m c f

Page 25: Sridhar Rajagopal

Design exploration: big picture

1. (a,m,c) = (, , )

2. Find (a,m,c) where ILP, SubP, DP are fully exploited

3. Find c that minimizes P for (max(a), max(m))

4. Find (a,m) that minimizes P using c

5. Explore sensitivity to , , p

, , , , , , ( , , )min min ( , , )a m c f a m c f

p

a m cP C a m c f

Page 26: Sridhar Rajagopal

Running algorithms at (amax,mmax,

cCDP)

Algorithm Kernel CDP MHz

Estimation

Correlation 32 1

Matrix mul 32 43

Iteration 32 1

Transpose 512 < 1

Matrix mul L 32 22

Matrix mul C 32 22

Detection Matched filter 32 71

Interference cancellation 32 83

Decoding

Packing 256 <1

Re-packing 64 <1

Initialization 64 17

Add-Compare-Select (ACS)

64 254

Decoding output 64 23

Min. real-time frequency (a,m,c) =(5,3,512)

538 MHz

Page 27: Sridhar Rajagopal

Real-time frequency with clusters for (a,m) = (5,3)

100

101

102

10310

2

103

104

Clusters

Fre

qu

ency

(M

Hz)

= 0 = 0.5 = 1

538 MHz

541 MHz

( ) ( )c cdp

cdpf f

c

Page 28: Sridhar Rajagopal

Choosing clusters c = 64, 541 MHz

100

101

102

103

10-3

10-2

10-1

100

Clusters

Nor

mal

ized

Pow

er

Power f2

Power f2.5

Power f3

Page 29: Sridhar Rajagopal

ALU utilization (+,*)

1

3

5 1

3

400

800

1200

(51,42)

(55,62)

(65,46)

#Adders

(67,62)

(78,45)

Rea

l-T

ime

Fre

qu

ency

(in

MH

z)

Initial (5,3,64)(541 MHz)

Final (3,1,64)(567 MHz)

c = 64, = 0.01, = 1, p = 3

Page 30: Sridhar Rajagopal

Choosing ALUs (a,m) for c = 64

p = 2 p = 2.5

p = 3

= 0, = 0.01 (2,1,64)

(2,1,64)

(3,1,64)

= 0.5, = 0.01

(2,1,64)

(3,1,64)

(3,1,64)

= 1, = 0.01 (2,1,64)

(3,1,64)

(3,1,64)

= 1, = 0.1 (2,1,64)

(3,1,64)

(3,1,64)

Page 31: Sridhar Rajagopal

Insights from analysis

• Sensitivity importance: p, ,

• Design gives candidates for low power solutions Design I : (a,m,c): (, , ) (5,3,512) (5,3,64)

(2,1,64)Design II : (a,m,c): (, , ) (5,3,512) (5,3,64)

(3,1,64)

• Power minimization related to ALU efficiency– same as maximizing a scaled version of ALU utilization

Page 32: Sridhar Rajagopal

Advantages of design exploration tool

• Simulator (S)– cycle-accurate (execution time at run-time)– explore 100 machine configurations in 100 hours

(conservative)– modification of parameters and code for different runs

• Tool (T)– cycle-approximate (execution time at compile time)– explore millions of configurations in 100 hours– automated process all the way – generate plots for defense the day before

• Rapid evaluation of candidate algorithms for future systems

Page 33: Sridhar Rajagopal

Verification of design tool

Human (3,3,32) @ 1.2V, 0.13 , 1 GHz = 18.2 W

Exploration tool choice : (2,1,64) at 887 MHz

Estimated base power @ 1.2V, 0.13 = 13.2 W

200

400

600

800

1000

(Execu

tion

tim

e)

Real-

tim

e c

lock f

req

uen

cy (

MH

z)

ComputationsStalls

T S T S T S Design I Design II Human

T- ToolS - Simulator

Page 34: Sridhar Rajagopal

Cluster utilization

• 64 cluster inefficient in terms of cluster utilization (54% for 33:64)

• But, still lower power than 32 clusters due to the difference in f– can see difference reduces as p 2

20 40 60

20

40

60

80

100

Cluster index number

Clu

ste

r u

tiliza

tion

(%

)

32 clusters

64 clusters

Page 35: Sridhar Rajagopal

Improving power efficiency

• Clusters significant source of power consumption (50-75%)

• When CDP < c, unutilized clusters waste power

• Dynamically turn off clusters using power gating to improve power efficiency

Page 36: Sridhar Rajagopal

Data access difficult after adaptation

Clusters off – then how to get data from other banks?

4 2 clusters• Data not in the correct memory banks• Overhead in bringing data : external memory, inter-

cluster network

+++***

+++***

+++***

+++***

4 2 clusters

Page 37: Sridhar Rajagopal

Multiplexer network design

Multiplexernetwork adapts clusters to DP

No reconfiguration 4: 2 reconfiguration 4:1 reconfiguration All clusters off

Turned off using power gating to

eliminate static anddynamic power dissipation

Page 38: Sridhar Rajagopal

Run-time variations in workload

20 40 60

20

40

60

80

100

Cluster index number

Clu

ster

uti

lizat

ion

(%

)

K = 9

K = 7

K = 5

Page 39: Sridhar Rajagopal

Benefits of multiplexer network

Power efficiency at design time:

Human choice : (3,3,32) Base power @ 1.2V, 0.13 , 1 GHz = 18.2 W

Exploration tool choice : (2,1,64)Base power @ 1.2V, 0.13 , 887 MHz = 13.2 W

Power efficiency at run-time:With mux network ( K = 9) = 9.9 W

( K = 7) = 7.4 W (K = 5) = 6.8 W

Page 40: Sridhar Rajagopal

Design exploration for 2G-3G-4G systems

A “power”ful tool for algorithm-architecture exploration

101

102

103

101

102

103

104

105

Data ratesReal-

tim

e c

lock f

req

uen

cy (

MH

z)

4G*3G2G

(2,1,64) and (3,1,64)

(1,1,32) and (2,1,32)

Page 41: Sridhar Rajagopal

Broader impact

• Power-aware design exploration with improved run-time power efficiency

• Techniques can be applied to all high performance, power efficient DSP designs– Handsets, cameras, video

Page 42: Sridhar Rajagopal

Future extensions

• Fabrication needed to verify concepts

• Higher performance– Multi-threading (ILP, SubP, DP, MT)– Pipelining (ILP, SubP, DP, MT, PP)

• LDPC decoding– Sparse matrix requires permutations over large data– Indexed SRF in stream processors [Jayasena, HPCA

2004]

Page 43: Sridhar Rajagopal

Conclusions

• Providing high performance with 100-1000’s of ALUs and providing low power designs – a challenge for DSP designers

• Algorithm design for efficient mapping on DP-DSPs

• Design exploration tool for low power DP-DSPs – Provides candidate DSPs for low power – Allows algorithm-architecture evaluation for new systems

• Power efficiency provided during both design and use of DP-DSPs

Page 44: Sridhar Rajagopal

Acknowledgements

• Dr. Joseph R. Cavallaro, Dr. Scott Rixner

• Imagine stream processor group at Stanford– Abhishek, Ujval, Brucek, Dr. Dally

• Marjan, Predrag, Alex– 4G MIMO + LDPC

• Thesis committee

• Nokia, Texas Instruments, TATP, NSF