a new generation of dsp architectures bryan ackland and paul d’arcy lucent technologies paper...
TRANSCRIPT
A New Generation of DSP A New Generation of DSP ArchitecturesArchitectures
Bryan Ackland and Paul D’ArcyBryan Ackland and Paul D’ArcyLucent TechnologiesLucent Technologies
Paper Review
Babak Noory
Professor Maitham Shams
97.575
March 18, 2002
AgendaAgenda
1. Look at the evolution of Digital Signal Processors
2. Review the emerging system requirements
3. Summarize recent advances in low power DSP techniques
4. Look at a number of new high performance architectures
5. Describe a bus based multi-core architecture for task level parallelism
IntroductionIntroduction
General Purpose Digital Signal Processors
Introduced in 1980
- High performance engines
- MAC speed advantage of 50:1 over the best micro-processors
Today
- Modest performance improvements
- Outperformed by micro-processors
DSP EvolutionDSP Evolution
1980 1985 1990 1995 2000
10
1
100
1K
Performance
(Peak MACs)
M68000
80286
80386
Pentium
Pentium MMX
DSP-1
DSP-32C
DSP-16
DSP-1600 DSP-16210
Performance of DSPs vs. Microprocessors
And yet, DSPs generate over $ 3 billion dollars for the semiconductor
industry every year.
DSP EvolutionDSP Evolution
1980 1985 1990 1995 20001
10
100
1K
10K
M68000 ($200)
80286 ($200)
80386 ($300)Pentium ($500)DSP-1 ($150)
DSP-16A ($15)DSP-1600 (<$10)
DSP-32C ($250)
Power (mW/MIP)
Lower cost
Higher MOP/mm2 and MOP/mW
Power and Cost of DSP’s vs. Microprocessors
Emerging ApplicationsEmerging Applications
Very Low Power Applications
Portable Applications: functionalities such as video and web browsing added to cellular phones, PDAs, and Multimedia Laptops
Average power becomes the main design constrainAverage power becomes the main design constrain
High Performance Applications
Embedded Applications: digital audio broadcast and smart phones
PC based Applications: 3-D graphics and real-time video communications
Infrastructure Applications: modem head-end and wireless basestations
Low Power TechniquesLow Power Techniques
1. Full Custom Datapath Layout
Circuit Topology
Transistor Sizing
Layout Parasitics
Layout Topology Drain Capacitance Simple 45.6 fF
Finger 18.7 fF
Ring 10.8 fF
W
a) Simple
S
DW/4
c) Ring
S DX
W/2
b) Finger
S SD
X
Courtesy [1]
T
&
&
&
To boards 1-3
Gate CPU
Gate CPU Section 2
Gate CPU Section 3
Gate CPU Section 1
To boards 4-6
To boards 7-9Crystal Oscillator
System Clock
Low Power TechniquesLow Power Techniques
2. Clock Gating
System Level Clock Gating: Limit data transition and clock dissipation to active sub-systems
Local Clock Gating: Deactivate non-active elements in a sequential circuit
Courtesy [4]
Operation Mode Power
Normal Mode (80MHz) 120mW
Standby (Halt) 21 mW
Slow Clock (16KHz) 2.3mW
StopClk 30uW
Low Power TechniquesLow Power Techniques
3. Minimizing Data Transitions
Applicable to circuits, where data transitions are well understood
Difficult to estimate internal node activity for complex circuits
P(A=1) = 0.5
P(B=1) = 0.2
P(C=1) = 0.1
A
C
B
Z A
B
C
Z
x x
Activity at node x = 0.09 Activity at node x = 0.0196
Courtesy [3]
Low Power TechniquesLow Power Techniques
4. Partitioned Memory Architecture
Memories occupy a great deal of silicon area, but activity factors in these individual circuits are very low.
Adopt hierarchical sub-banking
Replace large memory blocks with several smaller blocks
Make use of gated clocks to limit switching activity to active blocks
Low Power TechniquesLow Power Techniques
5. Technology &Voltage Scaling
Adjusting supply voltages to meet performance requirements
Mixed voltage & mixed threshold logic families
Dynamic voltage scaling: Supply voltage and clock speed vary continuously according to processor load
Supply “cut off:” High threshold transistors used to cut off the power when chip goes in sleep mode
Emerging Applications Emerging Applications (Revisited)(Revisited)
Very Low Power Applications
Portable Applications: functionalities such as video and web browsing added to cellular phones, PDAs, and Multimedia Laptops
Average power becomes the main design constrain
High Performance Applications
Embedded Applications: digital audio broadcast and smart phones
PC based Applications: 3-D graphics and real-time video communications
Infrastructure Applications: modem head-end and wireless basestations
Minor enhancements in combination with process improvement will not
meet the requirements of emerging applications. The new architectures
must provide:
Performance ranging from hundreds of MOPS to tens of GOPS
Parallel architectures, many operations/clock
Large memory and I/O bandwidth
Cache hierarchies
Compiler driven programming environment
High-level programming languages
Scalability
Range of cost/performance targets
New Class of New Class of architecturesarchitectures
Media ProcessorsMedia Processors
TI
C80
Chromatics
MPACT
Philips
Tri-Media
IBM
MFAST
Samsung
MSP-1
Architecture 4 64bDSP
+ 32b RISC
VLIW/SIMD
4 ALUs
VLIW
25 exec. Units
VLIW/SIMD
4by4 folded array
32-way SIMD
+ 32b RISC
clock 40 MHz 62 MHz 100 MHz 50 MHz 100 MHz
Performance 1.2 GOPS 2.0 GOPS 4.0 GOPS 20 GOPS 6.4 GOPS
Memory DRAM
400 MB/s
RAMBUS
500 MB/s
SDRAM
400 MB/s
SDRAM
800 MB/s
SDRAM
800 MB/s
Programming Compiler +
Assembler
In-house VLIW Compiler
Compiler + Assembler
Compiler + Assembler
Very high performance
Very fast memories
Yet all programs (save Tri-Media) have been cancelled
Reasons:
1. Programmability Issues
- Required large quantities of assembly code
- Explicit management of task level and instruction level parallelism
2. Lack of Scalability
- Single price/performance (except for C80)
3. Difficult Market
- Multimedia applications on PC
- Caught between high-performance ASICS and Software solutions
Media ProcessorsMedia Processors
Task Level Parallelism
Code and data
ScalabilityBus support for N DSP cores
Cache memory
Daytona MIMD Daytona MIMD ArchitectureArchitecture
Memory &I/O Controller
STBus
DSP
cache
DSP
cache
DSP
cache
Ext. mem I/O host
Simulation has shown that N can be in the range of 8 to 10 processors !
LIW Machine32b SPARC + 64b SIMD
Instruction level parallelism:
- 64b instructions
- 2 x 32b RISC operations
- 32b RISC + 32b coprocessor
extension
DSP core programming in C
Daytona DSP Core Daytona DSP Core ArchitectureArchitecture
Bus Interface
STBus
8kB Instruction and Data Cache
32b SPARCRISC up
64b 8-way SIMDVector Coprocessor
Conclusions(1)Conclusions(1)
The DSP world is changing
Emerging applications in combination with few backward compatibility issues require new architectures, which can maximize:
Parallelism
Scalability
Programmability
Generality
While other measures must be taken to minimize:
Cost
Time to Market
Conclusions(2)Conclusions(2)
The DSP world is changing
What will separate the DSPs from general purpose microprocessors in the future, will simply be the cost factor.
The DSP world is changing
What will separate the DSPs from general purpose microprocessors in the future, will simply be the cost factor. Advances in programmable hardware field are also very promising, and could further change the DSP landscape in the future.
ReferencesReferences[1] A. P. Chandrakasan and R.W. Brodersen, “Low Power Digital CMOS Design,” Kluwer Academic Publishers: Norwell, 1995.
[2] K. D. Wagner, “Clock System Design,” IEEE Design & Test of Computers, PP. 9-27, October 1988
[3] L. Wanhammar, “DSP Integrated Circuits,” Academic Press: London: 1999.
[4] K. Hwang, “Advanced Computer Architecture: Parallelism, Scalability, Programmability,” McGraw-Hill: New York, 1993.
[5] T. Kudra and T. Sakurai, “Overview of Low-Power ULSI Circuit Techniques,” IEICE Transactions on Electronics, Vol. E78-C, NO.4, PP. 334-344, April 1995
[6] C. Hamacher, Z. Vranesic and S. Zaky, “Computer Organization,” fifth edition, McGraw-Hill: New York, 2002.
[7] M. M. Mano, “Computer System Architecture,” McGraw-Hill: New York, 1993.