the future of microprocessors embedded in memory
TRANSCRIPT
1
The Future of MicroprocessorsEmbedded in Memory
David A. Patterson
[email protected]://cs.berkeley.edu/~patterson/talks
EECS, University of CaliforniaBerkeley, CA 94720-1776
2
Outlinen Desktop/Server Microprocessor State of
the Artn A New Technology: Embedded DRAMn Mobile Multimedia Computing as New
Directionn A New Architecture for Mobile Multimedia
Computingn Berkeley’s Mobile Multimedia
Microprocessorn Historical Perspectiven Challenges & Potential Industrial Impact
3
Processor-DRAM Gap (latency)
µProc60%/yr.
DRAM7%/yr.
1
10
100
10001980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU1982
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
Time
“Moore’s Law”
4
Processor-MemoryPerformance Gap “Tax”
Processor % Area %Transistors
(≈cost)(≈power)
n Alpha 21164 37% 77%n StrongArm SA110 61% 94%n Pentium Pro 64% 88%
– 2 dies per package: Proc/I$/D$ + L2$
n Caches have no inherent value,only try to close performance gap
5
Today’s Situation:Microprocessor
MIPS MPUs R5000 R1000010k/5k
n Clock Rate 200 MHz 195 MHz1.0x
n On-Chip Caches 32K/32K 32K/32K 1.0x
n Instructions/Cycle 1(+ FP) 4 4.0x
n Pipe stages 5 5-71.2x
n Model In-order Out-of-order ---
6
Desktop/Server State of the Artn Processor performance doubling / 18 monthsn Microprocessor-DRAM performance gap & taxn Cost fixed at ≈$500/chip, power whatever can
cool– 10X cost, 10X power => 2X performance?
n Desktop apps slow at rate processors speedup?– Word struggles to keep up with typing?
n Consolidation of industry?PowerPCPowerPC
PA-RISC
PA-RISCMIPSMIPS AlphaAlpha IA-64SPARC
7
Outlinen Desktop/Server Microprocessor State of
the Artn A New Technology: Embedded DRAMn Mobile Multimedia Computing as New
Directionn A New Architecture for Mobile Multimedia
Computingn Berkeley’s Mobile Multimedia
Microprocessorn Historical Perspectiven Challenges & Potential Industrial Impact
8
A More RevolutionaryApproach: ProcessorEmbedded in DRAM
n Faster logic in DRAM process– DRAM vendors offer faster transistors +
same number metal layers as good logicprocess?@ ≈ 20% higher cost per wafer?
– As die cost ≈ f(die area4), 4% die shrink ⇒ equalcost
n Called Intelligent RAM (“IRAM”) since most oftransistors will be DRAM
9
IRAM Vision Statement
Microprocessor & DRAMon a single chip:– on-chip memory latency
5-10X, bandwidth 50-100X– improve energy efficiency
2X-4X (no off-chip bus)– serial I/O 5-10X v. buses– smaller board area/volume– adjustable memory size/width
DRAM
fab
Proc
Bus
D R A M
$ $Proc
L2$
Logic
fabBus
D R A M
I/OI/O
I/OI/O
Bus
10
Outlinen Desktop/Server Microprocessor State of
the Artn A New Technology: Embedded DRAMn Mobile Multimedia Computing as New
Directionn A New Architecture for Mobile Multimedia
Computingn Berkeley’s Mobile Multimedia
Microprocessorn Historical Perspectiven Challenges & Potential Industrial Impact
11
Intelligent PDA ( 2003?)Pilot PDA (todo,calendar,
calculator, addresses,...)+ Gameboy (Tetris, ...)+ Nikon Coolpix (camera)+ Cell Phone, Pager,
GPS, tape recorder,TV remote, am/fm radio,garage door opener, ...
+ Wireless data (WWW)+ Speech, vision recog.
– Speech control of all devices – Vision to see surroundings, scan documents, read bar
codes, measure room+ Voice output for conversations
12
New ArchitectureDirections
n “…media processing will become thedominant force in computer arch. µprocessor design.”
n “... new media-rich applications... involvesignificant real-time processing of continuousmedia streams, and make heavy use ofvectors of packed 8-, 16-, and 32-bit integerand Fl. Pt.”
n Needs include high memory BW, highnetwork BW, continuous media data types,real-time response, fine grain parallelism– “How Multimedia Workloads Will Change
13
Outlinen Desktop/Server Microprocessor State of
the Artn A New Technology: Embedded DRAMn Mobile Multimedia Computing as New
Directionn A New Architecture for Mobile Multimedia
Computingn Berkeley’s Mobile Multimedia
Microprocessorn Historical Perspectiven Challenges & Potential Industrial Impact
14
Potential IRAM Architecturen “New” model: VSIW=Very Short Instruction
Word!– Compact: Describe N operations with 1 short
instruct.– Predictable (real-time) perf. vs. statistical perf.
(cache)– Multimedia ready: choose N*64b, 2N*32b, 4N*16b– Easy to get high performance; N operations:
» are independent
» use same functional unit» access disjoint registers» access registers in same order as previous instructions» access contiguous memory words or known pattern
15
Operation & Instruction Count:RISC v. “VSIW” Processor
(from F. Quintana, U. Barcelona.)
Spec92fp Operations (M) Instructions (M)Program RISC VSIW R / V RISC VSIW R /
Vswim256 115 95 1.1x 115 0.8
142xhydro2d 58 40 1.4x 58 0.8
71xnasa7 69 41 1.7x 69 2.2
31xsu2cor 51 35 1.4x 51 1.8
29xtomcatv 15 10 1.4x 15 1.3
11x
VSIW reduces ops by 1.2X, instructions by 20X!
16
Revive Vector (= VSIW)Architecture!
n Cost: ≈ $1M each?n Low latency, high
BW memorysystem?
n Code density?n Compilers?n Vector
Performance?n Power/Energy?n Scalar
performance?
n Real-time?
Limited to scientific
n Single-chip CMOS MPU/IRAMn IRAM = low latency,
high bandwidth memoryn Much smaller than VLIW/EPICn For sale, mature (>20 years)n Easy scale speed with
technologyn Parallel to save energy, keep
perfn Include modern, modest CPU
⇒ OK scalar (MIPS 5K v.10k)
n No caches, no speculation⇒ repeatable speed as varyinput
17
Software Technology TrendsAffecting New Direction?
n any CPU + vector coprocessor/memory– scalar/vector interactions are limited, simple– Example architecture based on ARM 9, MIPS IV
n Vectorizing compilers built for 25 years– can buy one for new machine from The Portland
Group
n Can change HW without changing programs– More arithmetic units, registers OK
n Lowers software effort– Media Processors/DSP, > 50% staff are
programmers
18
Software Technology TrendsAffecting New Direction?
n IRAM has limited memory => Desktop OSwrong
n Real-Time Operating Systems– “Microkernel” - exclude modules not used by
application– Small footprint even with all features ( 0.5 to 2 MB)– Examples: WRS VxWorks, QNX, Windows CE
n Applications not distributed in binary– WWW software distribution (Java, Javascript, TCL)– Open Source movement (Linix, Netscape, PERL)
19
Outlinen Desktop/Server Microprocessor State of
the Artn A New Technology: Embedded DRAMn Mobile Multimedia Computing as New
Directionn A New Architecture for Mobile Multimedia
Computingn Berkeley’s Mobile Multimedia
Microprocessorn Historical Perspectiven Challenges & Potential Industrial Impact
20
MHz1.6 GFLOPS(64b)/6.4
GOPS(16b)/32MB
Memory Crossbar Switch
M
M
…
M
M
M
…
M
M
M
…
M
M
M
…
M
M
M
…
M
M
M
…
M
…
M
M
…
M
M
M
…
M
M
M
…
M
M
M
…
M
+
Vector Registers
x
÷
Load/Store
8K I cache 8K D cache
2-waySuperscalar
VectorProcessor
4 x 64 4 x 64 4 x 64 4 x 64 4 x 64
4 x 64or
8 x 32or
16 x 16
4 x 644 x 64
QueueInstruction
I/OI/O
I/OI/O
SerialI/O
21
Ring-basedSwitch
CPU+$
Tentative VIRAM-1Floorplan
I/O
n 0.18 µm DRAM32 MB in 16 banksx 256b, 128subbanks
n 0.18 µm,5 Metal Logic
n ≈ 200 MHz MIPS, 16K I$, 16K D$
n ≈ 4 200 MHzFP/int. vector units
n die: ≈ 16x16mm
n xtors: ≈ 270Mn power: ≈2 Watts
4 Vector Pipes/Lanes
Memory (128 Mbits / 16 MBytes)
Memory (128 Mbits / 16 MBytes)
22
CPU+$
Tentative VIRAM-”0.25”Floorplan
n Demonstratescalability via2nd layout(automatic from1st)
n 8 MB in 4 banks x256b, 32 subbanks
n ≈ 200 MHz CPU,16K I$, 16K D$
n 1 ≈ 200 MHzFP/int. vector units
n die: ≈ 5 x 16mm
n xtors: ≈ 70Mn power: ≈0.5 Watts
1 VU
Memory (32 Mb /
4 MB)
Memory (32 Mb /
4 MB)
23
V-IRAM-1 Tentative Plann Phase I: Feasibility stage (≈H1’98)
– Test chip, CAD agreement, architecturedefined
n Phase 2: Design & Layout Stage (≈H2’98)– Test chip, Simulated design and layout
n Phase 3: Verification (≈H2’99)– Tape-out
n Phase 4: Fabrication,Testing, andDemonstration (≈H1’00)– Functional integrated circuit
n First microprocessor ≥ 0.15-.25Btransistors?
24
Outlinen Desktop/Server Microprocessor State of
the Artn A New Technology: Embedded DRAMn Mobile Multimedia Computing as New
Directionn A New Architecture for Mobile Multimedia
Computingn Berkeley’s Mobile Multimedia
Microprocessorn Historical Perspectiven Challenges & Potential Industrial Impact
25
SIMD on chip (DRAM)Uniprocessor (SRAM)MIMD on chip (DRAM)Uniprocessor (DRAM)MIMD component (SRAM )
10 100 1000 100000.1
1
10
100
Mbits of
Memory
Computational RAM
PIP-RAMMitsubishi M32R/D
Execube
Pentium Pro
Alpha 21164
Transputer T9
1000IRAMUNI?
PPRAM
Bits of Arithmetic Unit
Terasys
IRAMnot a new
ideaStone, ‘70 “Logic-in memory”Barron, ‘78 “Transputer”Kogge, ‘94 “Execube”Shimizu, ‘96 “M32R/D”Murakami, ‘97 “PPRAM”
26
Why IRAM now?Lower risk than before
n DRAM manufacturers now willing to listen– Before not interested, so early IRAM = SRAM
n Faster Logic + DRAM available soon?n Past efforts memory limited ⇒ multiple chips
⇒ 1st solve the unsolved (parallelprocessing)– Gigabit DRAM ⇒ ≈100 MB; OK for many apps?
n Systems headed to 2 chips: CPU + memoryn Embedded apps leverage energy efficiency,
adjustable mem. capacity, smaller board area ⇒ “Post-PC Era”: PDA >> desktop
27
IRAM Challengesn Chip
– Good performance and reasonable power?– Cost for Embedded DRAM process?
» Cost of 16 Mbit DRAM $1.78 in June 1998: Howcompete?
– Speed in Embedded DRAM? (time behind logicfab?)
– Testing time of IRAM vs DRAM vsmicroprocessor?
– BW/Latency oriented DRAM tradeoffs?
n Architecture– How to turn high memory bandwidth into
performance for real applications?
28
Commercial IRAM highway isgoverned by memory per
IRAM?
Graphics Acc.
Super PDA/PhoneVideo Games
Network ComputerLaptop
8 MB
2 MB
32 MB
29
Words to Remember
“...a strategic inflection point is a time in thelife of a business when its fundamentals areabout to change. ... Let's not mince words: Astrategic inflection point can be deadly whenunattended to. Companies that begin adecline as a result of its changes rarelyrecover their previous greatness.”– Only the Paranoid Survive, Andrew S. Grove,
1996
30
n Apps/metrics of future to design computer offuture
n Mobile Multimedia >> “Stationary” Computers?n IRAM potential in mem/IO BW, energy, board
area;challenges in cost, power/performance, testing
n 10X-100X improvements based on technologyshipping for 20 years (not JJ, photons, MEMS,...)
n Potential: shift semiconductor balance ofpower? Who ships the most memory? Most
Conclusion
31
Interested in Participating?n Looking for ideas of IRAM enabled apps,
partners to transfer technologyn Contact us if you’re interested:http://iram.cs.berkeley.edu/email: [email protected]
n Thanks for advice/support: DARPA,California MICRO, Hitachi, Intel, LGSemicon, Microsoft, Neomagic, Sandcraft,SGI/Cray, Sun Microsystems, TI, TSMC
32
Backup Slides
(The following slides are used to helpanswer questions)
33
VIRAM-1 Specs/GoalsTechnology 0.18-0.20 micron, 5-6 metal layers, fast
xtorMemory 16-32 MBDie size ≈ 200-300 mm2
Vector pipes/lanes 4 64-bit (or 8 32-bit or 16 16-bit)Serial I/O 4 lines @ 1 Gbit/sPoweruniversity ≈2 w @ 1-1.5 volt logicClockuniversity 200scalar/200vector MHzPerfuniversity 1.6 GFLOPS64 – 6.4 GOPS16
Powerindustry ≈1 w @ 1-1.5 volt logicClockindustry 400scalar/400vector MHzPerfindustry 3.2 GFLOPS64 – 12.8 GOPS16
2X
34
Mediaprocesing Functions (Dubey)Kernel Vector length
n Matrix transpose/multiply # vertices atonce
n DCT (video, comm.) image widthn FFT (audio) 256-1024n Motion estimation (video) image width,
i.w./16n Gamma correction (video) image widthn Haar transform (media mining) image widthn Median filter (image process.) image widthn Separable convolution (““) image width(from http://www.research.ibm.com/people/p/pradeep/tutor.html)
35
A More Revolutionary Approach:New Architecture Directions
n “...wires are not keeping pace with scalingof other features. … In fact, for CMOSprocesses below 0.25 micron ... anunacceptably small percentage of the diewill be reachable during a single clockcycle.”
n “Architectures that require long-distance,rapid interaction will not scale well ...”– “Will Physical Scalability Sabotage
Performance Gains?” Matzke, IEEEComputer (9/97)
36
“Vanilla” IRAM -Performance Conclusions
n IRAM systems with existing architecturesprovide moderate performance benefits
n High bandwidth / low latency used to speed upmemory accesses, not computation
n Reason: existing architectures developedunder assumption of low bandwidth memorysystem– Need something better than “build a bigger
cache”– Important to investigate alternative architectures
that better utilize high bandwidth and low latencyof IRAM