multicore computing - evolution › ~cavazos › cisc879 › papers › talks › ... ·...
TRANSCRIPT
1
© Sudhakar Yalamanchili, Georgia Institute of Technology
Multicore ComputingMulticore Computing-- EvolutionEvolution
ECE 4100/6100 (2)
Performance Scaling
0.01
0.1
1
10
100
1000
10000
100000
1000000
10000000
1970 1980 1990 2000 2010 2020
MIP
S
Pentium® Pro Architecture
Pentium® 4 Architecture
Pentium® Architecture
486386
2868086
Source: Shekhar Borkar, Intel Corp.
2
ECE 4100/6100 (3)
Intel
• Homogeneous cores• Bus based on chip interconnect• Shared Memory • Traditional I/O
Classic OOO: Reservation Stations, Issue ports, Schedulers…etc
Large, shared set associative, prefetch, etc.
Source: Intel Corp.
ECE 4100/6100 (4)
IBM Cell Processor
Co-processor accelerator
Heterogeneous Heterogeneous MultiCoreMultiCore
High bandwidth, multiple buses
High speed I/O
Classic (stripped down) coreSource: IBM
3
ECE 4100/6100 (5)
AMD Au1200 System on Chip
Custom cores
Embedded processor
On-Chip I/O
On-Chip BusesSource: AMD
ECE 4100/6100 (6)
PlayStation 2 Die Photo (SoC)
Source: IEEE Micro, March/April 2000
Floating point MACs
4
ECE 4100/6100 (7)
Multi-* is Happening
Source: Intel Corp.
ECE 4100/6100 (8)
Intel’s Roadmap for Multicore
Source: Adapted from Tom’s Hardware
2006 20082007
SC 1MBDC 2MB
DC 2/4MB shared
DC 3 MB/6 MB shared
(45nm)
2006 20082007
DC 2/4MB
DC 2/4MB shared
DC 4MB
DC 3MB /6MB shared (45nm)
2006 20082007
DC 2MBDC 4MB
DC 16MB
QC 4MB
QC 8/16MB shared
8C 12MB shared (45nm)
SC 512KB/ 1/ 2MB
8C 12MB shared (45nm)
Des
ktop
pro
cess
ors
Mob
ile p
roce
ssor
s
Ent
erpr
ise
pro
cess
ors
• Drivers are – Market segments– More cache– More cores
5
ECE 4100/6100 (9)
Distillation Into Trends
• Technology Trends – What can we expect/project?
• Architecture Trends– What are the feasible outcomes?
• Application Trends– What are the driving deployment scenarios?– Where are the volumes?
ECE 4100/6100 (10)
Technology Scaling
• 30% scaling down in dimensions doubles transistor density
• Power per transistor – Vdd scaling lower power
• Transistor delay = Cgate Vdd/ISAT– Cgate, Vdd scaling lower delay
GATE
SOURCE
BODY
DRAIN
tox
GATE
SOURCE DRAIN
L
leakddstdddd IVIVfCVP ++= 2α
6
ECE 4100/6100 (11)
Fundamental Trends
Medium High Very HighVariability
Energy scaling will slow down>0.5>0.5>0.35Energy/Logic Op scaling
0.5 to 1 layer per generation8-97-86-7Metal Layers
11111111RC Delay
Reduce slowly towards 2-2.5<3~3ILD (K)
Low Probability High ProbabilityAlternate, 3G etc
128
11
2016
High Probability Low ProbabilityBulk Planar CMOS
Delay scaling will slow down>0.7~0.70.7Delay = CV/I scaling
256643216842Integration Capacity (BT)
8162232456590Technology Node (nm)
2018201420122010200820062004High Volume Manufacturing
Source: Shekhar Borkar, Intel Corp.
ECE 4100/6100 (12)
Moore’s Law
• How do we use the increasing number of transistors?
• What are the challenges that must be addressed?
Source: Intel Corp.
7
ECE 4100/6100 (13)
Impact of Moore’s Law To Date
Push the MemoryMemory Wall Larger caches
Increase FrequencyFrequency
Deeper Pipelines
Increase ILPILPConcurrent Threads,
Branch Prediction and SMT
Manage PowerPowerclock gating, activity
minimization
IBM Power5
Source: IBM
Source: IBM
ECE 4100/6100 (14)
Shaping Future Multicore Architectures
• The ILP Wall– Limited ILP in applications
• The Frequency Wall– Not much headroom
• The Power Wall– Dynamic and static power dissipation
• The Memory Wall– Gap between compute bandwidth and memory bandwidth
• Manufacturing– Non recurring engineering costs– Time to market
8
ECE 4100/6100 (15)
The Frequency Wall
• Not much headroom left in the stage to stage times (currently 8-12 FO4 delays)
• Increasing frequency leads to the power wallVikas Agarwal, M. S. Hrishikesh, Stephen W. Keckler, Doug Burger. Clock rate versus IPC: the end of the road for conventional microarchitectures. In ISCA 2000
ECE 4100/6100 (16)
Options
• Increase performance via parallelism– On chip this has been largely at the instruction/data level
• The 1990’s through 2005 was the era of instruction level parallelism– Single instruction multiple data/Vector parallelism
– MMX, SSIMD, Vector Co-Processors– Out Of Order (OOO) execution cores– Explicitly Parallel Instruction Computing (EPIC)
• Have we exhausted options in a thread?
9
ECE 4100/6100 (17)
The ILP Wall - Past the Knee of the Curve?
“Effort”
Performance
ScalarIn-Order
Moderate-PipeSuperscalar/OOO
Very-Deep-PipeAggressive
Superscalar/OOO
Made sense to goSuperscalar/OOO:
good ROI
Very little gain forsubstantial effort
Source: G. Loh
ECE 4100/6100 (18)
The ILP Wall
• Limiting phenomena for ILP extraction:– Clock rate: at the wall each increase in clock rate has a
corresponding CPI increase (branches, other hazards)– Instruction fetch and decode: at the wall more
instructions cannot be fetched and decoded per clock cycle
– Cache hit rate: poor locality can limit ILP and it adversely affects memory bandwidth
– ILP in applications: serial fraction on applications
• Reality:– Limit studies cap IPC at 100-400 (using ideal processor)– Current processors have IPC of only 1-2
10
ECE 4100/6100 (19)
The ILP Wall: Options
• Increase granularity of parallelism– Simultaneous Multi-threading to exploit TLP
– TLP has to exist otherwise poor utilization results– Coarse grain multithreading – Throughput computing
• New languages/applications– Data intensive computing in the enterprise– Media rich applications
ECE 4100/6100 (20)
The Memory Wall
µProc60%/yr.
DRAM7%/yr.
1
10
100
1000
DRAM
CPU
Processor-MemoryPerformance Gap:(grows 50% / year)
Time
“Moore’s Law”
11
ECE 4100/6100 (21)
The Memory Wall
• Increasing the number of cores increases the demanded memory bandwidth
• What architectural techniques can meet this demand?
Average access
time
Year?
ECE 4100/6100 (22)
The Memory Wall
CPU0 CPU1
AMD Dual-Core Athlon FX
• On die caches are both area intensive and power intensive– StrongArm dissipates more than 43% power in caches– Caches incur huge area costs
• Larger caches never deliver the near-universal performance boost offered by frequency ramping (Source: Intel)
IBM Power5
12
ECE 4100/6100 (23)
The Power Wall
• Power per transistor scales with frequency but also scales with Vdd– Lower Vdd can be compensated for with increased
pipelining to keep throughput constant– Power per transistor is not same as power per area
power density is the problem!– Multiple units can be run at lower frequencies to keep
throughput constant, while saving power
leakddstdddd IVIVfCVP ++= 2α
ECE 4100/6100 (24)
Leakage Power Basics
• Sub-threshold leakage– Increases with lower Vth , T, W
• Gate-oxide leakage– Increases with lower Tox, higher W – High K dielectrics offer a potential solution
• Reverse biased pn junction leakage– Very sensitive to T, V (in addition to diffusion area)
/ /1 (1 )thV nkT V kT
subI KWe e− −= −
2/
2oxT V
oxox
VI K W eT
α− =
, ( 1)qVkT
pn leakage p nI J e A+= −
13
ECE 4100/6100 (25)
The Current Power Trend
Source: Intel Corp.
400480088080
8085
8086
286 386486
Pentium®P6
1
10
100
1000
10000
1970 1980 1990 2000 2010Year
Pow
er D
ensi
ty (W
/cm
2 )
Hot Plate
NuclearReactor
RocketNozzle
Sun’sSurface
ECE 4100/6100 (26)
Improving Power/Perfomance
• Consider constant die size and decreasing core area each generation = more cores/chip– Effect of lowering voltage and frequency power reduction– Increasing cores/chip performance increase
Better power performance!
leakddstdddd IVIVfCVP ++= 2α
14
ECE 4100/6100 (27)
Accelerators
TCB Exec
Core
PLL
OOO
ROMC
AM
1
TCB Exec
Core
PLL
ROB
ROMC
LB
Inputseq
Sendbuffer
2.23 mm X 3.54 mm, 260K transistors
Opportunities: Network processing enginesMPEG Encode/Decode engines, Speech engines
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1995 2000 2005 2010 2015
MIP
S GP MIPS@75W
TOE MIPS@~2W
TCP/IP Offload EngineTCP/IP Offload Engine
Source: Shekhar Borkar, Intel Corp.
ECE 4100/6100 (28)
Low-Power Design Techniques
• Circuit and gate level methods– Voltage scaling– Transistor sizing– Glitch suppression– Pass-transistor logic– Pseudo-nMOS logic– Multi-threshold gates
• Functional and architectural methods– Clock gating– Clock frequency reduction– Supply voltage reduction– Power down/off– Algorithmic and software techniques
Two decades worth of research and development!
15
ECE 4100/6100 (29)
The Economics of Manufacturing
• Where are the costs of developing the next generation processors? – Design Costs– Manufacturing Costs
• What type of chip level solutions is the economics implying?
• Assessing the implications of Moore’s Law is an exercise in mass production
ECE 4100/6100 (30)
The Cost of An ASIC
Example: Design with80 M transistors in 100 nm technology
Estimated Cost -$85 M -$90 M
C P produ
ction
verifi
catio
n
desig
npro
totyp
e
verifi
catio
n
imple
mentat
ion
verifi
catio
n
12 – 18 months
• Cost and Risk rising to unacceptable levels
• Top cost drivers– Verification (40%)– Architecture Design (23%)– Embedded Software Design
– 1400 man months (SW)– 1150 man months (HW)
– HW/SW integration
*Handel H. Jones, “How to Slow the Design Cost Spiral,” Electronics Design Chain, September 2002, www.designchain.com
16
ECE 4100/6100 (31)
The Spectrum of Architectures
Synthesis
Compilation
Custom ASIC
FPGA Polymorphic Computing Architectures
Fixed + Variable ISA
Microprocessor
Hardware Development
Tiled architectures
Software Development
Customization fully in Hardware
Customization fullyin Software
Design NRE Effort
Decreasing Customization Increasing NRE and Time to Market
Structured ASIC
Tensilica Stretch Inc.
PACT, PICOChip
LSI Logic Leopard
Logic
MONARCHSM,RAW,
TRIPS
Xilinx Altera
ECE 4100/6100 (32)
Interlocking Trade-offs
Power
Memory
Frequency
ILP
speculation
bandwidth
dynamic power
dyna
mic
pen
altie
smiss penalty
leak
age
powe
r
• Improving one property comes at the expense of the other
• We need new approaches to co-optimization!
17
ECE 4100/6100 (33)
Multi-core Architecture Drivers• Addressing ILP limits
– Multiple threads– Coarse grain parallelism raise the level of abstraction
• Addressing Frequency and Power limits– Multiple slower cores across technology generation– Scaling via increasing the number of cores rather than frequency– Heterogeneous cores for improved power/performance
• Addressing memory system limits– Deep, distributed, cache hierarchies – OS replication shared memory remains dominant
• Addressing manufacturing issues– Design and verification costs
Replication the network becomes more important!
© Sudhakar Yalamanchili, Georgia Institute of Technology
Revisiting ParallelismRevisiting Parallelism