0 1 thousand core chips a technology perspective shekhar borkar intel corp. june 7, 2007

Post on 19-Jan-2016

223 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

2

Thousand Core ChipsThousand Core ChipsA Technology PerspectiveA Technology Perspective

Shekhar Shekhar BorkarBorkar

Intel Corp.Intel Corp.

June 7, 2007June 7, 2007

3

OutlineOutline

Technology outlookTechnology outlook

Evolution of Multi—thousands of cores?Evolution of Multi—thousands of cores?

How do you feed thousands of coresHow do you feed thousands of cores

Future challenges: variations and reliabilityFuture challenges: variations and reliability

ResiliencyResiliency

SummarySummary

4

Technology OutlookTechnology OutlookHigh Volume High Volume ManufacturingManufacturing

20042004 20062006 20082008 20102010 20122012 20142014 20162016 20182018

Technology Technology Node (nm)Node (nm)

9090 6565 4545 3232 2222 1616 1111 88

Integration Integration Capacity (BT)Capacity (BT)

2 4 8 16 32 64 128 256

Delay = CV/I Delay = CV/I scalingscaling

0.70.7 ~0.7~0.7 >0.7>0.7 Delay scaling will slow downDelay scaling will slow down

Energy/Logic Op Energy/Logic Op scalingscaling

>0.35>0.35 >0.5>0.5 >0.5>0.5 Energy scaling will slow downEnergy scaling will slow down

Bulk Planar Bulk Planar CMOSCMOS

High Probability Low ProbabilityHigh Probability Low Probability

Alternate, 3G etcAlternate, 3G etc Low Probability High ProbabilityLow Probability High Probability

VariabilityVariability Medium High Very HighMedium High Very High

ILD (K)ILD (K) ~3~3 <3<3 Reduce slowly towards 2-2.5Reduce slowly towards 2-2.5

RC DelayRC Delay 11 11 11 11 11 11 11 11

Metal LayersMetal Layers 6-76-7 7-87-8 8-98-9 0.5 to 1 layer per generation0.5 to 1 layer per generation

5

Terascale Integration CapacityTerascale Integration Capacity

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

2001 2005 2009 2013 2017

Tra

nsi

sto

rs (

Mil

lio

ns) Total Transistors,

300mm2 die

~1.5B LogicTransistors

~100MB Cache

100+B Transistor integration capacity100+B Transistor integration capacity

6

Scaling ProjectionsScaling Projections

0

10

20

30

40

50

2001 2005 2009 2013 2017

Fre

qu

ency

(G

Hz)

1.5X Ideal

1.25X Realistic

0.0

0.4

0.8

1.2

2001 2005 2009 2013 2017

Vd

d (

Vo

lts)

0.7X Ideal

Realistic

Freq scaling will slow downFreq scaling will slow down

VVdddd scaling will slow down scaling will slow down

Power will be too highPower will be too high0

200

400

600

800

1,000

1,200

1,400

2001 2005 2009 2013 2017

Po

wer

(W

atts

)

Power too high

300mm2 Die

7

Why Multi-core? –PerformanceWhy Multi-core? –Performance

1

10

1 10Area (X) or Power (X)

Pe

rfo

rma

nc

e (

X)

Slope ~ 0.5

Pollack's Rule2X Power = 1.4X Performance

1

10

100

1,000

2001 2005 2009 2013 2017

Rel

ativ

e P

erfo

rman

ce

Single Core

Multi-Core(Potential)

> 10X

Ever increasing single cores yield diminishing performance in a power envelope

Multi-cores provide potential for near-linear performance speedup

8

Why Dual-core? –PowerWhy Dual-core? –Power

VoltageVoltage FrequencyFrequency PowerPower PerformancePerformance

1%1% 1%1% 3%3% 0.66%0.66%

Rule of thumb

CoreCore

CacheCache

CoreCore

CacheCache

CoreCore

Voltage = 1Freq = 1Area = 1Power = 1Perf = 1

Voltage = -15%Freq = -15%Area = 2Power = 1Perf = ~1.8

In the same process technology…

9

C1C1 C2C2

C3C3 C4C4

Cache

Large CoreLarge Core

Cache

1

2

3

4

1

2 SmallCoreSmallCore

1 1

1

2

3

4

1

2

3

4

Power

PerformancePower = 1/4

Performance = 1/2

Multi-Core:Multi-Core:Power efficientPower efficient

Better power and Better power and thermal managementthermal management

Multi-Core:Multi-Core:Power efficientPower efficient

Better power and Better power and thermal managementthermal management

From Dual to Multi—From Dual to Multi—

10

GPGP GPGP

GPGP

GPGP GPGP

GPGP

GPGP

GPGP GPGP

GPGP

GPGP GPGP

General Purpose Cores

Future Multi-core PlatformFuture Multi-core Platform

SPSP SPSP

SPSP SPSPSpecial Purpose HW

CC

CC

CC

CC

CC

CC

CC

CC Interconnect fabric

Heterogeneous Multi-Core Platform—SOCHeterogeneous Multi-Core Platform—SOC

11

Fine Grain Power ManagementFine Grain Power Management

ff ff

ff

ff

ff ff

Vdd Cores with critical tasksFreq = f, at VddTPT = 1, Power = 1

f/2f/2

f/2f/2

f/2f/2

f/2f/2

f/2f/2

0.7xVdd

Non-critical coresFreq = f/2, at 0.7xVddTPT = 0.5, Power = 0.25

00

00

00

00 00

Cores shut downTPT = 0, Power = 0

12

Performance ScalingPerformance Scaling

0

2

4

6

8

10

0 10 20 30

Number of Cores

Per

form

ance

Amdahl’s Law: Parallel Speedup = 1/(Serial% + (1-Serial%)/N)

Serial% = 6.7%N = 16, N1/2 = 8

16 Cores, Perf = 8

Serial% = 20%N = 6, N1/2 = 3

6 Cores, Perf = 3

Parallel software key to Multi-core successParallel software key to Multi-core successParallel software key to Multi-core successParallel software key to Multi-core success

13

From Multi to Many…From Multi to Many…

0

5

10

15

20

25

30

TPT OneApp

TwoApp

FourApp

EightApp

Sys

tem

Per

form

ance

Large

Med

Small

Single Core Performance

1

0.5

0.3

0

0.2

0.4

0.6

0.8

1

1.2

La

rge

Me

d

Sm

all

Re

lati

ve

Pe

rfo

rma

nc

e

13mm, 100W, 48MB Cache, 4B Transistors, in 22nm12 Cores 48 Cores 144 Cores

14

From Many to Too Many…From Many to Too Many…

Single Core Performance

1

0.5

0.3

0

0.2

0.4

0.6

0.8

1

1.2

La

rge

Me

d

Sm

all

Re

lati

ve

Pe

rfo

rma

nc

e

13mm, 100W, 96MB Cache, 8B Transistors, in 16nm24 Cores 96 Cores 288 Cores

0

5

10

15

20

25

30

TPT OneApp

TwoApp

FourApp

EightApp

Sys

tem

Per

form

ance

Large

Med

Small

15

On Die Network PowerOn Die Network Power

1

10

100

1000

10000

2001 2005 2009 2013 2017

Th

rou

gh

pu

t (R

ela

tiv

e)

Small, 1.5MT core~1000 Cores

Large, 15MT core~ 100 Cores

1

10

100

1,000

2001 2005 2009 2013 2017

Ne

two

rk P

ow

er

(W)

4B wide links, 4 links/core

~150W

~15W

300mm2 Die

A careful balance of:

1. Throughput performance

2. Single thread performance (core size)

3. Core and network power

16

ObservationsObservationsScaling Multi— demands more parallelism every Scaling Multi— demands more parallelism every generationgeneration• Thread level, task level, application level

Many (or too many) cores does not always Many (or too many) cores does not always mean…mean…• The highest performance

• The highest MIPS/Watt

• The lowest power

If on-die network power is significant, then power If on-die network power is significant, then power is even worseis even worse

Now software, too, must follow Moore’s LawNow software, too, must follow Moore’s LawNow software, too, must follow Moore’s LawNow software, too, must follow Moore’s Law

17

Memory BW GapMemory BW GapBusses have become wider to deliver necessary memory BW (10 to 30 GB/sec)

Yet, memory BW is not enough

Many Core System will demand 100 GB/sec memory BW

0

1000

2000

3000

4000

5000

6000

1985 1990 1995 2000 2005 2010

MH

z

Core Clock

Bus Clock

GAP

How do you feed the beast?How do you feed the beast?How do you feed the beast?How do you feed the beast?

18

IO Pins and PowerIO Pins and Power

0

5

10

15

20

25

30

0 5 10 15 20

Signaling Rate GBit/sec

Po

wer

(m

W/G

bp

s)

State of the artState of the art

Research

State of the art:100 GB/sec ~ 1 Tb/sec = 1,000 Gb/sec 25mw/Gb/sec = 25 WattsBus-width = 1,000/5 = 200, about 400 pins (differential)

Too many signal pins, too much powerToo many signal pins, too much power

19

SolutionSolution

ChipChip ChipChip> 5mm

Bus

High speed busses

Busses are transmission linesL-R-C effectsNeed signal terminationSignal processing consumes power

Solutions:Reduce distance to << 5mmR-C busReduce signaling speed (~1Gb/sec)Increase pins to deliver BW1-2 mw/Gbps

ChipChip ChipChip

<2mm

100 GB/sec ~ 1 Tb/sec = 1,000 Gb/sec 2mw/Gb/sec = 2 WattsBus-width = 1,000/1 = 1,000 pins

20

Package

Anatomy of a Silicon ChipAnatomy of a Silicon Chip

Si Chip

Heat-sink

Heat

PowerSignals

21

Package

System in a PackageSystem in a Package

Si Chip Si Chip

Limited pins: 10mm / 50 micron = 200 pins

Limited pinsSignal distance is large ~10 mm – higher powerComplex package

22

Package

DRAM on TopDRAM on Top

CPU

Temp = 85°C

Junction Temp = 100+°C

High temp, hot spotsNot good for DRAM

DRAM

Heat-sink

23

Package

DRAM at the BottomDRAM at the Bottom

DRAM

CPU

Heat-sink

Power and IO signals go through DRAM to CPU

Thin DRAM die

Through DRAM vias

The most promising solution to feed the beastThe most promising solution to feed the beast

24

ReliabilityReliability

Soft Error FIT/Chip (Logic & Mem)

0

50

100

150

Re

lati

ve

Time dependent device degradation

0

1

1 2 3 4 5 6 7 8 9 10

Time

Ion

Re

lati

ve

Burn-in may phase out…?

1

10

100

1000

10000

180 90 45 22

Jo

x (

Re

lati

ve

)Hi-K?

?

Extreme device variations

0

50

100

100 120 140 160 180 200

Vt(mV)

Re

lati

ve

Wider

25

Implications to ReliabilityImplications to Reliability

Extreme variations (Static & Dynamic) will result in Extreme variations (Static & Dynamic) will result in unreliable componentsunreliable components

Impossible to design reliable system as we know Impossible to design reliable system as we know todaytoday

• Transient errors (Soft Errors)

• Gradual errors (Variations)

• Time dependent (Degradation)

Reliable systems with unreliable components Reliable systems with unreliable components ——Resilient Resilient ArchitecturesArchitectures

Reliable systems with unreliable components Reliable systems with unreliable components ——Resilient Resilient ArchitecturesArchitectures

26

Implications to TestImplications to Test

One-time-factory testing will be outOne-time-factory testing will be out

Burn-in to catch chip infant-mortality will not be Burn-in to catch chip infant-mortality will not be practicalpractical

Test HW will be part of the designTest HW will be part of the design

Dynamically self-test, detect errors, Dynamically self-test, detect errors, reconfigure, & adaptreconfigure, & adapt

27

In a Nut-shell…In a Nut-shell…

100 Billion

Transistors

100 Billion

Transistors

100 BT integration capacity

Billions unusable (variations)

Some will fail over time

Yet, deliver high performance in the power & Yet, deliver high performance in the power & cost envelopecost envelope

Yet, deliver high performance in the power & Yet, deliver high performance in the power & cost envelopecost envelope

Intermittent failures

28

Resiliency with Many-CoreResiliency with Many-Core

Dynamic on-chip testingDynamic on-chip testing

Performance profilingPerformance profiling

Cores in reserve (spares)Cores in reserve (spares)

Binning strategyBinning strategy

Dynamic, fine grain, performance Dynamic, fine grain, performance and power managementand power management

Coarse-grain redundancy Coarse-grain redundancy checkingchecking

Dynamic error detection & Dynamic error detection & reconfiguration reconfiguration

Decommission aging cores, swap Decommission aging cores, swap with spareswith spares

Dynamically…Dynamically…1.1. Self test & detectSelf test & detect2.2. Isolate errorsIsolate errors3.3. ConfineConfine4.4. Reconfigure, andReconfigure, and5.5. AdaptAdapt

Dynamically…Dynamically…1.1. Self test & detectSelf test & detect2.2. Isolate errorsIsolate errors3.3. ConfineConfine4.4. Reconfigure, andReconfigure, and5.5. AdaptAdapt

CC

CC

CC

CC

CC

CC

CC

CC

29

SummarySummaryMoore’s Law with Terascale integration capacity Moore’s Law with Terascale integration capacity will allow integration of thousands of coreswill allow integration of thousands of cores

Power continues to be the challengePower continues to be the challenge

On-die network power could be significantOn-die network power could be significant

Optimize for power with the size of the core and Optimize for power with the size of the core and the number of coresthe number of cores

3D Memory technology needed to feed the beast3D Memory technology needed to feed the beast

Many-cores will deliver the highest performance in Many-cores will deliver the highest performance in the power envelope the power envelope with resiliencywith resiliency

top related