ece 4100/6100 (1) multicore computing - evolution

ECE 4100/6100 (1)

Multicore Computing- Evolution

ECE 4100/6100 (2)

Performance Scaling

0.01

0.1

1

10

100

1000

10000

100000

1000000

10000000

1970 1980 1990 2000 2010 2020

MIP

S

Pentium® Pro Architecture

Pentium® 4 Architecture

Pentium® Architecture

486386

2868086

Source: Shekhar Borkar, Intel Corp.

ECE 4100/6100 (3)

Intel

Homogeneous cores Bus based on chip interconnect Shared Memory Traditional I/O

Classic OOO: Reservation Stations, Issue ports, Schedulers…etc

Large, shared set associative, prefetch, etc.

Source: Intel Corp.

ECE 4100/6100 (4)

IBM Cell Processor

Co-processor accelerator

Heterogeneous MultiCoreHeterogeneous MultiCore

High bandwidth, multiple buses

High speed I/O

Classic (stripped down) coreSource: IBM

ECE 4100/6100 (5)

AMD Au1200 System on Chip

Custom cores

Embedded processor

On-Chip I/O

On-Chip BusesSource: AMD

ECE 4100/6100 (6)

PlayStation 2 Die Photo (SoC)

Source: IEEE Micro, March/April 2000

Floating point MACs

ECE 4100/6100 (7)

Multi-* is Happening

Source: Intel Corp.

ECE 4100/6100 (8)

Intel’s Roadmap for Multicore

Source: Adapted from Tom’s Hardware

2006 20082007

SC 1MB

DC 2MB

DC 2/4MB shared

DC 3 MB/6 MB shared

(45nm)

2006 20082007

DC 2/4MB

DC 2/4MB shared

DC 4MB

DC 3MB /6MB shared (45nm)

2006 20082007

DC 2MB

DC 4MB

DC 16MB

QC 4MB

QC 8/16MB shared

8C 12MB shared (45nm)

SC 512KB/ 1/ 2MB

8C 12MB shared (45nm)

De

skto

p p

roce

sso

rs

Mo

bile

p

roce

sso

rs

En

terp

rise

p

roce

sso

rs

• Drivers are – Market segments– More cache– More cores

ECE 4100/6100 (9)

Distillation Into Trends

• Technology Trends – What can we expect/project?

• Architecture Trends– What are the feasible outcomes?

• Application Trends– What are the driving deployment scenarios?– Where are the volumes?

ECE 4100/6100 (10)

Technology Scaling

• 30% scaling down in dimensions doubles transistor density

• Power per transistor – Vdd scaling lower power

• Transistor delay = Cgate Vdd/ISAT – Cgate, Vdd scaling lower delay

GATE

SOURCE

BODY

DRAIN

tox

GATE

SOURCE DRAIN

L

leakddstdddd IVIVfCVP 2

ECE 4100/6100 (11)

Fundamental TrendsHigh Volume Manufacturing

2004 2006 2008 2010 2012 2014 2016 2018

Technology Node (nm)

90 65 45 32 22 16 11 8

Integration Capacity (BT)

2 4 8 16 32 64 128 256

Delay = CV/I scaling

0.7 ~0.7 >0.7 Delay scaling will slow down

Energy/Logic Op scaling

>0.35

>0.5 >0.5 Energy scaling will slow down

Bulk Planar CMOS

High Probability Low Probability

Alternate, 3G etc

Low Probability High Probability

Variability Medium High Very High

ILD (K) ~3 <3 Reduce slowly towards 2-2.5

RC Delay 1 1 1 1 1 1 1 1

Metal Layers 6-7 7-8 8-9 0.5 to 1 layer per generationSource: Shekhar Borkar, Intel Corp.

ECE 4100/6100 (12)

Moore’s Law

• How do we use the increasing number of transistors?• What are the challenges that must be addressed?

Source: Intel Corp.

ECE 4100/6100 (13)

Impact of Moore’s Law To Date

Push the MemoryMemory Wall Larger caches

Increase FrequencyFrequency

Deeper Pipelines

Increase ILPILP Concurrent Threads,

Branch Prediction and SMT

Manage PowerPower clock gating, activity

minimization

IBM Power5

Source: IBM Source: IBM

ECE 4100/6100 (14)

Shaping Future Multicore Architectures

• The ILP Wall– Limited ILP in applications

• The Frequency Wall– Not much headroom

• The Power Wall– Dynamic and static power dissipation

• The Memory Wall– Gap between compute bandwidth and memory

bandwidth

• Manufacturing– Non recurring engineering costs– Time to market

ECE 4100/6100 (15)

The Frequency Wall

• Not much headroom left in the stage to stage times (currently 8-12 FO4 delays)

• Increasing frequency leads to the power wall

Vikas Agarwal, M. S. Hrishikesh, Stephen W. Keckler, Doug Burger. Clock rate versus IPC: the end of the road for conventional microarchitectures. In ISCA 2000

ECE 4100/6100 (16)

Options

• Increase performance via parallelism– On chip this has been largely at the instruction/data

level

• The 1990’s through 2005 was the era of instruction level parallelism– Single instruction multiple data/Vector parallelism

• MMX, SSIMD, Vector Co-Processors– Out Of Order (OOO) execution cores– Explicitly Parallel Instruction Computing (EPIC)

• Have we exhausted options in a thread?

ECE 4100/6100 (17)

The ILP Wall - Past the Knee of the Curve?

“Effort”

Performance

ScalarIn-Order

Moderate-PipeSuperscalar/OOO

Very-Deep-PipeAggressive

Superscalar/OOO

Made sense to goSuperscalar/OOO:

good ROI

Very little gain forsubstantial effort

Source: G. Loh

ECE 4100/6100 (18)

The ILP Wall

• Limiting phenomena for ILP extraction:– Clock rate: at the wall each increase in clock rate has a

corresponding CPI increase (branches, other hazards)– Instruction fetch and decode: at the wall more

instructions cannot be fetched and decoded per clock cycle

– Cache hit rate: poor locality can limit ILP and it adversely affects memory bandwidth

– ILP in applications: serial fraction on applications

• Reality:– Limit studies cap IPC at 100-400 (using ideal processor)– Current processors have IPC of only 1-2

ECE 4100/6100 (19)

The ILP Wall: Options

• Increase granularity of parallelism– Simultaneous Multi-threading to exploit TLP

• TLP has to exist otherwise poor utilization results

– Coarse grain multithreading – Throughput computing

• New languages/applications– Data intensive computing in the enterprise– Media rich applications

ECE 4100/6100 (20)

The Memory Wall

µProc60%/yr.

DRAM7%/yr.

1

10

100

1000

DRAM

CPU

Processor-MemoryPerformance Gap:(grows 50% / year)

Time

“Moore’s Law”

ECE 4100/6100 (21)

The Memory Wall

• Increasing the number of cores increases the demanded memory bandwidth

• What architectural techniques can meet this demand?

Average access

time

Year?

ECE 4100/6100 (22)

The Memory Wall

CPU0 CPU1

AMD Dual-Core Athlon FX

• On die caches are both area intensive and power intensive– StrongArm dissipates more than 43% power in caches– Caches incur huge area costs

• Larger caches never deliver the near-universal performance boost offered by frequency ramping (Source: Intel)

IBM Power5

ECE 4100/6100 (23)

The Power Wall

• Power per transistor scales with frequency but also scales with Vdd

– Lower Vdd can be compensated for with increased pipelining to keep throughput constant

– Power per transistor is not same as power per area power density is the problem!

– Multiple units can be run at lower frequencies to keep throughput constant, while saving power


ECE 4100/6100 (24)

Leakage Power Basics

• Sub-threshold leakage– Increases with lower Vth , T, W

• Gate-oxide leakage– Increases with lower Tox, higher W – High K dielectrics offer a potential solution

• Reverse biased pn junction leakage– Very sensitive to T, V (in addition to diffusion area)

/ /1 (1 )thV nkT V kT

subI KWe e

2

/2

oxT Vox

ox

VI K W e

T

, ( 1)qV

kTpn leakage p nI J e A

ECE 4100/6100 (25)

The Current Power Trend

Source: Intel Corp.

40048008

80808085

8086

286386

486Pentium®

P6

1

10

100

1000

10000

1970 1980 1990 2000 2010

Year

Po

wer

Den

sity

(W

/cm

2 )

Hot Plate

NuclearReactor

RocketNozzle

Sun’sSurface

ECE 4100/6100 (26)

Improving Power/Performance

• Consider constant die size and decreasing core area each generation = more cores/chip– Effect of lowering voltage and frequency power reduction– Increasing cores/chip performance increase

Better power performance!


ECE 4100/6100 (27)

AcceleratorsT

CB Exec

Core

PLL

OOO

ROM

CA

M1

TC

B ExecCore

PLL

ROB

ROM

CL

B

Inputseq

Sendbuffer

2.23 mm X 3.54 mm, 260K transistors

Opportunities: Network processing enginesMPEG Encode/Decode engines, Speech engines

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1995 2000 2005 2010 2015M

IPS GP MIPS

@75W

TOE MIPS@~2W

TCP/IP Offload EngineTCP/IP Offload Engine

Source: Shekhar Borkar, Intel Corp.

ECE 4100/6100 (28)

Low-Power Design Techniques

• Circuit and gate level methods– Voltage scaling– Transistor sizing– Glitch suppression– Pass-transistor logic– Pseudo-nMOS logic– Multi-threshold gates

• Functional and architectural methods– Clock gating– Clock frequency reduction– Supply voltage reduction– Power down/off– Algorithmic and software techniques

Two decades worth of research and development!

ECE 4100/6100 (29)

The Economics of Manufacturing

• Where are the costs of developing the next generation processors? – Design Costs– Manufacturing Costs

• What type of chip level solutions is the economics implying?

• Assessing the implications of Moore’s Law is an exercise in mass production

ECE 4100/6100 (30)

The Cost of An ASIC

Example: Design with80 M transistors in 100 nm technology

Estimated Cost - $85 M -$90 M

C P prod

uctio

n

ver

ificat

ion

desig

n

prot

otyp

e

verif

icatio

n

imple

men

tatio

n

verif

icatio

n

12 – 18 months

• Cost and Risk rising to unacceptable levels

• Top cost drivers– Verification (40%)– Architecture Design (23%)– Embedded Software Design

• 1400 man months (SW)• 1150 man months (HW)

– HW/SW integration

*Handel H. Jones, “How to Slow the Design Cost Spiral,” Electronics Design Chain, September 2002, www.designchain.com

ECE 4100/6100 (31)

The Spectrum of Architectures

Synthesis

Compilation

Custom ASIC

FPGA Polymorphic Computing Architectures

Fixed + Variable ISA

Microprocessor

Hardware Development

Tiled architectures

Software Development

Customization fully in Hardware

Customization fullyin Software

Design NRE Effort

Decreasing Customization Increasing NRE and Time to Market

Structured ASIC

Tensilica Stretch Inc.

PACT, PICOChip

LSI Logic Leopard

Logic

MONARCHSM,RAW,

TRIPS

Xilinx Altera

ECE 4100/6100 (32)

Interlocking Trade-offs

Power

Memory

Frequency

ILP

specu

lation

bandwidth

dynamic power

dynam

ic p

enalt

iesmiss penalty

leakage p

ow

er

ECE 4100/6100 (33)

Multi-core Architecture Drivers

• Addressing ILP limits– Multiple threads– Coarse grain parallelism raise the level of abstraction

• Addressing Frequency and Power limits– Multiple slower cores across technology generation– Scaling via increasing the number of cores rather than frequency– Heterogeneous cores for improved power/performance

• Addressing memory system limits– Deep, distributed, cache hierarchies – OS replication shared memory remains dominant

• Addressing manufacturing issues– Design and verification costs Replication the network becomes more important!

ECE 4100/6100 (34)

Parallelism

ECE 4100/6100 (35)

Beyond ILP

• Performance is limited by the serial fraction parallelizable

1CPU 2CPUs 3CPUs 4CPUs

• Coarse grain parallelism in the post ILP era– Thread, process and data parallelism

• Learn from the lessons of the parallel processing community– Revisit the classifications and architectural

techniques

ECE 4100/6100 (36)

Flynn’s Model

• Flynn’s Classification– Single instruction stream, single data stream (SISD)

• The conventional, word-sequential architecture including pipelined computers

– Single instruction stream, multiple data stream (SIMD)• The multiple ALU-type architectures (e.g., array processor)

– Multiple instruction stream, single data stream (MISD)• Not very common

– Multiple instruction stream, multiple data stream (MIMD)

• The traditional multiprocessor system

M.J. Flynn, “Very high speed computing systems,” Proc. IEEE, vol. 54(12), pp. 1901–1909, 1966.

ECE 4100/6100 (37)

SIMD/Vector Computation

• SIMD and Vector models are spatial and temporal analogs of each other

• A rich architectural history dating back to 1953!

Source: CraySource: IBM

IBM Cell SPE pipeline diagram

IBM Cell SPE Organization

ECE 4100/6100 (38)

SIMD/Vector Architectures• VIRAM - Vector IRAM

– Logic is slow in DRAM process– put a vector unit in a DRAM and provide a port between a traditional

processor and the vector IRAM instead of a whole processor in DRAM

Source: Berkeley Vector IRAM

ECE 4100/6100 (39)

MIMD Machines

• Parallel processing has catalyzed the development of a several generations of parallel processing machines

• Unique features include the interconnection network, support for system wide synchronization, and programming languages/compilers

P + C

Dir

Memory

P + C

Dir

Memory

P + C

Dir

Memory

P + C

Dir

Memory

Interconnection Network

ECE 4100/6100 (40)

Basic Models for Parallel Programs

• Shared Memory– Coherency/consistency are driving concerns– Programming model is simplified at the

expense of system complexity

• Message Passing– Typically implemented on distributed

memory machines– System complexity is simplified at the

expense of increased effort by the programmer

ECE 4100/6100 (41)

Shared Memory Model

• That’s basically it…– need to fork/join threads, synchronize (typically locks)

Main Memory

Write X Read X

CPU0 CPU1

ECE 4100/6100 (42)

Recv

Message Passing Protocols

• Explicitly send data from one thread to another– need to track ID’s of other CPUs– broadcast may need multiple send’s– each CPU has own memory space

• Hardware: send/recv queues between CPUs

Send

CPU0 CPU1

ECE 4100/6100 (43)

Shared Memory Vs. Message Passing

• Shared memory doesn’t scale as well to larger number of nodes

• communications are broadcast based• bus becomes a severe bottleneck

• Message passing doesn’t need centralized bus

• can arrange multi-processor like a graph– nodes = CPUs, edges = independent links/routes

• can have multiple communications/messages in transit at the same time

ECE 4100/6100 (44)

Two Emerging ChallengesProgramming Models and

Compilers?

Interconnection Networks

Source: IBMSource: Intel Corp.

ece 4100/6100 (1) multicore computing - evolution

Documents

mb shared dc

mb6 mb

delay scaling

amd slide

multicore source

technology scaling

chip buses source

cvi scaling