1 pipelining for multi- core architectures. 2 multi-core technology 20042005 2007 single core dual...

1

Pipelining for Multi-Core Architectures

2

Multi-Core Technology

2004 2005 2007 Single Core Dual Core Multi-Core

+ Cache

+ Cache

Cache

Core

4 or more cores

+ Cache

CoreCache

2 or more cores

+ Cache

2X more

cores

3

Why multi-core ?• Difficult to make single-core clock frequencies even higher

• Deeply pipelined circuits:

– heat problems

– Clock problems

– Efficiency (Stall) problems

• Doubling issue rates above today’s 3-6 instructions per clock, say to 6 to 12 instructions, is extremely difficult

– issue 3 or 4 data memory accesses per cycle,

– rename and access more than 20 registers per cycle, and

– fetch 12 to 24 instructions per cycle.

• Many new applications are multithreaded

• General trend in computer architecture (shift towards more parallelism)

4

Instruction-level parallelism

• Parallelism at the machine-instruction level

• The processor can re-order, pipeline instructions, split them into microinstructions, do aggressive branch prediction, etc.

• Instruction-level parallelism enabled rapid increases in processor speeds over the last 15 years

5

Thread-level parallelism (TLP)• This is parallelism on a more coarser scale

• Server can serve each client in a separate thread (Web server, database server)

• A computer game can do AI, graphics, and sound in three separate threads

• Single-core superscalar processors cannot fully exploit TLP

• Multi-core architectures are the next step in processor evolution: explicitly exploiting TLP

6

What applications benefit from multi-core?

• Database servers• Web servers (Web commerce)• Multimedia applications• Scientific applications,

CAD/CAM• In general, applications with

Thread-level parallelism(as opposed to instruction-level parallelism)

Each can run on itsown core

7

More examples

• Editing a photo while recording a TV show through a digital video recorder

• Downloading software while running an anti-virus program

• “Anything that can be threaded today will map efficiently to multi-core”

• BUT: some applications difficult toparallelize

8

Core 2 Duo Microarchitecture

9

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers

Integer Floating Point

L1 D-Cache D-TLB

uCode ROMBTBL2 C

ache

and

Con

trol

Bus

Thread 1: floating point

Without SMT, only a single thread can run at any given time

10

Without SMT, only a single thread can run at any given time

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROMBTBL2 C

ache

and

Con

trol

Bus

Thread 2:integer operation

11

SMT processor: both threads can run concurrently

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROMBTBL2 C

ache

and

Con

trol

Bus

Thread 1: floating pointThread 2:integer operation

12

But: Can’t simultaneously use the same functional unit

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROMBTBL2 C

ache

and

Con

trol

Bus

Thread 1 Thread 2

This scenario isimpossible with SMTon a single core(assuming a single integer unit)IMPOSSIBLE

13

Multi-core: threads can run on separate cores

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROM

BTBL2 C

ache

and

Con

trol

Bus

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROM

BTBL2 C

ache

and

Con

trol

Bus

Thread 1 Thread 2

14

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROM

BTBL2 C

ache

and

Con

trol

Bus

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROM

BTBL2 C

ache

and

Con

trol

Bus

Thread 3 Thread 4

Multi-core: threads can run on separate cores

15

Combining Multi-core and SMT

• Cores can be SMT-enabled (or not)

• The different combinations:– Single-core, non-SMT: standard uniprocessor– Single-core, with SMT – Multi-core, non-SMT– Multi-core, with SMT: our fish machines

• The number of SMT threads:2, 4, or sometimes 8 simultaneous threads

• Intel calls them “hyper-threads”

16

SMT Dual-core: all four threads can run concurrently

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROM

BTBL2 C

ache

and

Con

trol

Bus

BTB and I-TLB

Decoder

Trace Cache

Rename/Alloc

Uop queues

Schedulers


L1 D-Cache D-TLB

uCode ROM

BTBL2 C

ache

and

Con

trol

Bus

Thread 1 Thread 3 Thread 2 Thread 4

17

Multi-Core and caches coherence

memory

L2 cache

L1 cache L1 cacheC O

R E

1

C O

R E

0

L2 cache

memory

L2 cache

L1 cache L1 cacheC O

R E

1

C O

R E

0

L2 cache

Both L1 and L2 are private

Examples: AMD Opteron, AMD Athlon, Intel Pentium D

L3 cache L3 cache

A design with L3 caches

Example: Intel Itanium 2

18

The cache coherence problem• Since we have private caches:

How to keep the data consistent across caches?• Each core should perceive the memory as a

monolithic array, shared by all the cores

19

The cache coherence problem

Suppose variable x initially contains 15213

Core 1 Core 2 Core 3 Core 4

One or more levels of

cache


cache


cache


cache

Main memoryx=15213

multi-core chip

20


Core 1 reads x



cachex=15213


cache


cache


cache

Main memoryx=15213

multi-core chip

21


Core 2 reads x



cachex=15213


cachex=15213


cache


cache

Main memoryx=15213

multi-core chip

22


Core 1 writes to x, setting it to 21660



cachex=21660


cachex=15213


cache


cache

Main memoryx=21660

multi-core chip

assuming write-through caches

23


Core 2 attempts to read x… gets a stale copy



cachex=21660


cachex=15213


cache


cache

Main memoryx=21660

multi-core chip

24

The Memory Wall Problem

25

Memory WallµProc60%/yr.(2X/1.5yr)

DRAM9%/yr.(2X/10 yrs)

1

10

100

1000198

0198

1

198

3198

4198

5198

6198

7198

8198

9199

0199

1199

2199

3199

4199

5199

6199

7199

8199

9200

0

DRAM

CPU

198

2

Processor-MemoryPerformance Gap:(grows 50% / year)

Perf

orm

ance

“Moore’s Law”

26

Latency in a Single PC

1997 1999 2001 2003 2006 2009

X-Axis

0.1

1

10

100

1000

Tim

e (n

s)

0

100

200

300

400

500

Mem

ory

to C

PU

Rat

io

CPU Clock Period (ns)

Memory System Access Time

Ratio

Memory Access Time

CPU Time

Ratio

THE WALL

27

Pentium 4 Cache hierarchy

Processor

L1 I (12Ki) L1 D (8KiB)

L2 cache (512 KiB)

L3 cache (2 MiB)

Cycles: 2

Cycles: 19

Cycles: 43

Memory Cycles: 206

28

Technology TrendsTechnology Trends Capacity Speed (latency)

Logic: 2x in 3 years 2x in 3 years

DRAM: 4x in 3 years 2x in 10 years

Disk: 4x in 3 years 2x in 10 years

DRAM Generations

Year Size Cycle Time

1980 64 Kb 250 ns1983 256 Kb 220 ns1986 1 Mb 190 ns1989 4 Mb 165 ns1992 16 Mb 120 ns1996 64 Mb 110 ns1998 128 Mb 100 ns2000 256 Mb 90 ns2002 512 Mb 80 ns2006 1024 Mb 60ns 16000:1 4:1 (Capacity) (Latency)

29

Processor-DRAM Performance Gap Processor-DRAM Performance Gap Impact: Impact: Example Example• To illustrate the performance impact, assume a single-issue pipelined CPU with

CPI = 1 using non-ideal memory. • The minimum cost of a full memory access in terms of number of wasted CPU

cycles:

CPU CPU Memory Minimum CPU cycles or Year speed cycle Access instructions wasted MHZ ns ns

1986: 8 125 190 190/125 - 1 = 0.51989: 33 30 165 165/30 -1 = 4.51992: 60 16.6 120 120/16.6 -1 = 6.21996: 200 5 110 110/5 -1 = 211998: 300 3.33 100 100/3.33 -1 = 292000: 1000 1 90 90/1 - 1 = 892003: 2000 .5 80 80/.5 - 1 = 159

30

Main MemoryMain Memory• Main memory generally uses Dynamic RAM (DRAM), which uses a single transistor to store a bit, but requires a

periodic data refresh (~every 8 msec). • Cache uses SRAM: Static Random Access Memory

– No refresh (6 transistors/bit vs. 1 transistor/bit for DRAM)

• Size: DRAM/SRAM 4-8, Cost & Cycle time: SRAM/DRAM 8-16

• Main memory performance:– Memory latency:

• Access time: The time it takes between a memory access request and the time the requested information is available to cache/CPU.

• Cycle time: The minimum time between requests to memory (greater than access time in DRAM to allow address lines to be stable)

– Memory bandwidth: The maximum sustained data transfer rate between main memory and cache/CPU.

31

Architects Use Transistors to Tolerate Slow Memory

• Cache– Small, Fast Memory

– Holds information (expected)to be used soon

– Mostly Successful

• Apply Recursively– Level-one cache(s)

– Level-two cache

• Most of microprocessordie area is cache!

1 pipelining for multi- core architectures. 2 multi-core technology 20042005 2007 single core dual...

Documents