1 pipelining for multi- core architectures. 2 multi-core technology 20042005 2007 single core dual...
Post on 15-Jan-2016
225 views
TRANSCRIPT
1
Pipelining for Multi-Core Architectures
2
Multi-Core Technology
2004 2005 2007 Single Core Dual Core Multi-Core
+ Cache
+ Cache
Cache
Core
4 or more cores
+ Cache
CoreCache
2 or more cores
+ Cache
2X more
cores
3
Why multi-core ?• Difficult to make single-core clock frequencies even higher
• Deeply pipelined circuits:
– heat problems
– Clock problems
– Efficiency (Stall) problems
• Doubling issue rates above today’s 3-6 instructions per clock, say to 6 to 12 instructions, is extremely difficult
– issue 3 or 4 data memory accesses per cycle,
– rename and access more than 20 registers per cycle, and
– fetch 12 to 24 instructions per cycle.
• Many new applications are multithreaded
• General trend in computer architecture (shift towards more parallelism)
4
Instruction-level parallelism
• Parallelism at the machine-instruction level
• The processor can re-order, pipeline instructions, split them into microinstructions, do aggressive branch prediction, etc.
• Instruction-level parallelism enabled rapid increases in processor speeds over the last 15 years
5
Thread-level parallelism (TLP)• This is parallelism on a more coarser scale
• Server can serve each client in a separate thread (Web server, database server)
• A computer game can do AI, graphics, and sound in three separate threads
• Single-core superscalar processors cannot fully exploit TLP
• Multi-core architectures are the next step in processor evolution: explicitly exploiting TLP
6
What applications benefit from multi-core?
• Database servers• Web servers (Web commerce)• Multimedia applications• Scientific applications,
CAD/CAM• In general, applications with
Thread-level parallelism(as opposed to instruction-level parallelism)
Each can run on itsown core
7
More examples
• Editing a photo while recording a TV show through a digital video recorder
• Downloading software while running an anti-virus program
• “Anything that can be threaded today will map efficiently to multi-core”
• BUT: some applications difficult toparallelize
8
Core 2 Duo Microarchitecture
9
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2 C
ache
and
Con
trol
Bus
Thread 1: floating point
Without SMT, only a single thread can run at any given time
10
Without SMT, only a single thread can run at any given time
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2 C
ache
and
Con
trol
Bus
Thread 2:integer operation
11
SMT processor: both threads can run concurrently
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2 C
ache
and
Con
trol
Bus
Thread 1: floating pointThread 2:integer operation
12
But: Can’t simultaneously use the same functional unit
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2 C
ache
and
Con
trol
Bus
Thread 1 Thread 2
This scenario isimpossible with SMTon a single core(assuming a single integer unit)IMPOSSIBLE
13
Multi-core: threads can run on separate cores
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTBL2 C
ache
and
Con
trol
Bus
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTBL2 C
ache
and
Con
trol
Bus
Thread 1 Thread 2
14
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTBL2 C
ache
and
Con
trol
Bus
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTBL2 C
ache
and
Con
trol
Bus
Thread 3 Thread 4
Multi-core: threads can run on separate cores
15
Combining Multi-core and SMT
• Cores can be SMT-enabled (or not)
• The different combinations:– Single-core, non-SMT: standard uniprocessor– Single-core, with SMT – Multi-core, non-SMT– Multi-core, with SMT: our fish machines
• The number of SMT threads:2, 4, or sometimes 8 simultaneous threads
• Intel calls them “hyper-threads”
16
SMT Dual-core: all four threads can run concurrently
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTBL2 C
ache
and
Con
trol
Bus
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTBL2 C
ache
and
Con
trol
Bus
Thread 1 Thread 3 Thread 2 Thread 4
17
Multi-Core and caches coherence
memory
L2 cache
L1 cache L1 cacheC O
R E
1
C O
R E
0
L2 cache
memory
L2 cache
L1 cache L1 cacheC O
R E
1
C O
R E
0
L2 cache
Both L1 and L2 are private
Examples: AMD Opteron, AMD Athlon, Intel Pentium D
L3 cache L3 cache
A design with L3 caches
Example: Intel Itanium 2
18
The cache coherence problem• Since we have private caches:
How to keep the data consistent across caches?• Each core should perceive the memory as a
monolithic array, shared by all the cores
19
The cache coherence problem
Suppose variable x initially contains 15213
Core 1 Core 2 Core 3 Core 4
One or more levels of
cache
One or more levels of
cache
One or more levels of
cache
One or more levels of
cache
Main memoryx=15213
multi-core chip
20
The cache coherence problem
Core 1 reads x
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=15213
One or more levels of
cache
One or more levels of
cache
One or more levels of
cache
Main memoryx=15213
multi-core chip
21
The cache coherence problem
Core 2 reads x
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=15213
One or more levels of
cachex=15213
One or more levels of
cache
One or more levels of
cache
Main memoryx=15213
multi-core chip
22
The cache coherence problem
Core 1 writes to x, setting it to 21660
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=21660
One or more levels of
cachex=15213
One or more levels of
cache
One or more levels of
cache
Main memoryx=21660
multi-core chip
assuming write-through caches
23
The cache coherence problem
Core 2 attempts to read x… gets a stale copy
Core 1 Core 2 Core 3 Core 4
One or more levels of
cachex=21660
One or more levels of
cachex=15213
One or more levels of
cache
One or more levels of
cache
Main memoryx=21660
multi-core chip
24
The Memory Wall Problem
25
Memory WallµProc60%/yr.(2X/1.5yr)
DRAM9%/yr.(2X/10 yrs)
1
10
100
1000198
0198
1
198
3198
4198
5198
6198
7198
8198
9199
0199
1199
2199
3199
4199
5199
6199
7199
8199
9200
0
DRAM
CPU
198
2
Processor-MemoryPerformance Gap:(grows 50% / year)
Perf
orm
ance
“Moore’s Law”
26
Latency in a Single PC
1997 1999 2001 2003 2006 2009
X-Axis
0.1
1
10
100
1000
Tim
e (n
s)
0
100
200
300
400
500
Mem
ory
to C
PU
Rat
io
CPU Clock Period (ns)
Memory System Access Time
Ratio
Memory Access Time
CPU Time
Ratio
THE WALL
27
Pentium 4 Cache hierarchy
Processor
L1 I (12Ki) L1 D (8KiB)
L2 cache (512 KiB)
L3 cache (2 MiB)
Cycles: 2
Cycles: 19
Cycles: 43
Memory Cycles: 206
28
Technology TrendsTechnology Trends Capacity Speed (latency)
Logic: 2x in 3 years 2x in 3 years
DRAM: 4x in 3 years 2x in 10 years
Disk: 4x in 3 years 2x in 10 years
DRAM Generations
Year Size Cycle Time
1980 64 Kb 250 ns1983 256 Kb 220 ns1986 1 Mb 190 ns1989 4 Mb 165 ns1992 16 Mb 120 ns1996 64 Mb 110 ns1998 128 Mb 100 ns2000 256 Mb 90 ns2002 512 Mb 80 ns2006 1024 Mb 60ns 16000:1 4:1 (Capacity) (Latency)
29
Processor-DRAM Performance Gap Processor-DRAM Performance Gap Impact: Impact: Example Example• To illustrate the performance impact, assume a single-issue pipelined CPU with
CPI = 1 using non-ideal memory. • The minimum cost of a full memory access in terms of number of wasted CPU
cycles:
CPU CPU Memory Minimum CPU cycles or Year speed cycle Access instructions wasted MHZ ns ns
1986: 8 125 190 190/125 - 1 = 0.51989: 33 30 165 165/30 -1 = 4.51992: 60 16.6 120 120/16.6 -1 = 6.21996: 200 5 110 110/5 -1 = 211998: 300 3.33 100 100/3.33 -1 = 292000: 1000 1 90 90/1 - 1 = 892003: 2000 .5 80 80/.5 - 1 = 159
30
Main MemoryMain Memory• Main memory generally uses Dynamic RAM (DRAM), which uses a single transistor to store a bit, but requires a
periodic data refresh (~every 8 msec). • Cache uses SRAM: Static Random Access Memory
– No refresh (6 transistors/bit vs. 1 transistor/bit for DRAM)
• Size: DRAM/SRAM 4-8, Cost & Cycle time: SRAM/DRAM 8-16
• Main memory performance:– Memory latency:
• Access time: The time it takes between a memory access request and the time the requested information is available to cache/CPU.
• Cycle time: The minimum time between requests to memory (greater than access time in DRAM to allow address lines to be stable)
– Memory bandwidth: The maximum sustained data transfer rate between main memory and cache/CPU.
31
Architects Use Transistors to Tolerate Slow Memory
• Cache– Small, Fast Memory
– Holds information (expected)to be used soon
– Mostly Successful
• Apply Recursively– Level-one cache(s)
– Level-two cache
• Most of microprocessordie area is cache!