l 2 ilp, dlp and tlp in modern...
TRANSCRIPT
6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE
SPRING 2013
LECTURE 2
ILP, DLP AND TLP IN MODERN MULTICORES
DANIEL SANCHEZ AND JOEL EMER
Review: ILP Challenges
Clock frequency: getting close to pipelining limits
Clocking overheads, CPI degradation
Branch prediction & memory latency limit the practical benefits of out-of-order execution
Power grows superlinearly with higher clock & more OOO logic
Design complexity grows exponentially with issue width
Limited ILP Must exploit TLP and DLP
Thead-Level Parallelism: Multithreading and multicore
Data-Level Parallelism: SIMD
2
6.888 Spring 2013 - Sanchez and Emer - L02
Review: Memory Hierarchy 3
Caching: Reduce latency, energy, BW of memory accesses
Why multilevel?
Why not just on-chip memories?
How does parallelism impact latency/BW constraints?
Prefetching: Trade-off latency for bandwidth, energy, capacity (pollution)
6.888 Spring 2013 - Sanchez and Emer - L02
L3
6MB
Core
Main Memory
16GB
L1I
64KB
L1D
64KB
L2
256KB
Core
L1I
64KB
L1D
64KB
L2
256KB
Core
L1I
64KB
L1D
64KB
L2
256KB
Core
L1I
64KB
L1D
64KB
L2
256KB
4 cycles, 72GB/s
8 cycles, 48GB/s
26-31 cycles, 96GB/s
(24GB/s/core)
150 cycles, 21GB/s
Flynn’s Taxonomy 4
Single instruction Multiple instruction
Single data SISD MISD (?)
Multiple data SIMD MIMD
6.888 Spring 2013 - Sanchez and Emer - L02
SIMD Processing 5
Same instruction sequence applies to multiple elements
Vector processing Amortize instruction costs (fetch, decode,
…) across multiple operations
Requires regular data parallelism (no or minimal divergence)
Exploiting SIMD:
Explicit & low-level, using vector intrinsics
Explicit & high-level, convey parallel semantics (e.g., foreach)
Implicitly: Parallelizing compiler infers loop dependencies
How easy is this in C++? Java?
6.888 Spring 2013 - Sanchez and Emer - L02
SIMD Implementations 6
Modern CPUs: SIMD extensions & wider regs
SSE: 128-bit operands (4x32-bit or 2x64-bit)
AVX (2011): 256-bit operands (8x32-bit or 4x64-bit)
LRB (upcoming): 512-bit operands
Explicit SIMD: Parallelization performed at compile time
GPUs: Architected for SIMD from the ground up
32 to 64 32-bit floats
Implicit SIMD: Scalar binary, multiple instances always run in
lockstep
How to handle divergence?
6.888 Spring 2013 - Sanchez and Emer - L02
Multithreading: Options 7
Motivation: Hardware underutilized on stalls TLP to increase utilization
CGMT, SMT typically increase throughput with moderate cost, maintain single-thread performance
FGMT typically trades throughput and simplicity at the expense of single-thread performance
6.888 Spring 2013 - Sanchez and Emer - L02
Example 1: SMT (Nehalem) 8
SMT design choices: For each component,
Replicate, partition statically, or share
Tradeoffs? Complexity, utilization, interference & fairness
Example: Intel Nehalem
4-wide superscalar, 2-way SMT
Replicated: Register file, RAS predictor, large-page ITLB
Partitioned: Load buffer, store buffer, ROB, small-page ITLB
Shared: Instruction window, execution units, predictors, caches, DTLBs
SMT policies:
Fetch policies: Utilization vs fairness
Long-latency stall tolerance: Flushing vs stalling
6.888 Spring 2013 - Sanchez and Emer - L02
[See: “Exploiting Choice: Instruction
Fetch and Issue on an Implementable
Simultaneous Multithreading
Processor”, Tullsen et al, ISCA 96]
Example 2: FGMT (Niagara) 9
4 threads/core, round-robin scheduling
No branch prediction, minimal bypasses more stalls
Small L1 caches (can tolerate higher L1 miss rates)
But L2 is still large… performance with long-latency stalls?
6.888 Spring 2013 - Sanchez and Emer - L02
Example 3: Extreme FGMT (Tera MTA) 10
Use FGMT to hide all instruction latencies
Worst case instruction latency is 128 cycles 128 threads
Benefits: no interlocks, no bypass, and no cache
Problem: single-thread performance
GPUs also exploit high FGMT for latency tolerance (e.g., Fermi, 48-way MT)
Throughput-oriented functional units: Longer latency, deeply pipelined
Throughput-oriented memory system: Small caches, aggressive memory scheduler
6.888 Spring 2013 - Sanchez and Emer - L02
InstT1
InstT3
A B C F G H
A B C D E E G H
D E
A B E F G H C D
A B E F G H C D
InstT2
InstT4
A B E F G H C D
A B E F G H C D
A B E F G H C D
A B E F G H C D
A B E F G H C D
InstT5
InstT7
InstT6
InstT8
InstT1
Multithreading: How Many Threads? 11
With more HW threads:
Larger/multiple register files
Replicated & partitioned resources Lower utilization, lower single-thread performance
Shared resources Utilization vs interference and thrashing
Impact of MT/MC on memory hierarchy?
6.888 Spring 2013 - Sanchez and Emer - L02
[“Many-Core vs. Many-Thread Machines: Stay
Away From the Valley”, Guz et al, CAL 09]
Amdahl’s Law 12
Amdahl’s Law: If a change improves a fraction f of the
workload by a factor K, the total speedup is:
Not only valid for performance!
Energy, complexity, …
I/D/TLP techniques make different tradeoffs between K
and f
SIMD vs MIMD f and K?
6.888 Spring 2013 - Sanchez and Emer - L02
)1(/
1
Time
TimeSpeedup
after
before
fKf
Amdahls’ Law in the Multicore Era
[Hill & Marty, CACM 08]
Should we focus on a single approach to extract parallelism?
At what point should we trade ILP for TLP?
Assume a resource-limited multi-core
N base core equivalent (BCEs) due to area or power constraints
A 1-BCE core leads to performance of 1
A R-BCE core leads to performance of perf(R)
Assuming perf(R) = sqrt(R) in following drawings (Pollack’s rule)
How should we design the multi-core?
Select type & number of cores
Assume caches & interconnect are rather constant
Assume no application scaling (or equal scaling for seq/par portions)
13
6.888 Spring 2013 - Sanchez and Emer - L02
Three Multicore Approaches
Large Cores (R BCEs/core)
Simple Cores (1 BCE/core)
Number Performance Number Performance
Symmetric CMP N/R Seq: Perf(R) Par: N/R*Perf(R)
- -
Asymmetric CMP
1 Seq: Perf(R) Par: Perf(R)
N-R Seq: - Par: N-R
Dynamic CMP 1 Seq: Perf(R) Par: -
N Seq: - Par: N
16 1-BCE cores Symmetric:
4 4-BCE cores
Asymmetric:
1 4-BCE core
& 12 1-BCE cores
Dynamic:
Adapt between
16 1-BCEs and 1 16-BCE
14
6.888 Spring 2013 - Sanchez and Emer - L02
Amdahl’s Law x3
Symmetric CMP
Asymmetric CMP
Dynamic CMP
Symmetric Speedup =
1
+ 1 - F
Perf(R)
F * R
Perf(R)*N
Asymmetric Speedup =
1
+ 1 - F
Perf(R)
F
Perf(R) + N - R
Dynamic Speedup =
1
+ 1 - F
Perf(R)
F
N
15
6.888 Spring 2013 - Sanchez and Emer - L02
6.888 Spring 2013 - Sanchez and Emer - L01
Symmetric Multicore Chip
N = 256 BCEs
0
50
100
150
200
250
1 2 4 8 16 32 64 128 256
Sym
me
tric
Sp
ee
du
p
R BCEs/core
F=0.999
F=0.99
F=0.975
F=0.9
F=0.5 F=0.9
R=28 (vs. 2)
Cores=9 (vs. 8)
Speedup=26.7 (vs. 6.7)
CORE ENHANCEMENTS!
F1
R=1 (vs. 1)
Cores=256 (vs. 16)
Speedup=204 (vs. 16)
MORE CORES!
F=0.99
R=3 (vs. 1)
Cores=85 (vs. 16)
Speedup=80 (vs. 13.9)
CORE ENHANCEMENTS
& MORE CORES!
Results in parentheses N= 16
Higher N Higher R (more ILP) for fixed f
16
6.888 Spring 2013 - Sanchez and Emer - L01
Asymmetric Multicore Chip
N = 256 BCEs
0
50
100
150
200
250
1 2 4 8 16 32 64 128 256
Asy
mm
etri
c Sp
ee
du
p
R BCEs
F=0.999
F=0.99
F=0.975
F=0.9
F=0.5 F=0.9
R=118 (vs. 28)
Cores= 139 (vs. 9)
Speedup=65.6
(vs. 26.7)
F=0.99
R=41 (vs. 3)
Cores=216 (vs. 85)
Speedup=166 (vs. 80)
Results in parentheses N= 16
Better speedups than symmetric
Software complexity?
17
6.888 Spring 2013 - Sanchez and Emer - L01
Dynamic Multicore Chip
N = 256 BCEs
Results in parenthesis refer to N=16
Dynamic offers even higher speedups than asymmetric
SW and HW complexity?
0
50
100
150
200
250
1 2 4 8 16 32 64 128 256
Dyn
amic
Sp
ee
du
p
R BCEs
F=0.999
F=0.99
F=0.975
F=0.9
F=0.5
F=0.99
R=256 (vs. 41)
Cores=256 (vs. 216)
Speedup=223 (vs. 166)
Note:
#Cores
always
N=256
18
Readings for Wednesday
1. Is Dark Silicon Useful?
2. Dark Silicon and the End of Multicore Scaling
3. Single-Chip Heterogeneous Computing
19
6.888 Spring 2013 - Sanchez and Emer - L02