l 2 ilp, dlp and tlp in modern...

6.888 PARALLEL AND HETEROGENEOUS COMPUTER ARCHITECTURE

SPRING 2013

LECTURE 2

ILP, DLP AND TLP IN MODERN MULTICORES

DANIEL SANCHEZ AND JOEL EMER

Review: ILP Challenges

Clock frequency: getting close to pipelining limits

Clocking overheads, CPI degradation

Branch prediction & memory latency limit the practical benefits of out-of-order execution

Power grows superlinearly with higher clock & more OOO logic

Design complexity grows exponentially with issue width

Limited ILP Must exploit TLP and DLP

Thead-Level Parallelism: Multithreading and multicore

Data-Level Parallelism: SIMD

2

6.888 Spring 2013 - Sanchez and Emer - L02

Review: Memory Hierarchy 3

Caching: Reduce latency, energy, BW of memory accesses

Why multilevel?

Why not just on-chip memories?

How does parallelism impact latency/BW constraints?

Prefetching: Trade-off latency for bandwidth, energy, capacity (pollution)


L3

6MB

Core

Main Memory

16GB

L1I

64KB

L1D

64KB

L2

256KB

Core

L1I

64KB

L1D

64KB

L2

256KB

Core

L1I

64KB

L1D

64KB

L2

256KB

Core

L1I

64KB

L1D

64KB

L2

256KB

4 cycles, 72GB/s

8 cycles, 48GB/s

26-31 cycles, 96GB/s

(24GB/s/core)

150 cycles, 21GB/s

Flynn’s Taxonomy 4

Single instruction Multiple instruction

Single data SISD MISD (?)

Multiple data SIMD MIMD


SIMD Processing 5

Same instruction sequence applies to multiple elements

Vector processing Amortize instruction costs (fetch, decode,

…) across multiple operations

Requires regular data parallelism (no or minimal divergence)

Exploiting SIMD:

Explicit & low-level, using vector intrinsics

Explicit & high-level, convey parallel semantics (e.g., foreach)

Implicitly: Parallelizing compiler infers loop dependencies

How easy is this in C++? Java?


SIMD Implementations 6

Modern CPUs: SIMD extensions & wider regs

SSE: 128-bit operands (4x32-bit or 2x64-bit)

AVX (2011): 256-bit operands (8x32-bit or 4x64-bit)

LRB (upcoming): 512-bit operands

Explicit SIMD: Parallelization performed at compile time

GPUs: Architected for SIMD from the ground up

32 to 64 32-bit floats

Implicit SIMD: Scalar binary, multiple instances always run in

lockstep

How to handle divergence?


Multithreading: Options 7

Motivation: Hardware underutilized on stalls TLP to increase utilization

CGMT, SMT typically increase throughput with moderate cost, maintain single-thread performance

FGMT typically trades throughput and simplicity at the expense of single-thread performance


Example 1: SMT (Nehalem) 8

SMT design choices: For each component,

Replicate, partition statically, or share

Tradeoffs? Complexity, utilization, interference & fairness

Example: Intel Nehalem

4-wide superscalar, 2-way SMT

Replicated: Register file, RAS predictor, large-page ITLB

Partitioned: Load buffer, store buffer, ROB, small-page ITLB

Shared: Instruction window, execution units, predictors, caches, DTLBs

SMT policies:

Fetch policies: Utilization vs fairness

Long-latency stall tolerance: Flushing vs stalling


[See: “Exploiting Choice: Instruction

Fetch and Issue on an Implementable

Simultaneous Multithreading

Processor”, Tullsen et al, ISCA 96]

Example 2: FGMT (Niagara) 9

4 threads/core, round-robin scheduling

No branch prediction, minimal bypasses more stalls

Small L1 caches (can tolerate higher L1 miss rates)

But L2 is still large… performance with long-latency stalls?


Example 3: Extreme FGMT (Tera MTA) 10

Use FGMT to hide all instruction latencies

Worst case instruction latency is 128 cycles 128 threads

Benefits: no interlocks, no bypass, and no cache

Problem: single-thread performance

GPUs also exploit high FGMT for latency tolerance (e.g., Fermi, 48-way MT)

Throughput-oriented functional units: Longer latency, deeply pipelined

Throughput-oriented memory system: Small caches, aggressive memory scheduler


InstT1

InstT3

A B C F G H

A B C D E E G H

D E

A B E F G H C D

A B E F G H C D

InstT2

InstT4

A B E F G H C D

A B E F G H C D

A B E F G H C D

A B E F G H C D

A B E F G H C D

InstT5

InstT7

InstT6

InstT8

InstT1

Multithreading: How Many Threads? 11

With more HW threads:

Larger/multiple register files

Replicated & partitioned resources Lower utilization, lower single-thread performance

Shared resources Utilization vs interference and thrashing

Impact of MT/MC on memory hierarchy?


[“Many-Core vs. Many-Thread Machines: Stay

Away From the Valley”, Guz et al, CAL 09]

Amdahl’s Law 12

Amdahl’s Law: If a change improves a fraction f of the

workload by a factor K, the total speedup is:

Not only valid for performance!

Energy, complexity, …

I/D/TLP techniques make different tradeoffs between K

and f

SIMD vs MIMD f and K?


)1(/

1

Time

TimeSpeedup

after

before

fKf

Amdahls’ Law in the Multicore Era

[Hill & Marty, CACM 08]

Should we focus on a single approach to extract parallelism?

At what point should we trade ILP for TLP?

Assume a resource-limited multi-core

N base core equivalent (BCEs) due to area or power constraints

A 1-BCE core leads to performance of 1

A R-BCE core leads to performance of perf(R)

Assuming perf(R) = sqrt(R) in following drawings (Pollack’s rule)

How should we design the multi-core?

Select type & number of cores

Assume caches & interconnect are rather constant

Assume no application scaling (or equal scaling for seq/par portions)

13


Three Multicore Approaches

Large Cores (R BCEs/core)

Simple Cores (1 BCE/core)

Number Performance Number Performance

Symmetric CMP N/R Seq: Perf(R) Par: N/R*Perf(R)

- -

Asymmetric CMP

1 Seq: Perf(R) Par: Perf(R)

N-R Seq: - Par: N-R

Dynamic CMP 1 Seq: Perf(R) Par: -

N Seq: - Par: N

16 1-BCE cores Symmetric:

4 4-BCE cores

Asymmetric:

1 4-BCE core

& 12 1-BCE cores

Dynamic:

Adapt between

16 1-BCEs and 1 16-BCE

14


Amdahl’s Law x3

Symmetric CMP

Asymmetric CMP

Dynamic CMP

Symmetric Speedup =

1

+ 1 - F

Perf(R)

F * R

Perf(R)*N

Asymmetric Speedup =

1

+ 1 - F

Perf(R)

F

Perf(R) + N - R

Dynamic Speedup =

1

+ 1 - F

Perf(R)

F

N

15



Symmetric Multicore Chip

N = 256 BCEs

0

50

100

150

200

250

1 2 4 8 16 32 64 128 256

Sym

me

tric

Sp

ee

du

p

R BCEs/core

F=0.999

F=0.99

F=0.975

F=0.9

F=0.5 F=0.9

R=28 (vs. 2)

Cores=9 (vs. 8)

Speedup=26.7 (vs. 6.7)

CORE ENHANCEMENTS!

F1

R=1 (vs. 1)

Cores=256 (vs. 16)

Speedup=204 (vs. 16)

MORE CORES!

F=0.99

R=3 (vs. 1)

Cores=85 (vs. 16)

Speedup=80 (vs. 13.9)

CORE ENHANCEMENTS

& MORE CORES!

Results in parentheses N= 16

Higher N Higher R (more ILP) for fixed f

16


Asymmetric Multicore Chip

N = 256 BCEs

0

50

100

150

200

250

1 2 4 8 16 32 64 128 256

Asy

mm

etri

c Sp

ee

du

p

R BCEs

F=0.999

F=0.99

F=0.975

F=0.9

F=0.5 F=0.9

R=118 (vs. 28)

Cores= 139 (vs. 9)

Speedup=65.6

(vs. 26.7)

F=0.99

R=41 (vs. 3)

Cores=216 (vs. 85)


Results in parentheses N= 16

Better speedups than symmetric

Software complexity?

17


Dynamic Multicore Chip

N = 256 BCEs

Results in parenthesis refer to N=16

Dynamic offers even higher speedups than asymmetric

SW and HW complexity?

0

50

100

150

200

250

1 2 4 8 16 32 64 128 256

Dyn

amic

Sp

ee

du

p

R BCEs

F=0.999

F=0.99

F=0.975

F=0.9

F=0.5

F=0.99

R=256 (vs. 41)

Cores=256 (vs. 216)


Note:

#Cores

always

N=256

18

Readings for Wednesday

1. Is Dark Silicon Useful?

2. Dark Silicon and the End of Multicore Scaling

3. Single-Chip Heterogeneous Computing

19


l 2 ilp, dlp and tlp in modern...

Documents