12/14 multi-hyper thread.1 mutli-threading, hyperthreading & chip multiprocessing (cmp) beyond...

67
12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

Upload: victoria-kelley

Post on 29-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.1

Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP)

Beyond ILP: thread level parallelism (TLP)

Multithreaded microarchitectures

Page 2: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.2

Locality and Parallelism Review

• Large memories are slow, fast memories are small• Storage hierarchies are large and fast on average• Parallel processors, collectively, have large, fast cache

– the slow accesses to “remote” data we call “communication”

• Algorithm should do most work on local data

ProcCache

L2 Cache

L3 Cache

Memory

Conventional Storage Hierarchy

ProcCache

L2 Cache

L3 Cache

Memory

ProcCache

L2 Cache

L3 Cache

Memory

potentialinterconnects

Page 3: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.3

Static ILP hitting limitIn-order scheduling microarchitecture with perfect memory

GCC Benchmark: Issue width VS IPC

Memory not keeping pace with processors

• Chip density ~2x every 2 years

• Clock speed: no increase

• Number of processor cores doubling

• Power kept under control, no longer growing

Page 4: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.4

Memory Not Keeping Pace• Memory density doubling every three years; processor logic every two

• Storage costs dropping slower compared to logic

• Memory density doubling every three years; processor logic every two

• Storage costs dropping slower compared to logic

Source: David Turek, IBM

Cost of Computation vs. Memory

Source: IBM

Page 5: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.5

Power Density Limiting Serial PerformanceHEAT

400480088080

8085

8086

286 386 486Pentium®

P6

1

10

100

1000

10000

1970 1980 1990 2000 2010Year

Pow

er D

ensi

ty

(W/c

m2 )

Hot Plate

NuclearReactor

RocketNozzle

Sun’sSurfaceSource: Patrick Gelsinger, Shenkar Bokar, Intel

Scaling clock speed (business as usual) will not work

• High performance serial processors waste power- Speculation, dynamic dependence checking, etc. burn power- Implicit parallelism discovery

• More transistors, but not faster serial processors

• Concurrent systems more power efficient – Dynamic power is

proportional to V2fC– Increasing cores increases

capacitance– lowering clock speed Save

power

Page 6: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.6

Parallelism Today:: Multicore • All processor vendors multicore chips

– Every machine is a parallel machine– To double performance, double parallelism– Can commercial applications use parallelism?– rewritten from scratch?

• Will programmers parallel programmers– New software models needed– hide complexity from most programmers– In the meantime, need to understand it

• Computer industry betting on parallelism, but does not have all the answers– Berkeley ParLab & Stanford parallelism working on it

Page 7: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.7

Finding Enough Parallelism• Only part of application is parallel, rest sequential• Amdahl’s law

– If S fraction of sequential work, (1-s) is fraction parallelizable– P = number of processors

Speedup(P) = Time(1)/Time(P)

<= 1/(s + (1-s)/P); serial part limits speedup

<= 1/s (limit)

• performance limited by sequential work, even with If perfect parallel part speeds up

• Top500 list: Nov 2014 fastest machine is Tianhe-2 - China, others came from US, Japan – Europe distant

Page 8: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.8

TOP500 – China Tianhe-2 1st nov 2014

Page 9: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.9

TOP500 – China Tianhe-2 is 1st

Page 10: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.10

Parallelism has Overhead barrier• Parallelism overheads:

– Starting thread / process– communicating shared data– Synchronizing

• Each can be in milliseconds (M flops) • Tradeoff: Algorithm needs large units of work to run

fast in parallel (i.e. large granularity), but not too large; not enough parallel work

Page 11: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.11

Performance beyond single thread TLP

• natural parallelism in applications (e.g., Database / Scientific )

• Explicit Thread Level Parallelism or Data Level Parallelism

• Thread: instruction stream with own PC and data– Eg. Online transaction processing, scientific nature modeling, ..– Each thread has (instructions, data, PC, register state, and so on)

necessary to execute

• Data Level Parallelism: eg multimedia ; identical operations on data, , vector was predecessor

Page 12: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.12

Multithreaded Categories OverviewTi

me

(pro

cess

or

cycle

)Superscalar Fine-Grained Coarse-Grained Multiprocessing

SimultaneousMultithreading

Thread 1

Thread 2Thread 3Thread 4

Thread 5Idle slot

Tim

e (p

roce

ssor

cy

cle)

Page 13: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.13

Multithreaded Execution

• multiple threads share processor functional units– processor duplicates independent state of each thread e.g., a

separate copy of register file, a separate PC, and for running independent programs, a separate page table

– memory shared through virtual memory mechanisms– HW for fast thread switch; much faster than full process switch

100s to 1000s of clocks

• When switch?– fine grain Alternate instruction per thread – coarse grain When thread stalls, eg cache miss;

Page 14: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.14

Course-Grained Multithreading• Switch on costly stall, eg L2 cache misses• Advantages

– Simple,– Doesn’t slow down thread

• Disadvantage throughput loss from short stalls, pipeline start-up costs– CPU issues instructions from 1 thread, pipeline emptied on

stall– New thread fills pipeline

• coarse-grained multithreading is better for reducing penalty of high cost stalls, ( pipeline refill << stall time)

• Used in IBM eServer pSeries 680

Page 15: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.15

Fine-Grained Multithreading

• Switch thread on each instruction, every clock • done in a round-robin , skipping stalled threads

• Advantage: can hide both short and long stalls, instructions from other threads execute when thread stalls

• Disadvantage: slows down individual threads; thread delayed by other threads

• Used on Sun’s Niagara

Page 16: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.16

Most execution units in superscalar are idle

Tullsen, Eggers, and Levy,“Simultaneous Multithreading:

For an 8-way superscalar.

observation

Page 17: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.17

Chip Multiprocessing (CMP) i7, Power4 Without SMT

Sending threads – processes to multiple processors– reduces horizontal waste – But leaves vertical waste

– POWER 5 uses SMT

Issue width

Time

Processor cycle

Page 18: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.18

IBM Power 4 1st CMP 2000• 2 64-bit cores• Single-threaded predecessor to Power 5. • 8 execution units in out-of-order engine• each may issue an instruction each cycle.

(IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA = compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, and CP = group commit).

Page 19: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.19

Power4 Core

Page 20: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.20

Power4 PipelineInstruction fetch, group, crack

• group up to 5 instructions – Up to 8 instructions fetched from cache– Instructions cracked in groups of 1 to 5

instructions.– complex instructions simpler ones– cracked instruction: broken to 2 internal

instructions e.g. load multiple word – millicoded instruction: broken to more than 2

internal instructions

Page 21: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.21

Power4 Pipeline ( group dispatch GD)

• Dispatch: send instruction group issue queues in order – instruction dependencies determined instruction dependencies determined – internal resources assigned: issue queue slot,

rename registers, load / store reorder queues (GD and MP stages)

– Group control information GCT Global completion table (20 groups) [ ROB ]

Page 22: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.22

Power4 Pipeline ( group dispatch – one group / cycle)

• Group separate issue queues: floating-point, branch execution, fixed-point and load/store units.

• Fixed point (integer) & load/store units share common issue queues.

• issue stage (ISS): ready to execute instructions pulled out of issue queues.

Page 23: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.23

Power4 Pipeline

• Instruction execution EX, speculation, rename resources (GPRs from 32 -- 80)

• Branch Prediction BP– conditional branches are predicted, instructions fetched and speculatively executed– 3 history tables used – processing continues If prediction is correct,

ELSE– instructions flushed and instruction fetching

redirected.

Page 24: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.24

Power 5 = SMT + Power 4

Page 25: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.25

3/1/2010

Power 4Power 4

Power 5Power 5

2 fetch (PC),2 initial decodes

2 commits (architected register sets)

Page 26: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.26

Power 5 data flow ...

Why only 2 threads? With 4, shared resources (physical registers, cache, memory bandwidth) would be bottleneck

Page 27: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.27

Simultaneous Multi-threading ...

1

2

3

4

5

6

7

8

9

M M FX FX FP FP BR CCCycleOne thread, 8 units

M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes

1

2

3

4

5

6

7

8

9

M M FX FX FP FP BR CCCycleTwo threads, 8 units

Page 28: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.28

Simultaneous Multithreading (SMT)• (SMT): Using dynamically scheduled processor

– Large register set can hold independent threads – Register renaming provides unique register identifiers. Instructions

from multiple threads mixed in datapath without confusing sources and destinations across threads

– Out-of-order completion allows threads to execute out of order, and get better utilization HW

• Adding per thread renaming table and separate PCs– Independent commit; logically keep separate reorder buffer for

each thread

Page 29: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.29

Changes from Single thread to SMT• Second Program Counter (PC) added to fetch 2nd thread• GPR/FPR rename mapper expanded to map second set of registers ( bit indicates thread)• Completion logic replicated to track two threads• Thread bit added to most address/tag buses

Page 30: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.30

Changes in Power 5 to support SMT

• Increased associativity of L1 I cache and instruction address translation buffers –(ITLB)

• Added load - store queues / per thread • Increased L2 , L3 size (1.92 vs. 1.44 MB) • separate instruction prefetch and buffering per

thread• Increased number of virtual registers from 152 to 240

– rename registers• Increased the size of issue queues• Power5 core 24% larger than the Power4 core to

support SMT

Page 31: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.31

SMT Design Issues• SMT , impact on single thread performance?• Larger register file needed to hold multiple

contexts• Clock cycle time, especially in:

– Instruction issue - more candidate instructions need to be considered

– Instruction completion - choosing which instructions to commit challenging

• Cache and TLB conflicts generated by SMT degrade performance

Page 32: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.32

Resource Sharing -- effects • Threads share many resources

–GCT, BHT, TLB, . .

• Resources balanced across threads for Higher performance• drifting to extremes reduced performance

Solution: Dynamically adjust resource utilization

Page 33: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.33

Power 5 thread performance / priority..

Relative priority of each thread is hardware controlled

For balanced operation, both run slower than if threads “owned” the machine.

Page 34: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.34

Thread priority Control-cont’d• Unbalanced execution desirable if

– No work for opposite thread– Thread spin-waiting on lock– Software determined non uniform balance– Power management

• Solution: Control instruction decode rate

– Software/hardware controls 8 priority levels for each thread

Page 35: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.35

Dynamic Thread Switching• Used if no task ready for second thread to run

• All machine resources allocated to one thread

• Software initiated

• Dormant thread awakens on

–External interrupt

–Decrementer Interrupt

–Special Instruction from active thread

Page 36: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.36

Single Thread Operation• For execution unit limited applications

– Floating or fixed point intensive Workloads

• Execution unit limited applications provide minimal performance leveragefor SMT

– Higher performance benefit when resources dedicated to single thread

• Determined dynamically on a Per processor basis

Page 37: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.37

Initial Performance of SMT• Pentium 4 Extreme SMT yields 1.01 speedup for SPECint_rate

benchmark and 1.07 for SPECfp_rate– Pentium 4 is dual threaded SMT

– SPECRate requires that each SPEC benchmark be run against a vendor-selected number of copies of the same benchmark

• Running on Pentium 4 each of 26 SPEC benchmarks paired with every other (262 runs) speed-ups from 0.90 to 1.58; average was 1.20

• Power 5, 8 processor server 1.23 faster for SPECint_rate with SMT, 1.16 faster for SPECfp_rate

• Power 5 running 2 copies of each app speedup between 0.89 and 1.41– Most gained some

– Fl.Pt. apps had most cache conflicts and least gains

Page 38: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.38

Limits to ILP• Doubling issue rates above today’s 3-6 instructions

per clock, say to 6 to 12 instructions, probably requires a processor to – issue 3 or 4 data memory accesses per cycle, – resolve 2 or 3 branches per cycle, – rename and access more than 20 registers per cycle, and – fetch 12 to 24 instructions per cycle.

• The complexities of implementing these capabilities is likely to mean sacrifices in the maximum clock rate – E.g, widest issue processor is the Itanium 2, but it also has

the slowest clock rate, despite the fact that it consumes the most power!

Page 39: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.39

Limits to ILP• Most techniques for increasing performance increase power

consumption • The key question is whether a technique is energy efficient:

does it increase power consumption faster than it increases performance?

• Multiple issue processors techniques all are energy inefficient:1. Issuing multiple instructions incurs some overhead in

logic that grows faster than the issue rate grows2. Growing gap between peak issue rates and sustained

performance• Number of transistors switching = f(peak issue rate), and

performance = f( sustained rate), growing gap between peak and sustained performance increasing energy per unit of performance

Page 40: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.40

Commentary• Itanium architecture does not represent a significant

breakthrough in scaling ILP or in avoiding power / complexity consumption problems

• Instead of more ILP, architects focusing on TLP implemented with CMP

• IBM announced Power4, 1st commercial CMP, = 2 Power3 processors + L2 cache – Sun Microsystems and Intel have switched CMP rather than

aggressive uniprocessors.

• Right balance of ILP and TLP not clear – Good for server, exploit more TLP, – desktop, single-thread performance a primary requirement

Page 41: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.41

And in conclusion …

• Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for practical options

• Explicitly parallel (Data level parallelism or Thread level parallelism) is next step to performance

• Coarse grain vs. Fine grained multithreading– Only on big stall vs. every clock cycle

• Simultaneous Multithreading fine grained multithreading based on superscalar microarchitecture– Instead of replicating registers, reuse rename registers

Page 42: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.42

Power Storage Hierarchy

Page 43: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.43

Power Storage Hierarchy• Hardware data prefetch

– hardware prefetches Data from L2, L3 & memory : hides memory latency transparently loads the L1 data cache

– Triggered by data cache line misses• L1 prefetches 1 cache line ahead• L2 prefetches 5 cache lines ahead• L3 prefetches 17 to 20 lines

Page 44: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.44

Moore’s Law reinterpreted• Number of cores per chip will double every

two years

• Clock speed will not increase (possibly decrease)

• Need to deal with systems with millions of concurrent threads

• Need to deal with inter-chip parallelism as well as intra-chip parallelism

Page 45: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.45

Intel’s Hyper-threading technology is SMTPentium 4 (Xeon)

• Executes two tasks simultaneously– Two different applications

– Two threads of same application

• CPU maintains architecture state for two processors – Two logical processors per physical processor

• Implemented on Intel® Xeon™ and most Pentium 4– Two logical processors for < 5% additional die area

– Power efficient performance gain

Page 46: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.46

Resources are shared not replicated

Page 47: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.47

Multithreaded Microarchitecture• Dedicated local context per running thread• Efficient resource sharing

– Time sharing– Space sharing

• Fast thread synchronization / communication– Explicit instructions– Implicit via shared registers / cache / buffer

Page 48: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.48

Changes needed for Hyper-threadingPentium 4

• Replicate – All per CPU architectural state

– Instruction Pointers, renaming logic

– Other: ITLB, return stack predictor, .. So

• Partition resources (share by splitting in half per thread)– Several buffers: Re-order buffer, load/store buffers, queues

• Share– Out -of -Order execution engine– Caches

Page 49: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.49

P4 Out-of-order Execution pipeline

Page 50: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.50

P4 Hyper-threaded pipeline

Page 51: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.51

Pentium-4 HyperthreadingFront End

Resource divided between logical CPUs

Resource shared between logical CPUs

Page 52: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.52

Thread selection points

Page 53: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.53

Icount Choosing Policy

Fetch from thread with the least instructions in flight.

Page 54: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.54

All caches are shared • Execution trace cache• L1 Data• L2 Unified• L3 Unified

Page 55: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.55

Data in Caches can be shared• L1 Data

• L2 unified

• L3 unified

Page 56: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.56

Operating systems manages tasks

• Schedule tasks on logical processors

• Executes HALT if a logical processor is idle

Page 57: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.57

Initial Performance of SMT• Pentium 4 Extreme SMT yields 1.01 speedup for

SPECint_rate benchmark and 1.07 for SPECfp_rate– Pentium 4 is dual threaded SMT

• Running on Pentium 4 each of 26 SPEC benchmarks paired with every other (262 runs) speed-ups from 0.90 to 1.58; average was 1.20

• Power 5, 8-processor server 1.23 faster for SPECint_rate with SMT, 1.16 faster for SPECfp_rate

• Power 5 running 2 copies of each app speedup between 0.89 and 1.41– Most gained some– Fl.Pt. apps had most cache conflicts and least gains

Page 58: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.58

Hyper-threading technology

• Significant new technology direction for Intel’s future CPUs

• Exploits parallelism in today’s applications and usage– Two logical processors on one physical processor

• Accelerates performance for low silicon and power costs

• Implemented in Xeon MP, Pentium 4, Itanium 2

Page 59: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.59

Multicore & Manycore• Revolution needed• Software or architecture alone can’t fix parallel programming

problem, need innovations in both• “Multicore” 2X cores per generation: 2, 4, 8, … • “Manycore” 100s is highest performance per unit area, and per

Watt, then 2X per generation: 64, 128, 256, 512, 1024 …

• Multicore architectures & Programming Models good for 2 to 32 cores won’t evolve to Manycore systems of 1000’s of processors Desperately need HW/SW models that work for Manycore or will run out of steam(as ILP ran out of steam at 4 instructions)

Page 60: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.60

Summary: Multithreaded CategoriesTi

me

(pro

cess

or

cycle

)Superscalar Fine-Grained Coarse-Grained Multiprocessing

SimultaneousMultithreading

Thread 1

Thread 2Thread 3Thread 4

Thread 5Idle slot

Page 61: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.61

Cell Processor

Page 62: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.62

Cell Processor Features• 64b Power core & its L2 cache• 8 SPE – processing elements with local

memory• High bandwidth interconnect bus• Memory interface controller• 10 simultaneous threads, 8 on SPEs + 2 on

Power core• 234M transistors, 90 nm, SOI, 8-level Copper• On-chip temperature monitored – cooling

adjusted

Page 63: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.63

12/10

Page 64: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.64

SPE•SPE optimized for compute intensive applications• Both types of processor cores share access to common address space,• main memory, and address ranges corresponding to each SPE’s local store, control registers,and I/O devices.• Simple high speed pipeline•Pervasive parallel computing ….SIMD data level parallelism•128 x 128 register file (scalar – vector)•Optimized scalar – uses same h/w path as vector instructions•256k local store ( similar to but not a cache, no tags, ..etc)

Page 65: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.65

Cell Processor Die Photo

Page 66: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.66

Synergistic Processor SPE

Page 67: 12/14 Multi-Hyper thread.1 Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP) Beyond ILP: thread level parallelism (TLP) Multithreaded microarchitectures

12/14Multi-Hyper thread.67

SPE Pipeline