12/14 multi-hyper thread.1 mutli-threading, hyperthreading & chip multiprocessing (cmp) beyond...
TRANSCRIPT
12/14Multi-Hyper thread.1
Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP)
Beyond ILP: thread level parallelism (TLP)
Multithreaded microarchitectures
12/14Multi-Hyper thread.2
Locality and Parallelism Review
• Large memories are slow, fast memories are small• Storage hierarchies are large and fast on average• Parallel processors, collectively, have large, fast cache
– the slow accesses to “remote” data we call “communication”
• Algorithm should do most work on local data
ProcCache
L2 Cache
L3 Cache
Memory
Conventional Storage Hierarchy
ProcCache
L2 Cache
L3 Cache
Memory
ProcCache
L2 Cache
L3 Cache
Memory
potentialinterconnects
12/14Multi-Hyper thread.3
Static ILP hitting limitIn-order scheduling microarchitecture with perfect memory
GCC Benchmark: Issue width VS IPC
Memory not keeping pace with processors
• Chip density ~2x every 2 years
• Clock speed: no increase
• Number of processor cores doubling
• Power kept under control, no longer growing
12/14Multi-Hyper thread.4
Memory Not Keeping Pace• Memory density doubling every three years; processor logic every two
• Storage costs dropping slower compared to logic
• Memory density doubling every three years; processor logic every two
• Storage costs dropping slower compared to logic
Source: David Turek, IBM
Cost of Computation vs. Memory
Source: IBM
12/14Multi-Hyper thread.5
Power Density Limiting Serial PerformanceHEAT
400480088080
8085
8086
286 386 486Pentium®
P6
1
10
100
1000
10000
1970 1980 1990 2000 2010Year
Pow
er D
ensi
ty
(W/c
m2 )
Hot Plate
NuclearReactor
RocketNozzle
Sun’sSurfaceSource: Patrick Gelsinger, Shenkar Bokar, Intel
Scaling clock speed (business as usual) will not work
• High performance serial processors waste power- Speculation, dynamic dependence checking, etc. burn power- Implicit parallelism discovery
• More transistors, but not faster serial processors
• Concurrent systems more power efficient – Dynamic power is
proportional to V2fC– Increasing cores increases
capacitance– lowering clock speed Save
power
12/14Multi-Hyper thread.6
Parallelism Today:: Multicore • All processor vendors multicore chips
– Every machine is a parallel machine– To double performance, double parallelism– Can commercial applications use parallelism?– rewritten from scratch?
• Will programmers parallel programmers– New software models needed– hide complexity from most programmers– In the meantime, need to understand it
• Computer industry betting on parallelism, but does not have all the answers– Berkeley ParLab & Stanford parallelism working on it
12/14Multi-Hyper thread.7
Finding Enough Parallelism• Only part of application is parallel, rest sequential• Amdahl’s law
– If S fraction of sequential work, (1-s) is fraction parallelizable– P = number of processors
Speedup(P) = Time(1)/Time(P)
<= 1/(s + (1-s)/P); serial part limits speedup
<= 1/s (limit)
• performance limited by sequential work, even with If perfect parallel part speeds up
• Top500 list: Nov 2014 fastest machine is Tianhe-2 - China, others came from US, Japan – Europe distant
12/14Multi-Hyper thread.8
TOP500 – China Tianhe-2 1st nov 2014
12/14Multi-Hyper thread.9
TOP500 – China Tianhe-2 is 1st
12/14Multi-Hyper thread.10
Parallelism has Overhead barrier• Parallelism overheads:
– Starting thread / process– communicating shared data– Synchronizing
• Each can be in milliseconds (M flops) • Tradeoff: Algorithm needs large units of work to run
fast in parallel (i.e. large granularity), but not too large; not enough parallel work
12/14Multi-Hyper thread.11
Performance beyond single thread TLP
• natural parallelism in applications (e.g., Database / Scientific )
• Explicit Thread Level Parallelism or Data Level Parallelism
• Thread: instruction stream with own PC and data– Eg. Online transaction processing, scientific nature modeling, ..– Each thread has (instructions, data, PC, register state, and so on)
necessary to execute
• Data Level Parallelism: eg multimedia ; identical operations on data, , vector was predecessor
12/14Multi-Hyper thread.12
Multithreaded Categories OverviewTi
me
(pro
cess
or
cycle
)Superscalar Fine-Grained Coarse-Grained Multiprocessing
SimultaneousMultithreading
Thread 1
Thread 2Thread 3Thread 4
Thread 5Idle slot
Tim
e (p
roce
ssor
cy
cle)
12/14Multi-Hyper thread.13
Multithreaded Execution
• multiple threads share processor functional units– processor duplicates independent state of each thread e.g., a
separate copy of register file, a separate PC, and for running independent programs, a separate page table
– memory shared through virtual memory mechanisms– HW for fast thread switch; much faster than full process switch
100s to 1000s of clocks
• When switch?– fine grain Alternate instruction per thread – coarse grain When thread stalls, eg cache miss;
12/14Multi-Hyper thread.14
Course-Grained Multithreading• Switch on costly stall, eg L2 cache misses• Advantages
– Simple,– Doesn’t slow down thread
• Disadvantage throughput loss from short stalls, pipeline start-up costs– CPU issues instructions from 1 thread, pipeline emptied on
stall– New thread fills pipeline
• coarse-grained multithreading is better for reducing penalty of high cost stalls, ( pipeline refill << stall time)
• Used in IBM eServer pSeries 680
12/14Multi-Hyper thread.15
Fine-Grained Multithreading
• Switch thread on each instruction, every clock • done in a round-robin , skipping stalled threads
• Advantage: can hide both short and long stalls, instructions from other threads execute when thread stalls
• Disadvantage: slows down individual threads; thread delayed by other threads
• Used on Sun’s Niagara
12/14Multi-Hyper thread.16
Most execution units in superscalar are idle
Tullsen, Eggers, and Levy,“Simultaneous Multithreading:
For an 8-way superscalar.
observation
12/14Multi-Hyper thread.17
Chip Multiprocessing (CMP) i7, Power4 Without SMT
Sending threads – processes to multiple processors– reduces horizontal waste – But leaves vertical waste
– POWER 5 uses SMT
Issue width
Time
Processor cycle
12/14Multi-Hyper thread.18
IBM Power 4 1st CMP 2000• 2 64-bit cores• Single-threaded predecessor to Power 5. • 8 execution units in out-of-order engine• each may issue an instruction each cycle.
(IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA = compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, and CP = group commit).
12/14Multi-Hyper thread.19
Power4 Core
12/14Multi-Hyper thread.20
Power4 PipelineInstruction fetch, group, crack
• group up to 5 instructions – Up to 8 instructions fetched from cache– Instructions cracked in groups of 1 to 5
instructions.– complex instructions simpler ones– cracked instruction: broken to 2 internal
instructions e.g. load multiple word – millicoded instruction: broken to more than 2
internal instructions
12/14Multi-Hyper thread.21
Power4 Pipeline ( group dispatch GD)
• Dispatch: send instruction group issue queues in order – instruction dependencies determined instruction dependencies determined – internal resources assigned: issue queue slot,
rename registers, load / store reorder queues (GD and MP stages)
– Group control information GCT Global completion table (20 groups) [ ROB ]
12/14Multi-Hyper thread.22
Power4 Pipeline ( group dispatch – one group / cycle)
• Group separate issue queues: floating-point, branch execution, fixed-point and load/store units.
• Fixed point (integer) & load/store units share common issue queues.
• issue stage (ISS): ready to execute instructions pulled out of issue queues.
12/14Multi-Hyper thread.23
Power4 Pipeline
• Instruction execution EX, speculation, rename resources (GPRs from 32 -- 80)
• Branch Prediction BP– conditional branches are predicted, instructions fetched and speculatively executed– 3 history tables used – processing continues If prediction is correct,
ELSE– instructions flushed and instruction fetching
redirected.
12/14Multi-Hyper thread.24
Power 5 = SMT + Power 4
12/14Multi-Hyper thread.25
3/1/2010
Power 4Power 4
Power 5Power 5
2 fetch (PC),2 initial decodes
2 commits (architected register sets)
12/14Multi-Hyper thread.26
Power 5 data flow ...
Why only 2 threads? With 4, shared resources (physical registers, cache, memory bandwidth) would be bottleneck
12/14Multi-Hyper thread.27
Simultaneous Multi-threading ...
1
2
3
4
5
6
7
8
9
M M FX FX FP FP BR CCCycleOne thread, 8 units
M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes
1
2
3
4
5
6
7
8
9
M M FX FX FP FP BR CCCycleTwo threads, 8 units
12/14Multi-Hyper thread.28
Simultaneous Multithreading (SMT)• (SMT): Using dynamically scheduled processor
– Large register set can hold independent threads – Register renaming provides unique register identifiers. Instructions
from multiple threads mixed in datapath without confusing sources and destinations across threads
– Out-of-order completion allows threads to execute out of order, and get better utilization HW
• Adding per thread renaming table and separate PCs– Independent commit; logically keep separate reorder buffer for
each thread
12/14Multi-Hyper thread.29
Changes from Single thread to SMT• Second Program Counter (PC) added to fetch 2nd thread• GPR/FPR rename mapper expanded to map second set of registers ( bit indicates thread)• Completion logic replicated to track two threads• Thread bit added to most address/tag buses
12/14Multi-Hyper thread.30
Changes in Power 5 to support SMT
• Increased associativity of L1 I cache and instruction address translation buffers –(ITLB)
• Added load - store queues / per thread • Increased L2 , L3 size (1.92 vs. 1.44 MB) • separate instruction prefetch and buffering per
thread• Increased number of virtual registers from 152 to 240
– rename registers• Increased the size of issue queues• Power5 core 24% larger than the Power4 core to
support SMT
12/14Multi-Hyper thread.31
SMT Design Issues• SMT , impact on single thread performance?• Larger register file needed to hold multiple
contexts• Clock cycle time, especially in:
– Instruction issue - more candidate instructions need to be considered
– Instruction completion - choosing which instructions to commit challenging
• Cache and TLB conflicts generated by SMT degrade performance
12/14Multi-Hyper thread.32
Resource Sharing -- effects • Threads share many resources
–GCT, BHT, TLB, . .
• Resources balanced across threads for Higher performance• drifting to extremes reduced performance
Solution: Dynamically adjust resource utilization
12/14Multi-Hyper thread.33
Power 5 thread performance / priority..
Relative priority of each thread is hardware controlled
For balanced operation, both run slower than if threads “owned” the machine.
12/14Multi-Hyper thread.34
Thread priority Control-cont’d• Unbalanced execution desirable if
– No work for opposite thread– Thread spin-waiting on lock– Software determined non uniform balance– Power management
• Solution: Control instruction decode rate
– Software/hardware controls 8 priority levels for each thread
12/14Multi-Hyper thread.35
Dynamic Thread Switching• Used if no task ready for second thread to run
• All machine resources allocated to one thread
• Software initiated
• Dormant thread awakens on
–External interrupt
–Decrementer Interrupt
–Special Instruction from active thread
12/14Multi-Hyper thread.36
Single Thread Operation• For execution unit limited applications
– Floating or fixed point intensive Workloads
• Execution unit limited applications provide minimal performance leveragefor SMT
– Higher performance benefit when resources dedicated to single thread
• Determined dynamically on a Per processor basis
12/14Multi-Hyper thread.37
Initial Performance of SMT• Pentium 4 Extreme SMT yields 1.01 speedup for SPECint_rate
benchmark and 1.07 for SPECfp_rate– Pentium 4 is dual threaded SMT
– SPECRate requires that each SPEC benchmark be run against a vendor-selected number of copies of the same benchmark
• Running on Pentium 4 each of 26 SPEC benchmarks paired with every other (262 runs) speed-ups from 0.90 to 1.58; average was 1.20
• Power 5, 8 processor server 1.23 faster for SPECint_rate with SMT, 1.16 faster for SPECfp_rate
• Power 5 running 2 copies of each app speedup between 0.89 and 1.41– Most gained some
– Fl.Pt. apps had most cache conflicts and least gains
12/14Multi-Hyper thread.38
Limits to ILP• Doubling issue rates above today’s 3-6 instructions
per clock, say to 6 to 12 instructions, probably requires a processor to – issue 3 or 4 data memory accesses per cycle, – resolve 2 or 3 branches per cycle, – rename and access more than 20 registers per cycle, and – fetch 12 to 24 instructions per cycle.
• The complexities of implementing these capabilities is likely to mean sacrifices in the maximum clock rate – E.g, widest issue processor is the Itanium 2, but it also has
the slowest clock rate, despite the fact that it consumes the most power!
12/14Multi-Hyper thread.39
Limits to ILP• Most techniques for increasing performance increase power
consumption • The key question is whether a technique is energy efficient:
does it increase power consumption faster than it increases performance?
• Multiple issue processors techniques all are energy inefficient:1. Issuing multiple instructions incurs some overhead in
logic that grows faster than the issue rate grows2. Growing gap between peak issue rates and sustained
performance• Number of transistors switching = f(peak issue rate), and
performance = f( sustained rate), growing gap between peak and sustained performance increasing energy per unit of performance
12/14Multi-Hyper thread.40
Commentary• Itanium architecture does not represent a significant
breakthrough in scaling ILP or in avoiding power / complexity consumption problems
• Instead of more ILP, architects focusing on TLP implemented with CMP
• IBM announced Power4, 1st commercial CMP, = 2 Power3 processors + L2 cache – Sun Microsystems and Intel have switched CMP rather than
aggressive uniprocessors.
• Right balance of ILP and TLP not clear – Good for server, exploit more TLP, – desktop, single-thread performance a primary requirement
12/14Multi-Hyper thread.41
And in conclusion …
• Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for practical options
• Explicitly parallel (Data level parallelism or Thread level parallelism) is next step to performance
• Coarse grain vs. Fine grained multithreading– Only on big stall vs. every clock cycle
• Simultaneous Multithreading fine grained multithreading based on superscalar microarchitecture– Instead of replicating registers, reuse rename registers
12/14Multi-Hyper thread.42
Power Storage Hierarchy
12/14Multi-Hyper thread.43
Power Storage Hierarchy• Hardware data prefetch
– hardware prefetches Data from L2, L3 & memory : hides memory latency transparently loads the L1 data cache
– Triggered by data cache line misses• L1 prefetches 1 cache line ahead• L2 prefetches 5 cache lines ahead• L3 prefetches 17 to 20 lines
12/14Multi-Hyper thread.44
Moore’s Law reinterpreted• Number of cores per chip will double every
two years
• Clock speed will not increase (possibly decrease)
• Need to deal with systems with millions of concurrent threads
• Need to deal with inter-chip parallelism as well as intra-chip parallelism
12/14Multi-Hyper thread.45
Intel’s Hyper-threading technology is SMTPentium 4 (Xeon)
• Executes two tasks simultaneously– Two different applications
– Two threads of same application
• CPU maintains architecture state for two processors – Two logical processors per physical processor
• Implemented on Intel® Xeon™ and most Pentium 4– Two logical processors for < 5% additional die area
– Power efficient performance gain
12/14Multi-Hyper thread.46
Resources are shared not replicated
12/14Multi-Hyper thread.47
Multithreaded Microarchitecture• Dedicated local context per running thread• Efficient resource sharing
– Time sharing– Space sharing
• Fast thread synchronization / communication– Explicit instructions– Implicit via shared registers / cache / buffer
12/14Multi-Hyper thread.48
Changes needed for Hyper-threadingPentium 4
• Replicate – All per CPU architectural state
– Instruction Pointers, renaming logic
– Other: ITLB, return stack predictor, .. So
• Partition resources (share by splitting in half per thread)– Several buffers: Re-order buffer, load/store buffers, queues
• Share– Out -of -Order execution engine– Caches
12/14Multi-Hyper thread.49
P4 Out-of-order Execution pipeline
12/14Multi-Hyper thread.50
P4 Hyper-threaded pipeline
12/14Multi-Hyper thread.51
Pentium-4 HyperthreadingFront End
Resource divided between logical CPUs
Resource shared between logical CPUs
12/14Multi-Hyper thread.52
Thread selection points
12/14Multi-Hyper thread.53
Icount Choosing Policy
Fetch from thread with the least instructions in flight.
12/14Multi-Hyper thread.54
All caches are shared • Execution trace cache• L1 Data• L2 Unified• L3 Unified
12/14Multi-Hyper thread.55
Data in Caches can be shared• L1 Data
• L2 unified
• L3 unified
12/14Multi-Hyper thread.56
Operating systems manages tasks
• Schedule tasks on logical processors
• Executes HALT if a logical processor is idle
12/14Multi-Hyper thread.57
Initial Performance of SMT• Pentium 4 Extreme SMT yields 1.01 speedup for
SPECint_rate benchmark and 1.07 for SPECfp_rate– Pentium 4 is dual threaded SMT
• Running on Pentium 4 each of 26 SPEC benchmarks paired with every other (262 runs) speed-ups from 0.90 to 1.58; average was 1.20
• Power 5, 8-processor server 1.23 faster for SPECint_rate with SMT, 1.16 faster for SPECfp_rate
• Power 5 running 2 copies of each app speedup between 0.89 and 1.41– Most gained some– Fl.Pt. apps had most cache conflicts and least gains
12/14Multi-Hyper thread.58
Hyper-threading technology
• Significant new technology direction for Intel’s future CPUs
• Exploits parallelism in today’s applications and usage– Two logical processors on one physical processor
• Accelerates performance for low silicon and power costs
• Implemented in Xeon MP, Pentium 4, Itanium 2
12/14Multi-Hyper thread.59
Multicore & Manycore• Revolution needed• Software or architecture alone can’t fix parallel programming
problem, need innovations in both• “Multicore” 2X cores per generation: 2, 4, 8, … • “Manycore” 100s is highest performance per unit area, and per
Watt, then 2X per generation: 64, 128, 256, 512, 1024 …
• Multicore architectures & Programming Models good for 2 to 32 cores won’t evolve to Manycore systems of 1000’s of processors Desperately need HW/SW models that work for Manycore or will run out of steam(as ILP ran out of steam at 4 instructions)
12/14Multi-Hyper thread.60
Summary: Multithreaded CategoriesTi
me
(pro
cess
or
cycle
)Superscalar Fine-Grained Coarse-Grained Multiprocessing
SimultaneousMultithreading
Thread 1
Thread 2Thread 3Thread 4
Thread 5Idle slot
12/14Multi-Hyper thread.61
Cell Processor
12/14Multi-Hyper thread.62
Cell Processor Features• 64b Power core & its L2 cache• 8 SPE – processing elements with local
memory• High bandwidth interconnect bus• Memory interface controller• 10 simultaneous threads, 8 on SPEs + 2 on
Power core• 234M transistors, 90 nm, SOI, 8-level Copper• On-chip temperature monitored – cooling
adjusted
12/14Multi-Hyper thread.63
12/10
12/14Multi-Hyper thread.64
SPE•SPE optimized for compute intensive applications• Both types of processor cores share access to common address space,• main memory, and address ranges corresponding to each SPE’s local store, control registers,and I/O devices.• Simple high speed pipeline•Pervasive parallel computing ….SIMD data level parallelism•128 x 128 register file (scalar – vector)•Optimized scalar – uses same h/w path as vector instructions•256k local store ( similar to but not a cache, no tags, ..etc)
12/14Multi-Hyper thread.65
Cell Processor Die Photo
12/14Multi-Hyper thread.66
Synergistic Processor SPE
12/14Multi-Hyper thread.67
SPE Pipeline