12/14 multi-hyper thread.1 mutli-threading, hyperthreading & chip multiprocessing (cmp) beyond...

12/14Multi-Hyper thread.1

Mutli-threading, Hyperthreading & Chip Multiprocessing (CMP)

Beyond ILP: thread level parallelism (TLP)

Multithreaded microarchitectures


Locality and Parallelism Review

• Large memories are slow, fast memories are small• Storage hierarchies are large and fast on average• Parallel processors, collectively, have large, fast cache

– the slow accesses to “remote” data we call “communication”

• Algorithm should do most work on local data

ProcCache

L2 Cache

L3 Cache

Memory

Conventional Storage Hierarchy

ProcCache

L2 Cache

L3 Cache

Memory

ProcCache

L2 Cache

L3 Cache

Memory

potentialinterconnects


Static ILP hitting limitIn-order scheduling microarchitecture with perfect memory

GCC Benchmark: Issue width VS IPC

Memory not keeping pace with processors

• Chip density ~2x every 2 years

• Clock speed: no increase

• Number of processor cores doubling

• Power kept under control, no longer growing


Memory Not Keeping Pace• Memory density doubling every three years; processor logic every two

• Storage costs dropping slower compared to logic

• Memory density doubling every three years; processor logic every two

• Storage costs dropping slower compared to logic

Source: David Turek, IBM

Cost of Computation vs. Memory

Source: IBM


Power Density Limiting Serial PerformanceHEAT

400480088080

8085

8086

286 386 486Pentium®

P6

1

10

100

1000

10000

1970 1980 1990 2000 2010Year

Pow

er D

ensi

ty

(W/c

m2 )

Hot Plate

NuclearReactor

RocketNozzle

Sun’sSurfaceSource: Patrick Gelsinger, Shenkar Bokar, Intel

Scaling clock speed (business as usual) will not work

• High performance serial processors waste power- Speculation, dynamic dependence checking, etc. burn power- Implicit parallelism discovery

• More transistors, but not faster serial processors

• Concurrent systems more power efficient – Dynamic power is

proportional to V2fC– Increasing cores increases

capacitance– lowering clock speed Save

power


Parallelism Today:: Multicore • All processor vendors multicore chips

– Every machine is a parallel machine– To double performance, double parallelism– Can commercial applications use parallelism?– rewritten from scratch?

• Will programmers parallel programmers– New software models needed– hide complexity from most programmers– In the meantime, need to understand it

• Computer industry betting on parallelism, but does not have all the answers– Berkeley ParLab & Stanford parallelism working on it


Finding Enough Parallelism• Only part of application is parallel, rest sequential• Amdahl’s law

– If S fraction of sequential work, (1-s) is fraction parallelizable– P = number of processors

Speedup(P) = Time(1)/Time(P)

<= 1/(s + (1-s)/P); serial part limits speedup

<= 1/s (limit)

• performance limited by sequential work, even with If perfect parallel part speeds up

• Top500 list: Nov 2014 fastest machine is Tianhe-2 - China, others came from US, Japan – Europe distant


TOP500 – China Tianhe-2 1st nov 2014


TOP500 – China Tianhe-2 is 1st


Parallelism has Overhead barrier• Parallelism overheads:

– Starting thread / process– communicating shared data– Synchronizing

• Each can be in milliseconds (M flops) • Tradeoff: Algorithm needs large units of work to run

fast in parallel (i.e. large granularity), but not too large; not enough parallel work


Performance beyond single thread TLP

• natural parallelism in applications (e.g., Database / Scientific )

• Explicit Thread Level Parallelism or Data Level Parallelism

• Thread: instruction stream with own PC and data– Eg. Online transaction processing, scientific nature modeling, ..– Each thread has (instructions, data, PC, register state, and so on)

necessary to execute

• Data Level Parallelism: eg multimedia ; identical operations on data, , vector was predecessor


Multithreaded Categories OverviewTi

me

(pro

cess

or

cycle

)Superscalar Fine-Grained Coarse-Grained Multiprocessing

SimultaneousMultithreading

Thread 1

Thread 2Thread 3Thread 4

Thread 5Idle slot

Tim

e (p

roce

ssor

cy

cle)


Multithreaded Execution

• multiple threads share processor functional units– processor duplicates independent state of each thread e.g., a

separate copy of register file, a separate PC, and for running independent programs, a separate page table

– memory shared through virtual memory mechanisms– HW for fast thread switch; much faster than full process switch

100s to 1000s of clocks

• When switch?– fine grain Alternate instruction per thread – coarse grain When thread stalls, eg cache miss;


Course-Grained Multithreading• Switch on costly stall, eg L2 cache misses• Advantages

– Simple,– Doesn’t slow down thread

• Disadvantage throughput loss from short stalls, pipeline start-up costs– CPU issues instructions from 1 thread, pipeline emptied on

stall– New thread fills pipeline

• coarse-grained multithreading is better for reducing penalty of high cost stalls, ( pipeline refill << stall time)

• Used in IBM eServer pSeries 680


Fine-Grained Multithreading

• Switch thread on each instruction, every clock • done in a round-robin , skipping stalled threads

• Advantage: can hide both short and long stalls, instructions from other threads execute when thread stalls

• Disadvantage: slows down individual threads; thread delayed by other threads

• Used on Sun’s Niagara


Most execution units in superscalar are idle

Tullsen, Eggers, and Levy,“Simultaneous Multithreading:

For an 8-way superscalar.

observation


Chip Multiprocessing (CMP) i7, Power4 Without SMT

Sending threads – processes to multiple processors– reduces horizontal waste – But leaves vertical waste

– POWER 5 uses SMT

Issue width

Time

Processor cycle


IBM Power 4 1st CMP 2000• 2 64-bit cores• Single-threaded predecessor to Power 5. • 8 execution units in out-of-order engine• each may issue an instruction each cycle.

(IF = instruction fetch, IC = instruction cache, BP = branch predict, D0 = decode stage0, Xfer = transfer, GD = group dispatch, MP = mapping, ISS = instruction issue, RF = register file read, EX = execute, EA = compute address, DC = data caches, F6 = six-cycle floating-point execution pipe, Fmt = data format, WB = write back, and CP = group commit).


Power4 Core


Power4 PipelineInstruction fetch, group, crack

• group up to 5 instructions – Up to 8 instructions fetched from cache– Instructions cracked in groups of 1 to 5

instructions.– complex instructions simpler ones– cracked instruction: broken to 2 internal

instructions e.g. load multiple word – millicoded instruction: broken to more than 2

internal instructions


Power4 Pipeline ( group dispatch GD)

• Dispatch: send instruction group issue queues in order – instruction dependencies determined instruction dependencies determined – internal resources assigned: issue queue slot,

rename registers, load / store reorder queues (GD and MP stages)

– Group control information GCT Global completion table (20 groups) [ ROB ]


Power4 Pipeline ( group dispatch – one group / cycle)

• Group separate issue queues: floating-point, branch execution, fixed-point and load/store units.

• Fixed point (integer) & load/store units share common issue queues.

• issue stage (ISS): ready to execute instructions pulled out of issue queues.


Power4 Pipeline

• Instruction execution EX, speculation, rename resources (GPRs from 32 -- 80)

• Branch Prediction BP– conditional branches are predicted, instructions fetched and speculatively executed– 3 history tables used – processing continues If prediction is correct,

ELSE– instructions flushed and instruction fetching

redirected.


Power 5 = SMT + Power 4


3/1/2010

Power 4Power 4

Power 5Power 5

2 fetch (PC),2 initial decodes

2 commits (architected register sets)


Power 5 data flow ...

Why only 2 threads? With 4, shared resources (physical registers, cache, memory bandwidth) would be bottleneck


Simultaneous Multi-threading ...

1

2

3

4

5

6

7

8

9

M M FX FX FP FP BR CCCycleOne thread, 8 units

M = Load/Store, FX = Fixed Point, FP = Floating Point, BR = Branch, CC = Condition Codes

1

2

3

4

5

6

7

8

9

M M FX FX FP FP BR CCCycleTwo threads, 8 units


Simultaneous Multithreading (SMT)• (SMT): Using dynamically scheduled processor

– Large register set can hold independent threads – Register renaming provides unique register identifiers. Instructions

from multiple threads mixed in datapath without confusing sources and destinations across threads

– Out-of-order completion allows threads to execute out of order, and get better utilization HW

• Adding per thread renaming table and separate PCs– Independent commit; logically keep separate reorder buffer for

each thread


Changes from Single thread to SMT• Second Program Counter (PC) added to fetch 2nd thread• GPR/FPR rename mapper expanded to map second set of registers ( bit indicates thread)• Completion logic replicated to track two threads• Thread bit added to most address/tag buses


Changes in Power 5 to support SMT

• Increased associativity of L1 I cache and instruction address translation buffers –(ITLB)

• Added load - store queues / per thread • Increased L2 , L3 size (1.92 vs. 1.44 MB) • separate instruction prefetch and buffering per

thread• Increased number of virtual registers from 152 to 240

– rename registers• Increased the size of issue queues• Power5 core 24% larger than the Power4 core to

support SMT


SMT Design Issues• SMT , impact on single thread performance?• Larger register file needed to hold multiple

contexts• Clock cycle time, especially in:

– Instruction issue - more candidate instructions need to be considered

– Instruction completion - choosing which instructions to commit challenging

• Cache and TLB conflicts generated by SMT degrade performance


Resource Sharing -- effects • Threads share many resources

–GCT, BHT, TLB, . .

• Resources balanced across threads for Higher performance• drifting to extremes reduced performance

Solution: Dynamically adjust resource utilization


Power 5 thread performance / priority..

Relative priority of each thread is hardware controlled

For balanced operation, both run slower than if threads “owned” the machine.


Thread priority Control-cont’d• Unbalanced execution desirable if

– No work for opposite thread– Thread spin-waiting on lock– Software determined non uniform balance– Power management

• Solution: Control instruction decode rate

– Software/hardware controls 8 priority levels for each thread


Dynamic Thread Switching• Used if no task ready for second thread to run

• All machine resources allocated to one thread

• Software initiated

• Dormant thread awakens on

–External interrupt

–Decrementer Interrupt

–Special Instruction from active thread


Single Thread Operation• For execution unit limited applications

– Floating or fixed point intensive Workloads

• Execution unit limited applications provide minimal performance leveragefor SMT

– Higher performance benefit when resources dedicated to single thread

• Determined dynamically on a Per processor basis


Initial Performance of SMT• Pentium 4 Extreme SMT yields 1.01 speedup for SPECint_rate

benchmark and 1.07 for SPECfp_rate– Pentium 4 is dual threaded SMT

– SPECRate requires that each SPEC benchmark be run against a vendor-selected number of copies of the same benchmark

• Running on Pentium 4 each of 26 SPEC benchmarks paired with every other (262 runs) speed-ups from 0.90 to 1.58; average was 1.20

• Power 5, 8 processor server 1.23 faster for SPECint_rate with SMT, 1.16 faster for SPECfp_rate

• Power 5 running 2 copies of each app speedup between 0.89 and 1.41– Most gained some

– Fl.Pt. apps had most cache conflicts and least gains


Limits to ILP• Doubling issue rates above today’s 3-6 instructions

per clock, say to 6 to 12 instructions, probably requires a processor to – issue 3 or 4 data memory accesses per cycle, – resolve 2 or 3 branches per cycle, – rename and access more than 20 registers per cycle, and – fetch 12 to 24 instructions per cycle.

• The complexities of implementing these capabilities is likely to mean sacrifices in the maximum clock rate – E.g, widest issue processor is the Itanium 2, but it also has

the slowest clock rate, despite the fact that it consumes the most power!


Limits to ILP• Most techniques for increasing performance increase power

consumption • The key question is whether a technique is energy efficient:

does it increase power consumption faster than it increases performance?

• Multiple issue processors techniques all are energy inefficient:1. Issuing multiple instructions incurs some overhead in

logic that grows faster than the issue rate grows2. Growing gap between peak issue rates and sustained

performance• Number of transistors switching = f(peak issue rate), and

performance = f( sustained rate), growing gap between peak and sustained performance increasing energy per unit of performance


Commentary• Itanium architecture does not represent a significant

breakthrough in scaling ILP or in avoiding power / complexity consumption problems

• Instead of more ILP, architects focusing on TLP implemented with CMP

• IBM announced Power4, 1st commercial CMP, = 2 Power3 processors + L2 cache – Sun Microsystems and Intel have switched CMP rather than

aggressive uniprocessors.

• Right balance of ILP and TLP not clear – Good for server, exploit more TLP, – desktop, single-thread performance a primary requirement


And in conclusion …

• Limits to ILP (power efficiency, compilers, dependencies …) seem to limit to 3 to 6 issue for practical options

• Explicitly parallel (Data level parallelism or Thread level parallelism) is next step to performance

• Coarse grain vs. Fine grained multithreading– Only on big stall vs. every clock cycle

• Simultaneous Multithreading fine grained multithreading based on superscalar microarchitecture– Instead of replicating registers, reuse rename registers


Power Storage Hierarchy


Power Storage Hierarchy• Hardware data prefetch

– hardware prefetches Data from L2, L3 & memory : hides memory latency transparently loads the L1 data cache

– Triggered by data cache line misses• L1 prefetches 1 cache line ahead• L2 prefetches 5 cache lines ahead• L3 prefetches 17 to 20 lines


Moore’s Law reinterpreted• Number of cores per chip will double every

two years

• Clock speed will not increase (possibly decrease)

• Need to deal with systems with millions of concurrent threads

• Need to deal with inter-chip parallelism as well as intra-chip parallelism


Intel’s Hyper-threading technology is SMTPentium 4 (Xeon)

• Executes two tasks simultaneously– Two different applications

– Two threads of same application

• CPU maintains architecture state for two processors – Two logical processors per physical processor

• Implemented on Intel® Xeon™ and most Pentium 4– Two logical processors for < 5% additional die area

– Power efficient performance gain


Resources are shared not replicated


Multithreaded Microarchitecture• Dedicated local context per running thread• Efficient resource sharing

– Time sharing– Space sharing

• Fast thread synchronization / communication– Explicit instructions– Implicit via shared registers / cache / buffer


Changes needed for Hyper-threadingPentium 4

• Replicate – All per CPU architectural state

– Instruction Pointers, renaming logic

– Other: ITLB, return stack predictor, .. So

• Partition resources (share by splitting in half per thread)– Several buffers: Re-order buffer, load/store buffers, queues

• Share– Out -of -Order execution engine– Caches


P4 Out-of-order Execution pipeline


P4 Hyper-threaded pipeline


Pentium-4 HyperthreadingFront End

Resource divided between logical CPUs

Resource shared between logical CPUs


Thread selection points


Icount Choosing Policy

Fetch from thread with the least instructions in flight.


All caches are shared • Execution trace cache• L1 Data• L2 Unified• L3 Unified


Data in Caches can be shared• L1 Data

• L2 unified

• L3 unified


Operating systems manages tasks

• Schedule tasks on logical processors

• Executes HALT if a logical processor is idle


Initial Performance of SMT• Pentium 4 Extreme SMT yields 1.01 speedup for

SPECint_rate benchmark and 1.07 for SPECfp_rate– Pentium 4 is dual threaded SMT

• Running on Pentium 4 each of 26 SPEC benchmarks paired with every other (262 runs) speed-ups from 0.90 to 1.58; average was 1.20

• Power 5, 8-processor server 1.23 faster for SPECint_rate with SMT, 1.16 faster for SPECfp_rate

• Power 5 running 2 copies of each app speedup between 0.89 and 1.41– Most gained some– Fl.Pt. apps had most cache conflicts and least gains


Hyper-threading technology

• Significant new technology direction for Intel’s future CPUs

• Exploits parallelism in today’s applications and usage– Two logical processors on one physical processor

• Accelerates performance for low silicon and power costs

• Implemented in Xeon MP, Pentium 4, Itanium 2


Multicore & Manycore• Revolution needed• Software or architecture alone can’t fix parallel programming

problem, need innovations in both• “Multicore” 2X cores per generation: 2, 4, 8, … • “Manycore” 100s is highest performance per unit area, and per

Watt, then 2X per generation: 64, 128, 256, 512, 1024 …

• Multicore architectures & Programming Models good for 2 to 32 cores won’t evolve to Manycore systems of 1000’s of processors Desperately need HW/SW models that work for Manycore or will run out of steam(as ILP ran out of steam at 4 instructions)


Summary: Multithreaded CategoriesTi

me

(pro

cess

or

cycle

)Superscalar Fine-Grained Coarse-Grained Multiprocessing

SimultaneousMultithreading

Thread 1

Thread 2Thread 3Thread 4

Thread 5Idle slot


Cell Processor


Cell Processor Features• 64b Power core & its L2 cache• 8 SPE – processing elements with local

memory• High bandwidth interconnect bus• Memory interface controller• 10 simultaneous threads, 8 on SPEs + 2 on

Power core• 234M transistors, 90 nm, SOI, 8-level Copper• On-chip temperature monitored – cooling

adjusted


12/10


SPE•SPE optimized for compute intensive applications• Both types of processor cores share access to common address space,• main memory, and address ranges corresponding to each SPE’s local store, control registers,and I/O devices.• Simple high speed pipeline•Pervasive parallel computing ….SIMD data level parallelism•128 x 128 register file (scalar – vector)•Optimized scalar – uses same h/w path as vector instructions•256k local store ( similar to but not a cache, no tags, ..etc)


Cell Processor Die Photo


Synergistic Processor SPE


SPE Pipeline

12/14 multi-hyper thread.1 mutli-threading, hyperthreading & chip multiprocessing (cmp) beyond...

Documents