review: multiprocessor basics
DESCRIPTION
Review: Multiprocessor Basics. Q1 – How do they share data? Q2 – How do they coordinate? Q3 – How scalable is the architecture? How many processors?. CMP: Multiprocessors On One Chip. - PowerPoint PPT PresentationTRANSCRIPT
CMP&SMT.1
Review: Multiprocessor Basics
# of Proc
Communication model
Message passing 8 to 2048
Shared address
NUMA 8 to 256
UMA 2 to 64
Physical connection
Network 8 to 256
Bus 2 to 36
Q1 – How do they share data? Q2 – How do they coordinate? Q3 – How scalable is the architecture? How many
processors?
CMP&SMT.2
CMP: Multiprocessors On One Chip By placing multiple processors, their memories and the
IN all on one chip, the latencies of chip-to-chip communication are drastically reduced
ARM multi-chip core
Snoop Control Unit
CPUL1$s
CPUL1$s
CPUL1$s
CPUL1$s
Interrupt Distributor
CPUInterface
CPUInterface
CPUInterface
CPUInterface
Per-CPU aliased peripherals
Configurable between 1 & 4 symmetric CPUs
Private peripheral
bus
Configurable # of hardware intr
Primary AXI R/W 64-b bus Optional AXI R/W 64-b bus
I & D64-b bus
CCB
Private IRQ
CMP&SMT.3
Multithreading on A Chip Find a way to “hide” true data dependency stalls, cache
miss stalls, and branch stalls by finding instructions (from other process threads) that are independent of those stalling instructions
Multithreading – increase the utilization of resources on a chip by allowing multiple processes (threads) to share the functional units of a single processor
Processor must duplicate the state hardware for each thread – a separate register file, PC, instruction buffer, and store buffer for each thread
The caches, buffers can be shared (although the miss rates may increase if they are not sized accordingly)
The memory can be shared through virtual memory mechanisms Hardware must support efficient thread context switching
CMP&SMT.4
Types of Multithreading Fine-grain – switch threads on every instruction issue
Round-robin thread interleaving (skipping stalled threads) Processor must be able to switch threads on every clock cycle Advantage – can hide throughput losses that come from both
short and long stalls Disadvantage – slows down the execution of an individual
thread since a thread that is ready to execute without stalls is delayed by instructions from other threads
Coarse-grain – switches threads only on costly stalls (e.g., L2 cache misses)
Advantages – thread switching doesn’t have to be essentially free and much less likely to slow down the execution of an individual thread
Disadvantage – limited, due to pipeline start-up costs, in its ability to overcome throughput loss
- Pipeline must be flushed and refilled on thread switches
CMP&SMT.5
Multithreaded Example: Sun’s Niagara (UltraSparc T1)
Eight fine grain multithreaded single-issue, in-order cores (no speculation, no dynamic branch prediction)
Ultra III Niagara
Data width 64-b 64-b
Clock rate 1.2 GHz 1.0 GHz
Cache (I/D/L2)
32K/64K/ (8M external)
16K/8K/3M
Issue rate 4 issue 1 issue
Pipe stages 14 stages 6 stages
BHT entries 16K x 2-b None
TLB entries 128I/512D 64I/64D
Memory BW 2.4 GB/s ~20GB/s
Transistors 29 million 200 million
Power (max) 53 W <60 W4-
way
MT
SP
AR
C p
ipe
4-w
ay M
T S
PA
RC
pip
e
4-w
ay M
T S
PA
RC
pip
e
4-w
ay M
T S
PA
RC
pip
e
4-w
ay M
T S
PA
RC
pip
e
4-w
ay M
T S
PA
RC
pip
e
4-w
ay M
T S
PA
RC
pip
e
4-w
ay M
T S
PA
RC
pip
e
Crossbar
4-way banked L2$
Memory controllers
I/Osharedfunct’s
CMP&SMT.6
Niagara Integer Pipeline
Cores are simple (single-issue, 6 stage, no branch prediction), small, and power-efficient
Fetch Thrd Sel Decode Execute Memory WB
I$
ITLB
Inst bufx4
PC logicx4
Decode
RegFilex4
Thread Select Logic
ALU Mul Shft Div
D$
DTLB Stbufx4
Thrd Sel Mux
Thrd Sel Mux
Crossbar Interface
Instr typeCache missesTraps & interruptsResource conflicts
CMP&SMT.7
Simultaneous Multithreading (SMT)
A variation on multithreading that uses the resources of a multiple-issue, dynamically scheduled processor (superscalar) to exploit both program ILP and thread-level parallelism (TLP)
Most have more machine level parallelism than most programs can effectively use (i.e., than have ILP)
With register renaming and dynamic scheduling, multiple instructions from independent threads can be issued without regard to dependencies among them
- Need separate rename tables (ROBs) for each thread
- Need the capability to commit from multiple threads (i.e., from multiple ROBs) in one cycle
Intel’s Pentium 4 SMT called hyperthreading Supports just two threads (doubles the architecture state)
CMP&SMT.8
Threading on a 4-way SS Processor Example
Thread A Thread B
Thread C Thread D
Tim
e →
Issue slots →SMTFine MTCoarse MT
CMP&SMT.9
Multicore Xbox360 – “Xenon” processor
To provide game developers with a balanced and powerful platform
Three SMT processors, 32KB L1 D$ & I$, 1MB UL2 cache 165M transistors total 3.2 Ghz Near-POWER ISA 2-issue, 21 stage pipeline, with 128 128-bit registers Weak branch prediction – supported by software hinting In order instructions Narrow cores – 2 INT units, 2 128-bit VMX units, 1 of anything
else
An ATI-designed 500MZ GPU w/ 512MB of DDR3DRAM 337M transistors, 10MB framebuffer 48 pixel shader cores, each with 4 ALUs
CMP&SMT.10
Xenon Diagram
Core 0
L1D L1I
Core 1
L1D L1I
Core 2
L1D L1I
1MB UL2
512MB DRAM
GPUBIU/IO Intf
3D Core
10MBEDRAM
VideoOut
MC
0M
C1
AnalogChip
XM
A D
ecS
MC
DVDHDD PortFront USBs (2)WirelessMU ports (2 USBs)Rear USB (1)EthernetIRAudio OutFlashSystems Control
Video Out
CMP&SMT.11
The PS3 “Cell” Processor Architecture
Composed of a Non-SMP Architecture 234M transistors @ 4Ghz 1 Power Processing Element, 8 “Synergistic” (SIMD) PE’s 512KB L2 $ - Massively high bandwidth (200GB/s) bus connects
it to everything else The PPE is strangely similar to one of the Xenon cores
- Almost identical, really. Slight ISA differences, and fine-grained MT instead of real SMT
The real differences lie in the SPEs (21M transistors each)- An attempt to ‘fix’ the memory latency problem by giving each
processor complete control over it’s own 256KB “scratchpad” – 14M transistors
– Direct mapped for low latency
- 4 vector units per SPE, 1 of everything else – 7M trans.
CMP&SMT.12
How to make use of the SPEs
CMP&SMT.13
What about the Software?
Makes use of special IBM “Hypervisor” Like an OS for OS’s Runs both a real time OS (for sound) and non-real time (for
things like AI)
Software must be specially coded to run well The single PPE will be quickly bogged down Must make use of SPEs wherever possible This isn’t easy, by any standard
What about Microsoft? Development suite identifies which 6 threads you’re expected to
run Four of them are DirectX based, and handled by the OS Only need to write two threads, functionally