scalable distributed memory machinesmeseec.ce.rit.edu/eecc756-spring2000/756-4-25-2000.pdf ·...
TRANSCRIPT
EECC756 - ShaabanEECC756 - Shaaban#1 lec # 12 Spring2000 4-25-2000
Scalable DistributedScalable DistributedMemory MachinesMemory Machines
Goal: Parallel machines that can be scaled to hundreds or thousands of processors.
• Design Choices:– Custom-designed or commodity nodes?– Network scalability.– Capability of node-to-network interface (critical).– Supporting programming models?
• What does hardware scalability mean?– Avoids inherent design limits on resources.– Bandwidth increases with machine size P.– Latency should not increase with machine size P.– Cost should increase slowly with P.
EECC756 - ShaabanEECC756 - Shaaban#2 lec # 12 Spring2000 4-25-2000
MPPs Scalability IssuesMPPs Scalability Issues• Problems:
– Memory-access latency.– Interprocess communication complexity or synchronization
overhead.– Multi-cache inconsistency.– Message-passing and message processing overheads.
• Possible Solutions:– Fast dedicated, proprietary and scalable, networks and protocols.– Low-latency fast synchronization techniques possibly hardware-assisted .– Hardware-assisted message processing in communication assists (node-to-
network interfaces).– Weaker memory consistency models.– Scalable directory-based cache coherence protocols.– Shared virtual memory.– Improved software portability; standard parallel and distributed
operating system support.– Software latency-hiding techniques.
EECC756 - ShaabanEECC756 - Shaaban#3 lec # 12 Spring2000 4-25-2000
One Extreme:One Extreme:
Limited Scaling of a BusLimited Scaling of a Bus
• Bus: Each level of the system design is grounded inthe scaling limits at the layers below andassumptions of close coupling between components.
Characteristic Bus
Physical Length ~ 1 ft
Number of Connections fixed
Maximum Bandwidth fixed
Interface to Comm. medium memory inf
Global Order arbitration
Protection Virt -> physical
Trust total
OS single
comm. abstraction HW
Poor Scalability
EECC756 - ShaabanEECC756 - Shaaban#4 lec # 12 Spring2000 4-25-2000
Another Extreme:Another Extreme:
ScalingScaling of Workstations in a LAN? of Workstations in a LAN?
• No clear limit to physical scaling, no global order,consensus difficult to achieve.
Characteristic Bus LAN
Physical Length ~ 1 ft KM
Number of Connections fixed many
Maximum Bandwidth fixed ???
Interface to Comm. medium memory inf peripheral
Global Order arbitration ???
Protection Virt -> physical OS
Trust total none
OS single independent
comm. abstraction HW SW
EECC756 - ShaabanEECC756 - Shaaban#5 lec # 12 Spring2000 4-25-2000
• Depends largely on network characteristics:– Channel bandwidth.
– Static: Topology: Node degree, Bisection width etc.
– Multistage: Switch size and connection patternproperties.
– Node-to-network interface capabilities.
Bandwidth ScalabilityBandwidth Scalability
P M M P M M P M M P M M
S S S S
Typical switches
Bus
Multiplexers
Crossbar
EECC756 - ShaabanEECC756 - Shaaban#6 lec # 12 Spring2000 4-25-2000
Dancehall MP OrganizationDancehall MP Organization
• Network bandwidth?
• Bandwidth demand?– Independent processes?
– Communicating processes?
• Latency? ° ° °
Scalable network
P
$
Switch
M
P
$
P
$
P
$
M M° ° °
Switch Switch
Extremely high demands on network in terms ofbandwidth, latency even forindependent processes.
EECC756 - ShaabanEECC756 - Shaaban#7 lec # 12 Spring2000 4-25-2000
Generic Distributed MemoryGeneric Distributed MemoryOrganizationOrganization
• Network bandwidth?• Bandwidth demand?
– Independent processes?– Communicating processes?
• Latency? O(log2P) increase?• Cost scalability of system?
° ° °
Scalable network
CA
P
$
Switch
M
Switch Switch
Multi-stageinterconnectionnetwork (MIN)?Custom-designed?
Node:O(10) Bus-based SMP
Custom-designed CPU?Node/System integration level?How far? Cray-on-a-Chip? SMP-on-a-Chip?
OS Supported?Network protocols?
Communication AssistExtend of functionality?
MessagetransactionDMA?
Global virtual Shared address space?
EECC756 - ShaabanEECC756 - Shaaban#8 lec # 12 Spring2000 4-25-2000
Key System Scaling PropertyKey System Scaling Property
• Large number of independent communication pathsbetween nodes.
=> Allow a large number of concurrent transactions using different channels.
• Transactions are initiated independently.
• No global arbitration.
• Effect of a transaction only visible to the nodes involved– Effects propagated through additional transactions.
EECC756 - ShaabanEECC756 - Shaaban#9 lec # 12 Spring2000 4-25-2000
Latency ScalingLatency Scaling
• T(n) = Overhead + Channel Time + Routing Delay
• Scaling of overhead?
• Channel Time(n) = n/B --- BW at bottleneck
• RoutingDelay(h,n)
EECC756 - ShaabanEECC756 - Shaaban#10 lec # 12 Spring2000 4-25-2000
Network Latency Scaling ExampleNetwork Latency Scaling Example
• Max distance: log2 n
• Number of switches: α n log n• overhead = 1 us, BW = 64 MB/s, 200 ns per hop
• Using pipelined or cut-through routing:• T64(128) = 1.0 us + 2.0 us + 6 hops * 0.2 us/hop = 4.2 us
• T1024(128) = 1.0 us + 2.0 us + 10 hops * 0.2 us/hop = 5.0 us
• Store and Forward• T64
sf(128) = 1.0 us + 6 hops * (2.0 + 0.2) us/hop = 14.2 us
• T1024sf(128) = 1.0 us + 10 hops * (2.0 + 0.2) us/hop = 23 us
O(log2 n) Stage MIN using switches:
Only 20% increase in latency for 16x size increase
EECC756 - ShaabanEECC756 - Shaaban#11 lec # 12 Spring2000 4-25-2000
Cost ScalingCost Scaling• cost(p,m) = fixed cost + incremental cost (p,m)
• Bus Based SMP?
• Ratio of processors : memory : network : I/O ?
• Parallel efficiency(p) = Speedup(P) / P
• Similar to speedup, one can define:
Costup(p) = Cost(p) / Cost(1)
• Cost-effective: speedup(p) > costup(p)
EECC756 - ShaabanEECC756 - Shaaban#12 lec # 12 Spring2000 4-25-2000
Cost Effective?Cost Effective?
2048 processors: 475 fold speedup at 206x cost
0
500
1000
1500
2000
0 500 1000 1500 2000
Processors
Speedup = P/(1+ logP)
Costup = 1 + 0.1 P
EECC756 - ShaabanEECC756 - Shaaban#13 lec # 12 Spring2000 4-25-2000
Physical ScalingPhysical Scaling• Chip-level integration:
– Integrate network interface, message router I/O links.– Memory/Bus controller/chip set.– IRAM-style Cray-on-a-Chip.– Future: SMP on a chip?
• Board-level:– Replicating standard microprocessor cores.
• CM-5 replicated the core of a Sun SparkStation 1 workstation.
• Cray T3D and T3E replicated the core of a DEC Alpha workstation.
• System level:• IBM SP-2 uses 8-16 almost complete RS6000 workstations placed in
racks.
EECC756 - ShaabanEECC756 - Shaaban#14 lec # 12 Spring2000 4-25-2000
Chip-level integration Example:Chip-level integration Example:
nCUBEnCUBE/2 Machine Organization/2 Machine Organization
• Entire machine synchronous at 40 MHz
Single-chip node
Basic module
Hypercube networkconfiguration
DRAM interface
DM
Ach
anne
ls
Rou
ter
MMU
I-Fetch&
decode
64-bit integerIEEE floating point
Operand$
Execution unit
500, 000 transistorslarge at the time
64 nodes socketed on a board
13 linksup to 8096nodes possible
EECC756 - ShaabanEECC756 - Shaaban#15 lec # 12 Spring2000 4-25-2000
Board-level integration Example:Board-level integration Example:
CM-5 Machine OrganizationCM-5 Machine Organization
Diagnostics network
Control network
Data network
Processingpartition
Processingpartition
Controlprocessors
I/O partition
PM PM
SPARC
MBUS
DRAMctrl
DRAM DRAM DRAM DRAM
DRAMctrl
Vectorunit DRAM
ctrlDRAM
ctrl
Vectorunit
FPU Datanetworks
Controlnetwork
$ctrl
$SRAM
NI
Design replicated the core of a Sun SparkStation 1 workstation
Fat Tree
EECC756 - ShaabanEECC756 - Shaaban#16 lec # 12 Spring2000 4-25-2000
Memory bus
MicroChannel bus
I/O
i860 NI
DMA
DR
AM
IBM SP-2 node
L2 $
Power 2CPU
Memorycontroller
4-wayinterleaved
DRAM
General interconnectionnetwork formed fr om8-port switches
NIC
System Level Integration Example:System Level Integration Example:
IBM SP-2IBM SP-2
8-16 almostcompleteRS6000workstationsplaced in racks.
EECC756 - ShaabanEECC756 - Shaaban#17 lec # 12 Spring2000 4-25-2000
Realizing Programming Models:Realizing Programming Models:
Realized by ProtocolsRealized by Protocols
CAD
Multiprogramming Sharedaddress
Messagepassing
Dataparallel
Database Scientific modeling Parallel applications
Programming models
Communication abstractionUser/system boundary
Compilationor library
Operating systems support
Communication hardware
Physical communication medium
Hardware/software boundary
Network Transactions
EECC756 - ShaabanEECC756 - Shaaban#18 lec # 12 Spring2000 4-25-2000
Challenges in Realizing Prog. ModelsChallenges in Realizing Prog. Modelsin Large-Scale Machinesin Large-Scale Machines
• No global knowledge, nor global control.
– Barriers, scans, reduce, global-OR give fuzzy global state.
• Very large number of concurrent transactions.
• Management of input buffer resources:
– Many sources can issue a request and over-commitdestination before any see the effect.
• Latency is large enough that one is tempted to “take risks”:
– Optimistic protocols.
– Large transfers.
– Dynamic allocation.
• Many more degrees of freedom in design and engineering ofthese system.
EECC756 - ShaabanEECC756 - Shaaban#19 lec # 12 Spring2000 4-25-2000
Network Transaction ProcessingNetwork Transaction Processing
Key Design Issue:
• How much interpretation of the message by CA withoutinvolving the CPU?
• How much dedicated processing in the CA?
PM
CA
PM
CA° ° °
Scalable Network
Node Architecture
Communication Assist
Message
Output Processing – checks – translation – formating – scheduling
Input Processing – checks – translation – buffering – action
CA = Communication Assist
EECC756 - ShaabanEECC756 - Shaaban#20 lec # 12 Spring2000 4-25-2000
Spectrum of DesignsSpectrum of DesignsNone: Physical bit stream
– blind, physical DMA nCUBE, iPSC, . . .
User/System– User-level port CM-5, *T
– User-level handler J-Machine, Monsoon, . . .
Remote virtual address– Processing, translation Paragon, Meiko CS-2
Global physical address– Proc + Memory controller RP3, BBN, T3D
Cache-to-cache– Cache controller Dash, KSR, Flash
Incr
easi
ng
HW
Su
pp
ort
, Sp
ecia
lizat
ion
, In
tru
sive
nes
s, P
erfo
rman
ce (
??
?)
EECC756 - ShaabanEECC756 - Shaaban#21 lec # 12 Spring2000 4-25-2000
No CA Net Transactions Interpretation:No CA Net Transactions Interpretation:
Physical DMA Physical DMA
• DMA controlled by regs, generates interrupts.
• Physical => OS initiates transfers.
• Send-side:
– Construct system “envelope” around user data inkernel area.
• Receive:
– Must receive into system buffer, since no messageinterpretation in CA.
PMemory
Cmd
DestData
Addr
Length
Rdy
PMemory
DMAchannels
Status,inter rupt
Addr
Length
Rdy° ° °
sender auth
dest addr
EECC756 - ShaabanEECC756 - Shaaban#22 lec # 12 Spring2000 4-25-2000
nCUBEnCUBE/2 Network Interface/2 Network Interface
• Independent DMA channel per link direction
– Leave input buffers always open.
– Segmented messages.
• Routing determines if message is intended for local orremote node
– Dimension-order routing on hypercube.
– Bit-serial with 36 bit cut-through.
Processor
Switch
Input ports
° ° °
Output ports
Memory
Addr AddrLength
Addr Addr AddrLength
AddrLength
° ° °
DMAchannels
Memorybus
Os 16 ins 260 cy 13 us
Or 18 200 cy 15 us
- includes interrupt
EECC756 - ShaabanEECC756 - Shaaban#23 lec # 12 Spring2000 4-25-2000
DMA In Conventional LANDMA In Conventional LANNetwork InterfacesNetwork Interfaces
NIC Controller
DMAaddr
len
trncv
TX
RX
Addr LenStatusNext
Addr LenStatusNext
Addr LenStatusNext
Addr LenStatusNext
Addr LenStatusNext
Addr LenStatusNext
Data
Host Memory NIC
IO Busmem bus
Proc
EECC756 - ShaabanEECC756 - Shaaban#24 lec # 12 Spring2000 4-25-2000
User-Level PortsUser-Level Ports
• Initiate transaction at user level.• CA interprets delivers message to user without OS
intervention.• Network port in user space.• User/system flag in envelope.
– Protection check, translation, routing, media access insource CA
– User/sys check in destination CA, interrupt on system.
PMem
DestData
User/system
PMemStatus,interrupt
° ° °
EECC756 - ShaabanEECC756 - Shaaban#25 lec # 12 Spring2000 4-25-2000
User-Level Network Example:User-Level Network Example: CM-5 CM-5• Two data networks and
one control network.
• Input and output FIFOfor each network.
• Tag per message:
– Index NetworkInteface (NI) mappingtable.
• *T integrated NI on chip.
• Also used in iWARP.
Diagnostics network
Control network
Data network
Processingpartition
Processingpartition
Controlprocessors
I/O partition
PM PM
SPARC
MBUS
DRAMctrl
DRAM DRAM DRAM DRAM
DRAMctrl
Vectorunit DRAM
ctrlDRAM
ctrl
Vectorunit
FPU Datanetworks
Controlnetwork
$ctrl
$SRAM
NI
Os 50 cy 1.5 us
Or 53 cy 1.6 us
interrupt 10us
EECC756 - ShaabanEECC756 - Shaaban#26 lec # 12 Spring2000 4-25-2000
User-Level HandlersUser-Level Handlers
• Tighter integration of user-level network port with theprocessor at the register level.
• Hardware support to vector to address specified in message– message ports in registers.
User /sys te m
PM e m
D e stData A ddress
PM e m
° ° °
EECC756 - ShaabanEECC756 - Shaaban#27 lec # 12 Spring2000 4-25-2000
iWARPiWARP
• Nodes integratecommunication withcomputation on systolicbasis.
• Message data direct toregister.
• Stream into memory.
Interface unit
Host
EECC756 - ShaabanEECC756 - Shaaban#28 lec # 12 Spring2000 4-25-2000
Dedicated Message Processing WithoutDedicated Message Processing WithoutSpecialized Hardware DesignSpecialized Hardware Design
General Purpose processor performs arbitrary output processing (at system level)
General Purpose processor interprets incoming network transactions (at system level)
User Processor <–> Message Processor share memory
Message Processor <–> Message Processor via system network transaction
Network
° ° °
dest
Mem
P M P
NI
User System
Mem
P M P
NI
User System
Node:Bus-basedSMP
MPMessageProcessor
EECC756 - ShaabanEECC756 - Shaaban#29 lec # 12 Spring2000 4-25-2000
• User Processor stores cmd / msg / data into shared output queue.– Must still check for output queue full (or make elastic).
• Communication assists make transaction happen.– Checking, translation, scheduling, transport, interpretation.
• Effect observed on destination address space and/or events.
• Protocol divided between two layers.
Network
° ° °
dest
Mem
P M P
NI
User System
Mem
PM P
NI
Levels of Network TransactionLevels of Network Transaction
EECC756 - ShaabanEECC756 - Shaaban#30 lec # 12 Spring2000 4-25-2000
Example: Intel ParagonExample: Intel ParagonNetwork
° ° ° Mem
P M P
NIi860xp50 MHz16 KB $4-way32B BlockMESI
sDMArDMA
64400 MB/s
$ $
16 175 MB/s Duplex
I/ONodes
rteMP handler
Var dataEOP
I/ONodes
Service
Devices
Devices
2048 B
EECC756 - ShaabanEECC756 - Shaaban#31 lec # 12 Spring2000 4-25-2000
Dispatcher
User OutputQueues
Send FIFO~Empty
Rcv FIFO~Full
Send DMA
Rcv DMA
DMA done
ComputeProcessorKernel
SystemEvent
Message Processor EventsMessage Processor Events
EECC756 - ShaabanEECC756 - Shaaban#32 lec # 12 Spring2000 4-25-2000
• Concurrency Intensive
– Need to keep inbound flows moving while outbound flowsstalled.
– Large transfers segmented.
• Reduces overhead but adds latency.
User OutputQueues
Send FIFO~Empty
Rcv FIFO~Full
Send DMA
Rcv DMA
DMA done
ComputeProcessorKernel
SystemEvent
User InputQueues
VAS
Dispatcher
Message Processor AssessmentMessage Processor Assessment