distributed-memory multicomputers prof. sivarama dandamudi school of computer science carleton...
TRANSCRIPT
Distributed-Memory Multicomputers
Prof. Sivarama Dandamudi
School of Computer Science
Carleton University
Carleton University © S. Dandamudi 2
Roadmap Cray T3E
Architecture details on the video NCUBE
Communication primitives Binary collapsing
Job scheduling Space-sharing Time-sharing Hybrid
Hierarchical scheduling Performance
Carleton University © S. Dandamudi 3
Cray T3E
Distributed memory MIMD systemPredecessor model: T3D
Named after the interconnect used: a 3-D torusVideo gives details
T3E followed T3D Systems up to 126 processors: air cooledLarger systems (up to 2048 processors): liquid cooledUses DEC Alpha 21164A RISC processor
Carleton University © S. Dandamudi 4
Cray T3E (cont’d)
Each node consists of a processing element (PE) Processor and Memory Communication engine
Takes care of the communication between PEs
Memory 256 or 512 MB local memory (50 ns DRAM) per node
Total memory: 10GB to 1TB Cache coherent and physically distributed
Globally addressable SECDED data error protection Peak memory bandwidth: 1200 MB per PE All I/O channels are accessible and controllable from all PEs
Carleton University © S. Dandamudi 5
Cray T3E (cont’d)
I/O is done through GigaRing channels Each I/O channel
Uses dual-ring technique Two rings with data moving in opposite directions
Maximum bandwidth of 500 MB/sec
Processor DEC Alpha 21164A (EV5.6) 675 MHz
Superscalar RISC processor 2 floating-point operations/cycle
32- and 64-bit IEEE arithmetic 1350 MFLOPS per PE (peak) & 2700 MIPS per PE
Carleton University © S. Dandamudi 6
Cray T3E (cont’d)
Interconnection Uses 3-D torus interconnect (like the T3D) Peak bisection bandwidth:
42GB/sec (64 PEs) 166GB/sec (512 PEs)
Module 8 PEs per module One I/O interface per module
System size 40 to 2176 PEs per system
In increments of 8 PEs
Peak performance54 GFLOPS to 3 TFLOPS
Carleton University © S. Dandamudi 7
Cray T3E (cont’d)
Supports explicit as well as implicit parallelismExplicit methods
CF90 and C/C++ PVM MPI SHMEM
Implicit methodsHPFCray CRAFT work-sharing features
Carleton University © S. Dandamudi 8
NCUBE
Uses binary hypercube as the interconnect We look at NCUBE/ten
Uses 10-d hypercube1024 32-bit processors
Custom-made processors
128 KB memorySmall by current standards
Carleton University © S. Dandamudi 9
NCUBE (cont’d)
Each printed circuit board (16” X 22”) holds64 processorsMemoriesInterconnect
Total system is composed of16 processor boardsUp to 8 I/o boards
Entire system housed in a small air-cooled enclosure
Carleton University © S. Dandamudi 10
NCUBE (cont’d)
Inter-PCB communicationRequires 640 connections
Each node has 4 off-board bi-directional channels 64*4*2 = 512 wires
Each node has one I/O bi-directional channels 64 * 2 = 128
Total wires 512 + 128 = 640
Bit-serial links to conserve pins/connections
Carleton University © S. Dandamudi 11
NCUBE (cont’d)
CommunicationChannels operate at 10 MHz with parity check
Data transfer rate = 1 MB/s in each directionEach channel has two 32-bit write-only registers
One for message buffer addressOne for byte count
Indicates number of bytes left to send/receive
A ready flag and an interrupt enable flag for each channel
Carleton University © S. Dandamudi 12
NCUBE (cont’d)
Communication primitivesnwrite
To send a message
nwrite (message, length, dest, type,
status, error)status = indicates when the message has left the buffer
Buffer is reusable
error = error code
Carleton University © S. Dandamudi 13
NCUBE (cont’d)
nreadTo receive a message
blocking receive
nread (message, length, source, type,
status, error)source = 1 (wildcard)type = 1 (wildcard)Both can be 1 (wildcard)
Receives the next message
Carleton University © S. Dandamudi 14
NCUBE (cont’d)
NCUBE example Computes sum-of-squares
Sum (V[i]2) K elements
K = N 2M
Nodes in the cube = 2M
Each node receives N elements from host Final sum is returned to host
Uses binary collapsing Works on one dimension at a time
Carleton University © S. Dandamudi 15
NCUBE (cont’d)
NCUBE examplecall whoami(PN, PROC, HOST, M)
PN = Logical processor #PROC = process # in nodeHOST = host for cube communicationM = dimension of the allocated cube
SR = nread(V, N*4, HOST, TYPEH, FLAG1)Receive vector V of length N (N*4 bytes) from HOST
Carleton University © S. Dandamudi 16
NCUBE (cont’d)
S = 0
DO 1 I = 1, N
1 S = S + V(I)**2
Local computation is done by each nodeOnce done, we use binary collapsing to compute the final
sum
Local computation
Carleton University © S. Dandamudi 17
NCUBE (cont’d)
DO 2 I = M, 1, -1 IF(PN .LT. 2**I) THEN NPN = PN .NEQV. (2**(I-1)) IF(NPN .LT. PN) THEN SW = nwrite (S, 4, NPN, TYPEN, FLAG2) ELSE SR = nread (A, 4, NPN, TYPEN, FLAG3) S = S + A ENDIF ENDIF
2 CONTINUE
XOR operator
Carleton University © S. Dandamudi 18
NCUBE (cont’d)
Send final result back to hostIF (PN .EQ. 0) THEN
SW = nwrite (S, 4, HOST, TYPEH,
FLAG4)
ENDIF
This code is executed by node 0 only
Carleton University © S. Dandamudi 19
Scheduling in Multicomputers Principle (in the absence of priority)
Share processing power equally among the jobs Uniprocessors
Round-robin/processor sharing Multicomputers
Equal sharing can be done Spatially, or
Space-sharing policies Temporally
Time-sharing policies
Carleton University © S. Dandamudi 20
Space-Sharing Policies Space-sharing policies
System is divided into several partitions Each partition is assigned to a parallel job The assigned job keeps the partition until completion
Run-to-completion strategy
Three types of policies Fixed Static Dynamic
Carleton University © S. Dandamudi 21
Space-Sharing Policies (cont’d)
Fixed space-sharing Partitioning is a system configuration parameter
Long term Job characteristics can be used
Maximum job parallelism Average job parallelism
Partition is kept by the job until completion Advantage
Simple implementation Not the best way
Several problems
Carleton University © S. Dandamudi 22
Space-Sharing Policies (cont’d)
Problems with fixed space-sharing Difficult to partition the system
What is the best partition? Does not adapt to system load conditions and
resource requirements of jobs Internal fragmentation (this refers to leaving some
allocated processors idle) Example: Allocating 50 processors to a job that requires
only 40 processors Leads to under-utilization of resources
In the last example, 10 processors idle
Carleton University © S. Dandamudi 23
Space-Sharing Policies (cont’d)
Static space-sharing Partitions are allocated one job-by-job basis at the
schedule time No prepartitions as in fixed space-sharing Eliminates the mismatch between a job's partition size and
the allocated partitions size As in fixed policies, partition is kept until job completes
Advantages Internal fragmentation is avoided Better than fixed space-sharing
Carleton University © S. Dandamudi 24
Space-Sharing Policies (cont’d)
Problems with static space-sharing External fragmentation is possible
We can reduce this by using First-fit Best-fit A related problem: fairness
Another solution [Tucker and Gupta 1989]: Adjust software structure to fit the partition size Suitable for some applications Not suitable for applications that require partition size at
compile time in order to optimize the code
Carleton University © S. Dandamudi 25
Space-Sharing Policies (cont’d)
Fragmentation can also occur due to System imposed constraints
Example: In hypercube machines, contiguous set of nodes may not be available to form a sub-cube
All-or-nothing allocation Partial allocation may be acceptable to many applications
Central allocator may create performance problems
It can become a bottleneckFault-tolerance/reliability
Carleton University © S. Dandamudi 26
Space-Sharing Policies (cont’d)
Example Policy Original policy
Partition size = MAX (1, Total processors/(Q+1) )Q = job queue length
Problem: Does not take scheduled jobs into account Modified policy
Partition size = MAX (1, Total processors/(Q+f*S+1) )Q = job queue lengthS = Number of scheduled jobsf = Weight of scheduled jobs (between 0 and 1)
Carleton University © S. Dandamudi 27
Space-Sharing Policies (cont’d)
Dynamic space-sharingProcessors are not allocated on a lifetime basis
Processors are taken away from jobs if they cannot use them
Particularly useful for jobs that exhibit varying degree of parallelism
AdvantageEliminates some forms of external fragmentation
by not allocating partitions for the lifetime of jobs
Carleton University © S. Dandamudi 28
Space-Sharing Policies (cont’d)
Problems with dynamic space-sharing Difficult to implement on distributed-memory multicomputers
Expensive to take processors away in distributed-memory multicomputers
Processors may be taken only when the computation reaches a desired "yielding point"
Central allocator may become a bottleneck
Not used with multicomputer systems
Carleton University © S. Dandamudi 29
Time-Sharing Policies Space-sharing
Fixed policies: Long term commitment Static policies: commitments at job level Dynamic policies: commitments at task or sub-task
level Time-sharing
Changes focus from jobs to processors Time-sharinguses preemption to rotate
processors amongst a number of jobs Usually specified by multiprogramming level MPL
Carleton University © S. Dandamudi 30
Time-Sharing Policies (cont’d)
Two policies Task-based round-robin (RRTask)
Quantum size is fixed per task Violates our “equal allocation of processing power” principle
Larger jobs tend to dominate Job-based round-robin (RRJob)
Quantum size is fixed per job Equal allocation is possible
Preemption can be Coordinated (gang scheduling) Uncoordinated
Carleton University © S. Dandamudi 31
Time-Sharing Policies (cont’d)
Problems with time-sharing Requires a central coordinator
Coordinator can become a bottleneck for large systems Central task queue can create bottleneck problems
Could use Local RR-job Apply round-robin at the processors level Not as effective
Hybrid version is effective Combined space- and time-sharing Partition as in space-sharing, but time-share each
partition
Carleton University © S. Dandamudi 32
Hierarchical Scheduling Motivation
Should be self-scheduling to avoid bottlenecks Should not cause bottleneck problems
For the global task queue and coordinator Should minimize internal fragmentation
As in time sharing Should minimize external fragmentation
Implies partial allocation Handling system imposed constraints
Should be a hybrid policy space-sharing at low system loads/time-sharing at moderate
to high loads
Carleton University © S. Dandamudi 33
Hierarchical Scheduling (cont’d)
Carleton University © S. Dandamudi 34
Hierarchical Scheduling (cont’d)
Carleton University © S. Dandamudi 35
Hierarchical Scheduling (cont’d)
Performance Ideal workload
Example: Job service time = 16 minutes Divided into four tasks: 4, 4, 4, 4 minutes
50-50 workload 50% of evenly distributed task service time is distributed among 50% of the
tasks Example: 2, 2, 6, 6 minutes
50-25 workload Example: 1, 1, 7, 7 minutes
50-75 workload Example: 3, 3, 5, 5 minutes
Carleton University © S. Dandamudi 36
Hierarchical Scheduling (cont’d)
0
4
8
12
16
20
0 10 20 30 40 50 60 70 80 90 100
Utilization (%)
Mea
n re
spon
se ti
me
Space-sharing Hierarchical Time-sharing
Ideal workload
Carleton University © S. Dandamudi 37
Hierarchical Scheduling (cont’d)
50-50 (service CV=10)
0
4
8
12
16
20
0 10 20 30 40 50 60 70 80 90 100
Utilization (%)
Mea
n re
spon
se t
ime
Space-sharing Hierarchical Time-sharing
Carleton University © S. Dandamudi 38
Hierarchical Scheduling (cont’d)
50-50 (service CV=1)
0
4
8
12
16
20
0 10 20 30 40 50 60 70 80 90 100
Utilization (%)
Mea
n re
spon
se t
ime
Space-sharing Hierarchical Time-sharing
Carleton University © S. Dandamudi 39
Hierarchical Scheduling (cont’d)
50-50 (service CV=15)
0
10
20
30
40
50
60
0 10 20 30 40 50 60 70 80 90 100
Utilization (%)
Mea
n re
spon
se ti
me
Space-sharing Hierarchical Time-sharing
Carleton University © S. Dandamudi 40
Hierarchical Scheduling (cont’d)
50-25 (service CV=10)
0
10
20
30
40
50
60
70
0 20 40 60 80 100
Utilization (%)
Mea
n re
spon
se ti
me
Space-sharing Hierarchical Time-sharing
Carleton University © S. Dandamudi 41
Hierarchical Scheduling (cont’d)
50-75 (service CV=10)
0
10
20
30
40
50
60
0 20 40 60 80 100
Utilization (%)
Mea
n re
spon
se t
ime
Space-sharing Hierarchical Time-sharing
Last slide