gacop jacca meeting - february 27, 2004 p al a new approach in the system software design for...
TRANSCRIPT
GACOP
JACCA Meeting - February 27, 2004
PAL
A New Approach in the System Software Designfor Large-Scale Parallel Computers
A New Approach in the System Software Designfor Large-Scale Parallel Computers
Juan Fernández1,2, Eitan Frachtenberg1, Fabrizio Petrini1,
Salvador Coll1 and José C. Sancho1
1Performance and Architecture Lab 2Grupo de Arquitectura y Computación Paralela (GACOP)
CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores
Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN
URL: http://www.c3.lanl.gov URL: http://www.ditec.um.es
email:{juanf,eitanf,fabrizio,scoll,jcsancho}@lanl.gov
GACOP
JACCA Meeting - February 27, 2004
PAL
MotivationMotivation
System software is a key factor to maximize usability, performance and scalability on large-scale systems!!!
Hardware / OSsHardware / OSsare glued together byare glued together by
System Software:System Software:Resource Management
CommunicationsParallel Developmentand Debugging ToolsParallel File System
Fault Tolerance
OSOS OSOSOSOS
OSOS
OSOS OSOSOSOS
OSOS
GACOP
JACCA Meeting - February 27, 2004
PAL
MotivationMotivation
System software complexity due to multiple factors: Extremely complex global state Non-deterministic behavior inherent to computing
systems and parallel apps Local OSs lack global awareness of parallel apps Independent design of different components User-level applications rely on system software
GACOP
JACCA Meeting - February 27, 2004
PAL
OutlineOutline
Motivation
Goals
Core Primitives
Resource Management
Communication Libraries
Ongoing and future work
GACOP
JACCA Meeting - February 27, 2004
PAL
Target Simplifying design and implementation of the system
software for large-scale parallel computers Simplicity, performance, scalability, determinism
Approach Built atop a basic set of three primitives Global synchronization/scheduling
Vision SIMD system running MIMD applications
(variable granularity in the order of hundreds of s)
GoalsGoals
GACOP
JACCA Meeting - February 27, 2004
PAL
OutlineOutline
Motivation
Goals
Core Primitives
Resource Management
Communication Libraries
Ongoing and future work
GACOP
JACCA Meeting - February 27, 2004
PAL
Core PrimitivesCore Primitives
System software built atop three primitives Xfer-And-Signal
– Transfer block of data to a set of nodes– Optionally signal local/remote event upon completion
Compare-And-Write– Compare global variable on a set of nodes– Optionally write global variable on the same set of nodes
Test-Event– Poll local event
GACOP
JACCA Meeting - February 27, 2004
PAL
Core PrimitivesCore Primitives
Characteristic Requirement Solution
Job Launching
Data dissemination
Flow Control
Termination Detection
Xfer-And-Signal
Compare-And-Write
Compare-And-Write
Job SchedulingHeartbeat
Context switch
responsiveness
Xfer-And-Signal
Prioritized messages /
Multiple Rails
Communication
PUT
GET
Barrier
Broadcast
Reduce
Xfer-And-Signal
Xfer-And-Signal
Compare-And-Write
Compare-And-Write+Xfer-And-Signal
Xfer-And-Signal / “Smart” NIC
The proposed mechanisms simplify design and implementation!!!
GACOP
JACCA Meeting - February 27, 2004
PAL
Core PrimitivesCore Primitives
Implementation Global, virtually addressable shared memory Remote Direct Memory Access (RDMA) Hardware-supported multicast Hardware-supported global query Computing capability in the NIC
Portability Infiniband, BlueGene/L, QsNET
GACOP
JACCA Meeting - February 27, 2004
PAL
OutlineOutline
Motivation
Goals
Core Primitives
Resource Management
Communication Libraries
Ongoing and future work
GACOP
JACCA Meeting - February 27, 2004
PAL
Resource ManagementResource Management STORM: Scalable TOol for Resource Management [1,2]
Job launching– binary and data dissemination– actual launching of a parallel job– reporting of job termination
Job scheduling– FCFS, gang scheduling, ... [3]– new scheduling algorithms can be “plugged”
Heartbeat/strobe at regular intervals (time slices) Monitoring Built atop the three core primitives
[1] “Scalable Resource Management in High Performance Computers.”
E. Frachtenberg, J. Fernández, F. Petrini, and S. Coll. Cluster´02.
[2] “STORM: Lightning-Fast Resource Management.” E. Frachtenberg, J. Fernández, F. Petrini, S. Pakin and S. Coll. SC´02.
[3] “Flexible CoScheduling: Mitigating Load Imbalance and Improving Utilization of Heterogeneous Resources.” E. Frachtenberg, D. G. Feitelson, F. Petrini and J. Fernández. IPDPS´03.
GACOP
JACCA Meeting - February 27, 2004
PAL
OutlineOutline
Motivation
Goals
Core Primitives
Resource Management
Communication Libraries
Ongoing and future work
GACOP
JACCA Meeting - February 27, 2004
PAL
Communication LibrariesCommunication Libraries BCS-MPI: Buffered Coscheduled MPI [4]
Global synchronization [5]– Heartbeat/strobe sent at regular intervals (time slices)– All system activities are tightly coupled
Global Scheduling– Exchange of communication requirements– Communication scheduling– Perform real transmission and reduce computations [6]
Implementation on the NIC (Elan3 - QsNet) Built atop the three core primitives
[4] “BCS-MPI: A New Approach in the System Software Design for Large-Scale Parallel Computers.” J. Fernández, E. Frachtenberg, and F. Petrini. SC´03.
[5] “Scalable Collective Communication on the ASCI Q Machine”J. Fernández, E. Frachtenberg, and F. Pettrini. HOTi´11.
[6] “Scalable NIC-based Reduction on Large-scale Clusters.” A. Moody, J. Fernández, F. Petrini and D. K. Panda. SC´03.
GACOP
JACCA Meeting - February 27, 2004
PAL
Communication LibrariesCommunication Libraries
•Global Strobe•(time slice starts)
•Global Strobe•(time slice ends)
Exchange of comm requirements
Communication scheduling
Real transmission
•Global•Synchronization
•Global•Synchronization
Tim
e S
lice
(h
un
dre
ds
of s
) BCS-MPI: real-time commication scheduling
GACOP
JACCA Meeting - February 27, 2004
PAL
Ongoing and future workOngoing and future work
Improved system utilization Scheduling multiple jobs
QoS for different types of traffic Scheduling messages may provide traffic segregation
Transparent fault tolerance [7] BCS MPI simplifies the state of the machine
Kernel-level implementation of BCS-MPI User-level solution is already working
Deterministic replay of MPI programs Ordered resource scheduling may enforce reproducibility
[7] “On the Feasibility of Incremental Checkpointing for Scientific Computing.” J. C. Sancho, F. Petrini, G. Johnson, J. Fernández and E. Frachtenberg. IPDPS´04.
GACOP
JACCA Meeting - February 27, 2004
PAL
A New Approach in the System Software Designfor Large-Scale Parallel Computers
A New Approach in the System Software Designfor Large-Scale Parallel Computers
Juan Fernández1,2, Eitan Frachtenberg1, Fabrizio Petrini1
1 Performance and Architecture Lab 2 Grupo de Arquitectura y Computación Paralelas (GACOP)
CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores
Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN
URL: http://www.c3.lanl.gov URL: http://www.ditec.um.es
email:{juanf,eitanf,fabrizio}@lanl.gov
GACOP
JACCA Meeting - February 27, 2004
PAL
MotivationMotivation
Characteristic Workstation Cluster
Job Launching Operating System Scripts/Middleware on top of the OS
Job Scheduling Timeshared by OSBatch queued or gang scheduled
(with large quanta) using middleware
CommunicationIPC/Shared
Memory
Message Passing Library (e.g. MPI) /
Data-Parallel Programming (e.g. HPF)
Fault Tolerance Little or noneApplication/application-assisted
checkpointing
StorageStandard file
systemCustom parallel file system
DebuggabilityStandard tools:
Reproducibility!!!
Parallel debugging tools:
Non-determinism!!!
Growing gap between workstation and cluster usability!!!
GACOP
JACCA Meeting - February 27, 2004
PAL
MotivationMotivation System software complexity due to multiple factors:
Extremely complex global stateThousands of processes, threads, open files, pending
messages, etc. Non-deterministic behavior
Inherent to computing systems OS process schedulingInduced by parallel applications MPI_ANY_SOURCE
Local OSs lack global awareness of parallel applications Interferences with fine-grain synchronization operations non-
scalable collective communication primitives Independent design of different components
Redundancy of functionality Communication protocolsMissing functionality QoS user-level traffic / system-level
traffic User-level applications rely on system software
System software performance/scalability impacts user-application performance/scalability
GACOP
JACCA Meeting - February 27, 2004
PAL
Resource ManagementResource Management
Job Launching
STORM is 40 times faster than the best reported result!!!
GACOP
JACCA Meeting - February 27, 2004
PAL
Resource ManagementResource Management
Job Scheduling
STORM is able to use very small time slices: RESPONSIVENESS !!!
GACOP
JACCA Meeting - February 27, 2004
PAL
Communication LibrariesCommunication Libraries
Non-blocking primitives: MPI_Isend/Irecv
GACOP
JACCA Meeting - February 27, 2004
PAL
Communication LibrariesCommunication Libraries
Blocking primitives: MPI_Send/Recv
GACOP
JACCA Meeting - February 27, 2004
PAL
Communication LibrariesCommunication Libraries
Global Synchronization Protocol Global Message Scheduling Phase
– Microphases: Descriptor Exchange + Message Scheduling Message Transmission Phase:
– Microphases: Point-to-point, Barrier and Broadcast, Reduce