designing parallel operating systems via parallel programming
DESCRIPTION
Designing Parallel Operating Systems via Parallel Programming. Eitan Frachtenberg 1 , Kei Davis 1 , Fabrizio Petrini 1 , Juan Fernández 1,2 and José Carlos Sancho 1 1 Performance and Architecture Lab (PAL) 2 Grupo de Arquitectura y Computación Paralelas (GACOP) - PowerPoint PPT PresentationTRANSCRIPT
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Designing Parallel Operating Systemsvia Parallel Programming
Designing Parallel Operating Systemsvia Parallel Programming
email:[email protected]
Eitan Frachtenberg1, Kei Davis1, Fabrizio Petrini1,
Juan Fernández1,2 and José Carlos Sancho1
1Performance and Architecture Lab (PAL) 2Grupo de Arquitectura y Computación Paralelas (GACOP)
CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores
Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN
URL: http://www.c3.lanl.gov URL: http://www.ditec.um.es
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
HARDWAREHARDWARE = = Independent Nodes + High-speed Independent Nodes + High-speed NetworkNetwork
SOFTWARESOFTWARE= = Commodity OS + Parallel Apps + System Commodity OS + Parallel Apps + System SoftwareSoftware
OSOS OSOSOSOS
OSOS
OSOS OSOSOSOS
OSOS
MotivationMotivation
Clusters have been the most successful player in high-performance computing in the last decade
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Earth Simulator Earth Simulator 5120 Processors5120 Processors
Thunder (LLNL) Thunder (LLNL) 4096 Processors4096 Processors
ASCI Q (LANL) ASCI Q (LANL) 8192 Processors8192 Processors
MotivationMotivation
Ever-increasing demand for computing capability is driving the construction of ever-larger clusters
Systems are becoming more complex,Systems are becoming more complex,less efficient and less reliableless efficient and less reliable
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
PROBLEM: parallel software has neither PROBLEM: parallel software has neither evolved nor scaled accordingly to cluster evolved nor scaled accordingly to cluster
sizessizes
MotivationMotivation
SOLUTION: new approach to the design SOLUTION: new approach to the design of parallel software for large-scale of parallel software for large-scale
clustersclusters
Clusters are loosely-coupled systems used for solving inherently tightly-coupled problems
Parallel software keeps all the pieces together
Development of parallel software is a time- and resource- consuming task due to its complexity
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
GoalsGoals
Target New methodology for the design of parallel
software Simplicity, performance, scalability, reliability Backbone to integrate all nodes into a parallel OS
Vision BSP-like system running MIMD applications
(variable granularity in the order of hundreds of s)
Approach BSP-like global control and coordination of all
system activities Small set of collective communication primitives
for global coordination
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Motivation and Goals
Toward a Parallel Operating System
Core Primitives
Parallel Software Design
Case Studies
Concluding remarks
OutlineOutline
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Designing a Parallel OS:
Lack of global coordination (loose coupling)
Redundant/missing functionality (complexity)
Toward a Parallel OSToward a Parallel OS
Hardware
CommProtocol 1
CommProtocol 2 . . . Comm
Protocol N
ResourceManagement
ParallelApplication . . . Parallel
File System
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Toward a Parallel OSToward a Parallel OS
Scientific applications are tightly coupled … Data dependencies between nodes They exchange messages very often
… but the processing nodes are “bolted
together” in a loosely coupled fashion
Need for global control and coordination Need for global control and coordination ofof
all the system activities, enforced byall the system activities, enforced byglobal collective communication global collective communication
primitivesprimitives
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Designing a Parallel OS:
System-level, global control and coordination of all application and system software activities
Toward a Parallel OSToward a Parallel OS
Hardware
CommProtocol 1
CommProtocol 2 . . . Comm
Protocol N
Global control and coordination
ResourceManagement
ParallelApplication . . . Parallel
File System
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Toward a Parallel OSToward a Parallel OS
Parallel applications use point-to-point and
collective communication
System software tasks are either collective
operations or can be cast in terms of them
Parallel applications and system Parallel applications and system software can be built atop the same software can be built atop the same
communication primitivescommunication primitives
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Designing a Parallel OS:
Least common denominator of system and application software Core Primitives
Toward a Parallel OSToward a Parallel OS
Hardware
CommProtocol 1
CommProtocol 2 . . . Comm
Protocol NCore Primitives
Global control and coordination
ResourceManagement
ParallelApplication . . . Parallel
File System
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Motivation and Goals
Toward a Parallel Operating System
Core Primitives
Parallel Software Design
Case Studies
Concluding remarks
OutlineOutline
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Parallel software built atop three primitives Xfer-And-Signal
– Transfer block of data to a set of nodes– Optionally signal local/remote event upon completion
Test-Event– Poll local event
Compare-And-Write– Compare global variable on a set of nodes– Optionally write global variable on the same set of nodes
Core PrimitivesCore Primitives
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Parallel software built atop three primitives Xfer-And-Signal (QsNet):
– Node S transfers block of data to nodes D1, D2, D3 and D4
S D1 D2D4D3
Core PrimitivesCore Primitives
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Parallel software built atop three primitives Xfer-And-Signal (QsNet):
– Node S transfers block of data to nodes D1, D2, D3 and D4
– Events triggered at source and destinations
S D1 D2D4D3
SourceEvent
DestinationEvents
Core PrimitivesCore Primitives
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Parallel software built atop three primitives Compare-And-Write (QsNet):
– Node S compares variable V on nodes D1, D2, D3 and D4
S D1 D2D4D3
•Is V {, , >} to Value?
Core PrimitivesCore Primitives
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Parallel software built atop three primitives Compare-And-Write (QsNet):
– Node S compares variable V on nodes D1, D2, D3 and D4
– Partial results are combined in the switches
S D1 D2D4D3
Core PrimitivesCore Primitives
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Motivation and Goals
Toward a Parallel Operating System
Core Primitives
Parallel Software Design
Case Studies
Concluding remarks
OutlineOutline
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
•Global Strobe•(time slice starts)
•Global Strobe•(time slice ends)
Task 1
Task 2
•Global•Synchronization
•Global•Synchronization
Tim
e S
lice
(h
un
dre
ds
of s
)Toward a Parallel OSToward a Parallel OS
Global control/coordination of all system activities
Task 3
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Using the core primitives… Global control and coordination
– Strobe sent at regular intervals (time slices) Compare-And-Write + Xfer-And-Signal (Master) Test-Event (Slaves)
– All system activities are tightly coupled– Global information is required to schedule resources, global
synchronization facilitates the task but it is not enough
Global resource scheduling– Exchange of requirements/restrictions
Xfer-And-Signal + Test-Event– Resource scheduling
Parallel Software DesignParallel Software Design
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Characteristic Workstation Cluster
Job Launching OS Scripts atop the OS
SYSTEMSYSTEM
SOFTWARESOFTWARE
Job Scheduling Timeshared by OSBatch queued or gang scheduled by middleware
CommunicationStandard IPC and shared memory
MPI
StorageStandard file system
Custom parallel file system
Debuggability Standard toolsCustom parallel debugging tools
Fault Tolerance Little or noneApplication-specified checkpointing
Parallel Software DesignParallel Software Design
Applications System calls Rely on System Software
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Characteristic
Requirement Solution
Job Launching
Data Dissemination
Flow Control
Termination Detection
Xfer-And-Signal
Compare-And-Write
Compare-And-Write
Job SchedulingHeartbeat
Context Switch
Xfer-And-Signal
Prioritized messages/multiple rails
Communication
PUT
GET
Barrier
Broadcast
Xfer-And-Signal
Xfer-And-Signal
Compare-And-Write
Compare-And-Write + Xfer-And-Signal
Using the core primitives…
Parallel Software DesignParallel Software Design
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Can we really buildsystem software using
this new approach?
Parallel Software Design Parallel Software Design
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Motivation and Goals
Introduction
Core Primitives
Parallel Software Design
Case Studies
Concluding remarks
OutlineOutline
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Experimental Setup
Characteristic
Crescendo Cluster Wolverine Cluster
Nodes 32 x Dell 1550 64 AlphaServer ES40
CPUs/Node 2 x 1GHz Pentium-III 4 x 833MHz EV68
Memory/Node 1 GB 8 GB
Network Cards QM-400 Elan3 QM-400 Elan3
OS RH 7.3 + QsNet kernel RH 7.1 + QsNet kernel
Software
Qsnetlibs v1.5.0-0 +
Intel C/Fortran Compiler 5.0.1
Qsnetlibs v1.5.0-0 +
Compaq´s C Compiler
Case StudiesCase Studies
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
STORM (Scalable TOol for Resource Management)– Architecture:
Set of dæmons running on the management/compute nodes Built atop the three core primitives BSP-like behavior: management activities are synchronized
and scheduled every few hundreds of microseconds
– Functionality: Job Launching Job Scheduling (FCFS, gang scheduling and others)
New scheduling algorithms can be “plugged in” Resource Accounting
Case StudiesCase Studies
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Job Launching: send/execute/check for completion
40 times faster than the best reported result!!!
Case StudiesCase Studies
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
BCS-MPI (Buffered CoScheduled MPI)– Architecture
Set of cooperative threads running in the NIC Built atop the three core primitives BSP-like behavior: communications are synchronized and
scheduled every few hundreds of microseconds– Functionality:
Subset of the MPI standard– Paves the way to provide:
Traffic segregation Deterministic replay of user applications System-level fault tolerance
Case StudiesCase Studies
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
SWEEP3D and SAGE Performance (IA32)– Production-level MPI versus BCS-MPI
Case StudiesCase Studies
0.5% SPEEDUP 2% SPEEDUP
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Motivation and Goals
Introduction
Core Primitives
Parallel Software Design
Case Studies
Concluding remarks
OutlineOutline
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Methodology for designing parallel software Coordination of all system and application software activities
in a BSP-like fashion Parallel applications and system software built atop a basic
set of collective primitives for global coordination Backbone to integrate all nodes into a parallel OS
Promising preliminary results demonstrate that
this approach is indeed feasible
Concluding RemarksConcluding Remarks
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Kernel-level implementation User-level solution is already working
Deterministic replay of MPI programs Ordered resource scheduling may enforce
reproducibility
Transparent fault tolerance Global coordination simplifies the state of the
machine
Future WorkFuture Work
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Designing Parallel Operating Systemsvia Parallel Programming
Designing Parallel Operating Systemsvia Parallel Programming
email:[email protected]
Eitan Frachtenberg1, Kei Davis1, Fabrizio Petrini1,
Juan Fernández1,2 and José Carlos Sancho1
1Performance and Architecture Lab (PAL) 2Grupo de Arquitectura y Computación Paralelas (GACOP)
CCS-3 Modeling, Algorithms and Informatics Dpto. Ingeniería y Tecnología de Computadores
Los Alamos National Laboratory, NM 87545, USA Universidad de Murcia, 30071 Murcia, SPAIN
URL: http://www.c3.lanl.gov URL: http://www.ditec.um.es
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Characteristic
Requirement Solution
StorageMetadata Transfer
File Data TransferXfer-And-Signal
Debuggability
Debug Data Transfer
Debug Synchronization
Xfer-And-Signal
Compare-And-Write
Fault Tolerance
Fault Detection
Checkpointing Synchronization
Checkpointing Data Transfer
Compare-And-Write
Compare-And-Write
Xfer-And-Signal
Using the core primitives…
Parallel Software DesignParallel Software Design
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Job Scheduling: gang scheduling
Very small time slices: RESPONSIVENESS !!!
Case StudiesCase Studies
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
•Global Strobe•(time slice starts)
•Global Strobe•(time slice ends)
Exchange of comm requirements
Communication scheduling
Real transmission
•Global•Synchronization
•Global•Synchronization
Tim
e S
lice
(h
un
dre
ds
of s
)Toward a Parallel OSToward a Parallel OS
BCS-MPI: real-time communication scheduling
Euro-Par - August 31- September 3, 2004 - Pisa (Italy)
Toward a Parallel OSToward a Parallel OS
BCS-MPI: real-time communication scheduling