fault-tolerant communication runtime support for data-centric programming models
DESCRIPTION
Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models. Abhinav Vishnu 1 , Huub Van Dam 1 , Bert De Jong 1 , Pavan Balaji 2, Shuaiwen Song 3 1 Pacific Northwest National Laboratory 2 Argonne National Laboratory 3 Virginia Tech. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models](https://reader036.vdocuments.mx/reader036/viewer/2022062323/568163b0550346895dd4c8a6/html5/thumbnails/1.jpg)
Copyright: Abhinav Vishnu
Fault-Tolerant Communication Runtime Support for Data-Centric
Programming Models
Abhinav Vishnu1, Huub Van Dam1, Bert De Jong1, Pavan Balaji2, Shuaiwen Song3
1 Pacific Northwest National Laboratory2 Argonne National Laboratory
3 Virginia Tech
![Page 2: Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models](https://reader036.vdocuments.mx/reader036/viewer/2022062323/568163b0550346895dd4c8a6/html5/thumbnails/2.jpg)
Copyright: Abhinav Vishnu
Faults Are Becoming Increasingly Prevalent
Many scientific domains have insatiable computational requirements
Chemistry, Astrophysics etc.Many hundreds of thousands of processing elements are being combined
Hardware faults are becoming increasingly commonplaceDesigning fault resilient systems and system stack is imperative
Large body of work with Checkpoint/Restart
Application driven and transparent (BLCR)Message Logging Matches very well with MPI semantics and fixed process model
System #Cores MTBF
ASCI/Q 8192 6.5 hrs
ASCI White
8192 8 hrs
Jaguar 4 Core
150K+ 37.5 hrs
Source: Mueller et al., ICPADS’ 2010
![Page 3: Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models](https://reader036.vdocuments.mx/reader036/viewer/2022062323/568163b0550346895dd4c8a6/html5/thumbnails/3.jpg)
Copyright: Abhinav Vishnu
Quick Overview of Data Centric / Partitioned Global Address Space (PGAS) Models
Abstractions for arbitrary distributed data structures
Arrays, Trees etcFit for irregular global address space accessesNatural mechanism for load balancingDecouple of data and computation
We focus on Global ArraysFundamental building block for other structuresUses one-sided communication runtime system for data transfer (ARMCI)What are the requirements for fault tolerance from the runtime?
Global Address
Space view
Physically distributed
data
Global ArraysLayer
Applications
Fault Tolerant Runtime
![Page 4: Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models](https://reader036.vdocuments.mx/reader036/viewer/2022062323/568163b0550346895dd4c8a6/html5/thumbnails/4.jpg)
Copyright: Abhinav Vishnu
Task-Based Execution using PGAS: Characteristic of Various Applications
Each taskexecuted by any processReads/updates arbitrary global address space
Requirements for fault tolerance
Continue with available nodesRecover data on faultRecovery proportional to degree of failure
Global Address Space (RO)
Global Address Space (RW)
Task Collection
P0
PN-1
Compute
Get
Put/Accumulate
![Page 5: Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models](https://reader036.vdocuments.mx/reader036/viewer/2022062323/568163b0550346895dd4c8a6/html5/thumbnails/5.jpg)
Copyright: Abhinav Vishnu
Other Questions for Fault Tolerance
Interconnection Network
Power supply
Is Node 2 Dead?
What should the process manager do?What about the collective operations
and their semantics?
What about the one-sided operationsto the failed node?
What about the lost data andcomputation?
Node 1 Node 2 Node 3 Node 4
We answer these questions in this work
![Page 6: Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models](https://reader036.vdocuments.mx/reader036/viewer/2022062323/568163b0550346895dd4c8a6/html5/thumbnails/6.jpg)
Copyright: Abhinav Vishnu
Provided Solutions
1. Is Node 2 Dead? .
2. What should the process manager do?
3. What about the lost data and computation?
4. One-sided operations to the failed node?
5. Collective communication operations and their semantics?
1. Design a Fault Tolerance Management Infrastructure
2. Continue execution with lost node
3. Application based redundancy and recovery
4. Fault resilient Global Arrays and ARMCI
5. Design a fault resilient barrier
Problem Solution
![Page 7: Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models](https://reader036.vdocuments.mx/reader036/viewer/2022062323/568163b0550346895dd4c8a6/html5/thumbnails/7.jpg)
Copyright: Abhinav Vishnu
Design of Fault Tolerant Communication Runtime System
Application
Data Redundancy/Fault Recovery Layer
Global Arrays Layer
Fault Resilient ARMCI
Fault Resilient Process Manager Fault Tolerance
Management Infrastructure
Fault Tolerant Barrier
HandlingData
Redundancy
Network
![Page 8: Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models](https://reader036.vdocuments.mx/reader036/viewer/2022062323/568163b0550346895dd4c8a6/html5/thumbnails/8.jpg)
Copyright: Abhinav Vishnu
ARMCI: underlying communication runtime for Global Arrays
Aggregate Remote Memory Copy Interface (ARMCI)Provides one-sided communication primitives
Put, Get, Accumulate, Atomic Memory OperationsAbstract network interfaces
Established & available on all leading platformsCray XTs, XEIBM Blue Gene L | PCommodity interconnects (InfiniBand, Ethernet etc)
Further upcoming platformsIBM Blue Waters, Blue GeneQCray Cascade
![Page 9: Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models](https://reader036.vdocuments.mx/reader036/viewer/2022062323/568163b0550346895dd4c8a6/html5/thumbnails/9.jpg)
Copyright: Abhinav Vishnu
FTMI Protocols (Example with InfiniBand)
No Response
No Response
Node 2 isdead
Network Adapter
Node
Requires response
from remote
process,Less
Reliable
Ping Message
RDMA Read
Reliable Notification,
MostReliable
![Page 10: Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models](https://reader036.vdocuments.mx/reader036/viewer/2022062323/568163b0550346895dd4c8a6/html5/thumbnails/10.jpg)
Copyright: Abhinav Vishnu
Fault Resilient Process Manager
Adaptation from OSU-MVAICH Process ManagerProvides MPI style (not fault tolerant) collectives
Based on TCP/IP for bootstrappingGeneric enough for any machine which has at least Ethernet control network
Ignores any TCP/IP errorsLayers rely on FTMI for higher accuracy fault information
Interconnection Network
![Page 11: Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models](https://reader036.vdocuments.mx/reader036/viewer/2022062323/568163b0550346895dd4c8a6/html5/thumbnails/11.jpg)
Copyright: Abhinav Vishnu
Expected Data Redundancy Model: Impact on ARMCI
Expected Data Redundancy ModelStaggered data model
Simultaneous updates may result in both copies in an inconsistent stateEach copy should be updated one by oneEvery Write based Primitive (Put/Acc) should be Fenced
WaitProc – Wait for all non-blocking operations to completeFence – Ensure all writes to a process have finished
N1 N2 N3 N4
Primary Copy
Shadow Copy
N4N1 N2 N3
Data
Data
![Page 12: Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models](https://reader036.vdocuments.mx/reader036/viewer/2022062323/568163b0550346895dd4c8a6/html5/thumbnails/12.jpg)
Copyright: Abhinav Vishnu
Fault Resilient ARMCI – Communication Protocols
Multiple phases of communication in ARMCIPut/Get/Acc are implemented as a combination of these phasesOn Failures
Either process/thread may be waiting for data, while other process is deadUse Timeout based FTMI to detect failuresIf FTMI detects failure, return error, if necessary
Asynchronous Agent
Process
![Page 13: Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models](https://reader036.vdocuments.mx/reader036/viewer/2022062323/568163b0550346895dd4c8a6/html5/thumbnails/13.jpg)
Copyright: Abhinav Vishnu
Fault Resilient Collective Communication Primitives
Barrier is fundamental non-data moving collective communication operation
Ga_sync = Fence to all processes + BarrierUsed at various execution points in ARMCI
We have implemented multiple fault tolerant barrier algorithmsFault tolerant version of based on high concurrency all-to-all personalized exchangeFault tolerant version of hypercube based implementation
![Page 14: Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models](https://reader036.vdocuments.mx/reader036/viewer/2022062323/568163b0550346895dd4c8a6/html5/thumbnails/14.jpg)
Copyright: Abhinav Vishnu
Usage of FTMI
General functionality for fault detectionPotential use as a component for fault tolerance backplane
ARMCI LayerDifferent phases of one-sided communication protocols
Put, Get, AccumulateFault Resilient Barrier
Different steps of the fault tolerant algorithmPotential use at application layer for designing recovery algorithm
![Page 15: Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models](https://reader036.vdocuments.mx/reader036/viewer/2022062323/568163b0550346895dd4c8a6/html5/thumbnails/15.jpg)
Copyright: Abhinav Vishnu
Performance Evaluation Methodology
Comparisons of following approaches:FT-ARMCI, No FaultFT-ARMCI, One Node Fault (Actual)Original
Testbed:InfiniBand DDR with AMD Barcelona CPUsUsed 128 Nodes for performance evaluation
![Page 16: Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models](https://reader036.vdocuments.mx/reader036/viewer/2022062323/568163b0550346895dd4c8a6/html5/thumbnails/16.jpg)
Copyright: Abhinav Vishnu
Overhead of FT-ARMCI (No Faults)
Latency comparisons for Original and FT-ARMCI implementationObjective is to understand the overhead of pure communication benchmarksPut and Accumulate primitivesOverhead is observed due to synchronous writes
Possible performance optimizations for futurePiggyback acknowledgement with data
![Page 17: Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models](https://reader036.vdocuments.mx/reader036/viewer/2022062323/568163b0550346895dd4c8a6/html5/thumbnails/17.jpg)
Copyright: Abhinav Vishnu
Performance Evaluation with Faults
Pure communication BenchmarkReal scientific application would have a combination of computation and communicationObjective is to understand the overhead when a fault occurs
The primary overhead is due to timeout by the hardware for fault detectionTest continues to execute, when actual faults occur
![Page 18: Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models](https://reader036.vdocuments.mx/reader036/viewer/2022062323/568163b0550346895dd4c8a6/html5/thumbnails/18.jpg)
Copyright: Abhinav Vishnu
Conclusions
We presented the execution models of data centric programming models for fault toleranceWe presented the design for fault tolerant communication runtime system
Fault tolerance management infrastructureProcess manager and barrier collective operationFault tolerant communication protocols
Our evaluation showsWe can perform continued execution in presence of node faultsImprovable overhead is observed in absence of faultsAcceptable degradation in presence of node faults
![Page 19: Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models](https://reader036.vdocuments.mx/reader036/viewer/2022062323/568163b0550346895dd4c8a6/html5/thumbnails/19.jpg)
Copyright: Abhinav Vishnu
Future Work
This is a part of active R&D:Leverage infrastructure for Global ArraysSignificant application impactHandling simultaneous and consecutive failuresActively pursuing designs for upcoming high end systems
Blue Waters, Gemini etc
Our thanks to:eXtreme Scale Computing Initiative (XSCI) @ PNNLUS Department of EnergyMolecular Science Computing Facility @ PNNL