@spcl eth fault tolerance for remote memory access ... · logging memory accesses vs. messages...

397
spcl.inf.ethz.ch @spcl_eth M ACIEJ BESTA, TORSTEN HOEFLER Fault Tolerance for Remote Memory Access Programming Models

Upload: others

Post on 08-Aug-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

MACIEJ BESTA, TORSTEN HOEFLER

Fault Tolerance for Remote Memory Access

Programming Models

Page 2: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

2

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Page 3: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

2

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory

Process p

A

Page 4: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

2

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Process p Process q

A

BB

Page 5: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

2

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray

BlueWaters

Process p Process q

A

BB

Page 6: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

2

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray

BlueWaters

Process p Process q

A

BB

Page 7: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

2

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray

BlueWaters

Process p Process q

A

BB

Page 8: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

2

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray

BlueWaters

Process p Process q

A

BB

Page 9: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

2

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray

BlueWaters

put

Process p Process q

A

BB

AA

Page 10: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

2

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray

BlueWaters

put

Process p Process q

A

Bget

B

A

B

A

B

Page 11: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

2

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray

BlueWaters

put

Process p Process q

A

Bget

B

A

B

flush

A

B

Page 12: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

2

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray

BlueWaters

put

Process p Process q

A

Bget

B

A

B

flush

A

B

Page 13: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

One-sided communication

2

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray

BlueWaters

put

Process p Process q

A

Bget

B

A

B

flush

A

B

Page 14: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

One-sided communication

2

REMOTE MEMORY ACCESS (RMA) PROGRAMMING

Memory Memory

Cray

BlueWaters

put

Process p Process q

A

Bget

B

A

B

flush

A

B

no active

participation

Page 15: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

3

REMOTE MEMORY ACCESS PROGRAMMING

Implemented in hardware in NICs in the majority of HPC

networks support RDMA

Page 16: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

3

REMOTE MEMORY ACCESS PROGRAMMING

Implemented in hardware in NICs in the majority of HPC

networks support RDMA

Page 17: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

3

REMOTE MEMORY ACCESS PROGRAMMING

Implemented in hardware in NICs in the majority of HPC

networks support RDMA

Page 18: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

3

REMOTE MEMORY ACCESS PROGRAMMING

Implemented in hardware in NICs in the majority of HPC

networks support RDMA

Page 19: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

3

REMOTE MEMORY ACCESS PROGRAMMING

Implemented in hardware in NICs in the majority of HPC

networks support RDMA

Page 20: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

4

REMOTE MEMORY ACCESS PROGRAMMING

Supported by many HPC libraries and languages

Page 21: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

4

REMOTE MEMORY ACCESS PROGRAMMING

Supported by many HPC libraries and languages

Page 22: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

4

REMOTE MEMORY ACCESS PROGRAMMING

Supported by many HPC libraries and languages

Page 23: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

4

REMOTE MEMORY ACCESS PROGRAMMING

Supported by many HPC libraries and languages

Page 24: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

5

REMOTE MEMORY ACCESS PROGRAMMING

Enables significant speedups over message passing in

many types of applications, e.g.:

[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.

ACM/IEEE Supercomputing 2013, SC13, Best Paper Award

[2] D. Petrovic et al., High-performance RMA-based broadcast on the Intel SCC. SPAA’12

Page 25: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

5

REMOTE MEMORY ACCESS PROGRAMMING

Enables significant speedups over message passing in

many types of applications, e.g.: Speedup of ~1.5 for communication patterns in graph analytics

[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.

ACM/IEEE Supercomputing 2013, SC13, Best Paper Award

[2] D. Petrovic et al., High-performance RMA-based broadcast on the Intel SCC. SPAA’12

Page 26: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

5

REMOTE MEMORY ACCESS PROGRAMMING

Enables significant speedups over message passing in

many types of applications, e.g.: Speedup of ~1.5 for communication patterns in graph analytics

Speedup of ~1.4-2 in physics computations

[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.

ACM/IEEE Supercomputing 2013, SC13, Best Paper Award

[2] D. Petrovic et al., High-performance RMA-based broadcast on the Intel SCC. SPAA’12

Page 27: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

6

FAULT TOLERANCE + RMA

Page 28: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

6

15.8h of MTBF (for nodes) for the TSUBAME2

FAULT TOLERANCE + RMA

Page 29: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

7

FAULT TOLERANCE + RMA

Page 30: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Fault tolerance is well studied for message passing

7

FAULT TOLERANCE + RMA

Page 31: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Fault tolerance is well studied for message passing

7

FAULT TOLERANCE + RMA

Message Passing

Page 32: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Fault tolerance is well studied for message passing

7

FAULT TOLERANCE + RMA

Message Passing

Coordinated

Checkpointing (CC)

Page 33: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Fault tolerance is well studied for message passing

7

FAULT TOLERANCE + RMA

Message Passing

Coordinated

Checkpointing (CC)

uncoordinated

checkpointing

and message

logging (UC)

Page 34: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Fault tolerance is well studied for message passing

7

FAULT TOLERANCE + RMA

Message Passing

Coordinated

Checkpointing (CC)

uncoordinated

checkpointing

and message

logging (UC)

Page 35: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Fault tolerance is well studied for message passing

Scarce research exists for fault tolerance for RMA

7

FAULT TOLERANCE + RMA

Message Passing

Coordinated

Checkpointing (CC)

uncoordinated

checkpointing

and message

logging (UC)

Page 36: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Fault tolerance is well studied for message passing

Scarce research exists for fault tolerance for RMA

7

FAULT TOLERANCE + RMA

RMAMessage Passing

Coordinated

Checkpointing (CC)

uncoordinated

checkpointing

and message

logging (UC)

Page 37: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Fault tolerance is well studied for message passing

Scarce research exists for fault tolerance for RMA

7

FAULT TOLERANCE + RMA

RMAMessage Passing

Coordinated

Checkpointing (CC)

uncoordinated

checkpointing

and message

logging (UC)

Page 38: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Fault tolerance is well studied for message passing

Scarce research exists for fault tolerance for RMA

7

FAULT TOLERANCE + RMA

RMAMessage Passing

Coordinated

Checkpointing (CC)

uncoordinated

checkpointing

and message

logging (UC)

logging memory

accesses vs. messages

Page 39: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Fault tolerance is well studied for message passing

Scarce research exists for fault tolerance for RMA

7

FAULT TOLERANCE + RMA

RMAMessage Passing

Coordinated

Checkpointing (CC)

uncoordinated

checkpointing

and message

logging (UC)

logging memory

accesses vs. messages

checkpointing in RMA-

based applications

Page 40: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Fault tolerance is well studied for message passing

Scarce research exists for fault tolerance for RMA

7

FAULT TOLERANCE + RMA

RMAMessage Passing

Coordinated

Checkpointing (CC)

uncoordinated

checkpointing

and message

logging (UC)

logging memory

accesses vs. messages

checkpointing in RMA-

based applications

fault tolerance

mechanisms and

schemes

Page 41: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Fault tolerance is well studied for message passing

Scarce research exists for fault tolerance for RMA

7

FAULT TOLERANCE + RMA

RMAMessage Passing

Coordinated

Checkpointing (CC)

uncoordinated

checkpointing

and message

logging (UC)

logging memory

accesses vs. messages

checkpointing in RMA-

based applications

performance

fault tolerance

mechanisms and

schemes

Page 42: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

8

OVERVIEW OF OUR RESEARCH

Page 43: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Generic model

8

OVERVIEW OF OUR RESEARCH

Page 44: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Generic model

8

OVERVIEW OF OUR RESEARCH

Page 45: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Generic model

8

OVERVIEW OF OUR RESEARCH

Page 46: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Generic model

8

OVERVIEW OF OUR RESEARCH

Page 47: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

CC in RMAGeneric model

8

OVERVIEW OF OUR RESEARCH

Page 48: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

CC in RMAGeneric model

8

OVERVIEW OF OUR RESEARCH

MP vs. RMA

Page 49: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

CC in RMAGeneric model

8

OVERVIEW OF OUR RESEARCH

MP vs. RMA Schemes

Page 50: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

CC in RMAGeneric model

UC in RMA

8

OVERVIEW OF OUR RESEARCH

MP vs. RMA Schemes

Page 51: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

CC in RMAGeneric model

UC in RMA

8

OVERVIEW OF OUR RESEARCH

MP vs. RMA

MP vs. RMA

Schemes

Page 52: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

CC in RMAGeneric model

UC in RMA

8

OVERVIEW OF OUR RESEARCH

MP vs. RMA

MP vs. RMA

Checkpointing

schemesSchemes

Page 53: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

CC in RMAGeneric model

UC in RMA

8

OVERVIEW OF OUR RESEARCH

MP vs. RMA

MP vs. RMA

Checkpointing

schemesSchemes

Logging accesses

Page 54: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

CC in RMAGeneric model

UC in RMA

Recovery in RMA

8

OVERVIEW OF OUR RESEARCH

MP vs. RMA

MP vs. RMA

Checkpointing

schemesSchemes

Logging accesses

Page 55: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

CC in RMAGeneric model

UC in RMA

Recovery in RMA

8

OVERVIEW OF OUR RESEARCH

MP vs. RMA

MP vs. RMA

Checkpointing

schemesSchemes

Basic scheme

Logging accesses

Page 56: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

CC in RMAGeneric model

UC in RMA

Recovery in RMA

8

OVERVIEW OF OUR RESEARCH

MP vs. RMA

MP vs. RMA

Checkpointing

schemesSchemes

Basic scheme

Extended RMA

semantics

Logging accesses

Page 57: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMA

8

OVERVIEW OF OUR RESEARCH

MP vs. RMA

MP vs. RMA

Checkpointing

schemesSchemes

Basic scheme

Extended RMA

semantics

Logging accesses

Page 58: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMA

8

OVERVIEW OF OUR RESEARCH

MP vs. RMA

MP vs. RMA

Model extensions

Checkpointing

schemesSchemes

Basic scheme

Extended RMA

semantics

Logging accesses

Page 59: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMA

8

OVERVIEW OF OUR RESEARCH

Distribution of

processes

MP vs. RMA

MP vs. RMA

Model extensions

Checkpointing

schemesSchemes

Basic scheme

Extended RMA

semantics

Logging accesses

Page 60: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMA

8

OVERVIEW OF OUR RESEARCH

Distribution of

processes

MP vs. RMA

MP vs. RMA

Model extensions

Checkpointing

schemesSchemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 61: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMA

8

OVERVIEW OF OUR RESEARCH

Distribution of

processes

MP vs. RMA

MP vs. RMA

Model extensions

Checkpointing

schemesSchemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 62: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMA

8

OVERVIEW OF OUR RESEARCH

Distribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Checkpointing

schemesSchemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 63: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMA

8

OVERVIEW OF OUR RESEARCH

Distribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 64: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMA

8

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 65: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

8

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 66: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

8

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 67: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

8

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 68: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

9

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 69: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

10

COORDINATED CHECKPOINTING (MP)

Page 70: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

Page 71: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

Page 72: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

barrier

Page 73: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

barrier

Page 74: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

barrier com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

barrier

Page 75: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

barrier com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

com

pute

barrier

Page 76: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

barrier com

pute

com

pute

com

pute

com

pute

barrier

Page 77: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

barrier com

pute

com

pute

com

pute

com

pute

global

rollback

barrier

Page 78: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

barrier com

pute

com

pute

com

pute

com

pute

barrier

Page 79: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

barrier com

pute

com

pute

com

pute

com

pute

barrier

Page 80: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

barrier com

pute

com

pute

com

pute

com

pute

barrier

Page 81: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

barrier com

pute

com

pute

com

pute

com

pute

barrier

Page 82: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

barrier com

pute

com

pute

com

pute

com

pute

barrier

Page 83: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

10

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

Page 84: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

11

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

Page 85: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

11

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

coordinated

checkpoint

Page 86: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

11

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

coordinated

checkpoint

Page 87: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

11

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

send

coordinated

checkpoint

Page 88: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

11

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

send

coordinated

checkpoint

recv

Page 89: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

11

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

send

coordinated

checkpoint

recv

Page 90: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

11

COORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

coordinated

checkpoint

Page 91: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

12

COORDINATED CHECKPOINTING (RMA)

Proc k Proc 1Proc 1 ... ... Proc k...

coordinated

checkpoint

Page 92: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

12

COORDINATED CHECKPOINTING (RMA)

Proc k Proc 1Proc 1 ... ... Proc k...

coordinated

checkpoint

Page 93: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

12

COORDINATED CHECKPOINTING (RMA)

Proc k Proc 1Proc 1 ... ... Proc k...

put

coordinated

checkpoint

Page 94: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

12

COORDINATED CHECKPOINTING (RMA)

Proc k Proc 1Proc 1 ... ... Proc k...

put

coordinated

checkpoint

Page 95: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

12

COORDINATED CHECKPOINTING (RMA)

Proc k Proc 1Proc 1 ... ... Proc k...

put

coordinated

checkpoint

flush

Page 96: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

12

COORDINATED CHECKPOINTING (RMA)

Proc k Proc 1Proc 1 ... ... Proc k...

put

coordinated

checkpoint

flush

Page 97: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

13

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 98: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

13

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 99: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

13

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 100: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

Page 101: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

Page 102: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A

B

Page 103: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A

B

actions are

non-blocking

Page 104: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A

B

actions are

non-blocking

data will be valid

upon synchronizing

memories

Page 105: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A

B

actions are

non-blocking

data will be valid

upon synchronizing

memories

Page 106: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A B

actions are

non-blocking

data will be valid

upon synchronizing

memories

Page 107: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A B

C

actions are

non-blocking

data will be valid

upon synchronizing

memories

Page 108: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A B

C

D

actions are

non-blocking

data will be valid

upon synchronizing

memories

Page 109: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A B

C

D

actions are

non-blocking

data will be valid

upon synchronizing

memories

Page 110: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A B CD

actions are

non-blocking

data will be valid

upon synchronizing

memories

Page 111: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A B CD

E

F

actions are

non-blocking

data will be valid

upon synchronizing

memories

Page 112: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A B CD

E

F

actions are

non-blocking

data will be valid

upon synchronizing

memories

Page 113: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A B CD E F

actions are

non-blocking

data will be valid

upon synchronizing

memories

Page 114: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A B CD E F

actions are

non-blockingEpoch 0

data will be valid

upon synchronizing

memories

Page 115: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A B CD E F

actions are

non-blockingEpoch 0

Epoch 1

data will be valid

upon synchronizing

memories

Page 116: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

14

RMA: EPOCHS

Proc p Proc q

memorymemory

A B C D E F

A B CD E F

actions are

non-blockingEpoch 0

Epoch 1

Epoch 2

data will be valid

upon synchronizing

memories

Page 117: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

15

RMA: EPOCHS

Proc p Proc q

Page 118: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

15

RMA: EPOCHS

Proc p Proc q

X

Y

Z

Page 119: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

15

RMA: EPOCHS

Proc p Proc q

Epoch 0

X

Y

Z

Page 120: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

15

RMA: EPOCHS

Proc p Proc q

Epoch 0

Epoch 1

X

Y

Z

Page 121: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

15

RMA: EPOCHS

Proc p Proc q

Page 122: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

B

C

E

F

Page 123: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

B

C

E

F

Page 124: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

B

C

E

F

Page 125: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

B

C

E

F

Page 126: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

co

B

C

B Cput put

E

F

Page 127: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

co

B

C

B Cput put

E

F

Page 128: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

co

B

C

B Cput put

E

F

Page 129: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

co

co

B

C

B Cput put

E Fget getE

F

Page 130: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

co

co

For recovery, a

process has to

replay actions

in the correct

order! co

B

C

B Cput put

E Fget getE

F

Page 131: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

co

co

For recovery, a

process has to

replay actions

in the correct

order! co

For this,

we use epoch

counters (EC)

B

C

B Cput put

E Fget getE

F

Page 132: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

co

co

For recovery, a

process has to

replay actions

in the correct

order! co

For this,

we use epoch

counters (EC)

memory

EC

B

C

B Cput put

E Fget getE

F

Page 133: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

co

co

For recovery, a

process has to

replay actions

in the correct

order! co

For this,

we use epoch

counters (EC)

memory

EC 0

B

C

B Cput put

E Fget getE

F

Page 134: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

co

co

For recovery, a

process has to

replay actions

in the correct

order! co

For this,

we use epoch

counters (EC)

memory

EC 1

B

C

B Cput put

E Fget getE

F

Page 135: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

16

RMA: THE CONSISTENCY ORDER

Proc p Proc q

A

D

Epoch 0

Epoch 1

Epoch 2

co

co

co

For recovery, a

process has to

replay actions

in the correct

order! co

For this,

we use epoch

counters (EC)

memory

EC 2

B

C

B Cput put

E Fget getE

F

Page 136: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

17

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 137: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

17

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 138: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

17

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 139: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

Page 140: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

uncoordinated

checkpoint

Page 141: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

uncoordinated

checkpoint

Page 142: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

uncoordinated

checkpoint

Page 143: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

uncoordinated

checkpoint

rollback

Page 144: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

uncoordinated

checkpoint

dependency

rollback

Page 145: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

uncoordinated

checkpoint

dependency

rollback

rollback

Page 146: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

uncoordinated

checkpoint

dependency

dependency

rollback

rollback

Page 147: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

uncoordinated

checkpoint

dependency

dependency

rollback

rollback rollback

Page 148: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

uncoordinated

checkpoint

Page 149: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

uncoordinated

checkpoint

logging a

message

Page 150: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

Page 151: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

local

rollback

Page 152: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

Page 153: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

get the log and

replay the message

Page 154: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Node 1 Node N

18

UNCOORDINATED CHECKPOINTING (MP)

Proc k Proc 1Proc 1 ... ... Proc k...

Page 155: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

Page 156: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

memoryCC

Page 157: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

memoryCC

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

Page 158: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

memoryCC

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

Page 159: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

memory

Logs of puts

CC

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

Page 160: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

record the put

memory

Logs of puts

CC

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

Page 161: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

record the put

memory

Logs of puts

CC

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

Page 162: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

record the put

memory

Logs of puts

C

C

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

Page 163: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

record the put

memory

Logs of puts

C + #put

C

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

Page 164: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

record the put

memory

Logs of puts

C + #put

C

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

EC 1

Page 165: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

record the put

memory

Logs of puts

C + #put

+ EC( )

C

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

EC 1

Page 166: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

record the put

memory

Logs of puts

C + #put

+ EC( )

C

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

EC 1

1

Page 167: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

record the put

memory

Logs of puts

C + #put

+ EC( )

C

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

EC 1

1

Page 168: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

19

RMA: LOGGING PUTS

Proc p Proc q

C

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

1

Page 169: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

20

RMA: LOGGING GETS

Proc p Proc q

C

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

Page 170: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

20

RMA: LOGGING GETS

Proc p Proc q

A

B

C

D

E

F

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

Page 171: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

20

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

Page 172: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

Page 173: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

Page 174: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

memory memoryD

Page 175: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

memory

Logs of #gets

memoryD

Page 176: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

memory

Logs of #gets

memoryD

Logs of gets

Page 177: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

memory

Logs of #gets

memoryD

Logs of gets

Page 178: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

memory

Logs of #gets

memoryD

Logs of gets

Page 179: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

memory

Logs of #gets

Data is not

yet valid

memoryD

Logs of gets

Page 180: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

log(#get + EC)

memory

Logs of #gets

Data is not

yet valid

memoryD

Logs of gets

Page 181: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

log(#get + EC)

memory

Logs of #gets

#get

Data is not

yet valid

memoryD

Logs of gets

Page 182: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

log(#get + EC)

memory

Logs of #gets

#get + EC( )

Data is not

yet valid

EC 1memory

D1

Logs of gets

Page 183: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

log(#get + EC)

memory

Logs of #gets

#get + EC( )

Data is not

yet valid

EC 1memory

D

1

Logs of gets

Page 184: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

log(#get + EC)

memory

Logs of #gets

#get + EC( )

Data is not

yet valid

EC 1memory

D

1...

Logs of gets

Page 185: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

log(#get + EC)

memory

Logs of #gets

#get + EC( )

Data is not

yet valid

EC 1memory

D

1...

Logs of gets

Page 186: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

log(#get + EC)

memory

Logs of #gets

#get + EC( )

Data is not

yet valid

EC 1memory

DD

1...

Logs of gets

Page 187: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

log(#get + EC)

memory

Logs of #gets

#get + EC( )

EC 1memory

DD

Data is valid

1...

Logs of gets

Page 188: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

log(#get + EC)

memory

Logs of #gets

#get + EC( )

EC 1memory

D

Now this data

may be invalid

Data is valid

1...

Logs of gets

Page 189: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

log(#get + EC)

memory

Logs of #gets

#get + EC( )

EC 1memory

D

Now this data

may be invalid

Data is valid

1...

Logs of gets

Page 190: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

log(#get + EC)

memory

Logs of #gets

#get + EC( )

EC 1memory

+ #get

+ EC( )

D

Now this data

may be invalid

Data is valid

1...

Logs of gets

D

1

Page 191: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

log(#get + EC)

memory

Logs of #gets

#get + EC( )

EC 1memory

+ #get

+ EC( )

D

Now this data

may be invalid

Data is valid

delete #get

and EC

1...

Logs of gets

D

1

Page 192: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

D

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

log(#get + EC)

memory

Logs of #gets

EC 1memory

+ #get

+ EC( )

D

Now this data

may be invalid

Data is valid

delete #get

and EC

...Logs of gets

D

1

Page 193: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

21

RMA: LOGGING GETS

Proc p Proc q

Page 194: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

22

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 195: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

22

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 196: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

22

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 197: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memorymemory

23

RMA: RECOVERY

Proc p Proc q

App. dataChpt. data

DataP

Page 198: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memorymemory

23

RMA: RECOVERY

Proc p Proc q

App. data

DataP

Page 199: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memorymemory

23

RMA: RECOVERY

Proc p Proc q

Stage 2: replay actions beyond the checkpoint

App. data

DataP

Page 200: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memorymemory

23

RMA: RECOVERY

Proc p Proc q

Stage 2: replay actions beyond the checkpoint

X

Y

Z

App. data

DataP

Page 201: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memorymemory

23

RMA: RECOVERY

Proc p Proc q

Stage 2: replay actions beyond the checkpoint

X

Y

Z

App. data

DataP

Page 202: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memorymemory

23

RMA: RECOVERY

Proc p Proc q

Stage 2: replay actions beyond the checkpoint

X

Y

App. data

DataP

Page 203: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memorymemory

23

RMA: RECOVERY

Proc p Proc q

Stage 2: replay actions beyond the checkpoint

X

Y

App. data

DataP

Page 204: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memorymemory

23

RMA: RECOVERY

Proc p Proc q

memory

Logs of puts

Stage 2: replay actions beyond the checkpoint

X

Y

App. data

DataPX + #put + EC( )0

Y + #put + EC( )0

Page 205: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memorymemory

23

RMA: RECOVERY

Proc p Proc q

memory

Logs of puts

Stage 2: replay actions beyond the checkpoint

X

Y

App. data

DataP

Epoch 0

X + #put + EC( )0

Y + #put + EC( )0

Page 206: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memorymemory

23

RMA: RECOVERY

Proc p Proc q

memory

Logs of puts

Stage 2: replay actions beyond the checkpoint

X

Y

X + #put + EC( )0

Y + #put + EC( )0

App. data

DataP

Epoch 0

X + #put + EC( )0

Y + #put + EC( )0

Page 207: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memorymemory

23

RMA: RECOVERY

Proc p Proc q

memory

Logs of puts

Stage 2: replay actions beyond the checkpoint

X

Y

X + #put + EC( )0

Y + #put + EC( )0

App. data

DataP

Epoch 0

X + #put + EC( )0

Y + #put + EC( )0

Page 208: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memorymemory

23

RMA: RECOVERY

Proc p Proc q

memory

Logs of puts

Stage 2: replay actions beyond the checkpoint

X + #put + EC( )0

Y + #put + EC( )0

App. data

DataPX + #put + EC( )0

Y + #put + EC( )0

Page 209: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

X + #put + EC( )0

Y + #put + EC( )0

Stage 2: replay actions beyond the checkpoint

DataP

Page 210: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

X + #put + EC( )0

Y + #put + EC( )0

A

B

C

D

E

F

Stage 2: replay actions beyond the checkpoint

DataP

Page 211: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

X + #put + EC( )0

Y + #put + EC( )0

A

B

C

D

E

F

Stage 2: replay actions beyond the checkpoint

DataP

Page 212: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

X + #put + EC( )0

Y + #put + EC( )0

D

E

F

Stage 2: replay actions beyond the checkpoint

DataP

Page 213: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

X + #put + EC( )0

Y + #put + EC( )0

D

E

F

Stage 2: replay actions beyond the checkpoint

DataP

Page 214: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

X + #put + EC( )0

Y + #put + EC( )0

D

E

F

Logs of gets

D + #get + EC( )1

E + #get + EC( )2

F + #get + EC( )2

Stage 2: replay actions beyond the checkpoint

DataP

Page 215: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

X + #put + EC( )0

Y + #put + EC( )0

D

E

F

Logs of gets

D + #get + EC( )1

E + #get + EC( )2

F + #get + EC( )2

Stage 2: replay actions beyond the checkpoint

DataP

Page 216: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

X + #put + EC( )0

Y + #put + EC( )0

D

E

F

Logs of gets

D + #get + EC( )1

E + #get + EC( )2

F + #get + EC( )2

Stage 2: replay actions beyond the checkpoint

DataP

Page 217: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

X + #put + EC( )0

Y + #put + EC( )0

D

E

F

Logs of gets

E + #get + EC( )2

F + #get + EC( )2

D + #get + EC( )1

Stage 2: replay actions beyond the checkpoint

DataP

D + #get + EC( )1D

Page 218: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

X + #put + EC( )0

Y + #put + EC( )0

D

E

F

Logs of gets

E + #get + EC( )2

F + #get + EC( )2

D + #get + EC( )1

Stage 2: replay actions beyond the checkpoint

DataP

D + #get + EC( )1D

Page 219: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

X + #put + EC( )0

Y + #put + EC( )0

D

E

F

Logs of gets

D + #get + EC( )1

E + #get + EC( )

F + #get + EC( )2

2

Stage 2: replay actions beyond the checkpoint

DataP

D + #get + EC( )1D

E + #get + EC( )2

F + #get + EC( )2

Page 220: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

updating

state...

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

X + #put + EC( )0

Y + #put + EC( )0

D

E

F

Logs of gets

D + #get + EC( )1

E + #get + EC( )

F + #get + EC( )2

2

Stage 2: replay actions beyond the checkpoint

DataP

D + #get + EC( )1D

E + #get + EC( )2

F + #get + EC( )2

Page 221: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

updating

state...

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

X + #put + EC( )0

Y + #put + EC( )0

D

E

F

Logs of gets

D + #get + EC( )1

E + #get + EC( )

F + #get + EC( )2

2

Stage 2: replay actions beyond the checkpoint

DataP

D + #get + EC( )1D

E + #get + EC( )2

F + #get + EC( )2

Page 222: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

updating

state...

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

D

E

F

Logs of gets

D + #get + EC( )1

E + #get + EC( )

F + #get + EC( )2

2

Stage 2: replay actions beyond the checkpoint

DataP

D + #get + EC( )1D

E + #get + EC( )2

F + #get + EC( )2

Page 223: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

updating

state...

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

D

E

F

Logs of gets

D + #get + EC( )1

E + #get + EC( )

F + #get + EC( )2

2

Stage 2: replay actions beyond the checkpoint

DataP

D + #get + EC( )1D

E + #get + EC( )2

F + #get + EC( )2

Page 224: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

updating

state...

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

D

E

F

Logs of gets

E + #get + EC( )

F + #get + EC( )2

2

Stage 2: replay actions beyond the checkpoint

DataP

D + #get + EC( )1D

E + #get + EC( )2

F + #get + EC( )2

Page 225: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

updating

state...

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

D

E

F

Logs of gets

E + #get + EC( )

F + #get + EC( )2

2

Stage 2: replay actions beyond the checkpoint

DataP

D + #get + EC( )1D

E + #get + EC( )2

F + #get + EC( )2

Page 226: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

updating

state...

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

D

E

F

Logs of gets

Stage 2: replay actions beyond the checkpoint

DataP

D + #get + EC( )1D

E + #get + EC( )2

F + #get + EC( )2

Page 227: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory memory

24

RMA: RECOVERY

Proc p Proc q

Logs of puts

X + #put + EC( )0

Y + #put + EC( )0

App. data

D

E

F

Logs of gets

state

restored!

Stage 2: replay actions beyond the checkpoint

DataP

D + #get + EC( )1D

E + #get + EC( )2

F + #get + EC( )2

Page 228: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

25

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 229: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

25

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 230: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

25

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 231: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

26

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Page 232: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

26

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Today’s supercomputers have a hierarchical layout

Page 233: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

26

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Today’s supercomputers have a hierarchical layout

A Cray XE/XT

supercomputer

Page 234: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

26

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Today’s supercomputers have a hierarchical layout

...4 cabinets:

A Cray XE/XT

supercomputer

Page 235: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

26

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Today’s supercomputers have a hierarchical layout

...4 cabinets:

3 chassis: ...

A Cray XE/XT

supercomputer

Page 236: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

26

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Today’s supercomputers have a hierarchical layout

...4 cabinets:

3 chassis: ...

A Cray XE/XT

supercomputer

...8 blades:

Page 237: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

26

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Today’s supercomputers have a hierarchical layout

...4 cabinets:

3 chassis:

4 nodes:

...

...

A Cray XE/XT

supercomputer

...8 blades:

Page 238: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

26

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Today’s supercomputers have a hierarchical layout

...4 cabinets:

3 chassis:

4 nodes:

...

...

...

A Cray XE/XT

supercomputer

...8 blades:

32 cores:

Page 239: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

26

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Today’s supercomputers have a hierarchical layout

A single hardware crash may kill multiple processes

...4 cabinets:

3 chassis:

4 nodes:

...

...

...

A Cray XE/XT

supercomputer

...8 blades:

32 cores:

Page 240: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

26

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Today’s supercomputers have a hierarchical layout

A single hardware crash may kill multiple processes

...4 cabinets:

3 chassis:

4 nodes:

...

...

...

A Cray XE/XT

supercomputer

...8 blades:

32 cores:

Up to 128 process

failures (assuming

1 process per core)

Page 241: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

26

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Today’s supercomputers have a hierarchical layout

A single hardware crash may kill multiple processes

Introduced protocols usually cannot handle > 1 process crash

...4 cabinets:

3 chassis:

4 nodes:

...

...

...

A Cray XE/XT

supercomputer

...8 blades:

32 cores:

Up to 128 process

failures (assuming

1 process per core)

Page 242: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Page 243: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Page 244: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

G

Page 245: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

G

A

B

C

Application data

Page 246: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

G

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

Application data

Page 247: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

G

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

Page 248: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

Page 249: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

C

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

Page 250: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

C

X

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

Page 251: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

C

X

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

Page 252: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

C

X

Y

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

Page 253: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

C

X

Y

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

Page 254: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

C

X

Y

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

Parity data

Page 255: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

C

X

Y

D

E

F

G

H

I

J

K

L

M

N

O

P

Q

R

Page 256: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

C

X

Y

D

E

F

S

T

G

H

I

U

W

J

K

L

Z

V

M

N

O

1

2

P

Q

R

3

4

Page 257: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

X

Y

D

E

F

S

T

G

H

I

U

W

J

K

L

Z

V

M

N

O

1

2

P

Q

R

3

4

Page 258: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

X

Y

D

E

F

S

T

G

H

I

U

W

J

K

L

Z

V

M

N

O

1

2

P

Q

R

3

4

Page 259: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

X

Y

D

E

F

S

T

G

H

I

U

W

J

K

L

Z

V

M

N

O

1

2

P

Q

R

3

4

Page 260: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

X

Y

D

E

F

S

T

G

H

I

U

W

J

K

L

Z

V

M

N

O

1

2

P

Q

R

3

4

Page 261: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

C

X

Y

D

E

F

S

T

G

H

I

U

W

J

K

L

Z

V

M

N

O

1

2

P

Q

R

3

4

Page 262: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

C

X

Y

D

E

F

S

T

G

H

I

U

W

J

K

L

Z

V

M

N

O

1

2

P

Q

R

3

4

Page 263: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

C

X

Y

D

E

F

S

T

G

H

I

U

W

J

K

L

Z

V

M

N

O

1

2

P

Q

R

3

4

Page 264: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

27

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 1: groups of processes: Divide processes into groups of size G each

Add m parity processes to each group to store the parity data

m

G

A

B

C

X

Y

D

E

F

S

T

G

H

I

U

W

J

K

L

Z

V

M

N

O

1

2

P

Q

R

3

4

Page 265: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

A D G J M P

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

Page 266: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 2: topology-aware distribution of groups:

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

A D G J M P

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

Page 267: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

A D G J M P

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

Page 268: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

A D G J M P

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

Page 269: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

A D G J M P

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

Page 270: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...

... and nodes

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

A D G J M P

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

Page 271: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...

... and nodes

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

A D G J M P

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

Page 272: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...

... and nodes

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

A D G J M P

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

Page 273: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...

... and nodes

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

A D G J M P

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

Page 274: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...

... and nodes

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

G J M P

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

Page 275: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...

... and nodes

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

Page 276: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...

... and nodes

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

Page 277: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...

... and nodes

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

A D G J M P

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

Page 278: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

28

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Step 2: topology-aware distribution of groups: For example, apply topology-awareness at the level of blades...

... and nodes

Proc 5 Proc 6

Proc 7 Proc 8 Proc 11 Proc 12

Proc 13 Proc 14 Proc 17 Proc 18

Proc 19 Proc 20 Proc 23 Proc 24

Proc 19 Proc 20 Proc 21 Proc 22 Proc 23 Proc 24

Proc 1 Proc 2 Proc 3 Proc 4

Proc 9 Proc 10

Proc 15 Proc 16

Proc 21 Proc 22

A D G J M P

Y T W V 2 4

Q

R

3

B

X

E

F

S

H

I

U

K

L

Z

N

O

1

C

Page 279: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

Page 280: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine

Page 281: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

Page 282: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

...

...

...

...

...

Page 283: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

...

...

...

...

...

Page 284: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

Probability that xj elements

of level j will fail and cause

a catastrophic failure

...

...

...

...

...

Page 285: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

Probability that xj elements

of level j will fail and cause

a catastrophic failure

...

...

...

...

...

Page 286: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

Probability that xj elements

of level j will fail and cause

a catastrophic failure

...

...

...

...

...

Page 287: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

Probability that xj elements

of level j will fail and cause

a catastrophic failureProbability

that xj elements

of level j will fail

...

...

...

...

...

Page 288: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

Probability that xj elements

of level j will fail and cause

a catastrophic failureProbability

that xj elements

of level j will fail

Probability that xj

given failures at level j

are catastrophic

...

...

...

...

...

Page 289: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

Probability that xj elements

of level j will fail and cause

a catastrophic failureProbability

that xj elements

of level j will fail

Probability that xj

given failures at level j

are catastrophic

Every number

of xj elements

is considered...

...

...

...

...

...

Page 290: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

Probability that xj elements

of level j will fail and cause

a catastrophic failureProbability

that xj elements

of level j will fail

Probability that xj

given failures at level j

are catastrophic

Every number

of xj elements

is considered...

...

...

...

...

...

Page 291: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

Probability that xj elements

of level j will fail and cause

a catastrophic failureProbability

that xj elements

of level j will fail

Probability that xj

given failures at level j

are catastrophic

Every number

of xj elements

is considered...

...

...

...

...

...

Page 292: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

Probability that xj elements

of level j will fail and cause

a catastrophic failureProbability

that xj elements

of level j will fail

Probability that xj

given failures at level j

are catastrophic

Every number

of xj elements

is considered...

...

...

...

...

...

Page 293: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

Probability that xj elements

of level j will fail and cause

a catastrophic failureProbability

that xj elements

of level j will fail

Probability that xj

given failures at level j

are catastrophic

Every number

of xj elements

is considered...

...

...

...

...

...

Page 294: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

Probability that xj elements

of level j will fail and cause

a catastrophic failureProbability

that xj elements

of level j will fail

Probability that xj

given failures at level j

are catastrophic

Every number

of xj elements

is considered...

...

...

...

...

...

Page 295: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

Probability that xj elements

of level j will fail and cause

a catastrophic failureProbability

that xj elements

of level j will fail

Probability that xj

given failures at level j

are catastrophic

Every number

of xj elements

is considered...

...at every

hierarchy level

...

...

...

...

...

Page 296: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

29

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓 =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗(𝑥𝑗 ∩ 𝑥𝑗,𝑐𝑓) =

𝑗=1

𝑥𝑗=1

𝐻𝑗

𝑃𝑗 𝑥𝑗 𝑃𝑗(𝑥𝑗,𝑐𝑓|𝑥𝑗)

Probability that xj elements

of level j will fail and cause

a catastrophic failureProbability

that xj elements

of level j will fail

Probability that xj

given failures at level j

are catastrophic

Every number

of xj elements

is considered...

...at every

hierarchy level

...

...

...

...

...

Page 297: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

30

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

Page 298: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

30

EXTENDING THE PROTOCOLS FOR MORE RESILIENCE

The probability of a catastrophic failure in a multi-level

computing machine A catastrophic failure: a failure that takes place when more than m processes in

the same group die

𝑃𝑐𝑓/𝑑𝑎𝑦:

𝑁𝑟 𝑜𝑓 𝑝𝑎𝑟𝑖𝑡𝑦 𝑝𝑟𝑜𝑐𝑒𝑠𝑠𝑒𝑠

Page 299: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

31

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 300: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

31

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 301: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

31

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 302: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

32

HOLISTIC RESILIENCE PROTOCOL FOR RMATHE OVERVIEW

Page 303: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

32

HOLISTIC RESILIENCE PROTOCOL FOR RMATHE OVERVIEW

Implemented as FTRMA: a portable fault-tolerance library Based on C and foMPI, an available MPI-3 RMA implementation [1]

[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.

ACM/IEEE Supercomputing 2013, SC13, Best Paper Award

Page 304: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

32

HOLISTIC RESILIENCE PROTOCOL FOR RMATHE OVERVIEW

Implemented as FTRMA: a portable fault-tolerance library Based on C and foMPI, an available MPI-3 RMA implementation [1]

The layered protocol:

[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.

ACM/IEEE Supercomputing 2013, SC13, Best Paper Award

Page 305: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

32

HOLISTIC RESILIENCE PROTOCOL FOR RMATHE OVERVIEW

Implemented as FTRMA: a portable fault-tolerance library Based on C and foMPI, an available MPI-3 RMA implementation [1]

The layered protocol:

transparent logging

[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.

ACM/IEEE Supercomputing 2013, SC13, Best Paper Award

Page 306: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

32

HOLISTIC RESILIENCE PROTOCOL FOR RMATHE OVERVIEW

Implemented as FTRMA: a portable fault-tolerance library Based on C and foMPI, an available MPI-3 RMA implementation [1]

The layered protocol:

transparent logging

uncoordinated checkpointing

[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.

ACM/IEEE Supercomputing 2013, SC13, Best Paper Award

Page 307: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

32

HOLISTIC RESILIENCE PROTOCOL FOR RMATHE OVERVIEW

Implemented as FTRMA: a portable fault-tolerance library Based on C and foMPI, an available MPI-3 RMA implementation [1]

The layered protocol:

transparent logging

uncoordinated checkpointing

coordinated checkpointing

[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.

ACM/IEEE Supercomputing 2013, SC13, Best Paper Award

Page 308: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

32

HOLISTIC RESILIENCE PROTOCOL FOR RMATHE OVERVIEW

Implemented as FTRMA: a portable fault-tolerance library Based on C and foMPI, an available MPI-3 RMA implementation [1]

The layered protocol:

transparent logging

uncoordinated checkpointing

coordinated checkpointing

topology-awareness

uses

[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.

ACM/IEEE Supercomputing 2013, SC13, Best Paper Award

Page 309: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

32

HOLISTIC RESILIENCE PROTOCOL FOR RMATHE OVERVIEW

[2] J.T.Daly, A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems,

Volume 22 Issue 3, February 2006, Pages 303-312

Implemented as FTRMA: a portable fault-tolerance library Based on C and foMPI, an available MPI-3 RMA implementation [1]

The layered protocol:

transparent logging

uncoordinated checkpointing

usescoordinated checkpointing

Daly’s [2]

formula

topology-awareness

uses

≈ 2𝑀𝑇

[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.

ACM/IEEE Supercomputing 2013, SC13, Best Paper Award

Page 310: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

32

HOLISTIC RESILIENCE PROTOCOL FOR RMATHE OVERVIEW

[2] J.T.Daly, A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems,

Volume 22 Issue 3, February 2006, Pages 303-312

Implemented as FTRMA: a portable fault-tolerance library Based on C and foMPI, an available MPI-3 RMA implementation [1]

The layered protocol:

transparent logging

uncoordinated checkpointing

usescoordinated checkpointing

Daly’s [2]

formula

topology-awareness

uses

≈ 2𝑀𝑇

[1] R. Gerstenberger, M. Besta, T. Hoefler, Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One-Sided.

ACM/IEEE Supercomputing 2013, SC13, Best Paper Award

usesdemand

checkpoints

Page 311: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

33

Evaluation on CSCS Monte Rosa

1,496 computing Cray XE6 nodes

47,872 schedulable cores

46TB memory

4 protocols

2 applications

PERFORMANCE

Page 312: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

34

PERFORMANCE: COORDINATED CHECKPOINTING

NAS 3D FFT

Page 313: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

34

NAS 3D FFT [1] Performance

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and

overlap.IPDPS’09

PERFORMANCE: COORDINATED CHECKPOINTING

NAS 3D FFT

Page 314: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

34

NAS 3D FFT [1] Performance

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and

overlap.IPDPS’09

no-FT: no fault tolerance

PERFORMANCE: COORDINATED CHECKPOINTING

NAS 3D FFT

Page 315: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

34

NAS 3D FFT [1] Performance

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and

overlap.IPDPS’09

no-FT: no fault tolerance

f-daly: using Daly’s interval

1-5% slower than no-FT

PERFORMANCE: COORDINATED CHECKPOINTING

NAS 3D FFT

Page 316: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

34

NAS 3D FFT [1] Performance

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and

overlap.IPDPS’09

no-FT: no fault tolerance

f-daly: using Daly’s interval

1-5% slower than no-FT

f-no-daly: no Daly’s interval

1-15% slower than no-FT

PERFORMANCE: COORDINATED CHECKPOINTING

NAS 3D FFT

Page 317: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

34

NAS 3D FFT [1] Performance

SCR: a popular checkpoint

/ restart library

21-67% slower than no-FT

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and

overlap.IPDPS’09

no-FT: no fault tolerance

f-daly: using Daly’s interval

1-5% slower than no-FT

f-no-daly: no Daly’s interval

1-15% slower than no-FT

PERFORMANCE: COORDINATED CHECKPOINTING

NAS 3D FFT

Page 318: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

35

PERFORMANCE: UNCOORDINATED CHECKPOINTING

NAS 3D FFT

Page 319: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

35

NAS 3D FFT [1] Performance

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and

overlap.IPDPS’09

PERFORMANCE: UNCOORDINATED CHECKPOINTING

NAS 3D FFT

Page 320: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

35

NAS 3D FFT [1] Performance

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and

overlap.IPDPS’09

How the log size impacts the

number of uncoordinated

checkpoints and the

performance of the code?

PERFORMANCE: UNCOORDINATED CHECKPOINTING

NAS 3D FFT

Page 321: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

35

NAS 3D FFT [1] Performance

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and

overlap.IPDPS’09

How the log size impacts the

number of uncoordinated

checkpoints and the

performance of the code?

PERFORMANCE: UNCOORDINATED CHECKPOINTING

NAS 3D FFT

Page 322: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

36

PERFORMANCE: ACCESS LOGGING

NAS 3D FFT

Page 323: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

36

NAS 3D FFT [1] Performance

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and

overlap.IPDPS’09

PERFORMANCE: ACCESS LOGGING

NAS 3D FFT

Page 324: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

36

NAS 3D FFT [1] Performance

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and

overlap.IPDPS’09

PERFORMANCE: ACCESS LOGGING

NAS 3D FFT

no-FT: no fault tolerance

Page 325: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

36

NAS 3D FFT [1] Performance

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and

overlap.IPDPS’09

PERFORMANCE: ACCESS LOGGING

NAS 3D FFT

no-FT: no fault tolerance

FTRMA: logging puts (FFT

code does not use gets)

8-9% slower than no-FT

Page 326: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

36

NAS 3D FFT [1] Performance

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and

overlap.IPDPS’09

PERFORMANCE: ACCESS LOGGING

NAS 3D FFT

no-FT: no fault tolerance

FTRMA: logging puts (FFT

code does not use gets)

8-9% slower than no-FT

ML: a simple protocol

based on message logging

18% slower than no-FT

Page 327: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

37

PERFORMANCE: ACCESS LOGGING

DISTRIBUTED HASHTABLE

Page 328: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

37

Distributed Hashtable Performance

PERFORMANCE: ACCESS LOGGING

DISTRIBUTED HASHTABLE

Page 329: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

37

Distributed Hashtable Performance

PERFORMANCE: ACCESS LOGGING

DISTRIBUTED HASHTABLE

An insert: 1 get and 1 put are logged

A collision: 4 gets and 6 puts are logged

Page 330: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

37

Distributed Hashtable Performance

PERFORMANCE: ACCESS LOGGING

DISTRIBUTED HASHTABLE

no-FT: no fault tolerance

An insert: 1 get and 1 put are logged

A collision: 4 gets and 6 puts are logged

Page 331: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

37

Distributed Hashtable Performance

PERFORMANCE: ACCESS LOGGING

DISTRIBUTED HASHTABLE

no-FT: no fault tolerance

f-puts: logging puts

12% slower than no-FT

An insert: 1 get and 1 put are logged

A collision: 4 gets and 6 puts are logged

Page 332: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

37

Distributed Hashtable Performance

PERFORMANCE: ACCESS LOGGING

DISTRIBUTED HASHTABLE

no-FT: no fault tolerance

f-puts: logging puts

12% slower than no-FT

f-puts-gets: logging puts

and gets

33% slower than no-FT

An insert: 1 get and 1 put are logged

A collision: 4 gets and 6 puts are logged

Page 333: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

37

Distributed Hashtable Performance

PERFORMANCE: ACCESS LOGGING

DISTRIBUTED HASHTABLE

no-FT: no fault tolerance

f-puts: logging puts

12% slower than no-FT

ML: logging puts and gets

40% slower than no-FT

f-puts-gets: logging puts

and gets

33% slower than no-FT

An insert: 1 get and 1 put are logged

A collision: 4 gets and 6 puts are logged

Page 334: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

38

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 335: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

38

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 336: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

38

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 337: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

38

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 338: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

38

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 339: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

38

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 340: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

38

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 341: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Holistic fault-tolerance library

Topology-awareness

CC in RMAGeneric model

UC in RMA

Recovery in RMAProofs

38

OVERVIEW OF OUR RESEARCH

PerformanceDistribution of

processesDesign

MP vs. RMA

MP vs. RMA

Model extensions

Deadlock freedom

Correct recovery

Checkpointing

schemes

Optimizations

Checkpoints

on demand

Schemes

Basic scheme

Extended RMA

semantics

Decreasing

failure prob.

Logging accesses

Page 342: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Thanks to:

Paul Hargrove (and the whole UPC team)

and the MPI Forum RMA WG …

… and the institutions:

39

ACKNOWLEDGMENTS

Page 343: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

Thank you

for your attention

40

Page 344: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

41

RMA: LOGGING PUTS

Proc p Proc q

C

Page 345: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

41

RMA: LOGGING PUTS

Proc p Proc q

C

memory

Logs of puts

CC

Page 346: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

41

RMA: LOGGING PUTS

Proc p Proc q

C

memory

Logs of puts

CC

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

Page 347: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

41

RMA: LOGGING PUTS

Proc p Proc q

C

record the put

memory

Logs of puts

CC

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

Page 348: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

41

RMA: LOGGING PUTS

Proc p Proc q

C

record the put

memory

Logs of puts

CC

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

Page 349: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

41

RMA: LOGGING PUTS

Proc p Proc q

C

record the put

memory

Logs of puts

C

C

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

Page 350: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

41

RMA: LOGGING PUTS

Proc p Proc q

C

record the put

memory

Logs of puts

C + #put

C

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

Page 351: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

41

RMA: LOGGING PUTS

Proc p Proc q

C

To replay the actions

preserving ,

we also record

epoch counters

record the put

memory

Logs of putsco

C + #put

C

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

Page 352: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

41

RMA: LOGGING PUTS

Proc p Proc q

C

To replay the actions

preserving ,

we also record

epoch counters

record the put

memory

Logs of putsco

C + #put

C

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

EC 2

Page 353: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

41

RMA: LOGGING PUTS

Proc p Proc q

C

To replay the actions

preserving ,

we also record

epoch counters

record the put

memory

Logs of putsco

C + #put

+ EC( )

C

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

EC 2

Page 354: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

41

RMA: LOGGING PUTS

Proc p Proc q

C

To replay the actions

preserving ,

we also record

epoch counters

record the put

memory

Logs of putsco

C + #put

+ EC( )

C

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

EC 2

2

Page 355: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

41

RMA: LOGGING PUTS

Proc p Proc q

C

To replay the actions

preserving ,

we also record

epoch counters

record the put

memory

Logs of putsco

C + #put

+ EC( )

C

Data is valid

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

EC 2

2

Page 356: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

41

RMA: LOGGING PUTS

Proc p Proc q

C

𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔,… , 𝑑𝑎𝑡𝑎

#𝑎 = 𝑠𝑟𝑐, 𝑡𝑟𝑔, …

2

Page 357: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

Page 358: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory memory

Page 359: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

memory

App. data

...

DataPDataP

Page 360: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

memory

Chpt. data

App. data

Chpt. data

...

...

DataPDataP

Page 361: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

memory

Application

state before

the checkpoint

Chpt. data

App. data

Chpt. data

...

...

DataPDataP

Page 362: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

memorymemory

Application

state before

the checkpoint

Chpt. data

App. data

Chpt. data

...

...

DataPDataP

Page 363: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memorymemory

Application

state before

the checkpoint

Chpt. data

App. data

Chpt. data

...

...

DataPDataP

Page 364: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Application

state before

the checkpoint

DataC

Chpt. data

App. data

Chpt. data

...

...

DataPDataP

Page 365: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Application

state before

the checkpoint

DataC

Chpt. data

take a local

checkpointApp. data

Chpt. data

...

...

DataPDataP

Page 366: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataPDataP

Page 367: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataPDataP

Page 368: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataPDataP

Page 369: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataP

DataP

Page 370: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataP

DataP

Page 371: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataP

DataP

Page 372: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataP

DataP

Page 373: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataP

Page 374: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataP

can clear

the logs

Page 375: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

The Epoch Condition

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataP

can clear

the logs

Page 376: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

The Epoch Condition

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataP

can clear

the logs

Page 377: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

The Epoch Condition

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataP

can clear

the logs

Page 378: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

The Epoch Condition

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataP

can clear

the logs

Page 379: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

The Epoch Condition

Application

state before

the checkpoint

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...

DataP

can clear

the logs

Page 380: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

The Epoch Condition

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...Application

state some time

later, after some

communication...

DataP

can clear

the logs

Page 381: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

The Epoch Condition

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...Application

state some time

later, after some

communication...

DataP

can clear

the logs

Page 382: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

The Epoch Condition

DataC

Chpt. data

DataP take a local

checkpointApp. data

Chpt. data

...

...State lost!

DataP

can clear

the logs

Page 383: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

Integrate

the data

with the

checksum

The Epoch Condition

DataC

Chpt. data

take a local

checkpointApp. data

Chpt. data

...

...State lost!

can clear

the logs

Page 384: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory

42

RMA: CHECKPOINTING

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

The Epoch Condition

DataC

Chpt. data

App. data

Chpt. data

...

...State lost!

Page 385: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory

42

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

DataC

Chpt. data

App. data

Chpt. data

...

...

RMA: RECOVERY

State lost!

Page 386: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory

42

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

DataC

Chpt. data

App. data

Chpt. data

...

...

RMA: RECOVERY

State lost!

Page 387: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory

42

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

DataC

Chpt. data

App. data

Chpt. data

...

...

RMA: RECOVERY

State lost!

Stage 1: restore the state upon checkpointing

Page 388: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory

42

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

DataC

Chpt. data

App. data

Chpt. data

...

...

RMA: RECOVERY

State lost!

Fetch the data

Stage 1: restore the state upon checkpointing

Page 389: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory

42

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

DataC

Chpt. data

App. data

Chpt. data

...

...

RMA: RECOVERY

State lost!

...

Fetch the data

Stage 1: restore the state upon checkpointing

Page 390: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory

42

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

DataC

Chpt. data

App. data

Chpt. data

...

...

RMA: RECOVERY

State lost!

...

Fetch the data

Decode

the data

Stage 1: restore the state upon checkpointing

Page 391: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory

42

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

DataC

Chpt. data

App. data

Chpt. data

...

...

RMA: RECOVERY

State lost!

...

Fetch the data

Decode

the data

Stage 1: restore the state upon checkpointing

Page 392: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory

42

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

DataC

Chpt. data

App. data

Chpt. data

...

...

RMA: RECOVERY

State lost!

...

DataP

Fetch the data

Decode

the data

Stage 1: restore the state upon checkpointing

Page 393: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory

42

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

DataC

Chpt. data

App. data

Chpt. data

...

...

RMA: RECOVERY

State lost!DataP

Fetch the data

Decode

the data

Stage 1: restore the state upon checkpointing

Page 394: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory

42

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

DataC

Chpt. data

App. data

Chpt. data

...

...

RMA: RECOVERY

State lost!DataP

Fetch the data

Decode

the data

Stage 1: restore the state upon checkpointing

Page 395: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory

42

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

DataC

Chpt. data

App. data

Chpt. data

...

...

RMA: RECOVERY

State lost!DataP

Fetch the data

Decode

the data

DataP

Stage 1: restore the state upon checkpointing

Page 396: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory

42

Proc p Proc q

memory

App. data

Proc C

a process

that stores

parity data

memory

Parity data

memory

DataC

Chpt. data

App. data

Chpt. data

...

...

RMA: RECOVERY

DataP

Fetch the data

Decode

the data

DataP

Stage 1: restore the state upon checkpointing

Page 397: @spcl eth Fault Tolerance for Remote Memory Access ... · logging memory accesses vs. messages checkpointing in RMA-based applications fault tolerance mechanisms and schemes. spcl.inf.ethz.ch

spcl.inf.ethz.ch

@spcl_eth

memory

42

Proc p Proc q

memory

App. data

memory

Chpt. data

RMA: RECOVERY

DataP