checksum strategies for data in volatile memory authors: humayun arafat(ohio state) sriram...
TRANSCRIPT
Checksum Strategies for Data in Volatile Memory
Authors:Humayun Arafat(Ohio State)Sriram Krishnamoorthy(PNNL)P. Sadayappan(Ohio State)
1
Motivation• In exascale systems, failures will further increase due to
increasing number of processors
• Typical current approach to fault tolerance is to checkpoint in stable storage
• Soft errors can affect individual data blocks
• Multiple data blocks might be corrupted before they can be efficiently detected
• We focus on developing an approach that can tolerate multiple hard errors and soft errors
2
Fault Tolerant Data in Volatile Memory• Efficient checksum-based approach to fault tolerance
for data in volatile memory systems
• The developed scheme is applicable in multiple scenarios• Online recovery of large read-only data structures
with low storage overhead• Online recovery from soft errors in blocked data• Online recovery of read/write data via in-memory
checkpointing
• The approach uses a logical multi-dimensional view of the data to be protected
3
Design
• Recover exact data• Inspiration from Algorithm Based Fault
Tolerance(ABFT)• Low overhead
4
Checksum Design• Checksum Operator• XOR
• Multi-dimensional Checksums• Increase tolerance
• Checksum co-located with data• Reduce space overhead
• Distributed Checksum• Reduce overhead and increase tolerance
5
One Dimensional Checksum
6
One Dimensional Checksum
7
C
cccccc
cccc cc cc
One Dimensional Checksum
8
Recover checksum
Recover data
Two Dimensional Checksum
9
Checksum and Data Distribution
10
Two Dimensional Checksum
11Recovery
Checksum calculation
Three Dimensional Checksum
12
Three Dimensional Checksum Distribution
13
Checksum Overhead
– One Dimension
– Two Dimension
– Three Dimension
– d Dimension
Experiments• Cray XE6 system(NERSC Hopper)
• 6384 nodes with Gemini interconnect
• Peak bandwidth 8.3 GB/s per direction
• Twelve core 2.1 GHz AMD ‘MagnyCours’ with 24 cores per node and 32 GB DDR3 memory
• Intel C++ compiler 13 and Cray MPI 6.0.1
Checksum Calculation Time 1D, 2D and 3D
1D
3D
2D
16
Fault Recovery
17
Soft Error• Soft error can change the data in memory
• Unit of failure is a block of data inside the process not the entire process
• Low overhead compared to entire process failure
• Less number of tolerable failures
18
Soft Error
19
Soft Error Equations
20
1D block
2D block
2D Soft Error Checksum
21
2D Soft Error Recovery
22
Summary• In memory checkpointing, low overhead
protection for read only data, recovery from soft errors
• XOR based checksum to recover exact data
• Multidimensional checksum calculation to increase fault tolerance
• Co-location of the checksums with the data
• Scalable design to ensure low space overhead23
THANK YOUQuestions?
24