checksum strategies for data in volatile memory authors: humayun arafat(ohio state) sriram...

24
Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1

Upload: clyde-tate

Post on 02-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1

Checksum Strategies for Data in Volatile Memory

Authors:Humayun Arafat(Ohio State)Sriram Krishnamoorthy(PNNL)P. Sadayappan(Ohio State)

1

Page 2: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1

Motivation• In exascale systems, failures will further increase due to

increasing number of processors

• Typical current approach to fault tolerance is to checkpoint in stable storage

• Soft errors can affect individual data blocks

• Multiple data blocks might be corrupted before they can be efficiently detected

• We focus on developing an approach that can tolerate multiple hard errors and soft errors

2

Page 3: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1

Fault Tolerant Data in Volatile Memory• Efficient checksum-based approach to fault tolerance

for data in volatile memory systems

• The developed scheme is applicable in multiple scenarios• Online recovery of large read-only data structures

with low storage overhead• Online recovery from soft errors in blocked data• Online recovery of read/write data via in-memory

checkpointing

• The approach uses a logical multi-dimensional view of the data to be protected

3

Page 4: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1

Design

• Recover exact data• Inspiration from Algorithm Based Fault

Tolerance(ABFT)• Low overhead

4

Page 5: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1

Checksum Design• Checksum Operator• XOR

• Multi-dimensional Checksums• Increase tolerance

• Checksum co-located with data• Reduce space overhead

• Distributed Checksum• Reduce overhead and increase tolerance

5

Page 6: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1

One Dimensional Checksum

6

Page 7: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1

One Dimensional Checksum

7

C

cccccc

cccc cc cc

Page 8: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1

One Dimensional Checksum

8

Recover checksum

Recover data

Page 9: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1

Two Dimensional Checksum

9

Page 10: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1

Checksum and Data Distribution

10

Page 11: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1

Two Dimensional Checksum

11Recovery

Checksum calculation

Page 12: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1

Three Dimensional Checksum

12

Page 13: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1

Three Dimensional Checksum Distribution

13

Page 14: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1

Checksum Overhead

– One Dimension

– Two Dimension

– Three Dimension

– d Dimension

Page 15: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1

Experiments• Cray XE6 system(NERSC Hopper)

• 6384 nodes with Gemini interconnect

• Peak bandwidth 8.3 GB/s per direction

• Twelve core 2.1 GHz AMD ‘MagnyCours’ with 24 cores per node and 32 GB DDR3 memory

• Intel C++ compiler 13 and Cray MPI 6.0.1

Page 16: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1

Checksum Calculation Time 1D, 2D and 3D

1D

3D

2D

16

Page 17: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1

Fault Recovery

17

Page 18: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1

Soft Error• Soft error can change the data in memory

• Unit of failure is a block of data inside the process not the entire process

• Low overhead compared to entire process failure

• Less number of tolerable failures

18

Page 19: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1

Soft Error

19

Page 20: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1

Soft Error Equations

20

1D block

2D block

Page 21: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1

2D Soft Error Checksum

21

Page 22: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1

2D Soft Error Recovery

22

Page 23: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1

Summary• In memory checkpointing, low overhead

protection for read only data, recovery from soft errors

• XOR based checksum to recover exact data

• Multidimensional checksum calculation to increase fault tolerance

• Co-location of the checksums with the data

• Scalable design to ensure low space overhead23

Page 24: Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1

THANK YOUQuestions?

24