rebound: scalable checkpointing for coherent shared memory

Rebound: Scalable Checkpointing for Coherent Shared Memory

Rishi Agarwal, Pranav Garg, and Josep TorrellasDepartment of Computer Science

University of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edu

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Checkpointing in Shared-Memory MPs

• HW-based schemes for small CMPs use Global checkpointing– All procs participate in system-wide checkpoints

• Global checkpointing is not scalable– Synchronization, bursty movement of data, loss in rollback…

save chkpt

save chkpt

rollback

2

Fault

checkpoint

checkpoint

P1 P2 P3 P4


Alternative: Coordinated Local Checkpointing

• Idea: threads coordinate their checkpointing in groups• Rationale:

– Faults propagate only through communication – Interleaving between non-comm. threads is irrelevant

3

+ Scalable: Checkpoint and rollback in processor groups– Complexity: Record inter-thread dependences dynamically.

GlobalChkpt

P1 P2 P3 P4 P5 P1 P2 P3 P4 P5

LocalChkptLocal

Chkpt


Contributions

• Leverages directory protocol to track inter-thread deps.

• Opts to boost checkpointing efficiency:• Delaying write-back of data to safe memory at checkpoints• Supporting multiple checkpoints• Optimizing checkpointing at barrier synchronization

• Avg. performance overhead for 64 procs: 2%• Compared to 15% for global checkpointing

Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory

4


Background: In-Memory Checkpt with ReVive

P1 P2 P3

MemoryLog

Writebacks

Logging

RegisterDump

Caches

Writeback

5

[Prvulovic-02]

CHK

W W W W WBDirty Cache lines

Execution

CheckpointApplication

Stalls

oldoldold

Displacement


Fault

Background: In-Memory Checkpt with ReVive

[Pvrulovic-02]

6

Old Register restored

Cache Invalidated

Memory LinesReverted

Global Broadcast protocol

Local CoordinatedScalable protocol

CHK

W W W W WB

Log Memory

P3P2

Caches

P1


Coordinated Local Checkpointing Rules

• Banatre et al. used Coordinated Local checkpointing for bus-based machines [Banatre96]

wr x

rd x

P1 P2

Producerrollback

Consumerrollback

P1 P2

Producerchkpoint

Consumerchkpoint

P1 P2

chkptchkpt

7

P checkpoints P’s producers checkpointP rolls back P’s consumers rollback


Rebound Fault Model

• Any part of the chip can suffer transient or permanent faults.• A fault can occur even during checkpointing• Off-chip memory and logs suffer no fault on their own (e.g. NVM)• Fault detection outside our scope:

• Fault detection latency has upper-bound of L cycles

Log (in SW)Main Memory

Chip Multiprocessor

8


Rebound Architecture

Main Memory

Chip Multiprocessor

L2DirectoryCache

LW-ID

MyProducerMyConsumer

DepRegister

P+L1

9


• Dependence (Dep) registers in the L2 cache controller:• MyProducers : bitmap of proc. that produced data consumed by

the local proc.• MyConsumers : bitmap of proc. that consumed data produced

by the local proc.


Main Memory

Chip Multiprocessor

L2DirectoryCache

LW-ID


DepRegister

P+L1

10


• Dependence (Dep) registers in the L2 cache controller:• MyProducers : bitmap of proc. that produced data consumed by

the local proc.• MyConsumers : bitmap of proc. that consumed data produced

by the local proc. • Processor ID in each directory entry:

• LW-ID : last writer to the line in the current checkpoint interval.


Main Memory

Chip Multiprocessor

L2DirectoryCache

LW-ID


DepRegister

P+L1

11


Recording Inter-Thread Dependences

Assume MESI protocol

P1 P2

Log

DP1

Memory

Write

12

P1 writes MyProducersMyConsumers

MyProducersMyConsumers

LW-ID




P1 P2

DP1 S

Write back

Logging

13

MemoryLog

P2 reads

MyConsumers P2

MyProducers P1


MyProducersMyConsumersP2

P1

LW-ID


P1 S



P1 P2

DP1

14

MemoryLog

P1 writes P2P1MyProducers

MyConsumersMyProducersMyConsumers

LW-ID


P1P1 S



P1 P2

DWritebacks

Clear LW-ID

Logging

15

MemoryLog

P1 checkpoints

LW-ID should remain set till the line is checkpointed

P2P1MyProducers

MyConsumersMyProducersMyConsumers

Clear Dep registers

LW-ID


Lazily clearing Last Writers

• Clear LW-IDs Expensive process !

• Write Signature encodes all line addresses that the processor has written to (or read exclusively) in the current interval.

• At checkpoint, the processors clear their Write Signature– Potentially stale LW-ID

16


P1 P2

P1 S

17

MemoryLog

P2 readsMyProducersMyConsumers


Stale LW-ID

Lazily clearing Last Writers

WSigNO !

Addr ?Clear LW-ID


• Interaction Set [Pi]: set of producer processors (transitively) for Pi

– Built using MyProducers

Distributed Checkpointing Protocol in SW

initiatecheckpoint

P1P1 P2 P3 P4

chk

InteractionSet : P1

18





initiatecheckpoint

P1P1 P2 P3 P4

chk

InteractionSet : P1

19

P3

Ck? Ck?

P2





initiatecheckpoint

P1

P2

P4

P3

Ck?

Ck? Ck?Acce

pt

P1 P2 P3 P4

chk

InteractionSet : P1, P2, P3

21

Accept





initiatecheckpoint

P1

P2

P4

P3

Decline

Ack

Ck?

Ck? Ck?Acce

pt

P1 P2 P3 P4

chk


22

Accept





initiatecheckpoint

P1

P2

P4

P3

Decline

Ack

Ck?

Ck? Ck?Acce

pt

P1 P2 P3 P4

chk


23

Accept

• Checkpointing is a 2-phase commit protocol.


• Rollback handled similar to the Checkpointing protocol: - Interaction set is built transitively using MyConsumers

• Rollback involves– Clearing the Dep. Registers and Write Signature– Invalidating the processor caches– Restoring the data and register context from the logs up to

the latest checkpoint.

• No Domino Effect

24

Distributed Rollback Protocol in SW


Optimization1 : Delayed Writebacks

• Checkpointing overhead dominated by data writebacks

• Delayed Writeback optimization• Processors synchronize and resume execution• Hardware automatically writes back dirty lines in background • Checkpoint only completed when all delayed data written back• Still need to record inter-thread dependences on delayed data

WB dirty linesIn

terv

al

I1Tim

e

25

sync

sync

Che

ckpo

int

Inte

rval

I2

Stallsync

sync

WB dirty lines

Che

ckpo

int

Inte

rval

I1

Inte

rval

I2

Stall


Delayed Writeback Pros/Cons

+ Significant reduction in checkpoint overhead

- Additional support:Each processor has two sets of Dep. Registers and Write Signature Each cache line has a delayed bit

- Increased vulnerabilityA rollback event forces both intervals to roll back

26


P1 P2

DP1 S

Write back

Logging

27

MemoryLog

P2 reads

MyConsumers0 P2

MyProducers1 P1

MyProducers0MyConsumers0

MyProducers0MyConsumers0P2

P1

LW-ID



WSig0

WSig1

Addr ?

Addr ?NO !

YES !xxx

Delayed Writeback protocol


Optimization2 : Multiple Checkpoints

• Solution: Keep multiple checkpoints– On fault, roll back interacting processors to safe checkpoints

• No Domino Effect 28

Fault

Det

ectio

n L

aten

cy

Dep registers 1

Dep registers 2Rol

lbac

k

Ckpt 1

Ckpt 2

tf

• Problem: Fault detection is not instantaneous– Checkpoint is safe only after max fault-detection latency (L)


Multiple Checkpoints: Pros/Cons

+ Realistic system: supports non-instantaneous fault detection

- Additional support:Each checkpoint has Dep registers Dep registers can be recycled only after fault detection latency

- Need to track communication across checkpoints

- Combination with Delayed Writebacks: one more Dep register set

29


Optimization3 : Hiding Chkpt behind Global Barrier

• Global barriers require that all processors communicate– Leads to global checkpoints

• Optimization:– Proactively trigger a global checkpoint at a global barrier– Hide checkpoint overhead behind barrier imbalance spins

30


Hiding Checkpoint behind Global Barrier

Lock count++ if(count == numProc) Iam_last = TRUE /*local var*/UnlockIf(I am_last) { count = 0 flag = TRUE …}else while(!flag) {}

31

Update


Hiding Checkpoint behind Global Barrier

• First arriving processor initiates the checkpoint• Others: HW writes back data as execution proceeds to barrier• Commit checkpoint as last processor arrives• After the barrier: few interacting processors

Lock count++ if(count == numProc) Iam_last = TRUE /*local var*/UnlockIf(I am_last) { count = 0 flag = TRUE …}else while(!flag) {}

32

UpdateUpdate

Processor P1 Processor P2 Processor P3

Update

BarCK? BarCK?

Notify Notify

flag = TRUE ICHK = {P3} while(!flag)

ICHK = {P2, P3}

while(!flag)ICHK = {P1, P3}

Update


Evaluation Setup

• Analysis tool using Pin + SESC cycle-acc. simulator + DRAMsim• Applications: SPLASH-2 , some PARSEC, Apache• Simulated CMP architecture with up to 64 threads • Checkpoint interval : 5 – 8 ms• Modeled several environments:

• Global: baseline global checkpointing• Rebound: Local checkpointing scheme with delayed writeback.• Rebound_NoDWB: Rebound without the delayed writebacks.

33


Avg. Interaction Set: Set of Producer Processors

• Most apps: interaction set is a small set– Justifies coordinated local checkpointing– Averages brought up by global barriers

34

64

38


Checkpoint Execution Overhead

• Rebound’s avg checkpoint execution overhead is 2%– Compared to 15% for Global

35

Bar

nes

Cho

lesk

y Fft

Fmm

Rad

ix

Lu-C

Lu-N

C

Vol

rend

Wat

er-

Sp

Wat

er-

Nsq

Rad

iosi

ty

Oce

an

Ray

trace

SP

2

0

10

20

30

40 GlobalRebound_NoDWBRebound

% C

heck

poin

t O

verh

ead

2

15


Checkpoint Execution Overhead

• Rebound’s avg checkpoint execution overhead is 2%– Compared to 15% for Global

• Delayed Writebacks complement local checkpointing

36

Bar

nes

Cho

lesk

y Fft

Fmm

Rad

ix

Lu-C

Lu-N

C

Vol

rend

Wat

er-

Sp

Wat

er-

Nsq

Rad

iosi

ty

Oce

an

Ray

trace

SP

2

0

10

20

30

40 GlobalRebound_NoDWBRebound

% C

heck

poin

t O

verh

ead


Rebound Scalability

• Rebound is scalable in checkpoint overhead• Delayed Writebacks help scalability

Constant problem size

37


Also in the Paper

• Delayed write backs also useful in Global• Barrier optimization is effective but not universally applicable• Power increase due to hardware additions < 2%• Rebound leads to only 4% increase in coherence traffic

38


Conclusions

• Leverages directory protocol• Boosts checkpointing efficiency:

• Delayed write-backs• Multiple checkpoints• Barrier optimization

• Avg. execution overhead for 64 procs: 2%

Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory

• Future work:• Apply Rebound to non-hardware coherent machines• Scalability to hierarchical directories

39

Rebound: Scalable Checkpointing for Coherent Shared Memory

Rishi Agarwal, Pranav Garg, and Josep TorrellasDepartment of Computer Science

University of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edu

rebound: scalable checkpointing for coherent shared memory

Documents

scalable checkpointingwe

scalable checkpointinglogs

memory checkpt

memory state

safe memory

pranav garg

coordinated local

checkpointingoffchip