rebound: scalable checkpointing for coherent shared memory

39
Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of Illinois at Urbana- Champaign http://iacoma.cs.uiuc.edu

Upload: stacey

Post on 23-Feb-2016

46 views

Category:

Documents


0 download

DESCRIPTION

Rebound: Scalable Checkpointing for Coherent Shared Memory. Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu. Checkpointing in Shared-Memory MPs. rollback. Fault. s ave c hkpt. s ave - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Rebound: Scalable Checkpointing for Coherent Shared Memory

Rebound: Scalable Checkpointing for Coherent Shared Memory

Rishi Agarwal, Pranav Garg, and Josep TorrellasDepartment of Computer Science

University of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edu

Page 2: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Checkpointing in Shared-Memory MPs

• HW-based schemes for small CMPs use Global checkpointing– All procs participate in system-wide checkpoints

• Global checkpointing is not scalable– Synchronization, bursty movement of data, loss in rollback…

save chkpt

save chkpt

rollback

2

Fault

checkpoint

checkpoint

P1 P2 P3 P4

Page 3: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Alternative: Coordinated Local Checkpointing

• Idea: threads coordinate their checkpointing in groups• Rationale:

– Faults propagate only through communication – Interleaving between non-comm. threads is irrelevant

3

+ Scalable: Checkpoint and rollback in processor groups– Complexity: Record inter-thread dependences dynamically.

GlobalChkpt

P1 P2 P3 P4 P5 P1 P2 P3 P4 P5

LocalChkptLocal

Chkpt

Page 4: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Contributions

• Leverages directory protocol to track inter-thread deps.

• Opts to boost checkpointing efficiency:• Delaying write-back of data to safe memory at checkpoints• Supporting multiple checkpoints• Optimizing checkpointing at barrier synchronization

• Avg. performance overhead for 64 procs: 2%• Compared to 15% for global checkpointing

Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory

4

Page 5: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Background: In-Memory Checkpt with ReVive

P1 P2 P3

MemoryLog

Writebacks

Logging

RegisterDump

Caches

Writeback

5

[Prvulovic-02]

CHK

W W W W WBDirty Cache lines

Execution

CheckpointApplication

Stalls

oldoldold

Displacement

Page 6: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Fault

Background: In-Memory Checkpt with ReVive

[Pvrulovic-02]

6

Old Register restored

Cache Invalidated

Memory LinesReverted

Global Broadcast protocol

Local CoordinatedScalable protocol

CHK

W W W W WB

Log Memory

P3P2

Caches

P1

Page 7: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Coordinated Local Checkpointing Rules

• Banatre et al. used Coordinated Local checkpointing for bus-based machines [Banatre96]

wr x

rd x

P1 P2

Producerrollback

Consumerrollback

P1 P2

Producerchkpoint

Consumerchkpoint

P1 P2

chkptchkpt

7

P checkpoints P’s producers checkpointP rolls back P’s consumers rollback

Page 8: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Rebound Fault Model

• Any part of the chip can suffer transient or permanent faults.• A fault can occur even during checkpointing• Off-chip memory and logs suffer no fault on their own (e.g. NVM)• Fault detection outside our scope:

• Fault detection latency has upper-bound of L cycles

Log (in SW)Main Memory

Chip Multiprocessor

8

Page 9: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Rebound Architecture

Main Memory

Chip Multiprocessor

L2DirectoryCache

LW-ID

MyProducerMyConsumer

DepRegister

P+L1

9

Page 10: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

• Dependence (Dep) registers in the L2 cache controller:• MyProducers : bitmap of proc. that produced data consumed by

the local proc.• MyConsumers : bitmap of proc. that consumed data produced

by the local proc.

Rebound Architecture

Main Memory

Chip Multiprocessor

L2DirectoryCache

LW-ID

MyProducerMyConsumer

DepRegister

P+L1

10

Page 11: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

• Dependence (Dep) registers in the L2 cache controller:• MyProducers : bitmap of proc. that produced data consumed by

the local proc.• MyConsumers : bitmap of proc. that consumed data produced

by the local proc. • Processor ID in each directory entry:

• LW-ID : last writer to the line in the current checkpoint interval.

Rebound Architecture

Main Memory

Chip Multiprocessor

L2DirectoryCache

LW-ID

MyProducerMyConsumer

DepRegister

P+L1

11

Page 12: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Recording Inter-Thread Dependences

Assume MESI protocol

P1 P2

Log

DP1

Memory

Write

12

P1 writes MyProducersMyConsumers

MyProducersMyConsumers

LW-ID

Page 13: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Recording Inter-Thread Dependences

Assume MESI protocol

P1 P2

DP1 S

Write back

Logging

13

MemoryLog

P2 reads

MyConsumers P2

MyProducers P1

MyProducersMyConsumers

MyProducersMyConsumersP2

P1

LW-ID

Page 14: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

P1 S

Recording Inter-Thread Dependences

Assume MESI protocol

P1 P2

DP1

14

MemoryLog

P1 writes P2P1MyProducers

MyConsumersMyProducersMyConsumers

LW-ID

Page 15: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

P1P1 S

Recording Inter-Thread Dependences

Assume MESI protocol

P1 P2

DWritebacks

Clear LW-ID

Logging

15

MemoryLog

P1 checkpoints

LW-ID should remain set till the line is checkpointed

P2P1MyProducers

MyConsumersMyProducersMyConsumers

Clear Dep registers

LW-ID

Page 16: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Lazily clearing Last Writers

• Clear LW-IDs Expensive process !

• Write Signature encodes all line addresses that the processor has written to (or read exclusively) in the current interval.

• At checkpoint, the processors clear their Write Signature– Potentially stale LW-ID

16

Page 17: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

P1 P2

P1 S

17

MemoryLog

P2 readsMyProducersMyConsumers

MyProducersMyConsumers

Stale LW-ID

Lazily clearing Last Writers

WSigNO !

Addr ?Clear LW-ID

Page 18: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

• Interaction Set [Pi]: set of producer processors (transitively) for Pi

– Built using MyProducers

Distributed Checkpointing Protocol in SW

initiatecheckpoint

P1P1 P2 P3 P4

chk

InteractionSet : P1

18

Page 19: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

• Interaction Set [Pi]: set of producer processors (transitively) for Pi

– Built using MyProducers

Distributed Checkpointing Protocol in SW

initiatecheckpoint

P1P1 P2 P3 P4

chk

InteractionSet : P1

19

P3

Ck? Ck?

P2

Page 20: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

• Interaction Set [Pi]: set of producer processors (transitively) for Pi

– Built using MyProducers

Distributed Checkpointing Protocol in SW

initiatecheckpoint

P1

P2

P4

P3

Ck?

Ck? Ck?Acce

pt

P1 P2 P3 P4

chk

InteractionSet : P1, P2, P3

21

Accept

Page 21: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

• Interaction Set [Pi]: set of producer processors (transitively) for Pi

– Built using MyProducers

Distributed Checkpointing Protocol in SW

initiatecheckpoint

P1

P2

P4

P3

Decline

Ack

Ck?

Ck? Ck?Acce

pt

P1 P2 P3 P4

chk

InteractionSet : P1, P2, P3

22

Accept

Page 22: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

• Interaction Set [Pi]: set of producer processors (transitively) for Pi

– Built using MyProducers

Distributed Checkpointing Protocol in SW

initiatecheckpoint

P1

P2

P4

P3

Decline

Ack

Ck?

Ck? Ck?Acce

pt

P1 P2 P3 P4

chk

InteractionSet : P1, P2, P3

23

Accept

• Checkpointing is a 2-phase commit protocol.

Page 23: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

• Rollback handled similar to the Checkpointing protocol: - Interaction set is built transitively using MyConsumers

• Rollback involves– Clearing the Dep. Registers and Write Signature– Invalidating the processor caches– Restoring the data and register context from the logs up to

the latest checkpoint.

• No Domino Effect

24

Distributed Rollback Protocol in SW

Page 24: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Optimization1 : Delayed Writebacks

• Checkpointing overhead dominated by data writebacks

• Delayed Writeback optimization• Processors synchronize and resume execution• Hardware automatically writes back dirty lines in background • Checkpoint only completed when all delayed data written back• Still need to record inter-thread dependences on delayed data

WB dirty linesIn

terv

al

I1Tim

e

25

sync

sync

Che

ckpo

int

Inte

rval

I2

Stallsync

sync

WB dirty lines

Che

ckpo

int

Inte

rval

I1

Inte

rval

I2

Stall

Page 25: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Delayed Writeback Pros/Cons

+ Significant reduction in checkpoint overhead

- Additional support:Each processor has two sets of Dep. Registers and Write Signature Each cache line has a delayed bit

- Increased vulnerabilityA rollback event forces both intervals to roll back

26

Page 26: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

P1 P2

DP1 S

Write back

Logging

27

MemoryLog

P2 reads

MyConsumers0 P2

MyProducers1 P1

MyProducers0MyConsumers0

MyProducers0MyConsumers0P2

P1

LW-ID

MyProducers1MyConsumers1

MyProducers1MyConsumers1

WSig0

WSig1

Addr ?

Addr ?NO !

YES !xxx

Delayed Writeback protocol

Page 27: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Optimization2 : Multiple Checkpoints

• Solution: Keep multiple checkpoints– On fault, roll back interacting processors to safe checkpoints

• No Domino Effect 28

Fault

Det

ectio

n L

aten

cy

Dep registers 1

Dep registers 2Rol

lbac

k

Ckpt 1

Ckpt 2

tf

• Problem: Fault detection is not instantaneous– Checkpoint is safe only after max fault-detection latency (L)

Page 28: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Multiple Checkpoints: Pros/Cons

+ Realistic system: supports non-instantaneous fault detection

- Additional support:Each checkpoint has Dep registers Dep registers can be recycled only after fault detection latency

- Need to track communication across checkpoints

- Combination with Delayed Writebacks: one more Dep register set

29

Page 29: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Optimization3 : Hiding Chkpt behind Global Barrier

• Global barriers require that all processors communicate– Leads to global checkpoints

• Optimization:– Proactively trigger a global checkpoint at a global barrier– Hide checkpoint overhead behind barrier imbalance spins

30

Page 30: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Hiding Checkpoint behind Global Barrier

Lock count++ if(count == numProc) Iam_last = TRUE /*local var*/UnlockIf(I am_last) { count = 0 flag = TRUE …}else while(!flag) {}

31

Update

Page 31: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Hiding Checkpoint behind Global Barrier

• First arriving processor initiates the checkpoint• Others: HW writes back data as execution proceeds to barrier• Commit checkpoint as last processor arrives• After the barrier: few interacting processors

Lock count++ if(count == numProc) Iam_last = TRUE /*local var*/UnlockIf(I am_last) { count = 0 flag = TRUE …}else while(!flag) {}

32

UpdateUpdate

Processor P1 Processor P2 Processor P3

Update

BarCK? BarCK?

Notify Notify

flag = TRUE ICHK = {P3} while(!flag)

ICHK = {P2, P3}

while(!flag)ICHK = {P1, P3}

Update

Page 32: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Evaluation Setup

• Analysis tool using Pin + SESC cycle-acc. simulator + DRAMsim• Applications: SPLASH-2 , some PARSEC, Apache• Simulated CMP architecture with up to 64 threads • Checkpoint interval : 5 – 8 ms• Modeled several environments:

• Global: baseline global checkpointing• Rebound: Local checkpointing scheme with delayed writeback.• Rebound_NoDWB: Rebound without the delayed writebacks.

33

Page 33: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Avg. Interaction Set: Set of Producer Processors

• Most apps: interaction set is a small set– Justifies coordinated local checkpointing– Averages brought up by global barriers

34

64

38

Page 34: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Checkpoint Execution Overhead

• Rebound’s avg checkpoint execution overhead is 2%– Compared to 15% for Global

35

Bar

nes

Cho

lesk

y Fft

Fmm

Rad

ix

Lu-C

Lu-N

C

Vol

rend

Wat

er-

Sp

Wat

er-

Nsq

Rad

iosi

ty

Oce

an

Ray

trace

SP

2

0

10

20

30

40 GlobalRebound_NoDWBRebound

% C

heck

poin

t O

verh

ead

2

15

Page 35: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Checkpoint Execution Overhead

• Rebound’s avg checkpoint execution overhead is 2%– Compared to 15% for Global

• Delayed Writebacks complement local checkpointing

36

Bar

nes

Cho

lesk

y Fft

Fmm

Rad

ix

Lu-C

Lu-N

C

Vol

rend

Wat

er-

Sp

Wat

er-

Nsq

Rad

iosi

ty

Oce

an

Ray

trace

SP

2

0

10

20

30

40 GlobalRebound_NoDWBRebound

% C

heck

poin

t O

verh

ead

Page 36: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Rebound Scalability

• Rebound is scalable in checkpoint overhead• Delayed Writebacks help scalability

Constant problem size

37

Page 37: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Also in the Paper

• Delayed write backs also useful in Global• Barrier optimization is effective but not universally applicable• Power increase due to hardware additions < 2%• Rebound leads to only 4% increase in coherence traffic

38

Page 38: Rebound: Scalable Checkpointing for Coherent Shared Memory

R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing

Conclusions

• Leverages directory protocol• Boosts checkpointing efficiency:

• Delayed write-backs• Multiple checkpoints• Barrier optimization

• Avg. execution overhead for 64 procs: 2%

Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory

• Future work:• Apply Rebound to non-hardware coherent machines• Scalability to hierarchical directories

39

Page 39: Rebound: Scalable Checkpointing for Coherent Shared Memory

Rebound: Scalable Checkpointing for Coherent Shared Memory

Rishi Agarwal, Pranav Garg, and Josep TorrellasDepartment of Computer Science

University of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edu