rebound: scalable checkpointing for coherent shared memory
DESCRIPTION
Rebound: Scalable Checkpointing for Coherent Shared Memory. Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu. Checkpointing in Shared-Memory MPs. rollback. Fault. s ave c hkpt. s ave - PowerPoint PPT PresentationTRANSCRIPT
Rebound: Scalable Checkpointing for Coherent Shared Memory
Rishi Agarwal, Pranav Garg, and Josep TorrellasDepartment of Computer Science
University of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edu
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Checkpointing in Shared-Memory MPs
• HW-based schemes for small CMPs use Global checkpointing– All procs participate in system-wide checkpoints
• Global checkpointing is not scalable– Synchronization, bursty movement of data, loss in rollback…
save chkpt
save chkpt
rollback
2
Fault
checkpoint
checkpoint
P1 P2 P3 P4
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Alternative: Coordinated Local Checkpointing
• Idea: threads coordinate their checkpointing in groups• Rationale:
– Faults propagate only through communication – Interleaving between non-comm. threads is irrelevant
3
+ Scalable: Checkpoint and rollback in processor groups– Complexity: Record inter-thread dependences dynamically.
GlobalChkpt
P1 P2 P3 P4 P5 P1 P2 P3 P4 P5
LocalChkptLocal
Chkpt
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Contributions
• Leverages directory protocol to track inter-thread deps.
• Opts to boost checkpointing efficiency:• Delaying write-back of data to safe memory at checkpoints• Supporting multiple checkpoints• Optimizing checkpointing at barrier synchronization
• Avg. performance overhead for 64 procs: 2%• Compared to 15% for global checkpointing
Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory
4
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Background: In-Memory Checkpt with ReVive
P1 P2 P3
MemoryLog
Writebacks
Logging
RegisterDump
Caches
Writeback
5
[Prvulovic-02]
CHK
W W W W WBDirty Cache lines
Execution
CheckpointApplication
Stalls
oldoldold
Displacement
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Fault
Background: In-Memory Checkpt with ReVive
[Pvrulovic-02]
6
Old Register restored
Cache Invalidated
Memory LinesReverted
Global Broadcast protocol
Local CoordinatedScalable protocol
CHK
W W W W WB
Log Memory
P3P2
Caches
P1
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Coordinated Local Checkpointing Rules
• Banatre et al. used Coordinated Local checkpointing for bus-based machines [Banatre96]
wr x
rd x
P1 P2
Producerrollback
Consumerrollback
P1 P2
Producerchkpoint
Consumerchkpoint
P1 P2
chkptchkpt
7
P checkpoints P’s producers checkpointP rolls back P’s consumers rollback
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Rebound Fault Model
• Any part of the chip can suffer transient or permanent faults.• A fault can occur even during checkpointing• Off-chip memory and logs suffer no fault on their own (e.g. NVM)• Fault detection outside our scope:
• Fault detection latency has upper-bound of L cycles
Log (in SW)Main Memory
Chip Multiprocessor
8
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Rebound Architecture
Main Memory
Chip Multiprocessor
L2DirectoryCache
LW-ID
MyProducerMyConsumer
DepRegister
P+L1
9
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
• Dependence (Dep) registers in the L2 cache controller:• MyProducers : bitmap of proc. that produced data consumed by
the local proc.• MyConsumers : bitmap of proc. that consumed data produced
by the local proc.
Rebound Architecture
Main Memory
Chip Multiprocessor
L2DirectoryCache
LW-ID
MyProducerMyConsumer
DepRegister
P+L1
10
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
• Dependence (Dep) registers in the L2 cache controller:• MyProducers : bitmap of proc. that produced data consumed by
the local proc.• MyConsumers : bitmap of proc. that consumed data produced
by the local proc. • Processor ID in each directory entry:
• LW-ID : last writer to the line in the current checkpoint interval.
Rebound Architecture
Main Memory
Chip Multiprocessor
L2DirectoryCache
LW-ID
MyProducerMyConsumer
DepRegister
P+L1
11
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Recording Inter-Thread Dependences
Assume MESI protocol
P1 P2
Log
DP1
Memory
Write
12
P1 writes MyProducersMyConsumers
MyProducersMyConsumers
LW-ID
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Recording Inter-Thread Dependences
Assume MESI protocol
P1 P2
DP1 S
Write back
Logging
13
MemoryLog
P2 reads
MyConsumers P2
MyProducers P1
MyProducersMyConsumers
MyProducersMyConsumersP2
P1
LW-ID
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
P1 S
Recording Inter-Thread Dependences
Assume MESI protocol
P1 P2
DP1
14
MemoryLog
P1 writes P2P1MyProducers
MyConsumersMyProducersMyConsumers
LW-ID
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
P1P1 S
Recording Inter-Thread Dependences
Assume MESI protocol
P1 P2
DWritebacks
Clear LW-ID
Logging
15
MemoryLog
P1 checkpoints
LW-ID should remain set till the line is checkpointed
P2P1MyProducers
MyConsumersMyProducersMyConsumers
Clear Dep registers
LW-ID
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Lazily clearing Last Writers
• Clear LW-IDs Expensive process !
• Write Signature encodes all line addresses that the processor has written to (or read exclusively) in the current interval.
• At checkpoint, the processors clear their Write Signature– Potentially stale LW-ID
16
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
P1 P2
P1 S
17
MemoryLog
P2 readsMyProducersMyConsumers
MyProducersMyConsumers
Stale LW-ID
Lazily clearing Last Writers
WSigNO !
Addr ?Clear LW-ID
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
• Interaction Set [Pi]: set of producer processors (transitively) for Pi
– Built using MyProducers
Distributed Checkpointing Protocol in SW
initiatecheckpoint
P1P1 P2 P3 P4
chk
InteractionSet : P1
18
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
• Interaction Set [Pi]: set of producer processors (transitively) for Pi
– Built using MyProducers
Distributed Checkpointing Protocol in SW
initiatecheckpoint
P1P1 P2 P3 P4
chk
InteractionSet : P1
19
P3
Ck? Ck?
P2
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
• Interaction Set [Pi]: set of producer processors (transitively) for Pi
– Built using MyProducers
Distributed Checkpointing Protocol in SW
initiatecheckpoint
P1
P2
P4
P3
Ck?
Ck? Ck?Acce
pt
P1 P2 P3 P4
chk
InteractionSet : P1, P2, P3
21
Accept
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
• Interaction Set [Pi]: set of producer processors (transitively) for Pi
– Built using MyProducers
Distributed Checkpointing Protocol in SW
initiatecheckpoint
P1
P2
P4
P3
Decline
Ack
Ck?
Ck? Ck?Acce
pt
P1 P2 P3 P4
chk
InteractionSet : P1, P2, P3
22
Accept
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
• Interaction Set [Pi]: set of producer processors (transitively) for Pi
– Built using MyProducers
Distributed Checkpointing Protocol in SW
initiatecheckpoint
P1
P2
P4
P3
Decline
Ack
Ck?
Ck? Ck?Acce
pt
P1 P2 P3 P4
chk
InteractionSet : P1, P2, P3
23
Accept
• Checkpointing is a 2-phase commit protocol.
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
• Rollback handled similar to the Checkpointing protocol: - Interaction set is built transitively using MyConsumers
• Rollback involves– Clearing the Dep. Registers and Write Signature– Invalidating the processor caches– Restoring the data and register context from the logs up to
the latest checkpoint.
• No Domino Effect
24
Distributed Rollback Protocol in SW
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Optimization1 : Delayed Writebacks
• Checkpointing overhead dominated by data writebacks
• Delayed Writeback optimization• Processors synchronize and resume execution• Hardware automatically writes back dirty lines in background • Checkpoint only completed when all delayed data written back• Still need to record inter-thread dependences on delayed data
WB dirty linesIn
terv
al
I1Tim
e
25
sync
sync
Che
ckpo
int
Inte
rval
I2
Stallsync
sync
WB dirty lines
Che
ckpo
int
Inte
rval
I1
Inte
rval
I2
Stall
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Delayed Writeback Pros/Cons
+ Significant reduction in checkpoint overhead
- Additional support:Each processor has two sets of Dep. Registers and Write Signature Each cache line has a delayed bit
- Increased vulnerabilityA rollback event forces both intervals to roll back
26
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
P1 P2
DP1 S
Write back
Logging
27
MemoryLog
P2 reads
MyConsumers0 P2
MyProducers1 P1
MyProducers0MyConsumers0
MyProducers0MyConsumers0P2
P1
LW-ID
MyProducers1MyConsumers1
MyProducers1MyConsumers1
WSig0
WSig1
Addr ?
Addr ?NO !
YES !xxx
Delayed Writeback protocol
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Optimization2 : Multiple Checkpoints
• Solution: Keep multiple checkpoints– On fault, roll back interacting processors to safe checkpoints
• No Domino Effect 28
Fault
Det
ectio
n L
aten
cy
Dep registers 1
Dep registers 2Rol
lbac
k
Ckpt 1
Ckpt 2
tf
• Problem: Fault detection is not instantaneous– Checkpoint is safe only after max fault-detection latency (L)
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Multiple Checkpoints: Pros/Cons
+ Realistic system: supports non-instantaneous fault detection
- Additional support:Each checkpoint has Dep registers Dep registers can be recycled only after fault detection latency
- Need to track communication across checkpoints
- Combination with Delayed Writebacks: one more Dep register set
29
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Optimization3 : Hiding Chkpt behind Global Barrier
• Global barriers require that all processors communicate– Leads to global checkpoints
• Optimization:– Proactively trigger a global checkpoint at a global barrier– Hide checkpoint overhead behind barrier imbalance spins
30
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Hiding Checkpoint behind Global Barrier
Lock count++ if(count == numProc) Iam_last = TRUE /*local var*/UnlockIf(I am_last) { count = 0 flag = TRUE …}else while(!flag) {}
31
Update
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Hiding Checkpoint behind Global Barrier
• First arriving processor initiates the checkpoint• Others: HW writes back data as execution proceeds to barrier• Commit checkpoint as last processor arrives• After the barrier: few interacting processors
Lock count++ if(count == numProc) Iam_last = TRUE /*local var*/UnlockIf(I am_last) { count = 0 flag = TRUE …}else while(!flag) {}
32
UpdateUpdate
Processor P1 Processor P2 Processor P3
Update
BarCK? BarCK?
Notify Notify
flag = TRUE ICHK = {P3} while(!flag)
ICHK = {P2, P3}
while(!flag)ICHK = {P1, P3}
Update
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Evaluation Setup
• Analysis tool using Pin + SESC cycle-acc. simulator + DRAMsim• Applications: SPLASH-2 , some PARSEC, Apache• Simulated CMP architecture with up to 64 threads • Checkpoint interval : 5 – 8 ms• Modeled several environments:
• Global: baseline global checkpointing• Rebound: Local checkpointing scheme with delayed writeback.• Rebound_NoDWB: Rebound without the delayed writebacks.
33
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Avg. Interaction Set: Set of Producer Processors
• Most apps: interaction set is a small set– Justifies coordinated local checkpointing– Averages brought up by global barriers
34
64
38
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Checkpoint Execution Overhead
• Rebound’s avg checkpoint execution overhead is 2%– Compared to 15% for Global
35
Bar
nes
Cho
lesk
y Fft
Fmm
Rad
ix
Lu-C
Lu-N
C
Vol
rend
Wat
er-
Sp
Wat
er-
Nsq
Rad
iosi
ty
Oce
an
Ray
trace
SP
2
0
10
20
30
40 GlobalRebound_NoDWBRebound
% C
heck
poin
t O
verh
ead
2
15
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Checkpoint Execution Overhead
• Rebound’s avg checkpoint execution overhead is 2%– Compared to 15% for Global
• Delayed Writebacks complement local checkpointing
36
Bar
nes
Cho
lesk
y Fft
Fmm
Rad
ix
Lu-C
Lu-N
C
Vol
rend
Wat
er-
Sp
Wat
er-
Nsq
Rad
iosi
ty
Oce
an
Ray
trace
SP
2
0
10
20
30
40 GlobalRebound_NoDWBRebound
% C
heck
poin
t O
verh
ead
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Rebound Scalability
• Rebound is scalable in checkpoint overhead• Delayed Writebacks help scalability
Constant problem size
37
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Also in the Paper
• Delayed write backs also useful in Global• Barrier optimization is effective but not universally applicable• Power increase due to hardware additions < 2%• Rebound leads to only 4% increase in coherence traffic
38
R. Agarwal, P. Garg, J. TorrellasRebound: Scalable Checkpointing
Conclusions
• Leverages directory protocol• Boosts checkpointing efficiency:
• Delayed write-backs• Multiple checkpoints• Barrier optimization
• Avg. execution overhead for 64 procs: 2%
Rebound: First HW-based scheme for scalable, coordinated local checkpointing in coherent shared-memory
• Future work:• Apply Rebound to non-hardware coherent machines• Scalability to hierarchical directories
39
Rebound: Scalable Checkpointing for Coherent Shared Memory
Rishi Agarwal, Pranav Garg, and Josep TorrellasDepartment of Computer Science
University of Illinois at Urbana-Champaignhttp://iacoma.cs.uiuc.edu