uw-madison computer sciences multifacet group© 2011 karma: scalable deterministic record-replay...
TRANSCRIPT
UW-Madison Computer Sciences Multifacet Group © 2011
Karma:Scalable Deterministic Record-Replay
Arkaprava BasuJayaram Bobba
Mark D. Hill
Work done at University of Wisconsin-Madison
2
Executive summary
• Applications of deterministic record-replay– Debugging– Fault tolerance– Security
• Existing hardware record-replayer– Fast record but– Slow replay or – Requires major hardware changes
• Karma: Faster Replay with nearly-conventional h/w– Extends Rerun– Records more parallelism
3
Outline
• Background & Motivation• Rerun Overview• Karma Insights• Karma Implementation• Evaluation• Conclusion
4
Deterministic Record-Replay
• Multi-threaded execution non-deterministic• Deterministic record-replay to reincarnate
past execution• Record:
– Record selective events in a log• Replay:
– Use the log to reincarnate past execution• Key Challenge: Memory races
5
Record-Replay Motivation
• Debugging– Ensures bugs faithfully reappear (no heisenbugs)
• Fault-Tolerance– Enable hot backup for primary server to
shadow primary & take over on failure
• Security– Real time intrusion detection & attack analysis
Rep
lay sp
eed
matte
rs
6
Previous work
• Record Dependence– Wisconsin Flight Data Recorder [ISCA’03,etc.]: Too much
state– UCSD Strata [ASPLOS’06]: Log size grows rapidly w #cores
• Record Independence– UIUC DeLorean [ISCA’08]: Non-conventional BulkSC H/W– Wisconsin Rerun [ISCA’08]: Sequential replay– Intel MRR [MICRO’09]: Only for snoop based systems– Timetraveler [ISCA’10]: Extends Rerun to lower log size
• Our Goal– Retain Rerun’s near-conventional hardware– Enable Faster Replay
7
Outline
• Background & Motivation• Rerun Overview• Karma Insights• Karma Implementation• Evaluation• Conclusion
8
Rerun’s Recording
• Most code executes without races– Use race-free regions for ordering
• Episodes: independent execution regions– Defined per thread
T0 T1
LD A ST B ST C LD F
ST E LD B ST X LD R ST T LD X
T2
ST V ST Z LD W LD J
ST C LD Q LD J
ST Q ST E ST K LD Z
LD V
ST X
Partially adopted from ISCA’08 talk
9
23
Rerun’s Recording (Contd.)
• Capturing causality:– Timestamp via Lamport scalar clock [Lamport ‘78]
• Replay in timestamp order– Episodes with same timestamp can be replayed in parallel
43 2260
61 44
62
2344
45
T0 T1 T2
10
Rerun’s Replay
T0 T1 T2
22
43
4444
45
60
61
TS=22
TS=45
TS=44
TS=43
TS=60
TS=61
11
Outline
• Background & Motivation• Rerun Overview• Karma Insights• Karma Implementation• Evaluation• Conclusion
12
Karma’s Insight 1:
• Capture order with DAG (not scalar clock)
Recording: DAG captured with episode predecessor & successor sets 23
43 2260
61 44
62
2344
45
T0 T1 T2
13
Karma’s Insight 1:
T0 T1 T2
2260
61 43
4444
62
T0 T1 T2
22
43
4444
45
60
61
Reru
n’s
Rep
lay
Karm
a’s
Rep
lay
14
Karma’s Insight 1: (Contd.)
• Naïve approach: DAG arcs point to episodes– Episode represented by integers– Too much log size overhead !!
• Our approach: DAG arcs point to cores– Recording: Only one “active” episode per core – Replay: Send wakeup message(s) to core(s) of
successor episode(s)
15
Karma’s Insight 1:
T0 T1 T2
2260
61 43
44
44
62
84 0|0|1 0|0|1
Anatomy of a log entry
17
• Not necessary to end the episode on every conflict:– As long as the episodes can be ordered during
replay
ST B ST C
Karma Insight 2:
T0 T1 LD A
LD F
ST E LD B ST X LD R ST T
LD X
T2
ST V ST Z LD W LD J
ST C LD Q
LD J ST Q
ST E ST K LD Z
LD V
ST X
18
Outline
• Background & Motivation• Rerun Overview• Karma Insights• Karma Implementation• Evaluation• Conclusion
19
Karma Hardware
Coherence Controller
L1 I
L2 0
L2 1
L2 14
L2 15
Core 15
Interconnect
DR
AM
DR
AM
…
Core 14
Core 1
Core 0 …
Base System
Total State: 148 bytes/core
Address Filter(FLT)
Reference (REFS)
Predecessor(PRED)
Successor(SUCC)
Timestamp(TS)
20
Outline
• Background & Motivation• Rerun Overview• Karma Insights• Karma Implementation• Evaluation• Conclusion
21
Evaluation:
• Were we able to speed up the replay?
0
0.2
0.4
0.6
0.8
1
1.2
4core-4MB 8core-8MB 16core-16MB
Spee
dup
norm
aliz
ed to
"Ba
se"
of c
orre
spon
ding
co
nfigu
rati
on
Number of cores-L2 cache size
Apache Base
Rerun Replay
Karma Replay
22
Evaluation:
• Were we able to speed up the replay?
0
0.2
0.4
0.6
0.8
1
1.2
4core-4MB 8core-8MB 16core-16MB
Spee
dup
norm
aliz
ed to
"Ba
se"
of c
orre
spon
ding
co
nfigu
rati
on
Number of cores-L2 cache size
Apache Base
Rerun Replay
Karma Replay
0
0.2
0.4
0.6
0.8
1
1.2
4core-4MB 8core-8MB 16core-16MB
Spee
dup
norm
aliz
ed to
"Ba
se"
of c
orre
spon
ding
co
nfigu
rati
on
Number of cores-L2 cache size
Jbb Base
Rerun Replay
Karma Replay
0
0.2
0.4
0.6
0.8
1
1.2
4core-4MB 8core-8MB 16core-16MB
Spee
dup
norm
aliz
ed to
"Ba
se"
of c
orre
spon
ding
co
nfigu
rati
on
Number of cores-L2 cache size
OltpBaseRerun ReplayKarma Replay
0
0.2
0.4
0.6
0.8
1
1.2
4core-4MB 8core-8MB 16core-16MB
Spee
dup
norm
aliz
ed to
"Ba
se"
of c
orre
spon
ding
co
nfigu
ratio
n
Number of cores-L2 cache size
Zeus Base
Rerun Replay
Karma Replay
On Average ~4X improvement in replay speed over Rerun
23
Evaluation
• Did we blowup log size?
0
0.2
0.4
0.6
0.8
1
1.2
1.4
128 256 512 1024 2048 4096 8192 Unbounded
Ka
rma
lo
g s
ize
no
rma
lize
d t
o R
eru
n's
lo
g s
ize
Maximum allowable Episode size
Apache
Zeus
Oltp
Jbb
On average Karma does not increase the size of the log but instead improves it by as much as 40% as we allow larger episodes
25
Conclusion
• Applications of deterministic replay– Debugging– Fault tolerance– Security
• Existing hardware record-replayer– Slow replay or – Requires major hardware changes
• Karma: Faster Replay with nearly-conventional h/w– Extends Rerun– Uses DAG instead of Scalar clock– Extend episodes past conflicts
• Widen Application + Lower Cost More Attractive
26
Questions?