nvthreads: practical persistence for multi-threaded...
TRANSCRIPT
-
NVthreads: Practical Persistence for Multi-threaded Applications
Terry Hsu*, Purdue University Helge Brügner*, TU München Indrajit Roy*, Google Inc. Kimberly Keeton, Hewlett Packard Labs Patrick Eugster, TU Darmstadt and Purdue University * Work was done at Hewlett Packard Labs.
NVMW 2018
❖ NVthreads was published in EuroSys 2017 ❖ This work was supported by Hewlett Packard Labs, NSF TC-1117065, NSF TWC-1421910, and ERC FP7-617805.
-
What is non-volatile memory (NVM)?
2
• Key features: persistence, good performance, byte addressability
• Persistence
- Retain data without power
• Good performance
- Outperform traditional filesystem interface
• Byte addressability
- Allow for pure memory operations
-
4
☞Problem: Can we provide a simpler programming interface?
• NVM aware filesystems: BPFS, PMFS, PMEM
- Pro: provide good performance
- Con: require applications to use file-system interfaces and may need hardware modifications
• Durable transaction and heaps: NV-Heaps, Mnemosyne
- Pro: allow fine-grained NVM access
- Con: force programs to use transactions and require non-trivial effort to retrofit transactions in lock-based programs
Programming interfaces for NVM
-
8
NVM-aware apps programming
1 .head
5 .e
NULLtail
NVM
Challenges:1.data consistency
programmability volatile caches performance
1 : # Add element to the tail of list
2 : pthread_lock(&m);
3 : malloc(&e, sizeof(*e));
4 :
5 :
6 : e->value = 5;
7 :
8 :
9 : e->next = NULL;
10:
11:
12: head->next = e; //crash
13:
14:
15: tail = e;
16: pthread_unlock(&m);
12: head->next = e; // crash
-
9
1 : # Add element to the tail of list
2 : pthread_lock(&m);
3 : malloc(&e, sizeof(*e));
4 : value>
5 :
6 : e->value = 5;
7 : next>
8 :
9 : e->next = NULL;
10: next>
11:
12: head->next = e;
13:
14:
15: tail = e;
16: pthread_unlock(&m);
NVM-aware apps programming
NVM
1 .head
5 .e
NULLtail
Challenges:1.data consistency 2.programmability
volatile caches performance
-
1 : # Add element to the tail of list
2 : pthread_lock(&m);
3 : malloc(&e, sizeof(*e));
4 : value>
5 :
6 : e->value = 5;
7 : next>
8 :
9 : e->next = NULL;
10: next>
11:
12: head->next = e;
13:
14:
15: tail = e;
16: pthread_unlock(&m); 10
NVM-aware apps programming
NVM
1 .head
5 .e
NULLtail
flushing…
Challenges:1.data consistency 2.programmability 3.volatile caches
performance
Cache
-
1 : # Add element to the tail of list
2 : pthread_lock(&m);
3 : malloc(&e, sizeof(*e));
4 : value>
5 :
6 : e->value = 5;
7 : next>
8 :
9 : e->next = NULL;
10: next>
11:
12: head->next = e;
13:
14:
15: tail = e;
16: pthread_unlock(&m); 11
NVM-aware apps programming
NVM
1 .head
5 .e
NULLtail
Cache
Challenges:1.data consistency 2.programmability 3.volatile caches 4.performance
flushing…
-
• Data consistency
- Ensure data consistency even after crash
• Volatile caches
- Manage data movement from volatile caches to NVM
• Programmability
- Avoid extensive program modifications
• Performance - Minimize runtime overhead
13
Challenges of using NVM
!Proposal: NVthreads, a programming model and runtime that adds persistence to multi-threaded C/C++ programs
-
Goals of NVthreads• Make existing lock-based C/C++ applications crash tolerant
• Minimize porting effort
- Drop-in replacement for pthreads library
- No need for transactions
• Advantages of the NVthreads
- Good performance
- Easier to develop NVM-aware applications
14
-
Key ideas• Use synchronization points to infer consistent regions
(cf. Atlas [OOPSLA’14])
- Does not require applications to use transactions
• Execute multithreaded program as multi-process program (cf. DThreads [SOSP’11])
- Process memory buffers uncommitted writes
• Track data modifications at page granularity
- Amortizes logging overhead vs fine-grained tracking15
-
Unmodified C/C++ application
Using NVthreads• Ease of use:
19
bash$ gcc foo.c –o foo.out –rdynamic libnvthread.so –ldl
DRAMVolatile main memory
e.g., stacks
Operating systemMemory allocation and file system interface for
both DRAM and NVM
NVthreads libraryMulti-process, intercepting synchronization,
tracking data, maintaining log
Modifications• Allocate data in NVM: nvmalloc() • Recover data in NVM: nvrecover()
Add recovery code, specify persistent allocations
NVMPersistent regions
e.g., linked list on heap
User space
Kernel space
Hardware
Link to NVthreads library
DRAM
NVM
-
NVthreads: programming model
22
1 void main(){2 if( crashed() ){3 int *c = (int*) nvmalloc(sizeof(int), “c”);4 *c = nvrecover(c, sizeof(int), “c”);5 }6 else{ // normal execution7 int *c = (int*) nvmalloc(sizeof(int), “c”);8 ... // thread creation9 m.lock()10 *c = *c+1; 11 ...12 m.unlock()13 }14 }
6 else{ // normal execution7 int *c = (int*) nvmalloc(sizeof(int), “c”);8 ... // thread creation9 m.lock()10 *c = *c+1; 11 ...12 m.unlock()13 }
Locks mark boundary for durable code section.
-
NVthreads: programming model
23
1 void main(){2 if( crashed() ){3 int *c = (int*)nvmalloc(sizeof(int), “c”);4 *c = nvrecover(c, sizeof(int), “c”);5 }6 else{ // normal execution7 int *c = (int*) nvmalloc(sizeof(int), “c”);8 ... // thread creation9 m.lock()10 *c = *c+1; 11 ...12 m.unlock()13 }14 }
Application specific recovery code.
Programer needs to add.
2 if( crashed() ){3 int *c = (int*) nvmalloc(sizeof(int), “c”);4 *c = nvrecover(c, sizeof(int), “c”);5 }
-
Example: linked list
25
• NVthreads guarantees that the linked list is atomically appended w.r.t. failures
1 : # L is a persistent list
2 : Start threads {T1, T2, T3}
3 : …
4 : # Add element to the tail of list
5 : pthread_lock(&m);
6 : nvmalloc(&e, sizeof(*e));
7 : e->val = localVal;
8 : tail->next = e;
9 : e->prev = tail; // crash!
10: tail = e;
11: pthread_unlock(&m)
Critical section (add e1)
Critical section (add e2)
Critical section (add e3)
L={} L={e1} L={e1, e2}NVM
T1
T2
T3
Recovery phase
(execute redo ops)
state of the list data structure “L”
9 : e->prev = tail; // crash!
-
Implementing atomic durability• Convert threads to processes (cf. DThreads [SOSP’11])
- Each process works on private memory, no undo log
• At synchronization points, propagate private updates, execute processes sequentially
• Track dirty pages and log them to NVM for recovery
- Apply redo log in the event of crash26
sharedaddress space disjointaddress spaces
-
From threads to processes
33
Pass token
Wait
Wait
T1
T2
Critical sectionParallelphase
Parallelphase
Execute Wait
Star
t NVM log
write
Merge shared state
Track dirty
pages Sto
p
Star
t NVM log
write
Merge shared state
Track dirty
pages Sto
p
-
Redo logging
34
Rego log
Shared state
T1
log dirty pages
sync()
merge updated
bytes
write back to NVM
NVM
Critical sectionParallel phaseClean page
Dirtied page
-
NVM
Tracking data dependencies
46
T1
T2
X=Y=0
Y=X
B
A
X=1 cond_wait()
cond_signal()
dependence
Log1 Log2 Log3NVthreads maintains metadata for memory pages
per lockset to track data dependencies.
-
Evaluation• Environment
- Ubuntu 14.04 (Linux 3.16.7)
- Two Intel Xeon X5650 processors ([email protected])
- 198GB RAM and 600GB SSD
• Applications
- PARSEC benchmarks, Phoenix benchmarks, PageRank, K-means
• NVM emulator
- Linux tmpfs on DRAM emulating nvmfs (provided by Hewlett Packard Labs)
- Injected 1000ns delay to each 4KB page write via RDTSCP instruction
47
-
Performance vs pthreads
48
• Phoenix and PARSEC benchmarks
• No recovery protocol
Slo
wdo
wn
(x)
0
4
8
12
16
hist
ogra
m
kmea
nslin
ear r
egre
ssio
n
mat
rix m
ultip
ly
pca
reve
rse
inde
x
strin
g m
atch
wor
d co
unt
blac
ksch
oles
cann
eal
dedu
p
ferr
et
stre
amcl
uste
r
swap
tions
Pthreads Dthreads NVthreads (nvmfs 1000ns) Atlas
-
Performance vs pthreads
50
• 9 out of 14 applications: NVthreads incurs less than 20% overhead vs pthreads • Remaining 5 applications: 4x to 7x slowdown vs pthreads
Slo
wdo
wn
(x)
0
4
8
12
16
hist
ogra
m
kmea
nslin
ear r
egre
ssio
n
mat
rix m
ultip
ly
pca
reve
rse
inde
x
strin
g m
atch
wor
d co
unt
blac
ksch
oles
cann
eal
dedu
p
ferr
et
stre
amcl
uste
r
swap
tions
Pthreads Dthreads NVthreads (nvmfs 1000ns) Atlas
-
52
• 10 out of 12 applications: NVthreads is 7% to 100x faster vs Atlas
101.96 46.92
Slo
wdo
wn
(x)
0
4
8
12
16
hist
ogra
m
kmea
nslin
ear r
egre
ssio
n
mat
rix m
ultip
ly
pca
reve
rse
inde
x
strin
g m
atch
wor
d co
unt
blac
ksch
oles
cann
eal
dedu
p
ferr
et
stre
amcl
uste
r
swap
tions
Pthreads Dthreads NVthreads (nvmfs 1000ns) Atlas
xx
Performance vs Atlas [OOPSLA’14]
-
53
• 10 out of 12 applications: NVthreads is 7% to 100x faster vs Atlas • Remaining 2 applications: 7% to 2x slower vs Atlas
Slo
wdo
wn
(x)
0
4
8
12
16
hist
ogra
m
kmea
nslin
ear r
egre
ssio
n
mat
rix m
ultip
ly
pca
reve
rse
inde
x
strin
g m
atch
wor
d co
unt
blac
ksch
oles
cann
eal
dedu
p
ferr
et
stre
amcl
uste
r
swap
tions
Pthreads Dthreads NVthreads (nvmfs 1000ns) Atlas
xx
Performance vs Atlas [OOPSLA’14]
-
Is coarse grained tracking a good fit?
54
• 9 out of 14 applications touch more than 55% of each page
• It is worthwhile to track data at page granularity in these apps
% o
f eac
h pa
ge m
odifi
ed
0102030405060708090
100
linea
r reg
ress
ion
(25)
strin
g m
atch
(37)
hist
ogra
m (4
4)bl
acks
chol
es (8
9)sw
aptio
ns (4
83)
mat
rix m
ultip
ly (4
K)
kmea
ns (1
0K)
pca
(11K
)w
ord
coun
t (12
K)
ferr
et (1
50K
)st
ream
clus
ter (
180K
)de
dup
(2.3
M)
reve
rse
inde
x (2
.7M
)ca
nnea
l (7.
4M)
-
• Microbenchmark: 4 threads randomly modify parts of 1000 memory pages
• Mnemosyne [ASPLOS’11] and Atlas [OOPSLA’14] use word-level tracking
• NVthreads is 3x to 30x faster than fine-grained tracking
56
NVthreads is faster than fine-grained trackingS
low
dow
n ov
er p
thre
ads
(x)
0255075
100125150175200225250
Percentage of page modified
5% 10% 25% 50% 75% 100%
NVthreads (nvm-1000ns) Atlas (no-clflush) Mnemosyne Atlas
-
• We made K-means crash at synthetic program points, recover, continue until convergence at ~160th iteration
• NVthreads’ K-means provides up to 1.9x speedup vs pthreads
• NVthreads requires only 4 SLOC changes to make K-means crash tolerant
58
Input size0
0.5
1
1.5
2
1M 10M 20M 30M 1M 10M 20M 30M 1M 10M 20M 30M 1M 10M 20M 30M
10 50 75 150
Spee
dup
over
pth
read
s
Iteration when crash occured
Pthreads NVthreads (nvm=1000ns)
Benefits of recovery (K-means)S
peed
up o
ver p
thre
ads
(x)
-
Summary• NVthreads allows programmers to easily leverage NVM
with just few lines of source code changes
• Recovery requires only redo log because multi-process execution buffers private updates
• Coarse-grained page-level tracking amortizes logging overheads
• NVthreads prototype is publicly available at:
https://github.com/HewlettPackard/nvthreads
61
https://github.com/HewlettPackard/nvthreads