nvthreads: practical persistence for multi-threaded...

NVthreads: Practical Persistence for Multi-threaded Applications

Terry Hsu*, Purdue University Helge Brügner*, TU München Indrajit Roy*, Google Inc. Kimberly Keeton, Hewlett Packard Labs Patrick Eugster, TU Darmstadt and Purdue University * Work was done at Hewlett Packard Labs.

NVMW 2018

❖ NVthreads was published in EuroSys 2017 ❖ This work was supported by Hewlett Packard Labs, NSF TC-1117065, NSF TWC-1421910, and ERC FP7-617805.

What is non-volatile memory (NVM)?

2

• Key features: persistence, good performance, byte addressability

• Persistence

- Retain data without power

• Good performance

- Outperform traditional filesystem interface

• Byte addressability

- Allow for pure memory operations

4

☞Problem: Can we provide a simpler programming interface?

• NVM aware filesystems: BPFS, PMFS, PMEM

- Pro: provide good performance

- Con: require applications to use file-system interfaces and may need hardware modifications

• Durable transaction and heaps: NV-Heaps, Mnemosyne

- Pro: allow fine-grained NVM access

- Con: force programs to use transactions and require non-trivial effort to retrofit transactions in lock-based programs

Programming interfaces for NVM

8

NVM-aware apps programming

1 .head

5 .e

NULLtail

NVM

Challenges:1.data consistency

programmability volatile caches performance

1 : # Add element to the tail of list

2 : pthread_lock(&m);

3 : malloc(&e, sizeof(*e));

4 :

5 :

6 : e->value = 5;

7 :

8 :

9 : e->next = NULL;

10:

11:

12: head->next = e; //crash

13:

14:

15: tail = e;

16: pthread_unlock(&m);

12: head->next = e; // crash

9




4 : value>

5 :

6 : e->value = 5;

7 : next>

8 :

9 : e->next = NULL;

10: next>

11:

12: head->next = e;

13:

14:

15: tail = e;

16: pthread_unlock(&m);


NVM

1 .head

5 .e

NULLtail

Challenges:1.data consistency 2.programmability

volatile caches performance




4 : value>

5 :

6 : e->value = 5;

7 : next>

8 :

9 : e->next = NULL;

10: next>

11:

12: head->next = e;

13:

14:

15: tail = e;

16: pthread_unlock(&m); 10


NVM

1 .head

5 .e

NULLtail

flushing…

Challenges:1.data consistency 2.programmability 3.volatile caches

performance

Cache




4 : value>

5 :

6 : e->value = 5;

7 : next>

8 :

9 : e->next = NULL;

10: next>

11:

12: head->next = e;

13:

14:

15: tail = e;

16: pthread_unlock(&m); 11


NVM

1 .head

5 .e

NULLtail

Cache

Challenges:1.data consistency 2.programmability 3.volatile caches 4.performance

flushing…

• Data consistency

- Ensure data consistency even after crash

• Volatile caches

- Manage data movement from volatile caches to NVM

• Programmability

- Avoid extensive program modifications

• Performance - Minimize runtime overhead

13

Challenges of using NVM

!Proposal: NVthreads, a programming model and runtime that adds persistence to multi-threaded C/C++ programs

Goals of NVthreads• Make existing lock-based C/C++ applications crash tolerant

• Minimize porting effort

- Drop-in replacement for pthreads library

- No need for transactions

• Advantages of the NVthreads

- Good performance

- Easier to develop NVM-aware applications

14

Key ideas• Use synchronization points to infer consistent regions

(cf. Atlas [OOPSLA’14])

- Does not require applications to use transactions

• Execute multithreaded program as multi-process program (cf. DThreads [SOSP’11])

- Process memory buffers uncommitted writes

• Track data modifications at page granularity

- Amortizes logging overhead vs fine-grained tracking15

Unmodified C/C++ application

Using NVthreads• Ease of use:

19

bash$ gcc foo.c –o foo.out –rdynamic libnvthread.so –ldl

DRAMVolatile main memory

e.g., stacks

Operating systemMemory allocation and file system interface for

both DRAM and NVM

NVthreads libraryMulti-process, intercepting synchronization,

tracking data, maintaining log

Modifications• Allocate data in NVM: nvmalloc() • Recover data in NVM: nvrecover()

Add recovery code, specify persistent allocations

NVMPersistent regions

e.g., linked list on heap

User space

Kernel space

Hardware

Link to NVthreads library

DRAM

NVM

NVthreads: programming model

22

1 void main(){2 if( crashed() ){3 int *c = (int*) nvmalloc(sizeof(int), “c”);4 *c = nvrecover(c, sizeof(int), “c”);5 }6 else{ // normal execution7 int *c = (int*) nvmalloc(sizeof(int), “c”);8 ... // thread creation9 m.lock()10 *c = *c+1; 11 ...12 m.unlock()13 }14 }

6 else{ // normal execution7 int *c = (int*) nvmalloc(sizeof(int), “c”);8 ... // thread creation9 m.lock()10 *c = *c+1; 11 ...12 m.unlock()13 }

Locks mark boundary for durable code section.

NVthreads: programming model

23

1 void main(){2 if( crashed() ){3 int *c = (int*)nvmalloc(sizeof(int), “c”);4 *c = nvrecover(c, sizeof(int), “c”);5 }6 else{ // normal execution7 int *c = (int*) nvmalloc(sizeof(int), “c”);8 ... // thread creation9 m.lock()10 *c = *c+1; 11 ...12 m.unlock()13 }14 }

Application specific recovery code.

Programer needs to add.

2 if( crashed() ){3 int *c = (int*) nvmalloc(sizeof(int), “c”);4 *c = nvrecover(c, sizeof(int), “c”);5 }

Example: linked list

25

• NVthreads guarantees that the linked list is atomically appended w.r.t. failures

1 : # L is a persistent list

2 : Start threads {T1, T2, T3}

3 : …



6 : nvmalloc(&e, sizeof(*e));

7 : e->val = localVal;

8 : tail->next = e;

9 : e->prev = tail; // crash!

10: tail = e;

11: pthread_unlock(&m)

Critical section (add e1)



L={} L={e1} L={e1, e2}NVM

T1

T2

T3

Recovery phase

(execute redo ops)

state of the list data structure “L”

9 : e->prev = tail; // crash!

Implementing atomic durability• Convert threads to processes (cf. DThreads [SOSP’11])

- Each process works on private memory, no undo log

• At synchronization points, propagate private updates, execute processes sequentially

• Track dirty pages and log them to NVM for recovery

- Apply redo log in the event of crash26

sharedaddress space disjointaddress spaces

From threads to processes

33

Pass token

Wait

Wait

T1

T2

Critical sectionParallelphase

Parallelphase

Execute Wait

Star

t NVM log

write

Merge shared state

Track dirty

pages Sto

p

Star

t NVM log

write

Merge shared state

Track dirty

pages Sto

p

Redo logging

34

Rego log

Shared state

T1

log dirty pages

sync()

merge updated

bytes

write back to NVM

NVM

Critical sectionParallel phaseClean page

Dirtied page

NVM

Tracking data dependencies

46

T1

T2

X=Y=0

Y=X

B

A

X=1 cond_wait()

cond_signal()

dependence

Log1 Log2 Log3NVthreads maintains metadata for memory pages

per lockset to track data dependencies.

Evaluation• Environment

- Ubuntu 14.04 (Linux 3.16.7)

- Two Intel Xeon X5650 processors ([email protected])

- 198GB RAM and 600GB SSD

• Applications

- PARSEC benchmarks, Phoenix benchmarks, PageRank, K-means

• NVM emulator

- Linux tmpfs on DRAM emulating nvmfs (provided by Hewlett Packard Labs)

- Injected 1000ns delay to each 4KB page write via RDTSCP instruction

47

Performance vs pthreads

48

• Phoenix and PARSEC benchmarks

• No recovery protocol

Slo

wdo

wn

(x)

0

4

8

12

16

hist

ogra

m

kmea

nslin

ear r

egre

ssio

n

mat

rix m

ultip

ly

pca

reve

rse

inde

x

strin

g m

atch

wor

d co

unt

blac

ksch

oles

cann

eal

dedu

p

ferr

et

stre

amcl

uste

r

swap

tions

Pthreads Dthreads NVthreads (nvmfs 1000ns) Atlas

Performance vs pthreads

50

• 9 out of 14 applications: NVthreads incurs less than 20% overhead vs pthreads • Remaining 5 applications: 4x to 7x slowdown vs pthreads

Slo

wdo

wn

(x)

0

4

8

12

16

hist

ogra

m

kmea

nslin

ear r

egre

ssio

n

mat

rix m

ultip

ly

pca

reve

rse

inde

x

strin

g m

atch

wor

d co

unt

blac

ksch

oles

cann

eal

dedu

p

ferr

et

stre

amcl

uste

r

swap

tions


52

• 10 out of 12 applications: NVthreads is 7% to 100x faster vs Atlas

101.96 46.92

Slo

wdo

wn

(x)

0

4

8

12

16

hist

ogra

m

kmea

nslin

ear r

egre

ssio

n

mat

rix m

ultip

ly

pca

reve

rse

inde

x

strin

g m

atch

wor

d co

unt

blac

ksch

oles

cann

eal

dedu

p

ferr

et

stre

amcl

uste

r

swap

tions


xx

Performance vs Atlas [OOPSLA’14]

53

• 10 out of 12 applications: NVthreads is 7% to 100x faster vs Atlas • Remaining 2 applications: 7% to 2x slower vs Atlas

Slo

wdo

wn

(x)

0

4

8

12

16

hist

ogra

m

kmea

nslin

ear r

egre

ssio

n

mat

rix m

ultip

ly

pca

reve

rse

inde

x

strin

g m

atch

wor

d co

unt

blac

ksch

oles

cann

eal

dedu

p

ferr

et

stre

amcl

uste

r

swap

tions


xx

Performance vs Atlas [OOPSLA’14]

Is coarse grained tracking a good fit?

54

• 9 out of 14 applications touch more than 55% of each page

• It is worthwhile to track data at page granularity in these apps

% o

f eac

h pa

ge m

odifi

ed

0102030405060708090

100

linea

r reg

ress

ion

(25)

strin

g m

atch

(37)

hist

ogra

m (4

4)bl

acks

chol

es (8

9)sw

aptio

ns (4

83)

mat

rix m

ultip

ly (4

K)

kmea

ns (1

0K)

pca

(11K

)w

ord

coun

t (12

K)

ferr

et (1

50K

)st

ream

clus

ter (

180K

)de

dup

(2.3

M)

reve

rse

inde

x (2

.7M

)ca

nnea

l (7.

4M)

• Microbenchmark: 4 threads randomly modify parts of 1000 memory pages

• Mnemosyne [ASPLOS’11] and Atlas [OOPSLA’14] use word-level tracking

• NVthreads is 3x to 30x faster than fine-grained tracking

56

NVthreads is faster than fine-grained trackingS

low

dow

n ov

er p

thre

ads

(x)

0255075

100125150175200225250

Percentage of page modified

5% 10% 25% 50% 75% 100%

NVthreads (nvm-1000ns) Atlas (no-clflush) Mnemosyne Atlas

• We made K-means crash at synthetic program points, recover, continue until convergence at ~160th iteration

• NVthreads’ K-means provides up to 1.9x speedup vs pthreads

• NVthreads requires only 4 SLOC changes to make K-means crash tolerant

58

Input size0

0.5

1

1.5

2

1M 10M 20M 30M 1M 10M 20M 30M 1M 10M 20M 30M 1M 10M 20M 30M

10 50 75 150

Spee

dup

over

pth

read

s

Iteration when crash occured

Pthreads NVthreads (nvm=1000ns)

Benefits of recovery (K-means)S

peed

up o

ver p

thre

ads

(x)

Summary• NVthreads allows programmers to easily leverage NVM

with just few lines of source code changes

• Recovery requires only redo log because multi-process execution buffers private updates

• Coarse-grained page-level tracking amortizes logging overheads

• NVthreads prototype is publicly available at:

https://github.com/HewlettPackard/nvthreads

61
https://github.com/HewlettPackard/nvthreads

nvthreads: practical persistence for multi-threaded...

Documents