nvthreads: practical persistence for multi-threaded...

27
NVthreads: Practical Persistence for Multi-threaded Applications Terry Hsu*, Purdue University Helge Brügner*, TU München Indrajit Roy*, Google Inc. Kimberly Keeton, Hewlett Packard Labs Patrick Eugster, TU Darmstadt and Purdue University * Work was done at Hewlett Packard Labs. NVMW 2018 NVthreads was published in EuroSys 2017 This work was supported by Hewlett Packard Labs, NSF TC-1117065, NSF TWC-1421910, and ERC FP7-617805.

Upload: others

Post on 05-Feb-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • NVthreads: Practical Persistence for Multi-threaded Applications

    Terry Hsu*, Purdue University Helge Brügner*, TU München Indrajit Roy*, Google Inc. Kimberly Keeton, Hewlett Packard Labs Patrick Eugster, TU Darmstadt and Purdue University * Work was done at Hewlett Packard Labs.

    NVMW 2018

    ❖ NVthreads was published in EuroSys 2017 ❖ This work was supported by Hewlett Packard Labs, NSF TC-1117065, NSF TWC-1421910, and ERC FP7-617805.

  • What is non-volatile memory (NVM)?

    2

    • Key features: persistence, good performance, byte addressability

    • Persistence

    - Retain data without power

    • Good performance

    - Outperform traditional filesystem interface

    • Byte addressability

    - Allow for pure memory operations

  • 4

    ☞Problem: Can we provide a simpler programming interface?

    • NVM aware filesystems: BPFS, PMFS, PMEM

    - Pro: provide good performance

    - Con: require applications to use file-system interfaces and may need hardware modifications

    • Durable transaction and heaps: NV-Heaps, Mnemosyne

    - Pro: allow fine-grained NVM access

    - Con: force programs to use transactions and require non-trivial effort to retrofit transactions in lock-based programs

    Programming interfaces for NVM

  • 8

    NVM-aware apps programming

    1 .head

    5 .e

    NULLtail

    NVM

    Challenges:1.data consistency

    programmability volatile caches performance

    1 : # Add element to the tail of list

    2 : pthread_lock(&m);

    3 : malloc(&e, sizeof(*e));

    4 :

    5 :

    6 : e->value = 5;

    7 :

    8 :

    9 : e->next = NULL;

    10:

    11:

    12: head->next = e; //crash

    13:

    14:

    15: tail = e;

    16: pthread_unlock(&m);

    12: head->next = e; // crash

  • 9

    1 : # Add element to the tail of list

    2 : pthread_lock(&m);

    3 : malloc(&e, sizeof(*e));

    4 : value>

    5 :

    6 : e->value = 5;

    7 : next>

    8 :

    9 : e->next = NULL;

    10: next>

    11:

    12: head->next = e;

    13:

    14:

    15: tail = e;

    16: pthread_unlock(&m);

    NVM-aware apps programming

    NVM

    1 .head

    5 .e

    NULLtail

    Challenges:1.data consistency 2.programmability

    volatile caches performance

  • 1 : # Add element to the tail of list

    2 : pthread_lock(&m);

    3 : malloc(&e, sizeof(*e));

    4 : value>

    5 :

    6 : e->value = 5;

    7 : next>

    8 :

    9 : e->next = NULL;

    10: next>

    11:

    12: head->next = e;

    13:

    14:

    15: tail = e;

    16: pthread_unlock(&m); 10

    NVM-aware apps programming

    NVM

    1 .head

    5 .e

    NULLtail

    flushing…

    Challenges:1.data consistency 2.programmability 3.volatile caches

    performance

    Cache

  • 1 : # Add element to the tail of list

    2 : pthread_lock(&m);

    3 : malloc(&e, sizeof(*e));

    4 : value>

    5 :

    6 : e->value = 5;

    7 : next>

    8 :

    9 : e->next = NULL;

    10: next>

    11:

    12: head->next = e;

    13:

    14:

    15: tail = e;

    16: pthread_unlock(&m); 11

    NVM-aware apps programming

    NVM

    1 .head

    5 .e

    NULLtail

    Cache

    Challenges:1.data consistency 2.programmability 3.volatile caches 4.performance

    flushing…

  • • Data consistency

    - Ensure data consistency even after crash

    • Volatile caches

    - Manage data movement from volatile caches to NVM

    • Programmability

    - Avoid extensive program modifications

    • Performance - Minimize runtime overhead

    13

    Challenges of using NVM

    !Proposal: NVthreads, a programming model and runtime that adds persistence to multi-threaded C/C++ programs

  • Goals of NVthreads• Make existing lock-based C/C++ applications crash tolerant

    • Minimize porting effort

    - Drop-in replacement for pthreads library

    - No need for transactions

    • Advantages of the NVthreads

    - Good performance

    - Easier to develop NVM-aware applications

    14

  • Key ideas• Use synchronization points to infer consistent regions

    (cf. Atlas [OOPSLA’14])

    - Does not require applications to use transactions

    • Execute multithreaded program as multi-process program (cf. DThreads [SOSP’11])

    - Process memory buffers uncommitted writes

    • Track data modifications at page granularity

    - Amortizes logging overhead vs fine-grained tracking15

  • Unmodified C/C++ application

    Using NVthreads• Ease of use:

    19

    bash$ gcc foo.c –o foo.out –rdynamic libnvthread.so –ldl

    DRAMVolatile main memory

    e.g., stacks

    Operating systemMemory allocation and file system interface for

    both DRAM and NVM

    NVthreads libraryMulti-process, intercepting synchronization,

    tracking data, maintaining log

    Modifications• Allocate data in NVM: nvmalloc() • Recover data in NVM: nvrecover()

    Add recovery code, specify persistent allocations

    NVMPersistent regions

    e.g., linked list on heap

    User space

    Kernel space

    Hardware

    Link to NVthreads library

    DRAM

    NVM

  • NVthreads: programming model

    22

    1 void main(){2 if( crashed() ){3 int *c = (int*) nvmalloc(sizeof(int), “c”);4 *c = nvrecover(c, sizeof(int), “c”);5 }6 else{ // normal execution7 int *c = (int*) nvmalloc(sizeof(int), “c”);8 ... // thread creation9 m.lock()10 *c = *c+1; 11 ...12 m.unlock()13 }14 }

    6 else{ // normal execution7 int *c = (int*) nvmalloc(sizeof(int), “c”);8 ... // thread creation9 m.lock()10 *c = *c+1; 11 ...12 m.unlock()13 }

    Locks mark boundary for durable code section.

  • NVthreads: programming model

    23

    1 void main(){2 if( crashed() ){3 int *c = (int*)nvmalloc(sizeof(int), “c”);4 *c = nvrecover(c, sizeof(int), “c”);5 }6 else{ // normal execution7 int *c = (int*) nvmalloc(sizeof(int), “c”);8 ... // thread creation9 m.lock()10 *c = *c+1; 11 ...12 m.unlock()13 }14 }

    Application specific recovery code.

    Programer needs to add.

    2 if( crashed() ){3 int *c = (int*) nvmalloc(sizeof(int), “c”);4 *c = nvrecover(c, sizeof(int), “c”);5 }

  • Example: linked list

    25

    • NVthreads guarantees that the linked list is atomically appended w.r.t. failures

    1 : # L is a persistent list

    2 : Start threads {T1, T2, T3}

    3 : …

    4 : # Add element to the tail of list

    5 : pthread_lock(&m);

    6 : nvmalloc(&e, sizeof(*e));

    7 : e->val = localVal;

    8 : tail->next = e;

    9 : e->prev = tail; // crash!

    10: tail = e;

    11: pthread_unlock(&m)

    Critical section (add e1)

    Critical section (add e2)

    Critical section (add e3)

    L={} L={e1} L={e1, e2}NVM

    T1

    T2

    T3

    Recovery phase

    (execute redo ops)

    state of the list data structure “L”

    9 : e->prev = tail; // crash!

  • Implementing atomic durability• Convert threads to processes (cf. DThreads [SOSP’11])

    - Each process works on private memory, no undo log

    • At synchronization points, propagate private updates, execute processes sequentially

    • Track dirty pages and log them to NVM for recovery

    - Apply redo log in the event of crash26

    sharedaddress space disjointaddress spaces

  • From threads to processes

    33

    Pass token

    Wait

    Wait

    T1

    T2

    Critical sectionParallelphase

    Parallelphase

    Execute Wait

    Star

    t NVM log

    write

    Merge shared state

    Track dirty

    pages Sto

    p

    Star

    t NVM log

    write

    Merge shared state

    Track dirty

    pages Sto

    p

  • Redo logging

    34

    Rego log

    Shared state

    T1

    log dirty pages

    sync()

    merge updated

    bytes

    write back to NVM

    NVM

    Critical sectionParallel phaseClean page

    Dirtied page

  • NVM

    Tracking data dependencies

    46

    T1

    T2

    X=Y=0

    Y=X

    B

    A

    X=1 cond_wait()

    cond_signal()

    dependence

    Log1 Log2 Log3NVthreads maintains metadata for memory pages

    per lockset to track data dependencies.

  • Evaluation• Environment

    - Ubuntu 14.04 (Linux 3.16.7)

    - Two Intel Xeon X5650 processors ([email protected])

    - 198GB RAM and 600GB SSD

    • Applications

    - PARSEC benchmarks, Phoenix benchmarks, PageRank, K-means

    • NVM emulator

    - Linux tmpfs on DRAM emulating nvmfs (provided by Hewlett Packard Labs)

    - Injected 1000ns delay to each 4KB page write via RDTSCP instruction

    47

  • Performance vs pthreads

    48

    • Phoenix and PARSEC benchmarks

    • No recovery protocol

    Slo

    wdo

    wn

    (x)

    0

    4

    8

    12

    16

    hist

    ogra

    m

    kmea

    nslin

    ear r

    egre

    ssio

    n

    mat

    rix m

    ultip

    ly

    pca

    reve

    rse

    inde

    x

    strin

    g m

    atch

    wor

    d co

    unt

    blac

    ksch

    oles

    cann

    eal

    dedu

    p

    ferr

    et

    stre

    amcl

    uste

    r

    swap

    tions

    Pthreads Dthreads NVthreads (nvmfs 1000ns) Atlas

  • Performance vs pthreads

    50

    • 9 out of 14 applications: NVthreads incurs less than 20% overhead vs pthreads • Remaining 5 applications: 4x to 7x slowdown vs pthreads

    Slo

    wdo

    wn

    (x)

    0

    4

    8

    12

    16

    hist

    ogra

    m

    kmea

    nslin

    ear r

    egre

    ssio

    n

    mat

    rix m

    ultip

    ly

    pca

    reve

    rse

    inde

    x

    strin

    g m

    atch

    wor

    d co

    unt

    blac

    ksch

    oles

    cann

    eal

    dedu

    p

    ferr

    et

    stre

    amcl

    uste

    r

    swap

    tions

    Pthreads Dthreads NVthreads (nvmfs 1000ns) Atlas

  • 52

    • 10 out of 12 applications: NVthreads is 7% to 100x faster vs Atlas

    101.96 46.92

    Slo

    wdo

    wn

    (x)

    0

    4

    8

    12

    16

    hist

    ogra

    m

    kmea

    nslin

    ear r

    egre

    ssio

    n

    mat

    rix m

    ultip

    ly

    pca

    reve

    rse

    inde

    x

    strin

    g m

    atch

    wor

    d co

    unt

    blac

    ksch

    oles

    cann

    eal

    dedu

    p

    ferr

    et

    stre

    amcl

    uste

    r

    swap

    tions

    Pthreads Dthreads NVthreads (nvmfs 1000ns) Atlas

    xx

    Performance vs Atlas [OOPSLA’14]

  • 53

    • 10 out of 12 applications: NVthreads is 7% to 100x faster vs Atlas • Remaining 2 applications: 7% to 2x slower vs Atlas

    Slo

    wdo

    wn

    (x)

    0

    4

    8

    12

    16

    hist

    ogra

    m

    kmea

    nslin

    ear r

    egre

    ssio

    n

    mat

    rix m

    ultip

    ly

    pca

    reve

    rse

    inde

    x

    strin

    g m

    atch

    wor

    d co

    unt

    blac

    ksch

    oles

    cann

    eal

    dedu

    p

    ferr

    et

    stre

    amcl

    uste

    r

    swap

    tions

    Pthreads Dthreads NVthreads (nvmfs 1000ns) Atlas

    xx

    Performance vs Atlas [OOPSLA’14]

  • Is coarse grained tracking a good fit?

    54

    • 9 out of 14 applications touch more than 55% of each page

    • It is worthwhile to track data at page granularity in these apps

    % o

    f eac

    h pa

    ge m

    odifi

    ed

    0102030405060708090

    100

    linea

    r reg

    ress

    ion

    (25)

    strin

    g m

    atch

    (37)

    hist

    ogra

    m (4

    4)bl

    acks

    chol

    es (8

    9)sw

    aptio

    ns (4

    83)

    mat

    rix m

    ultip

    ly (4

    K)

    kmea

    ns (1

    0K)

    pca

    (11K

    )w

    ord

    coun

    t (12

    K)

    ferr

    et (1

    50K

    )st

    ream

    clus

    ter (

    180K

    )de

    dup

    (2.3

    M)

    reve

    rse

    inde

    x (2

    .7M

    )ca

    nnea

    l (7.

    4M)

  • • Microbenchmark: 4 threads randomly modify parts of 1000 memory pages

    • Mnemosyne [ASPLOS’11] and Atlas [OOPSLA’14] use word-level tracking

    • NVthreads is 3x to 30x faster than fine-grained tracking

    56

    NVthreads is faster than fine-grained trackingS

    low

    dow

    n ov

    er p

    thre

    ads

    (x)

    0255075

    100125150175200225250

    Percentage of page modified

    5% 10% 25% 50% 75% 100%

    NVthreads (nvm-1000ns) Atlas (no-clflush) Mnemosyne Atlas

  • • We made K-means crash at synthetic program points, recover, continue until convergence at ~160th iteration

    • NVthreads’ K-means provides up to 1.9x speedup vs pthreads

    • NVthreads requires only 4 SLOC changes to make K-means crash tolerant

    58

    Input size0

    0.5

    1

    1.5

    2

    1M 10M 20M 30M 1M 10M 20M 30M 1M 10M 20M 30M 1M 10M 20M 30M

    10 50 75 150

    Spee

    dup

    over

    pth

    read

    s

    Iteration when crash occured

    Pthreads NVthreads (nvm=1000ns)

    Benefits of recovery (K-means)S

    peed

    up o

    ver p

    thre

    ads

    (x)

  • Summary• NVthreads allows programmers to easily leverage NVM

    with just few lines of source code changes

    • Recovery requires only redo log because multi-process execution buffers private updates

    • Coarse-grained page-level tracking amortizes logging overheads

    • NVthreads prototype is publicly available at:

    https://github.com/HewlettPackard/nvthreads

    61

    https://github.com/HewlettPackard/nvthreads