proteus: a flexible and fast software supported hardware...

17
Proteus: A Flexible and Fast Software supported Hardware Logging approach for NVM Seunghee Shin, Satish Tirukkovalluri, James Tuck, and Yan Solihin North Carolina State University 1 The 2018 Non-Volatile Memories Workshop (NVMW 2018)

Upload: others

Post on 27-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • Proteus: A Flexible and Fast Software supported Hardware Logging approach for NVM

    Seunghee Shin, Satish Tirukkovalluri, James Tuck, and Yan Solihin

    North Carolina State University

    1

    The 2018 Non-Volatile Memories Workshop (NVMW 2018)

  • Background

    • Use NVM as storage or main memory?• We assume NV main memory (NVMM)

    – Keep important data in memory instead of file– Need to ensure failure safety

    + Fast+ Byte-addressable- Volatile

    DRAM + Non-volatile- Slow- Block-addressable

    Disk / Flash

    NVM+ Fast+ Byte-addressable+ Non-volatile

    2

  • Failure Safety through Durable Transactions

    • Durable transaction - Needed to ensure failure safety

    A B C D

    X Insert Node X

    3

    System FailureUndo-logging

    - All updates in a transaction are atomically durable- Atomicity can be achieved through HW or SW undo logging

  • Transaction with Software Undo-Logging

    • Step 1 - Create undo log and make it durable

    • Step 2 - Set log-flag and make it durable, indicating transaction start

    • Step 3 - Perform data updates and make them durable

    • Step 4 - Unset log-flag and make it durable, indicating transaction end

    4

  • Memory Persistency

    • Unpredictable persist ordering- Persist: operation which makes NVMM writes durable- NVMM persist order is determined by LLC writebacks,

    instead of program order

    • Persistency Model- Defines when stores become durable (i.e. placed in

    the persistence domain)- E.g. Intel PMEM persistency model, strict

    persistency model, epoch persistency model, buffered epoch persistency model, strand persistency model, etc.

    5

    Shared Cache(LLC)

    MC

    NVMM

    PrivateCache

    Unpredictable order

  • PERSISTENCEDOMAIN

    PERSISTENCEDOMAIN

    Intel PMEM Instruction and ADR

    • Asynchronous DRAM Refresh (ADR)- Added write pending queue (WPQ) in MC

    to persistence domain- Flush data in WPQ to NVMM automatically

    on system failure

    • CLWB- Write back a dirty block from caches to

    WPQ- A fence is needed for ordering

    Shared Cache

    MCMC

    NVMMNVMM

    L1 L1L2

    L1 L1L2

    clwb

    6

    st Ast B

    st B

    clwb A

    st A

    st Aclwb Asfencest B

  • • Hardware logging (HL)- Hardware creates and manages logs automatically (e.g. ATOM [HPCA’17])- (+) Low performance overheads - (−) Not flexible

    Let’s Revisit Software Logging

    Time

    Program Order log A

    log B

    log C

    log D

    st A

    st B

    st C

    st D

    FENCE

    Time

    Program Order log A

    log B

    log C

    log D

    st A

    st B

    st C

    st D

    Software Logging Hardware Logging

    a. Memory fence is not required between logging and data modificationb. New logging optimizations possible

    7

    • Software logging (SL)- Software performs log creation, maintenance, and truncation- (+) Flexible (e.g. no OS support needed)- (−) High performance overheads (~50% slowdown)

  • Software Supported Hardware Logging

    • Software Supported Hardware Logging (SSHL)- Hardware provides logging instructions- Software performs logging operations using logging instructions- Hardware applies optimizations

    HLSL

    SSHL

    Flexible, but not fast Fast, but not flexible

    Fast and flexible

    8

  • Proteus: SSHL Design

    • Flexibility: Software involvement in logging- Add instructions which starts logging operations in hardware- Two instructions are required: log-load and log-flush

    • Performance Optimizations- Parallel logging: process multiple loggings concurrently- Redundant logging detection and removal

    • Endurance Optimization (log write removal)- With the introduction of ADR, WPQ is considered non-volatile- Key insight: logs are no longer needed when a transaction commits- Remove logs without flushing to NVMM

    9

  • Proteus: New Logging Instructions

    - log-from address (M1): address of original data- log-to address (M2): address of log entry- Log data register (LR#): register holding logging data

    log-load $LR1 M1 LR1= Mem[M1]log-flush $LR1 M2 Mem[M2] = LR1

    Shared Cache

    MC

    NVM

    L1 L1L2

    M1 M2

    $LR1

    log-load $LR1 M1log-flush $LR1 M2

    tx_beginA = …B = …

    tx_end

    i1: tx_begini2: log-load LR1, Ai3: log-flush LR1, (LTA)+i4: st Ai5: log-load LR2, Bi6: log-flush LR2, (LTA)+i7: st Bi8: tx_end

    Code generation

    10

  • Proteus Hardware DesignPipeline

    LDRInt

    fp

    txIDlog-startlog-endcur-log

    Register File

    from to data

    Cache

    tag LRU txIDRouter

    txID coreID loginfo

    WPQ

    Arbiter

    LoadQ StoreQ

    NVMM

    LLT

    LPQdata

    LogQ

    Memory Controller with ADR

    Dep. Check

    Dep. Check

    11

    Log data register (LDR)Keep log data while logging instructions are in pipeline

    txID: holds current transaction ID being executed in the corelog-start: the start address of the log arealog-end: the end address of the log areacur-log: tracks the current free log entry

    ArbiterPrioritize writes from WPQ unless LPQ has no free entries (less than threshold)

    Log Queue (LogQ)Maintain log to store dependencies Keep track of logging executions (parallel loggings)

    Log Look-up Table (LLT)Prevent redundant loggings in a transaction

    Log Pending Queue (LPQ)Holds logs until the transaction ends or there is no free entriesSeparate logs from WPQ to avoid the incoming read requests check log entries

  • 02 1 0x200 A02 1 0x200 AB

    AA

    0x2000x3000x100

    010202

    0x800 0x800 0x2000x800 0x200 A0x800 0x200 A

    Proteus Hardware Design

    LDR

    Register File

    from to data

    Cache

    RoutertxID coreID loginfoWPQ

    Arbiter

    StoreQ

    NVMM

    LPQ data

    LogQ

    Memory Controller with ADR

    txIDlog-startlog-endcur-log

    LR1

    LR2

    tx_begin

    log-load LR1, (0x800)

    log-flush LR1, (LTA)+

    store B, (0x800)

    clwb (0x800)

    sfence

    tx_end

    0x800: A0x800: B

    12

  • Proteus LDR: 8 registers, LogQ: 8 entriesLLT: 64 entries (8way), LPQ: 256 entries

    11-29(109)-11-28-39-12-6-6-5-24 (tRCD 29 for Read, 109 for Write)tCAS-tRCD-tRP-tRAS-tRC-tWR-tWTR-tRTP-tRRD-tFAW

    NVM DDR3 like interface, 800MHz, 8GB 1 channel16 Banks per rank, 2KB row-buffer

    L3 Cache 8MB, 16-way, 64B block, 42 cycles, shared by all coresL2 Cache 256KB, 8-way, 64B block, 12 cycles, private per coreL1 I/D Cache 32KB,8-way,64Bblock,4cycles,private per coreProcessor OOO, 3.4GHz, 4 cores,

    System Configuration

    Methodology

    13

    - MarssX86 + DRAMsim2 simulator is used- NVM has 50ns for read latency and 150ns for write latency

  • Evaluation (1) - Speedup

    - Baseline: software logging using Intel PMEM instructions- Proteus performs 46% better than baseline, 10% better than ATOM

    46% better than baseline10% better than ATOM

    14

    QueueBtreeAvlTree Hashmap

    RB tree

    StringSwap

  • Evaluation (2) – Numbers of writes

    - Baseline: no logging (not failure safe)- ATOM incurs 350% more writes than baseline- Proteus has similar writes to baseline (only 2% higher) 15

    ATOM introduces 3.4x more writes than Proteus

  • Conclusions

    • Software logging is expensive but flexible • Hardware logging is fast but inflexible• Proteus: Software Supported Hardware Logging (SSHL)

    - Fast and flexible- New logging instructions allow software to manage logging- Performance optimizations: parallel logging, redundant logging removal- Endurance optimization: remove logs before flushing to NVMM

    • Results- Performance: 46% better vs. SW logging (10% better vs. ATOM)- Endurance 2% more writes to NVMM vs. 350% with ATOM

    16

  • Thank you

    17