from hyper converged infrastructure to hybrid cloud … · dheeraj pandey 2009: vp of engineering...

66
From Hyper Converged Infrastructure to Hybrid Cloud Infrastructure Karan Gupta, Principal Architect, Nutanix

Upload: others

Post on 29-Jan-2021

9 views

Category:

Documents


0 download

TRANSCRIPT

  • From Hyper Converged Infrastructure to Hybrid Cloud Infrastructure

    Karan Gupta, Principal Architect, Nutanix

  • Dheeraj Pandey

    2009: VP of Engineering at Aster Data (Teradata) -multi-genre advanced analytics solution

    2007: Managed storage engine group for Oracle Database: Exadata

    Mohit Aron

    2009: Lead architect at Aster Data

    2007: Lead developer of Google File System (GFS)

    Founders’ of Hyper Converged Infrastructure

  • • Multiplexing of compute

    • Dynamically migrate workloads

    • VM high availability

    • > 25,000 IOPS

    • < 100us latency

    • Operational simplicity

    • Scale on demand

    2009-10: Changing technology landscape

    AWS

  • Hyper Converged Infrastructure

  • Foundation technologies

  • Nutanix Controller VM Services

    26

    Medusa(Metadata layer)

    § Medusa (Metadata Service)§ Protocol§ Performance/Scale Enhancements§ Failure profile

    § Stargate (Data Path)§ Cache Layer§ NextGen Data path

    § Hybrid Cloud

  • 3

    VM 0 VM 1 VM n

    Virtual disk abstraction

    vDisk 0 vDisk n … vDisk 0 vDisk n … vDisk 0 vDisk n …

    Server 0Compute Storage

    Cluster

    CPU HDDCPU SSD

    Server NCompute Storage

    CPU HDDCPU SSD

    Cluster block storage system

  • 4

    VM 0 VM 1 VM n

    Metadata index for virtual disks

    vDisk 0 vDisk n … vDisk 0 vDisk n … vDisk 0 vDisk n …

    Server 0Compute Storage

    Cluster

    CPU HDDCPU SSD

    Server NCompute Storage

    CPU HDDCPU SSD

    Metadata index: virtual disk block -> physical disk block

  • 8

    Medusa: A Consistent system under Partitions

    Distributed hash table (DHT)

    Shard 0

    Multi-Paxos Multi-Paxos

    Log-structured Merge-tree

    Log-structured Merge-tree

    Shard n

    node 0 node m

    CPU

    SSD

    CPU

    SSD

    CPU

    HDD

    CPU

    SSD

    Use DHTs to shard metadata index across the cluster

    Use LSM for durability

    Replicate shards and use Paxos for consistency

    Protocol (FRSM)

    Performance (TRIAD)

    Failures (IASO)

  • Failure Profiles of this Module

    Team Starts to Grow Performance and Scale features• Leader Only Reads• Compaction Changes• Memory Management• DirectIO

    Still Discover Day 1 issues• DirectIO/Ext4• Leader Only Reads• Cassandra skip row

  • Nutanix Controller VM Services

    26

    Medusa(Metadata layer)

    § Medusa (Metadata Service)§ Protocol§ Performance/Scale Enhancements§ Failure profile

    § Stargate (Data Path)§ Cache Layer§ NextGen Data path

    § Hybrid Cloud

  • Protocol:Fine-Grained Replicated State Machines for a Cluster Storage

    System (FRSM) NSDI’20

  • Key

    • Key Id (Partition identifier)

    Paxos Instance Number• Epoch: Generation Id• Timestamp: Advanced by 1 every time

    value is updated

    Paxos Consensus State• Promised Proposal Number• Accepted Proposal Number• Chosen bit

    Client

    Replica

    • No Operation Logs: next_RSM_state = function(curr_RSM_state, operation)• CAS/Read can support Speculative Execution• Stable Leader: Failure characteristics of the clusters• Value is not required for Paxos consensus.

    Key idea: Fine-grained Replicated State Machine (fRSM)

  • Required APIs by metadata maps

    7

    [1] Maurice Herlihy, Wait-free Synchronization, ACM Transactions on Programmable languages and systems, 1991

    • Compare-and-Swap (key, old_val, new_val)• Create (key, val),• Delete (key)

    • Read (key)• Quorum reads• Leader Only reads• Mutating reads

    • Scan (key range)

  • 19

    Client

    Leader

    Follower n

    Follower 0

    request

    Delete handling under fRSM

    Follower n

  • 19

    Client

    Leader

    Follower i

    Follower 0

    request

    Delete handling under fRSM

    Follower n

    Ownercheck

    Perform a CAS update (t+1)

  • 19

    Client

    Leader

    Follower i

    Follower 0

    request

    Delete handling under fRSM

    Follower n

    Ownercheck

    DeleteCell

    Delete Cellack

    Delete Acknowledged to the client.

  • 19

    Client

    Leader

    Follower i

    Follower 0

    request

    Delete handling under fRSM

    Follower n

    Ownercheck

    DeleteCell

    Delete Cellack

    Failure to send delete

  • 19

    Client

    Leader

    Follower i

    Follower 0

    request

    Delete handling under fRSM

    Follower n

    Ownercheck

    DeleteCell

    Delete Cellack

    Tomb-stone

    Tomb-stoneack

    Value space reclaimed.

  • 19

    Client

    Leader

    Follower i

    Follower 0

    request

    Delete handling under fRSM

    Follower n

    Ownercheck

    DeleteCell

    Delete Cellack

    Tomb-stone

    Tomb-stoneack

    Periodic delete retries

  • 19

    Client

    Leader

    Follower i

    Follower 0

    request

    Delete handling under fRSM

    Follower n

    Ownercheck

    DeleteCell

    Delete Cellack

    Tomb-stone

    Tomb-stoneack

    Cell Remove

    ack

    Key removed after 24 hours

  • Ghost Writes: Read after Read inconsistency

    Px, E, (T + 1)

    Px, E, T

    Px, E, T

    Node X

    Node Y

    Node Z

    t1

    Px, E, (T + 1)

    Px, E, T

    Px, E, T

    t2

    Px, E, (T + 1)

    Px, E, T

    Px, E, T

    t3

    Px, E, (T + 1)

    Px, E, T

    Px, E, T

    t4

    Px, E, (T + 1)

    Px, E, (T + 1)

  • Mutating Reads: Stronger than Linearizability

    Px, E, (T + 1)

    Px, E, T

    Px, E, T

    Node X

    Node Y

    Node Z

    t1

    Px, E, (T + 1)

    Px, E, T

    Px, E, T

    t2

    Px, E, (T + 1)

    Px, E, T

    Px, E, T

    t3

    Py, E, (T + 1)

    Py, E, (T + 1)

    Px, E, (T + 1)

    Py, E, (T + 1)

    Py, E, (T + 1)

    t4

    Py, E, (T + 1)

  • Nutanix Controller VM Services

    26

    Medusa(Metadata layer)

    § Medusa (Metadata Service)§ Protocol§ Performance/Scale Enhancements§ Failure profile

    § Stargate (Data Path)§ Cache Layer§ NextGen Data path

    § Hybrid Cloud

  • Scale and Performance:TRIAD: Creating Synergies Between Memory, Disk and Log in

    Log Structured Key-Value Stores (ATC’17)

  • TRIAD Goal

    Decrease background ops overhead to increase user throughput.

    Reduce Write Amplification

    26

  • 050

    100150200250300

    Uniform50r-50w

    Skewed50r-50w

    K O

    pera

    tions

    /sRocksDBRocksDB No BG I/O

    Background I/O Overhead

    § Long & slow bg. ops slowdown of user ops.

    27

    up to 3x throughput gap L

  • TRIAD

    TRIAD-MEM

    TRIAD-DISK

    TRIAD-LOG

    28

    Workload Improve WA in

    Skewed workloads Flushing and Compaction

    In-between Compaction

    Uniform workloads Flushing

    Three techniques work together and are complementary.

  • 29

    TRIAD-MEM: Hot-cold key separation

    L0

    flushing

    K1 V1nK2 V2

    K3 V3

    Kn Vn

    Cm

    K1 V11K2 V2

    K1 V12K1 V13K1 V14…

    K1 V1n

    Commit Log

    RAMDisk

  • 30

    TRIAD-MEM: Hot-cold key separation

    L0

    flushing

    K1 V1nK2 V2

    K3 V3

    Kn Vn

    Cm

    Idea: Keep hot keys in memory

    Flush only cold keysKeep hot keys in CL

    K1 V11K2 V2

    K1 V12K1 V13K1 V14…

    K1 V1n

    Commit Log

    RAMDisk

  • 31

    TRIAD-MEM: Hot-cold key separation

    L0

    flushing

    K1 V1n

    Cm

    K2 V2

    K3 V3

    Kn Vn

    Idea: Keep hot keys in memory

    Flush only cold keysKeep hot keys in CL

    K1 V1n

    Commit Log

    RAMDisk

  • ü Good for skewed workloads.

    ü Reduce flushing WA: less data written from memory to disk.

    ü Reduce compaction WA: avoid repeatedly compacting hot keys.

    32

    TRIAD-MEM Summary

  • TRIAD-LOG

    TRIAD-LOG

    33

    Workload Improve WA in

    Uniform workloads Flushing

  • Problem: Flushing with Uniform Workloads

    Commit Log

    L0

    flushing

    Key Val

    K1 V1’

    K2 V2

    Kn Vn

    K1 V1

    K2 V2

    K1 V1’

    K3 V3

    K3 V3’

    Kn Vn

    Cm

    34

    RAMDisk

  • Problem: Flushing with Uniform Workloads

    Commit Log

    L0

    flushing

    Key Val

    K1 V1

    K2 V2

    K1 V1’

    K3 V3

    K3 V3’

    Kn Vn

    Cm

    35

    RAMDisk

    Key Val

    K1 V1’

    K2 V2

    Kn Vn

    flush

  • Commit Log

    L0

    flushing

    Key Val

    K1 V1

    K2 V2

    K1 V1’

    K3 V3

    K3 V3’

    Kn Vn

    Cm

    36

    RAMDisk

    Key Val

    K1 V1’

    K2 V2

    Kn Vn

    flush

    Problem: Flushing with Uniform Workloads

  • Commit Log

    L0

    flushing

    Key Val

    K1 V1

    K2 V2

    K1 V1’

    K3 V3

    K3 V3’

    Kn Vn

    Cm

    37

    RAMDisk

    Key Val

    K1 V1’

    K2 V2

    Kn Vn

    flush

    Insight: Flushed data already written to commit log.

    Idea: Use commit logs as SSTables. Avoid bg I/O due to flushing.

    Problem: Flushing with Uniform Workloads

  • TRIAD-LOG

    Commit Log

    L0

    flushingK1 V1

    K2 V2

    K1 V1’

    K3 V3

    K3 V3’

    Kn Vn

    CmKey Val CL Index

    K1 V1’ 3

    K2 V2 2

    Kn Vn n

    38

    RAMDisk

    Point to most recent entry in CL.

  • TRIAD-LOG

    L0

    flushing

    Cm

    Key Val CL Index

    K1 V1’ 3

    K2 V2 2

    Kn Vn n

    39

    RAMDisk Commit Log

    K1 V1

    K2 V2

    K1 V1’

    K3 V3

    K3 V3’

    Kn Vn

  • TRIAD-LOG

    L0

    flushing

    Cm

    Key Val CL Index

    K1: 3

    K2: 2

    Kn: n

    K1 V1

    K2 V2

    K1 V1’

    Kn Vn

    CL Index

    K1: 3

    K2: 2

    Kn: n

    40

    RAMDisk

    CL-SSTableOnly flush CL Index from memory and couple it with the current Commit Log. Commit Log

    CL Index

    K1: 3

    K2: 2

    Kn: n

    Keep index in memory for further reads.

  • ü Good for uniform workloads.

    ü Reuse Commit Log as L0 SST.

    ü No more flushing of mem component to disk.

    41

    TRIAD-LOG Summary

  • Production Workloads: Throughput

    42

    050

    100150200250300350

    Prod Wkld 1 Prod Wkld 2

    KOPS RocksDBTRIAD

    0

    2

    4

    6

    8

    10

    Prod Wkld 1 Prod Wkld 2

    Writ

    e A

    mpl

    ifica

    tion

    RocksDB

    TRIAD

    ~uniform skewed

    TRIAD: stable throughput across wklds.

    2x

    higheris better

  • Production Workloads: Write Amplification

    43

    050

    100150200250300350

    Prod Wkld 1 Prod Wkld 2

    KOPS RocksDBTRIAD

    0

    2

    4

    6

    8

    10

    Prod Wkld 1 Prod Wkld 2

    Writ

    e A

    mpl

    ifica

    tion

    RocksDB

    TRIAD

    ~uniform skewed

    TRIAD: low and uniform WA.

    4x

    loweris better

  • Nutanix Controller VM Services

    26

    Medusa(Metadata layer)

    § Medusa (Metadata Service)§ Protocol§ Performance/Scale Enhancements§ Failure profile

    § Stargate (Data Path)§ Cache Layer§ NextGen Data path

    § Hybrid Cloud

  • Fail Slow ErrorsIASO: A Fail-Slow Detection and Mitigation Framework For

    Distributed Storage Services [ATC’19]

  • Fail-slow frequency

    ● Frequent - 232 incidents seen across 39,000 nodes over 7 months

    ● Almost 1 case per day

    ● Can take days to be fully resolved

  • Fail-slow problem space

  • IASO Peer based failure detection

    Score analyzer

    n1 {98, 97}

    n2 {1, 1}

    n3 {1, 1}

    Set of scores for peers

    Outlier

    ● Detect a fail slow node / peer● Quarantine the faulty node ● Resolve the root cause

  • IASO in Production

    Mitigates cluster outage in 10 minutes Catches fail slow faults with 96.3% accuracy

  • Us and Publications from Industry and Academia 2010s

    27

    • RAFT (ATC’ 14)

    • RIFL (SOSP’ 15)

    • WPAXOS (2017)

    • EPAXOS (SOSP’13)

    • Spanner (OSDI’ 12)

    • HERMES (ASPLOS’2020)

    • Pysalia (NSDI’2020)

  • Nutanix Controller VM Services

    26

    Medusa(Metadata layer)

    § Medusa (Metadata Service)§ Protocol§ Performance/Scale Enhancements§ Failure profile

    § Stargate (Data Path)§ Cache Layer§ NextGen Data path

    § Hybrid Cloud

  • Stargate: Data IO Path

  • Nutanix Controller VM Services

    26

    Medusa(Metadata layer)

    § Medusa (Metadata Service)§ Protocol§ Performance/Scale Enhancements§ Failure profile

    § Stargate (Data Path)§ Cache Layer§ NextGen Data path

    § Hybrid Cloud

  • Lines of Code over time - Cache

    TITLE OF PRESENTATION | CONFIDENTIAL

    In Memory CachingScan-Resistance

    SSD Spillover

    Unification of Caches

    Priority based caching / Auto Disable

    Touch pool adjustment

    New Use cases (OSS)New Types/Watches

    Timestamp and Tag based O(1) subset invalidations

    TTL based eviction O(1)

    Intelligent Compressed Cache

    Accurate Warmup of Hot Data

  • TCMalloc Issues

    ● TCMalloc designed for performance over garbage collection○ Thread-local caches○ Unordered fast reuse from freelists and new pages.

    ● Single Process with multiple possibly independent modules sharing memory○ Shared memory domain/arena for memory and cpu efficiency○ Bursty writes and reads means bursty allocations and deallocations in TCMalloc○ Modules can expand and reduce memory usage over time in a common memory space

    IssuesGBs of fragmented memory in Central CachePerformance impacts and less efficient CPU and Memory usageCaches have to be pruned to regain memory.

  • Nutanix Controller VM Services

    26

    Medusa(Metadata layer)

    § Medusa (Metadata Service)§ Protocol§ Performance/Scale Enhancements§ Failure profile

    § Stargate (Data Path)§ Cache Layer§ NextGen Data path

    § Hybrid Cloud

  • New Media Shifts Bottleneck to Software

    57

    >550,000 IO/s

    ~10us

  • 0

    1000

    2000

    3000

    2010 2013 2016 2018

    Random 4k ≈ Sequential Writes

    2016 201820132010

    Existing assumption

    New Hardware

    Band

    wid

    th (M

    B/s)

    Hig

    her i

    s be

    tter

    30

    Fast Drives: Random Writes ≈ Sequential Writes

  • SSD

    Block subsystem (SCSI)

    Stargate

    Controller VM

    System Calls

    Blockstore

    Accelerating the Data Path with Blockstore + SPDK

    SSD

    File Sys (extent store)

    Block subsystem (SCSI)

    Stargate

    Controller VM

    System Calls

    User-Space Filesystem

    Efficient filesystem metadata management

    Nutanix Data Path

    NVMe

    Stargate

    Controller VM

    SPDK

    Purpose built for device accessthru SPDK for NVMe media

    Blockstore

    Current 2HCY20

    Fully utilize new media performance

    Purpose-built for NVMe but also benefits SSDs and HDDs

    59

    Future

  • Stargate

    SPDK

    Blockstore

    Accelerating Further for Full Stack (AOS and AHV)

    Current Future

    • Use of iSCSI over RDMA between AHV(initiator) and Stargate(target)

    • Zero copy DMA operations

    • Eliminates system callsNVMe

    Stargate

    Controller VM

    Hypervisor

    SPDK

    Blockstore

    Shortest data path from App to Storage

    System Calls

    System calls b/w storage and hypervisor

    60

    AHVISER

    NVMe

    Controller VM

  • Nutanix Hybrid Cloud

    26

    § Medusa (Metadata Service)§ Protocol§ Performance/Scale Enhancements§ Failure profile

    § Stargate (Data Path)§ Cache Layer§ NextGen Data path

    § Hybrid Cloud

    Hosted

    Private

    Trad. Hosting

    Public

    SaaS App

    Private

    Cloud

    “Hybrid”Data Center

    INTEGRATION

    VISIBILITY

    CROSS-CLOUD SECURITY

    LICENSE PORTABILITY

    WORKLOAD PORTABILITY

    DATA LOCALITY

    LATENCY AND DIRECT

    CONNECT

  • Hybrid Multicloud Architecture

    VPC

    EC2 Bare Metal

    Nutanix Private Cloud AWS Services

    S3

    RDS

    EC2Direct

    Connect

    Machine Learning

    Elasticsearch

    . . .

    Click to Cloud with existing VPCs, Subnets and accounts

    1

    Govern and Manage costs across all clouds

    2

    App Mobility with programmable infrastructure and portability

    3

    | 62

    Nutanix Hybrid Multicloud Platform

    VNET

    Azure Dedicated Hosts

    Azure Services

    Virtual Machines

    Blob Storage

    Express Route

    . . . DatabricksSQL DBCognitive

    Services

  • Active Research and Development Areas| 63

    § Medusa§ New media like Optane drives and Non LSM databases§ (Kvell, SOSP’19)§ S3 based Timeseries databases§ SmartNICs: Background process offloading to even protocol.

    § Stargate (Data Path)§ NextGen architecture to support GPUs§ SmartNICs: Dissaggregated storage§ SmartStorage§ WAN optimized storage and data mobility§ Better memory management

    § Hybrid Cloud

  • • Multiplexing of compute

    • Dynamically migrate workloads

    • VM high availability

    • > 25,000 IOPS

    • < 100us latency

    • Operationalsimplicity

    • Scale on demand

    2009-10: Changing technology landscape

    AWS

  • • Multiplexing of compute VMs, Containers, Functions

    • > 10 GBps

    • < 1us latency

    • Cloud specializations:Geos, functionality, features and government regulations

    20019-20: Changing technology landscape

    NVM

  • Thank You