from hyper converged infrastructure to hybrid cloud … · dheeraj pandey 2009: vp of engineering...
TRANSCRIPT
-
From Hyper Converged Infrastructure to Hybrid Cloud Infrastructure
Karan Gupta, Principal Architect, Nutanix
-
Dheeraj Pandey
2009: VP of Engineering at Aster Data (Teradata) -multi-genre advanced analytics solution
2007: Managed storage engine group for Oracle Database: Exadata
Mohit Aron
2009: Lead architect at Aster Data
2007: Lead developer of Google File System (GFS)
Founders’ of Hyper Converged Infrastructure
-
• Multiplexing of compute
• Dynamically migrate workloads
• VM high availability
• > 25,000 IOPS
• < 100us latency
• Operational simplicity
• Scale on demand
2009-10: Changing technology landscape
AWS
-
Hyper Converged Infrastructure
-
Foundation technologies
-
Nutanix Controller VM Services
26
Medusa(Metadata layer)
§ Medusa (Metadata Service)§ Protocol§ Performance/Scale Enhancements§ Failure profile
§ Stargate (Data Path)§ Cache Layer§ NextGen Data path
§ Hybrid Cloud
-
3
VM 0 VM 1 VM n
Virtual disk abstraction
vDisk 0 vDisk n … vDisk 0 vDisk n … vDisk 0 vDisk n …
Server 0Compute Storage
Cluster
CPU HDDCPU SSD
Server NCompute Storage
CPU HDDCPU SSD
Cluster block storage system
-
4
VM 0 VM 1 VM n
Metadata index for virtual disks
vDisk 0 vDisk n … vDisk 0 vDisk n … vDisk 0 vDisk n …
Server 0Compute Storage
Cluster
CPU HDDCPU SSD
Server NCompute Storage
CPU HDDCPU SSD
Metadata index: virtual disk block -> physical disk block
-
8
Medusa: A Consistent system under Partitions
Distributed hash table (DHT)
Shard 0
Multi-Paxos Multi-Paxos
Log-structured Merge-tree
Log-structured Merge-tree
Shard n
node 0 node m
CPU
SSD
CPU
SSD
CPU
HDD
CPU
SSD
Use DHTs to shard metadata index across the cluster
Use LSM for durability
Replicate shards and use Paxos for consistency
Protocol (FRSM)
Performance (TRIAD)
Failures (IASO)
-
Failure Profiles of this Module
Team Starts to Grow Performance and Scale features• Leader Only Reads• Compaction Changes• Memory Management• DirectIO
Still Discover Day 1 issues• DirectIO/Ext4• Leader Only Reads• Cassandra skip row
-
Nutanix Controller VM Services
26
Medusa(Metadata layer)
§ Medusa (Metadata Service)§ Protocol§ Performance/Scale Enhancements§ Failure profile
§ Stargate (Data Path)§ Cache Layer§ NextGen Data path
§ Hybrid Cloud
-
Protocol:Fine-Grained Replicated State Machines for a Cluster Storage
System (FRSM) NSDI’20
-
Key
• Key Id (Partition identifier)
Paxos Instance Number• Epoch: Generation Id• Timestamp: Advanced by 1 every time
value is updated
Paxos Consensus State• Promised Proposal Number• Accepted Proposal Number• Chosen bit
Client
Replica
• No Operation Logs: next_RSM_state = function(curr_RSM_state, operation)• CAS/Read can support Speculative Execution• Stable Leader: Failure characteristics of the clusters• Value is not required for Paxos consensus.
Key idea: Fine-grained Replicated State Machine (fRSM)
-
Required APIs by metadata maps
7
[1] Maurice Herlihy, Wait-free Synchronization, ACM Transactions on Programmable languages and systems, 1991
• Compare-and-Swap (key, old_val, new_val)• Create (key, val),• Delete (key)
• Read (key)• Quorum reads• Leader Only reads• Mutating reads
• Scan (key range)
-
19
Client
Leader
Follower n
Follower 0
request
Delete handling under fRSM
Follower n
-
19
Client
Leader
Follower i
Follower 0
request
Delete handling under fRSM
Follower n
Ownercheck
Perform a CAS update (t+1)
-
19
Client
Leader
Follower i
Follower 0
request
Delete handling under fRSM
Follower n
Ownercheck
DeleteCell
Delete Cellack
Delete Acknowledged to the client.
-
19
Client
Leader
Follower i
Follower 0
request
Delete handling under fRSM
Follower n
Ownercheck
DeleteCell
Delete Cellack
Failure to send delete
-
19
Client
Leader
Follower i
Follower 0
request
Delete handling under fRSM
Follower n
Ownercheck
DeleteCell
Delete Cellack
Tomb-stone
Tomb-stoneack
Value space reclaimed.
-
19
Client
Leader
Follower i
Follower 0
request
Delete handling under fRSM
Follower n
Ownercheck
DeleteCell
Delete Cellack
Tomb-stone
Tomb-stoneack
Periodic delete retries
-
19
Client
Leader
Follower i
Follower 0
request
Delete handling under fRSM
Follower n
Ownercheck
DeleteCell
Delete Cellack
Tomb-stone
Tomb-stoneack
Cell Remove
ack
Key removed after 24 hours
-
Ghost Writes: Read after Read inconsistency
Px, E, (T + 1)
Px, E, T
Px, E, T
Node X
Node Y
Node Z
t1
Px, E, (T + 1)
Px, E, T
Px, E, T
t2
Px, E, (T + 1)
Px, E, T
Px, E, T
t3
Px, E, (T + 1)
Px, E, T
Px, E, T
t4
Px, E, (T + 1)
Px, E, (T + 1)
-
Mutating Reads: Stronger than Linearizability
Px, E, (T + 1)
Px, E, T
Px, E, T
Node X
Node Y
Node Z
t1
Px, E, (T + 1)
Px, E, T
Px, E, T
t2
Px, E, (T + 1)
Px, E, T
Px, E, T
t3
Py, E, (T + 1)
Py, E, (T + 1)
Px, E, (T + 1)
Py, E, (T + 1)
Py, E, (T + 1)
t4
Py, E, (T + 1)
-
Nutanix Controller VM Services
26
Medusa(Metadata layer)
§ Medusa (Metadata Service)§ Protocol§ Performance/Scale Enhancements§ Failure profile
§ Stargate (Data Path)§ Cache Layer§ NextGen Data path
§ Hybrid Cloud
-
Scale and Performance:TRIAD: Creating Synergies Between Memory, Disk and Log in
Log Structured Key-Value Stores (ATC’17)
-
TRIAD Goal
Decrease background ops overhead to increase user throughput.
Reduce Write Amplification
26
-
050
100150200250300
Uniform50r-50w
Skewed50r-50w
K O
pera
tions
/sRocksDBRocksDB No BG I/O
Background I/O Overhead
§ Long & slow bg. ops slowdown of user ops.
27
up to 3x throughput gap L
-
TRIAD
TRIAD-MEM
TRIAD-DISK
TRIAD-LOG
28
Workload Improve WA in
Skewed workloads Flushing and Compaction
In-between Compaction
Uniform workloads Flushing
Three techniques work together and are complementary.
-
29
TRIAD-MEM: Hot-cold key separation
L0
flushing
K1 V1nK2 V2
K3 V3
…
Kn Vn
Cm
K1 V11K2 V2
K1 V12K1 V13K1 V14…
K1 V1n
Commit Log
RAMDisk
-
30
TRIAD-MEM: Hot-cold key separation
L0
flushing
K1 V1nK2 V2
K3 V3
…
Kn Vn
Cm
Idea: Keep hot keys in memory
Flush only cold keysKeep hot keys in CL
K1 V11K2 V2
K1 V12K1 V13K1 V14…
K1 V1n
Commit Log
RAMDisk
-
31
TRIAD-MEM: Hot-cold key separation
L0
flushing
K1 V1n
Cm
K2 V2
K3 V3
…
Kn Vn
Idea: Keep hot keys in memory
Flush only cold keysKeep hot keys in CL
K1 V1n
Commit Log
RAMDisk
-
ü Good for skewed workloads.
ü Reduce flushing WA: less data written from memory to disk.
ü Reduce compaction WA: avoid repeatedly compacting hot keys.
32
TRIAD-MEM Summary
-
TRIAD-LOG
TRIAD-LOG
33
Workload Improve WA in
Uniform workloads Flushing
-
Problem: Flushing with Uniform Workloads
Commit Log
L0
flushing
Key Val
K1 V1’
K2 V2
…
Kn Vn
K1 V1
K2 V2
K1 V1’
K3 V3
K3 V3’
…
Kn Vn
Cm
34
RAMDisk
-
Problem: Flushing with Uniform Workloads
Commit Log
L0
flushing
Key Val
K1 V1
K2 V2
K1 V1’
K3 V3
K3 V3’
…
Kn Vn
Cm
35
RAMDisk
Key Val
K1 V1’
K2 V2
…
Kn Vn
flush
-
Commit Log
L0
flushing
Key Val
K1 V1
K2 V2
K1 V1’
K3 V3
K3 V3’
…
Kn Vn
Cm
36
RAMDisk
Key Val
K1 V1’
K2 V2
…
Kn Vn
flush
Problem: Flushing with Uniform Workloads
-
Commit Log
L0
flushing
Key Val
…
K1 V1
K2 V2
K1 V1’
K3 V3
K3 V3’
…
Kn Vn
Cm
37
RAMDisk
Key Val
K1 V1’
K2 V2
…
Kn Vn
flush
Insight: Flushed data already written to commit log.
Idea: Use commit logs as SSTables. Avoid bg I/O due to flushing.
Problem: Flushing with Uniform Workloads
-
TRIAD-LOG
Commit Log
L0
flushingK1 V1
K2 V2
K1 V1’
K3 V3
K3 V3’
…
Kn Vn
CmKey Val CL Index
K1 V1’ 3
K2 V2 2
…
Kn Vn n
38
RAMDisk
Point to most recent entry in CL.
-
TRIAD-LOG
L0
flushing
Cm
Key Val CL Index
K1 V1’ 3
K2 V2 2
…
Kn Vn n
39
RAMDisk Commit Log
K1 V1
K2 V2
K1 V1’
K3 V3
K3 V3’
…
Kn Vn
-
TRIAD-LOG
L0
flushing
Cm
Key Val CL Index
K1: 3
K2: 2
Kn: n
K1 V1
K2 V2
K1 V1’
…
Kn Vn
CL Index
K1: 3
K2: 2
…
Kn: n
40
RAMDisk
CL-SSTableOnly flush CL Index from memory and couple it with the current Commit Log. Commit Log
CL Index
K1: 3
K2: 2
Kn: n
Keep index in memory for further reads.
-
ü Good for uniform workloads.
ü Reuse Commit Log as L0 SST.
ü No more flushing of mem component to disk.
41
TRIAD-LOG Summary
-
Production Workloads: Throughput
42
050
100150200250300350
Prod Wkld 1 Prod Wkld 2
KOPS RocksDBTRIAD
0
2
4
6
8
10
Prod Wkld 1 Prod Wkld 2
Writ
e A
mpl
ifica
tion
RocksDB
TRIAD
~uniform skewed
TRIAD: stable throughput across wklds.
2x
higheris better
-
Production Workloads: Write Amplification
43
050
100150200250300350
Prod Wkld 1 Prod Wkld 2
KOPS RocksDBTRIAD
0
2
4
6
8
10
Prod Wkld 1 Prod Wkld 2
Writ
e A
mpl
ifica
tion
RocksDB
TRIAD
~uniform skewed
TRIAD: low and uniform WA.
4x
loweris better
-
Nutanix Controller VM Services
26
Medusa(Metadata layer)
§ Medusa (Metadata Service)§ Protocol§ Performance/Scale Enhancements§ Failure profile
§ Stargate (Data Path)§ Cache Layer§ NextGen Data path
§ Hybrid Cloud
-
Fail Slow ErrorsIASO: A Fail-Slow Detection and Mitigation Framework For
Distributed Storage Services [ATC’19]
-
Fail-slow frequency
● Frequent - 232 incidents seen across 39,000 nodes over 7 months
● Almost 1 case per day
● Can take days to be fully resolved
-
Fail-slow problem space
-
IASO Peer based failure detection
Score analyzer
n1 {98, 97}
n2 {1, 1}
n3 {1, 1}
Set of scores for peers
Outlier
● Detect a fail slow node / peer● Quarantine the faulty node ● Resolve the root cause
-
IASO in Production
Mitigates cluster outage in 10 minutes Catches fail slow faults with 96.3% accuracy
-
Us and Publications from Industry and Academia 2010s
27
• RAFT (ATC’ 14)
• RIFL (SOSP’ 15)
• WPAXOS (2017)
• EPAXOS (SOSP’13)
• Spanner (OSDI’ 12)
• HERMES (ASPLOS’2020)
• Pysalia (NSDI’2020)
-
Nutanix Controller VM Services
26
Medusa(Metadata layer)
§ Medusa (Metadata Service)§ Protocol§ Performance/Scale Enhancements§ Failure profile
§ Stargate (Data Path)§ Cache Layer§ NextGen Data path
§ Hybrid Cloud
-
Stargate: Data IO Path
-
Nutanix Controller VM Services
26
Medusa(Metadata layer)
§ Medusa (Metadata Service)§ Protocol§ Performance/Scale Enhancements§ Failure profile
§ Stargate (Data Path)§ Cache Layer§ NextGen Data path
§ Hybrid Cloud
-
Lines of Code over time - Cache
TITLE OF PRESENTATION | CONFIDENTIAL
In Memory CachingScan-Resistance
SSD Spillover
Unification of Caches
Priority based caching / Auto Disable
Touch pool adjustment
New Use cases (OSS)New Types/Watches
Timestamp and Tag based O(1) subset invalidations
TTL based eviction O(1)
Intelligent Compressed Cache
Accurate Warmup of Hot Data
-
TCMalloc Issues
● TCMalloc designed for performance over garbage collection○ Thread-local caches○ Unordered fast reuse from freelists and new pages.
● Single Process with multiple possibly independent modules sharing memory○ Shared memory domain/arena for memory and cpu efficiency○ Bursty writes and reads means bursty allocations and deallocations in TCMalloc○ Modules can expand and reduce memory usage over time in a common memory space
IssuesGBs of fragmented memory in Central CachePerformance impacts and less efficient CPU and Memory usageCaches have to be pruned to regain memory.
-
Nutanix Controller VM Services
26
Medusa(Metadata layer)
§ Medusa (Metadata Service)§ Protocol§ Performance/Scale Enhancements§ Failure profile
§ Stargate (Data Path)§ Cache Layer§ NextGen Data path
§ Hybrid Cloud
-
New Media Shifts Bottleneck to Software
57
>550,000 IO/s
~10us
-
0
1000
2000
3000
2010 2013 2016 2018
Random 4k ≈ Sequential Writes
2016 201820132010
Existing assumption
New Hardware
Band
wid
th (M
B/s)
Hig
her i
s be
tter
30
Fast Drives: Random Writes ≈ Sequential Writes
-
SSD
Block subsystem (SCSI)
Stargate
Controller VM
System Calls
Blockstore
Accelerating the Data Path with Blockstore + SPDK
SSD
File Sys (extent store)
Block subsystem (SCSI)
Stargate
Controller VM
System Calls
User-Space Filesystem
Efficient filesystem metadata management
Nutanix Data Path
NVMe
Stargate
Controller VM
SPDK
Purpose built for device accessthru SPDK for NVMe media
Blockstore
Current 2HCY20
Fully utilize new media performance
Purpose-built for NVMe but also benefits SSDs and HDDs
59
Future
-
Stargate
SPDK
Blockstore
Accelerating Further for Full Stack (AOS and AHV)
Current Future
• Use of iSCSI over RDMA between AHV(initiator) and Stargate(target)
• Zero copy DMA operations
• Eliminates system callsNVMe
Stargate
Controller VM
Hypervisor
SPDK
Blockstore
Shortest data path from App to Storage
System Calls
System calls b/w storage and hypervisor
60
AHVISER
NVMe
Controller VM
-
Nutanix Hybrid Cloud
26
§ Medusa (Metadata Service)§ Protocol§ Performance/Scale Enhancements§ Failure profile
§ Stargate (Data Path)§ Cache Layer§ NextGen Data path
§ Hybrid Cloud
Hosted
Private
Trad. Hosting
Public
SaaS App
Private
Cloud
“Hybrid”Data Center
INTEGRATION
VISIBILITY
CROSS-CLOUD SECURITY
LICENSE PORTABILITY
WORKLOAD PORTABILITY
DATA LOCALITY
LATENCY AND DIRECT
CONNECT
-
Hybrid Multicloud Architecture
VPC
EC2 Bare Metal
Nutanix Private Cloud AWS Services
S3
RDS
EC2Direct
Connect
Machine Learning
Elasticsearch
. . .
Click to Cloud with existing VPCs, Subnets and accounts
1
Govern and Manage costs across all clouds
2
App Mobility with programmable infrastructure and portability
3
| 62
Nutanix Hybrid Multicloud Platform
VNET
Azure Dedicated Hosts
Azure Services
Virtual Machines
Blob Storage
Express Route
. . . DatabricksSQL DBCognitive
Services
-
Active Research and Development Areas| 63
§ Medusa§ New media like Optane drives and Non LSM databases§ (Kvell, SOSP’19)§ S3 based Timeseries databases§ SmartNICs: Background process offloading to even protocol.
§ Stargate (Data Path)§ NextGen architecture to support GPUs§ SmartNICs: Dissaggregated storage§ SmartStorage§ WAN optimized storage and data mobility§ Better memory management
§ Hybrid Cloud
-
• Multiplexing of compute
• Dynamically migrate workloads
• VM high availability
• > 25,000 IOPS
• < 100us latency
• Operationalsimplicity
• Scale on demand
2009-10: Changing technology landscape
AWS
-
• Multiplexing of compute VMs, Containers, Functions
• > 10 GBps
• < 1us latency
• Cloud specializations:Geos, functionality, features and government regulations
20019-20: Changing technology landscape
NVM
-
Thank You