breaking the barrier of monolithic sds architecture using
TRANSCRIPT
2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 1
Breaking the Barrier of Monolithic SDS Architecture using NVMe-oF to Enable Next-gen Storage Technology Innovation
Yi Zou (Research Scientist)Arun Raghunath (Research Scientist)Intel Corp.
2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 2
Monolithic SDS Architecture Challenges Cannot handle demands of modern elastic applications
(e.g. serverless) Cannot scale up/down IOPs | Increase throughput | Reduce latency, for
subsets of cluster data Cannot scale out without cluster-wide data rebalancing
Storage disaggregation “tax” [SDC2018 talk] Relayed data placement latency + bandwidth overheads
Deep coupling of block layer storage functions with purpose-built distributed storage capabilities makes it hard to integrate next-gen storage media and protocols
SDS = Software Defined Storage
2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 3
Proposed Architecture Change
Stateless: performs cluster-wide operations Chooses data placement destination Manages replicas and erasure coded chunks Monitors data integrity Performs failure recovery
Disaggregated
Decouple SDS architecture into stateless and stateful components
stateless
stateful
Hyper-converged
stateless
stateful
stateless
stateful
stateless
stateful
SDS
NVMe-oF
Standards-based NVMeoF to communicate between stateful and stateless components
Stateful: actually stores data and metadata responsible for durability provides persistence manages block layout Provides object semantics Supports transactions
Applies to hyper-converged as well as disaggregated deployments
stateless
stateful
stateless
stateful
stateless
stateful
2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 4
Unique Scale-out Vectors
Storage Target
stateful
stateless
stateless
stateless
Spawn stateless component of SDS server on new machines to add more CPU resources as needed increase physical cache size for SDS
server under memory pressure Only the bottleneck SDS server can be
scaled out Scaling possible with no data rebalancing
As load reduces, the extra stateless instances can be shut down to reduce cost
Storage Target
stateful
stateless
stateless
stateless
Scale down
Scale up
Elastic and fine-grained scale-out capability
2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 5
Simplifies Integrating Next-gen Storage
Stateful
No modifications
stateless
MonolithicStack
• Logic to leverage storage media entwined with remaining code
• Integrating new storage media and protocols gets complicated
• Ripple effects on unrelated code
• Media/Protocol specific optimizations repeated per storage frameworkMedia/Protocol specific logic
Media 1 specific logicStateful
Media 2 specific logic
stateless• New media can be integrated with no modifications to stateless component
• Industry standard communication interface between components simplifies integration
• Media specific optimization can be done once and then re-used by multiple storage frameworks
2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 6
Benefits to Scaling out More Heterogeneous Services
Services focus on their unique values Offload stateful tasks to remote target side Let the target manage what it is good at: blocks Drive service to be stateless and container-friendly
CEPH CassandraKafkaSwift
StorageTargetLUN
StorageTargetLUN
StorageTargetLUN
StorageTargetLUN
Storage Target
stateful
CEPH’ Swift’ Cassandra’Kafka’
Our vision is to be able to scale out heterogeneous services simultaneously
2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 7
Benefits to RecoveryStateless Recovery(1) Creates a new stateless
instance(2) Connects the new stateless
instance to the stateful component
Monolithic Failure Domain
Stateless and stateful
Stateless and stateful
Stateless and stateful
NVMeof (Disaggregated) or PCIe (Hyper-converged)
Stateless
stateful stateful
Stateless Stateless
(1)
(2)
stateful
Stateless
stateful
Stateless
stateful
(1)
(2)
(3)
Recovery of Stateful Component(1) Temporarily route client
requests to replica’s stateful component
(2) Creates a new stateful component
(3) Connects the original stateless component to the new stateful component
2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 8
Benefits to Disaggregation
ID, data
ID, data
Storage Target1
statefulstateless
data from client
stateless
ID, only metadata
Storage Target2
stateful
No “relayed” data placement
Latency reduction
Minimize data transfers Bandwidth savings
Improve data parallelism
Reduce BW Consumption
Improve latency
TCO Reduction
2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 9
Ceph PoC Details Based on Ceph Luminous +
SPDK v19.07 Created a new Ceph
ObjectStore backend as a SPDK NVMe-oF Initiator
The new Ceph ObjectStore backend handles ObjectStore APIs over NVMe-oF
Uses SPDK RDMA transport for NVMe
Uses SPDK NVMe-oF target Created a new SPDK bdev
that runs standalone Ceph BlueStore
SPDK bdev maps requests to remote Ceph BlueStore
PoC Ceph Architecture Change
PoC Setup
Metric: Ceph cluster network rx/tx bytes
2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 10
NVMe-oF Protocol Modifications NVMe-oF Protocol currently
A block transport protocol designed for block I/O queueing/execution Expects accesses to be block aligned No transaction support
PoC Modifies the NVMe-oF Protocol in SPDK Adds object awareness Enables minimal set of object operations (Native Ceph ObjectStore APIs) Enables support for remote asynchronous transactions Adds new READ/WRITE Op code to differ from block level READ/WRITE
Wraps around existing SPDK spdk_nvme_ns_cmd_writev/readv()
Target remaps the new READ/WRITE Op code Target decodes the payload header Target passes the decoded object information to BlueStore
PoC shows NVMe-oF can be extended to be more powerful and flexible for next-gen storage architecturesSolicit feedback from industry on our approach and extensions required to generalize NVMe-oF protocol modifications
2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 11
Results
Setup: Optimized Ceph (modified stock Ceph and stock SPDK) with Chelsio RDMA NIC Test: rados put of various object sizes, Small and big objects, 3KB to 20MB, 100 iteration each Measure Ceph network rx/tx bytes The results validate the PoC
* Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Intel is the trademark of Intel Corporation in the U.S. and/or other countries. Other names and brands may be claimed as the property of others. See Trademarks on intel.com for full list of Intel trademarks or the Trademarks & Brands Names Database
Observations- Fabric traffic = client x rep factor- Cluster traffic greatly reduced- OSD to OSD traffic is close to constant Meta only
- almost constant
Meta + Data!
2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 12
Bandwidth Reduction Results
Derive reduction in bandwidth consumption Estimate the 3-way replication case based on 2-way replication In stock Ceph, overhead is due to extra hops and increases with object sizes and replication factor Overall reduction is ~33% (2-way replication), 40% (3-way replication)
Cost = client traffic x rep factor as expected
Tax reduction
2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 13
Summary SDS stacks are monolithic with deep coupling of block layer storage functions
with purpose-built distributed storage capabilities
We propose to decouple the SDS architectures into stateless and stateful components enable independent scalability create new scaling vectors
We presented initial results from a hardware RDMA based Ceph PoC
Next Steps Illustrate the capability of containerization of SDS Illustrate the CPU cost on stateless and stateful components Illustrate the latency reduction in various scenarios
2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 14
Please take a moment to rate this session.
Your feedback matters to us.
2019 Storage Developer Conference. © Intel Corporation. All Rights Reserved. 15