ceph day new york 2014: ceph over high performance networks
DESCRIPTION
Presented by Asaf Wachtel, MellanoxTRANSCRIPT
Building World Class Data Centers
Mellanox High Performance Networks for Ceph
Ceph Day, October 8th, 2014
© 2014 Mellanox Technologies 2- Mellanox Confidential -
Agenda
Who are we and what does it have to do with CEPH?
Benefits of High Performance Interconnect for CEPH
Typical deployment scenarios
Advanced performance work in progress
Examples of Commercial Solutions
© 2014 Mellanox Technologies 3- Mellanox Confidential -
Leading Supplier of End-to-End Interconnect Solutions
Virtual Protocol Interconnect
StorageFront / Back-EndServer / Compute Switch / Gateway
56G IB & FCoIB 56G InfiniBand
10/40/56GbE & FCoE 10/40/56GbE
Virtual Protocol Interconnect
Host/Fabric SoftwareICs Switches/GatewaysAdapter Cards Cables/Modules
Comprehensive End-to-End InfiniBand and Ethernet Portfolio
Metro / WAN
© 2014 Mellanox Technologies 4- Mellanox Confidential -
The Future Depends on Fastest Interconnects
10Gb/s 40/56Gb/s1Gb/s
© 2014 Mellanox Technologies 5- Mellanox Confidential -
From Scale-Up to Scale-Out Architecture
Only way to support storage capacity growth in a cost-effective manner We have seen this transition on the compute side in HPC in the early 2000s Scaling performance linearly requires “seamless connectivity” (ie lossless, high bw, low latency,
cpu offloads)
Interconnect Capabilities Determine Scale Out Performance
© 2014 Mellanox Technologies 6- Mellanox Confidential -
CEPH and Networks
High performance networks enable maximum cluster availability• Clients, OSD, Monitors and Metadata servers communicate over multiple network layers• Real-time requirements for heartbeat, replication, recovery and re-balancing
Cluster (“backend”) network performance dictates cluster’s performance and scalability• “Network load between Ceph OSD Daemons easily dwarfs the network load between Ceph Clients
and the Ceph Storage Cluster” (Ceph Documentation)
© 2014 Mellanox Technologies 7- Mellanox Confidential -
How Customers Deploy CEPH with Mellanox Interconnect
Building Scalable, Performing Storage Solutions• Cluster network @ 40Gb Ethernet• Clients @ 10G/40Gb Ethernet
Directly connect over 500 Client Nodes• Target Retail Cost: US$350/1TB
Scale Out Customers Use SSDs• For OSDs and Journals
8.5PB System Currently Being Deployed
© 2014 Mellanox Technologies 8- Mellanox Confidential -
CEPH Deployment Using 10GbE and 40GbE
Cluster (Private) Network @ 40GbE• Smooth HA, unblocked heartbeats, efficient data balancing
Throughput Clients @ 40GbE• Guaranties line rate for high ingress/egress clients
IOPs Clients @ 10GbE / 40GbE• 100K+ IOPs/Client @4K blocks
20x Higher Throughput , 4x Higher IOPs with 40Gb Ethernet Clients!(http://www.mellanox.com/related-docs/whitepapers/WP_Deploying_Ceph_over_High_Performance_Networks.pdf)
Throughput Testing results based on fio benchmark, 8m block, 20GB file,128 parallel jobs, RBD Kernel Driver with Linux Kernel 3.13.3 RHEL 6.3, Ceph 0.72.2IOPs Testing results based on fio benchmark, 4k block, 20GB file,128 parallel jobs, RBD Kernel Driver with Linux Kernel 3.13.3 RHEL 6.3, Ceph 0.72.2
Cluster Network
Admin Node
40GbE
Public Network10GbE/40GBE
Ceph Nodes(Monitors, OSDs, MDS)
Client Nodes10GbE/40GbE
© 2014 Mellanox Technologies 9- Mellanox Confidential -
I/O Offload Frees Up CPU for Application Processing
~88% CPU Efficiency
Use
r S
pa
ceS
yst
em S
pa
ce
~53% CPU Efficiency
~47% CPU Overhead/Idle
~12% CPU Overhead/Idle
Without RDMA With RDMA and Offload
Use
r S
pa
ceS
yst
em S
pa
ce
© 2014 Mellanox Technologies 10- Mellanox Confidential -
Phase 1: RDMA (multi-transport) integration of XioMessenger based on Accelio• Basic integration complete, in up-stream contribution process• http://
wiki.ceph.com/Planning/Blueprints/Hammer/Accelio_RDMA_Messenger
Eliminating XioMessenger bottleneck uncovered several others… Increased focus on performance, and weekly community calls
involving RedHat, Cisco, Fujitsu and others:• http://pad.ceph.com/p/performance_weekly
CEPH+RDMA: Coming Soon
Accelio Native simple messenger
Xio messenger CephTCP
CephRDMA
ExpectedRamdisk local perf
BW Read (MB/s) 3200* 980** 3200* 950 1300 2534
BW Write (MB/s) 3200* 980** 3200* 435 500 2361
* Measured on gen2 PCIE system, expected 5000 in gen3 systems** Increased pipe depth to 10000 and 1000000
© 2014 Mellanox Technologies 11- Mellanox Confidential -
Commercial Solutions in the Market Today
ISS: Storage Super Core• 82,000 IOPS on 512B reads
• 74,000 IOPS on 4KB reads
• 1.1 GB/sec on 256KB reads.
http://www.iss-integration.com/supercore.html
OnyxCCS: ElectraStack
Turnkey IaaS multi-tenant computing system
5X Faster Node/Data Restoration
https://www.onyxccs.com/products/8-series
Scalable Informatics: Unison
Archive like storage density with tier 1 performance• 60 drives in a 4U chassis all driven at full bandwidth
https://www.scalableinformatics.com/unison.html
Daystrom/StorageFoundry: Nautilus
Massive-scale storage platform• Single client reaching up to 2.2GB/sec on read throughput.
Easy integration with existing FC storage
http://storagefoundry.net/
Healthcare
High Performance Storage
Cloud
Media / Entertainment
Generally Available, Pre-configured, Pre-tuned Solutions leveraging 40/56Gb/s Cluster Networks
© 2014 Mellanox Technologies 12- Mellanox Confidential -
Summary
CEPH cluster scalability and availability rely on high performance networks
End to end 40/56 Gb/s transport with full CPU offloads available and being deployed• 100Gb/s around the corner
RDMA and more performance improvements coming soon
Thank You