accelerating machine learning with nvme and …...manager, being responsible for developing the ibm...

Accelerating Machine Learning withNVMe and NVMe-over-Fabrics

About Me

© 2019 E8 Storage, Proprietary and Confidential2

Zivan Ori

CEO & Co-Founder, E8 Storage

Mr. Zivan Ori is the co-founder and CEO of E8 Storage. Before founding E8 Storage, Mr. Ori held the position of IBM XIV R&D Manager, being responsible for developing the IBM XIV high-end, grid-scale storage system, and served as Chief Architect at Stratoscale, a provider of hyper-converged infrastructure. Prior to IBM XIV, Mr. Ori headed Software Development at Envara(acquired by Intel) and served as VP R&D at Onigma (acquired by McAfee).

About E8 Storage

• Founded in November 2014 by storage industry veterans from IBM-XIV

• Leading NVMe over Fabrics certified solution in the market

• Backed by Tier-1 VCs Accel Partners, Magma Ventures & Vertex Ventures

• World-wide Team:• R&D in Tel-Aviv• Sales & marketing in Santa Clara, NY and London

• In production with customers in U.S. and Europe

• Awarded 10 patents (granted) + 4 pending for E8 architecture

• Flash Memory Summit 2016 & 2017 Most Innovative Product Award


The Problem (Part 1): Why not use local SSDs in servers?


• Local SSDs today achieve latency 10x faster than all-flash arrays

• “The DevOps Problem”• Things that work on laptops become 10x

slower on the production infrastructure

• “The islands of storage problem”• Local SSDs in servers mean inefficient capacity

utilization, no sharing of SSD data

• Local SSDs couple storage and compute• Server purchasing requires upfront

investment in SSDs

0.1ms 1ms

???

Local SSD AFA

The Problem (Part 2): Why not use SSDs in SAN/NAS?

• Traditional all-flash arrays (SAN/NAS) get 10%-20% of the potential performance of NVMe SSDs• Classic “scale-up” bottleneck

• Dual controller bottleneck• All I/O gated by controller CPU• Switching the SSDs from SAS to NVMe

cannot alleviate the controller bottleneck


First gen architectures cannot unlock the full performance of NVMe

E8 Storage Unlocks the Performance of NVMe


1000

100120

Read Latency (us)(@4K)

AFA with 24 SSDs

Single NVMe SSD

E8 24 NVMe SSDs

300K750K

10M

IOPS (@4K read)

2.4 3.1

40

Read/Write Bandwidth (GB/s)

See the Demo!

https://www.youtube.com/watch?v=JU6aDwMtd6c

https://www.youtube.com/watch?v=JU6aDwMtd6c

Fastest Shared Block Storage in the World

• E8 holds record in 2 audited storage benchmarks• 17x faster in STAC-M3 benchmark• 8x lower latency on average in SPECsfs benchmark

• The power of NVMe SSDs + RDMA networks• Previous submissions used tons of RAM

• More performance, less hardware• Shared NVMe allows to consolidate hardware into a small footprint• E8 with 2U appliance beat 10U and 18U appliances


• As of SPEC SFS®2014_swbuild results published August 2018. See all published results at https://www.spec.org/sfs2014/results/

• Of the published, audited results on https://stacresearch.com/ as of May 2018. Graphs show the 2 closest competitors for overall results.

0 5000 10000 15000 20000

100T.VWAB-12D-NO

10T.VOLCURV

1T.NBBO

1T.WRITE.LAT2

E8 Storage Competitor A Competitor B

17x Faster!

SPECsfs Record Holder (IOPs + Latency)

Best STAC-M3 Response Times (ms)8x lower latency!

https://www.spec.org/sfs2014/results/

https://stacresearch.com/

Designed for Availability and Reliability


• Host agents operate independently• Failure of one agent (or more) does not affect other agents

• Access to shared storage is not impacted

• RAID data protection with virtual spare capacity

• Network multi-pathing with fast fail-over

• Enclosure high availability• Option 1: HA enclosure + dual-ported SSDs

• Option 2: Cross-enclosure HA + single-ported SSDs

No single point of failure anywhere

in the architectureHost Servers with E8 Host Agents

Cost Comparison (*based on typical rack)


Save >40% of the Cost of SSDs, 20% of the Cost of the Rack

Before:• 64 servers with 16TB NVMe

= 1PB of SSDs• $0.2/GB = $200K

After:• RAID-10• Over-provision 4:1• 16*16TB SSDs

in a dual-controller• 0.5PB of SSDs =

$100K + $8K enclosure

Local SSDutilization: 20%

Central SSD utilization: 80% $0

$100,000

$200,000

$300,000

$400,000

$500,000

$600,000

SSD Cost Rack Cost

Local NVMe

Dis-aggregated NVMe-oF

10

E8 Storage Customers and Use-Cases

© 2019 E8 Storage, Proprietary and Confidential

E8 Storage Customers: When Performance Matters


Web-scale/IaaS Financials BioIT/HPC

2 of the world’s Top-10 Largest Hedge Funds

Customer Use-Case: Market Data for Financials

Before• 1152 local SSDs in 72 servers • Market data copied nightly to all

servers• Restricted to 10TB-20TB

After• 48 SSDs in 2 E8-D24 appliances• Market data shared from E8 to all

72 servers• Easily scalable to 300TB


In production with 2 of the world’s Top-10 largest hedge funds

“We have been using E8 for a year and have more than 10 boxes.

A single box achieves 40GB/s reads and large block writes. For an all-flash tier, it is just a beast.”

Shared NVMe reduced the number of replicas needed by 72X

70% Cost reduction! SVP at Tier-1 Hedge Fund:

2 of the world’s Top-10 Largest Hedge Funds

Genomic Acceleration with E8 Storage

"We were keen to test E8 by trying to integrate it with our Univa Grid Engine cluster as

a consumable resource of ultra-performance scratch space. Following some simple

tuning and using a single EDR link we were able to achieve about 5GB/s from one

node and 1.5M 4k IOPS from one node. Using the E8 API we were quickly able to write

a simple Grid Engine prolog/epilog that allowed for a user-requestable scratch volume

to be automatically created and destroyed by a job. The E8 box behaved flawlessly and

the integration with InfiniBand was simpler than we could have possibly expected for

such a new product."

- Dr. Robert Esnouf, Director of Research Computing

Oxford Big Data Institute +

Wellcome Center for Human Genetics


Shared NVMe as a fast tier for parallelizing genomic processing

From 10 hours per genome to 1 hour for 10 genomes!

E8 for AI/ML with IBM GPFS and Nvidia

• A GPU cluster requires 0.5PB-1PB of shared fast storage

• But GPU servers have no real estate for local SSDs…

• E8 provides concurrent access for 1000 (!) GPUs in cluster

• 10x Performance of Pure Storage FlashBlade

• 4x Performance of IBM ESS SSD Appliances, for half the cost


Shared NVMe Accelerates Training for Image Recognition

Pure StorageFlashBlade

IBM GPFS + ESS E8 + GPFS

Cost ($/GBu)

GPU Farm: Nvidia DGX-1• Up to 8 GPUs per node• GPFS Client + E8 Agent run on

x86 within GPU Server• Up to 126 GPU nodes in

cluster

Mellanox 100G IB

0

500

1000

1500

2000

2500

3000

1 GPU node 10 GPU nodes 100 GPU nodes

Images per second, per GPU node(ResNet-50 Image Recognition Training)

Pure Storage

IBM GPFS + ESS

E8+GPFS

Shared NVMe Storage• E8 D24 2U24-HA• Dual-port 2.5” NVMe Drives• Up to 307TB NAND per 2U• Up to 36TB Optane per 2U• E8 Patented Distributed

RAID6


Centralized Storage Reliability

Hyper-scalability

Affordable100% COTS

PCIe SSDPerformance

accelerating machine learning with nvme and …...manager, being responsible for developing the ibm...

Documents