accelerating machine learning with nvme and …...manager, being responsible for developing the ibm...

Post on 15-Aug-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Accelerating Machine Learning withNVMe and NVMe-over-Fabrics

About Me

© 2019 E8 Storage, Proprietary and Confidential2

Zivan Ori

CEO & Co-Founder, E8 Storage

Mr. Zivan Ori is the co-founder and CEO of E8 Storage. Before founding E8 Storage, Mr. Ori held the position of IBM XIV R&D Manager, being responsible for developing the IBM XIV high-end, grid-scale storage system, and served as Chief Architect at Stratoscale, a provider of hyper-converged infrastructure. Prior to IBM XIV, Mr. Ori headed Software Development at Envara(acquired by Intel) and served as VP R&D at Onigma (acquired by McAfee).

About E8 Storage

• Founded in November 2014 by storage industry veterans from IBM-XIV

• Leading NVMe over Fabrics certified solution in the market

• Backed by Tier-1 VCs Accel Partners, Magma Ventures & Vertex Ventures

• World-wide Team:• R&D in Tel-Aviv• Sales & marketing in Santa Clara, NY and London

• In production with customers in U.S. and Europe

• Awarded 10 patents (granted) + 4 pending for E8 architecture

• Flash Memory Summit 2016 & 2017 Most Innovative Product Award

© 2019 E8 Storage, Proprietary and Confidential3

The Problem (Part 1): Why not use local SSDs in servers?

© 2019 E8 Storage, Proprietary and Confidential4

• Local SSDs today achieve latency 10x faster than all-flash arrays

• “The DevOps Problem”• Things that work on laptops become 10x

slower on the production infrastructure

• “The islands of storage problem”• Local SSDs in servers mean inefficient capacity

utilization, no sharing of SSD data

• Local SSDs couple storage and compute• Server purchasing requires upfront

investment in SSDs

0.1ms 1ms

???

Local SSD AFA

The Problem (Part 2): Why not use SSDs in SAN/NAS?

• Traditional all-flash arrays (SAN/NAS) get 10%-20% of the potential performance of NVMe SSDs• Classic “scale-up” bottleneck

• Dual controller bottleneck• All I/O gated by controller CPU• Switching the SSDs from SAS to NVMe

cannot alleviate the controller bottleneck

© 2019 E8 Storage, Proprietary and Confidential5

First gen architectures cannot unlock the full performance of NVMe

E8 Storage Unlocks the Performance of NVMe

© 2019 E8 Storage, Proprietary and Confidential6

1000

100120

Read Latency (us)(@4K)

AFA with 24 SSDs

Single NVMe SSD

E8 24 NVMe SSDs

300K750K

10M

IOPS (@4K read)

2.4 3.1

40

Read/Write Bandwidth (GB/s)

See the Demo!

Fastest Shared Block Storage in the World

• E8 holds record in 2 audited storage benchmarks• 17x faster in STAC-M3 benchmark• 8x lower latency on average in SPECsfs benchmark

• The power of NVMe SSDs + RDMA networks• Previous submissions used tons of RAM

• More performance, less hardware• Shared NVMe allows to consolidate hardware into a small footprint• E8 with 2U appliance beat 10U and 18U appliances

© 2019 E8 Storage, Proprietary and Confidential7

• As of SPEC SFS®2014_swbuild results published August 2018. See all published results at https://www.spec.org/sfs2014/results/

• Of the published, audited results on https://stacresearch.com/ as of May 2018. Graphs show the 2 closest competitors for overall results.

0 5000 10000 15000 20000

100T.VWAB-12D-NO

10T.VOLCURV

1T.NBBO

1T.WRITE.LAT2

E8 Storage Competitor A Competitor B

17x Faster!

SPECsfs Record Holder (IOPs + Latency)

Best STAC-M3 Response Times (ms)8x lower latency!

Designed for Availability and Reliability

© 2019 E8 Storage, Proprietary and Confidential8

• Host agents operate independently• Failure of one agent (or more) does not affect other agents

• Access to shared storage is not impacted

• RAID data protection with virtual spare capacity

• Network multi-pathing with fast fail-over

• Enclosure high availability• Option 1: HA enclosure + dual-ported SSDs

• Option 2: Cross-enclosure HA + single-ported SSDs

No single point of failure anywhere

in the architectureHost Servers with E8 Host Agents

Cost Comparison (*based on typical rack)

© 2019 E8 Storage, Proprietary and Confidential9

Save >40% of the Cost of SSDs, 20% of the Cost of the Rack

Before:• 64 servers with 16TB NVMe

= 1PB of SSDs• $0.2/GB = $200K

After:• RAID-10• Over-provision 4:1• 16*16TB SSDs

in a dual-controller• 0.5PB of SSDs =

$100K + $8K enclosure

Local SSDutilization: 20%

Central SSD utilization: 80% $0

$100,000

$200,000

$300,000

$400,000

$500,000

$600,000

SSD Cost Rack Cost

Local NVMe

Dis-aggregated NVMe-oF

10

E8 Storage Customers and Use-Cases

© 2019 E8 Storage, Proprietary and Confidential

E8 Storage Customers: When Performance Matters

© 2019 E8 Storage, Proprietary and Confidential11

Web-scale/IaaS Financials BioIT/HPC

2 of the world’s Top-10 Largest Hedge Funds

Customer Use-Case: Market Data for Financials

Before• 1152 local SSDs in 72 servers • Market data copied nightly to all

servers• Restricted to 10TB-20TB

After• 48 SSDs in 2 E8-D24 appliances• Market data shared from E8 to all

72 servers• Easily scalable to 300TB

© 2019 E8 Storage, Proprietary and Confidential12

In production with 2 of the world’s Top-10 largest hedge funds

“We have been using E8 for a year and have more than 10 boxes.

A single box achieves 40GB/s reads and large block writes. For an all-flash tier, it is just a beast.”

Shared NVMe reduced the number of replicas needed by 72X

70% Cost reduction! SVP at Tier-1 Hedge Fund:

2 of the world’s Top-10 Largest Hedge Funds

Genomic Acceleration with E8 Storage

"We were keen to test E8 by trying to integrate it with our Univa Grid Engine cluster as

a consumable resource of ultra-performance scratch space. Following some simple

tuning and using a single EDR link we were able to achieve about 5GB/s from one

node and 1.5M 4k IOPS from one node. Using the E8 API we were quickly able to write

a simple Grid Engine prolog/epilog that allowed for a user-requestable scratch volume

to be automatically created and destroyed by a job. The E8 box behaved flawlessly and

the integration with InfiniBand was simpler than we could have possibly expected for

such a new product."

- Dr. Robert Esnouf, Director of Research Computing

Oxford Big Data Institute +

Wellcome Center for Human Genetics

© 2019 E8 Storage, Proprietary and Confidential13

Shared NVMe as a fast tier for parallelizing genomic processing

From 10 hours per genome to 1 hour for 10 genomes!

E8 for AI/ML with IBM GPFS and Nvidia

• A GPU cluster requires 0.5PB-1PB of shared fast storage

• But GPU servers have no real estate for local SSDs…

• E8 provides concurrent access for 1000 (!) GPUs in cluster

• 10x Performance of Pure Storage FlashBlade

• 4x Performance of IBM ESS SSD Appliances, for half the cost

© 2019 E8 Storage, Proprietary and Confidential14

Shared NVMe Accelerates Training for Image Recognition

Pure StorageFlashBlade

IBM GPFS + ESS E8 + GPFS

Cost ($/GBu)

GPU Farm: Nvidia DGX-1• Up to 8 GPUs per node• GPFS Client + E8 Agent run on

x86 within GPU Server• Up to 126 GPU nodes in

cluster

Mellanox 100G IB

0

500

1000

1500

2000

2500

3000

1 GPU node 10 GPU nodes 100 GPU nodes

Images per second, per GPU node(ResNet-50 Image Recognition Training)

Pure Storage

IBM GPFS + ESS

E8+GPFS

Shared NVMe Storage• E8 D24 2U24-HA• Dual-port 2.5” NVMe Drives• Up to 307TB NAND per 2U• Up to 36TB Optane per 2U• E8 Patented Distributed

RAID6

© 2019 E8 Storage, Proprietary and Confidential15

Centralized Storage Reliability

Hyper-scalability

Affordable100% COTS

PCIe SSDPerformance

top related