ime storage system · how ime works ime metadata (1/3) 13 purpose of ime metadata describe each...
Post on 07-Oct-2020
14 Views
Preview:
TRANSCRIPT
ddn.com© 2017 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.DDN INTERNAL! Not for external use. DDN INTERNAL! Not for external use.
IME Storage System
MSST – 2019, May 20th
Paul Nowoczynskipnowoczynski@ddn.com
Jean-Yves Vetjyvet@ddn.com
Courtesy of Jean-Thomas Acquaviva, Kenneth de Mello, Pharthiphan Asokan, Jean-François Le Fillâtre, James Coomer, John Bent, Lokesh Jaliminche
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
Outline2Introduction
How IME Works
Cost Efficiency
Manageability
Performance
Use Cases
Fault Tolerance
Q&A
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
IntroductionWhat’s IME (Infinite Memory Engine)? (1/3)
3
Persistent Data (Disk)
Application IO can be random and
unaligned, it is not ideal for peak
parallel file system efficiency
Parallel file system does not operate at
peak efficiency.
Diverse, high concurrency applications
Computenodes
Application issues IO directly to parallel
scratch file system.
Traditional architecture
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
IntroductionWhat’s IME (Infinite Memory Engine)? (2/3)
4
IME’s Active I/O Tier, is inserted right between Compute and the parallel file system
IME software intelligently virtualizes disparate NVMe SSDs into a single pool of shared memory that accelerates I/O, PFS & Applications
► Scale-Out Flash Cache Layer using NVMe SSDs inserted between compute cluster and Parallel File System (PFS):
• IME is configured as CLUSTER with multiple NVMe servers
• All compute nodes can access cached data on IME
► Accelerates difficult IO patterns: small/random/shared file/high concurrency due to thin SW IO management layer
► Configured as scale-out massive cache layer with huge IO bandwidth and IOPs
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
IntroductionWhat’s IME (Infinite Memory Engine)? (3/3)
5
Fast Data NVM &
SSD
Persistent Data (Disk)
Diverse, high concurrency applications
Computenodes
Application issues IO to IME client.
Erasure Coding applied
IME client sends fragments to IME
servers
IME servers write buffers to NVM and manage
internal metadata
IME servers write aligned sequential
I/O to SFA backend
Parallel Backing File System (BFS)
operates at maximum efficiency
► New system: No need to oversize the PFS to deliver large BW
► Accelerate an existing PFS (Lustre, Spectrum Scale, NFS)
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
IntroductionIO500 - Largest B/W (1/2)
6
Largest BW, rely on PFS for metadata
► POSIX results
► Largest IOR scores (bulk data B/W)
► Metadata are forwarded to the BFS
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
IntroductionIO500 - Largest B/W (2/2)
7
► Example: KISTI
− Theoretical peak: 96 EDR~ 1TB/s
− Almost no gap between:
■ Easy Read / Easy Write (large IOs, File per Process)
■ Easy / Hard write (47008 Bytes IOs, Single shared file)
− Hard read remains challenging
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
IntroductionBottom line
8
IME is a fully distributed, resilient, transparent and flash native IO accelerator.
It uses an automated tiering system to write data (8MB chunks) back to theBFS (Lustre, Spectrum Scale or NFS).
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
9
How IME works
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
How IME worksNamespace - Delegated to the Backing File System
10
The servers are clients of the BFS
► Stand-alone, no modification required
► Can be a subdirectory of a mount – doesn’t need to be at the root• The subdirectory name contains a UUID identifying the BFS
► Metadata operation (create, delete, stats, …) are forwarded to the BFS• Adds a hop. But in some cases higher metadata rates (less client nodes than IME servers• IME server uses open_by_handle() and name_to_handle() system calls
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
How IME worksIME is Log Structured(1/2)
11
► Writing changes only (byte addressable)► Consecutive changes are split into chunks (max 128KB) called fragments
► Fragments remove implicit locking: performance are agnostic to file sharing► Flash native (reduced write amplification on NAND devices)
File
off
set
Time fragment
Accumulation of fragments
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
How IME worksIME is Log Structured(2/2)
12
► During reads, frags are virtually rendered to know how to reconstruct data from the fragments
File
off
set
Accumulation of fragments Rendered view
File
off
set
Time
Hole: read from BFS
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
How IME worksIME Metadata (1/3)
13
Purpose of IME Metadata ► Describe each data fragment► Know the state of the data fragments► Know on which data block, on which NVMe device the fragments are located► Know which blocks (other servers) are needed for reconstructions
Each server contains one (or two) device(s) for logs► The logs are kept on a separate NVMe device called the commit log device► Metadata are in RAM, reconstructed at server boot from the commit log device
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
How IME worksIME Metadata (2/3)
14
IME Metadata are distributed► Partial metadata view in RAM on each server
► All metadata accessible to all servers through a Distributed Hash-Table (DHT)
► Hash function provides a tuple containing ids of servers
File1 2,3,1,4
File4 4,1,3,2
File3 1,2,4,3
File6 3,4,2,1
Data Peer tupleDistributed
Network
Hash Function Peers
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
How IME worksIME Metadata (3/3)
15
► Network parallelism: hash function assigns metadata of each file to a server
► Node-level fault tolerance: hash function tuple provides the list of servers containing copies (dhtcopies in IME configuration file defines amount of copies)
► Self-Optimising for Noisy Fabrics: hash function may pick backup servers provided by
the hash function
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
How IME worksBulk data - Mapping to IME servers
16
► Files virtually split in 8MB chunks (buckets): Each chunk is assigned to a serverFi
le o
ffse
t
Timefragment
Bucket 1:IME server N
Bucket 2:IME server N+x
Bucket 3:IME server N+y
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
How IME worksBulk data - Sending fragments to servers
17
Aggregating IOs (Write)
RDMA buffer from client node to a server: contains 8 sub-buffers
“Preferred” network write IO: 1MB
A sub-buffer: parity block or it contains data fragments of the same file. This sub-buffer will be written on a NVMe device.
128KB
fragments
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
How IME worksBulk data - Reading data
18
Client nodes knows which server(s) to query to retrieve data
Client prefetcher► Client nodes have prefetch engines enabled when detecting a pattern when accessing data► Works on strided IOs
Client node
Server N
- Flattens the fragments- Retrieves data via local or remote NVMe device(s) or BFS- Prepares the RDMA to client node
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
How IME worksResilient to SSD Failures (1/4)
19
IME Server0
IME Server1
IME Server2
IME ServerN
FILE CACHE
8+38+1
6+0 8+2
6+14+1
4+14+18+0
1+1
► IME supports multiple resilience levels through flexible, adaptive erasure coding.
► System Wide Default up to 15+3.
► Runtime flexible: Applications can override defaults and select a specific Erasure Coding Scheme.
Erasure coding options:none
1+1 1+2 1+32+1 2+2 2+32+3 3+2 3+3... ... ...15+1 15+2 15+3
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
How IME worksResilient to SSD Failures (2/4)
20
IME implements erasure coding across servers► Allows for the loss of data drives as well as entire servers► Clients do all the erasure coding computations
• Except for data recovery after a device loss, which is done between servers
Clients work with parity groups (PG)► A PG is made of D+P buffers (D data + P parity, or M+N, or N+K)
Parity geometry can be defined in many ways► Globally, in ime.conf → def_pgeom = N+K► For one FUSE client, in ime-fuse.conf → --pgeom=N+K► For one native IME client application → IM_CLIENT_PGEOM=N+K► The client options take precedence over the global default value
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
How IME worksResilient to SSD Failures (3/4)
21
What happens if there isn’t enough client data to fill all the buffers?► Parity geometry adapts► The number of parity buffers stays the same► The number of data buffers is inferior to the requested geometry► Example: 4+2 default may result in 3+2, 2+2 or 1+2► The recovery guaranty is always respected
• 4+2 cannot result to 3+1
What’s the recovery model of IME?► The buffers that belonged to PGs on that drive, are reconstructed automatically on other drives of the
same server.• As all other buffers from those PGs are on other servers, a single server can lose a lot of drives without
losing parity-protected data.
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
How IME worksResilient to SSD Failures (4/4)
22
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
How IME worksRecovery Models - Recap
23
Erasure Coding with PG (IME Data)► Allows NVMe drive reconstruction and node-level fault tolerance (bulk data)► def_pgeom = N+K defines parity geometry in IME configuration file
DHT copies (IME Metadata)► Allows node-level fault tolerance (IME metadata)► dhtcopies in defines the amount of copies in IME configuration file
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
How IME worksAutomated tiering (1/2)
24
► IME IO Operations
• Reading – IME clients reading data
o If data exist in IME then data are read from IME
o If data don’t exist in IME then they are read from the BFS
o Autoprestage could be enabled in IME1.3
• Prestaging
o Preloading data from the BFS to IME.
o Prestaged data are clean.
• Writing
o New or modified data is written to IME, regardless of its original location
o Data written to IME are dirty.
• Releasing
o Data are marked as deletable, and then freed from IME, regardless of its state
• Syncing
o Modified data are copied back to the BFS, automatically or per user request
o Data waiting for sync are pending, and it becomes clean again once synced
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
How IME worksAutomated tiering (2/2)
25
► States of data in IME
• Clean – Data in IME cache has an exact copy on the BFS. It has either been synchronized or prestaged.
• Dirty – New or modified data in IME without an exact copy on the BFS.
• Pending – Dirty data awaiting synchronization to the BFS.
• Deletable – Data that may be freed from IME. Data can be manually marked as deletable regardless of its state.
► Read Coherency
• Clean and Dirty data are read from IME.
• Non-cached data are retrieved from BFS
• /!\ Overwriting directly into the BFS is currently not supported.IME does not track changes on the BFS. If data are clean/dirty in IME. They should not be modified on the BFS.o IME-1.3.x will support “mtime change” detection
so that clean data may be automatically evicted from IME.
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
How IME worksIO Management - Data Flow to Servers
26
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
IntroductionIME Client Interfaces
27
IME Native API Spec.:https://github.com/DDNStorage/ime_native
MPI Implementations supporting IME MPI-IO upstream:MVAPICH2 2.3.1Open MPI > 4.0.1MPICH 3.3
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
28
Return on Investment
Simple server nodes with single-port NVMe drives
Device bandwidth aligned with network bandwidth
Better endurance: reduced write amplification
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
ROICost-efficiency - Simple Hardware
29
140
10x single-port NVMe Devices
Dual Intel 4108 CPU
8C, 1.8GHz 12x 8GB DIMMS
OS drive(128GB SATA DoM)Two 16x PCIe slots for
HDR100/EDR/OPA/40 or 100 GbE
240
24x single-port NVMe Devices
8x 16GB DIMMS
Two OS drives(1TB SATA)
Dual Intel E5-2680 v4 CPU
14C, 2.4GHz
Two 16x PCIe slots forHDR100/EDR/OPA/40 or 100 GbE
⌨ Live Demo: Listing NVMe devices
► Simple server nodes with single-port NVMe drives► Limited switch footprint: 1 switch HDR port per box (split cable)
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
ROICost-efficiency - Bandwidth Balanced
30
>600K IOPs | 20GB/s | 2 Rack Units 240
► Device bandwidth aligned with network bandwidth
Random 4k write > 1M IOPs
Random 4k read > 600K IOPs
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
IntroductionIME is Flash Native (1/2)
31
Log Structured at the storage device level:
• High performance device throughput (NAND Flash)
• Increases device lifetime due to reduced write amplification
valid sectors with user data
free, unused blocks
Random writes incoming with offsets corresponding to existing data
Log Structured Filesystem:SSD sees writes to new block ranges
IME
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
ROICost-efficiency - Endurance
32
Low
er is
bet
ter
► Main factors contributing to extend endurance:
• Log-structured (no read-modify-write including parity protected data)
• IOs aggregated in 128KB (aligned) chunks on disk
• Unmap (trim) commands on released blocks
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
33
Manageability
User Space: simple installation
Integration with open source tools
Easy to monitor
Automated tiering
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
ManageabilityUser Space - Simple Installation (1/4)
34
Server Packages
ime-240-1.00-219.el7.x86_64.rpm
ime-nvme-1.8-213.el7.noarch.rpm ime-mem-dev-1.0-454.el7.noarch.rpm
libfabric-1.4.0.ddn1.0-el7.x86_64.rpm ime-net-libfabric-1.2.2-1573.el7.x86_64.rpm
libisal-2.16.0-el7.x86_64.rpm
ime-3rdparty-1.2.2-1573.el7.x86_64.rpm ime-common-1.2.2-1573.el7.x86_64.rpm ime-server-1.2.2-1573.el7.x86_64.rpm
PlatformClient Packages
libfabric-1.4.0.ddn1.0-el7.x86_64.rpmime-net-libfabric-1.2.2-1573.el7.x86_64.rpm
libisal-2.16.0-el7.x86_64.rpm
ime-ulockmgr-1.2.2-1573.el7.x86_64.rpm
ime-3rdparty-1.2.2-1573.el7.x86_64.rpm
ime-common-1.2.2-1573.el7.x86_64.rpm ime-client-1.2.2-1573.el7.x86_64.rpm mvapich-verbs-2.2.ddn1.4-el7.x86_64.rpm
NVMe U. Space
Network Stack
Erasure Coding
Locking
Libfuse + Hwloc
IME
MPI
⌨ Live Demo: Installation
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
ManageabilityUser Space - Simple Installation (2/4)
35
⌨ Live Demo: Configuration
Preparing Backing File System► Register a path on the BFS (don’t need to be at the root):
ime-bfs-prepare -b <path> -n <name>
Preparing all Nodes (clients and servers)► Configuration file /etc/ddn/ime/ime.conf
• Should be identical on all nodes
Preparing the Client Nodes► IME FUSE configuration file: /etc/ddn/ime/ime-fuse.conf
Preparing the Server Nodes► Register the NVMe Devices: ime-nvm-toolbox -AO
ime.conf
# Declare a backing file system. There may be several.ime_bfs mybfs { # logical name bfs_uuid = 271ba1a5-ae21-4b14-818a-28ccd97968e0; mount_point = /bfs/lustre/ime; bfs_type = lustre;}
# Define node profileim_node_profile ime240 { devs = /dev/ime_nvme{0,1,2,3,4,5,...}; cmtl_devs = /dev/ime_nvme23; netdevs = [ib0-verbs,ib1-verbs]; server_port = 13814; client_port = 5813,5814,5815,5816; heartbeat_port = 2222;}
im_pool pool_msst { #peerno:hostname:enabled:profile:ipquad 0:dime240-02:yes:ime240:172.30.3.12,172.30.3.22; 1:dime240-03:yes:ime240:172.30.3.13,172.30.3.23; 2:dime240-04:yes:ime240:172.30.3.14,172.30.3.24; 3:dime240-05:yes:ime240:172.30.3.15,172.30.3.25;}
im_cluster main { pools = [pool_msst]; bfs = mybfs; dhtcopies = 2; def_pgeom = 3+1;}
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
ManageabilityUser Space - Simple Installation (3/4)
36
► Dockerized IME client (1 FUSE mount point per container):• Available in future IME release
► Early evaluations on NVIDIA DGX show similar benefits H
ighe
r is
bet
ter
Hig
her
is b
ette
r
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
ManageabilityUser Space - Simple Installation (4/4)
37
⌨ Live Demo: Starting IME
Start the IME service on all server nodes► service ime-server start► Default path for log file: /var/log/ime/ime-server.log
Mount IME on client nodes► service ime-fuse start► Default path for log file: /var/log/ime/ime-fuse.log
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
ManageabilityIntegration with Open Source Tools
38
POSIXIME Native
⌨ Live Demo: IO tools
FIO (https://github.com/axboe/fio)► Write in the FUSE mount point with standard FIO engines► IME engines (sync and async) relaying on IME Native interface
IOR (https://github.com/hpc/ior)► POSIX interface and MPIIO interface (if not compiled with MPIIO supporting IME)► MPIIO interface (compiled with IME support)► IME interface
YAPIO Yet Another Parallel I/O Testing Tool (https://github.com/00pauln00/yapio)► POSIX interface► IME interface
MDtest (https://github.com/hpc/ior)► POSIX interface► IME Interface
POSIX
IME Native
IME Native
POSIXIME MPIIO
IME Native
POSIX
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
ManageabilityEasy to Monitor (1/7)
39
► Monitoring tool in terminal
► Metrics can be retrieved in JSON format (and use you own tools!)
► Preconfigured Grafana solution
► Turn-key solution: DDN Insight
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
ManageabilityEasy to Monitor (2/7)
40
► Monitoring tool in terminal (ime-monitor --nvm-stats)
⌨ Live Demo: Monitoring IME Metrics
NVM Metrics
- Internal ID to identify a unique drive in IME
- Name or path used to open the device
- Read throughput (IO/s), and bandwidth (MB/s)
- Write throughput (IO/s), and bandwidth (MB/s)
- Read throughput (IO/s)- Free space (%)- Device state
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
ManageabilityEasy to Monitor (2/7)
41
► Monitoring tool in terminal (ime-monitor --rpc)
⌨ Live Demo: Monitoring IME Metrics
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
ManageabilityEasy to Monitor (3/7)
42
► Monitoring tool in terminal (ime-monitor --frgbs)
⌨ Live Demo: Monitoring IME Metrics
Data Migration Statistics
- BFS Sync Subsystem: Auto sync and manual sync subsystem status.- Auto Flush: Auto sync subsystem status.- Data Cleanup: Auto cleanup subsystem status.- Percentage Free: Percentage of free space on the server.
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
ManageabilityEasy to Monitor (4/7)
43
► Monitoring tool in terminal (ime-monitor --bfs-stats)
⌨ Live Demo: Monitoring IME Metrics
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
ManageabilityEasy to Monitor (5/7)
44
► Metrics can be retrieved in JSON format
► Preconfigured Telegraf - InfluxDB - Grafana solution
Telegraf Agent
IME Server Node
Grafana
IME Monitoring Node Real time monitoringGrafana(Apache 2.0 License)
InfluxDB(MIT License)
Telegraf(MIT License)
InfluxDB
Telegraf Agent
IME Server Node
Telegraf Agent
IME Server Node
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
ManageabilityEasy to Monitor (6/7)
45
⌨ Live Demo: Monitoring IME Metrics
Aggregated dashboard
Per server dashboard
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
ManageabilityEasy to Monitor (7/7)
46
► Turn-key solution (next release): IME Metrics integrated in DDN Insight:
○ Aggregated views for Clients, Servers, Devices
○ Performance and status data collection
○ Event monitoring and alerts
○ Live and historical data analysis
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
ManageabilityAutomated tiering
47
Initial values in IME conf file► flush_threshold_ratio = when to start syncing (default if not specified: 40%)► min_free_space_ratio = when to start releasing (default if not specified: 15%)
Can be changed on the fly from a client node (requires root credentials)► ime-sys-ctl -u <flush_threshold_ratio>► ime-sys-ctl -n <min_free_space_ratio>
Can be changed toggled on and off on file (=pinning)► ime-ctl -m <file> = auto sync on► ime-ctl -M <file> = auto sync off
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
48
Performance
Close to peak BW even with single shared file
Low random read latency
Dynamic load balancing
Declustered rebuild
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
PerformanceClose to peak BW - Even with single shared file (1/5)
49
FILESYSTEM
Shared File
Shared File
► Parallel File systems can exhibit extremely poor performance for shared file IO due to internal lock management as a result of managing files in large lock units
► IME eliminates contention by managing IO fragments directly, and coalescing IO's prior to flushing to the parallel file system
Performance barrier
file
file
⌨ Live Demo: Single-shared-file “Hard” I/O
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
PerformanceClose to peak BW - Even with single shared file (2/5)
50
MEMORY 8x 16GB DDR4-2400 RAM
CPUS 2x Intel E5 Series Processors 2x PCIe 3.0 ports (16x) for drives 2x PCIe 3.0 ports (16x) for fabrics
FABRIC ADAPTERS
2x InfiniBand EDR/FDRor 2x 10/40/100GbE Ethernetor 2x Intel Omni-Path
DATA DRIVES 23x Intel NVMes P3520 1.2TB
HARDLIMITS 25 GB/s
105 GB/s
30 GB/s
25 GB/s
38.2 GB/s
25 GB/s
105 GB/s
30 GB/s
25 GB/s
29.6GB/s
23 µs
* 180µs with QD 64
~ 20 µs*
~ 3 µs
18 µs
* 210µs with QD 64
~ 15 µs*
~ 3 µs
LATENCY (4K IO)
AGGREGATED B/W (1MB IO)
READ WRITE READ WRITE
~ 0.1 µs ~ 0.1 µs
240
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
PerformanceClose to peak BW - Even with single shared file (3/5)
51
Hig
her
is b
ette
r
Hig
her
is b
ette
r
(single IME240 server saturation)
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
PerformanceClose to peak BW - Even with single shared file (4/5)52
I/O Payload (Bytes)
Ban
dw
idth
(G
B/s
)
23x P3520 NVMEs 1.2 TB
Infiniband (2x EDR)
PCIe 3.0 32x lanes
Write Sequential
Latencydriven 4K
I/O roofline model - Single Shared File - IME server Bandwidths (native interface)
Read Sequential
Read Random
Write Random Read Sequential Read Random Write Random
Write Sequential
Read Random
Write Sequential
Write Random
Read Sequential
Hig
her
is b
ette
r
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
PerformanceClose to peak BW - Even with single shared file (5/5)
53
► Extracting results from IO500 where the client count is 100 nodes or more
► Filesystem options show huge degradation when the IO patterns is tough.
► Only IME is able to present Flash to the applications efficiently
IO500 november 2017
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
Low random read latency
I/O critical path break down on a random read 4K IO (1/6)
Client node processing
Communication layer- Protocol- Hardware transport layer
Server(s) processing
Disk access (no software cache)
54
IME server
IME AppliancesCompute Nodes
IMEclient
IME serverIME
client
... ...
Switch
a
b
c d
a
b
c
d
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
Client side- Erasure coding management (slide 22)- IO Aggregation (write)- Prefetcher (sequential read)- FUSE vs Native overhead
- 4K random (SSF) read +10µs
55
IME server
IME AppliancesCompute Nodes
IMEclient
IME serverIME
client
... ...
Switch
10 µs
Low random read latency
I/O critical path break down on a random read 4K IO (2/6)
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
Low random read latency
I/O critical path break down on a random read 4K IO (3/6)
Network layer- Network Protocol (Verbs over Infiniband, TCP over ethernet,...) - Network topology (hops?)- IRQs (+context switches)- NUMA effects- Network software stack (OFED drivers + CCI* + IME network layer)
IME server
IME AppliancesCompute Nodes
IMEclient
IME serverIME
client
... ...
Switch
19 µs
* Common Communication Interface (cci-forum.com)
- Infiniband FDR
56
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
Low random read latency
I/O critical path break down on a random read 4K IO (4/6)
Server processing- Fetch data on Backing File System- IME internal read application- IME tasks engine
IME server
IME AppliancesCompute Nodes
IMEclient
IME serverIME
client
... ...
Switch
25 µs- No data fetched on the BFS
57
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
Low random read latency
I/O critical path break down on a random read 4K IO (5/6)
Drive accesses- Queue depth
IME server
IME AppliancesCompute Nodes
IMEclient
IME serverIME
client
... ...
Switch
114 µs - Intel NVMe P3520 1.2 TB- Queue depth between 1 and 10- Disk has free blocks
58
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
59
168 µs
Low random read latency
I/O critical path break down on a random read 4K IO (6/6)
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
PerformanceDynamic Load Balancers - Data Placement on NVMe devices60
IME IO scheduler in a Server
► Performance oriented○ Slow drives are less solicited
► NUMA aware○ Network Buffer leads to higher affinity
to NVMes in the same NUMA node
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
PerformanceDynamic Load Balancers - Data Placement on Server Nodes61
IME Non-Deterministic (ND) Data Placement ► Enabled via an environment variable on the client
IM_CLIENT_DATA_PLACEMENT_TYPE=NONDETERMINISTIC
► IME performance aware load Balancing ensures all available performance is utilized:
○ ½ Server Lost: ½ Server Performance Lost○ 1 Server Lost: 1 Server Performance Lost
Deterministic Data Placement ► In Traditional Systems all servers run at the rate of the
Slowest Server:○ ½ Server Lost: ½ All Performance Lost○ 1 Server Lost: All Performance Lost
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
PerformanceDeclustered rebuild62
When Disk Failure Occurs:
► Reconstruction triggered automatically on other drives of the same server► Uses parity blocks located in the other server nodes
Reconstruction is fast► Only need to reconstruct data which resided in the failed device
○ If 10% of used capacity, only those 10% are rebuilt
► Performance limitation is the mainly due to hardware on which the drive is reconstructed○ Aggregated BW of network adapters○ Aggregated BW of NVMe devices left
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
PerformanceWe keep improving performances63
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
Use cases (1/3)
Simulating an “AI workload”64
Simulating an AI “workload” (random 67KB reads from 576 processes)
► Big file on BFS
► Reading from BFS
► Prestage
► Read from IME (uncached)
► Read from IME (cached),
► Write output
► Sync & release
⌨ Live Demo: Use case 1
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
Use cases (2/3)
Accelerating your applications (MPIIO)65
PnetCDF I/O Benchmark using S3D Application I/O Kernelhttps://github.com/wkliao/S3D-IO
A data checkpoint is performed at regular time intervals, and its data consist of three- and four-dimensional array variables of type double.
► Run with PFS
► Effect of recompiling the app with MPI supporting IME
► Run with IME
⌨ Live Demo: Use case 2
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
Use cases (3/3)
Accelerating NFS66
NFS
IME
COMPU
TE
NFS
► Brings scale-out Flash native performance to NFS access
► Shield NFS server from ”tough" IO
► Increase IO throughput from NFS hardware
► Zero application changes - replace NFS mount by IME mount
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
67
Fault Tolerance
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
Fault Tolerance (1/3)68
IME already supports► Data drive failures (erasure coding)► Node failure (node ejection relaying on Raft consensus algorithm)
IME 1.3 (upcoming release)► Disk Hot Swap► Node Reinsertion
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
69 Fault Tolerance (2/3)
Demo with the following scenario
► Data 4x IME240 nodes (parity 2+1, DHT copies 3)
1) Server 3 is ejected, 1TB of data needs to be rebuilt
2) 3 NVMe drives are ejected on server 4. 800GB of data is rebuilt on server 4
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
70
Parity 2+1DHTcopies 3
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
Takeaways71
Hardware/ROISimple server nodes with single-port NVMe drives
Device bandwidth aligned with network bandwidthBetter endurance: reduced write amplification
ManageabilityUser Space: simple installation
Integration with open source toolsEasy to monitor
Automated tieringFlexible erasure coding scheme
Node Fault Tolerant
PerformanceClose to peak BW even with single shared file
Low random read latencyDynamic Load Balancers
Declustered rebuild
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
Q&A72
Feature suggestions
Multi BFSEncryption
Data compressionMulti network
...
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
ThankYou!
Keep in touch with us
sales@ddn.com
@ddn_limitless
company/datadirect-networks
9351 Deering AvenueChatsworth, CA 91311
1.800.837.22981.818.700.4000
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
Backup SlidesPOSIX Compliance (1/2)
74
► Introducing PJD test POSIX Compliance suite
► 8789 tests in 17 categories
► Open source: https://github.com/pjd/pjdfstest.git
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
Backup SlidesPOSIX Compliance (2/2)
75
ddn.com© 2019 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.Any statements or representations around future events are subject to change.
Backup SlidesIME240 Block diagram
76
DDN Storage | ©2018 DataDirect Networks, Inc.
IME1.2 FAULT TOLERANCE
► 4xIME240 with parity=2+1 dhtcopy=3
► Device/Server failures are transparent for the application
► Automatic data rebuild with no service interruption
► Native De-Dlustered Distributed Erasure Coding ensures fast rebuild
DDN Storage | ©2018 DataDirect Networks, Inc.
IME1.2 FAULT TOLERANCE
2
3
I/O write intensive job startup
Server 3 fails with 1TB data
1
Data Rebuild Zone
Normal Service
Resumed
4
~3 mins
Continued Production
DDN Storage | ©2018 DataDirect Networks, Inc.
IME1.2 FAULT TOLERANCE
Continued Production
► Even after single node failure, the rebuilt data are still protected against failure• 3 failing devices on surviving
servers
• 2nd node failing
DDN Storage | ©2018 DataDirect Networks, Inc.
top related