february, 2015 bill loewe - hpc advisory council
TRANSCRIPT
Seagate Confidential
File System Metadata, a growing issue
Parallel File System - Lustre Overview
Metadata and Distributed Namespace
Test setup and implementation for metadata testing
Scaling Metadata Servers
High Availability
Agenda
Seagate Confidential
File System Performance typically viewed in Bandwidth
Bandwidth problem largely addressed, but metadata is a
growing issue.
We see this in workloads with high numbers of files to
access and process.
• Genome processing
• CPU Chip manufacturing
• Video compositing/rendering
Metadata Performance
Lustre Parallel File System
Lustre is an open source,
distributed parallel file system
Object-based design provides
extreme scalability
Compute clients interact directly
with storage servers
Comprised of:
Clients
Metadata Servers and Targets
Storage Servers and Targets
Seagate Confidential
Distributed NamespacE (DNE) is a new feature available
in Lustre 2.5 that allows multiple MDS / MDT
components to participate in a single file system.
DNE allows the namespace to be divided across multiple
metadata servers.
Enables the size of the namespace and metadata
throughput to be scaled with the number of servers.
The Lustre DNE project is comprised of 2 phases.
Lustre Distributed NamespacE (DNE)
Seagate Confidential
Phase 1, Lustre 2.5 Release
Remote Directories -- Lustre sub-directories are
distributed over multiple metadata targets (MDTs).
Sub-directory distribution is defined by an
administrator.
Root
dir a
File
dir b
dir b2
File
dir c
dir c2
File
dir d
dir d2
File
dir e
dir e2
File
Remote Directories
Seagate Confidential
Phase 2, Lustre 2.7
Striped Directories -- The contents of a given directory
are distributed over multiple MDTs.
File
dir c2
File
Striped Directory
dir e2
Striped Directories
Seagate Confidential
Engineered Storage Solutions for HPC, Big Data & Cloud
ClusterStor Parallel
file system/Object
Data protection
Linux OS
Flash optimization
BIOS/IPMI
GEM diagnostics
Custom X86 embedded server
Seagate storage platforms
High availability
File system (Ext4)
High speed networking (IB/40GB/e)
Architected Integrated Optimized Qualified Supported
Seagate Storage Devices
Seagate Confidential
OSS
Lustre Components
Clients
MDS OSS OSS
Directory Operations, File
open/close, metadata, and concurrency
File creation, file status, and recovery
File I/O and locking
ClusterStor Management Unit (CMU):
Management and Metadata (MDS/MDT) CSM Manager and MDS/MGS
Nodes
2RU 4-node Sandy Bridge Servers
– Server 1: CSM Mgmt
– Server 2: Boot
– Server 3: MGS
– Server 4: MDS
Fault Tolerance (active/passive)
Serviceability
2U24 JBOD – MDT
SAS JBOD for
MDS/MGS/Management
Disk Configuration
– Qty 4 Lustre Management (MGS)
– Qty 4 ClusterStor Management
and NFS
– Qty 2 Global Hot spares
– Qty 14 Drives for MDT
Scalable Storage Unit (SSU)
SSU
5U84 Enclosure
2 Object Storage Servers’s per
SSU
Two (2) trays of 42 HDD’s each
for Object Storage Targets
H/A on each SSU
Infiniband QDR/FDR and 40Gb
Ethernet data network
connectivity
ClusterStor & Lustre 2.5 DNE Hardware
DNE is available in ClusterStor v2.0
• MDT0 is master and default in DNE environment
DNE Servers are configured in active / active pairs
• Seagate 2U24 with 2 MDS embedded server modules
Scale Metadata Capacity / Performance with DNE Server
pairs
Base MDS
Root
dir a
File
dir b
dir b2
File
dir c
dir c2
File
dir d
dir d2
File
dir e
dir e2
File
Object Storage Server
Seagate Embedded Application Server
Object Storage Target Seagate 5U84 Storage Bay Bridge Enclosure
ClusterStor Hardware and the Lustre File System Meta Data and Management
Servers 2U x 4 Servers
Meta Data Target
Seagate 2U24 JOBD
1) Where is file?
2) File is at….
Client
File
3) Single File (3,072Kb)
5a) File block stripe 1 of 3 (1,024Kb)
5b) File block stripe 2 of 3 (1,024Kb)
5c) File block stripe 3 of 3 (1,024Kb)
4) File is broken into block stripe segments (1,024Kb)
Seagate Confidential
Scaling MDS and DNEs
•MDS + 4 DNE Servers
(2 ADUs)
•mdtest create/stat/del
•Mean of 5 iterations
0
100,000
200,000
300,000
400,000
500,000
600,000
Op
/s
mdtest scaling MDS + 4 DNEs
Mean Create
Mean Stat
Mean Remove
Seagate Confidential
Metadata High Availability
MDT failover will ensure that the
Lustre filesystem remains
available in the face of MDS node
failure
Based on existing OSS pair
failover model
Failover is graceful, quick, and
non-disruptive
Failback is automatic and non-
disruptive
0
20,000
40,000
60,000
80,000
100,000
120,000
140,000
BeforeFailover
Failed over After Failover
Op
/s
High Availability and Performance
Mean Create
Mean Stat
Mean Remove
15
Green Machine: Environmentally-Aware Cold Storage Solution
Power
Space
Cooling
Green
Light weight Small foot print
Cold storage optimized design
Recyclable chassis Reduced metal
Responsible disposal of old chassis
Zero heat emission Ambient cooling/No fans
High operating temp. tolerant HDDs
Dynamic power management Low power servers
Aggressive TCO goals
Lowest Operating Cost
Reduced Carbon footprint
“Best for the Planet”
16
Typical Use cases
• Retrieve content, photographs etc. from deep archive while maintaining consistent user experience
• Online pictures/Social media store use cases
• Pictures >45 days in cold storage
• Retrieve MRIs/X-rays of a patient
• Use cases leveraging Tape-based solutions