“A Cost-Effective, High-Bandwidth Storage Architecture”. Gibson et al. ASPLOS’98
Shimin ChenBig Data Reading Group Presentation
NASD (Network-Attached Secure Disks)
Machine
LAN: Local Area Network; PAN: Peripheral Area Network; SAN: Storage Area Network
Goal: avoid the bottleneck of file server
• Direct drive-client transfer
• Asynchronous oversight:Capabilities contain the policy decisions, enforced at drives.
• Cryptographic integrityall messages protected by crypto digests; computation is expensive, require HW
• Object-based interfaceIn contrast to fixed-size blocks, drives support variable-size objects, layout managed by drives
File Systems for NASD
• NFS and AFS ported to NASD– Using Andrew benchmark, NASD-NFS and NFS’s times are within 5%
• A parallel filesystem for NASD (Cheops)
Scaling of a Parallel Data Mining Application
NASD: n clients reading from a single NASD PFS file striped across n drives. scales to 45 MB/s.
NFS: a single file striped across n disks; bottlenecks near 20 MB/s. NFS-parallel: each client reads from a replica of the file on an independent disk
through the one server. maximum bandwidth 22.5 MB/s.All NFS configurations show the maximum achievable bandwidth with the given
number of disks, each twice as fast as a NASD, and up to 10 clients.
Discussions
• Requires SAN (devices and computers on the same network)
• Object-based storage device (OSD)– Supports NFS or CIFS interface
• Direct client-device transfer, async oversight– Google file system: chunk server ~ NASD– Google file system does not use object interface
pNFS
Standardizing Storage ClustersGarth Goodson, Sai Susarla, and Rahul Iyer
Network Appliance
IETF Draft: pNFS Problem StatementGarth Gibson and Peter Corbett
CMU/Panasas and Network Appliance
Presented by Steve Schlosser
NFSv4.1: A standard interface forparallel storage access
• Motivation– Commodity clusters = scalable compute– Single-server storage (NFS) doesn’t scale– Network-attached clustered storage systems provide
similar scaling• e.g., NASD, Panasas, Lustre, GPFS, GFS, et al.
– Proprietary, one-off, not interoperable
• pNFS standardizes the interface to parallel storage systems using an extension of the NFSv4 protocol
pNFS
• Separate control and data paths• Data striped on back-end storage nodes• On open, client obtains a recallable data layout object from
metadata server (MDS)– And access rights, if supported
• Using layout, fetch data directly from storage servers– Layouts can be cached, are recalled by MDS if layout changes
• Before close, client commits metadata changes to MDS, if any
• Layout drivers run within a standard NFSv4.1 client– Three data layout types: block (iSCSI, FCP), object (OSD), and file
(NFSv4.1)– Vendor provides this driver, rest of client remains unmodified
• Client can always fall back to standard NFSv4 through the MDS
File operations
ClientClient
ClientClient
Metadataserver (MDS)
Dataservers
1) OPEN2) LAYOUTGET
layoutobjects
4) LAYOUTCOMMIT5) LAYOUTRETURN6) CLOSE
3) READ/WRITE
What is standardized?
ClientClient
ClientClient
Metadataserver (MDS)
Dataservers
pNFSprotocol
storage accessprotocol
controlprotocol
• Only specifies pNFS protocol– Supports 3 data classes:
block, object, file
• Storage access and control protocols are left unspecified layout
driver
pNFS client stack
user applications
NFS v4.1
generic pNFS layout
filelayout
object (OSD)layout
blocklayout
Sun RPC SCSI
RDMA TCP/IP FCP
NFSv4.1 protocol
generic layouthandling/caching
layout-specificdrivers
File sharing semantics
• Close-to-open consistency– Client will see changes made by other clients only
after they close– Assumes write-sharing of files is rare– NFSv4 compatibility
• Clients notify metadata server after updates– Control protocol can be lazy
• Client cannot issue multi-server requests– Write atomicity must be handled by clients– Can always fall back to NFSv4 thru MDS
Conclusions
• pNFS could provide standard access to cluster storage systems
• Implementation, standardization underway– Files: NetApp, IBM, Sun– Objects: Panasas– Blocks: EMC– For what it’s worth, Connectathon, Bakeathon
results suggest that prototypes are interoperating…
GPFS: A Shared-Disk File System for Large Computing Clusters
Frank Schmuck and Roger HaskinFAST’02
Presented by Babu PillaiBig Data Reading Group
03 Dec 2007
Outline
• Overview
• Parallel data access
• Metadata management
• Handling Failures
• Unanswered questions
GPFS Overview
• IBM’s General Parallel File System
• Deployed at 100s of sites
• Largest: 1500+ node, 2PB ASCI Purple at LLNL
• Goal: provide fast parallel fs access– Inter- and intra- file shared access– Scale to 1000’s of nodes, disks
Overview 2
• Up to 4096 disks, 1 TB each (in 2002)
• Assumes no disk intelligence, only block access needed
Low level details
• Very large files (64-bit size), sparse files
• Stripes files across disks
• Large block size
• Efficient support of large directories– Extensible hashing
• Metadata journaling
Parallel Data Access
• Central token / lock manager
• Byte range locking on files
• Fine grain sub block sharing– “Data shipping” to /
from node that owns the file block
Metadata Management
• File metadata:– “Metanode” lock holder coordinates– Lazily updated by nodes (akin to shipping)
• Allocation tables:– Partition into many regions– Each node assigned a region– Lazily ship deallocations
• Others – similar central coarse locking, distributed shipping for fine sharing
Scaling tweaks
• Multiple token managers (by inode)
• Token eviction to limit memory
• Optimized token protocol– Batch requests– Lazy release on deletes– Revocation by new token holder
Failure Handling
• Journaling of metadata updates
• Failover of token manager service
• Handles partitioning of nodes– Majority partition can continue– Disk fencing
• Simple replication for disk failures
Conclusions
• Good handling of parallel access
• Good combination of coarse-grain central, fine-grain distributed control
• Details missing
• Failure handling seems minimal
Lustre FS
Parallel File System Blitz (Big Data Reading Group)
Presented by: Julio López
Julio Lopez © Apr 19, 2023 http://www.pdl.cmu.edu/
Lustre Overview
• Cluster parallel file system, POSIX interface• Target: Supercomputing environments
• Tens of thousands of nodes• Petabytes of storage capacity
• Deployed LLNL, ORNL, PNNL, LANL, PSC• Development:
• License: GPL• Initial developer: Cluster File Systems• Maintainer: Sun Microsystems, Inc
Julio Lopez © Apr 19, 2023 http://www.pdl.cmu.edu/
User Point of View
• Single unified FS view• Kernel support, mounted like a regular FS• User-level library for parallel applications
Higher throughput, lower latency
• Tools to manipulate file stripping• No client-side caching (lightweight version)• POSIX semantics
Multiple processes accessing a local FS
No application rewriting
Julio Lopez © Apr 19, 2023 http://www.pdl.cmu.edu/
POSIX semantics: Why are they important?• HPC access patterns• Disjoint byte-range accesses• Overlapping blocks• Guaranteed consistency
Simpler application code
Client Client Client
Byte range
Blocks
Julio Lopez © Apr 19, 2023 http://www.pdl.cmu.edu/
ArchitectureObject Storage Device (OSD)
• MDS: Meta data server• OSS: Object storage server• Clients
OST
OST
Ent
erp
rise
RA
ID
OSS
OSS
OSS
OSS
OSS
OSS
MDS MDS
Client
Client
Client
Julio Lopez © Apr 19, 2023 http://www.pdl.cmu.edu/
OperationClients retrieve layout data from MDS: LoV
Clients directly communicate with OSS (not OST)
MDS
Client
OSTOSS
OSTOSS
LoV
Julio Lopez © Apr 19, 2023 http://www.pdl.cmu.edu/
More features• MDS and OSS run Linux
Native FS as storage backend (e.g., ext3)
• Networking TCP/IP over Ethernet, Infiniband, Myrinet, Quadrics and others
• RDMA when available• Distributed lock manager (LDLM)• High availability:
• Single MDT, Failover MDS • Failover OSS• Online upgrades• Dynamic OST additions
Julio Lopez © Apr 19, 2023 http://www.pdl.cmu.edu/
Discussion
Boxwood: Abstractions as the Foundation for Storage Infrastructure
John MacCormick, Nick Murphy, Marc Najork, Chandramohan A. Thekkath, Lidong Zhou
Microsoft Research Silicon Valley
Main Idea
Problem: building distributed, fault-tolerant, storage-intensive software is hard
Idea: high level abstractions provided directly by storage subsystem (e.g., hash tables, lists, trees)
adieu, block-oriented interfaces!
high level applications easier to build
no need to manage free space for sparse/non-contiguous data structures
system can take advantage of structural information for performance (caching, pre-fetching, etc.)
Approach
Experiment with high(er)-level abstractions to storage
chunk store: allocate/free/read/write variable sized data (memory)
B-tree: create/insert/delete/lookup/enumerate (uses chunk store)
Boxwood system = collection of services
service = software module running on multiple machines
implemented in user-level processes
Test “application”: BoxFS, a multi-node NFS file server
Target environment: machine room/enterprise setting
processing nodes with local storage, high speed network
machines in single administrative domain, physically secure
Boxwood Architecture
Chunk Store
Services
Storage Application
Replicated Logical Device
Storage
B-Tree
Stores global state, e.g.― identity of master lock
server― locations of replicated
data 2k+1 instances tolerate k
failures
Transaction service
Distributed lock service Multiple reader, single
writer locks Master/slave structure
Undo/redo logging Logs stored on RLDev
Distributed consensusRLDev: reliable “media”Block-oriented replicated storagefixed size segments, 2-way replicationprimary/secondary structure
Chunk storeBacked by pair of chunk managers and RLDevChunk managers map unique chunk handle (id) to RLDev offset
B-Tree Distributed implementation
of B-Link tree (high concurrency B-tree)
BoxFS: multi-node NFS file server
Multiple nodes export single FS via NFS v2
User-level implementation, ~2500 lines C#
B-trees store directory structure, entries
File data stored in chunk store
Simplifying assumptions for performance
data cache flushed every 30 seconds
B-tree log (metadata operations) flushed to stable storage every 30 seconds instead of on transaction commit
access times not always updated
B-Link Tree
Chunk Store
Services
BoxFS
Performance RLDev read/write (2-8 servers)
Throughput scales linearly, limited by disk latency at small sizes (0.5K), TCP throughput at large sizes (256K)
B-Tree insert/delete/lookup (2-8 servers)
Throughput scales linearly using private trees (no contention)
Throughput flattens using single public tree w/partitioned key space (contention in upper levels of tree)
BoxFS vs. stock NFS using Connectathon (single server)
BoxFS: no RLDev replication, NFS: native Win2003 server on NTFS
Performance comparable; BoxFS with asynchronous metadata log flushes wins on meta-data intensive operations (B-trees)
BoxFS scaling (2-8 servers)
Read file: linear scaling (no contention)
MkDirEnt, write file: flat performance, limited by lock contention for metadata updates
Ursa Minor: a brief reviewMichael Kozuch
Motivation
Data distribution – protocol family Data encoding
Repication or erasure encoding Block size Storage node fault type
Crash or byzantine Number of faults to tolerate Timing model
Synchronous or asynchronous Data location
Architecture Object-based
Objects stored as slices with common charachterisitics
Similar to files: Create/delete Read/write Get/set attr. Etc.
Results