Download - “A Cost-Effective, High-Bandwidth Storage Architecture”. Gibson et al. ASPLOS’98 Shimin Chen Big Data Reading Group Presentation NASD (Network-Attached

“A Cost-Effective, High-Bandwidth Storage Architecture”. Gibson et al. ASPLOS’98

Shimin ChenBig Data Reading Group Presentation

NASD (Network-Attached Secure Disks)

Machine

LAN: Local Area Network; PAN: Peripheral Area Network; SAN: Storage Area Network

Goal: avoid the bottleneck of file server

• Direct drive-client transfer

• Asynchronous oversight:Capabilities contain the policy decisions, enforced at drives.

• Cryptographic integrityall messages protected by crypto digests; computation is expensive, require HW

• Object-based interfaceIn contrast to fixed-size blocks, drives support variable-size objects, layout managed by drives

File Systems for NASD

• NFS and AFS ported to NASD– Using Andrew benchmark, NASD-NFS and NFS’s times are within 5%

• A parallel filesystem for NASD (Cheops)

Scaling of a Parallel Data Mining Application

NASD: n clients reading from a single NASD PFS file striped across n drives. scales to 45 MB/s.

NFS: a single file striped across n disks; bottlenecks near 20 MB/s. NFS-parallel: each client reads from a replica of the file on an independent disk

through the one server. maximum bandwidth 22.5 MB/s.All NFS configurations show the maximum achievable bandwidth with the given

number of disks, each twice as fast as a NASD, and up to 10 clients.

Discussions

• Requires SAN (devices and computers on the same network)

• Object-based storage device (OSD)– Supports NFS or CIFS interface

• Direct client-device transfer, async oversight– Google file system: chunk server ~ NASD– Google file system does not use object interface

pNFS

Standardizing Storage ClustersGarth Goodson, Sai Susarla, and Rahul Iyer

Network Appliance

IETF Draft: pNFS Problem StatementGarth Gibson and Peter Corbett

CMU/Panasas and Network Appliance

Presented by Steve Schlosser

NFSv4.1: A standard interface forparallel storage access

• Motivation– Commodity clusters = scalable compute– Single-server storage (NFS) doesn’t scale– Network-attached clustered storage systems provide

similar scaling• e.g., NASD, Panasas, Lustre, GPFS, GFS, et al.

– Proprietary, one-off, not interoperable

• pNFS standardizes the interface to parallel storage systems using an extension of the NFSv4 protocol

pNFS

• Separate control and data paths• Data striped on back-end storage nodes• On open, client obtains a recallable data layout object from

metadata server (MDS)– And access rights, if supported

• Using layout, fetch data directly from storage servers– Layouts can be cached, are recalled by MDS if layout changes

• Before close, client commits metadata changes to MDS, if any

• Layout drivers run within a standard NFSv4.1 client– Three data layout types: block (iSCSI, FCP), object (OSD), and file

(NFSv4.1)– Vendor provides this driver, rest of client remains unmodified

• Client can always fall back to standard NFSv4 through the MDS

File operations

ClientClient

ClientClient

Metadataserver (MDS)

Dataservers

1) OPEN2) LAYOUTGET

layoutobjects

4) LAYOUTCOMMIT5) LAYOUTRETURN6) CLOSE

3) READ/WRITE

What is standardized?

ClientClient

ClientClient

Metadataserver (MDS)

Dataservers

pNFSprotocol

storage accessprotocol

controlprotocol

• Only specifies pNFS protocol– Supports 3 data classes:

block, object, file

• Storage access and control protocols are left unspecified layout

driver

pNFS client stack

user applications

NFS v4.1

generic pNFS layout

filelayout

object (OSD)layout

blocklayout

Sun RPC SCSI

RDMA TCP/IP FCP

NFSv4.1 protocol

generic layouthandling/caching

layout-specificdrivers

File sharing semantics

• Close-to-open consistency– Client will see changes made by other clients only

after they close– Assumes write-sharing of files is rare– NFSv4 compatibility

• Clients notify metadata server after updates– Control protocol can be lazy

• Client cannot issue multi-server requests– Write atomicity must be handled by clients– Can always fall back to NFSv4 thru MDS

Conclusions

• pNFS could provide standard access to cluster storage systems

• Implementation, standardization underway– Files: NetApp, IBM, Sun– Objects: Panasas– Blocks: EMC– For what it’s worth, Connectathon, Bakeathon

results suggest that prototypes are interoperating…

GPFS: A Shared-Disk File System for Large Computing Clusters

Frank Schmuck and Roger HaskinFAST’02

Presented by Babu PillaiBig Data Reading Group

03 Dec 2007

Outline

• Overview

• Parallel data access

• Metadata management

• Handling Failures

• Unanswered questions

GPFS Overview

• IBM’s General Parallel File System

• Deployed at 100s of sites

• Largest: 1500+ node, 2PB ASCI Purple at LLNL

• Goal: provide fast parallel fs access– Inter- and intra- file shared access– Scale to 1000’s of nodes, disks

Overview 2

• Up to 4096 disks, 1 TB each (in 2002)

• Assumes no disk intelligence, only block access needed

Low level details

• Very large files (64-bit size), sparse files

• Stripes files across disks

• Large block size

• Efficient support of large directories– Extensible hashing

• Metadata journaling

Parallel Data Access

• Central token / lock manager

• Byte range locking on files

• Fine grain sub block sharing– “Data shipping” to /

from node that owns the file block

Metadata Management

• File metadata:– “Metanode” lock holder coordinates– Lazily updated by nodes (akin to shipping)

• Allocation tables:– Partition into many regions– Each node assigned a region– Lazily ship deallocations

• Others – similar central coarse locking, distributed shipping for fine sharing

Scaling tweaks

• Multiple token managers (by inode)

• Token eviction to limit memory

• Optimized token protocol– Batch requests– Lazy release on deletes– Revocation by new token holder

Failure Handling

• Journaling of metadata updates

• Failover of token manager service

• Handles partitioning of nodes– Majority partition can continue– Disk fencing

• Simple replication for disk failures

Conclusions

• Good handling of parallel access

• Good combination of coarse-grain central, fine-grain distributed control

• Details missing

• Failure handling seems minimal

Lustre FS

Parallel File System Blitz (Big Data Reading Group)

Presented by: Julio López

Julio Lopez © Apr 19, 2023 http://www.pdl.cmu.edu/

Lustre Overview

• Cluster parallel file system, POSIX interface• Target: Supercomputing environments

• Tens of thousands of nodes• Petabytes of storage capacity

• Deployed LLNL, ORNL, PNNL, LANL, PSC• Development:

• License: GPL• Initial developer: Cluster File Systems• Maintainer: Sun Microsystems, Inc


User Point of View

• Single unified FS view• Kernel support, mounted like a regular FS• User-level library for parallel applications

Higher throughput, lower latency

• Tools to manipulate file stripping• No client-side caching (lightweight version)• POSIX semantics

Multiple processes accessing a local FS

No application rewriting


POSIX semantics: Why are they important?• HPC access patterns• Disjoint byte-range accesses• Overlapping blocks• Guaranteed consistency

Simpler application code

Client Client Client

Byte range

Blocks


ArchitectureObject Storage Device (OSD)

• MDS: Meta data server• OSS: Object storage server• Clients

OST

OST

Ent

erp

rise

RA

ID

OSS

OSS

OSS

OSS

OSS

OSS

MDS MDS

Client

Client

Client


OperationClients retrieve layout data from MDS: LoV

Clients directly communicate with OSS (not OST)

MDS

Client

OSTOSS

OSTOSS

LoV


More features• MDS and OSS run Linux

Native FS as storage backend (e.g., ext3)

• Networking TCP/IP over Ethernet, Infiniband, Myrinet, Quadrics and others

• RDMA when available• Distributed lock manager (LDLM)• High availability:

• Single MDT, Failover MDS • Failover OSS• Online upgrades• Dynamic OST additions


Discussion

Boxwood: Abstractions as the Foundation for Storage Infrastructure

John MacCormick, Nick Murphy, Marc Najork, Chandramohan A. Thekkath, Lidong Zhou

Microsoft Research Silicon Valley

Main Idea

Problem: building distributed, fault-tolerant, storage-intensive software is hard

Idea: high level abstractions provided directly by storage subsystem (e.g., hash tables, lists, trees)

adieu, block-oriented interfaces!

high level applications easier to build

no need to manage free space for sparse/non-contiguous data structures

system can take advantage of structural information for performance (caching, pre-fetching, etc.)

Approach

Experiment with high(er)-level abstractions to storage

chunk store: allocate/free/read/write variable sized data (memory)

B-tree: create/insert/delete/lookup/enumerate (uses chunk store)

Boxwood system = collection of services

service = software module running on multiple machines

implemented in user-level processes

Test “application”: BoxFS, a multi-node NFS file server

Target environment: machine room/enterprise setting

processing nodes with local storage, high speed network

machines in single administrative domain, physically secure

Boxwood Architecture

Chunk Store

Services

Storage Application

Replicated Logical Device

Storage

B-Tree

Stores global state, e.g.― identity of master lock

server― locations of replicated

data 2k+1 instances tolerate k

failures

Transaction service

Distributed lock service Multiple reader, single

writer locks Master/slave structure

Undo/redo logging Logs stored on RLDev

Distributed consensusRLDev: reliable “media”Block-oriented replicated storagefixed size segments, 2-way replicationprimary/secondary structure

Chunk storeBacked by pair of chunk managers and RLDevChunk managers map unique chunk handle (id) to RLDev offset

B-Tree Distributed implementation

of B-Link tree (high concurrency B-tree)

BoxFS: multi-node NFS file server

Multiple nodes export single FS via NFS v2

User-level implementation, ~2500 lines C#

B-trees store directory structure, entries

File data stored in chunk store

Simplifying assumptions for performance

data cache flushed every 30 seconds

B-tree log (metadata operations) flushed to stable storage every 30 seconds instead of on transaction commit

access times not always updated

B-Link Tree

Chunk Store

Services

BoxFS

Performance RLDev read/write (2-8 servers)

Throughput scales linearly, limited by disk latency at small sizes (0.5K), TCP throughput at large sizes (256K)

B-Tree insert/delete/lookup (2-8 servers)

Throughput scales linearly using private trees (no contention)

Throughput flattens using single public tree w/partitioned key space (contention in upper levels of tree)

BoxFS vs. stock NFS using Connectathon (single server)

BoxFS: no RLDev replication, NFS: native Win2003 server on NTFS

Performance comparable; BoxFS with asynchronous metadata log flushes wins on meta-data intensive operations (B-trees)

BoxFS scaling (2-8 servers)

Read file: linear scaling (no contention)

MkDirEnt, write file: flat performance, limited by lock contention for metadata updates

Ursa Minor: a brief reviewMichael Kozuch

Motivation

Data distribution – protocol family Data encoding

Repication or erasure encoding Block size Storage node fault type

Crash or byzantine Number of faults to tolerate Timing model

Synchronous or asynchronous Data location

Architecture Object-based

Objects stored as slices with common charachterisitics

Similar to files: Create/delete Read/write Get/set attr. Etc.

Results

Download - “A Cost-Effective, High-Bandwidth Storage Architecture”. Gibson et al. ASPLOS’98 Shimin Chen Big Data Reading Group Presentation NASD (Network-Attached

Top Related