talk

OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY

I/O Virtualization: Robust Storage Management in the Machine-Room and Beyond

Sudharshan Vazhkudai

Oak Ridge National Laboratory

Contributor: Xiaosong Ma (NCSU)

Virtualization in HPC

Nashville, TN

September 20th, 2006


Problem Space: Petascale Storage Crisis (1) Scaling to 1 PF system creates unique storage

challenges in terms of both “availability” and “performance”

Availability: Storage failure is a significant contributor to system down time “when averaged over one month, 90% of jobs submitted to the

system should complete without having to be resubmitted as a result of failure”—NSF Petascale computing solicitation (NSF06-573)

Performance: Several DOE applications (GYRO, POP, TSI) demand

sustained I/O throughput ~ orders of GB/sec or a TB/sec Bandwidth mismatch in diverse I/O operations such as staging,

offloading, checkpointing, prefetching, end-user data delivery Storage performance improves slower than CPUs User data needs grows faster than available compute power


Problem Space: Petascale Storage Crisis (2) Increasing number of processors per I/O node:

Failure a norm, not an exception! Macroscopic view

Microscopic view (from both commercial and HPC centers) 3% to 7% per year of disk failures; 3% to 16% per year of controller failures; Up

to 12% per year of SAN switches 10 times the rate expected from the disk vendor specification sheets

Well aware of memory wall; Data balance rears its head on the way to Peta Implied Disk need is 1,000,000 disks [Bell, Gray, Szalay, CACM Jan. 2006] Fraction System cost devoted to storage subsystem large 1,000,000 disks at $100/disk ~

50% of system cost in storage Power requirement for storage also large (50 W disks imply 50MW of power for storage

(cooling is extra) Failure rates only bound to grow manifold! Brute force approach of simply applying

more funds to match the computational scale will not work!

System # CPUs # I/O Nodes Ratio

Cray Red Storm 10368 256 41:1

Blue Gene/L 65536 1024 64:1

System # CPUs MTBF/I Outage Source

ASCI Q 8192 6.5 hrs Storage, CPU

ASCI White 8192 40 hrs Storage, CPU

Google 15000 20 reboots/day Storage, mem


Problem Space: Petascale Storage Crisis (3) Suboptimal center operations due to failure and

bandwidth mismatch Uptime low because of resubmits

Failure due to staging errors, purging result-data errors Data Staging is stealing center performance

Uptime low because checkpoints and restarts are expensive Checkpointing by an application on 100 thousand

processors: > 100,000 files/timestep; millions of files per run ~ several TB of data; I/O can take O(minutes)!!

“we need to find a way to avoid full file system complexity in order to save checkpoints and system communications”—SciDAC 2006, Alan Garra, Blue Gene Architect, IBM

No intelligent prefetching but for file system buffering Wealth of application access pattern information ignored

End-user data delivery is still an open area of research despite high-speed transfer tools and networks!


Approach If you cannot afford a balanced system, develop

management strategies to compensate Exploit opportunities throughout the HEC I/O

stack Application-level Parallel File System Many unused resources: memory, cluster node-local

storage, desktop idle storage (both in machine-room and client-side)

Disparate storage entities including archives & remote sources

Concerted use of aforementioned: Can be brought to bear upon urgent supercomputing

issues such as: staging, offloading, prefetching, checkpointing, data recovery, I/O bandwidth bottleneck


I/O Access Virtualization Stack

Application-level: Access patterns; Hints, Metadata specification: recovery hints, delivery/performance constraints, jitter/failure-tolerance, automatic capture(?); Well-defined simple interfaces for below functionality

Parallel File System I/O access optimizations: Prefetching; Data reorganization; Augmenting file systems with intelligent metadata aiding in data availability and performance gap

I/O Virtualization MiddlewareConstruct New Storage Abstractions: Aggregate RAM Disk; Aggregate node-local storage as a cache, etc., for checkpointing or staging; Relaxed POSIX I/O accessSeamless Data Pathway: To above storage abstractions, archives and remote storage; Online data reconstruction/recovery; Eager result-data offloading; Lazy/planned migrations; Optimizations: collective ops, data sieving

End-User Data Delivery: Client-side caches, Intermediate cache overlays (e.g., FreeLoader, IBP, etc.)


Application-aware prefetching and caching

Prefetching and caching increasingly important in improving performance

I/O speed falls further and further behind CPU/Mem Scientific applications full of hints regarding access

patterns! MPI-IO file views that define the range of accesses Repeated (iterative) behavior common in timestep

simulations Smart prefetching/caching through parallel file system or

I/O library Automatic access pattern analysis through MPI-IO calls Automatic "learning" process through the initial timesteps


Augmenting File System Metadata: Recovery Hints for Fault Tolerance

Embedding recovery metadata about transient job data into parallel file systems: Extend file system metadata to include recovery hints

Specification of rich metadata: recovery hints Persistent copy of job input data usually held remotely Information regarding “sources” and “sinks” of user’s job data

becomes an integral part of the transient data on the supercomputer

Sample metadata can include URIs, credentials, etc. Metadata specified as part of end-user job submission

Enables elegant, automatic “recovery” and “offloading” without manual intervention


Constructing Novel Storage Abstractions Several I/O operations:

Require high-speed access to storage (Checkpointing TB of data cumbersome)

Often create a lot of large transient datasets Thus, need better tools to address the storage bandwidth bottleneck

How? Intermediate storage nodes for transient data

~ 200TB of aggregate memory in the PF machine Abundant node-local storage

Aggregate RAM Disk: Aggregate memory-based storage abstraction from residual, unused memory

after application resource allocation Redundantly mounted on PEs Optimized, relaxed POSIX I/O interfaces to the memory-based storage Checkpoint to not just neighboring PE’s memory, but to the aggregate

resource Similarly, node-local storage can be efficiently aggregated Previous work:

Already designed and prototyped a working storage aggregation on LAN Results: Aggregate I/O bandwidth of up to 1 Gb/sec on a LAN Similar architecture can be adapted for the “Aggregate RAM Disk” with

O(TB/sec) of I/O bandwidth, alleviating the storage bandwidth bottleneck!


Seamless Data Pathway A transparent data pipeline that ties together disparate

storage elements in the machine-room: Parallel file systems, new storage abstractions, mass storage

archives and even remote data sources! Concerted use of these storage elements can help

Improve data availability Alleviate the I/O bandwidth bottleneck in several day-to-day

storage operations

Use cases: Online data recovery of failed, staged job-input data due to I/O node

failure or staging error Eager offloading of result-data to destination to avoid purging errors Lazy migration of checkpoint images from intermediate storage

abstractions to stable storage or remote destinations through the data pathway

Even offer a collective storage front: a parallel file system + archive


Online Data Recovery Why?

Standard data availability techniques designed with persistent data in mind RAID techniques are time consuming; a 160GB disk takes O(dozens-of-minutes) RAID techniques do not help solve partitions due to I/O node failures Replication can consume too much space; popular parallel file systems (Lustre/PVFS) do not

support them Need novel mechanisms to address “transient data availability” that complement existing

approaches!! What makes it feasible?

Natural data redundancy in the staged job data Job input data is usually immutable Network costs drastically reducing every year: better bulk transfer tools Support for partial data fetches

How? Parallel file system can use the “recovery metadata” to proactively fetch pieces of

the missing staged data Employ multiple nodes for parallel patching to hide latency and improve

application throughput Perform bulk, remote, collective I/O and rearrange locally to match local striping

policy Preliminary result

Indicate large remote I/O requests (256MB) and local shuffling can be overlapped with client activity

Data reconstruction from ORNL to PSC achieves good scaling in parallelism


Eager Offloading of Result-data

Offloading result-data equally important for local visualization and interpretation

Storage system failure and purging of scratch space can cause loss of result-data

Eager offloading: Equivalent to data reconstruction Transparent data migration using “sink”/destination metadata as

part of job submission Data offloading can be overlapped with computation Can failover to intermediate storage/archives for planned

transfers in the future

Needs coordination with parallel file system and job management tools


Problem Space:• Data Deluge: Increasing dataset sizes (NIH, SDSS, SNS, TSI)

• Locality of interest: Collaborating scientists routinely analyze and visualize these datasets

• Desktop, an integral part: End-user consumes data locally due to ever increasing processing power, convenience & control; But limited by secondary storage capabilities

Enabling Trends:• Unused Storage: More than 50% desktop storage unused

• Immutable Data: Data is usually write once read many, with remote source copies

• Connectivity: Well connected, secure LAN settings

FreeLoader Aggregate Storage Cache:• Scavenges O(GB) of contributions from desktops

• Parallel I/O environment across loosely-connected workstations, aggregating I/O as well as network BW

• NOT a file system, but a low-cost, local storage solution enabling client-side caching and locality

http://www.csm.ornl.gov/~vazhkuda/MorselsFreeLoader: Improving End-User Data Delivery with Client-Side Collaborative Caching

0

20

40

60

80

100

120

512MB 4GB 32GB 64GB

Dataset Size

Th

rou

gh

pu

t (M

B/s

ec)

FreeLoader PVFS HPSS-Hot HPSS-Cold RemoteNFS wget-ncbi


Putting It All Together…

Supercomputer Center

Data access fails over to source/mirror copies

Parallel Filesystem Interconnection Network

io-node io-node

Compute Nodes

Archives

Data Caches

FreeLoader IBP

Source copy of dataset accessed gsiftp://source/dataset

Data Source

Data Source

Mirror site of dataset accessed

http://mirror/dataset

End-user/Mirror Sites

Caches near by

Stage data for execution at supercomputer site with metadata for failover, online reconstruction and recovery

Offloading result data fails due to end-resource unavailability

Failover

Failure/unavailability

Staging/offloading

talk

Technology

storage failure

cpus system storage

cpu storage

mem storage

storage cooling

tb of data io

data recovery

petascale storage crisis