talk
DESCRIPTION
TRANSCRIPT
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
I/O Virtualization: Robust Storage Management in the Machine-Room and Beyond
Sudharshan Vazhkudai
Oak Ridge National Laboratory
Contributor: Xiaosong Ma (NCSU)
Virtualization in HPC
Nashville, TN
September 20th, 2006
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Problem Space: Petascale Storage Crisis (1) Scaling to 1 PF system creates unique storage
challenges in terms of both “availability” and “performance”
Availability: Storage failure is a significant contributor to system down time “when averaged over one month, 90% of jobs submitted to the
system should complete without having to be resubmitted as a result of failure”—NSF Petascale computing solicitation (NSF06-573)
Performance: Several DOE applications (GYRO, POP, TSI) demand
sustained I/O throughput ~ orders of GB/sec or a TB/sec Bandwidth mismatch in diverse I/O operations such as staging,
offloading, checkpointing, prefetching, end-user data delivery Storage performance improves slower than CPUs User data needs grows faster than available compute power
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Problem Space: Petascale Storage Crisis (2) Increasing number of processors per I/O node:
Failure a norm, not an exception! Macroscopic view
Microscopic view (from both commercial and HPC centers) 3% to 7% per year of disk failures; 3% to 16% per year of controller failures; Up
to 12% per year of SAN switches 10 times the rate expected from the disk vendor specification sheets
Well aware of memory wall; Data balance rears its head on the way to Peta Implied Disk need is 1,000,000 disks [Bell, Gray, Szalay, CACM Jan. 2006] Fraction System cost devoted to storage subsystem large 1,000,000 disks at $100/disk ~
50% of system cost in storage Power requirement for storage also large (50 W disks imply 50MW of power for storage
(cooling is extra) Failure rates only bound to grow manifold! Brute force approach of simply applying
more funds to match the computational scale will not work!
System # CPUs # I/O Nodes Ratio
Cray Red Storm 10368 256 41:1
Blue Gene/L 65536 1024 64:1
System # CPUs MTBF/I Outage Source
ASCI Q 8192 6.5 hrs Storage, CPU
ASCI White 8192 40 hrs Storage, CPU
Google 15000 20 reboots/day Storage, mem
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Problem Space: Petascale Storage Crisis (3) Suboptimal center operations due to failure and
bandwidth mismatch Uptime low because of resubmits
Failure due to staging errors, purging result-data errors Data Staging is stealing center performance
Uptime low because checkpoints and restarts are expensive Checkpointing by an application on 100 thousand
processors: > 100,000 files/timestep; millions of files per run ~ several TB of data; I/O can take O(minutes)!!
“we need to find a way to avoid full file system complexity in order to save checkpoints and system communications”—SciDAC 2006, Alan Garra, Blue Gene Architect, IBM
No intelligent prefetching but for file system buffering Wealth of application access pattern information ignored
End-user data delivery is still an open area of research despite high-speed transfer tools and networks!
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Approach If you cannot afford a balanced system, develop
management strategies to compensate Exploit opportunities throughout the HEC I/O
stack Application-level Parallel File System Many unused resources: memory, cluster node-local
storage, desktop idle storage (both in machine-room and client-side)
Disparate storage entities including archives & remote sources
Concerted use of aforementioned: Can be brought to bear upon urgent supercomputing
issues such as: staging, offloading, prefetching, checkpointing, data recovery, I/O bandwidth bottleneck
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
I/O Access Virtualization Stack
Application-level: Access patterns; Hints, Metadata specification: recovery hints, delivery/performance constraints, jitter/failure-tolerance, automatic capture(?); Well-defined simple interfaces for below functionality
Parallel File System I/O access optimizations: Prefetching; Data reorganization; Augmenting file systems with intelligent metadata aiding in data availability and performance gap
I/O Virtualization MiddlewareConstruct New Storage Abstractions: Aggregate RAM Disk; Aggregate node-local storage as a cache, etc., for checkpointing or staging; Relaxed POSIX I/O accessSeamless Data Pathway: To above storage abstractions, archives and remote storage; Online data reconstruction/recovery; Eager result-data offloading; Lazy/planned migrations; Optimizations: collective ops, data sieving
End-User Data Delivery: Client-side caches, Intermediate cache overlays (e.g., FreeLoader, IBP, etc.)
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Application-aware prefetching and caching
Prefetching and caching increasingly important in improving performance
I/O speed falls further and further behind CPU/Mem Scientific applications full of hints regarding access
patterns! MPI-IO file views that define the range of accesses Repeated (iterative) behavior common in timestep
simulations Smart prefetching/caching through parallel file system or
I/O library Automatic access pattern analysis through MPI-IO calls Automatic "learning" process through the initial timesteps
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Augmenting File System Metadata: Recovery Hints for Fault Tolerance
Embedding recovery metadata about transient job data into parallel file systems: Extend file system metadata to include recovery hints
Specification of rich metadata: recovery hints Persistent copy of job input data usually held remotely Information regarding “sources” and “sinks” of user’s job data
becomes an integral part of the transient data on the supercomputer
Sample metadata can include URIs, credentials, etc. Metadata specified as part of end-user job submission
Enables elegant, automatic “recovery” and “offloading” without manual intervention
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Constructing Novel Storage Abstractions Several I/O operations:
Require high-speed access to storage (Checkpointing TB of data cumbersome)
Often create a lot of large transient datasets Thus, need better tools to address the storage bandwidth bottleneck
How? Intermediate storage nodes for transient data
~ 200TB of aggregate memory in the PF machine Abundant node-local storage
Aggregate RAM Disk: Aggregate memory-based storage abstraction from residual, unused memory
after application resource allocation Redundantly mounted on PEs Optimized, relaxed POSIX I/O interfaces to the memory-based storage Checkpoint to not just neighboring PE’s memory, but to the aggregate
resource Similarly, node-local storage can be efficiently aggregated Previous work:
Already designed and prototyped a working storage aggregation on LAN Results: Aggregate I/O bandwidth of up to 1 Gb/sec on a LAN Similar architecture can be adapted for the “Aggregate RAM Disk” with
O(TB/sec) of I/O bandwidth, alleviating the storage bandwidth bottleneck!
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Seamless Data Pathway A transparent data pipeline that ties together disparate
storage elements in the machine-room: Parallel file systems, new storage abstractions, mass storage
archives and even remote data sources! Concerted use of these storage elements can help
Improve data availability Alleviate the I/O bandwidth bottleneck in several day-to-day
storage operations
Use cases: Online data recovery of failed, staged job-input data due to I/O node
failure or staging error Eager offloading of result-data to destination to avoid purging errors Lazy migration of checkpoint images from intermediate storage
abstractions to stable storage or remote destinations through the data pathway
Even offer a collective storage front: a parallel file system + archive
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Online Data Recovery Why?
Standard data availability techniques designed with persistent data in mind RAID techniques are time consuming; a 160GB disk takes O(dozens-of-minutes) RAID techniques do not help solve partitions due to I/O node failures Replication can consume too much space; popular parallel file systems (Lustre/PVFS) do not
support them Need novel mechanisms to address “transient data availability” that complement existing
approaches!! What makes it feasible?
Natural data redundancy in the staged job data Job input data is usually immutable Network costs drastically reducing every year: better bulk transfer tools Support for partial data fetches
How? Parallel file system can use the “recovery metadata” to proactively fetch pieces of
the missing staged data Employ multiple nodes for parallel patching to hide latency and improve
application throughput Perform bulk, remote, collective I/O and rearrange locally to match local striping
policy Preliminary result
Indicate large remote I/O requests (256MB) and local shuffling can be overlapped with client activity
Data reconstruction from ORNL to PSC achieves good scaling in parallelism
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Eager Offloading of Result-data
Offloading result-data equally important for local visualization and interpretation
Storage system failure and purging of scratch space can cause loss of result-data
Eager offloading: Equivalent to data reconstruction Transparent data migration using “sink”/destination metadata as
part of job submission Data offloading can be overlapped with computation Can failover to intermediate storage/archives for planned
transfers in the future
Needs coordination with parallel file system and job management tools
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Problem Space:• Data Deluge: Increasing dataset sizes (NIH, SDSS, SNS, TSI)
• Locality of interest: Collaborating scientists routinely analyze and visualize these datasets
• Desktop, an integral part: End-user consumes data locally due to ever increasing processing power, convenience & control; But limited by secondary storage capabilities
Enabling Trends:• Unused Storage: More than 50% desktop storage unused
• Immutable Data: Data is usually write once read many, with remote source copies
• Connectivity: Well connected, secure LAN settings
FreeLoader Aggregate Storage Cache:• Scavenges O(GB) of contributions from desktops
• Parallel I/O environment across loosely-connected workstations, aggregating I/O as well as network BW
• NOT a file system, but a low-cost, local storage solution enabling client-side caching and locality
http://www.csm.ornl.gov/~vazhkuda/MorselsFreeLoader: Improving End-User Data Delivery with Client-Side Collaborative Caching
0
20
40
60
80
100
120
512MB 4GB 32GB 64GB
Dataset Size
Th
rou
gh
pu
t (M
B/s
ec)
FreeLoader PVFS HPSS-Hot HPSS-Cold RemoteNFS wget-ncbi
OAK RIDGE NATIONAL LABORATORYU. S. DEPARTMENT OF ENERGY
Putting It All Together…
Supercomputer Center
Data access fails over to source/mirror copies
Parallel Filesystem Interconnection Network
io-node io-node
Compute Nodes
Archives
Data Caches
FreeLoader IBP
Source copy of dataset accessed gsiftp://source/dataset
Data Source
Data Source
Mirror site of dataset accessed
http://mirror/dataset
End-user/Mirror Sites
Caches near by
Stage data for execution at supercomputer site with metadata for failover, online reconstruction and recovery
Offloading result data fails due to end-resource unavailability
Failover
Failure/unavailability
Staging/offloading