![Page 1: Concurrent Write Sharing: Overcoming the Bane of File Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051713/587202211a28ab3f388b6f36/html5/thumbnails/1.jpg)
Concurrent Write Sharing:Overcoming the Bane of File Systems
Garth Gibson
Professor, School of Computer Science, Carnegie Mellon University Co-Founder and Chief Scientist, Panasas Inc.
![Page 2: Concurrent Write Sharing: Overcoming the Bane of File Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051713/587202211a28ab3f388b6f36/html5/thumbnails/2.jpg)
Agenda • Simple File Systems Don’t Do Write Sharing • HPC Checkpointing: N-1 versus N-N concurrent write
• N-1 has usability advantages & performance challenges
• PLFS: Parallel Log-structured File System • Library represents file as many logs of written values • In production at Los Alamos showing good benefits for
important apps, brilliant benefits for benchmarks
• Eliminates write size & alignment problems • Read performance doesn’t suffer as expected
• Index importing needs parallel impl.
• Backend abstracted to enable use of unusual storage, such as HPC on Hadoop HDFS
Garth Gibson, June 19, 2013#www.pdl.cmu.edu
![Page 3: Concurrent Write Sharing: Overcoming the Bane of File Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051713/587202211a28ab3f388b6f36/html5/thumbnails/3.jpg)
N-N File IO N-1 File IO
![Page 4: Concurrent Write Sharing: Overcoming the Bane of File Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051713/587202211a28ab3f388b6f36/html5/thumbnails/4.jpg)
Parallel file
11 12 13 14
Process 1
21 22 23
24
Process 2
31 32 33 34
Process 3
41 42 43 44
Process 4
RAID Group 1 RAID Group 2 RAID Group 3
File Systems full of Locks for Consistency
![Page 5: Concurrent Write Sharing: Overcoming the Bane of File Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051713/587202211a28ab3f388b6f36/html5/thumbnails/5.jpg)
Cross graph comparisons not meaningful
LANL GPFS
PSC Lustre
LANL PanFS
45X
100 MB/s 10 MB/s
20 MB/s
4.5 GB/s 3.3 GB/s
330X
1.8 GB/s
90X
N-N N-1
N-1 Concurrent Write Often Not Scalable
![Page 6: Concurrent Write Sharing: Overcoming the Bane of File Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051713/587202211a28ab3f388b6f36/html5/thumbnails/6.jpg)
N-1 versus N-N Checkpointing • N-N writing easier for lock-happy file systems • But many users prefer N-1 checkpoints
• Prefer to manage 1 file, rather than thousands+ • Can’t avoid mapping because of N-M restarts • >50% LANL cycles use N-1 checkpointing • 2 of 8 open science apps written for Roadrunner • At least 8 of 23 parallel IO benchmarks
including BTIO, FLASH IO, Chombo, QCD • Some programmers switch but many don’t
• One app wrote 10K lines of code (bulkio) to “fix”
Garth Gibson, June 19, 2013#www.pdl.cmu.edu
![Page 7: Concurrent Write Sharing: Overcoming the Bane of File Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051713/587202211a28ab3f388b6f36/html5/thumbnails/7.jpg)
N-1 Write-Optimization via Log-Structuring • 1991 LFS paper “write optimized” seeks
during writes (instead of reads) • Multiple projects emphasized eliminating
seeks for checkpoint capture • PSC Zest “write where the head is” checkpointFS • ADIOS file formatting library uses delayed-write • PVFS experiments embedded log ordering
• In retrospect, log structured writing not as important as decoupling file system locks
Garth Gibson, June 19, 2013#www.pdl.cmu.edu
![Page 8: Concurrent Write Sharing: Overcoming the Bane of File Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051713/587202211a28ab3f388b6f36/html5/thumbnails/8.jpg)
Parallel Log-structured File System • Open source library for Fuse, MPI-IO
• http://github.com/plfs (released v2.4 this week) • Part of FAST FORWARD plan for Exascale HPC
• Big team centered on Los Alamos Nat. Lab. • Gary Grider, Aaron Torres, Brett Kettering, Alfred Torrez,
David Shrader, David Bonnie, John Bent, Sorin Faibish, Percy Tzelnic, Uday Gupta, William Tucker, Jun He, Carlos Maltzahn, Chuck Cranor
• This talk draws heavily on LA-UR-11-11964, SC09, PDSW09, Cluster12, CMU-PDL-12-115 • Jun He in HPDC13 today (index management) • Other papers in PDSW12, DISCS12, 2x MSST12
Garth Gibson, June 19, 2013#www.pdl.cmu.edu
![Page 9: Concurrent Write Sharing: Overcoming the Bane of File Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051713/587202211a28ab3f388b6f36/html5/thumbnails/9.jpg)
PLFS Virtual Layer
/foo
host1 host2 host3
/foo/
host1/ host2/ host3/
131 132 279 281 132 148
data.131
indx
data.132 data.279 data.281
indx data.132 data.148
indx Physical Underlying Parallel File System
PLFS Decouples Logical from Physical
![Page 10: Concurrent Write Sharing: Overcoming the Bane of File Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051713/587202211a28ab3f388b6f36/html5/thumbnails/10.jpg)
GPFS
PSC Lustre
LANL PanFS
25X 100X
8X
3000X at scale
31 GB/s
Notice CU Boundaries IO node limits
N-1 @ N-N BW in Simple Test
![Page 11: Concurrent Write Sharing: Overcoming the Bane of File Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051713/587202211a28ab3f388b6f36/html5/thumbnails/11.jpg)
FLASH IO on Roadrunner
150X
![Page 12: Concurrent Write Sharing: Overcoming the Bane of File Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051713/587202211a28ab3f388b6f36/html5/thumbnails/12.jpg)
LANL1: 2X better than hand tuned library
2X 4X
![Page 13: Concurrent Write Sharing: Overcoming the Bane of File Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051713/587202211a28ab3f388b6f36/html5/thumbnails/13.jpg)
LANL3: More success in production
20X
300 MB/s
6.0 GB/s
8.5 GB/s
28X ?
???
![Page 14: Concurrent Write Sharing: Overcoming the Bane of File Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051713/587202211a28ab3f388b6f36/html5/thumbnails/14.jpg)
23X
7X
150X
2X
5X
28X
83X
PLFS Checkpoint BW Summary
![Page 15: Concurrent Write Sharing: Overcoming the Bane of File Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051713/587202211a28ab3f388b6f36/html5/thumbnails/15.jpg)
With PLFS
Without PLFS
Stripe aligned
block aligned
Unaligned
Does not Suffer “Alignment” Preferences
![Page 16: Concurrent Write Sharing: Overcoming the Bane of File Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051713/587202211a28ab3f388b6f36/html5/thumbnails/16.jpg)
PLFS Virtual Layer
/foo
host1 host2 host3
/foo/
host1/ host2/ host3/
131 132 279 281 132 148
data.131
indx
data.132 data.279 data.281
indx data.132 data.148
indx
Physical Underlying Parallel File System
/Vol1 /Vol2 /Vol3
Distribute over Federated Backends
![Page 17: Concurrent Write Sharing: Overcoming the Bane of File Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051713/587202211a28ab3f388b6f36/html5/thumbnails/17.jpg)
Uniform restart
Non-uniform restart
Archiving
Read Bandwidth • Wrote increasing addresses • Reads increasing addresses • Result is mergesort with
deep prefetch (one per log) • So write optimized is also
read optimized !
![Page 18: Concurrent Write Sharing: Overcoming the Bane of File Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051713/587202211a28ab3f388b6f36/html5/thumbnails/18.jpg)
Open for Read does N Squared Index Reading
![Page 19: Concurrent Write Sharing: Overcoming the Bane of File Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051713/587202211a28ab3f388b6f36/html5/thumbnails/19.jpg)
![Page 20: Concurrent Write Sharing: Overcoming the Bane of File Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051713/587202211a28ab3f388b6f36/html5/thumbnails/20.jpg)
![Page 21: Concurrent Write Sharing: Overcoming the Bane of File Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051713/587202211a28ab3f388b6f36/html5/thumbnails/21.jpg)
![Page 22: Concurrent Write Sharing: Overcoming the Bane of File Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051713/587202211a28ab3f388b6f36/html5/thumbnails/22.jpg)
Read optimizations
![Page 23: Concurrent Write Sharing: Overcoming the Bane of File Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051713/587202211a28ab3f388b6f36/html5/thumbnails/23.jpg)
Abstracted Backend Storage • PLFS uses abstracted backend storage • E.g. HPC app w/ PLFS can run on a cloud
with non-POSIX HDFS as native file system
Garth Gibson, June 19, 2013#www.pdl.cmu.edu
![Page 24: Concurrent Write Sharing: Overcoming the Bane of File Systems](https://reader034.vdocuments.mx/reader034/viewer/2022051713/587202211a28ab3f388b6f36/html5/thumbnails/24.jpg)
PLFS Summary and Futures • N-1 checkpoints feature intensive concurrent writing • File systems choke on consistency preserving locks • PLFS decouples concurrency with per-processor logs
• Order of magnitude and larger wins for taking checkpoint • Insensitive to ideal write sizes or alignments
• Typical reading also faster because it mergesorts logs • Index construction on read can be parallelized
• Ongoing work with PLFS • N-N checkpointing benefits from hashing logs over federated
backend storage systems – easy way to scalable metadata thruput • Burst buffers using NAND Flash for faster checkpoint can be
managed by PLFS • In-burst-buffer local processing needs exposed structure
Garth Gibson, June 19, 2013#www.pdl.cmu.edu