operating system support for space allocation in grid storage systems douglas thain university of...
TRANSCRIPT
![Page 1: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/1.jpg)
Operating System Supportfor Space Allocation
in Grid Storage Systems
Douglas Thain
University of Notre Dame
IEEE Grid Computing, Sep 2006
![Page 2: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/2.jpg)
Bad News:
Many large distributed systemsfall to pieces under heavy load!
![Page 3: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/3.jpg)
Example: Grid3 (OSG)
Robert Gardner, et al. (102 authors)The Grid3 Production Grid
Principles and PracticeIEEE HPDC 2004
The Grid2003 Project has deployed a multi-virtual organization, application-driven grid laboratory
that has sustained for several months the production-level services required by…
ATLAS, CMS, SDSS, LIGO…
![Page 4: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/4.jpg)
Grid2003: The DetailsThe good news:
– 27 sites with 2800 CPUs– 40985 CPU-days provided over 6 months– 10 applications with 1300 simultaneous jobs
The bad news on ATLAS jobs:– 40-70 percent utilization– 30 percent of jobs would fail.– 90 percent of failures were site problems– Most site failures were due to disk space!
![Page 5: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/5.jpg)
CPU
A Thought Experiment
CPU
CPU
CPU
CPU
CPU
CPU
CPUCPUCPU
shareddisk
CPU
CPU
CPUCPUCPUCPU
inout
outout
tasktask
task
task
task task
task
Job
in
out
task
x 1,000,000task
task
task task
task
task
task
1 - Only a problem when load > capacity.
2 – Grids are employed by users with infinite needs!
![Page 6: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/6.jpg)
Need Space Allocation
• Grid storage managers:– SRB - Storage Resource Broker at SDSC.– SRM – Storage Resource Manager at LBNL.– NeST – Networked Storage at UW-Madison.– IBP – Internet Backplane Protocol at UTK.
• But, do not have any help from the OS.– A runaway logfile can invalidate the careful
accounting of the grid storage mgr.
![Page 7: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/7.jpg)
Outline
• Grids Need OS Support for Allocation
• A Model of Space Allocation
• Three Implementations– User-Level Library– Loopback Devices– AllocFS: Kernel Filesystem
• Application to a Cluster
![Page 8: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/8.jpg)
A Model of Space Allocation
root
jobs home
j1 j2
alice betty
size:1000 GBused: 0 GB
size: 100 GBused: 0 GB
size: 100 GBused: 0 GB
size: 500 GBused: 0 GB
size: 10 GBused: 0 GB
size:1000 GBused: 100 GB
size: 100 GBused: 10 GB
data
size: 10 GBused: 5 GB
core
size: 100 GBused: 100 GB
size:1000 GBused: 700 GB
Three commands:
mkalloc (dir) (size)
lsalloc (dir)
rm –rf (dir)
![Page 9: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/9.jpg)
No Built-In Allocation Policy
• In order to make an allocation:– Must have permission to mkdir.– New allocation must fit in available space.
• Need something more complex?– Check remote database re global quota?– Delete allocation after a certain time?– Send email when allocation is full?
• Use a storage manager at a higher level.– SRB, SRM, NeST, IBP, etc...
![Page 10: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/10.jpg)
No Built-In Allocation Policy
gridstorage
manager
need 10 GB
ok, use jobs/j5
jobs
j4 j5
size: 100 GBused: 10 GB
size: 10 GBused: 0 GBsize: 10 GB
used: 5 GB
check database,charge credit card,consult human...
size: 10 GBused: 0 GB
size: 100 GBused: 20 GB
(writeable by alice)
mkalloc /jobs/j5 10GB
setacl /jobs/j5 alice write
ordinaryfile access
task1 task2size: 5 GB
used: 0 GBsize: 5 GB
used: 0 GB
![Page 11: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/11.jpg)
Outline
• Grids Need OS Support for Allocation
• A Model of Space Allocation
• Three Implementations– User-Level Library– Loopback Devices– AllocFS: Kernel Filesystem
• Application to a Cluster
![Page 12: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/12.jpg)
User Level Library
root
jobs
j1 j2
size:1000 GBused: 0 GB
size: 10 GBused: 0 GB
size: 100 GBused: 0 GB
Appl Appl
LibAlloc LibAlloc
file
1 - lock/read
file
2 - stat/write
3 - unlock/write
1 - lock/read
2 - stat/write
3 - write/unlocksize: 10 GBused: 2 GB
size: 100 GBused: 5 GB
![Page 13: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/13.jpg)
User Level Library
• Some details about locking: see paper.• Applicability
– Must modify apps or servers to employ.– Fails if non-enabled apps interfere.– But, can employ anywhere without privileges.
• Performance– Optimization: Cache locks until idle 2 sec.– At best, writes double in latency.– At worst, shared directories ping-pong locks.
• Recovery– fixalloc: traverses the directory structure and
recomputes current allocations.
![Page 14: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/14.jpg)
size:1000 GB
Loopback Filesystems
size: 100 GB
jobs
size:10 GB
root
j1 j2
file
dd if=/dev/zero of=/jobs.fs 100GB
losetup /dev/loopN /jobs.fs
mke2fs /dev/loopN
mount /dev/loopN /jobs
![Page 15: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/15.jpg)
Loopback Filesystems
• Applicability– Works with any standard application.– Must be root to deploy and manage allocations.– Limited to approx 10-100 allocations.
• Performance– Ordinary reads and writes: no overhead.– Allocations: Must touch every block to reserve!– Massively increases I/O traffic to disk.
• Recovery– Must scan hierarchy, fsck and mount every allocation.– Disastrous for large file systems!
![Page 16: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/16.jpg)
AllocFS: Kernel-Level Filesystem
# uid size used parent
2 0 1000 GB 700 GB 2
3 0 100 GB 99 GB 2
4 34 10 GB 5 GB 3
5 34 4
6 56 3
7 56 7
root
jobs
j1 j2
filefile
2
3
4
5
6
7
Inode Table
1 – To update allocation state, update fields in incore-inode.
2 – To create/delete an allocation, update the parent’s allocation state, which is already cached for other reasons.
![Page 17: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/17.jpg)
AllocFS: Kernel-Level Filesystem• Applicability
– Works with any ordinary application.– Must load module and be root to install.– Binary compatible with existing EXT2 filesystem.– Once loaded, ordinary users may employ.
• Performance– No measurable overhead on I/O.– Creating an allocation: touch two inodes.– Deleting an allocation: same as deleting directory.
• Recovery– fixalloc: traverses the directory structure and
recomputes current allocations.
![Page 18: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/18.jpg)
Library Adds Latency
![Page 19: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/19.jpg)
Allocation Performance
• Loopback Filesystem– 1 second per 25 MB of allocation. (40 sec/GB)– Must touch every single block.– Big increase in unnecessary I/O traffic!
• Allocation Library– 227 usec regardless of size.– Several synchronous disk ops.
• Kernel Level Filesystem– 32 usec regardless of size.– Touch one inode.
![Page 20: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/20.jpg)
Comparison
Priv.
Reqd.
Guarantee?
Max # Write
Perf.
Alloc
Perf.
Recovery
Library any
user
no no limit 2x
latency
usec fixalloc
once
Loopback root to install, use
yes 10-100 no
change
secs to mins
fsck and mount each alloc
Kernel root to
install
yes no limit no change
usec fixalloc
once
![Page 21: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/21.jpg)
Outline
• Grids Need OS Support for Allocation
• A Model of Space Allocation
• Three Implementations– User-Level Library– Loopback Devices– AllocFS: Kernel Filesystem
• Application to a Cluster
![Page 22: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/22.jpg)
CPU
A Physical Experiment
CPU
CPU
CPU
CPU
CPU
CPU
CPUCPUCPU
shareddisk
CPU
CPU
CPUCPUCPUCPU
inout
outout
tasktask
task
task
task task
task
Job
in
out
task
Three configurations:1 – No allocations.2 – Backoff when failures detected.3 – Heuristic: don’t start job unless space > threshhold.4 – Allocate space for each job.
Only space for 10.Vary load: # of simultaneous jobs.
![Page 23: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/23.jpg)
Allocations Improve Robustness
![Page 24: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/24.jpg)
Summary
• Grids require space allocations in order to become robust under heavy loads.
• Explicit operating system support for allocations is needed in order to make them manageable and efficient.
• User level approximations are possible, but have overheads in perf and mgmt.
• AllocFS provides allocations compatible with EXT2 with no measurable overhead.
![Page 25: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/25.jpg)
Library Implementation• http://www.cctools.org/chirp
• Solaris, Linux, Mac, Windows
• Start server with –Q 100GB
![Page 26: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/26.jpg)
Kernel Implementation
• http://www.cctools.org/allocfs
• Works with Linux 2.4.21.
• Install over existing EXT2 FS.– (And, uninstall without loss.)
% mkalloc /mnt/alloctest/adir 25M
mkalloc: /mnt/alloctest/adir allocated 25600 blocks.
% lsalloc -r /mnt/alloctest
USED TOTAL PCT PATH
25.01M 87.14M 28% /mnt/alloctest
10.00M 25.00M 39% /mnt/alloctest/adir
![Page 27: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/27.jpg)
A Final Thought
[Some think] traditional OS issues are either solved problems or minor problems. We believe that
building such vast distributed systems upon the fragile infrastructure provided by today’s operating systems is analogous to building castles on sand.
The Persistent Relevance of the Local Operating System to Global Applications
Jay Lepreau, Bryan Ford, and Mike Hibler
SIGOPS European Workshop, September 1996
![Page 28: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/28.jpg)
For More Information:
• Cooperative Computing Lab:– http://www.cse.nd.edu/~ccl
• Douglas Thain– [email protected]
• Related Talks:– “Grid Deployment of Bioinformatics Apps...”
• Session 4A Friday
– “Cacheable Decentralized Groups...”• Session 5B Friday
![Page 29: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/29.jpg)
Extra Slides
![Page 30: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/30.jpg)
Existing Tools Not Suitable for the Grid• User and Group Quotas
– Don’t always correspond to allocation needs!• User might want one alloc per job.• Or, many users may want to share an alloc.
• Disk Partitions– Very expensive to create, change, manage.– Not hierarchical: only root can manage.
• ZFS Allocations– Cheap to create, change, manage.– Not hierarchical: only root can manage.
![Page 31: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/31.jpg)
Library Suffers on Small Writes
![Page 32: Operating System Support for Space Allocation in Grid Storage Systems Douglas Thain University of Notre Dame IEEE Grid Computing, Sep 2006](https://reader035.vdocuments.mx/reader035/viewer/2022062315/56649c7b5503460f9492f383/html5/thumbnails/32.jpg)
Recovery Linear wrt # of Files