outline for today objective metadata complications more on naming attribute-based file naming: why...
DESCRIPTION
Operations on Directories ( UNIX ) link (oldpathname, newpathname) - make entry pointing to file unlink (filename) - remove entry pointing to file mknod (dirname, type, device) - used (e.g. by mkdir utility function) to create a directory (or named pipe, or special file) getdents(fd, buf, structsize) - reads dir entriesTRANSCRIPT
Outline for Today
• Objective– Metadata complications – More on naming
• Attribute-based file naming:“Why can’t I find my files?”
• Administrative– Not yet.
Metadata
• File size• File type• Protection - access
control information• History:
creation time, last modification,last access.
• Location of file - which device
• Location of individual blocks of the file on disk.
• Owner of file• Group(s) of users
associated with file
Operations on Directories (UNIX)
• link (oldpathname, newpathname) - make entry pointing to file
• unlink (filename) - remove entry pointing to file
• mknod (dirname, type, device) - used (e.g. by mkdir utility function) to create a directory (or named pipe, or special file)
• getdents(fd, buf, structsize) - reads dir entries
Metadata & Performance
• There are two popular approaches for improving the performance of metadata operations and recovery:– Journaling – Soft Updates
• Journaling systems record metadata operations on an auxiliary log
• Soft Updates uses ordered writes(Ganger & Patt, OSDI 94)
Metadata Operations
• Metadata operations modify the structure of the file system– Creating, deleting, or renaming
files, directories, or special files• Data must be written to disk in such a way
that the file system can be recovered to a consistent state after a system crash
General Rules of Ordering
1) Never point to a structure before it has been initialized (inode < direntry)
2) Never re-use a resource before nullifying all previous pointers to it
3) Never reset the old pointer to a live resource before the new pointer has been set (renaming)
Metadata Integrity
• FFS uses synchronous writes to guarantee the integrity of metadata– Any operation modifying multiple pieces of
metadata will write its data to disk in a specific order
– These writes will be blocking• Guarantees integrity and durability of
metadata updates
Deleting a file
abc
def
ghi
i-node-1
i-node-2
i-node-3
Assume we want to delete file “def”
Deleting a file
abc
def
ghi
i-node-1
i-node-3
Cannot delete i-node before directory entry “def”
?
Deleting a file
• Correct sequence is1. Write to disk directory block containing deleted directory
entry “def”2. Write to disk i-node block containing deleted i-node
• Leaves the file system in a consistent state
Creating a file
abc
ghi
i-node-1
i-node-3
Assume we want to create new file “tuv”
Creating a file
abc
ghi
tuv
i-node-1
i-node-3
Cannot write directory entry “tuv” before i-node
?
Creating a file
• Correct sequence is1. Write to disk i-node block containing new i-node2. Write to disk directory block containing new directory
entry
• Leaves the file system in a consistent state
Synchronous Updates
• Used by FFS to guarantee consistency of metadata:– All metadata updates are done through blocking
writes
• Increases the cost of metadata updates• Can significantly impact the performance of
whole file system
SOFT UPDATES
• Use delayed writes (write back)• Maintain dependency information about
cached pieces of metadata:This i-node must be updated before/after this directory entry
• Guarantee that metadata blocks are written to disk in the required order
First Problem
• Synchronous writes guaranteed that metadata operations were durable once the system call returned
• Soft Updates guarantee that file system will recover into a consistent state but not necessarily the most recent one– Some updates could be lost
Second Problem
• Cyclical dependencies:– Same directory block contains entries to be
created and entries to be deleted– These entries point to i-nodes in the same block
i-node-2
Example
We want to delete file “def” and create new file “xyz”
def
NEW xyz
NEW i-node-3
---
Block A Block B
----------
Example
• Cannot write block A before block B:– Block A contains a new directory entry
pointing to block B• Cannot write block B before block A:
– Block A contains a deleted directory entry pointing to block B
The Solution
• Roll back metadata in one of the blocks to an earlier, safe state
(Safe state does not contain new directory entry)
def
--- Block A’
The Solution
• Write first block with metadata that were rolled back (block A’ of example)
• Write blocks that can be written after first block has been written (block B of example)
• Roll forward block that was rolled back• Write that block• Breaks the cyclical dependency but must now
write twice block A
Journaling
• Journaling systems maintain an auxiliary log that records all meta-data operations
• Write-ahead logging ensures that the log is written to disk before any blocks containing data modified by the corresponding operations.– After a crash, can replay the log to bring the file
system to a consistent state
Journaling
• Log writes are performed in addition to the regular writes
• Journaling systems incur log write overhead but– Log writes can be performed efficiently
because they are sequential– Metadata blocks do not need to be written back
after each update
Journaling
• Journaling systems can provide– same durability semantics as FFS if log is
forced to disk after each meta-data operation– the laxer semantics of Soft Updates if log
writes are buffered until entire buffers are full• Will discuss two implementations
– Log to file– Write Ahead File System
Log-to-File
• Maintains a circular log in a pre-allocated file in the FFS (about 1% of file system size)
• Buffer manager uses a write-ahead logging protocol to ensure proper synchronization between regular file data and the log
Log-to-File
• Buffer header of each modified block in cache identifies the first and last log entries describing an update to the block
• System uses – First item to decide which log entries can be purged
from log– Second item to ensure that all relevant log entries are
written to disk before the block is flushed from the cache
WAFS
• Implements its log in an auxiliary file system:Write Ahead File System (WAFS)– Can be mounted and unmounted– Can append data– Can return data by sequential or keyed reads
• Keys for keyed reads are log-sequence-numbers (LSNs) that correspond to logical offsets in the log
WAFS
• Log is implemented as a circular buffer within the physical space allocated to the file system.
• Buffer header of each modified block in cache contains LSNs of first and last log entries describing an update to the block
WAFS
• Major advantage of WAFS is additional flexibility:– Can put WAFS on separate disk drive to avoid I/O
contention– Can even put it in NVRAM
• Normally uses synchronous writes – Metadata operations are persistent upon return from the
system call– Same durability semantics as FFS
Recovery
• Superblock has address of last checkpoint– LFFS-file has frequent checkpoints– LFFS-wafs much less frequent checkpoints
• First recover the log• Read then the log from logical end (backward
pass) and undo all aborted operations• Do forward pass and reapply all updates that have
not yet been written to disk
Other Approaches
• Using non-volatile cache (Network Appliances) – Ultimate solution: can keep data in cache forever– Additional cost of NVRAM
• Simulating NVRAM with– Uninterruptible power supplies – Hardware-protected RAM (Rio): cache is marked read-
only most of the time
Other Approaches
• Log-structured file systems– Not always possible to write all related meta-
data in a single disk transfer– Sprite-LFS adds small log entries to the
beginning of segments– BSD-LFS make segments temporary until all
metadata necessary to ensure the recoverability of the file system are on disk.
Feature Comparison
Summary of Journaling vs. Soft Updates
• Journaling alone is not sufficient to “solve” the meta-data update problem– Cannot realize its full potential when
synchronous semantics are required• When that condition is relaxed, journaling
and Soft Updates perform comparably in most cases
Extending Metadata
• File size• File type• Protection - access
control information• History:
creation time, last modification,last access.
• Location of file - which device
• Location of individual blocks of the file on disk.
• Owner of file• Group(s) of users
associated with file• <attribute, value> pairs
A Naming Problemusr
project
coursearchive
cwd
fall02fall01fall00
fall99
fall03
spring02
spring99
spring01spring00
cps210 cps210cps210
cps210
cps110cps110
cps110cps110
cps110
Find the lecture where metadata was discussed
usr
project
coursearchive
cwd
fall02fall01fall00
fall99
fall03
spring02
spring99
spring01spring00
cps210cps110 …
Find the lecture where metadata was discussed
A Naming Problem
spring00spring01
spring02
spring99
usr
project
coursearchive
cwd
fall02fall01fall00
fall99
fall03
spring02
spring99
spring01spring00
cps210 cps210cps210
cps210
cps110cps110
cps110cps110
cps110
Find the lecture where metadata was discussed
cps210
With symbolic links
A Naming Problem
• It gets worse:/home/home5/carla/talks2 laptops (one lives at work, one at home)desktop machine at home
• Forest not a tree! – Growing more like kudzu
A Naming Problem
Attributes in File Systems
• Metadata: <category, value>• How to assign?
– User provided – too much work– Content analysis – restricted by formats
• Semantic file system provided transducers– Context analysis
• Access-based or inter-file relationships
• Once you have them– Virtual directories – “views”– Indexing
spring00spring01
spring02
spring99
Virtual Directoriesusr
project
coursearchive
cwd
fall02fall01fall00
fall99
fall03
spring02
spring99
spring01spring00
cps210 cps210cps210
cps210
cps110cps110
cps110cps110
cps110
Find the lecture where metadata was discussed
Query: <class, cps210>
Automated symbolic links
Lecture10.ppt
Virtual Directoriesusr
project
coursearchive
cwd
fall02fall01fall00
fall99
fall03
spring02
spring99
spring01spring00
cps210 cps210cps210
cps210
cps110cps110
cps110cps110
cps110
Find the lecture where metadata was discussedQuery: <type, ppt> AND<topic, files>
Lecture10.pptmetadata.ppt
raid.ppt
Versions?
Issues with Virtual Directories• What if I want to create a file under a virtual directory that
doesn’t have a path location already?• How does the system maintain consistency? We should
make sure that when a file changes, its contents are still consistent with the query.– What if somewhere a new file is created that should match the
query and be included?– What if currently matching file is changed to not match?
• How do I construct a query that captures exactly the set of files I wish to group together?
Example: HAC File System (Gopal & Manber, OSDI99)
• Semantic directories created within the hierarchy (given a pathname in the tree) by issuing a query over the scope inherited from parent– Physically exist as directory files containing symlinks
• Creates symbolic links to all files that satisfy query• User can also explicitly add symbolic links to this
semantic directory as well as remove ones returned by the query as posed. – Query is a starting point for organization.
• Reevaluate queries whenever something in scope changes..
Context-based Relationships
• Premise: Context is what user might remember best.
• Previous work – Hoarding for disconnected access
(inter-file relationships)– Google: textual context for link and feedback
from search behavior (assumption of popularity over many users)
Access-based
• Use context of user’s session at access time• Application knowledge – modify apps to
provide hints– Example: subject of email associated with
attached file• Feedback from “find” type queries
– Searches are for rarely accessed files and usually only one user – limits statistical info
Inter-file
• Attributes can be shared/propagated among related files
• Determining relationships– User access patterns – temporal locality– Inter-file content analysis
• Similarity – duplication -- hashing• Versions
Challenges
• Mechanisms– Storage of large numbers of attributes that get
automatically generated– User interface
• Context switches– Creating false positive relationships
Background: Inter-file Relationships
Hoarding - Prefetching for Disconnected Information Access• Caching for availability (not just latency)• Cache misses, when operating disconnected, have
no redeeming value. (Unlike in connected mode, they can’t be used as the triggering mechanism for filling the cache.)
• How to preload the cache for subsequent disconnection? Planned or unplanned.
• What does it mean for replacement?
SEER’s Hoarding Scheme:Semantic Distance
• Observer monitors user access patterns, classifying each access by type.
• Correlator calculates semantic distance among files
• Clustering algorithm assign each file to one or more projects
• Only entire projects are hoarded.
Defining Semantic Distance
• Temporal semantic distance - elapsed time between two file references Time scale effects :-(
• Sequence-based semantic distance - number of intervening file references between 2, of interest. At what point? Open? Close?
• Lifetime semantic distance - accounts for concurrently open files - overlapping lifetimes
Calc of Lifetime Distance
foo.c
foo.h bar.h
foo.o
Distance is 0 if A not closed before B opened (0verlap) # intervening opens including itself otherwisefoo.c -> foo.h 0foo.c -> bar.h 0foo.c -> foo.o 3
• How to turn semantic distance between two references into semantic distance between files? Summarize - geometric mean.
• Using months of data. Only store n nearest neighbors for each file and files within distance M
• External investigators can incorporate some extra info (e.g. heuristics used by Tait, makefile)
0 1
3
Real World Complications• Meaningless clutter in the reference stream
(e.g. find command)• Shared libraries - an apparent link between unrelated files -
want to hoard but not use in distance calculations and clustering
• Rare but critical files, temp files, directories• Multi-tasking clutter• Delete and recreate by same filename.• Examine metadata then open – 1 or 2 accesses?• SEER tracing itself – avoid accesses by root
Evaluation• Metric
– Hoard misses usually do not allow continuation of activity (stops trace) – counting misses is meaningless.
– Time to 1st miss – would depend on hoard size– Miss-free hoard size – size necessary to ensure no misses
• Method– Live deployment – difficulty in making comparisons
• Only long enough disconnections• Subtract off suspensions
– Trace-driven simulation -- reproducible• What kind of traces are valid?
Metadata
• File size• File type• Protection - access
control information• History:
creation time, last modification,last access.
• Location of file - which device
• Location of individual blocks of the file on disk.
• Owner of file• Group(s) of users
associated with file
Access Control for Files
• Access control lists - detailed list attached to file of users allowed (denied) access, including kind of access allowed/denied.
• UNIX RWX - owner, group, everyone
UNIX access control
• Each file carries its access control with it.rwx rwx rwx setuid
OwnerUID
GroupGID
Everybody else When bit set, itallows processexecuting objectto assume UID ofowner temporarily -enter owner domain(rights amplification)
• Owner has chmod, chgrp rights (granting, revoking)
The Access Model• Authorization problems can be represented
abstractly by of an access model.– each row represents a subject/principal/domain– each column represents an object– each cell: accesses permitted for the {subject, object}
pair• read, write, delete, execute, search, control, or any other method
• In real systems, the access matrix is sparse and dynamic.
• need a flexible, efficient representation
68
Access Matrix
TA
grp
Chris
Pat
grad
efile
solu
tions
proj
1
rwx
rw rw
r
rx
luvl
tr
r
rw
hotg
ossi
p
rw
rw
69
Two Representations• ACL - Access Control Lists
– Columns of previous matrix– Permissions attached to Objects– ACL for file hotgossip: Chris, rw; Pat, rw
• Capabilities– Rows of previous matrix– Permissions associated with Subject– Tickets, Namespace (what it is that one can name)– Capabilities held by Pat: luvltr, rw; hotgossip,rw
Access Control Lists
• Approach: represent the access matrix by storing its columns with the objects.
• Tag each object with an access control list (ACL) of authorized subjects/principals.
• To authorize an access requested by S for O– search O’s ACL for an entry matching S– compare requested access with permitted access– access checks are often made only at bind time
Access Control Lists
Use of access control lists of manage file access
Access Control Lists
Two access control lists
Capabilities• Approach: represent the access matrix by storing its
rows with the subjects.• Tag each subject with a list of capabilities for the objects it is
permitted to access.– A capability is an unforgeable object reference, like a
pointer.– It endows the holder with permission to operate on the
object• e.g., permission to invoke specific methods
– Typically, capabilities may be passed from one subject to another.
• Rights propagation and confinement problems
Dynamics of Protection Schemes
• How to endow software modules with appropriate privilege?– What mechanism exists to bind principals with
subjects?• e.g., setuid syscall, setuid bit
– What principals should a software module bind to?• privilege of creator: but may not be sufficient to perform the
service• privilege of owner or system: dangerous
75
Dynamics of Protection Schemes
• How to revoke privileges?• What about adding new subjects or new objects?• How to dynamically change the set of objects
accessible (or vulnerable) to different processes run by the same user?– Need-to-know principle / Principle of minimal privilege– How do subjects change identity to execute a more
privileged module?• protection domain, protection domain switch (enter)
76
Protection Domains• Processes execute in a
protection domain, initially inherited from subject
• Goal: to be able to change protection domains
• Introduce a level of indirection
• Domains become protected objects with operations defined on them: owner, copy, control
TA
grp
Chris
Patgr
adef
ile
solu
tions
proj
1
rwx
rw rwo
r
rxc
luvl
tr
r
rw
hotg
ossi
p
rw
rw
Domain0
Dom
ain0
ctl
enter
r
77
• If domain contains copy on right to some object, then it can transfer that right to the object to another domain.
• If domain is owner of some object, it can grant that right to the object, with or without copy to another domain
• If domain is owner or has ctl right to a domain, it can remove right to object from that domain
• Rights propagation.
TA
grp
Chris
Patgr
adef
ile
solu
tions
proj
1
rwo
rw rwo
r
rc
luvl
tr
r
rw
hotg
ossi
p
rw
rw
Domain0
Dom
ain0
ctl
enter
r
rc
r