knowledge is power remzi arpaci-dusseau university of wisconsin, madison

Knowledge is Power

Remzi Arpaci-Dusseau

University of Wisconsin, Madison

Systems Without Knowledge

System designers often have limited knowledge• About the applications they run• About the other systems they interact

with

Result: The “curse of generality”• Missed performance optimizations• Limited functionality• Costly, too

Didacticism and Systems

How to gain knowledge?• Depends on environment

Sometimes it’s easy• A scientific application w/ cooperative

developers

Sometimes it’s not• Internals of Microsoft file system

What We Do

Build systems that acquire and exploit knowledge• “Gray box” techniques• Make assumptions, probe + measure,

learn something about how something works• Use knowledge to control systems in

unexpected ways

Result• Increase functionality, improve performance,

increase robustness and manageability too

Outline

OverviewKnowledge and its applications• Gray-box file placement• Semantically-smart disks• Scientific apps, the Grid, and I/O

Conclusions

The People

Gray-box file placement• With James Nugent, Andrea Arpaci-Dusseau

Semantically-smart disks• With Muthian Sivathanu, Vijayan

Prabhakaran,Florentina Popovici, Tim Denehy, Andrea Arpaci-Dusseau

Scientific apps, the Grid, and I/O• With John Bent, Doug Thain,

Andrea Arpaci-Dusseau, Miron Livny

Gray-box Control over File Placement

Controlled File Placement

Typical “Unix” file system: Little control over layout• Just a simple API of open(), read(), write(),

close()

Some applications want more control• e.g., a web server that knows which files are

often accessed together

Usual default: Use the raw disk• Harder to manage, doesn’t integrate w/ other

apps

What Might Be Better

Use normal file system• Convenience

Expose control over layout to applications• Control

Do the above without changing the file system• Can’t always change the system you’re

using

PLACE

A gray-box “Information and Control Layer” (ICL)• It’s just a library

Simple API for file placement• Exposes “FFS-like” groups• Place_Creat(file, mode, groupNumber);

No changes to underlying file system

File System

PLACE

P P

PLACE Outline

Basic operation• Gray-box knowledge• Key techniques

Assessment• Accuracy• Performance

Conclusions

Allocation Knowledge

Gray-box assumption: “FFS”-like allocation• Splits disk into numerous consecutive

“groups”• Spreads directories across groups• Puts files (inodes/data) that are within same

directory into same “group”

Many variants• Our focus: ext2 (but with other variants in

mind)

Exploiting Knowledge for Control

Key structure: Shadow Directory Tree (SDT)

To create a file /foo/bar in group 1:• Create file /.H/1/bar• Rename /.H/1/bar to /foo/bar

/

.h/

1/ 2/ n/

foo/

bar

bar

Challenge: Building the SDT

How to ensure that shadow directory for eachgroup K is in the right on-disk location?

Basic approach to creating a directory in a group:• Mkdir(tmp);• If (tmp is in the desired group)

• Break;

• Bias();

Point of portability: Bias() routine• Must account for different allocation algorithms

Repeat

Some Complications

Controlled directory placement• Similar to system initialization (hence, slow)• To speed up, use shadow cache of

directories

Crash recovery• Crash may leave junk in SDT• Periodic sweep of SDT cleaner fixes this

Level of control depends on underlying FS• e.g., FFS vs. ext2 behavior for large files

Assessment

Does it work?Non-place: 250 files in 1 directory

Non-place: 250 files in 10 directories

Non-place: 250 files in 100 directories

PLACE: 250 files in 100 directories into 1 group

Performance (Small Files)

Performance of 250 200-KB file reads (random)

Performance (Big Files)

Each point: Bandwidth attained reading 100-MB file

PLACE Conclusions

PLACE: Gray-box approach to file layoutSimple and effective control over

placementMain technique: Shadow Directory Tree• Use to control placement• Construction and maintenance are keys

Controlled layout can improve performance• Micro-benchmarks• Web server and I/O parameterization

(see USENIX ‘03)

Outline


Conclusions

Semantically-smart Disk Systems

Semantically-Smart Disk System (SDS)

Disk system that understands file system• Data structures• Operations

Operates underneath unmodified FS• Must discover layout + on-disk

structures• Must “reverse engineer” block stream

Exploits knowledge and “smarts” to implement new class of services

FileSystem

SDS

$CPU

SDS Outline

Semantic Knowledge: Acquisition• Off-line• On-line

Semantic Knowledge: Exploitation• Case studies

Conclusions

Static Knowledge: File System Layout

Challenge: How to discover layout information?• White-box approach: Embed knowledge in SDS

• Trend: FS layout does not change frequently

Su

perb

lk

I-B

itm

ap

D-B

itm

ap

Inod

es

Data

I-B

itm

ap

D-B

itm

ap

Inod

es

Data

Group 1 Group 2

Layout Discovery with EOFEOF: Extraction Of File-systems• Tool to automatically determine

layout• Uses gray-box techniques

Basic operation• Start with “soft” model of file system• Probe process (P): Initiates traffic• SDS: Monitors activity from FS

Two distinct tasks:• Classifying blocks by type• Identifying fields within an inode

Result: “Hardened” model of file system structures + fields

P

SDS

File System

EOF: More Details

Multi-phase procedure:• Bootstrap: Summary blocks• Data/data bitmaps• Inodes/inode bitmaps• Inode fields, directory entries

Key techniques• Known patterns: Data blocks• Isolation: Know all but one block, one

block must be…• Assertions: Check assumptions at each

step

EOF: Simplified Example

Create file: Touches many data structures• Directory data, directory inode, file data

(known pattern),file inode, data bitmap, inode bitmap

Reset to beginning of file, write block again• File data (known pattern), file inode• Now, can classify inode block (isolation)• Assertion: only two blocks observed

EOF: Overhead and Summary

Performance: A few minutes per GB• Probably OK, only done “once” per new file

system• Scales well with faster disks (sequential

bandwidth)

Limitations: “FFS”-like file systems (ext2/3, BSD FFS)

Have Knowledge, Will Innovate

Knowing structures is not enough (sometimes)• Data block overloading (data, pointer,

directory)• High-level operations not known (create,

delete)

Requires new on-line techniques• Direct classification• Indirect classification• Block association• Operation inferencing

A Simple Example: Smarter Caching

Modern RAID may have significant cache• Volatile (DRAM)• Non-volatile (NVRAM)

How to exploit semantic informationto cache more intelligently?

FileSystem

SDS

$

Storing Meta-Data in NVRAM

Start with simple meta-data: inodes, bitmaps, etc.• Good for meta-data intensive

workloads

Sup

er

I-B

it

D-B

it

Inod

e

Data

I-B

it

D-B

it

Inod

e

Data

NVRAMCache

Direct Classification

Given address, determine type directly

Direct classification via bounds check• Given disk address, can check bounds

to determine type(superblock, bitmaps, inodes, general data block)

Sup

er

I-B

it

D-B

it

Inod

e

Data

I-B

it

D-B

it

Inod

e

Data

Getting Rid Of The Dead

If file blocks are deleted, remove them from cache• No need to keep dead blocks around

Problem: How to determine if a file is deleted?• Need to look for signs of deletion

Three different places to look:• Inode bitmaps• Directory that contains file• Inode itself

Operation inferencing via block differencing

Operation Inferencing: Detecting Deletes (Inode

Bitmap)S

up

er

I-B

it

D-B

it

Inod

e

Data

I-B

it

D-B

it

Inod

e

Data

SDS

Diff =

Read Old Version

I-B

itm

ap

Result:Deleted Files

Operation Inferencing: Overheads

Space overhead• Block cache of inodes, indirect pointers,

bitmaps, etc.(could be substantial)

Time overhead• CPU: Difference operation is like an extra copy• Disk: May require block read (if small/no cache)

[In paper: Quantified time and space overheads]• Main point: There is a CPU and memory cost

Case Studies

Experimental Set-up

Problem: Don’t have SDS hardware to use (yet!)

“Cost-effective” alternative:• Software prototype

Insert driver underneath of FS• Much like software RAID

Good because…• Traffic stream similar

Bad because…• CPU, memory not isolated from host

FileSystem

SDSOS

Fast RAID Reconstruction

Observe: When reconstructing data onto hot spare,no need to reconstruct data that isn’t live

Trend: Less live data in performance-sensitive I/O systems

Question: How can we performreconstruction quickly?

MirrorHot

Spare

Traditional Approaches

Why not in the file system?• File system doesn’t know what RAID

is

Why not in the storage system?• RAID doesn’t know what blocks are

live(minimally it does, if block has never been written)

The Semantic Way

Easy: Scan disk, only copy live blocks• Key piece of knowledge: Bitmaps• Plus, need to watch for “unmapped”

writes

Optionally, can copy “dead” blocks later• Useful if SDS doesn’t feel “sure” about its

knowledge• Guaranteed correct with prioritized

recovery

Fast Reconstruction: A Graph

Fast reconstruction: Less live data -> less time• How data is spread across disk affects recovery time

RAID-5,IBMDisks

Semantic Conclusions

Innovation in traditional storage stack is limited• File system: high but not low-level info• Storage system: low but not high-level info

Semantically-smart disks: Best of both worlds?• Takes advantage of “smart” disk systems• Exploit low-level information…• …with high-level knowledge of file system

A remaining challenge• Overcoming the “file system obfuscation”

problem

Outline


Conclusions

Trends in Scientific Computing

What constitutes a job is increasingly complex• Not your simple process anymore

Data demands increasing• Not just cycles anymore

Wide-area collaboration • “Grids” facilitate sharing

The Question

How to run scientific workloads on the WAN?

WAN

HomeRemote

Scientific Outline

Typical “scientific” jobs• Structure• Properties

Migratory file services• Components• Performance

Conclusions

First Things First

Study of modern scientific applications• A “measure then build” approach

Suite of six applications• BLAST: Searches genomic databases for

matching proteins• IBIS: Global-scale simulation of earth systems• CMS: High-energy physics testing software• Nautilus: Simulation of molecular dynamics• Messkit Hartree-Fock: Simulation of atomic

interactions• AMANDA: Astrophysics simulation of cosmic

events

An Example: AMANDA

A single “job” is a multi-process pipeline -> batch pipelined• Each process is a blue circle

There are many types of I/O• Endpoint (red): unique input/output of pipeline• Pipeline private (green): shared between pipe processes• Batch shared (yellow): shared across all pipes in batch

4K

1M

23M 126M26M 5M

3M 505M

21

88

s 42

s

955

s

36

01

s

Some Things We Learned

Demands of a single pipeline are modest• Modern PC with disk can handle demand• Aggregation of I/O could be harder (WAN)

Lots of sharing of data within and across pipelines• Systems should (have to?) take

advantage of this

Towards Systems Support

Systems Support

Need to build systems support for global execution• Should support “batch-pipelined” jobs

effectively

Goals• Performance: Throughput is what matters

(NOT simple metrics like “availability”)• Failures: Must be handled effectively

(again, with goal of improving performance)

Migratory File Services

Migratory file service• I/O environment for “batch-pipelined”

workloads• Integrates performance and failure

management• Key: Understanding of workloads

Three pieces of implementation• Virtual batch overlay• Migratory proxies• Workflow manager

The Virtual Batch OverlayWant familiar and controllable remote

environment• But often are stuck w/ particular queueing system• Further, cannot assume all relevant s/w installed

Glide in our own “virtual batch system”• On each node, run master, virtual machine, and

migratory proxy (described next)

M

VM MP

Migratory ProxiesMigratory proxies: Run on each remote

node• Fetch and cache data from home node• Cooperative cache for batch inputs• Localize I/O that is pipeline local

Remote

WAN

Home

M

VM MP

J

Workflow Manager

Where workload knowledge is encapsulated

Takes workflow description• Job dependencies• File indicators

Runs each while taking failures into account• Transactional management

• Proxy failure and job failure are not catastrophic(just rerun the job!)

• Proactive data replication

Performance

By exploiting knowledge, order of magnitudeimprovement over naïve approach

Outline


Conclusions

Conclusions

The theme: Knowledge is power• If you know how FS decides on file layout,

you can control it (PLACE)• If you know details of FS on-disk structures,

you can gain FS-level knowledge behinda block-based interface (Semantic disks)

• If you know something about workloads andtheir I/O behaviors, you can optimize performanceand handle failure gracefully

“Beware of false knowledge;it is more dangerous than ignorance.”

Bernard Shaw

http://www.cs.wisc.edu/wind

knowledge is power remzi arpaci-dusseau university of wisconsin, madison

Documents

file foobar

file placementexposes

file systemcant

miron livnygraybox control

little control

control layer iclits

mindexploiting knowledge

worksuse knowledge