knowledge is power remzi arpaci-dusseau university of wisconsin, madison

60
Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Upload: arthur-leonard

Post on 30-Dec-2015

222 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Knowledge is Power

Remzi Arpaci-Dusseau

University of Wisconsin, Madison

Page 2: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Systems Without Knowledge

System designers often have limited knowledge• About the applications they run• About the other systems they interact

with

Result: The “curse of generality”• Missed performance optimizations• Limited functionality• Costly, too

Page 3: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Didacticism and Systems

How to gain knowledge?• Depends on environment

Sometimes it’s easy• A scientific application w/ cooperative

developers

Sometimes it’s not• Internals of Microsoft file system

Page 4: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

What We Do

Build systems that acquire and exploit knowledge• “Gray box” techniques• Make assumptions, probe + measure,

learn something about how something works• Use knowledge to control systems in

unexpected ways

Result• Increase functionality, improve performance,

increase robustness and manageability too

Page 5: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Outline

OverviewKnowledge and its applications• Gray-box file placement• Semantically-smart disks• Scientific apps, the Grid, and I/O

Conclusions

Page 6: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

The People

Gray-box file placement• With James Nugent, Andrea Arpaci-Dusseau

Semantically-smart disks• With Muthian Sivathanu, Vijayan

Prabhakaran,Florentina Popovici, Tim Denehy, Andrea Arpaci-Dusseau

Scientific apps, the Grid, and I/O• With John Bent, Doug Thain,

Andrea Arpaci-Dusseau, Miron Livny

Page 7: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Gray-box Control over File Placement

Page 8: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Controlled File Placement

Typical “Unix” file system: Little control over layout• Just a simple API of open(), read(), write(),

close()

Some applications want more control• e.g., a web server that knows which files are

often accessed together

Usual default: Use the raw disk• Harder to manage, doesn’t integrate w/ other

apps

Page 9: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

What Might Be Better

Use normal file system• Convenience

Expose control over layout to applications• Control

Do the above without changing the file system• Can’t always change the system you’re

using

Page 10: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

PLACE

A gray-box “Information and Control Layer” (ICL)• It’s just a library

Simple API for file placement• Exposes “FFS-like” groups• Place_Creat(file, mode, groupNumber);

No changes to underlying file system

File System

PLACE

P P

Page 11: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

PLACE Outline

Basic operation• Gray-box knowledge• Key techniques

Assessment• Accuracy• Performance

Conclusions

Page 12: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Allocation Knowledge

Gray-box assumption: “FFS”-like allocation• Splits disk into numerous consecutive

“groups”• Spreads directories across groups• Puts files (inodes/data) that are within same

directory into same “group”

Many variants• Our focus: ext2 (but with other variants in

mind)

Page 13: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Exploiting Knowledge for Control

Key structure: Shadow Directory Tree (SDT)

To create a file /foo/bar in group 1:• Create file /.H/1/bar• Rename /.H/1/bar to /foo/bar

/

.h/

1/ 2/ n/

foo/

bar

bar

Page 14: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Challenge: Building the SDT

How to ensure that shadow directory for eachgroup K is in the right on-disk location?

Basic approach to creating a directory in a group:• Mkdir(tmp);• If (tmp is in the desired group)

• Break;

• Bias();

Point of portability: Bias() routine• Must account for different allocation algorithms

Repeat

Page 15: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Some Complications

Controlled directory placement• Similar to system initialization (hence, slow)• To speed up, use shadow cache of

directories

Crash recovery• Crash may leave junk in SDT• Periodic sweep of SDT cleaner fixes this

Level of control depends on underlying FS• e.g., FFS vs. ext2 behavior for large files

Page 16: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Assessment

Page 17: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Does it work?Non-place: 250 files in 1 directory

Non-place: 250 files in 10 directories

Non-place: 250 files in 100 directories

PLACE: 250 files in 100 directories into 1 group

Page 18: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Performance (Small Files)

Performance of 250 200-KB file reads (random)

Page 19: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Performance (Big Files)

Each point: Bandwidth attained reading 100-MB file

Page 20: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

PLACE Conclusions

PLACE: Gray-box approach to file layoutSimple and effective control over

placementMain technique: Shadow Directory Tree• Use to control placement• Construction and maintenance are keys

Controlled layout can improve performance• Micro-benchmarks• Web server and I/O parameterization

(see USENIX ‘03)

Page 21: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Outline

OverviewKnowledge and its applications• Gray-box file placement• Semantically-smart disks• Scientific apps, the Grid, and I/O

Conclusions

Page 22: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Semantically-smart Disk Systems

Page 23: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Semantically-Smart Disk System (SDS)

Disk system that understands file system• Data structures• Operations

Operates underneath unmodified FS• Must discover layout + on-disk

structures• Must “reverse engineer” block stream

Exploits knowledge and “smarts” to implement new class of services

FileSystem

SDS

$CPU

Page 24: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

SDS Outline

Semantic Knowledge: Acquisition• Off-line• On-line

Semantic Knowledge: Exploitation• Case studies

Conclusions

Page 25: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Static Knowledge: File System Layout

Challenge: How to discover layout information?• White-box approach: Embed knowledge in SDS

• Trend: FS layout does not change frequently

Su

perb

lk

I-B

itm

ap

D-B

itm

ap

Inod

es

Data

I-B

itm

ap

D-B

itm

ap

Inod

es

Data

Group 1 Group 2

Page 26: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Layout Discovery with EOFEOF: Extraction Of File-systems• Tool to automatically determine

layout• Uses gray-box techniques

Basic operation• Start with “soft” model of file system• Probe process (P): Initiates traffic• SDS: Monitors activity from FS

Two distinct tasks:• Classifying blocks by type• Identifying fields within an inode

Result: “Hardened” model of file system structures + fields

P

SDS

File System

Page 27: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

EOF: More Details

Multi-phase procedure:• Bootstrap: Summary blocks• Data/data bitmaps• Inodes/inode bitmaps• Inode fields, directory entries

Key techniques• Known patterns: Data blocks• Isolation: Know all but one block, one

block must be…• Assertions: Check assumptions at each

step

Page 28: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

EOF: Simplified Example

Create file: Touches many data structures• Directory data, directory inode, file data

(known pattern),file inode, data bitmap, inode bitmap

Reset to beginning of file, write block again• File data (known pattern), file inode• Now, can classify inode block (isolation)• Assertion: only two blocks observed

Page 29: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

EOF: Overhead and Summary

Performance: A few minutes per GB• Probably OK, only done “once” per new file

system• Scales well with faster disks (sequential

bandwidth)

Limitations: “FFS”-like file systems (ext2/3, BSD FFS)

Page 30: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Have Knowledge, Will Innovate

Knowing structures is not enough (sometimes)• Data block overloading (data, pointer,

directory)• High-level operations not known (create,

delete)

Requires new on-line techniques• Direct classification• Indirect classification• Block association• Operation inferencing

Page 31: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

A Simple Example: Smarter Caching

Modern RAID may have significant cache• Volatile (DRAM)• Non-volatile (NVRAM)

How to exploit semantic informationto cache more intelligently?

FileSystem

SDS

$

Page 32: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Storing Meta-Data in NVRAM

Start with simple meta-data: inodes, bitmaps, etc.• Good for meta-data intensive

workloads

Sup

er

I-B

it

D-B

it

Inod

e

Data

I-B

it

D-B

it

Inod

e

Data

NVRAMCache

Page 33: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Direct Classification

Given address, determine type directly

Direct classification via bounds check• Given disk address, can check bounds

to determine type(superblock, bitmaps, inodes, general data block)

Sup

er

I-B

it

D-B

it

Inod

e

Data

I-B

it

D-B

it

Inod

e

Data

Page 34: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Getting Rid Of The Dead

If file blocks are deleted, remove them from cache• No need to keep dead blocks around

Problem: How to determine if a file is deleted?• Need to look for signs of deletion

Three different places to look:• Inode bitmaps• Directory that contains file• Inode itself

Operation inferencing via block differencing

Page 35: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Operation Inferencing: Detecting Deletes (Inode

Bitmap)S

up

er

I-B

it

D-B

it

Inod

e

Data

I-B

it

D-B

it

Inod

e

Data

SDS

Diff =

Read Old Version

I-B

itm

ap

Result:Deleted Files

Page 36: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Operation Inferencing: Overheads

Space overhead• Block cache of inodes, indirect pointers,

bitmaps, etc.(could be substantial)

Time overhead• CPU: Difference operation is like an extra copy• Disk: May require block read (if small/no cache)

[In paper: Quantified time and space overheads]• Main point: There is a CPU and memory cost

Page 37: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Case Studies

Page 38: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Experimental Set-up

Problem: Don’t have SDS hardware to use (yet!)

“Cost-effective” alternative:• Software prototype

Insert driver underneath of FS• Much like software RAID

Good because…• Traffic stream similar

Bad because…• CPU, memory not isolated from host

FileSystem

SDSOS

Page 39: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Fast RAID Reconstruction

Observe: When reconstructing data onto hot spare,no need to reconstruct data that isn’t live

Trend: Less live data in performance-sensitive I/O systems

Question: How can we performreconstruction quickly?

MirrorHot

Spare

Page 40: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Traditional Approaches

Why not in the file system?• File system doesn’t know what RAID

is

Why not in the storage system?• RAID doesn’t know what blocks are

live(minimally it does, if block has never been written)

Page 41: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

The Semantic Way

Easy: Scan disk, only copy live blocks• Key piece of knowledge: Bitmaps• Plus, need to watch for “unmapped”

writes

Optionally, can copy “dead” blocks later• Useful if SDS doesn’t feel “sure” about its

knowledge• Guaranteed correct with prioritized

recovery

Page 42: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Fast Reconstruction: A Graph

Fast reconstruction: Less live data -> less time• How data is spread across disk affects recovery time

RAID-5,IBMDisks

Page 43: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Semantic Conclusions

Innovation in traditional storage stack is limited• File system: high but not low-level info• Storage system: low but not high-level info

Semantically-smart disks: Best of both worlds?• Takes advantage of “smart” disk systems• Exploit low-level information…• …with high-level knowledge of file system

A remaining challenge• Overcoming the “file system obfuscation”

problem

Page 44: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Outline

OverviewKnowledge and its applications• Gray-box file placement• Semantically-smart disks• Scientific apps, the Grid, and I/O

Conclusions

Page 45: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Trends in Scientific Computing

What constitutes a job is increasingly complex• Not your simple process anymore

Data demands increasing• Not just cycles anymore

Wide-area collaboration • “Grids” facilitate sharing

Page 46: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

The Question

How to run scientific workloads on the WAN?

WAN

HomeRemote

Page 47: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Scientific Outline

Typical “scientific” jobs• Structure• Properties

Migratory file services• Components• Performance

Conclusions

Page 48: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

First Things First

Study of modern scientific applications• A “measure then build” approach

Suite of six applications• BLAST: Searches genomic databases for

matching proteins• IBIS: Global-scale simulation of earth systems• CMS: High-energy physics testing software• Nautilus: Simulation of molecular dynamics• Messkit Hartree-Fock: Simulation of atomic

interactions• AMANDA: Astrophysics simulation of cosmic

events

Page 49: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

An Example: AMANDA

A single “job” is a multi-process pipeline -> batch pipelined• Each process is a blue circle

There are many types of I/O• Endpoint (red): unique input/output of pipeline• Pipeline private (green): shared between pipe processes• Batch shared (yellow): shared across all pipes in batch

4K

1M

23M 126M26M 5M

3M 505M

21

88

s 42

s

955

s

36

01

s

Page 50: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Some Things We Learned

Demands of a single pipeline are modest• Modern PC with disk can handle demand• Aggregation of I/O could be harder (WAN)

Lots of sharing of data within and across pipelines• Systems should (have to?) take

advantage of this

Page 51: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Towards Systems Support

Page 52: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Systems Support

Need to build systems support for global execution• Should support “batch-pipelined” jobs

effectively

Goals• Performance: Throughput is what matters

(NOT simple metrics like “availability”)• Failures: Must be handled effectively

(again, with goal of improving performance)

Page 53: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Migratory File Services

Migratory file service• I/O environment for “batch-pipelined”

workloads• Integrates performance and failure

management• Key: Understanding of workloads

Three pieces of implementation• Virtual batch overlay• Migratory proxies• Workflow manager

Page 54: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

The Virtual Batch OverlayWant familiar and controllable remote

environment• But often are stuck w/ particular queueing system• Further, cannot assume all relevant s/w installed

Glide in our own “virtual batch system”• On each node, run master, virtual machine, and

migratory proxy (described next)

M

VM MP

Page 55: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Migratory ProxiesMigratory proxies: Run on each remote

node• Fetch and cache data from home node• Cooperative cache for batch inputs• Localize I/O that is pipeline local

Remote

WAN

Home

M

VM MP

J

Page 56: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Workflow Manager

Where workload knowledge is encapsulated

Takes workflow description• Job dependencies• File indicators

Runs each while taking failures into account• Transactional management

• Proxy failure and job failure are not catastrophic(just rerun the job!)

• Proactive data replication

Page 57: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Performance

By exploiting knowledge, order of magnitudeimprovement over naïve approach

Page 58: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Outline

OverviewKnowledge and its applications• Gray-box file placement• Semantically-smart disks• Scientific apps, the Grid, and I/O

Conclusions

Page 59: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

Conclusions

The theme: Knowledge is power• If you know how FS decides on file layout,

you can control it (PLACE)• If you know details of FS on-disk structures,

you can gain FS-level knowledge behinda block-based interface (Semantic disks)

• If you know something about workloads andtheir I/O behaviors, you can optimize performanceand handle failure gracefully

Page 60: Knowledge is Power Remzi Arpaci-Dusseau University of Wisconsin, Madison

“Beware of false knowledge;it is more dangerous than ignorance.”

Bernard Shaw

http://www.cs.wisc.edu/wind