wide-area cooperative storage with cfs morris et al. presented by milo martin upenn feb 17, 2004...

Wide-Area Cooperative Storage with CFS

Morris et al.

Presented by Milo MartinUPenn

Feb 17, 2004

(some slides based on slides by authors)

Overview

• Problem: content distribution• Solution: distributed read-only file

system• Implementation

• Provide file system interface• Using a Distributed Hash Table (DHT)• Extend Chord with redundancy and

caching

Problem: Content Distribution

• Serving static content with inexpensive hosts• open-source distributions• off-site backups• tech report archive

node

nodenode

node

Internet

node

Example: mirror open-source distributions

• Multiple independent distributions• Each has high peak load, low average

• Individual servers are wasteful• Solution:

• Option 1: single powerful server• Option 2: distributed service

• But how do you find the data?

Assumptions

• Storage is cheap and plentiful• Many participants• Many reads, few updates• Heterogeneous nodes

• Storage• Bandwidth

• Physical locality• Exploit for performance• Avoid for resilience

Goals

• Avoid hot spots due to popular content• Distribute the load (traffic and storage)• “Many hands make light work”

• High availability • Using replication

• Data integrity• Using secure hashes

• Limited support for updates• Self managing/repairing

CFS Approach• Break content into immutable “blocks”

• Use blocks to build a file system• Identify blocks via secure hash

• Unambiguous name for a block• Self-verifying

• Distributed Hash Table (DHT) of blocks• Distribute blocks• Replicate and cache blocks• Algorithm for finding blocks (e.g., Chord)

• Content search beyond the scope

Outline

• File system structure• Review: Chord distributed hashing• DHash block management• CFS evaluation• Brief overview of Oceanstore• Discussion

Hash-based Read-only File System

• Assume retrieval substrate• put(data): inserts data with key Hash(data)

• get(h) -> data

• Assume “root” identifier known• Build a read-only file system using

this interface• Based on SFSRO [OSDI 2000]

Client-server interface

• Files have unique names• Files are read-only (single writer, many readers)• Publishers split files into blocks• Clients check files for authenticity [SFSRO]

FS Client serverInsert file f

Lookup file f

Insert block

Lookup block

node

server

node

File System Structure

• Build file system bottom up• Break files into blocks• Hash each block• Create a “file block” that points to hashes• Create “directory blocks” that point to

files

• Ultimately arrive at “root” hash• SFSRO: blocks can be encrypted

File System Example

File Block

… …DataBlock

DataBlock

DataBlock

Directory Block

… …

Root Block

… …

File System Lookup Example

• Root hash verifies data integrity• Recursive verification• Root hash is correct, can verify data

• Prefetch data blocks

… …

File Block

DataBlock

DataBlock

DataBlock

Directory Block

Root Block

… …

… …

File System Updates• Updates supported, but kludgey

• Single updater• All or nothing updates

• Export a new version of the filesystem• Requires a new root hash• Conceptually rebuild file system• Insert modified blocks with new hashes

• Introduce signed “root block”• Key is hash of public key (not the hash of

contents)• Allow updates if they are signed with private key• Sequence number prevents replay

Update Example

… …

File Block

DataBlock

DataBlock

DataBlock

Directory Block

Root Block

… …

… …

NewData

Hash(Public Key)

File Block

Directory Block

Root Block

Unmodified

File System Summary

• Simple interface• put(data): inserts data with key Hash(data)

• get(h) -> data

• put_root(signprivate(hash, seq#), public key)

• refresh(key): allows discarding of stale data

• Recursive integrity verification

Review of Goals and Assumptions


• Limited support for updates• Distribute load; avoid hot spots• High availability • Self managing/repairing• Exploit locality• Assumptions

• Many participants• Heterogeneous nodes

File SystemLayer

Outline

• File system structure• Review: Chord distributed

hashing• DHash block management• CFS evaluation• Brief overview of Oceanstore• Discussion

Server Structure

• DHash stores, balances, replicates, caches blocks

• DHash uses Chord [SIGCOMM 2001] to locate blocks

DHash

Chord

Node 1 Node 2

DHash

Chord

Chord Hashes a Block ID to its Successor

N32

N10

N100

N80

N60

CircularID Space

• Nodes and blocks have randomly distributed IDs• Successor: node with next highest ID

B33, B40, B52

B11, B30

B112, B120, …, B10

B65, B70

B99

Block ID Node ID

Basic Lookup - Linear Time

N32

N10

N5

N20

N110

N99

N80

N60

N40

“Where is block 70?”

“N80”

• Lookups find the ID’s predecessor• Correct if successors are correct

Successor Lists Ensure Robust Lookup

N32

N10

N5

N20

N110

N99

N80

N60

• Each node stores r successors, r = 2 log N• Lookup can skip over dead nodes to find blocks

N40

10, 20, 32

20, 32, 40

32, 40, 60

40, 60, 80

60, 80, 99

80, 99, 110

99, 110, 5

110, 5, 10

5, 10, 20

Chord Finger Table Allows O(log N) Lookups

N80

½¼

1/8

1/161/321/641/128

• See [SIGCOMM 2001] for table maintenance• Reasonable lookup latency

DHash/Chord Interface

• lookup() returns list with node IDs closer in ID space to block ID• Sorted, closest first

server

DHash

Chord

Lookup(blockID) List of <node-ID, IP address>

finger table with <node IDs, IP address>

DHash Uses Other Nodes to Locate Blocks

N40

N10

N5

N20

N110

N99

N80 N50

N60N68

Lookup(BlockID=45)

1.

2.

3.

Availability and Resilience: Replicate blocks at r successors

N40

N10

N5

N20

N110

N99

N80

N60

N50

Block17

N68

• Hashed IP Addr. ensures independent replica failure• High storage cost for replication; does this matter?

original

replicas

Lookups find replicas

N40

N10

N5

N20

N110

N99

N80

N60

N50

Block17

N68

1.3.

2.

4.

Lookup(BlockID=17)

RPCs:1. Lookup step2. Get successor list3. Failed block fetch4. Block fetch

original

replicas

First Live Successor Manages Replicas

N40

N10

N5

N20

N110

N99

N80

N60

N50

Block17

N68

Copy of17

• Node can locally determine that it is the first live successor; automatically repair and re-replicate

Reduce Overheads: Caches Along Lookup Path

N40

N10

N5

N20

N110

N99

N80

N60

Lookup(BlockID=45)

N50

N68

1.

2.

3.

4.RPCs:1. Chord lookup2. Chord lookup3. Block fetch4. Send to cache

Send only to second-to-last hop in routing path

Caching at Fingers Limits Load

N32

• Only O(log N) nodes have fingers pointing to N32• This limits the single-block load on N32

Virtual Nodes Allow Heterogeneity

• Hosts may differ in disk/net capacity• Hosts may advertise multiple IDs

• Chosen as SHA-1(IP Address, index)• Each ID represents a “virtual node”

• Host load proportional to # v.n.’s• Manually controlled

• automatic adaptation possible

Node A

N60N10 N101

Node B

N5

Physical Locality-aware Path Choice

N80 N48

100ms

10ms

• Each node monitors RTTs to its own fingers• Pick smallest RTT (a greedy choice)

• Tradeoff: ID-space progress vs delay

N25

N90

N96

N18N115

N70

N37

N55

50ms

12ms

Lookup(47)

B47

Why Blocks Instead of Files?

• Cost: one lookup per block• Can tailor cost by choosing good block size• Need prefetching for high throughput• What is a good block size?• Higher latency?

• Benefit: load balance is simple• For large files• Storage cost of large files is spread out• Popular files are served in parallel

Block Storage

• Long-term blocks are stored for a fixed time• Publishers need to refresh periodically

• Cache uses LRU

disk: cache Long-term block storage

Preventing Flooding

• What prevents a malicious host from inserting junk data?

• Answer: not much• Kludge: capacity limit (e.g., 0.1%)

• Per source/destination pair• How real is this problem?

• Refresh requirement allows recovery

Review of Goals and Assumptions


• Limited support for updates• Distribute load; avoid hot spots• High availability • Self managing/repairing• Exploit locality• Assumptions

• Many participants• Heterogeneous nodes

File SystemLayer

Dhash/ChordLayer

Outline


CFS Project Status

• Working prototype software• Some abuse prevention mechanisms• SFSRO file system client

•Guarantees authenticity of files, updates, etc.

• Some measurements on RON testbed• Simulation results to test scalability

Experimental Setup (12 nodes)

• One virtual node per host• 8Kbyte blocks• RPCs use UDP

CA-T1CCIArosUtah

CMU

To vu.nlLulea.se

MITMA-CableCisco

Cornell

NYU

OR-DSL To vu.nl lulea.se ucl.uk

To kaist.kr, .ve

• Caching turned off• Proximity routing

turned off

CFS Fetch Time for 1MB File

• Average over the 12 hosts• No replication, no caching; 8 KByte blocks

Fetc

h T

ime (

Seco

nd

s)

Prefetch Window (KBytes)

Distribution of Fetch Times for 1MB

Fract

ion

of

Fetc

hes

Time (Seconds)

8 Kbyte Prefetch

24 Kbyte Prefetch40 Kbyte Prefetch

CFS Fetch Time vs. Whole File TCP

Fract

ion

of

Fetc

hes

Time (Seconds)

40 Kbyte Prefetch

Whole File TCP

Robustness vs. Failures

Faile

d L

ooku

ps

(Fra

ctio

n)

Failed Nodes (Fraction)

(1/2)6 is 0.016

Six replicasper block;

Much Related Work

• SFSRO (Secure file system, read only)

• Freenet• Napster• Gnutella• PAST• CAN• …

Later Work: Read/Write Filesystems

• Ivy and Eliot• Extend the idea to read/write file systems• Full NFS or AFS-like semantics

• All the nasty issues of distributed file systems• Consistency• Partitioning• Conflicts

• Oceanstore (earlier work, actually)

Outline


CFS Summary (from their talk)

• CFS provides peer-to-peer r/o storage• Structure: DHash and Chord• It is efficient, robust, and load-

balanced• It uses block-level distribution• The prototype is as fast as whole-file

TCP

OceanStore (Similarities to CFS)

• Ambitious global-scale storage “utility” built from an overlay network

• Similar goals:• Distributed, replicated, highly available• Explicit caching

• Similar solutions:• Hash-based identifiers• Multi-hop overlay-network routing

OceanStore (Differences)

• “Utility” model • Implies some managed layers• Handles explicit motivation of

participants

• Many applications/interfaces• Filesystem, web content, PDA sync.,

e-mail

OceanStore (More Differences)

• Explicit support for updates• Built into system at almost every layer• Uses Byzantine commit• Requires “maintained” inner ring• Updates on encrypted data

• Plaxton tree based routing• Fast, probabilistic method• Slower, reliable method

• Erasure codes limit replication overhead

Aside: Research Approach

• OceanStore• Original ASPLOS “vision paper”• Paints the big picture• Spent 3+ fleshing it out

• Chord• Evolution from Chord -> CFS -> Ivy• More follow-on research by others

Issues and Discussion• Kludge or pragmatic?

• CFS’s replication• CFS’s caching (is “closer” server better?)• CFS’s multiple virtual servers• CFS’s Locality-based routing

• Are “root” blocks cacheable?• Bandwidth is reasonable…

• …what about latency?

• Block-based or File-based?• How separate are the layers, really?

wide-area cooperative storage with cfs morris et al. presented by milo martin upenn feb 17, 2004...

Documents