edelweiss: automatic storage reclamation for distributed programming

Post on 30-Dec-2015

34 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Edelweiss: Automatic Storage Reclamation for Distributed Programming. Neil Conway Peter Alvaro Emily Andrews Joseph M. Hellerstein University of California, Berkeley. Mutable shared state. Frequent source of bugs. Hard to scale. Accumulate & exchange sets of immutable events - PowerPoint PPT Presentation

TRANSCRIPT

Edelweiss:Automatic Storage Reclamation for Distributed Programming

Neil ConwayPeter Alvaro

Emily AndrewsJoseph M. Hellerstein

University of California, Berkeley

Mutable shared state

Frequent sourceof bugs

Hard to scale

EventLogging

• Accumulate & exchange sets of immutable events No

mutation/deletion

• To delete: add new event “Event X should be

ignored”

• Current state: query over event log

Event Logging

i_log = Set.newd_log = Set.new

Insert(k, v): i_log << [k,v]

Delete(k): d_log << k

View(): i_log.notin(d_log, :k => :k)

Example: Key-Value Store

Mutable State

tbl = Hash.new

Insert(k, v): tbl[k] = v

Delete(k): tbl.delete(k)

View(): tbl

Update-in-place

Deletion

Set union

Compute“live” keys

Benefits of Event Logging

1. Concurrency2. Replication3. Undo/redo4. Point-in-time query, audit trails

(Sometimes: performance!)

Example Applications

• Multi-version concurrency control (MVCC)

• Write-ahead logging (WAL)• Stream processing• Log-structured file systems

Also: CRDTs, tombstones, purely functional data structures, accounting ledgers.

Observation: Logs consume unbounded storage

Solution: Discard log entries that are“no longer useful”(garbage collection)

Observation: Logs consume unbounded storage

Challenge: Discard log entries that are“no longer useful”(garbage collection)

Traditional Approach

“No longer useful” defined by application semantics– No framework support– Every system requires

custom GC logic– Reinvented many

times• >25 papers propose

~same scheme!

Engineering Challenges

1. Difficult to implement correctly– Too aggressive: destroy live data– Too conservative: storage leak

2. Ongoing maintenance burden– GC scheme and application code must

be updated together

Our Approach

1. New language: Edelweiss– Based on Datalog– No constructs for deletion or mutation!

2. Automatically generate safe, application-specific distributed GC protocols

3. Present several in-depth case studies– Reliable unicast/broadcast, key-value store,

causal consistency, atomic registers

Base Data(“Event Logs”)

Derived Data( “Live View”)

Query

The queries define how log entries contribute to the view.Goal: Find log entries that will never contribute to the view in the future.

A log entry is useful iff it might contribute to the view.

Semantics of Base Data

• Accumulate and broadcast to other nodes

• Datalog: monotonic–Set union: grows over time

• CALM Theorem [CIDR’11]: event log guaranteed to be eventually consistent

Semantics of Derived Data

Grows and shrinksover time– e.g., KVS keys

added and removed

Hence, not monotonic

Common Pattern

Live View = set difference between growing sets

Key-Value Store Insertions that haven’t been deleted

Reliable Broadcast

Outbound messages that haven’t been acknowledged

Causal Consistency

Writes that haven’t been replaced by a causally later write to the same key

Semantics of Set Difference

X = Y – Z– Z grows: X

shrinks– If t appears in Z,

t will never again appear in X

– “Anti-monotone with respect to Z”

i_log = Set.newd_log = Set.new

Insert(k, v): i_log << [k,v]

Delete(k): d_log << k

View(): i_log.notin(d_log, :k => :k)Can reclaim from i_log

upon match in d_log

Other Analysis Techniques

• Reclaim from negative notin input– Often called “tombstones”– E.g., how to reclaim from d_log in the

KVS

• Reclaim from join input tables• Disseminate GC metadata

automatically• Exploit user knowledge for better GC– Punctuations [Tucker & Maier ‘03]

Whole Program Analysis

• For each query q, find condition when input t will never contribute to q’s output– “Reclamation condition” (RC)

• For each tuple t, find the conjunction of the RCs for t over all queries–When all consumers no longer need t:

safe to reclaim

Edelweiss Input

Program

Source To Source

Rewriter

Datalog Output

Program

DatalogEvaluator

“Positive” program:no deletion or statemutation

Compute RCs,add deletion rules

Input program +deletion rules

Comparison of Program Size

Only19 rules!

Takeaways

No storage management code!– Similar to malloc/free vs. GC

Programs are concise and declarative– Developer: just compute current view– Log entries removed automatically

Reclamation logic application code always in sync

Conclusions

• Event logging: powerful design pattern– Problem: need for hand-written distributed

storage reclamation code

• Datalog: natural fit for event logging• Storage reclamation as a compiler rewrite?

Results:– Automatic, safe GC synthesis!– High-level, declarative programs

• No storage management code• Focus on solving domain problem

Thank You!

Future Work: Checkpoints

• Closely related to simple event logging– Summarize many log entries with a

single “checkpoint” record– View = last checkpoint + Query(¢Logs)

• General goal: reclaim space by structural transformation, not just discarding data

Future Work: Theory

• Current analysis is somewhat ad hoc• If program does not reclaim storage, two

possibilities:1. Program is “not reclaimable” in principle

• (Possible program bug!)

2. Our analysis is not complete• (Possible analysis bug!)

How to characterize the class of “not reclaimable” programs?

Reclaiming KVS Deletions

• Good question • X.notin(Y): how to

reclaim from Y?1. Y is a dense

ordered set; compress it.

2. Prove that each Y tuple matches exactly one X tuple

i_log = Set.newd_log = Set.new

Insert(k, v): i_log << [k,v]

Delete(k): d_log << k

View(): i_log.notin(d_log, :k => :k)

k is a keyof i_log

top related