osdi’06:7thusenix symposiumonoperating systemsdesignand ... · operating-system-relatedpapers,...

fect lease timeouts. Sharon an-swered that someone withinGoogle who understood hard-ware came up with a safe time-out value (which is on the orderof several seconds, as I found outlater). Someone asked aboutserver replacements. Sharon an-swered that reboots are okay, asno state gets lost, but losing adisk would be a problem.

A Distributed File System for a Wide-Area High Performance ComputingInfrastructure

Edward Walker, University of Texas atAustin

Ed Walker works in the TexasAdvanced Computing Lab anduses NSF TeraGrid, a nationalhigh-performance computing in-frastructure for performing large-scale engineering and scientificproblems. TeraGrid currentlyuses GPFS crossmounts for sup-porting remote file sharing. Butbecause of operating systems is-sues, not all sites can use IBM’sGPFS, and in a survey of users in2005, scp was cited as the mostimportant data managementtool.

Ed pointed out that many desk-tops are becoming computationscience–capable and that the ma-jority of links within TeraGridparticipants have less than 2%utilization, so that bandwidthcan be used. He then describedXUFS, a userspace overlay thathooks file system calls by inter-posing a shared object before libcto get transparent file systemredirection. XUFS has goals oflocation transparency (since lap-tops and even desktops moveeasily), performance, and privatename space, but not file sharing,as his research has shown thatscientific computing files arerarely shared (umask of 077).XUFS aggressively uses localcaches, and it also performswrite-on-close to sync up locally

made changes with the remotecopy.

In performance testing, XUFSdoes as well as or better thanGPFS in most cases (the excep-tion being smaller files). XUFShas a command-line tool forflushing the local write cache incase of a client crash, and auto-matic recovery in case of hostcrashes or network outages.Gunnar Sirer asked about thelack of support for file sharing.Ed answered that out of nearly2000 GPFS users, only one hadchanged the default permissionsto all group read permissionswithin directories.

OSDI ’06: 7th USENIXSymposium on OperatingSystems Design andImplementation

Sponsored by USENIX in coop-eration with ACM SIGOPS

Seattle,WashingtonNovember 6–8, 2006

Opening Remarks

Summarized by Rik Farrow

OSDI 2006 began with recordrains in Seattle, but the rain andlocal flooding did nothing todampen the mood in the confer-ence. Jeff Mogul started out withthe usual summary of the num-ber of papers submitted versusthose accepted (149/27), tellingus that each paper was reviewedmultiple times and that shep-herds helped with each acceptedpaper. Papers that did not meetthe format required by the CFPwere rejected without review. AsOSDI, together with SOSP, is thetop venue for publishing refereedoperating-system-related papers,hopeful authors are strongly mo-tivated to do much more thanadhere to formatting.

Jeff thanked the program com-mittee members and the people

and organizations who spon-sored OSDI. He pointed outthat registration fees do notbegin to cover the cost of theconference, but through thework of Robbert van Renesse andUSENIX, enough money wasraised to pay for registration feesand travel expenses for 72 stu-dents, as well as two receptions.Jeff also told us that he had setup osdi2006.blogspot.com sothat summaries and commentson papers could be posted in realtime during the conference.

Brian Bershad announced thewinner of the SIGOPS Hall ofFame award, “Safe Kernel Exten-sions without Runtime Check-ing,” by George C. Necula andPeter Lee. The two Best Paperawards went to “Rethink theSync” and “Bigtable: A Distrib-uted Storage System for Struc-tured Data.”

I found most of the papers excit-ing and was busy emailing linksto abstracts to friends and ac-quaintances who I thoughtwould likely be interested (mostwere). OSDI certainly has be-come one of my favorite confer-ences, being full of great infor-mation and new ideas.

LOCAL STORAGE

Summarized by Anthony Nicholson

Rethink the Sync

Edmund B. Nightingale, KaushikVeeraraghavan, Peter M. Chen, andJason Flinn, University of Michigan

Awarded Best Paper

The authors note that asynchro-nous I/O provides good user-per-ceptible performance, but it doesnot provide reliable and timelysafety of data on disk. Synchro-nous I/O provides data safetyguarantees but incurs significantoverhead. Ed Nightingale pre-sented a new model of “external-ly synchronous” I/O that resolvesthis tension by approximating

86 ; LOG I N : VO L . 3 2 , NO . 1

the performance of asynchro-nous I/O while providing thedata safety of synchronous I/O.The main reason synchronousI/O is slow is that applicationsmust block until data has beenwritten safely to disk. The au-thors argue that only the user,not applications, should be con-sidered “external” to the system.Therefore, under their model,applications need not block onI/O operations but can performother work while writes arequeued. The system only blockswhen some output that dependson a pending write is about to beexternalized to the screen, disk,or network—in other words,when any event occurs thatwould make the user think thatthe I/O has completed.

External synchrony preservesthe same causal ordering ofwrites as synchronous I/O. Thefact that unrelated I/O operationscan be batched and overlappedwithout violating causal orderingis what enables the performancewins in their system. Their im-plementation leverages theirprior work (Speculator, SOSP’05) to track causal dependenciesacross multiple applications andthroughout the kernel. “Commitdependencies” inside the kerneltrack all the processes and ob-jects that are causally dependenton a given pending write. Com-mit dependencies are forwardedto applications that becometainted by uncommitted data, toensure preservation of causal or-dering. They modified the Linuxext3 file system to support exter-nal synchrony, and they com-pared the performance of theirsystem to native ext3 in bothasynchronous and synchronousI/O mode and to ext3 synchro-nous with write barriers. Theirevaluation results show thatext3, even using synchronousI/O, does not guarantee datadurability across crashes orpower failures, but ext3 with

write barriers provides the samedata safety of external synchrony,although at a severe performancecost. Their performance on vari-ous file system benchmarksshows performance close to thatof asynchronous ext3, while pro-viding superior security guaran-tees to that of synchronouslymounted ext3. The performanceof ext3 using synchronous I/O isalso an order of magnitude slow-er than that of external syn-chrony. Their performance onthe specweb99 benchmark alsoshows that external synchronyadds minimal latency overheadas compared to asynchronousfile systems.

David Anderson from CarnegieMellon asked whether there wasa corner case where an applica-tion would do an asynchronouswrite, see it failed, and changebehavior based on that. Ed re-sponded that in their system, bythe time an application discoversa write has failed it would havealready moved on, complicatingrecovery. He argued, however,that “failure notification” usuallymeans kernel panic and crashingowing to hardware failure, so theuser has bigger problems in thatcase. George Candea from EPFLasked about network latency inthe mySql benchmark from theirpaper. Ed said that the client andserver were on the same box intheir evaluation, because theSpecWeb benchmark was moreconcerned with network latency.George explained that he wasmore interested in group commitlatency. In that case, Ed said,their system gets the benefit ofgroup commit but wouldn’t beable to commit multiple transac-tions from the same client be-cause of the rules of externalsynchrony. Micah Brodsky fromMIT asked whether performancewould suffer under external syn-chrony for high bandwidth, se-quential I/O operations. Ed saidyou would still see improve-

ment, because the OS wouldn’tbe blocking between operationsif they weren’t dependent oneach other. Micah wondered howtheir performance would com-pare to that of asynchronous I/Ofor such large sequential opera-tions. Ed noted that asynchro-nous I/O is limited by the speedto write to memory, and externalsynchrony would be similarlylimited.

Type-Safe Disks

Gopalan Sivathanu, SwaminathanSundararaman, and Erez Zadok, StonyBrook University

Gopalan Sivathanu began by not-ing that data on disk consists oftwo things: data and pointers.The structure of pointers on diskimplies how disk blocks are or-ganized via higher-level abstrac-tions into files and directories.Unfortunately, today’s disks arepointer-oblivious. The file sys-tem knows all about the seman-tics of data and the disk knowsabout the hardware details. Butbecause the interface betweenOS and disk is so constrained,little information is exchangedbetween the two. Everything isjust reading and writing blocks.The disk doesn’t know the high-level reason for a read/write. Forexample, it would be nice for aRAID system to prioritize thehandling of metadata blocks, be-cause those are more important.Type-safe disks try to bridge thissemantic gap. The authors pro-pose an extended interface be-tween the kernel and disk, to en-able both type-awareness (diskstracking pointers) and type-safe-ty (disks using pointer informa-tion to enforce constraints).

Because type-safe disks are con-scious of the relationship be-tween data and pointers, theycan do such things as automati-cally garbage-collect data blocksthat are no longer referenced byany file or directory. Offloadingthese tasks to the disk lets the

; LOGIN: FEBRUARY 2007 CONFERENCE REPORTS 87

file system component of the op-erating system shrink in size andcomplexity. The authors addedadditional API calls between filesystem and disk, such as “allo-cate block,” “create pointer,” and“delete pointer.” They have im-plemented a prototype in Linuxas a pseudo-device driver andhave ported the ext3 and vfat filesystems to support TSD. Theporting effort was minimal (ap-proximately two person-weeks).Their case study is a security ap-plication (ACCESS: A capabilityconscious extended storage sys-tem), which is disk-enforced ac-cess control. To access data, anapplication must provide thedisk with a valid capability (anencryption key). Thus the maxi-mum amount of data that can beexposed is that which is current-ly in use or cached at a higherlevel. Traditionally, if the OSwere compromised, then all dataon disk would be compromised.ACCESS establishes a securityperimeter on the disk itself in-stead.

Michael Scott from the Universi-ty of Rochester asked how TSDwould work with a file systemthat doesn’t use the hierarchicalpointer design, such as a file sys-tem that stores file data in achain of blocks interconnectedby pointers. Gopalan answeredthat they can support such filesystems because the pointerstracked by type-safe disks neednot be the same as those main-tained by the file system in itsown metadata. Margo Seltzerfrom Harvard asked how thiswas different from the semanti-cally smart disk work. Gopalanresponded that semanticallysmart disks need to do a lot ofoperations at the disk level toinfer the sort of information thattype-safe disks explicitly aregiven through the API. Margoconcluded that the two solutionsare basically the same but the

implementation cost is different.Gopalan disagreed.

Another questioner asked whereprivate keys for disk blocks canbe stored, if we don’t trust theoperating system that can readapplication memory. That dis-cussion was taken offline. EminGün Sirer from Cornell askedhow they settled on the API be-tween the kernel and disk. Whynot just move the whole file sys-tem into the disk? Gopalan an-swered that because we oftenwant to run multiple file systems(or none at all, for certain data-bases) at once, not all functional-ity can be pushed into the disk.He was not confident that theirAPI was the most minimal inter-face possible.

Finally, Chad Verbowski fromMicrosoft Research asked howthey would properly keep theparity blocks in a RAID system.Gopalan answered that if onewants to use type-safe disks,then all layers of the file systemsoftware stack must be modified,including the RAID software.

Stasis: Flexible Transactional Storage

Russell Sears and Eric Brewer,University of California, Berkeley

Rusty Sears stated that systemsresearchers in particular oftenabandon off-the-shelf storage so-lutions because of their poor per-formance and reinvent the wheelon every project that juggles alarge amount of data. The goal oftheir project was to provide atransactional storage frameworkthat provides good performanceoff the shelf but is easily extensi-ble to meet the needs of userswithout requiring that applica-tions be rewritten because theyare too tied into the underlyingplumbing. This saves users fromhaving to concern themselveswith details of logging, recovery,etc., unless they really want to.Stasis has the following three de-sign principles: (1) Provide sim-ple, thin APIs to low-level com-

ponents. (2) Ensure high-levelsemantics via local invariants.(3) Make all module interactionsexplicit—this lets them placepolicy decisions in replaceablemodules.

Russell gave an example of im-plementing a concurrent hashtable on top of Stasis. They wrapoperations with Stasis calls sothat the underlying layers knowhow to undo the operation thatis being wrapped, in case a trans-action needs to be rolled back.Russell also discussed their casestudy: persistent objects. Inother words, these are systemsthat support transactional up-dates over a series of objects. Tooptimize these updates, Stasiscan conserve log bandwidth byonly logging diffs. Also, theyhalve memory usage by deferringpage cache updates. Low-levelmodules implement log updatesand defer writes to the pagecache in order to save memory.They implemented group com-mit in the log manager, and eval-uation shows good performancegains. Since the log manager seesall updates that are produced, itcould be extended to transpar-ently implement automatic repli-cation, etc. Stasis can also ensuretemporal ordering of certainwrites (such as external syn-chrony). Similar to type-safedisks, the type-system of the diskcould be coupled to a type-sys-tem of a higher-level application,since these modules are all ex-tensible. The authors are inter-ested in implementing zero-copyI/O for the page file or replicat-ing the page file on multipleservers.

Compared to atop Berkeley DBand SQL, performance is reason-able, with the optimizations de-scribed here doubling through-put and halving required mem-ory. There were no questions.The project Web page ishttp://www.cs.berkeley.edu/~sears/stasis/.

88 ; LOG I N : VO L . 3 2 , NO . 1

RUNTIME REL IAB I L ITY MECHANISMS

Summarized by Geoffrey Lefebvre

SafeDrive: Safe and Recoverable Ex-tensions Using Language-Based Tech-niques

Feng Zhou, Jeremy Condit, ZacharyAnderson, and Ilya Bagrak, Universityof California, Berkeley; Rob Ennals,Intel Research Berkeley; MatthewHarren, George Necula, and EricBrewer, University of California,Berkeley

Feng Zhou began his talk by not-ing that many operating systemsand applications run loadableextensions. These extensions areoften buggier than their hostsand execute in the same protec-tion domain. To address thisissue, the authors present Safe-Drive, a language-based ap-proach to extension safety.SafeDrive can be decomposedinto two principal components:the Deputy source-to-sourcecompiler and the run-time recov-ery system. Although the princi-ples presented could be appliedto other systems, the talk and thepaper focused on adding type-based checking and restart capa-bility to existing Linux devicedrivers. The core idea is to trans-form code written in C into asafe variant, addressing issuessuch as out-of-bound array ac-cesses, null terminated strings,and unions.

Previous approaches to retro-fitting type safety to C, such asCCured, required the use of fatpointers, which contain both thepointer and bound information.This approach unfortunately re-quires changing the memory lay-out of data structures, makingthe modified extensions incom-patible with their host’s binaries.Instead of using fat pointers,Deputy relies on the programmeradding lightweight annotationsto header files and extensionsource code. Since SafeDrive re-lies on run-time checks, it must

provide mechanisms to deal withviolations. SafeDrive enforcesthe invariant that no driver codewill execute after a failure. TheSafeDrive run-time system tracksall kernel resources used by a de-vice driver, using wrappersaround kernel API functions.Each tracked resource is pairedwith a compensation operationthat performs an undo when afault occurs in a driver. As an ex-ample, the compensation opera-tion for a spinlock is to releasethe lock.

Compared to approaches usinghardware memory protectionsuch as Nooks, SafeDrive pro-vides finer-grained memory pro-tection. This allows it to catchmore errors at compile time andexhibit less run-time overhead.

Andrew Baumann from the Uni-versity of New South Wales stat-ed that many bugs are concur-rency-related and asked whetherSafeDrive can detect and recoverfrom such errors. Feng Zhou an-swered that SafeDrive does notcurrently address this issue. BradKarp from University CollegeLondon asked whether SafeDrivehandled integer overflow, specifi-cally the case where a signed in-teger overflow could result in adifferent memory allocation thanthe size requested. Feng Zhouanswered that SafeDrive doesn’tdeal with integer overflow butthat, in most cases, memory allo-cation routines should handlethis issue. Rik Farrow askedwhat happens when program-mers make errors writing the an-notations. The answer is that theDeputy type system will catchsome of the errors. Michael Swiftfrom the University of Wisconsinasked how easy would it be to in-tegrate this with Nooks to allowcatching errors that only Nooksis able to catch. Feng Zhou saidthat it should be feasible to doso. Finally, someone from theUniversity of California, Santa

Cruz, asked whether SafeDrivewas able to cope with complexdata structures. Feng replied thatby being able to deal directlywith pointers, Deputy is able todo so.

BrowserShield: Vulnerability-DrivenFiltering of Dynamic HTML

Charles Reis, University of Washington;John Dunagan, Helen J. Wang, andOpher Dubrovsky, Microsoft; SaherEsmeir, Technion

Charles Reis began by pointingout that vulnerabilities inbrowsers are dealt with by patch-ing, but there will often be a longdelay between the time a patch isreleased and its application. Thisdelay opens a dangerous timewindow when attackers can evenuse the patch as a blueprint tocreate exploits. A previous ap-proach, Shield, aims to provideprotection equivalent to patch-ing but with easier deploymentand roll-back. Shield allows fil-tering of malicious static contentusing vulnerability signatures.BrowserShield’s goal is to providesimilar protection for dynamicWeb content by rewriting em-bedded scripts into safe equiva-lents on their way to the brows-er.

BrowserShield consists of aJavaScript library and a logic in-jector, which could be located atthe server, at the firewall, or evenas a proxy on the client. Bothcomponents are configuredusing flexible policies that can betailored to address specific vul-nerabilities. The injector per-forms a first translation whichmodifies the HTML to removeexploits and wraps all embeddedscripts to force them to be in-voked via the BrowserShield li-brary. The library performs asecond translation during pagerendering by dynamically rewrit-ing scripts to access the HTMLdocument tree via an interposi-tion layer. The evaluation showsthat, combined with anti-virus


and HTML filtering, Browser-Shield provides patch-equivalentprotection for all 19 vulnerabili-ties in IE for which patches werereleased in 2005.

George Candea from EPFL askedabout the guarantees that can beprovided that pages will be ren-dered correctly. The answer isthat it’s much easier to roll backa policy than an applied browserpatch when something goeswrong. BrowserShield policyonly affects pages, not the brows-er itself. George then asked howto evaluate the policies. Charlesanswered that the two metricsare how easy it is to write a poli-cy and how to thoroughly testpolicies. Jason Flinn from Uni-versity of Michigan askedwhether the scripting was part ofthe TCB. Charles Reis said it is.Brad Chen from Google askedabout specific things that the au-thors would like to see added orremoved from JavaScript.Charles Reiss answered that asmaller API would make it easierto achieve complete interposi-tion. Benjamin Reed from Yahoo!inquired how to debug a policyonce deployed. Charles Reis an-swered that it’s important to firstdistinguish whether a problem isdue to BrowserShield or due tothe page or browser. A solutionwould be to render the page ina unprotected browser runningin a virtual machine. Benjaminfurther asked what the coredump should look like. Charlesanswered that BrowserShieldcould be enhanced with a set ofdebugging policies to generatethe appropriate information.

Finally, Diwaker Gupta fromUCSD asked how to prevent infi-nite loops in the script transla-tion. Charles answered that if thescript had an infinite loop thenits interpretation would fall intoan infinite loop, but he didn’tthink that it was possible for a

script to force the translationstep into an infinite loop.

XFI: Software Guards for System Ad-dress Spaces

Úlfar Erlingsson, Microsoft Research,Silicon Valley; Martín Abadi, MicrosoftResearch, Silicon Valley, and Universityof California, Santa Cruz; MichaelVrable, University of California, SanDiego; Mihai Budiu, Microsoft Research,Silicon Valley; George C. Necula,University of California, Berkeley

Úlfar Erlingsson began by intro-ducing XFI, a software-basedprotection system that providessafe system extensions. He pro-ceeded with a demo of a JPEGimage containing an exploitbeing rendered by both an un-modified JPEG library and anXFIed version of the same legacylibrary. The first case resulted inan application crash, whereas theXFIed version properly trappedand gracefully aborted the ren-dering.

The XFI implementation createssafe extensions from existingWindows x86 Portable Executa-bles by performing binary rewrit-ing. The rewriter adds inlineguards or short machine-code se-quences that perform checks atrun-time. Guards are placed oncomputed control-flow transfersusing unique labels as valid tar-get identifiers. A guard verifiesthat the label is present at thetarget site before performing acomputed jump or call. Thismechanism ensures that alltransfers remain within the con-trol-flow graph. The rewriter alsoadds memory access guards toensure that computed memoryaccesses lie within valid memoryregions. The validation uses afast path when the address lieswithin bounds established atload time. Memory accesses out-side of these bounds fall througha slow path that validates the ad-dress using data structures simi-lar to page tables. The binary isalso modified to use two stacks,

a scoped stack and an allocationstack. The scoped stack containsreturn address and variables thatare accessed by name only. Theallocation stack contains datathat can be accessed using com-puted access. This mechanismallows the integrity of the scopedstack to be established throughstatic verification.

Because rewriting binaries is atricky process, XFI doesn’t re-quire this step to be trusted. In-stead, XFI relies on a trusted ver-ifier to parse the binary at loadtime, ensuring that binaries havethe appropriate structure and theappropriate guards. The verifieris simple, fast, and consists ofonly 3000 lines of straightfor-ward code, mostly x86 decodingtables, increasing confidence inits correctness.

Bryan Ford from MIT asked howXFI ensures that the heap is notincorrectly seen to containmatching identifiers. Úlfar an-swered that a guard checkswhether computed control flowsare to targets within the binary,in addition to matching identi-fiers and the verifier ensuringthat identifiers are unique. JimLawson from MSB Associatesasked whether XFI rejects self-modifying code. Úlfar acknowl-edged this. Rik Farrow askedabout binary size increase. Úlfaranswered that code size can in-crease by a factor of two or morebut that most of the additionalcode is either in nonexecutableverification hints or in out-of-band trampolines, which are notinvoked often. Therefore the in-crease in code size has a minimalimpact on performance and i-cache behavior.

90 ; LOG I N : VO L . 3 2 , NO . 1

OS IMPLEMENTATION STRATEGIES

Summarized by Andrew Miklas

Operating System Profiling via Laten-cy Analysis

Nikolai Joukov, Avishay Traeger, andRakesh Iyer, Stony Brook University;Charles P. Wright, Stony BrookUniversity and IBM T.J. WatsonResearch Center; Erez Zadok, StonyBrook University

Nikolai presented a new ap-proach to operating system pro-filing. He showed that the loga-rithmic distributions of an OS’slatencies can reveal most, if notall, aspects of the OS’s internaloperation. He presented severalmethods to analyze these profilesand provided interesting exam-ples of the sorts of conclusionsthat can be drawn using the pro-filer data. Among others, the au-thors were able to diagnose alocking error in the Linux kernelwithout having to examine thesource code. Also, because thelatency can be measured entirelyoutside of the kernel, the meth-od can be used to analyze operat-ing systems even if the source isnot available.

The basic technique involves re-peatedly measuring the times re-quired to complete OS requests.The latency of each request de-pends on the path taken throughthe code and interactions withother processes. Therefore, thedistribution of these values canbe used to make inferences aboutan operating system’s innerworkings. The latencies are usedto generate histograms that canbe analyzed either visually or bya provided automatic data-analy-sis toolset. Various events, suchas being blocked on a lock, showup as spikes on the histogram.Correlations between differentrequests can reveal their con-tention on a shared resource. Forexample, similar spikes on thehistograms of two different sys-tem calls that only appear whenthe two are run together suggest

that they contend on the samelock.

Although the profiler can beused without any kernel changeswhatsoever, it can take advan-tage of kernel instrumentationpoints if they are available. Load-able drivers can also be used asvantage points from which tomeasure latencies. The authorscreated a tool that can patchLinux file systems to automati-cally produce latency data. Thepresented profiler adds negligi-ble overheads. The generatedprofiles are usually smaller than1 kilobyte. The profiler adds lessthan 200 CPU cycles per profiledcall, whereas existing profilersadd overhead per profiled event.For example, if a system call istrying to acquire several sema-phores and to perform I/O, exist-ing profilers would add overheadfor each such event. As a result,the presented profiler is efficient.As measured with the Postmarkbenchmark, the system’s CPUtime was affected by less than4%.

Marcel Rosu of the IBM T.J. Wat-son Research Center askedwhether the technique makesheavy use of CPU cycle countersin order to cheaply measure in-tervals of time and if they testedtheir system on CPUs that usevariable clocks, such as thosefound in notebook computers.Although they didn’t test on vari-able-clock CPUs, “it should bepossible,” Nikolai said, “to sim-ply apply a scaling factor to han-dle the cases where the CPU isn’trunning at its maximum fre-quency. Also, remember that oursystem doesn’t rely on usingCPU cycle counters. We coulduse an off-CPU high-precisiontimer if the CPU lacked an ap-propriate method of measuringintervals.” Brad Chen of Intelasked whether the profiler can beused to analyze anomalous cases,rather than just bad implementa-

tions of the main case. The paperdescribes a “sample profile,”where instead of folding all thelatencies for a given test togeth-er, these were separated based onfixed time intervals. Nikolai’sgroup used this method to ana-lyze ReiserFS’s performance dur-ing the relatively infrequent peri-ods of time when the Linuxbuffer flushing daemon bdflushwas active.

CRAMM: Virtual Memory Support forGarbage-Collected Applications

Ting Yang and Emery D. Berger,University of Massachusetts Amherst;Scott F. Kaplan, Amherst College; J.Eliot B. Moss, University ofMassachusetts Amherst

Ting Yang began by saying thatthe memory management com-ponent of an OS and the heap-management component of agarbage-collected (GC) run-timeenvironment must cooperatewith each other to ensure goodperformance under all memoryloads. Today’s operating systemsdon’t provide enough VM so-phistication to support GC run-times effectively when memorypressures are significant. As a re-sult, current run-times are reac-tive; they only resize the heaponce performance has degraded.In contrast, CRAMM is predic-tive; it can determine the bestheap size and suggest that therun-time adjust it before the sys-tem begins to experience perfor-mance loss.

CRAMM consists of two compo-nents: an extension that sits inthe kernel’s VM subsystem, and aheap-size model that exists inthe run-time environment. Thekernel-mode component trackseach process’s working-set size(WSS): the amount of memorythat the process needs so that itdoes only a small amount ofswapping. The model compo-nent computes the heap size thatwould cause the process’s WSS tojust fit within the maximum


amount of memory the operatingsystem is willing to allocate. Ifthe current heap size and thecomputed optimal heap size dif-fer, the model asks the run-timeto adjust the size of the heap.The system can therefore quicklyrespond to changes in memorypressure by reducing the size ofthe heap before it is swappedout, avoiding reclamation-trig-gered thrashing. CRAMM addsonly about 1–2.5% overhead dur-ing ordinary test runs wherememory is plentiful. In ex-change, CRAMM dramaticallyimproves the performance in sit-uations where memory pressuresuddenly increases.

Chris Stewart, University ofRochester, asked whether the au-thors assumed that when usingcopying garbage collectors, thenumber of pages used to hold“copied survivors” does notchange rapidly and wonderedwhether their system thereforewould be able to handle flash-loads when the run-time usescopying GCs. Ting explainedthat they keep track of the CSvalue from one invocation of theGC to the next and apply asmoothing function. The maxi-mum value ever seen for the CSvalue is weighted more heavilyto ensure that they don’t under-estimate this parameter.

Flight Data Recorder: MonitoringPersistent-State Interactions to Im-prove Systems Management

Chad Verbowski, Emre Kıcıman,Arunvijay Kumar, and Brad Daniels,Microsoft Research; Shan Lu, Universityof Illinois at Urbana-Champaign; JuhanLee, Microsoft MSN; Yi-Min Wang,Microsoft Research; Roussi Roussev,Florida Institute of Technology

Chad presented the Flight DataRecorder (FDR), a new toolwhich will ship with WindowsVista that allows all changes tothe persistent state of a system tobe logged for later analysis. Heexplained that various system

management tasks that up untilnow have been something of ablack art essentially reduce toqueries over the logs gathered bythe FDR. As a motivating exam-ple, Chad told of a server at Mi-crosoft that would exhibit ex-tremely poor performance everyfew weeks. A system administra-tor eventually determined thatthis was because the system’spage file was being inappropri-ately shrunk. Unfortunately, hewas unable to determine whythis was happening. The best hecould do was to send out emailto the other admins asking themto make sure they weren’t resiz-ing the file. However, after run-ning the FDR for a few weeks,the logs were used to quicklypinpoint the offending script.

Currently, system managementtools break down into roughlythree categories. Some use a sim-ilar logging approach but, owingto space constraints, activateonly on demand. These types oftools are of limited usefulnesswhen trying to determine why aparticular piece of configurationdata changed. Another class oftools uses signatures to look forknown-bad configurations.However, creating signaturesgeneral enough to be usefulacross a wide variety of machinesis a time-consuming process. Fi-nally, manifest-based approachesrequire that applications providethe system with a list of all con-figuration dependencies. Howev-er, the authors point out that theuninstall tools included in manysoftware packages leave behindfiles and configuration data, sug-gesting that building completemanifests is impractical.

The proposed approach simplylogs all changes to the system’spersistent state. The main contri-bution of the work is its novelmethod of encoding the activitylogs. This method requires onaverage just 0.5–0.9 bytes per

event. Since typical systems gen-erate on the order of 10 millionevents per day, the resulting logsare small enough to be practical-ly sent over the network, ar-chived, correlated with othersystems, and quickly queried.The authors report that they areable to execute common queriesagainst a day’s worth of storeddata in as little as three seconds.They also note that it should bepossible to serve as many as5000 systems running the FDRwith a single archive server.

The authors see the FDR asbeing useful not only for systemmanagement but also for ensur-ing that various security andmanagement policies are beingfollowed. For example, the logscaptured by the FDR can be usedto determine how often a locked-down production server is modi-fied without proper approval.The FDR can also be used to as-sist in locating system “extensi-bility points”: configuration set-tings that control the loading ofextra system services or plug-ins.This has important implicationsfor detecting and removing mal-ware. In summary, the FDR hasmade it possible to know abouteverything that is happening ona system.

Q: Can the FDR be used to pre-dict how a system will respondto a configuration change? A: Wecan search for another systemthat already has the configura-tion change but is otherwise sim-ilar to the machine in question.If we can find such a machine,we can use it to approximatehow this system would behavewith the change. Q: What is theright way to query the logs gen-erated by the FDR? Should we beusing SQL, or perhaps a cus-tomized query engine with a pro-grammatic API, etc.? A: Rightnow, we just expose the raw ta-bles as they are in the log file.Any special-built query engine

92 ; LOG I N : VO L . 3 2 , NO . 1

should be optimized for queriesof the form “What files havebeen changed since time T?”since the most common requestsseem to be those that look formodified configuration entries.Q: Other types of events mightbe worth logging. For example,did you consider logging allsocket activity? A: We did thinkabout logging IPC activities, butthe system doesn’t currently im-plement this feature.

WORK- IN-PROGRESS REPORTS (WIPS)

Summarized by Andrew Miklas

Taking the Trust out of Global-ScaleWeb Services

Nikolaos Michalakis

If you were to contract out thehosting of a dynamic Web site toa number of content deliverynetworks, how could you be surethat every system serving yourcustomers was running the exactsoftware you supplied to theCDNs? This problem becomeseven more severe for content dy-namically generated by ordinaryInternet users, where contractlaw might not be sufficient moti-vation to ensure that the distrib-utors don’t behave maliciously.

Nikolaos is researching ways tocertify that dynamic content isserved correctly in an environ-ment where the delivery systemsare not fully trusted by the con-tent providers. The basic designhas clients forward a fraction ofthe signed responses from oneserver to other replicas for verifi-cation. If the verifying replicacomputes a different result, itpublishes the erroneous signedresponse so that other hosts canlearn of the misbehaving server.

Information Flow for the Masses

Max Krohn, Micah Brodsky, NatanCliffer, M. Frans Kaashoek, EddieKohler, Robert Morris, and Alex Yip

Today’s Web sites are serving anincreasing amount of user-con-tributed content. They are nolonger places where users go topassively download information,but instead they act as meetingpoints where people can ex-change content. For example,consider Wikipedia, which isbuilt on content contributed byits users.

However, sites today do notallow users to contribute to theapplications running on them.Wikipedia does not allow anony-mous users to patch up the Me-diaWiki software running on thelive servers, as it does with itscontent. The main reason whythis can’t be allowed in today’shosting environments is security;a malicious patch could be usedto leak ordinarily inaccessible in-formation.

Traz is a novel Web-hosting envi-ronment that applies Asbestos-like reasoning to Web servers.User-contributed code may be al-lowed to manipulate data ordi-narily inaccessible to the con-tributing user, but it will not beallowed to leak this informationback to that user. Traz thereforeallows Web sites to safely exe-cute user-provided code againstprivileged information. Traz runson ordinary commodity operat-ing systems and allows develop-ers to write their applicationsusing any programming lan-guage.

AutoBash: Hammering the Futz Outof System Management

Ya-Yunn Su and Jason Flinn

We’ve all experienced it: a sys-tem that isn’t performing quitethe way we’d like. Maybe thevideo hardware doesn’t set theappropriate resolution when

plugging an external monitorinto a notebook. Perhaps thewireless card doesn’t reassociatewith the nearest access pointcorrectly on wake-up. No matterwhat the problem, we typicallyuse the same approach to solveit: Type the symptoms intoGoogle, find a page that de-scribes a fix, apply the stepsdescribed in the fix, and checkto see whether the problem iscorrected. If not, back out thechanges, and repeat. This pro-cess, which Ya-Yunn termed“futzing,” requires a substantialamount of manual interventionand can lead to serious frustra-tion.

AutoBash is a tool that auto-mates the futzing process. Whenthe user notices a configurationerror, he or she describes thesymptoms to the tool. AutoBashwill search its database for sce-narios where a user started withthe current configuration, ap-plied some changes, and endedup with a new configuration thatsatisfies the desired description.Once a record is found, the stepsrequired to adjust the user’s con-figuration will be automaticallyreplayed. If the user is dissatis-fied with the result, he or shewill be given the opportunity tohave the changes automaticallyrolled back and will possibly bepresented with another solution.Finally, should the tool be unableto automatically correct theerror, it will watch as the userdoes so manually, so that otherusers may benefit from the futz-ing done by this user.

Dynamic Software Updating for theLinux Kernel

Iulian Neamtiu and Michael Hicks

Software updates are a necessari-ly evil. In order to apply them, aservice must typically be restart-ed. For many applications, thedowntime that must be incurredin order to restart is undesirable.Worse, updates to system soft-


ware such as the kernel can ne-cessitate a full reboot of the ma-chine, resulting in an evenlonger stretch of downtime.

Gingseng, a tool worked on bythe presenter, can apply updatesto running user-mode services. Itdoes this by determining strate-gic locations in a service’s execu-tion where the code can be safelyupdated. Patching kernel code,however, is far more difficult,owing to its low-level and highlyconcurrent nature. Iulian de-scribed some of his work tomake Ginseng able to update alive Linux kernel.

iTrustPage: Preventing Users fromFilling Out Phishing Web Forms

Troy Ronda and Stefan Saroiu

The U.S. economy loses billionsof dollars each year to phishingattacks. Even worse, phishingerodes the public’s trust in theWeb as a platform for e-com-merce. Troy claimed that manyforms used to legitimately gatherinformation originate from well-established Web sites, whereasphishing attacks are usuallydone from newly created sites.Fortunately, there are a numberof services that can be used to es-timate the popularity and thusthe trustworthiness of a site.Phishing pages must also appearsimilar to their targets. Althoughit is difficult to design an algo-rithm to compare two Webpages, Troy mentioned that aperson can usually determinewith ease if one Web site is mim-icking another.

These two key observations formthe basis of iTrustPage, a Firefoxextension that helps users avoidphishing attacks. When a usertries to fill out a form on a pagethat is not well established,iTrustPage stops the user fromproceeding and asks the user todescribe the task that he or she istrying to accomplish. Using thisdescription, iTrustPage makes a

query to Google and shows theuser the Web sites associatedwith the first few hits. The userthen indicates which Web sitelooks most similar to the pagethe user was expecting. Finally,the tool redirects the user to theorganization’s legitimate Webpage and away from the phishingattempt. iTrustPage is currentlyavailable for download athttp://www.cs.toronto.edu/~ronda/itrustpage/.

Failures in the Real World

Bianca Schroeder and Garth A. Gibson

A major challenge in runninglarge-scale systems is that com-ponent failure is the norm ratherthan the exception. Unfortunate-ly, most work on dealing withfailures is based on simplistic as-sumptions rather than real fail-ure data. Bianca has been collect-ing and analyzing failure datafrom several real-world installa-tions. The initial results indicatethat many commonly used fail-ure models are not supported byreal data. For example, the prob-ability of a RAID failure can bean order of magnitude larger,based on actual observation,than one would expect given thestandard model, which uses ex-ponentially distributed intervalsbetween failures.

Motivated by these initial results,Bianca is continuing to collectand analyze failure data from alarge variety of real-world instal-lations. By carefully groundingnew failure models in real data,researchers will be able to moreaccurately model a system’s re-sponse to component failure.

Dynamically Instrumenting OperatingSystems with JIT Recompilations

Marek Olszewski, Keir Mierle, AdamCzajkowski, and Angela Demke Brown

Operating systems, like applica-tions, grow more complicatedeach year. Unfortunately, thetechniques and tools used to in-

strument, trace, and debug ker-nel-level code have not advancedas quickly as their user-modecounterparts. For example, toolssuch as Valgrind have made itpossible to inject instrumenta-tion into running user-modecode using JIT recompilationtechniques. Because these probesare dynamically compiled direct-ly into the surrounding code,they perform much better thantraditional patch-and-redirecttechniques. Unfortunately, JITinstrumentation tools are cur-rently unable to instrument ker-nel code.

Marek plans to create JIT instru-mentation tools that can be usedwith kernel code. Given thetime-sensitive nature of much ofa kernel, this approach maymake it possible to instrumentcode that previously could notbe probed for performance rea-sons. By bringing the provenbenefits of JIT instrumentationto the kernel, Marek will assistsystems programmers in betterunderstanding their operatingsystems and ultimately will helpthem to produce more efficientand correct kernels.

Pattern Mining Kernel Trace Data toDetect Systemic Problems

Christopher LaRosa, Li Xiong, and KenMandelberg

Profilers, debuggers, and systemcall tracers can all be used to di-agnose performance issues with-in a process. However, diagnos-ing performance problems thatresult from the interplay of twoor more processes can be compli-cated. For example, determiningwhy an X server is exhibitingpoor performance can involvegathering and correlating tracesfrom both the X server and anyactive X clients. Unfortunately,there are few tools to automati-cally correlate traces, and pro-grammers must usually resort ei-ther to poring over the gathered

94 ; LOG I N : VO L . 3 2 , NO . 1

data by hand or writing ad-hocscripts.

Christopher plans to apply data-mining techniques to system-wide activity traces. Using thesetechniques, anomalous condi-tions that might impact systemperformance can be automatical-ly detected and isolated, even ifthey span multiple processes. Heprovided an example involving astock-ticker toolbar applet thatunnecessarily flooded the X serv-er with requests. His trace ana-lyzer was able to automaticallydetect the excess of X calls andpinpoint their origin. He wouldappreciate it if DTrace or LTTusers would share their hard-to-find bugs with him, so that hecan test the effectiveness of hissystem’s automatic detection.

Spectrum: Overlay Network Band-width Provisioning

Dejan Kostic

Overlay networks are currentlyused to efficiently disseminatecontent. However, because oftheir decentralized nature, it canbe difficult to ensure that there isenough outbound capacity tosupport all receivers as well asprevent other overlays from“stealing” bandwidth from high-er-priority services. This presentsa serious problem when overlaysare used to transfer streamingmedia, where timely delivery ofcontent is necessary for the sys-tem to operate correctly.

Dejan is working on algorithmsto measure and disseminatebandwidth availability informa-tion throughout an overlay net-work. By doing so, the systemcan make globally optimal deci-sions about how much band-width to dedicate to a mediastream. This is especially usefulwhen the same overlay is carry-ing a variety of different content.For example, a BitTorrent–liketransfer through the overlaymight be permitted as long as it

doesn’t cause anyone’s videostream to drop below a certainbit-rate.

An Infrastructure for Characterizingthe Sensitivity of Parallel Applicationsto OS Noise

Kurt B. Ferreira, Ron Brightwell, andPatrick Bridges

Many commodity operating sys-tems do not scale well to thenumber of processors found intoday’s supercomputers. Whenrunning such operating systems,as much as 50% of the system’sperformance can be used by theoperating system itself. For thisreason, many of today’s largestsupercomputers run stripped-down operating systems that im-pose as small an overhead as pos-sible.

Kurt’s research seeks to under-stand exactly how this overhead,termed “OS noise,” affects thebottom-line performance of vari-ous scientific computing applica-tions running on large super-computers. He is also interestedin finding ways to reduce theoverheads found in ordinary op-erating systems in order to makethem more suitable for use onlarge supercomputers.

Limits of Power and Latency Reduc-tions by Intelligent Grouping

David Essary

Disk accesses are a very expen-sive operation; entire classes ofapplications are limited by theI/O capability of their hosts,rather than its raw processingpower. Improving an I/O subsys-tem’s ability to quickly respondto requests can greatly improvethe overall efficiency of such sys-tems.

David’s research seeks to im-prove storage access time andthroughput by carefully control-ling how the data is physicallylaid out on disk. Data can evenbe stored on multiple drives inan array to give the reading

process more flexibility when de-ciding how to optimally read thedata back. These techniques canresult in a 70% reduction in disk-related latencies and energy con-sumption. David also discussedsome of his work on predictiveretrieval algorithms and how hehad explored theoretical limits.Finally, he pointed out that hiswork to reduce seek operationsnot only improves performancebut also reduces drive powerconsumption.

Distributed Filename Look-up UsingDNS

Cristian Tapus, David Noblet, and JasonHickey

One of the challenges in buildinga distributed file system is find-ing a way to locate the data andmetadata associated with files.Usually, this information is repli-cated to provide reliability andthus might be distributed acrossa wide-area network. A central-ized directory service is undesir-able because it provides a singlepoint of failure and may behaveas a bottleneck for the system.

Cristian noted that many of theproblems faced when serving filemetadata and data are in fact thesame as those solved by DNS.For example, both DNS and FSmetadata are used to resolvenames in a hierarchical namespace to addresses. For these rea-sons, Cristian suggested usingDNS itself as a localization ser-vice for the data and metadata offiles in a distributed FS. Lookingup a file would involve making aDNS query for a name such as“passwd.etc.mojavefs.caltech.edu.” The query would returnthe addresses of the replica fileservers that could serve thenamed file. Replication of themetadata is handled automatical-ly by the caching mechanismsbuilt into DNS.


Bounded Inconsistency BFT Protocols:Trading Consistency for Throughput

Atul Singh, Petros Maniatis, PeterDruschel, and Timothy Roscoe

Many protocols exist for ensur-ing the high availability of a sys-tem despite the potential forbyzantine failure of its compo-nents. However, these protocolshave negative scaling properties:The more nodes added, the high-er the performance penalties as-sociated with keeping all of thecomponents synchronized.

Atul proposed a solution wherereplicas return results that differslightly from the correct result.The key is that the variability ofthe response is bounded; a clientcan be sure the true value iswithin some range of the re-turned quantity. By allowing thereplicas to run slightly out ofsync, the overall performance ofthe system can be improved.This approach can be useful forapplications that don’t requireprecise results. For example, itmight be acceptable for a diskquota system to allow a user toconsume at most 5% more diskspace than the allotted quota.

EyesOn: A Secure File System ThatSupports Intelligent Version Creationand Management

Yougang Song and Brett D. Fleisch

Versioning file systems are oftenused to enhance the security ca-pabilities of an operating system.Since they preserve the changehistory for each file, they canhelp system administrators bothdetect intrusions and roll backunauthorized changes. However,maintaining a comprehensivechange history can become over-whelmingly expensive in bothdisk space and performanceoverhead.

The EyesOn system aims to pre-serve normal file operations andexisting file structures whileleaving the complexity of recov-

ery operations to the time theyare requested. EyesOn extendsthe same strategy used by filesystem journaling to record itsin-memory modified data in alog without significant addition-al processing. EyesOn uses theselogs to create file versions thatcan be used to accelerate the re-trieval of a file’s change history.Versions are created based onuser-supplied predicates that canmake use of statistics stored inthe log. For example, predicatescan use the elapsed time since afile was last modified, the totalsize of the change, or whetherthe user has explicitly requestedthat a snapshot of the file betaken at this time. Two types ofversions are created in EyesOn.Normal versions are created forquickly retrieving recent changesand will automatically be culledonce they reach a certain age.Landmark versions are createdfor keeping valuable informationfor a longer time.

Robust Isolation of Browser-Based Ap-plications

Charles Reis

First it was online email. Nextcame online scheduling, photoarchiving, and journaling. Today,companies have begun testingonline word processors andspreadsheet applications. Even-tually, it’s possible that most ap-plications will be run on racks ofsystems in far-away data centersand served over the Web.

If the future of applications is theWeb, than in some sense the fu-ture of operating systems is theWeb browser. Although theymay never directly interact withhardware, they certainly will ful-fill other roles traditionally han-dled by an operating system. Forexample, Web browsers shouldensure that a buggy or maliciousscript on one site doesn’t ad-versely affect the scripts of an-other. Browsers should also pro-tect the locally stored data

associated with one site fromunauthorized access by scriptsfrom another site. Charles is cur-rently looking at ways to buildthese types of containmentmechanisms into browsers suchthat changes to the server-sideapplications are kept to a mini-mum.

Stealth Attacks on Kernel Data

Arati Baliga

Rootkits use an array of impres-sive techniques to hide them-selves from detection. Some goso far as to rewrite portions ofthe in-memory kernel image toperfect the illusion. New systemcalls might be added to renderthe rootkit’s processes invisibleto ps. Others carefully manipu-late the process lists and file sys-tem handlers to evade detection.

Arati is investigating all of theways in which rootkits can tamp-er with the running kernel imageby solely manipulating kerneldata. She hopes to use her find-ings to develop monitoring sys-tems that can’t be easily fooled.In particular, she is looking at at-tacks that do not employ con-ventional hiding techniques yetare able to cause stealth damageto the system and evade detec-tion from state-of-the-art integri-ty monitoring tools.

PROGRAM ANALYSIS TECHNIQUES

Summarized by Geoffrey Lefebvre

EXPLODE: A Lightweight, GeneralSystem for Finding Serious StorageSystem Errors

Junfeng Yang, Can Sar, and DawsonEngler, Stanford University

Junfeng Yang began his talk bynoting that storage systems er-rors are the worst kind of errorsince they can result in corrup-tion of persistent state, possiblyleading to permanent loss ofdata. To address this issue, theauthors presented EXPLODE, asystem to find errors in storage

96 ; LOG I N : VO L . 3 2 , NO . 1

systems. EXPLODE borrows ideasfrom model checking by beingcomprehensive, but instead ofrunning the checked system in-side a model checker, EXPLODE

runs client-defined checkers in-side the checked system. Run-ning on a real system allows it toeasily check storage stacks. A filesystem checker can be used tofind errors in storage layers ei-ther above or below the checkedsystem. Another advantage ofthis approach is that checkers areeasy to write. A checker for astorage system can be written inless than 200 lines of C++ andoften consists of a simple wrap-ping layer around existing utili-ties such as mkfs and fsck. Theauthors state that this work com-pletely subsumes their previouswork (FiSC).

Because bugs are often triggeredby corner cases, EXPLODE’s coreidea is to explore all choices. EX-PLODE provides choose(N), an N-way fork that allows checkers tofork at every decision point dur-ing testing and explore everypossible operation. Before ex-ploring a decision point, EX-PLODE checkpoints the state ofthe system. A checkpoint is sim-ply the recorded sequence of re-turn values from choose(). To re-store a checkpoint, EXPLODE

deterministically replays this se-quence from the initial state.These two mechanisms allowEXPLODE to perform an exhaus-tive state exploration. At anypoint during testing, a checkerhas the ability to force crashes.Upon a forced crash, EXPLODE

generates crash disks based onall possible orderings of dirtybuffers. A checker-supplied rou-tine then tests all disks for spe-cific invariants.

A challenge faced by the authorswas dealing with nondetermin-ism. The choose() primitivecould be called by noncheckingcode such as interrupt handler.

EXPLODE solves this problem byfiltering on thread ID. EXPLODE

must deterministically scheduleall threads involved with thechecked system. This includesthe checker’s thread but also allthreads belonging to the checkedstorage system. By playing withthread’s priorities, EXPLODE isable to enforce deterministicscheduling most of the time andcan detect when it fails to do so.

Junfeng then presented an evalu-ation of EXPLODE. The authorstested various file and storagesystems such as ext2, ext3, JFS,the VFS layer, NFS, Berkeley DB,and VMware with EXPLODE andfound bugs in all of them. Some-one from UCSD asked about theoption not to replay instrument-ed kernel functions and if doingso could lead to nondetermin-ism. Junfeng answered that shorttraces were deterministic, sincecalls to these kernel functionssucceed most of the time, butthat long traces had nondeter-minism. Someone asked whetherit was possible to test subcompo-nents of file systems. Junfeng an-swered that normally filesystemimplementations were not struc-tured that cleanly, so it makesmore sense to check a file systemas a whole. Another personasked how EXPLODE handlednondeterminism created by diskactually performing an operationout of order internally. Junfenganswered that EXPLODE uses aRAM disk, which avoids thisproblem.

Securing Software by Enforcing Data-Flow Integrity

Miguel Castro, Microsoft Research;Manuel Costa, Microsoft ResearchCambridge; Tim Harris, MicrosoftResearch

Manuel Costa began his presen-tation by noting that most of thesoftware in use today is writtenin C++. This body of softwarehas a large number of defects andthere exist many ways to exploit

these defects, such as corruptingcontrol data. He presented someof the various approaches to se-curing software. He noted thatremoving or avoiding all defectsis hard and that although it ispossible to prevent attacks basedon control-data exploits, certainattacks can succeed withoutcompromising control flow. Ap-proaches based on tainting canprevent noncontrol-data ex-ploits, but they may lead to falsepositives and have a high over-head.

To address these issues, the au-thors present a new approach tosecure software based on enforc-ing Data-Flow Integrity (DFI).Their approach uses reachingdefinition analysis to compute adata-flow graph at compile time.For every load, compute the setof stores that may produce theloaded data. An ID is assigned toevery store operation and, foreach load, the set of allowed IDsis computed. The results of theanalysis is used to add run-timechecks that will enforce data-flow integrity. Stores are instru-mented to write their ID into therun-time definition table (RDT).The RDT keeps track of the laststore to write to each memory lo-cation. Loads are instrumentedto check whether the store in theRDT is in their set of allowedwrites. If a store ID is not in theset during a check, an exceptionis raised. Because the analysis isconservative, an exception guar-antees the presence of an error.There are no false positives.Manuel also noted that control-flow attacks are a form of data-flow attacks. It is possible to usethe same mechanism to protectcontrol-flow data.

Manuel described a set of opti-mizations to improve the run-time performance. The most im-portant optimization is torename store IDs so that they ap-pear as a continuous range in the


definition set. The membershiptest can be replaced with a sub-traction and a compare. Theevaluation presented demon-strates that this optimization isfundamental to the performanceof DFI. On average, DFI imposesa space overhead of around 50%,and the run-time overheadranges from 44% to 103%.

Brad Karp from University Col-lege London asked whether DFIhandles function pointers andother complex program con-structs. Manuel answered thatDFI handles function pointersbut has problems with certainpieces of software written in as-sembly. These problems can re-sult in not detecting certain DFIviolations. An alternative wouldbe to use CFI in this case. It isimportant to understand thatthese limitations are a propertyof the program to instrument,not the attack. Bill Bolosky fromMicrosoft Research askedwhether every store had to be in-strumented. Manuel acknowl-edged that this was the case.

From Uncertainty to Belief: Inferringthe Specification Within

Ted Kremenek and Paul Twohey,Stanford University; Godmar Back,Virginia Polytechnic Institute and StateUniversity; Andrew Ng and DawsonEngler, Stanford University

Ted Kremenek began by sayingthat all systems have correctnessrules, such as not to leak memo-ry, or to acquire some lock beforeaccessing data. We can checkthese rules using program analy-sis, but the problem is thatmissed rules lead to missed bugs.The specification of a system isthe set of these rules and invari-ants. The problem addressed inthis talk is how to find errorsthat violate these rules when wedon’t know what the rules are or,more precisely, how we can inferthe specification. The talk pre-sented a general framework todo so and the technique is de-

scribed using an example: find-ing resource leaks by inferringallocators and deallocators.

There are many sources ofknowledge that can be used toinfer system rules, such as be-havioral tendencies (i.e., pro-grams are generally correct) andfunction names, but there isn’t away to bind all of this informa-tion together. To address thisissue, the authors present an ap-proach based on the AnnotationFactor Graph (AFG), a form ofprobabilistic graphical modeling.This approach reduces all formsof knowledge, either from evi-dence or intuitions, to probabili-ties. The idea is to express pro-gram properties to infer asannotation variables and inferthese annotations by combiningscores obtained from factors.Factors represent models of do-main-specific knowledge and areused to score assignments of val-ues to annotation variables basedon the belief that an assignmentis relatively more or less likely tobe correct.

The talk focused on how to inferresource ownership using AFGs.First, functions return valuesand parameters are annotated.The domain for a return valueannotation variable is to returnor not return ownership, and thedomain for a function parameteris to claim or not claim owner-ship. Ted Kremenek then de-scribed some of the factors usedto infer resource ownership.Some use static analysis based oncommon programming axiomssuch as that a resource shouldnever be claimed twice; othersare based on ad-hoc knowledgesuch as function names.

The evaluation was based on in-ferring annotations on five proj-ects: SDL, OpenSSH, GIMP,XNU, and the Linux kernel. Theauthors’ technique obtained a90%+ accuracy on the top 20ranked annotations. It was also

able to infer allocators unknownor misclassified by Coverity Pre-vent. Using their technique, theauthors found memory leaks inall five projects. Ted Kremenekdescribed a complex memoryleak they were able to find in theGIMP library.

Someone from Yahoo! askedhow well the tool performs whenallocators that are simply wrap-pers around malloc are factoredout. Ted answered that althoughmany allocators call malloc, thetool doesn’t use this knowledge.Brad Chen from Google askedwhether the tool would be con-fused by realloc. Ted answeredthat the tool did infer that reallocwas both an allocator and a deal-locator. Corner cases, such asthis one, are detected automati-cally but have to be dealt withseparately. They had a similarissue with certain functions inthe Linux kernel. Such outliershave to be modeled explicitly butare fairly rare. Micah Brodskyfrom MIT asked whether AFGsare an instance of Bayesian Net.Ted answered that AFGs andBayes Nets belong to the samefamily of probability modeling.The authors actually started withBayes Net but found the class tobe too rigid. AFGs were mucheasier to work with. Someoneasked what other things could bemodeled. Ted answered thatlocks are very behavioral, sotheir approach would work well.He stated that the temporal rela-tionship could be expressed as agrammar.

98 ; LOG I N : VO L . 3 2 , NO . 1

DISTR IBUTED SYSTEM

INFRASTRUCTURE


HQ Replication: A Hybrid QuorumProtocol for Byzantine FaultTolerance

James Cowling, Daniel Myers, andBarbara Liskov, MIT CSAIL; RodrigoRodrigues, INESC-ID and InstitutoSuperior Técnico; Liuba Shrira,Brandeis University

The authors address the problemof building reliable client-serverdistributed systems. They notethat the current state of the arteither requires a small number ofreplicas (3f+1) but has highcommunication overhead or re-duces communication complexi-ty by requiring a much largernumber of replicas (5f+1). Thesecond case still suffers from de-graded performance in cases ofwrite contention, however. Theypropose a hybrid scheme thatcombines the best aspects ofboth schemes to achieve a lownumber of replicas (3f+1) whilebounding communication over-head.

Their system uses a two-phasewrite protocol. First, a client ob-tains a timestamp grant fromeach replica. This grant is essen-tially a promise to execute thegiven operation at a given se-quence number, assuming agree-ment from a quorum of replicas.In the second phase, the clientforms a certificate from 2f+1matching grants and sends thiscertificate to all the replicas,which then complete the writeoperation. A certificate provesthat a quorum of replicas hasagreed to a given ordering of op-erations. Importantly, the exis-tence of a certificate precludesexistence of conflicting certifi-cate. Replicas are forbidden tohave two outstanding grants inprogress, and they return thecurrently outstanding grant toclients while a grant is in prog-ress, as proof that it is busy. The

authors have deployed both theirHybrid Quorum (HQ) and BFTprototypes on Emulab. Their re-sults show that HQ performsbetter than BFT up until around25% contention.

Atul Adya from Microsoft Re-search noted that their protocolallows clients to commit opera-tions on behalf of other clients,and asked whether this was a se-curity hole. James said that sincecertificates are free-standing andcryptographically signed, anyclient can send a certificate to areplica and it will commit faith-fully on behalf of the originatingclient. Bill Blaskey, also fromMSR, noted that commit couldbe painful because potentiallythousands of operations occurper second. James noted that alldata lives completely in RAM intheir experiments, and a rebootis considered a failure. PetrosManiatis from Intel Researchasked why the authors chose todo a two-phase protocol with3f+1 replicas, rather than a one-phase protocol with 5f+1 repli-cas. James noted that they couldhave done so but decided that5f+1 is just too many. The lastquestion involved whether col-luding replicas would just alwaystell clients that they had an out-standing grant, to force the slowpath of the protocol. James re-sponded yes, in the worst casethis would happen.

BAR Gossip

Harry C. Li, Allen Clement, Edmund L.Wong, Jeff Napper, Indrajit Roy, LorenzoAlvisi, and Michael Dahlin, Universityof Texas at Austin

The motivation scenario for talkis a live, peer-to-peer streamingmedia application. Such applica-tions can fall apart, however, inthe presence of malicious nodes,if it is not in a node’s self-interestto forward a data packet on to itspeers. The authors introduce theconcept of BAR Gossip, a gossipprotocol that makes it in the in-

terest of selfish nodes to act forthe benefit of the overall system,while detecting and punishingbyzantine (malicious) nodes.BAR Gossip is in fact two proto-cols: balanced exchange and op-timistic push.

During balanced exchange, ateach round every node selectsone partner, the partners ex-change their transaction histo-ries, and then they trade equalnumbers of data updates. Clientsmust commit to their historiesbefore discovering their partner’shistory; similarly, they must sendthe data updates first in encrypt-ed form before subsequentlysending the key. This makesbandwidth a “sunk cost,” so itis illogical for selfish nodes towithhold the data key from theirpartner later on. The authorsalso introduce the notion of veri-fiable partner selection to pre-vent clients from talking to morethan one client per round (to ac-cumulate more updates).

The second part of BAR Gossip isoptimistic push, which handlesbootstrapping for new nodes. Inthis case, unequal numbers ofupdates may be exchanged, butthe lesser peer must sacrifice thesame amount of energy andbandwidth by sending junk toeven out the exchange. Theirsimulation results show that fol-lowing the protocol was the mostbeneficial strategy for clients inall cases.

Rob Sherwood from Marylandasked about cases where clientsconsider different weights for in-bound and outbound traffic.Harry argued that such asymme-try can be handled by sizing keyrequests accordingly. Rob re-sponded that in cases where oneis downloading illegal content, itmight be overwhelmingly impor-tant never to upload anything.Harry countered that in such ex-treme cases BAR Gossip mightnot work. Jim Liang from UIUC


noted that the authors assumethat all clients know the com-plete membership list. Harry saidthat they are currently looking athandling cases where clientshave only partial membership in-formation. Jim further askedhow the BAR Gossip protocolachieves low latency for stream-ing media. Harry said their ex-periments showed an average la-tency of 20 seconds for a livevideo stream. This compares fa-vorably with existing streamingmedia applications (such as thefree NCAA Final Four videofeed). Another questioner askedhow their protocol handles col-lusion. Harry replied that theyhave no explicit mechanism forthis, because handling collusionin game theory is difficult. Theirresults showed, however, thatBAR Gossip is robust for smallcolluding groups (with up to30% nodes colluding). The lastquestion concerned the scalabili-ty of BAR Gossip. Harry citedthree limitations in scaling theirsystem: (1) handling dynamicmembership churn, (2) handlingnodes with only partial member-ship information, and (3) locali-ty-aware partner selection. Theyare currently looking at all threeissues.

Bigtable: A Distributed Storage Sys-tem for Structured Data

Fay Chang, Jeffrey Dean, SanjayGhemawat, Wilson C. Hsieh, DeborahA. Wallach, Mike Burrows, TusharChandra, Andrew Fikes, and Robert E.Gruber, Google, Inc.

Awarded Best Paper

Mike Burrows describedBigtable, a storage system de-signed for in-house use atGoogle. Bigtable developed in re-sponse to a need to store infor-mation on over 10 billion URLs,with wide variety in the size ofobjects associated with a givenURL, and variety in usage pat-terns. Commercial databases areobviously unsuitable owing to

the scale (petabytes of data onbillions of objects, with thou-sands of clients and servers).Bigtable stores data in a three-di-mensional, sparse sorted map.Data values are located by theirrow, column, and timestamp.

Since these tables can be quitelarge, Bigtable allows dynamicpartitions by row, called tablets,which are distributed over manyBigtable servers. Clients canmanage locality by choosing rowkeys in such a way that data thatshould be grouped togetherachieves spatial locality. A giventablet is owned by one server,but load balancing can shifttablets around. Similarly totablets, locality groups are parti-tions by column instead of row.These locality groups, however,segregate data within a tablet. Alldata is stored as files in theGoogle File System (GFS), withdifferent locality groups stored asdifferent files in GFS. One mas-ter Bigtable server controls themany tablet servers. Clients ac-cess a tablet by requesting a han-dle from the Chubby lock service(described in another OSDItalk), then doing read/write di-rectly with the tablet server thatowns a given tablet. The clientonly talks to the master when itneeds to manipulate metadata(e.g., create a table or manipulateACLs). Currently, Bigtable is de-ployed on over 24,000 machinesin approximately 400 clustersand is used by over 60 projects atGoogle.

The first question concerned thepoor random read performancedescribed in the paper and notedthat the authors attributed thisto shared network links. Thequestioner added that it seemsan easy optimization to movetablet servers closer to the GFSstorage that the tablets actuallyreside on. Mike answered that,by default, if a GFS server is col-located with the tablet server, the

tablet’s data will be stored onthat GFS server. Atul Adya fromMicrosoft Research asked howthey deal with hierarchicaldata—does this generate thou-sands of columns in their tableparadigm? Mike noted thatBigtable is not a database, andsuch hierarchical data situationsare not suited for Bigtable. Henoted that their clients need toconform to how Bigtable works,not the other way around.George Candea from EPFL askedwhether they had any technicalinsights to offer based on theirexperiences. Mike said that ifyou have the freedom to do so,build something with a customAPI that matches the needs ofboth clients and users.

DISTR IBUTED SYSTEMS OF L ITTLE

TH INGS


EnsemBlue: Integrating DistributedStorage and Consumer Electronics

Daniel Peek and Jason Flinn, Universityof Michigan

Daniel Peek described Ensem-Blue, a framework for integratingconsumer electronic devices(CEDs) into commodity distrib-uted file systems (DFSes). Thishas been difficult because of theclosed nature of CEDs and be-cause such devices cannot sim-ply run the DFS’s client softwareto integrate its storage with allthe user’s other computing de-vices. Instead, Dan describedhow EnsemBlue leverages theuser’s general-purpose comput-ers (such as desktops and lap-tops) to act as a bridge betweenthe DFS and each CED that con-nects to the computer (e.g.,when an iPod syncs with a user’sdesktop machine).

A key challenge here involvesnamespace conflicts between theDFS and the proprietary namingstructures found on CEDs. En-semBlue handles this by tracking

100 ; LOG I N : VO L . 3 2 , NO . 1

mappings between the name ofan object in the DFS and itsname on each given CED. Sincethe CEDs comprise a closed sys-tem, the general-purpose com-puters in the system must exe-cute all custom code in thesystem. These computers there-fore need to know when data isupdated in the system, to takecertain actions such as updatingcustom indexes on CEDs. Theauthors leverage the fact thatevery DFS has a distributed noti-fication protocol already—thecache consistency mechanism.Therefore, the authors introducethe concept of a “persistentquery,” which is an object in theDFS that indicates what thequery is looking for, such as newmp3 files added to the DFS. Theauthors presented an example ofhow a persistent query for allnew m4a files could be used toimplement a transcoder fromm4a to mp3 files. Finally, the au-thors described how they handledisconnected devices that cannotspeak with the general file server.In such situations, several of theuser’s devices might be able tocontact each other but not theremote file server. One of the de-vices becomes a “pseudo-fileserver,” acting as a file server tothe best of its ability, serving tothe other devices those files thatit happens to have at the time.

One questioner ask how thiswork would fit into the universalPlug-and-Play (uPnP) initiative.Dan responded that currently,EnsemBlue requires read andwrite access to the device(through USB, for example).Their future work will allowthem to work with arbitrary pro-tocols such as uPnP. JawwadShamsi from Wayne State Uni-versity asked whether they re-quire a separate general-purposedevice for each mobile device.Dan answered that any numberof CEDs can connect to anynumber of general-purpose com-

puters. Christopher Stewart fromthe University of Rochesterasked whether they had encoun-tered any performance tradeoffsin building the protocol that in-teracts with the dedicated hostmachine. Dan responded thatthey hadn’t measured the perfor-mance of integrating data backto the DFS through the general-purpose computer, but sincetheir system is weakly consistentanyway, such performancewould not be that important.

Persistent Personal Names for Global-ly Connected Mobile Devices

Bryan Ford, Jacob Strauss, ChrisLesniewski-Laas, Sean Rhea, FransKaashoek, and Robert Morris,Massachusetts Institute of Technology

Currently, locating users’ person-al devices over the Internet is dif-ficult. Local discovery protocolssuch as Bonjour don’t work overlong distances, and requestingDNS names for all of one’s de-vices is impractical at best. BryanFord argued that people shouldbe able to name their devices in apersonal fashion and have globalconnectivity with each other’sdevices, without having to keepchanging the way the devices lo-cate each other. Their proposedsolution is called UIA (Unman-aged Internet Architecture).

In UIA, users managed their ownpersonal namespaces. Whenthey acquire a new device (suchas a phone, laptop, or camera)they assign a name to the deviceand “introduce” it to their exist-ing devices. Each device has aunique endpoint identifier(EID)—just a hash of its publickey. All clients running UIA be-long to an overlay that lets thisnamespace exist atop IP. Usersassign short personal names toidentify their friends. Otherusers’ devices can then be namedby the combination of friendname and device name. Devicesthen gossip to propagate theirname records. Friendship is tran-

sitive, so if Alice knows Bob di-rectly, and Bob knows Charlie,Alice can name Charlie’s phoneas Phone.Charlie.Bob. The au-thors have implemented UIA onLinux, Mac OS X, and the Nokia770 tablet. A UIA name daemonand router run atop the TCP/IPstack in userspace, with someGUI controls as well. Bryan alsoshowed an entertaining demovideo that highlighted the ease ofuse of the UIA paradigm. Theircode is available for download(in a rough state!) on their proj-ect Web page: http://pdos.csail.mit.edu/uia/.

Mark Aiken from Microsoft Re-search asked how this systemmanages connectivity to devicesthat have moved while beingcommunicated with. Bryan an-swered that in UIA’s routing pro-tocol, devices track other devicesin their local social neighbor-hood and then hope to find a fewdevices that are stable enough tobe rendezvous points to dissemi-nate current IP address informa-tion. These stationary nodes helpbootstrap what is essentially adistributed DNS scheme. Anoth-er questioner asked whether theauthors had conducted any us-ability studies of their system.Bryan answered that they hadn’thad any users outside their re-search group. Mahur Shah fromHP noted that all the examples inthe talk deal with devices thatare fully owned by one user. Hewondered how this would workfor shared devices—in a familysetting, for example. Bryan notedthat they discuss this in thepaper. Each physical device canhave multiple EIDs and go bydifferent names.


A Modular Network Layer forSensornets

Cheng Tien Ee, Rodrigo Fonseca, SukunKim, Daekyeong Moon, and ArsalanTavakoli, University of California,Berkeley; David Culler, University ofCalifornia, Berkeley, and Arch RockCorporation; Scott Shenker, Universityof California, Berkeley, andInternational Computer Science Institute(ICSI); Ion Stoica, University ofCalifornia, Berkeley

Cheng Tien Ee described work atBerkeley on a modular networklayer for sensor networks. Theauthors recognize that sensornets software from different or-ganizations does not interoperateeasily. The specific problem theyaddress in this work is monolith-ic, vertically integrated networkstacks. Intuitively, one would ex-pect that if such network layerswere modularized there wouldbe a good deal of overlay amongdifferent implementations. Sincethe authors argue that we areprobably stuck with multiplenetwork protocols, they focus onmaking it as efficient as possibleto run multiple network proto-cols at once on one system.

Their solution decomposes thenetwork layer into modules.First, they break the networklayers into the data plane and thecontrol plane. Each of these isfurther subdivided into manycomponents, such as an outputqueue and forwarding engine inthe data layer, and a routing en-gine and routing topology in thecontrol plane. They show howdiverse protocols can actuallyshare components, such as out-put queues, resulting in run-timebenefits and code reuse. Theirevaluation of several commonprotocols showed that protocol-specific code made up a smallfraction of the total code base fortheir implemented examples.

Matt Welsh from Harvard wasconcerned about the interplayamong different network proto-cols with regard to such things aspacket scheduling and memoryusage. Cheng responded thatmemory management is indeed across-layer issue and that theyare currently looking at dealingwith such effects. Matt alsoasked whether the code wasavailable, and Cheng said theyare currently attempting to inte-grate their work into TinyOS.Eddie Kohler from UCLA askedabout the types of protocols thatwould fit this model less wellthan the examples the authorschose. Cheng answered that theycan decompose any class of pro-tocols, but the main difficultythey have with more complexprotocols is the decompositionof REs (Routing Engines) andFEs (Forwarding Engines) intosmaller ones that can be betterreused. An example of such aprotocol is one with multiplephases during the forwarding ofa packet along its path. For suchprotocols, it is not immediatelyclear how the further decompo-sition can be done. An indicationof an improperly decomposednetwork protocol would be mul-tiple similar functions, such aspacket forwarding methods,being implemented within a typeof component. The last questionnoted that there are protocolsthey can’t support, but this is dueto SP (Sensornet Protocol); forexample, anything with rate lim-iting can’t be represented to SP.To the question of whether thereis anything the authors wouldhave wanted from the lower-level abstractions that they don’tcurrently supply, Cheng repliedthat he didn’t need anything elsefrom SP and, on the contrary,found it often provided moreinfo than necessary.

OPERATING SYSTEM STRUCTURE

Summarized by Leonid Ryzhyk

Making Information Flow Explicit inHiStar

Nickolai Zeldovich and Silas Boyd-Wickizer, Stanford University; EddieKohler, University of California, LosAngeles; David Mazières, StanfordUniversity

Nickolai Zeldovich stated thatthe HiStar operating system aimsto prevent malicious and buggysoftware from leaking sensitiveuser data by making all informa-tion flow within the system ex-plicit. The HiStar kernel imple-ments six types of objects usedas building blocks for user-levelsoftware: containers, segments,address spaces, threads, gates,and devices. Each object is as-signed a security label that con-trols how the object can be mod-ified or observed and can bethought of as a taint. Data in atainted object can only be ac-cessed by other tainted objects.Any data that flows outside thesystem has to be untainted first.Untainting can only be per-formed by a thread that has anuntaint label.

Coming up with the right designof taint tracking can be difficult,if one wants to avoid covertchannels. In a naive design, ma-licious applications could com-municate by modifying and ob-serving taint levels of differentobjects. HiStar closes this covertchannel by making all nonthreadobject labels immutable. Toavoid covert channels arisingfrom resource allocation, HiStarprovides a specialized IPC ab-straction where the client do-nates initial resources to theserver.

Flexibility is achieved by intro-ducing multiple categories oftaint. For example, UNIX userscan be emulated by assigning ataint category to each user. A su-

102 ; LOG I N : VO L . 3 2 , NO . 1

peruser can then be implement-ed as a thread holding untaintprivilege for all user categories.As a result, root has no specialprivileges in the system and isnot fundamentally trusted by thekernel.

Nickolai illustrated the HiStar ar-chitecture using two case stud-ies. First, a running example of avirus scanner was used to intro-duce the taint-based access con-trol model and demonstrate howHiStar allowed encapsulating ap-plication-specific security poli-cies in a separate component.Second, an implementation ofthe UNIX authentication processbased on an untrusted authenti-cation service was described.

The main group of questions re-lated to the HiStar resource man-agement policy and dealing withdifferent types of covert chan-nels. Nickolai explained thatstatic preallocation of resourcesis preferred in cases where youreally want tight control over in-formation flow, but in less-sensi-tive applications that is likely tobe too restrictive. With regard tocovert channels, someone askedhow HiStar dealt with latency-based covert channels. Nickolaireplied that although they wereworking on some ideas, the cur-rent implementation did not pre-vent such channels.

Another question was whetherthe authentication service couldleak the password by denyingand allowing login requests. Theanswer was that this could nothappen, since every request is es-sentially handled by a newlyforked instance of the authenti-cation service, which is not al-lowed to communicate any databack to the original instance ofthe service. Another interestingquestion was whether HiStarcould accommodate legacy ap-plications requiring communica-tion with the outside world.Nickolai said that HiStar could

accommodate most legacy appli-cations, but it may not be able toprovide any added security if theapplication is monolithic and re-quires frequent network interac-tion.

Splitting Interfaces: Making Trust Be-tween Applications and OperatingSystems Configurable

Richard Ta-Min, Lionel Litty, and DavidLie, University of Toronto

In the modern OS, the trustedcomputing base of an applicationincludes not only the kernel butalso all the privileged user-levelservices that run under the rootaccount. By compromising oneof these services, the attackergets access to all sensitive data inthe system.

Proxos aims to reduce the TCBby isolating sensitive applica-tions inside their own private in-stances of the OS running undercontrol of a virtual machine. Allother applications run inside thepublic “commodity” OS. Thecommodity OS is also used forcommunication among secureapplications running inside pri-vate OSes. This is achieved by se-lectively routing some systemcalls issued by secure applica-tions to the commodity OS. Thedeveloper specifies which callsshould be routed by using theProxos routing language. To ben-efit from the Proxos architecture,secure applications need to besplit into components runninginside the commodity OS andthose running inside the privateOS.

The main performance overheadcomes from context switchingbetween the private and thecommodity OS, which turns outto be an order of magnitudeslower than Linux kernel call.However, at least for the proof-of-concept applications that havebeen ported to Proxos (Webbrowser, SSH authenticationserver, and the Apache Web serv-

er with SSL certificate service),this did not prove to be a prob-lem, as the resulting end-to-endoverhead was negligible.

Q: The private OS can become ascomplex as Linux itself. So howdoes this help reduce the TCB?A: Yes, security-sensitive applica-tions still have to trust the entireLinux kernel, but not the privi-leged processes running on top.Q: Is it necessary to write proxycode for each ioctl to the com-modity OS? A: Yes, but in our ex-perience things you want to iso-late do not require this. Q: So,are you claiming that context-switch time is irrelevant? A: Thisis the case for applications thatwe have ported so far. Of course,for applications that do morekernel calls the performance im-pact would be greater. Q: In yourperformance evaluation youcompare overhead of Proxosagainst Linux running on top ofXen. What would be the over-head compared to Linux runningon hardware? A: We haven’tdone such experiments but theoverhead can be estimated basedon available performance datafor Xen.

Connection Handoff Policies for TCPOffload Network Interfaces

Hyong-youb Kim and Scott Rixner, RiceUniversity

Hyong-youb Kim focused on ef-ficient utilization of TCP offloadcapabilities available in somemodern NICs. Whereas offload-capable NICs can potentially im-prove the performance of net-work-intensive applications bytaking over some of the TCP pro-cessing load, it turns out that ifused without care this featurecan easily saturate the NIC anddegrade the overall system per-formance.

Three techniques for optimizingconnection handoff policy wereproposed:


1. Prioritize packet processingon the NIC by giving pack-ets handled by the hostprocessor higher prioritythan packets processed bythe NIC.

2. Dynamically adapt thenumber of TCP connec-tions handled by the NICbased on the length ofpacket queues.

3. Compensate for the hand-off cost by offloading long-lived connections to theNIC.

The proposed techniques wereevaluated using a system simula-tor modeling the NIC as a MIPSprocessor with 32 MB of RAM.

Q: If there is a sudden change innumber of received packets,would there be a lot of overheadin switching? For example, whathappens if someone wants toDoS by starting to send lots oflong packets and then stopping?A: The NIC currently does notswitch connections back to thehost, so there’s no overhead in-volved in reducing the numberof connections. If the host wantsto hand a connection to the NICit has to transfer a small buffer,but it’s cheap. Q: You simulatethe NIC as a single general-pur-pose processor; however, actualnetwork processors are morecomplex and have multiple spe-cialized cores. Are your resultsrepresentative of what would beobserved on real hardware? A:Network processors are not goodat handing NIC workloads, sothey should not be used forNICs. Network processors arebuilt for switching packets. NICworkloads are different, and net-work processors do not workwell.

DISTR IBUTED STORAGE AND

LOCKING

Summarized by Leonid Ryzhyk

Ceph: A Scalable, High-PerformanceDistributed File System

Sage A. Weil, Scott A. Brandt, Ethan L.Miller, Darrell D.E. Long, and CarlosMaltzahn, University of California,Santa Cruz

Sage A. Weil described Ceph as ahigh-performance distributedfile system designed to accom-modate hundreds of petabytes ofdata. The main principles under-lying the Ceph architecture areseparation of data and metadata,use of intelligent storage devices,and adaptable dynamic metadatamanagement.

In his talk, Sage described thetwo main components of Ceph,the metadata store called MDSand the distributed object storecalled RADOS. MDS achieves ex-cellent scalability by dynamicallypartitioning the directory treeamong metadata servers basedon metadata access frequencies.In addition, MDS simplifies themetadata structure by replacingfile allocation data with seeds toa system-wide well-known hash-ing function called CRUSH,which is very stable in the face ofstorage device failures and otherstorage capacity changes.

The RADOS object store pro-vides scalability and reliabilitythrough replication, failure de-tection, and recovery. RADOSachieves scalability by using in-telligent object store nodes thattake advantage of a very compactrepresentation of the cluster state(made possible by CRUSH), en-abling them to efficiently com-municate with local peers toquickly recover from failures orany other changes to the cluster.To enable the use of Ceph for ef-ficient communication, whileguaranteeing data safety, RADOSimplements a two-phase writeacknowledgment protocol. The

first acknowledgment confirmsthat the update has been propa-gated to all replicas, while thesecond confirms that the updatehas been physically committedto disk. RADOS uses the EBOFSobject file system for local datastorage. EBOFS implements anon-POSIX interface that sup-ports features such as atomictransactions and asynchronouscommit notifications.

Q: Does Ceph support directorymove/rename? A: Yes. The onlything that moves is the inode,not the directory. Q: What is theeffect of network latency on per-formance numbers? A: One ofthe assumptions is that we aredeploying in a data center envi-ronment. In principle, however,you could deploy it in a widerscenario if you have sufficientbandwidth. Q: In your model,metadata lookups are done onthe server. What are the charac-teristics of your workload thatmake this more appropriate forscalability than doing the lookupon the client? A: High-perfor-mance workloads with high con-tention for metadata require con-sistency to be managed withinthe metadata server. Q: What isthe recovery strategy for metada-ta? A: The short-term log dou-bles as a journal, so if a metadataserver fails, another node canrescan its journal.

Distributed Directory Service in theFarsite File System

John R. Douceur and Jon Howell,Microsoft Research

Jon Howell described Farsite as adistributed serverless file systemfor networks of workstations.The focus of the current talk wason the design and implementa-tion of a distributed metadataservice for Farsite. The designgoals included support for fullyfunctional directory move opera-tions, including moves acrosspartitions, support for atomic

104 ; LOG I N : VO L . 3 2 , NO . 1

moves, support for load balanc-ing, and hotspot mitigation.

The authors observe that theconventional approach to meta-data partitioning based on pathnames precludes load balancing.Instead they suggest implement-ing partitioning based on hierar-chical immutable file identifiers.Since the identifier hierarchy isnot tied to the path hierarchy,load balancing and directory(re)naming become completelyorthogonal. File identifiers arecompactly represented using avariant of the Elias y coding. Ef-ficient lookup is achieved bystoring the file map in a datastructure similar to the Lampsonprefix table.

To implement atomic rename op-erations, the recursive path leas-es mechanism is introduced. Itenables safe locking of the chainof file identifiers from the file-system root to a file, withoutputting excessive pressure on theroot server. Finally, the Farsitemetadata service minimizes falsesharing by implementing a fine-grained locking scheme, where aclient acquires locks for individ-ual file metadata fields requiredfor the requested access mode,rather than for the entire file.

Q: Does the repeated load rebal-ancing have an impact on theefficiency? A: Yes, one of the de-sign decisions we made wasthat when we delegate load wecan never coalesce it again. Inpractice it seems to work well.Q: Why aren’t we already allusing Farsite-like things on ourmostly empty disks? Is therea drawback to this approach?A: The greatest limitation is thatbyzantine fault tolerance de-pends on assumptions abouthow many machines may fail,but if they are all running homo-geneous software and have asimilar vulnerability, they can allfail/misbehave in the same way.Q: Suppose we adopt a less pure

approach and put some stuff in-side protected infrastructure.How does that change things? A:Without having to care aboutbyzantine fault tolerance, thingswould be much simpler. Howev-er, metadata load is a huge factor,which we would like to distrib-ute among multiple machines.

The Chubby Lock Service for Loosely-Coupled Distributed Systems

Mike Burrows, Google Inc.

Chubby is a large-scale distrib-uted lock service used in severalGoogle products, includingGFS and Bigtable. Mike focusedmainly on introducing the Chub-by API and the motivation be-hind it and describing the waysChubby has been used, ratherthan on how it was implement-ed.

The main purpose of Chubby isto provide distributed-systemsdevelopers with a reliable andscalable implementation of thedistributed consensus protocol.However, experience has shownthat even if implemented as a li-brary, the consensus protocol isstill difficult to use for develop-ers. Therefore, Chubby encapsu-lates it inside the familiar lockservice API.

In addition to providing lock andunlock operations, Chubby al-lows associating small datarecords with locks and adopts aUNIX-like naming scheme forthem, which makes it look andfeel like a file system. However, itis not well suited for storinglarge amounts of data and lacks anumber of filesystem featuressuch as file renaming, atomicmultifile operations, and partial-file reads and writes. This lack offeatures helps simplify the Chub-by design and prevents develop-ers from misusing it as a distrib-uted file system.

Lock clients can be notified ofcertain types of events, includingfile content changes, file cre-

ation/deletion, and lock acquisi-tion. Chubby is designed to sup-port large numbers of clients perlock. Although changes of thelock ownership are typically in-frequent, clients tend to periodi-cally poll the lock, creating a lotof read traffic. To reduce thistraffic, a consistent write-through client-side cache isused.

Q: Are there any examples of in-teractions that you had with theuser community that led to thefile-system abstraction? A: No,the design decision happenedbefore the user base. I would putthe main reason down to sharingan office with Rob Pike and SeanQuinlan (Plan 9 people), soeverything looked like an FS.Q: Chubby allows developers toeasily get reliability guaranteesby using a lock server instead ofa state machine. Were thereother projects that had to go offand implement a state machine?A: Well, we did. There is a statemachine library that we use; atpresent there are no other usersof it. Q: It seems like large-scaletools are becoming increasinglymore integrated. Have youthought about bad interactionswhere one misbehaving applica-tion of a tool can cause cascadingfailure in other applications?A: Yes, it happens all the time.My system has managed to bringdown many others. What you dois analyze exactly what hap-pened, fix your programmingsteps, and fix your code so thatevery single problem can’t hap-pen again, and of course some-thing else happens next time.


LARGE DISTR IBUTED SYSTEMS

Summarized by PrashanthRadhakrishnan

Experiences Building PlanetLab

Larry Peterson, Andy Bavier, Marc E.Fiuczynski, and Steve Muir, PrincetonUniversity

This talk, given by Larry Peter-son, was about the authors’ ex-perience building PlanetLab(PL). PL is a global platform fordeploying and evaluating plane-tary-scale network services. PLhas machines spread around theworld, with users’ services run-ning in a slice of PL’s global re-sources.

The PL design was a synthesis ofexisting ideas to produce a fun-damentally new system: It wasexperience- and conflict-driven.

Larry listed the requirementsidentified at the time PL wasconceived and the design chal-lenges they faced. Given its scale,PL had to rely on site autonomyand decentralized control forsustainability, while also manag-ing the trust relationships be-tween the users of PL and theowners of the machines. Next, ithad to balance the need for re-source isolation while copingwith support for many userswith minimal resources. Finally,PL had to be a stable, usable sys-tem, supporting long-runningservices and short experiments,while continuously evolvingbased on feedback.

PL’s management architecturehas the following key features toaddress the design challenges.PlanetLab Control (PLC), a cen-tralized front-end, acts as thetrusted intermediary betweenPL users and node owners. Tosupport long-lived slices and ac-commodate scarce resources, PLdecouples slice creation from re-source allocation. Node-ownerautonomy is achieved by makingsure that only owners generateresources on their nodes and that

they can directly allocate a frac-tion of their node’s resources tovirtual machines (VMs) of a spe-cific slice. To support slice man-agement through third-partyservices, PLC allows delegationof slice-creation by granting tick-ets to such services. For scalabil-ity, PL was designed so that mul-tiple PL-like systems can coexistand federate with each other. Asper the principle of least privi-lege, management functionalityhas been factored into self-con-tained services, isolated intotheir own VMs and granted min-imal privileges. To address theresource allocation issues, PLprovides fair sharing of CPU andnetwork bandwidth and simplemechanisms to protect againstthrashing and overuse. Finally,keeping PL’s control plane or-thogonal from the VMM, lever-aging existing software, androlling out upgrades incremen-tally helped PL evolve while alsobeing operational.

Larry concluded with lessonslearned from their experience.Key among them was the obser-vation that decentralization fol-lows centralization; that is, acentralized model is importantfor a system to achieve criticalmass, and it is only by federationthat the system can scale.

During the Q&A session, SeanRhea of Intel Research Berkeleyasked about Larry’s commentson the proposal to set asidephysical boxes for measure-ments. Larry said he was notconvinced about reserving physi-cal resources, but rather thoughtthat logical isolation was suffi-cient. David Anderson of CMUnoted that Larry’s talk presenteda rosy picture of PL, in contrastto the PL panel in WORLDS ’06that discussed problems with PL.David asked about the observedproblems with running latency-sensitive services, disk thrash-ing, and scheduling. Larry said

that there was room for improve-ment in scheduling. He alsonoted that since the PL code isavailable, the community waswelcome to track down bugs thathamper their research and reportpatches. He said that there was aknown kernel bug that couldcause problems with latency-sensitive slices and that thingswould improve when the nextkernel upgrade is rolled out.

iPlane: An Information Plane forDistributed Services

Harsha Madhyastha, Tomas Isdal,Michael Piatek, Colin Dixon, ThomasAnderson, and Arvind Krishnamurthy,University of Washington; ArunVenkataramani, University ofMassachusetts Amherst

Harsha Madhyastha presentediPlane, a service that provides ac-curate predictions of end-to-endInternet path performance. Hestarted off with the observationthat large-scale distributed serv-ices, such as BitTorrent, dependon information about the state ofthe network for good perfor-mance. But most current Inter-net measurement efforts, such asGNP and Vivaldi, provide onlylatency predictions between apair of nodes. In contrast, iPlanemeasures a richer set of metrics,such as latency, loss rate, andavailable bandwidth.

iPlane continuously performsmeasurements to generate andmaintain an atlas of the Internetby doing traceroutes from a fewdistributed vantage points. Forscalability, targets are clusteredon the basis of BGP atoms and arepresentative target from eachatom is used to approximate theperformance of targets in theatom.

iPlane uses structural informa-tion such as the router-leveltopology and autonomous sys-tem (AS) topology to predictpaths between arbitrary nodes inthe Internet. This prediction is

106 ; LOG I N : VO L . 3 2 , NO . 1

made by composing partial seg-ments of known Internet pathsso as to exploit the similarity ofInternet routes. Next, iPlanemeasures the link properties inthe Internet core and edge. In theInternet core, the special vantagepoints measure the link attrib-utes. Link properties at the Inter-net edges are obtained by partici-pating in BitTorrent swarms andmeasuring the links to the end-hosts while interacting withthem.

Thus, to measure path propertiesbetween any two hosts, first thepath between them is predicted.Then, iPlane composes the meas-ured properties of the con-stituent path segments to predictthe performance of the compos-ite path.

iPlane has been demonstrated toimprove the overlay performanceof several representative overlayservices such as content distribu-tion networks, swarming peer-to-peer filesharing, and VoIP.

During the Q&A session, BuckKrasic of the University ofBritish Columbia observed thatparticipating in BitTorrentswarms to measure Internet edgelink properties might result inconservative estimates; for exam-ple, BitTorrent clients may havemultiple connections open andthe bandwidth that iPlane ob-serves might just be a fraction.He asked whether iPlane hadsome technique to compensatefor that. Harsha replied that theywere using BitTorrent to measurebandwidth capacity, not theavailable bandwidth, and thatbandwidth capacity can be meas-ured using a pair of back-to-backpackets. And since measure-ments from BitTorrent are basedon passive monitoring of a TCPconnection of several packets, itis likely that at least one pair ofback-to-back packets will be ob-served.

Fidelity and Yield in a VolcanoMonitoring Sensor Network

Geoff Werner-Allen and KonradLorincz, Harvard University; JeffJohnson, University of New Hampshire;Jonathan Lees, University of NorthCarolina; Matt Welsh, HarvardUniversity

Geoff Werner-Allen presentedthis science-centric evaluation ofa 19-day sensor network deploy-ment at Reventador, an activevolcano in Ecuador. The datacollected by the sensor networkdeployment was evaluated basedon five metrics, namely, robust-ness, event detection accuracy,data transfer performance, tim-ing accuracy, and data fidelity. Ofthese, Geoff dealt with robust-ness, timing accuracy, and datafidelity in his talk.

The sensor hardware they usedfor volcano monitoring wassmall and provided real-timedata acquisition, unlike conven-tional standalone dataloggers,which are unwieldy, and itlogged data to a local flash drive.Their sensor network contained16 sensor nodes, each equippedwith a seismometer, a micro-phone, and an antenna. Thesenodes continuously sample seis-mic and acoustic signals and logthe data to local flash memory.They also run an event-detectionalgorithm that transmits a time-stamped report to the base sta-tion upon detection of a seismicevent. The base station, located4.6 km away from the sensor de-ployment, initiates data collec-tion if it receives triggers frommore than 30% of the sensornodes within a certain time.

The overall robustness of thesystem was limited by poweroutages at the base station and asingle three-day software failure.Discounting these, the meannode uptime exceeded 96%, in-dicating that the sensor nodesthemselves were reliable. Flood-ing Time Synchronization Proto-

col (FTSP) was used for timesynchronization between sensornodes. Although predeploymentresults with FTSP were good,during deployment they raninto stability issues, leading tooccasional incorrect time-stampreports. They developed a timerectification approach that filtersand remaps recorded timestamps to accurately recovertiming despite the incorrecttime stamps. They evaluated thefidelity of the collected data byperforming an analysis of theseismic and acoustic signalsfrom a seismological perspective.Their results indicate that thecollected signal quality and tim-ing match the expectations of theinfrasonic and seismic activityproduced by the volcano.

Geoff concluded his talk withthe three lessons they learnedfrom this deployment: Groundtruth and self-validation are criti-cal, network infrastructure ismore brittle than sensor nodes,and it’s important to build confi-dence with domain scientists.

During the Q&A session, Geoffwas asked whether they look atsensor networks as a tool thatscientists in other domains coulduse without requiring the com-puter scientist’s presence. Geoffresponded by stating that thiswas a great observation and thatthey wanted sensor networks toeventually be used like that.Someone from Stony Brook Uni-versity asked whether it was pos-sible to simply broadcast timefrom the base station. Geoffreplied that such a broadcastwould work only with single-hopnetworks, which isn’t the casewith the Reventador deployment.Mehul Shah of HP Labs askedabout the fidelity of the measure-ments with respect to its rele-vance to the domain scientists.Geoff said that the scientists arestill working on the results ob-tained and that their initial obser-vations are encouraging.


osdi’06:7thusenix symposiumonoperating systemsdesignand ... · operating-system-relatedpapers,...

Documents