eros: a capability system

12

Upload: independent

Post on 15-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

EROS: A Capability SystemComputer and Information Sciences Technical Report MS-CIS-97-03J. S. Shapiro, J. M. Smith and D. J. Farber1Distributed Systems LaboratoryUniversity of Pennsylvania, Philadelphia, PA 19104-6389fshap,jms,[email protected] 23, 1997AbstractCapabilities de�ne a uniform semantics for system service invocation, enforce separation of con-cerns and encapsulation, and allow each program to be restricted to exactly that set of authority itrequires (the principle of least privilege). Capability systems therefore readily contain and reduceerrors at the application level and improve component testability. If carefully architected, a capa-bility system should be both faster and simpler than a comparable access-control-based system. Inpractice, implementations have failed to demonstrate such performance.This paper provides an architectural overview of EROS, the Extremely Reliable Operating Sys-tem. EROS is a persistent capability system which provides complete accountability for persistent,consumable and multiplexed resources. By choosing abstractions to leverage conventional hardwareprotecgion, and exploiting hardware support in the implementation, a fast pure capability archi-tecture can be demonstrated. This paper describes the system design and the proposed evaluationmethod for EROS. An implementation of EROS for the x86 platform is currently underway, andwe expect to report benchmark results by mid 1998.1 IntroductionControl of protected resources may use (object, user,authority) triples, which aggregate as an access con-trol list[Sal75] (ACL). ACLs have two disadvan-tages; they are a static lattice of authorities ratherthan dynamic, and the notion of user introducesmany opportunities for unintended sharing of author-ity.A capability is an unforgeable (object, type, author-ity) triple [Den66]. A capability system is one inwhich capabilities provide the system's mechanismfor security and access control. To access any objectin such a system, a program must hold and invoke acapability for that object. Possession of a capabilityis a necessary and su�cient proof of authority to per-form the operations authorized by the capability onthe object that it names.The capability model is attractive for several rea-sons: It is possible to construct a formal semanticsfor such a system, and to prove useful properties fromthat semantics. Simultaneously, capabilities providea natural framework in which to expose the underly-ing machine representation in a secure and consistentfashion, allowing programmers to better understandapplication performance and providing �ne-grain ac-

countability for resources with low overhead. Finally,programs in a capability-based system can be con-�ned [Lam73]. A con�ned program cannot leak infor-mation to an unauthorized party; it lives in a \blackbox." Such an environment is a safe container withinwhich to run an untrusted program.Architecturally, since they eliminate consideration ofuser identity, capabilities reduce the number of namespaces that must implemented by the supervisor.Capability-based systems are readily secured. Theset of objects reachable by a program is readily com-putable: it is a transitive, re exive, semi-symmetricclosure of objects reachable from the program's initialcapability set. If the primitive capabilities are prop-erly designed this set can be shown to be closed.2 Thecapability model should be faster: all other things be-ing equal there is one less criterion to examine (useridentity) and no list to traverse to determine accessrights.EROS (the Extremely Reliable Operating System) isa new capability system which provides these advan-tages while delivering unusually high performance.Preliminary measurements suggest that EROS deliv-2The presence of a shared mutable name space (the �lesystem) in most ACL systems violates this closure. This de-feats the reachability computation, rendering security di�cultto impossible in practice.1

ers performance on most critical kernel paths thatclosely approaches the maximum derivable from theunderlying hardware.2 BackgroundCapabilities were the �rst protection mechanismto be given a semantically rigorous de�nition[Den66], and were used by several multiprocess-ing systems of the 1970's (e.g., C.mmp[Wul81],CAL[Lam76], Plessey 2000[Lev84], Sigma-7, Bur-roughs B5700[Org73]) as their underlying protectionmodel. Operating systems for these machines focusedon two issues:1. bridging the semantic gap between the operatingsystem and the application, and2. extending the protection semantics of capabili-ties to the abstractions de�ned by the systemsupervisor [Wul81, Lam76, For94].The �rst of these issues motivated the CISC architec-tures of the late 70's and early 80's.A variety of architectural shortcomings resulted inlow performance for these systems, e�ectively caus-ing the abandonment3 of �ne-grain capability archi-tectures:� Excessive abstraction and the resulting com-plex objects (e.g., �les) preclude enforcing au-thorities using the underlying hardware[Col88,Wul81, Lam76].� Bu�ered messages are complex, slow, and themessage bu�er is an unaccountably multiplexedresource[Wul81, Lam76, For94].� Small granularity protection domains are notwell-supported by hardware, and the di�cultyof implementation[Wul81, Lam76]. leads to com-plexity and its associated performance cost.� Encryption, if used to support unforgeability,has signi�cant time (decryption) and space (ex-tra bits) overhead.� Indirection through \object tables" is costly butoften used to allow object deletion without locat-ing all outstanding capabilities[Wul81, Lam76].� Dynamic allocation in the supervisor is complex,and usually improperly accounted for, e.g., in theuse of variable-length capability lists3with the notable exception of the IBM AS/400

� Passive stores require reconstruction of authori-ties on restart, which requires persistence, but acommon �le system name space subverts manyadvantages of capabilities[Wul81, Lam76].� Localized persistence, where modules decide in-dependently when to store, requires communi-cating programs to implement their own, oftenexpensive, consistency mechanism (e.g., transac-tions).Collectively, these failings have deprecated capabil-ity systems among both researchers and commercialusers. However, the security and reliability guaran-tees of capability model are compelling.The EROS design process had four phases:1. Identify the principles that should constrain thedesign.2. Examine the various possibilities for primitive,supervisor-implemented abstractions suitable fora capability architecture.3. Reduce these to a minimumbasis set that can berealized e�ectively using the hardware supportavailable on modern architectures.4. Implement this set e�ciently.Thus our focus was on performance, and the result isa pure capability system providing performance com-petitive with monolithic systems. The rest of the pa-per is about how we achieved it.3 Design PrinciplesThe architecture of the EROS kernel began with a setof design principles, some of which are taken directlyfrom CAL[Lam76].Protection Domains All code outside the super-visor runs within some protection domain . A protec-tion domain provides a context that holds the author-ities accessible to a process. In EROS, each process isan independent protection domain. A related princi-ple is isolation: the supervisor should not be compro-misable by any action performed by non-supervisorcode. Regrettably, this has led us to admit devicemini-drivers into the kernel. Protection domains aidwith the principle of least privilege, which meansthat applications should have access to only and ex-actly those services and objects that are necessary totheir operation, minimizing the failure scope of the2

application, simplifying security arguments [Bom92],enforcing encapsulation[Wul81], and easing softwaremaintainance[Lyc78].Objects The system provides a virtual world com-posed of objects [Lam76]. These objects come in avariety of types, which are described in Section 4.Objects are encapsulated: each object de�nes a setof operations that can be performed on that objectby means of an invocation. Objects are named exclu-sively by kernel-protected capabilities. An object canonly be manipulated by performing an invocation ona capability naming the object.Active, Single Level Store The capability modelmust extend uniformly to the disk; there should beno translation required at this layer in the memoryhierarchy. EROS therefore implements a single levelstore. The state of the machine is periodically snap-shotted and written to the store using an e�cientasynchronous checkpoint mechanism [Lan92]. Thestore is active: processes are included in the check-point and are restarted automatically on recovery.This eliminates the need for the startup and shut-down phases of many programs, and reduces the fre-quency of program fabrication. Perhaps more impor-tant, transparent persistence eliminates the need tore-acquire authorities on system restart. An implica-tion for the supervisor is that it should be recoverable;there should be no state visibly held by the kernel be-tween units of operation; all persistent state shouldbe in checkpointable objects.4Atomic Units of Operation All services per-formed by the supervisor consist of a single, atomicunit of operation with a clearly de�ned commit point.A supervisor operation either executes to completionor does not alter the state of the system. This simpli-�es the semantics of each operation, facilitates our ap-proach to persistence, reduces the number of bound-ary conditions that applications need to check, andlays the groundwork for a fully multithreadable su-pervisor. The alternative is transactions, which haveprohibitive overhead [Sel95].Expose Multiplexing EROS is a real-time ker-nel. It is therefore essential to expose resource mul-tiplexing and virtualization to user control. This re-quires both that resource granularity be su�ciently�ne-grain (pages, for example, must be named) andthat suitable multiplexing controls be accessible toapplications in an appropriately secure fasion.4EROS does not quite meet this objective: the list of cur-rently running threads and the list of active reserves is kept bythe kernel.

Account for Everything No resource should gounaccounted for. Complex accounting should be min-imized, and if possible eliminated. A corollary to thisis the elimination of metadata. Metadata is implic-itly allocated by other operations, making it di�cultto account for properly.Minimality The supervisor should not performany operation that is not required to satisfy thesedesign principles.4 The EROS ArchitectureThe EROS supervisor implements a small number ofprimitive objects, which are shown in Table 1All objects, including supervisor services, are desig-nated by unforgeable capabilities, which are de�nedas a triple: Object� Type �CapInfoThe Type and CapInfo �elds combine to de�ne theRights conveyed by the capability. The meaning ofthe CapInfo �eld is speci�c to the capability type.Di�erent capabilities may name the same object butconvey di�erent authorities over that object.4.1 InvocationTo exercise the object designated by a capability, theholding process invokes that capability. The object isinvoked, performs the requested service, and repliesto the invoker. The invocation trap is the only systemcall implemented by the EROS supervisor.Every invocation (including interprocess communi-cation) sends up to 64 kilobytes of data, four dataregister values, and four capabilities. The CALL in-vocation transmits this information and leaves theprocess waiting for a reply in the same format. TheRETURN invocation sends a response and rendersthe process available for further calls.5 The SENDoperation implements a send-only invocation. Thesender remains running after the send completes; noresponse is expected. SEND can be used to imple-ment non-blocking return, or to initiate parallel exe-cution.4.2 NumbersA number capability is a capability holding 96 bitsof data. Number capabilities serve several roles:5The RETURN operation implements \return and wait fornext call."3

number a self-describing 96 bit quantity.page the basic unit of storage for user data.capability page a unit of storage for capabilities.node an alternative unit of storage for capabilities.address space a mapping from o�sets to pagesprocess the container for a computation, and also the unit of protectiondomain. Table 1: EROS Object Types� The zero number capability is used in all placeswhere a \void capability" might be expected.� The zero number capability serves to identify in-valid address ranges.� Number capabilities are used to hold process reg-ister values when a process is not executing.The only operation supported by a number capabilityis the operation that obtains it's value.4.3 Basic Units of StorageThe basic units of storage in the EROS architectureare the data page (page) and the capability page(node). A node contains 32 capabilities, in the sameway that a page holds 4096 bytes (your architecturemay vary). Capability pages cannot be mapped intoa process address space or otherwise manipulated di-rectly by the application. This partitioning ensuresthat they are unforgeable.Pages and nodes are named by page capabilitiesand node capabilities, which may carry a \read-only" restriction. Node capabilities may in additioncarry \no-call" and \weak" restrictions. The \no-call" restriction suppress the invocation of untrustedaddress space exception handlers (See Section 4.7).Any capability fetched via a \weak" capability isweakened according to the rules shown in Table 2.The combination of weak and read-only restrictionsInput outputNumberCap NumberCapPageCap RO PageCapNodeCap RO, NC, WK NodeCapSpaceCap RO, NC, WK SpaceCapall others NumberCap(0)Table 2: The weaken operation

therefore guarantees transitive read-only access.Using pages and nodes as the two fundamental unitsof storage, we construct the remaining abstractionsneeded for a general purpose, extensible operatingsystem.4.4 Address SpacesAn address space is a tree of nodes whose leaves arepages (Figure 1). Unmapped pages are indicated by aNumber Capability holding the value zero. If set,0 15

0 15

0 15

nodenode

Pages

Node CapabilityNull Capability

Page Capability

Node Capablity (Root)

node

Figure 1: A 19 page address spacethe read-only bit in each capability indicates that theassociated subtree cannot be written in this addressspace; sense capabilities are considered read-only.Address spaces are recursively de�ned. One processcan construct a new address space that maps a subsetof its current address space and pass this new addressspace to another party. The recipient can then mergethe transfered address space into its main addressspace, establishing a shared mapping.It is sometimes useful for the internal structure of4

an address space to be opaque to its user. To facil-itate this, we introduced a specialization of a nodecapability known as an address space capability.Node capabilities and address space capabilities canbe used interchangeably in de�ning the address spacetree. The di�erence between an address space capa-bility and a node capability is that the address spacecapability is opaque: slots of the named node cannotbe examined by means of an address space capability.Most address spaces are constructed exclusively fromnode, page, and null capabilities.For page, node, and address space capabilities, theCapInfo �eld contains the read-only, no-call (sup-presses fault handler invocation), and blss informa-tion about the object. BLSS (biased log space size)is de�ned asBLSS � log2(Address Space Size) � 1and de�nes the size of the subspace de�ned by a givenpage, node, or address space capability. Recordingthe size in the capability allows the address spacefabricator to short-circuit levels in the mapping tree.One consequence is that most mapping trees for na-tive EROS domains are only two or three nodes tall,reducing the number of object faults necessary to re-solve an address.4.5 ProcessesA process is constructed as a suitable arrangement ofnodes (Figure 2). Every process has a node that acts0 15

���

���

����

����

����

Number Capability

Any Capability0 15

Capability Registers

Node Capability

Registers Annex

Process Root

0 15

(Always Zero)

(To Address Space)Figure 2: An EROS Processas the root of the process structure, an additionalnode that holds the (kernel-implemented) capabilityregisters for that process, and some architecture de-pendent number of annex register nodes that hold the

register values of the process when it is not running.These register values are stored in number capabil-ities.While node capabilities often serve as address spacecapabilities, they are not suitable for use as processcapabilities. The capabilities in a process node mustbe of speci�c types for the process to be well-formed.6In addition, there is a reserved slot in the process rootknown as the brand. The brand is part of the EROScon�nement mechanism, and must not be visible ormodi�able by manipulators of the process.A process capability conveys the authority to startand stop a process, to examine its registers (both dataand capability), to alter or replace its address space,to change its scheduling authority, or to alter its faulthandler. This enables the holder of a process capabil-ity to alter essentially anything about the process; itsadvantage over a node capability is that it enforcesthe capability type restrictions for well-formed pro-cesses and protects the brand.4.6 User-Level ObjectsJust as processes are named by process capabilities,the programs embodied within these processes arenamed by start capabilities. Performing a CALLon a capability causes the supervisor to fabricate aresume capability to the caller. This capabilityis provided to the recipient, and conveys the rightto reply to this call. As with all other capabilities,the resume capability can be transferred or copied.All copies of a resume capability are e�ciently inval-idated when any of them is invoked, which ensuresthat every CALL receives at most one return.The CapInfo �eld of the start capability is set whenit is fabricated, and is provided to the recipient by thekernel whenever the recipient is invoked. This allowsthe process to provide di�erent interfaces to di�erentcallers, and can be used to distinguish service ver-sions, service classes (e.g. administrator v/s user), ordistinct \objects" implemented by the process.Invoking a start or resume capability to a badly-formed process behaves as though one had invokedthe zero number key. All invocations of the zero num-ber key return a result code that (by convention) isnot used by any other object.6The kernel is not jeopardized by malformed processes.Such processes do not run, but the e�ects of invoking one arewell-de�ned.5

4.7 ExceptionsAll exceptions taken by a process are redirected bythe kernel to a process known as a keeper. Both ad-dress spaces and processes can designate a keeper byplacing a start capability in the appropriate slot ofthe address space or process. When an exception oc-curs, the kernel synthesizes an upcall to the keeperprocess with a message describing the exception con-ditions. Memory exceptions are directed to the ap-propriate address space keeper, or if none is de�ned,to the process keeper. All other exceptions are re-ported to the process keeper.5 The StoreMuch of the bene�t of the EROS system derives fromcareful organization of the store and an e�cient con-sistent snapshot mechanism.5.1 Organization of the StoreAll objects in EROS can be reduced to node and pageobjects. Every page and node is uniquely identi�edby an object identi�er (OID) which encodes its lo-cation in the store. The store itself is managed asa set of (possibly duplexed) ranges of consecutivelynumbered pages and nodes (Figure 3). For simplicitypage ranges node rangesFigure 3: Disk Rangesof I/O management, nodes are clustered into page-size units for purposes of storage on the disk.On startup, the kernel scans all attached disks to loadtheir range tables, and constructs an in-core mastertable of object ranges. Each range table entry is ofthe formfpage; nodeg � [Start OID;End OID)The in-core table is relatively small, and is kept sortedby starting OID.When a capability for an object is �rst invoked, theobject must be brought in from the disk. To do so,

the kernel consults the master range table to �nd thephysical location of the containing range in the store,computes the o�set within that range of the disk pagecontaining the object, and initiates a disk read op-eration. Once fetched, the object resides in an in-memory object cache, and the capability is preparedby modifying it to point directly to the in-core ob-ject. To facilitate capability depreparation on objectremoval, the capability is placed in a doubly-linkedlist whose head is the object itself. The preparationand depreparation of capabilities is entirely transpar-ent to application code.This storage organization has several advantages, no-tably in the absence of any \indirection blocks". In asystem having a 2 terabyte store of 1000 2 gigabytedisks, each having 64 ranges (an unusually expen-sive con�guration), the total overhead to locate thecontaining disk page is 16 in-memory comparisons totraverse the sorted range table. This multi-terabytesystem an in-core table overhead of 1.2 megabytes,which is considerably less than would be used by anequivalent number of �le systems.75.2 The Checkpoint MechanismIn addition to page and and node ranges, the storealso contains log ranges. Before an object in themain memory cache can be modi�ed, space is re-served for it within the log range. Objects withinthe log are not stored in OID order; the log containsoverhead pages that provide a directory of objects inthe log. This directory is not referenced during nor-mal operation, as a complete directory of the log ismaintained in memory.On a periodic basis, a dedicated service process ini-tiates a checkpoint operation. The checkpoint op-eration stabilizes a systemwide consistent image viaa three phase process: snapshot, stabilization, andmigration.Snapshot The snapshot phase e�ciently creates aconsistent snapshot of the entire system:1. All dirty objects are marked copy on write.2. All page table entries are marked read-only.3. The thread list is traversed, and the in-corecheckpoint directory entry for the root node ofeach active process is updated to re ect the fact7One of our early users has an application that generates14 terabytes of new data per year, all of which must be onlinefor periods of many years.6

Sup Sup Sup Sup Sup S+UCase I-refs D-refs ITLB miss DTLB miss cycles timeNull IPC 28.6 58.14 0 1 206 2.485 �S14 byte IPC 36.51 82.04 1.5 1.5 333.64 3.68 �SPage Fault ??? ??? ??? ??? ??? ?~3 �STable 3: Supervisor footprint { p120that the process was running at the time of thecheckpoint.Snapshot is a synchronous operation. The currentimplementation takes comfortably under 100ms on aslow i486, and can be shrunk by further exploitationof lazy marking to a few microseconds. Having con-structed a consistent snapshot, execution is permittedto resume.Stabilization Once a snapshot has been captured,the frozen objects are asynchronously written tothe previously reserved space in the checkpoint log.When all objects frozen by the checkpoint have goneto the log, the checkpoint directory is written to thelog. Finally, a new checkpoint header is written toindicate that the checkpoint has committed.Stabilization does not violate the real-time require-ment. The cost of stabilization is built in to the costof dirtying objects; every object dirtied will cause atmost two stabilization writes for objects of that type.These writes are asynchronous and non-blocking, pro-vided that there is su�cient main memory and logspace to allow a new object to be dirtied.Migration After the checkpoint has committed, amigrator is started to copy the objects back to theiro�cial locations within page and node ranges. Be-cause the migrator works on large collections of ob-jects whose disk locations can be sorted, the currentEROS migrator is capable of saturating the sustain-able disk subsystem bandwidth on most machines.The principle di�culty in migration is slowing itdown enough to use the machine for other purposes(Yes, this is hyperbole. It will be replaced by real num-bers when we have them in a few weeks).It is rare for a checkpoint to occupy more than 30%of the log before committing. No single checkpoint ispermitted to occupy more than 70% of the availablelog space. Allowing this much of the log to be com-mitted to a single checkpoint serves to dampen jitterin page dirty rates across checkpoints. Restrictingthe checkpoints ensures that a stable rate of two-stage

migration throughput is achieved for large object mu-tations.6 ReservesIn addition to capabilities for real resources, EROSimplements capabilities for consumable resources andresource virtualization in the form of processor capac-ity reserves and working set reserves.6.1 Processor Capacity ReservesA processor capacity reserve is a 6-tuple of theform: (period, duration, quanta, start time, activepriority, inactive priority). The duration speci�es thenumber of guaranteed nanoseconds of computationwithin each period, beginning at the given start time(immediately if the start time is zero). The quantaspeci�es how often a process running under this re-serve will be preempted in favor of other ready pro-cesses under the same reserve. The active prioritygives the priority of this process during its reservedperiod. The inactive priority indicates the priority(if any) with which the process runs when its periodhas been depleted. Reserves whose inactive priorityis below that of the kernel idle thread (which is al-ways ready to run) will not execute outside of theirreserved duration.Processor reserves are allocated by an user-level agentknown as the reserve manager. The reserve man-ager accepts requests for reserves, determines if thoserequests can be met without violating previous reser-vations, and if so, returns a capability to an appro-priate kernel reserve. This reserve capability can beplaced in the schedule slot of a process's root node;all processes sharing a given reserve capability willrun under the named reserve.In contrast to previous implementations [Kit93,Mer93], Processor reserves are not inherited acrossIPC operations. Priority inheritance raises a host ofprotection domain violation issues { if reserves are in-herited, a caller might call a service and then shut o�7

its reserve, denying service to other callers. Reserveinheritance lends itself to \QoS cross-talk," [Les96](which manifests as variance), and understanding thebehavior of nested server calls in the face of priorityinheritance is di�cult. Instead, EROS enables eachapplication to construct a single end-to-end reserva-tion by instancing its services under whatever proces-sor reserve is appropriate.The EROS CPU reserve implementation di�ers fromprevious implementations by unifying reserves andpriority management. Depending on the selected ad-mission control mechanism, reserves can be used toprovide isochronous scheduling guarantees or softerperiodic service at high priority. We have found itoccasionally useful to run infrequently executed, un-reserved processes at higher priority than the reservedprocesses, most notably those processes associatedwith system recovery.6.2 Working Set ReservesThe capacity to use the processor is of little bene-�t if the application is not in memory. To ensurethat performance-critical application code and dataremains resident, EROS provides working set re-serves. A working set describes the number of totalpages and nodes and the number of dirty pages andnodes permitted to the working set reserve. Workingsets are subdivisible; an application can fabricate anempty working set and transfer reservation from someexisting working set to the new one. A working setmay also be marked non-persistent, exempting anydata in it from the checkpoint mechanism.Typical applications de�ne a single working set fortheir process. Applications may also specify an ad-dress subspace that operates under its own workingset. This enables protocol modules to declare work-ing storage for in-transit packets to be non-persistent,reducing the overhead of restart after checkpoint andreducing the complexity of orderly cleanup after asystem restart has occurred.7 Mapping to HardwareWhile pages and nodes are an elegant way of speci-fying a system, they are not an especially good wayof running that system e�ciently.Addressing-related capabilities translate directly intohardware page table entries; the associated accessrights checks are implemented by the native hard-ware addressing mechanism. The process and the

address space abstractions, however, must be trans-lated into forms that can be e�ciently used by thehardware. This is accomplished by two caches: theprocess cache and the mapping cache. The manage-ment of these caches is presented in detail elsewhere[cite POS paper], it is sketched here to facilitate un-derstanding of the benchmarks section.To facilitate e�cient context switching, the state ofactively executing processes is loaded into a processcache. Each active entry in the process cache con-tains the complete architected register set of someprocess. The layout of a process cache entry is op-timized for context unload and reload rather thanspeci�cation convenience. Once loaded into the pro-cess cache, a process remains cached until forced outby a more active process. In practice, most processesblock most of the time; the process cache is usuallyable to contain the entire active process set.The mapping cache holds hardware-speci�c map-ping table entries. On tree-structured mapping hard-ware, it contains page tables. For hash-structuredmapping systems, we implement a large second leveltranslation bu�er in software and fall back to walkingthe address space tree.Hardware-speci�c mapping structures are con-structed by traversing the address space structureand updating the appropriate entry or entries in themapping cache. To ensure that the mapping entriesare properly updated when the address space nodeslots are altered, a dependency table is used tokeep track of the projection from the abstract addressspace to the hardware mapping tables.8 Proposed Evaluation MethodIn the absence of a substantial native applicationbase, research microkernels have been evaluated bycomparing the performance of a familiar environment(such as UNIX) to its rehosted equivalent on top ofthe microkernel. This is an inappropriate evaluationmethod for EROS for several reasons:� EROS is designed to support mission-critical ap-plications that demand high availability, hands-o� 24/7 operation, very large stores and/orstrong fault containment. These are applicationsfor which UNIX and similar systems are inappro-priate.� EROS is designed for applications that place thesystem under high stress, such as online trans-action processing. Many of the simplifying as-8

sumptions that are essential to making UNIXfast (e.g. statistically justi�ed overcommitmentof swap space) are inappropriate for such an en-vironment.� Operating systems are usually rehosted for thesake of legacy application support. A nominal(10% or less) degradation of performance is ac-ceptable for such applications.� Benchmarking rehosted environments does notexpose any of the advantages of the new system.If what you want to do is run UNIX, you shouldrun UNIX.Equally important, both our own past experience andthat of others suggests that this evaluation strat-egy does not lend itself to a correct understand-ing of the performance of the respective systems[Che93, Bom92].88.1 Issues in ComparisonThree di�erences in semantics between UNIX andEROS make direct comparisons of certain operationsdi�cult: accountability, persistence, and the absenceof the fork() primitive. Neither of these features hasa direct equivalent in conventional systems, and eachhas associated costs and bene�ts.Accountability The EROS architecture requires(as a matter of policy) that all storage be provided bythe application, and that every piece of application-provided storage have real backing store. This addsto the cost of demand-zero memory regions, as themetadata and content structures (nodes and pagesrespectively) must be explicitly purchased by someuser level application. No analogous accountabilityis applied by UNIX and similar systems, since theanalogous storage is fabricated from a statisticallyovercommitted resource (the swap area). Transpar-ent persistence of an overcommitted swap area cannotbe recovered by restarting.Persistence By design, EROS has no analogue tothe UNIX sync operation. Under normal circum-stances, applications build structures in memory and8The conclusions of Bershad and Chen in [Che93] are mis-taken. Simply reversing the order of user and supervisor mem-ory delays in Figure 2-1 makes it apparent that the operatingsystem structure had essentially no e�ect on application be-havior. The graph actually reveals the poor design and archi-tecture of Mach 3.0 and the ine�ciency of its IPC mechanism,and strongly suggests that a better microkernel architectureshould exceed the performance of the native UNIX system.

rely on the underlying checkpoint mechanism to ren-der them persistent. A separate, �ne-grain mecha-nism exists for use by transactioned facilities.The overhead associated with �le system metadatais substantial and well explored [Gan94, Sel95]. Ulti-mately, this overhead derives from the fact the run-time state (including that of the �le system imple-mentation) is not persistent, and therefore cannot berelied on to resume from a consistent state. The useof a transparent system-wide checkpoint operation re-moves this impediment, which eliminates the need fortransactioned metadata update. Explicit metadatamanagement is relegated to those exceptional appli-cations such as databases that have unusual prompt-ness requirements.Since this re ects the normal behavior of the system(i.e. what the user will actually see), we view it asa realistic (albeit unfair) comparison. The reliabil-ity tradeo�s inherent in this policy are discussed inSection ??.fork UNIX systems place heavy reliance on thefork() primitive, which has no analogous operationin EROS. In benchmarks, and also in real systems,the fork call is nearly always followed immediatelyby exec. The end result is the creation of a new pro-cess with a new address space. The metric of interest,then, is the create process operation. This is the onethat should be benchmarked.8.2 BenchmarksThe performance of applications is dependent onthree factors: memory speed and con�guration, pro-cessor speed, and a weighted measure of the perfor-mance of certain critical services:1. Address space mapping management,2. Page fault handling,3. Reading and writing data to the store (�le oper-ations), and4. Interprocess communication performance,Processor speed and memory subsystem are inde-pendent of the choice of operating system. Criticalservice operations can be compared directly by mi-crobenchmarks. The lmbench benchmark suite doesa reasonably e�ective job of measuring such low-leveloperations [McV93]. We therefore propose to adaptthe lmbenchmicrobenchmark suite to the EROS plat-form for measurement purposes.9

For each benchmark, we propose to measure bothCPU cycles and cache misses in the instruction, data,and TLB caches. The latter measurements exposedeferred post-service overheads that would not oth-erwise be demonstrated by a microbenchmark.9 Related WorkThis section needs to be expanded.Most of the early hardware protection mechanismswere capability based [cite Plessey, Sigma-7, C.mmp,B5700], but these systems failed to deliver acceptableperformance. The dramatic and highly publicizedfailure of the Intel i432 led mainstream computerarchitects to abandon �ne grain capability architec-tures. The Intel/Siemens BiiN architecture (bet-ter known as the i960) demonstrated that hardware-based capability systems could be high-performance,but went undeployed due to a contractual breakdownbetween the principals. The IBM AS/400 is the onlywidely deployed capability system in use today.9.1 KeyKOSKeyKOS is an earlier capability system from whichthe EROS architecture is derived [Har85]. Originallyconstructed for the IBM 370, and later ported tothe Motorola 88000 family, KeyKOS delivered perfor-mance rougly equivalent to that of Mach 2.5 [Bom92].In addition to formal underpinnings described else-where [Sha97], EROS brings the performance of thearchitecture into line with that of more aggressivesystems such as L4.9.2 Mach 4.0Bryan Ford has given considerabley attention tothe problem of IPC performance in Mach, andhas shown that it can be substantially improvedMACH4:Migrating. For dekernelized systems, in-cluding EROS, IPC operations are a performancecritical benchmark. Various elements of the Machmessaging semantics restrict the realizable perfor-mance.9.3 L3 and L4L3 L3:IPC,L3:SpaceMux, and later L4, have estab-lished new standards for interprocess communicationtimes. The work already done on EROS has demon-strated that these times can be matched in the EROS

context [Sha96d]. The principle insight to draw fromthis work, in our opinion, is that context switch andexception handling have much more to do with thehardware architcture than with the speci�cs of anyparticular operating system architecture.10 ConclusionsWhile there are many ways to construct capabilitysystems, very few are likely to be e�cient. EROSachieves high performance by use of a design method-ology comprising the steps of:1. Identify minimal sets of primitive memory ob-jects and operations, which are required to pro-vide system semantics and cannot be constructedfrom even more primitive objects ans services.2. Examine these sets in light of architectural sup-port and microbenchmarks to select an e�cientbasis from which the capability system can con-structed.3. Use the hardware performance (e.g., for TLB ta-ble updates or disk throughput) as the designtarget for performance of each of the primitives.The critical path operations of the EROS systemmeet or substantially exceed the performance of allother current operating systems. Substantial furtherimprovement on the critical paths is possible only bychanges to the underlying hardware; the two most im-portant changes are a fast privilege boundary crossingmechanism and a tagged or software-managed TLBarchitecture.The key point to take away from this paper is that ca-pability systems are not slower than \conventional"operating systems. EROS provides a clear existenceproof, operating at or near the performance limits im-posed by the underlying hardware. The performancepresented here is achieved by an implementationconstructed in a high-level, portable, optimization-challenged source language (C++), augmented byless than 1500 lines of assembly code, which includingdrivers, compiles to less than 160 kilobytes of code.Further infor-mation on the EROS system can be found via ourweb page at http://www.cis.upenn.edu/~eros.References[Bom92] Allen C. Bomberger, A. Peri Frantz,William S. Frantz, Ann C. Hardy, Nor-10

man Hardy, Charles R. Landau, JonathanS. Shapiro. \The KeyKOS NanoKernel Ar-chitecture," Proceedings of the USENIXWorkshop on Micro-Kernels and OtherKernel Architectures. USENIX Associa-tion. April 1992. pp 95-112.[Che93] J. Bradley Chen and Brian N. Bershad.\The Impact of Operating System Struc-ture on Memory System Performance"Proc. 14th SOSP, December 1993.[Col88] Robert P. Colwell, Edward F. Gehringer,E. Douglas Jensen, \Performance E�ectsof Architectural Complexity in the In-tel 432." ACM Transactions on ComputerSystems, 6(3), August 1988, pp. 296{339.[Den66] J. B. Dennis and E. C. Van Horn,\Programming Semantics for Multipro-grammed Computations" in Communica-tions of the ACM, vol. 9, pp. 143{154,March 1966.[For94] Bryan Ford and Jay Lepreau, \EvolvingMach 3.0 to a Migrating Threads Model,"Proceedings of the Winter USENIX Con-ference, January 1994.ftp://mancos.cs.utah.edu/papers/thread-migrate.ps.Z[Gan94] Gregory R. Ganger and Yale N. Patt.\Metadata Update Performance in FileSystems" in Proceedings of the USENIXSymposium on Operating Systems Designand Implementation, Nov. 1994, pp. 49-60.[Har85] Norman Hardy. \The KeyKOS Archi-tecture" Operating Systems Review. Oct.1985, pp 8-25.[Key86] Key Logic, Inc. U.S. Patent 4,584,639:Computer Security System.[Kit93] Takuro Kitayama, Tatsuo Nakajima, Hi-roshi Arakawa, and Hideyuki Tokuda. \In-tegrated Management of Priority Inver-sion in Real-TimeMach." IEEE Real-TimeSystems Symposium. December 1993[Lam73] Butler W. Lampson. \A Note on the Con-�nement Problem." Communications ofthe ACM Vol 16, No 10, 1973[Lam76] Butler W. Lampson and Howard E. Stur-gis. \Re ections on an Operating SystemDesign." Communications of the ACM Vol19, No 5, May 1976.

[Lan92] Charles R. Landau. \The CheckpointMechanism in KeyKOS," Proceedings ofthe Second International Workshop on Ob-ject Orientation in Operating Systems.IEEE. September 1992. pp 86-91.[Les96] Need citation here[Lie93] Jochen Liedtke. \Improving IPC by Ker-nel Design." Proceedings of the 14th ACMSymposium on Operating System Princi-ples, ACM 1993.[Lie95] Jochen Liedtke. Improved Address-SpaceSwitching on Pentium Processors byTransparently Multiplexing User AddressSpaces, GMD TR 933, November 1995.[Lev84] Henry M. Levy. Capability-Based Com-puter Systems. Digital Press. 1984.[Lyc78] H. Lycklama and D. L. Bayer, \The MERTOperating System," Bell System Techni-cal Journal, 57(6, part 2), pp. 2049{2086,July/August 1978.[McV93] Larry McVoy, and C. Staelin. \lmbench:Portable Tools for Performance Analysus"in Proceedings of the 1196 USENIX Tech-nical Conference. San Diego, CA, January1996, pp. 279-295.[Mer93] C. W. Mercer, S. Savage and H. Tokuda.\Processor Capacity Reserves: An Ab-straction for Managing Processor Usage,"in Proc. 4th Workshop on Workstation Op-erating Systems, October 1993.[Org73] \Computer System Organization: TheB5700/B6700 Series," Academic Press,1973.[Sal75] Jerome H. Saltzer and Michael D.Schroeder. \The Protection of Informationin Computer Systems," in Proceedings ofthe IEEE, 63(9), Sept. 1975, pp. 1278{1308.[Sel95] M. Seltzer, K. Smith, H. Balakrishnan, J.Chang, S. McMains, and V. Padmanab-han, \File System Logging versus Clus-tering: A Performance Comparison", Pro-ceedings of the 1995 USENIX TechnicalConference, January 1995, New Orleans,LA, pp. 249{264.[Sel95] Margo Seltzer, Yasuhiro Endo, Christo-pher Small, Keith Smith, \Dealing with11

Disaster: Surviving Misbehaved KernelExtensions" in Proceedings of the 1996Symposium on Operating System Designand Implementation.[Tho78] K. Thompson, \UNIX Implementation,"Bell System Technical Journal, 57(6, part2), pp. 1931{1946, July/August 1978.[Sha96a] Jonathan S. Shapiro. A Programmer's In-troduction to EROS. Available via theEROS home page athttp://www.cis.upenn.edu/~eros[Sha96b] Jonathan S. Shapiro. The EROS ObjectReference Manual. In progress. Draft avail-able via the EROS home page athttp://www.cis.upenn.edu/~eros[Sha96c] Jonathan S. Shapiro, David J. Farber, andJonathan M. Smith. \State Caching in theEROS Kernel { ImplementingE�cient Or-thogonal Persistence in a Pure CapabilitySystem", Proceedings of the 7th Interna-tional Workshop on Persistent Object Sys-tems, Cape May, N.J. 1996[Sha96d] Jonathan S. Shapiro, David J. Farber,Jonathan M. Smith. \The Measured Per-formance of a Fast Local IPC". Pro-ceedings of the 5th International Work-shop on Object Orientation in OperatingSystems, Seattle, Washington, November1996, IEEE[Sha97] Jonathan S. Shapiro and Sam Weber.Verifying Operating System Security. De-partment of Computer and InformationScience Technical Report MS-CIS-97-26,Forthcoming.[Wul81] William A. Wulf, Roy Levin, and SamuelP. Harbison. HYDRA/C.mmp: An Exper-imental Computer System McGraw Hill,1981.12