virtualization mechanisms for mobility, security and...

Virtualization Mechanisms for Mobility, Security

and System Administration

Shaya Potter

Submitted in partial fulfillment of the

requirements for the degree

of Doctor of Philosophy

in the Graduate School of Arts and Sciences

COLUMBIA UNIVERSITY

2010

c© 2010

Shaya Potter

All Rights Reserved

ABSTRACTVirtualization Mechanisms for Mobility, Security and System Administration

Shaya Potter

This dissertation demonstrates that operating system virtualization is an effective

method for solving many different types of computing problems. We have designed

novel systems that make use of commodity software while solving problems that were

not conceived when the software was originally written. We show that by leveraging

and extending existing virtualization techniques, and introducing new ones, we can

build these novel systems without requiring the applications or operating systems to

be rewritten.

We introduce six architectures that leverage operating system virtualization. *Pod

creates fully secure virtual environments and improves user mobility. AutoPod re-

duces the downtime needed to apply kernel patches and perform system maintenance.

PeaPod creates least-privilege systems by introducing the pea abstraction. Strata im-

proves the ability of administrators to manage large numbers of machines by introduc-

ing the Virtual Layered File System. Apiary builds upon Strata to create a new form

of desktop security by using isolated persistent and ephemeral application containers.

Finally, ISE-T applies the two-person control model to system administration.

By leveraging operating system virtualization, we have built these architectures

on Linux without requiring any changes to the underlying kernel or user-space ap-

plications. Our results, with real applications, demonstrate that operating system

virtualization has minimal overhead. These architectures solve problems with min-

imal impact on end-users while providing functionality that would previously have

required modifications to the underlying system.

Contents

Contents i

List of Figures vii

List of Tables ix

Acknowledgments xi

1 Introduction 1

1.1 OS Virtualization Security and User Mobility . . . . . . . . . . . . . 3

1.2 Mobility to Improve Administration . . . . . . . . . . . . . . . . . . . 5

1.3 Isolating Cooperating Processes . . . . . . . . . . . . . . . . . . . . . 6

1.4 Managing Large Numbers of Machines . . . . . . . . . . . . . . . . . 6

1.5 A Desktop of Isolated Applications . . . . . . . . . . . . . . . . . . . 7

1.6 Two-Person Control Administration . . . . . . . . . . . . . . . . . . . 8

1.7 Technical Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Overview of Operating System Virtualization 12

2.1 Operating System Kernel Virtualization . . . . . . . . . . . . . . . . 13

2.2 File System Virtualization . . . . . . . . . . . . . . . . . . . . . . . . 14

2.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

i

3 *Pod: Improving User Mobility 20

3.1 *Pod Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.1.1 Secure Operating System Virtualization . . . . . . . . . . . . 24

3.2 Using a *Pod Device . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4 AutoPod: Reducing Downtime for System Maintenance 41

4.1 AutoPod Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 Migration Across Different Kernels . . . . . . . . . . . . . . . . . . . 45

4.3 Autonomic System Status Service . . . . . . . . . . . . . . . . . . . . 49

4.4 AutoPod Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.4.1 System Services . . . . . . . . . . . . . . . . . . . . . . . . . . 52

4.4.2 Desktop Computing . . . . . . . . . . . . . . . . . . . . . . . 53

4.4.3 Setting Up and Using AutoPod . . . . . . . . . . . . . . . . . 55


4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5 PeaPod: Isolating Cooperating Processes 63

5.1 PeaPod Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.2 PeaPod Virtualization . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.2.1 Pea Virtualization . . . . . . . . . . . . . . . . . . . . . . . . 69

5.2.2 Pea Configuration Rules . . . . . . . . . . . . . . . . . . . . . 73

5.2.2.1 File System . . . . . . . . . . . . . . . . . . . . . . . 73

5.2.2.2 Transition Rules . . . . . . . . . . . . . . . . . . . . 76

ii

5.2.2.3 Networking Rules . . . . . . . . . . . . . . . . . . . . 77

5.2.2.4 Shared Namespace Rules . . . . . . . . . . . . . . . . 78

5.2.2.5 Managing Rules . . . . . . . . . . . . . . . . . . . . 78

5.3 Security Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

5.4 Usage Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.4.1 Email Delivery . . . . . . . . . . . . . . . . . . . . . . . . . . 83

5.4.2 Web Content Delivery . . . . . . . . . . . . . . . . . . . . . . 85

5.4.3 Desktop Computing . . . . . . . . . . . . . . . . . . . . . . . 87


5.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

6 Strata: Managing Large Numbers of Machines 95

6.1 Strata Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.2 Strata Usage Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.2.1 Creating Layers and Repositories . . . . . . . . . . . . . . . . 103

6.2.2 Creating Appliance Templates . . . . . . . . . . . . . . . . . . 103

6.2.3 Provisioning and Running Appliance Instances . . . . . . . . . 104

6.2.4 Updating Appliances . . . . . . . . . . . . . . . . . . . . . . . 105

6.2.5 Improving Security . . . . . . . . . . . . . . . . . . . . . . . . 106

6.3 Virtual Layered File System . . . . . . . . . . . . . . . . . . . . . . . 107

6.3.1 Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.3.2 Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.3.2.1 Dependency Example . . . . . . . . . . . . . . . . . 114

6.3.2.2 Resolving Dependencies . . . . . . . . . . . . . . . . 114

6.3.3 Layer Creation . . . . . . . . . . . . . . . . . . . . . . . . . . 116

iii

6.3.4 Layer Repositories . . . . . . . . . . . . . . . . . . . . . . . . 117

6.3.5 VLFS Composition . . . . . . . . . . . . . . . . . . . . . . . . 119

6.4 Improving Appliance Security . . . . . . . . . . . . . . . . . . . . . . 122


6.5.1 Reducing Provisioning Times . . . . . . . . . . . . . . . . . . 126

6.5.2 Reducing Update Times . . . . . . . . . . . . . . . . . . . . . 127

6.5.3 Reducing Storage Costs . . . . . . . . . . . . . . . . . . . . . 128

6.5.4 Virtualization Overhead . . . . . . . . . . . . . . . . . . . . . 130

6.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132

7 Apiary: A Desktop of Isolated Applications 136

7.1 Apiary Usage Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 139

7.2 Apiary Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

7.2.1 Process Container . . . . . . . . . . . . . . . . . . . . . . . . . 144

7.2.2 Display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

7.2.3 File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147

7.2.4 Inter-Application Integration . . . . . . . . . . . . . . . . . . . 148


7.3.1 Handling Exploits . . . . . . . . . . . . . . . . . . . . . . . . . 153

7.3.1.1 Malicious Files . . . . . . . . . . . . . . . . . . . . . 154

7.3.1.2 Malicious Plugins . . . . . . . . . . . . . . . . . . . . 155

7.3.2 Usage Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

7.3.3 Performance Measurements . . . . . . . . . . . . . . . . . . . 161

7.3.3.1 Application Performance . . . . . . . . . . . . . . . . 161

7.3.3.2 Container Creation . . . . . . . . . . . . . . . . . . . 162

iv

7.3.4 File System Efficiency . . . . . . . . . . . . . . . . . . . . . . 165

7.3.5 File System Virtualization Overhead . . . . . . . . . . . . . . 167

7.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

8 ISE-T: Two-Person Control Administration 173

8.1 Usage Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

8.2 ISE-T Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180

8.2.1 Isolation Containers . . . . . . . . . . . . . . . . . . . . . . . 181

8.2.2 ISE-T’s File System . . . . . . . . . . . . . . . . . . . . . . . 183

8.2.3 ISE-T System Service . . . . . . . . . . . . . . . . . . . . . . . 184

8.3 ISE-T for Auditing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187


8.4.1 Software Installation . . . . . . . . . . . . . . . . . . . . . . . 190

8.4.2 System Services . . . . . . . . . . . . . . . . . . . . . . . . . . 192

8.4.3 Configuration Changes . . . . . . . . . . . . . . . . . . . . . . 193

8.4.4 Exploit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194

8.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195

9 Conclusions and Future Work 198

Bibliography 203

A Restricted System Calls 221

A.1 Host-Only System Calls . . . . . . . . . . . . . . . . . . . . . . . . . 221

A.2 Root-Squashed System Calls . . . . . . . . . . . . . . . . . . . . . . . 223

A.3 Option-Checked System Calls . . . . . . . . . . . . . . . . . . . . . . 224

v

A.4 Per-Virtual-Environment System Calls . . . . . . . . . . . . . . . . . 225

vi

List of Figures

3.1 *Pod Virtualization Overhead . . . . . . . . . . . . . . . . . . . . . . 32

3.2 *Pod Checkpoint/Restart vs. Normal Startup Latency . . . . . . . . 34

4.1 AutoPod Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.1 PeaPod Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.2 Example of Read/Write Rules . . . . . . . . . . . . . . . . . . . . . 74

5.3 Protecting a Device . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.4 Directory-Default Rule . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.5 Transition Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.6 Networking Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

5.7 Namespace Access Rules . . . . . . . . . . . . . . . . . . . . . . . . . 78

5.8 Compiler Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.9 Set of Multiple Rule Files . . . . . . . . . . . . . . . . . . . . . . . . 79

5.10 Email Delivery Configuration . . . . . . . . . . . . . . . . . . . . . . 84

5.11 Web Delivery Rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5.12 Desktop Application Rules . . . . . . . . . . . . . . . . . . . . . . . . 87

5.13 PeaPod Virtualization Overhead . . . . . . . . . . . . . . . . . . . . . 91

vii

6.1 How Layers, Repositories, and VLFSs Fit Together . . . . . . . . . . 101

6.2 Layer Definition for MySQL Server . . . . . . . . . . . . . . . . . . . 109

6.3 Layer Definition for Provisioned Appliance . . . . . . . . . . . . . . . 109

6.4 Metadata for MySQL Server Layer . . . . . . . . . . . . . . . . . . . 111

6.5 Metadata Specification . . . . . . . . . . . . . . . . . . . . . . . . . . 111

6.6 Storage Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129

6.7 Postmark Overhead in Multiple VAs . . . . . . . . . . . . . . . . . . 131

6.8 Kernel Build Overhead in Multiple VAs . . . . . . . . . . . . . . . . . 132

6.9 Apache Overhead in Multiple VAs . . . . . . . . . . . . . . . . . . . . 133

7.1 Apiary Screenshot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140

7.2 Usage Study Task Times . . . . . . . . . . . . . . . . . . . . . . . . . 160

7.3 Application Performance with 25 Containers . . . . . . . . . . . . . . 162

7.4 Application Startup Time . . . . . . . . . . . . . . . . . . . . . . . . 164

7.5 Postmark Overhead in Apiary . . . . . . . . . . . . . . . . . . . . . . 167

7.6 Kernel Build Overhead in Apiary . . . . . . . . . . . . . . . . . . . . 168

8.1 ISE-T Usage Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177

viii

List of Tables

3.1 Per-Device *Pod File System Sizes . . . . . . . . . . . . . . . . . . . 31

3.2 Benchmark Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3 *Pod Checkpoint Sizes . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1 Application Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.2 AutoPod Migration Costs . . . . . . . . . . . . . . . . . . . . . . . . 59

5.1 Application Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.1 VA Provisioning Times . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.2 VA Update Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

6.3 Layer Repository vs. Static VAs . . . . . . . . . . . . . . . . . . . . . 130

7.1 Application Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 161

7.2 File System Instantiating Times . . . . . . . . . . . . . . . . . . . . . 163

7.3 Apiary’s VLFS Layer Storage Breakdown . . . . . . . . . . . . . . . . 166

7.4 Comparing Apiary’s Storage Requirements Against a Regular Desktop 166

7.5 Update Times for Apiary’s VLFSs . . . . . . . . . . . . . . . . . . . . 166

8.1 ISE-T Commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178

ix

8.2 Administration Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

x

Acknowledgments

My deepest thanks go to my advisor, Jason Nieh, for his continual support and guid-

ance. His constant questioning, demanding of explanations and objective evaluation

has helped develop ideas that I would not have been able to reach on my own, while

also teaching me skills that I hope remain with me. I am constantly amazed by how

many different studies, projects and papers he is able to juggle while retaining the

ability to ask insightful questions. He has provided the model that I aspire to be.

There are many people at Columbia who have been a significant part of my grad-

uate experience. My officemates, Dinesh Subhraveti, Dan Phung and Dana Glasner

have been good friends, acted as sounding boards, provided valuable feedback, and,

in general, made the graduate experience an enjoyable one. I’ve worked on many

projects together with Ricardo Baratto and Oren Laadan and I am always amazed

by their abilities. Stelios Sidiroglou-Douskos, Mike Locasto, Carlo Perez and Gong Su

provided valuable feedback and friendship. I’d also like to thank Angelos Keromytis,

Steven M. Bellovin for providing help and guidance in my research. In addition, I’d

like to thank Erez Zadok, Gail Kaiser and Chandra Narayanaswami for serving on my

Ph.D. committee. Finally, I’d be remiss if I did not thank the administrative staff in

the Computer Science Department, including Alice Cueba, Twinkle Edwards, Elias

xi

Tesfaye and Susan Tritto for handling many tasks that enabled me to focus on my

research.

Finally, I’d like to thank my parents, whose constant support and belief in me has

enabled all my accomplishments.

xii

Dedicated in memory of my grandmothers,

לייב הירש צבי בת יוכבד and יצחק חיים בת מאשא אלתע

They were proud of all my accomplishments and were always

looking forward to the day when my Ph.D. would be complete.

Their memory will be with me always.

xiii

Chapter 1

Introduction

Computer use is more widespread today than it was even 10 years ago, but we are

still using software designs from 20 or 30 years ago. Although these designs are

well tested and understood, they were created to solve the problems of that time.

Today’s users face difficulties that the original software designers did not imagine.

We can redesign the operating system and applications to attempt to address these

problems, but this creates new, relatively untested software and designs and may

force users and administrators to learn fundamentally new models of usage. This

dissertation demonstrates that many problems can be solved not by redesigning and

rewriting the applications, but instead by virtualizing the interfaces through which

existing applications interact with the operating system.

Virtualization is the creation of a layer of indirection between two entities that

previously communicated directly. For example, in hardware virtualization [28, 34,

142,147], a virtual machine monitor (VMM) places a layer of indirection between an

operating system and the underlying hardware. A VMM provides a complete vir-

Chapter 1. Introduction 2

tualized hardware platform for an operating system, enabling any operating system

supporting that platform to run as though on physical hardware. Hardware virtual-

ization has been shown to enable operating systems to take advantage of hardware

for which they were not designed. The Disco project [34] demonstrated how to run

an operating system not designed for ccNUMA architectures on those architectures

by using a VMM.

Operating systems can also be virtualized in multiple ways, most commonly by

providing each process with its own virtualized and protected memory mappings. In-

stead of letting a process directly access the machine’s memory, the operating system,

with hardware support, places a layer of indirection between the processes and physi-

cal memory, creating a virtualized mapping between the process’s memory space and

the physical machine’s memory space. This provides security, efficiency and flexibil-

ity. The processes’ memories are isolated from one another, but memory can still be

shared among processes.

Memory, however, is not the only operating system interface that can be virtu-

alized. Zap [100] and FiST [152] demonstrated that an operating system’s kernel

state and file systems can be virtualized as well. Kernel virtualization operates by

virtualizing the system call interface, that is, by placing a layer of indirection between

processes and the system calls they use to access the operating system kernel’s func-

tionality and ephemeral state. Similarly, file system virtualization works by placing a

layer of indirection between processes and the underlying physical file systems, or the

operating system’s persistent state. Instead of accessing the machine’s kernel and file

system directly using built-in system call and file system functions, the application

running in the virtualized operating system executes a function within the virtualiza-

tion layer. The virtualization layer can modify the parameters passed to it, perform

work required by the desired virtualization, call built-in kernel and file system func-


tions to perform the desired real work, and modify the return value passed to the

calling process.

This dissertation demonstrates that by leveraging different forms of operating sys-

tem virtualization, we can use commodity operating systems and software in novel

ways and solve problems that the original developers could not have anticipated. By

virtualizing the interfaces, we do not change the applications or operating system, but

instead create specialized environments that enable us to solve problems. Although

virtualized environments, from the perspective of processes, look and behave like the

system they are virtualizing, they can look and behave very differently to the sys-

tems on which they are hosted. This decoupling of execution environment and host

environment lets us create tools that run on the host and solve new problems without

modifying a well-tested operating system and application code. For example, we can

create virtual private namespaces for applications distinct from the namespace of the

physical computer. To the processes running within the virtualized environment, the

environment looks like a regular machine, provides the same application interface,

and does not require applications to be rewritten. Similarly, because operating sys-

tem virtualization only interposes itself between the application and the underlying

operating system kernel, the underlying kernel’s binary and source code do not have

to be modified either.

1.1 OS Virtualization Security and User Mobility

Some forms of operating system virtualization [85, 100] are limited to isolating a

single user’s processes and are not designed to provide any security constraints. This

is especially noticeable for processes that run with elevated privileges, such as those

provided to root on Unix systems. Without secure virtualization, operating system


virtualization can only solve single user problems, substantially limiting its use. To

enable secure virtualization, we have enabled each virtualized environment to have a

unique set of virtualized users. Virtualizing the set of users gives each environment

an isolated set of privileges. However, unlike hardware virtualization, where each

virtual machine has a full operating system instance and therefore its own isolated

privileged state, operating systems generally only have a single set of privileged states.

Therefore, in addition to providing unique sets of virtualized users, we also restrict

the abilities of virtualized root users. If the virtualized root users were not restricted,

they could be treated equivalently to the root user of the underlying system, enabling

them to break the virtualization abstraction. This dissertation demonstrates how

operating system virtualization can be used to simply virtualize the set of users while

restricting the abilities of the privileged but virtualized root user.

We then show that operating system virtualization can be combined with check-

point/restart functionality to improve mobile users’ computing experience. Many

users lug around bulky, heavy computers simply to have access to their data and

applications. To solve this problem, we created *Pod devices. A *Pod is a physi-

cal storage device, such as a portable hard disk or USB thumb drive, containing a

complete application-specific environment, such as a desktop or web environment.

*Pod devices run their applications on whatever host computer is available at the

user’s current location. By storing the entire environment on the portable device,

users can move it between computers while retaining a common usage environment.

Operating system virtualization, coupled with process migration technology, enables

users to move their running processes and data between physical machines, much like

a laptop can be suspended and resumed when changing locations. We have built a

number of *Pod devices that enable users to carry an application [109,110,113] or an

entire desktop [114] with them.


1.2 Mobility to Improve Administration

Building on *Pod, we demonstrate how operating system virtualization and check-

point/restart ability can improve system maintenance, much of which requires taking

the machine offline and shutting down all active processes. Among other problems,

this prevents the kernel from being patched quickly. as it requires the machine to

be rebooted for the patch to take effect, thereby killing all running processes on the

machine. To address this, we developed AutoPod [112], a system that enables un-

scheduled operating system updates while preserving application service availability.

AutoPod leverages *Pod’s virtualization abstraction to provide a group of processes

and associated users with an isolated machine-independent virtualized environment

decoupled from the underlying operating system instance. This enables AutoPod to

run each independent service in its own isolated environment, preventing a security

fault in one from propagating to other services running on the same machine. This

virtualized environment is integrated with a checkpoint/restart system that allows

processes to be suspended, resumed and migrated across operating system kernel

versions with different security and maintenance patches. AutoPod incorporates a

system status service to determine when operating system patches need to be applied

to the current host, then automatically migrates application services to another host

to preserve their availability while the current host is updated and rebooted. Auto-

Pod’s ability to migrate processes across kernel versions also increases *Pod’s value

by making it possible for users to move their *Pod between machines that are not

running the exact same kernel version.


1.3 Isolating Cooperating Processes

AutoPod envisions virtual computer usage growing rapidly as users create and use

many task-specific virtual computers, as is already occurring with the rise of virtual

appliances. But more computers mean more targets for malicious attackers, making

it even more important to keep them secure. Operating system virtualization, as

in a pod, provides namespaces that isolate processes from the host, enabling a level

of least-privilege isolation as single services are constrained to independent pods.

Today’s services, however, are complex applications with many distinct components.

Even within a pod, each component of the service has access to all resources required

by every component within the system, which is not a true least-privilege system.

To solve this problem, we developed PeaPod [115], which combines the pod with

a pea (Protection and Encapsulation Abstraction). As AutoPod demonstrates, pods

can be used to isolate services into separate virtual machine environments. The pea is

used within a pod to provide finer-grained isolation among application components of

a single service while still enabling them to interact. This allows services composed of

multiple distinct processes to be constructed more securely. PeaPod enables processes

to work together while limiting the resources each process can access to only those

needed to perform its job.

1.4 Managing Large Numbers of Machines

Although virtualization provides numerous benefits, such as minimizing the amount

of hardware to maintain by putting multiple virtual machines on a single physical

host, this can also make it harder for administrators to maintain an increased num-

ber of virtual machines. Just as the proliferation of virtual machines affects security,


it also significantly increases the administrative burden. Instead of managing a sin-

gle machine providing a number of services, one manages many independent virtual

machines that each provide a single service. When security holes are discovered in

core operating system functionality, each virtual machine must be fixed separately.

This dissertation shows that operating system virtualization improves manage-

ment of large systems. Although virtualization decreases the amount of physical

hardware to manage, it does not reduce, and can even increase, the number of ma-

chine instances to be managed. Strata improves this situation by introducing the

Virtual Layered File System (VLFS ). Instead of having independent file systems for

each service, the VLFS enables a file system to be divided into a set of shareable

layers and combined into a single file system namespace view. This enables many

machines to be stored efficiently because data that is common to more than one only

has to be stored once. It allows efficient provisioning because none of the shared files

have to be copied into place. Finally, it improves maintenance because the patched

layer only has to be installed once, and is then pulled into all the VLFSs that use

that layer.

1.5 A Desktop of Isolated Applications

Once we can manage multiple independent machines efficiently, we can use those

machines in novel ways. For instance, Apiary improves the ability to create secure

computer desktops. Apiary leverages Strata’s VLFS to contain each application in

an independent and isolated container. Even if one application is exploited, the

exploit will be confined to that application and the rest of the user’s data will remain

secure. Similarly, because VLFSs allow very quick provisioning, Apiary can run

desktop applications ephemerally in addition to their regular persistent execution


models. An ephemeral application is an application whose container is provisioned

anew for each execution of the application. Once the execution is complete, the

container is removed from the system. This means that even if an application executed

ephemerally is exploited, the exploit will not persist because the next ephemeral

execution will be within a fresh container. Finally, because independent applications

do not provide the integrated feel users expect from their desktops, Apiary enables

applications to integrate securely at specific points. Apiary improves on PeaPod for

desktop scenarios by enabling applications to be securely isolated without requiring

complicated access rules to be designed and written.

1.6 Two-Person Control Administration

Finally, we have leveraged operating system virtualization to provide high assurance

system administration. In a traditional operating system, the administrative user is

an all-powerful entity who can perform any task with no record of the changes made

on the system and no check on their power. This causes two problems. First, they

are able to subvert the security of the system with malicious intent. Second, changes

made by single users are prone to error.

ISE-T [111] changes this model by applying the concept of two-person control

to system administration. Two-person control changes system administration in two

ways. First, instead of performing administrative actions directly on the machine,

the changes are first performed on a sandbox that mirrors the machine being admin-

istrated. By providing two administrators with their own sandboxes to perform the

same administrative task, ISE-T can extract their changes, compare them for equiv-

alence, and, if equivalent, commit them to the underlying machine. Second, in cases

where the two-person control system is too expensive, ISE-T can extract the changed


state and store it in a secure audit log for future verification before committing it to

the underlying system. This enables a high assurance system with little additional

administration cost.

1.7 Technical Contributions

This dissertation contributes multiple technical innovations and their associated ar-

chitectures:

1. We introduce an operating system virtualization platform that provides secure

virtual machines without any underlying operating system changes. This is

necessary to enable multiple virtual environments to run in parallel on a single

machine as well as to enable secure execution of untrusted processes.

2. We introduce a portable storage-based computing environment. By combining

our secure operating system virtualization platform, a checkpoint/restart sys-

tem, and portable storage devices, we created the *Pod architecture to migrate

a user’s processes between machines securely.

3. We introduce a checkpoint/restart mechanism to enable the migration of pro-

cesses between machines running different kernels. This is accomplished by sav-

ing the checkpoint/restart state in a kernel-independent format so that it can

be adapted to the internal data structures of the kernel to which the processes

are being migrated. The AutoPod architecture improves system management

by allowing administrators to administrate machines without terminating pro-

cesses. It also improves the utility of *Pod by not limiting users to machines

running the same kernel version.


4. We introduce the pea process isolation abstraction. Peas allow individual pro-

cesses in a multi-process system to cooperate while contained in individual

resource-restricted compartments. The PeaPod architecture creates least-privilege

environments for the multiple processes that constitute services in use today.

5. We introduce the Virtual Layered File System (VLFS). The VLFS improves sys-

tem administration by enabling system administrators to divide a file system

into distinct subset layers and use the layers for multiple simultaneous installa-

tions. The VLFS combines traditional package management with unioning file

systems in a new way, yielding powerful new functionality. The Strata architec-

ture permits administrators to provision and manage large numbers of virtual

machines efficiently.

6. We introduce the concepts of a containerized desktop and ephemeral application

execution. In a containerized desktop, each desktop application is fully isolated

in its own container with its own file system. This prevents an exploited applica-

tion from accessing data belonging to other applications. Ephemeral application

execution creates a single-use application container and file system for each indi-

vidual application execution. Ephemeral containers prevent malicious data from

having any persistent effect on the system and isolate faults to a single appli-

cation instance. The Apiary architecture provides a new way to secure desktop

applications by isolating each application within its own container, while let-

ting the isolated applications interact in a secure manner through ephemeral

execution.

7. We introduce two-person control for system administration to create a high as-

surance form of system administration. This helps keep system administration

faults from impacting a system. We use the same mechanism to introduce au-


ditable system administration, increasing assurance with little additional cost.

The ISE-T architecture enables systems to be administrated within this two-

person control model.

Chapter 2

Overview of Operating System

Virtualization

To understand how operating system virtualization allows us to solve new software

problems without requiring the software to be rewritten, we first explain what operat-

ing system virtualization is and how it works. Many people are familiar with hardware

virtualization, where real operating systems run on virtual hardware, which is a virtu-

alization layer between the host machine and the operating system. Operating system

virtualization differs from hardware virtualization in where it places the virtualization

layer. Instead of virtualizing the hardware interfaces, it virtualizes the operating sys-

tem interfaces to provide virtualized views of the underlying host operating system.

Unlike hardware virtualization, where different operating systems can run in paral-

lel, operating system virtualization is restricted to the same operating system as the

host. This dissertation explores the benefits of virtualizing the two primary operating

system elements that applications leverage: the kernel that provides the runtime, but

ephemeral, state of a process, and the file system that provides the long-term stable

storage on which processes depend.

Chapter 2. Overview of Operating System Virtualization 13

2.1 Operating System Kernel Virtualization

Applications depend heavily on kernel state during their runtime, from simple things

like process identifiers to more complicated states like inter-process communication

(IPC) keys, file descriptors and memory mappings. Some of these states already have

an element of virtualization that enables multiple processes to coexist on a single

system. For example, each process has its own file descriptor and virtual memory

namespaces. On the other hand, states such as process identifiers and IPC keys are

shared within a single namespace accessible to all processes. One primary use of

operating system kernel virtualization is to create multiple parallel namespaces that

are fully isolated from one another [7, 74, 116]. But this requires significant in-kernel

modifications.

Operating system kernel virtualization is commonly implemented by virtualizing

resource identifiers. Every resource that a virtualized process accesses has a virtual

identifier which corresponds to a physical operating system resource identifier. When

an operating system resource is created for a virtualized process, such as with IPC

key creation, the virtualization layer, instead of returning the corresponding physical

name to the process, intercepts the physical name value and returns a virtual name

to the process. Similarly, any time a virtualized process passes a virtual identifier

to the operating system, the virtualization layer intercepts it, replacing it with the

appropriate physical identifier.

This type of operating system kernel virtualization is easily implemented by sys-

tem call interposition. System call interposition can create a virtualized namespace

because all operating system resources are accessed through system calls. By inter-

posing on the system call, the virtualization abstraction can intercept the virtual

resource identifier the process passes in with the system call, and, if valid, replace


it with the correct physical resource identifier. Similarly, whenever a physical re-

source is created and has its identifier passed back to a process, the virtualization

abstraction can intercept the value and replace it with a newly created and mapped

virtual identifier. By virtualizing a process so that it can only access virtual named

resources, operating system virtualization decouples a process’s execution from the

underlying namespace of the host machine. Many commodity operating systems,

including Solaris [116] and Linux [6], now include this functionality natively.

Kernel virtualization is not limited to creating independent and isolated names-

paces, but can also change how the kernel behaves. Instead of simply translating

resource identifiers, kernel virtualization can change how the system calls interact

with those identifiers. For instance, it can change the security semantics of system

calls. Many system calls have built-in security checks to decide whether a process has

permission to execute a specific functionality. Once the kernel is virtualized through

the system call interface, the virtualized system calls can allow a process to access a

resource it would have otherwise been prevented from accessing, or vice versa.

2.2 File System Virtualization

Kernel virtualization and system call interposition enable virtualization of the ephemeral

kernel state, but each process also uses the file system, which provides the processes

with persistent storage. By virtualizing the file system, we enable processes to have a

file system view that is independent of the host machine’s file system. For instance,

when creating multiple parallel kernel namespaces, one often intends to provide vir-

tual machine environments. To do this, one must also provide a private file system

namespace for each environment. If private file system namespaces are inadvertently

omitted, the file system is shared and isolation is severely weakened.


In fact, commodity operating systems offer the ability to virtualize the file system

in exactly this way, such as by leveraging the chroot ability, enabling a process to

be confined to a subset of the underlying machine’s file system. But because current

commodity operating systems are not built to support multiple namespaces, we must

address the security issues this causes. Although chroot can provide processes within

a pod a virtualized file system namespace, there are many ways to break out of the

standard chrooted environment, especially if one allows the chroot system call to be

used by processes within the virtualized file system environment [58].

To provide secure file system virtualization, the virtualization mechanism must

enforce the chrooted environment’s limitation at all times. We have implemented a

barrier directory in the underlying file system that prevents processes within a pod

from crossing it. Even if a process is able to break the chrooted virtualized file system

view, the process will never be able to access any files outside the virtualized area.

To enforce a barrier, we interpose on the file system’s ->permission method, which

determines if a process can access a file or directory. For example, if a process tries to

access a file a few directories below the current directory, the permission function is

called on each directory in order as well as on the file itself. If a call determines that

the process does not have permission on that directory, the chain of calls ends, because

the process must have permission to traverse the directory hierarchy in order to access

the file. By interposing on the permission function, we can deny permission to access

the barrier directory to processes within a pod. The process cannot traverse the

barrier and so cannot access any file outside the virtualized file system environment.

However, file system virtualization is not limited to the creation of private file

system namespaces. Much as the barrier directory is implemented by interposing

on the file system’s permission function, one can also interpose on all functionality

the file system exposes to the operating system in order to create virtualized file


system instances. Just as pod virtualization allows differentiating virtual machine

environments without unique machine or operating system instances, file system vir-

tualization permits differentiating each pod’s file system namespace in unique ways

without requiring each pod to have a unique physical file system. For instance, file

system virtualization enables pods to have unique file system security policies. It

can even create file system views totally independent of the underlying file system by

combining multiple individual file systems into a single view.

In fact, this is exactly how stackable file systems [124, 152] work. Stackable file

systems provide a completely virtual file system by interposing on the kernel’s file

system operations. Instead of interposing directly, as with system call virtualization,

stackable file systems create a virtual file system that the kernel uses as a regular file

system. But rather than having a data store on a block device of its own, it leverages

the data stored within other file systems. This enables stackable file systems to

interpose directly on the physical file system by leveraging the operating system’s file

system interface. Instead of executing the file system’s functions directly, including

using its directory entry and inode structures, file system virtualization interposes on

those functions and provides its own set of file system structures that map onto those

of the underlying physical file system.

By interposing between the kernel and physical file systems, stackable file systems

allow easy creation of virtual file systems. The virtual file system is then able to

modify operations as appropriate for the needs of the system. For example, a unioning

semantic can be implemented with a stackable file system that combines multiple

underlying physical directories into a single view by interposing on the ->readdir

method. Whenever a program calls the operation, the stackable file system creates a

virtualized view by running the operation against all the underlying directories that

are being unioned into a single view and returning the unioned set of data.


2.3 Related Work

Many different systems have been created to enable the virtualization of kernel states.

They can be loosely grouped into four categories:

Operating system provided virtualization. This is most notable in operating

systems that provide alternate namespaces for the creation of containers, including

Solaris’s Zones [116], Linux’s Vserver [7] and Containers [6], and BSD’s Jail mode [74].

Systems in this category are the least flexible, as their techniques are tightly coupled

to the underlying system. This prevents them from being leveraged to solve problems

for which they were not explicitly designed.

Direct interposition on system calls. This enables code to directly intercept

the system call within the kernel. The kernel does not call the built-in system call’s

function, but instead executes the function provided by the virtualization layer, which

in turn calls the built-in one if needed. This very old technique, common in MS-DOS,

was used in Terminate and Stay Resident (TSR) programs [99]. In more modern

usage, Zap [100] implements its virtualization by interposing directly on a set of

system calls that it desires to virtualize, as well as by providing a generic interface

to enable other virtualization layers to interpose on whatever system call they desire.

The architectures in this dissertation use this approach.

Kernel-based system call trace and trapping. This is most notably provided

by the ptrace system call [144], which provides tracing and debugging facilities that

enable one process to completely control the execution of another. For instance,

a controlling process can be notified whenever the controlled process attempts to

execute a system call. Instead of letting the system call run directly, the controlling

process chooses to allow or disallow the system call, to change the parameters being

passed to the system call, or even to cause a totally separate code path to be executed.


This is a very flexible approach because while the interposition is being enforced by the

kernel via the ptrace system call, it runs as a regular user space program. However,

due to the the many context switches between the user space program using the

ptrace system call and the kernel, performance suffers.

User space-based system call trace and trapping. Instead of trapping in

the kernel, one can provide a user space library that provides its own system call

wrapper function [1]. Well-behaved programs do not execute system calls directly,

but call a library function that wraps the system call, enabling the system call to be

virtualized by replacing that library function with one that enforces the virtualization

of the kernel state. But this only works for well-behaved applications and cannot be

used to enforce security schemes, as any application can execute system calls directly

and avoid the library’s interposition mechanism.

File system virtualization: Operating systems’ file system interfaces have also

been virtualized in multiple ways. Modern operating systems provide a Virtual File

System (VFS) interface [73]. This enables different types of file systems to be used

with the operating system in a manner transparent to all applications. In addition,

modern operating systems support network file system shares using protocols such as

NFS [135] and SMB [151]. These network file systems provide virtualized access to

a remote file system while enabling applications to treat its contents as though they

were stored locally.

A common way to create virtualized file system access is through stackable file

systems. For example, Plan 9 [104] offered the 9P distributed file system protocol [105]

to enable the creation of virtual file systems. HURD [35] and Spring [78] also included

extensible file system interfaces. More commonly today, the NFS protocol serves as

the basis for other file systems that virtualize and extend the Unix file system via the

SFS Toolkit [89]. It exposes the NFS interface to user space programs, allowing them


to provide file system functionality safely. But the NFS protocol is very complicated.

User space file systems that depend on it must fully understand it to be implemented

correctly.

The more usual approach is to leverage kernel functionality to create these virtual-

ized file systems. This is generally easier to implement than an NFS-based approach

because the kernel’s file system interface is simpler than the one the NFS protocol

exposes. This can be implemented via a user space file system such as FUSE [137]

that provides the necessary kernel hooks. Alternatively, the entire file system can be

built as an in-kernel file system that can be dynamically loaded and unloaded, as in

FiST [152], which behaves as a native file system. In general, the in-kernel approach

yields significantly better performance because fewer context switches are needed.

Kernel-based virtualized file systems are known as stackable file systems and have

been implemented in many different operating systems [97,124,130,152].

Chapter 3

*Pod: Improving User Mobility

A key problem mobile users face is the lack of a common environment as they switch

locations. The computer at the office is configured differently from the one at home,

which is again different from the one at the library. Even though mobile users have

massive computer power at each location, they cannot easily take advantage of it.

These locations can have different sets of software installed, which can make it difficult

for a user to complete a task. Moreover, mobile users want consistent access to their

files, which is difficult to guarantee as they move around. The current personal

computer framework ties a user’s data to a single machine.

Laptops are a common solution in an attempt to solve the problems posed by mo-

bility. Laptops enable users to carry their data and applications with them wherever

they go. But laptops only mask the problem, as they do not leverage the existing

infrastructure and suffer from a number of difficulties of their own. First, laptops

are not as full-featured as a desktop computer. They have less storage and smaller

physical features like keyboards and displays. They are slower because cooling and

space constraints prevent the fastest processors from being used in a laptop. Even

laptops considered to be Desktop Replacements have speed limitations, tend to be

Chapter 3. *Pod: Improving User Mobility 21

as heavy as 8 or 9 pounds, and are not meant to be extremely mobile. Second, be-

cause laptops use small, specialized, and moving parts, they are more fault-prone.

This manifests itself in moving parts like a fan or hard disk breaking down, or in an

internal connection coming loose, as when memory is unseated from its socket.

To address these mobility and reliability problems posed by laptops, we have

designed and built the *Pod architecture. *Pod leverages operating system virtual-

ization to enable the creation of application-specific portable devices that decouple

a user’s application environment from any one physical machine. Depending on the

mobile user’s needs, the *Pod architecture lets users carry a single application or a

large set of applications, as well as large sets of data.

For instance, many users do most of their computing work through a web browser.

They read email through webmail interfaces, interact with friends on social network-

ing websites, and even use word processors and spreadsheets without leaving the web

browser. But while a web browser is available on every Internet-connected computer,

it will not necessarily be configured according to their needs. For instance, helper ap-

plications, browser plugins, bookmarks and cookies will not move with them between

machines. For these users, we leveraged the *Pod architecture to create a WebPod

device that contains a web browser, plugins and helper applications needed within

the web environment.

Many mobile users, however, require a more full-featured computing environment.

They do not want to store all their data on the Internet, nor to be limited to the

applications available via the web. They expect the traditional desktop experience.

Although they have access to powerful computers at many locations, these computers

are not configured correctly for their work. For these users, we leveraged the *Pod

architecture to create a DeskPod device containing all of the desktop applications a

user requires, along with their data.


The *Pod architecture enables mobile users to obtain the same persistent, person-

alized computing experience at any computer. *Pod takes advantage of commodity

storage devices that can easily fit in a user’s pocket yet store large amounts of data.

These devices range from flash memory sticks that can hold 64 GB of data to portable

hard disks, such as an Apple iPod, that can hold 120 GB of data. These devices can

hold a user’s entire computing environment, including applications and all their data.

The *Pod architecture allows a user to decouple their computing session from

the underlying computer, so that it can be suspended to a portable storage device,

carried around easily, and resumed from the storage device on a completely different

computer. Users have ubiquitous access to computing power, at work, home, school,

library or even an Internet cafe, and the *Pod architecture enables them to continue

working, even in the face of faulty components, simply by moving their *Pod-based

environment to a new host machine. *Pod provides this functionality without mod-

ifying, recompiling or relinking any applications or the operating system kernel, and

with only a negligible impact on performance.

The *Pod architecture does have limitations, as shown by our MediaPod and

GamePod devices. These devices enable users to carry with them a multimedia

player and a game playing environment, respectively. Although they are very flexible

in what media formats and games they can play, they do not provide any computing

capabilities of their own. Moreover, although they allow users to move their envi-

ronment among computers, they do not let them make use of the environment on

the go, when they have no access to a computer. This is in contrast to devices such

as Apple’s iPod and Nintendo’s Game Boy and DS portable devices, which can only

play a limited amount of formats, but provide their own computing ability and are

therefore usable on the go, without a powerful computer. These devices are popular

with users on the move, so MediaPod and GamePod are less likely to replace them.


3.1 *Pod Architecture

*Pod operates by encapsulating a user’s computing session within a virtualized exe-

cution environment and storing all states associated with the session on the portable

storage device. *Pod also leverages THINC [27] to virtualize the display so that

the application session can be scaled to different display resolutions as a user moves

among computers. This enables a computing session to run the same way on any

host despite different operating system environments and display hardware. These

virtualization mechanisms enable *Pod to isolate and protect the host from untrusted

applications that a user may run as part of their *Pod session. *Pod virtualization

also prevents other applications outside of the computing session that may be running

on the host from accessing any of the session’s data, protecting the user’s privacy.

We have combined *Pod’s virtualization with Zap’s checkpoint/restart mecha-

nism [100], allowing users to suspend the entire computing session to the portable

storage device so that it can be migrated between physical computers by simply mov-

ing the storage device to a new computer and resuming the session there. *Pod

preserves on the portable device the file system states and process execution states

associated with the computing session. A limitation of this approach is that Zap

only supports homogeneous migration, so it can migrate only between machines run-

ning the exact same kernel. In Chapter 4, we demonstrate heterogeneous migration,

thereby removing this limitation.

As a result, *Pod enables users to maintain a common environment, no matter

what computer they are using. Devices built upon the *Pod architecture are also less

prone to problems because they do not contain a complete operating system, only the

programs needed for one specific application environment. Various operating system

services that a normal machine depends on are not needed, so maintenance is simpler.


To the user, a *Pod-based device appears no different than a private computer, even

though it runs on a host that may be running other applications. Those applications

run outside the session provided by the *Pod device and are not visible to a user

within the *Pod session. To provide strong security, the *Pod can store the session

on an encrypted file system. If the *Pod device is lost or stolen, an attacker will only

be able to use it as his own personal storage device.

3.1.1 Secure Operating System Virtualization

In order to enable a *Pod device to be used on computers that are not controlled by

the *Pod user, we must securely isolate the *Pod device from the underlying machine.

Previous operating system virtualization techniques either are not designed to provide

secure isolation and therefore do not protect the host machine from rogue processes

running within the device’s context, or require significant changes to the underlying

operating system.

For example, pods as introduced by Zap [100] provide a level of isolation and enable

multiple pods to coexist on a single system, but they were not designed to be secure.

Zap’s virtualization operates by providing each environment with its own virtual

private namespace. A pod contains its own host-independent view of operating system

resources such as PID/GID, IPC, memory, file system and devices. The namespace

is the only means for the processes to access the underlying operating system. Zap

introduces this namespace to decouple processes from the host’s operating system.

But Zap assumes that the person using the pod already has privileged access to the

machine, and therefore is not directly concerned with a user breaking out of the

abstraction. Without protecting the host, no one would allow *Pod devices to use

their systems. Therefore, we leverage operating system virtualization at both the


kernel and file system levels to create the secure pod abstraction, enabling untrusted

*Pod devices to be used securely.

Protecting the host from rogue processes requires a complete virtualization ab-

straction that totally confines the process and prevents it from breaking the abstrac-

tion and effecting change to the host machine. The secure pod abstraction achieves

this in two ways. First, it prevents processes within it from accessing any file system

outside of the *Pod device. Second, while it lets processes in the *Pod device context

run with privilege, it prevents any privileged action that can break the abstraction.

Many previous operating system virtualization architectures relied on the chroot

functionality to provide a private file system namespace in which processes run. While

chroot can give a set of processes a virtualized file system namespace, there are many

ways to break out of the standard chrooted environment, especially if one allows the

chroot system call to be used by the virtualized processes. To prevent this, the

secure pod abstraction virtualizes the file system interface and implements a barrier,

thereby enforcing the chrooted environment even while allowing the chroot system

call.

We can implement a barrier easily because file systems provide a ->permission

method that determines if a process can access a file. For example, if a process

tries to access a file a few directories below the current directory, the file system’s

->permission method is called on each directory as well as the file itself, in order. If

any call determines that the process does not have permission on a directory, the chain

of calls ends. Even if the ->permission method were to determine that the process

has access to the file itself, it must have permission to traverse the directory hierarchy

to reach the file. We implemented a barrier simply by stacking a small virtual file

system on top of the staging directory that virtualized the underlying ->permission

method to prevent the virtualized processes from accessing the parent directory of the


staging directory. This effectively confines *Pod processes to the *Pod’s file system

by preventing a rogue process from ever walking past the *Pod’s file system root.

The secure pod abstraction also takes advantage of the user identifier (UID) se-

curity model in traditional file systems to support multiple security domains on the

same system running on the same operating system kernel. For example, since each

secure pod has its own private file system, it has its own /etc/passwd file that deter-

mines its list of users and their corresponding UIDs. In traditional Unix file systems,

the UID of a process determines what permissions it has when accessing a file. This

means that since the *Pod’s file system is separate from the host file system, a *Pod

process is effectively running in a separate security domain from another process with

the same UID that is running directly on the host system. Although both processes

have the same UID, the *Pod process is only allowed to access files in its own file sys-

tem namespace. Similarly, this model allows multiple secure pods on a single system

to contain independent processes running with the same UID.

This UID model supports an easy-to-use migration model when a user may be

using a *Pod device on a host in one administrative domain and then moves the *Pod

device to another. Even if the user has computer accounts in both administrative

domains, it is unlikely that the user will have the same UID in both domains if they

are administratively separate. Nevertheless, the secure pod abstraction enables the

user to run the same *Pod device with access to the same files in both domains.

Suppose the user has UID 100 on a machine in administrative domain A and starts

a pod connecting to a file server residing in domain A. Suppose that all virtualized

processes are then running with UID 100. When the user moves to a machine in

administrative domain B where they have UID 200, they can migrate their *Pod

device to the new machine and continue running its processes. Those processes can

continue to run as UID 100 and continue to access the same set of files on the *Pod


file system, even though the user’s real UID has changed. This works even if there is a

regular user on the new machine with a UID of 100. Whereas this example considers

the case of a *Pod device with all processes running with the same UID, it is easy to

see that the secure pod abstraction supports running processes with many different

UIDs.

However, this only works for regular processes, however, because they do not have

special privileges. But because the root UID 0 is privileged and treated specially by

the operating system kernel, the secure pod virtualization abstraction treats UID 0

processes within a secure pod specially as well. We must do this to prevent processes

with privilege from breaking the virtualization abstraction, accessing resources on the

host, and harming it. The secure pod abstraction does not disallow UID 0 processes,

as this would limit the range of application services that can be virtualized. Instead,

it restricts such processes to ensure that they function correctly when virtualized.

While a process is running in user space, its UID does not have any effect on

process execution. Its UID only matters when it tries to access the underlying kernel

via one of the kernel entry points, namely devices and system calls. Since the secure

pod abstraction already provides a virtual file system that includes a virtual /dev with

a limited set of secure devices, the device entry point is already secured. Furthermore,

the secure pod abstraction disallows device nodes on the *Pod device’s file system.

The only system calls of concern are those that could allow a root process to break the

virtualization abstraction. Only a small number of system calls can be used for this

purpose. These system calls are listed and described in further detail in Appendix A.

Secure pod virtualization classifies these system calls into three classes.

The first class of system calls are those affecting only the host system and serving

no purpose within a virtualized context. Examples of these system calls include those

that load and unload kernel modules (create module, delete module) or that reboot


the host system (reboot). Since they only affect the host, they would break the secure

pod abstraction by allowing processes within it to make administrative changes to

the host. System calls that are part of this class are therefore made inaccessible by

default to virtualized processes.

The second class of system calls are those forced to run unprivileged. Just as

NFS, by default, squashes root on a client machine to act as user nobody, secure pod

virtualization forces privileged processes to act as the nobody user when they execute

some system calls. Examples of these system calls include those that set resource

limits and ioctl system calls. Because some system calls, such as setrlimit and

nice, can allow a privileged process to increase its resource limits beyond predefined

limits imposed on virtualized processes, privileged virtualized processes are by de-

fault treated as unprivileged when executing these system calls. Similarly, the ioctl

system call is a multiplexer that effectively allows any driver on the host to install its

own set of system calls. It is impossible to audit the large set of system calls, given

that a *Pod device may be used on a wide range of machine configurations, so we

conservatively treat access to this system call as unprivileged by default.

The final class of system calls are those that are required for regular applications

to run, but that have options that will give the processes access to the underlying

host resources, breaking the isolation provided by the secure pod abstraction. Since

these system calls are required by applications, the secure pod virtualization checks all

their options to ensure that they are limited to resources to which the *Pod device has

access, making sure they do not break the secure pod abstraction. For example, the

mknod system call can be used by privileged processes to make named pipes or files in

certain application services. It is therefore desirable to make it available to virtualized

processes. But it can also be used to create device nodes that provide access to the

underlying host resources. The secure pod’s kernel virtualization mechanism checks


the options of the system call and only allows it to continue if it is not trying to create

a device.

3.2 Using a *Pod Device

A user starts a *Pod device simply by plugging it in to a computer. The computer

detects the device and automatically tries to restart the *Pod session. The user

is asked for a password. Authentication can also be done without a password by

using built-in fingerprint readers available on some USB drives [11]. Once a user is

authorized, the *Pod device mounts its file system, restarts its desktop computing

session, and attaches a *Pod viewer to the session, making the associated set of

applications available and visible to the user. Applications running in a *Pod session

appear to the underlying operating system just like other applications that may be

running on the host machine, and they use the host’s network interface in the same

manner.

Once the *Pod is started, the user can use the applications available in the com-

puting environment. When the user wants to leave the computer, they simply close

the *Pod viewer. The *Pod session is quickly checkpointed to the *Pod storage de-

vice, which can then be unplugged and carried around by the user. When the user is

ready to use another computer, they simply plug in the *Pod device and the session

restarts exactly where it was suspended. With a *Pod-based device, the user does not

need to manually launch applications and reload documents. The *Pod’s integrated

checkpoint/restart functionality maintains a user’s computing session persistently as

a user moves from one computer to another, even including ephemeral states such

as copy/paste buffers. If the host machine crashes, it takes down the current *Pod

session with it. But since *Pod devices do not provide their own operating system,


one can simply plug it into a new host machine and start a fresh *Pod session. The

only data lost is that not committed to disk when the host machine crashes. In addi-

tion, the *Pod device’s file system is automatically backed up when connected to the

user’s primary computer. This enables quick recovery if the device is lost. The user

can replicate the file system on a new device and continue working.

3.3 Experimental Results

We have implemented four *Pod devices: WebPod [113], DeskPod [114], Media-

Pod [109] and GamePod [110]. Each *Pod device contains three components: a

simple viewer application for accessing the *Pod session, an unmodified XFree86 4.3

display server with THINC’s virtual display device driver, and a loadable kernel mod-

ule in Linux that requires no changes to the Linux kernel. The kernel module provides

the secure pod’s operating system virtualization layer and Zap’s process migration

mechanism. We present experimental results using our Linux prototype to quantify

the overhead of using the *Pod device on various applications.

Experiments were conducted on two IBM PC machines, each with a 933 MHz

Intel Pentium-III CPU and 512 MB RAM. The machines each had a 100 Mbps NIC

and were connected to one another via 100 Mbps Ethernet and a 3Com Superstack

II 3900 switch. Two machines were used as hosts for running the *Pod device and

the other was used as a web server for measuring web benchmark performance. To

demonstrate *Pod’s ability to operate across different operating system distributions,

each machine was configured with a different Linux distribution. The machines ran

both Debian 3.0 (“Woody”) and 3.1 (“Sarge”) with a Linux 2.4.18 kernel.

We used a 40 GB Apple iPod as the *Pod portable storage device, although a

much smaller USB memory drive would have sufficed. Both PCs used FireWire to


WebPod GamePod Deskpod MediaPod

Size 163 MB 283 MB 418 MB 633 MB

Table 3.1 – Per-Device *Pod File System Sizes

Name Description Linux

getpid average getpid runtime 350 ns

ioctl average runtime for the FIONREAD ioctl 427 ns

semaphore IPC Semaphore variable is created and removed 1370 ns

fork-exit process forks and waits for child which calls exit immediately 44.7 µs

fork-sh process forks and waits for child to run /bin/sh to run a pro-gram that prints “hello world” then exits

3.89 ms

iBench Measures the average time it takes to load a set of web pages 826 ms

Table 3.2 – Benchmark Descriptions

connect to the iPod. We built an unoptimized *Pod file system by bootstrapping a

Debian GNU/Linux installation onto the iPod and installing the appropriate set of

applications. In all cases, we included a simple KDE 2.2.2 environment. WebPod

additionally included the Konqueror 2.2.2 web browser. GamePod included Quake

2, Tetris and Solitaire. DeskPod added on top of WebPod the entire KDE Office

Suite, with all the desktop applications a user needs. Finally, MediaPod added on

top of DeskPod multiple media-related applications, including video, DVD and music

players, with their related codecs. We removed the extra packages needed to boot a

full Linux system, as *Pod is just a lightweight application environment, not a full

operating system. As can be seen in Table 3.1, the various *Pod devices we built

all have minimal storage requirements, enabling them to be stored on many portable

devices with ease. In addition, our unoptimized *Pod file systems could be even

smaller if the file system were built from scratch instead of by installing programs

and libraries as needed.

To measure the cost of *Pod’s virtualization, we took a range of benchmarks that

represent various operations that occur in a normal application environment and

measured their performance on both our Linux *Pod prototype and a vanilla Linux


0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

getpidioctl

semaphore

forkexitforksh

iBench

Nor

mal

ized

Per

form

ance

Plain*Pod

Figure 3.1 – *Pod Virtualization Overhead

system. We used a set of micro-benchmarks that represent operations executed by real

applications as well as a real web browsing application benchmark. Table 3.2 shows

the 6 benchmarks we used along with their performance on a vanilla Linux system in

which all benchmarks were run from a local disk. These benchmarks were then run for

comparison purposes in the *Pod portable storage environment. To obtain accurate,

repeatable results, we rebooted the system between measurements. Additionally, the

system call micro-benchmarks directly used the TSC register available on Pentium

CPUs to record timestamps at the significant measurement events. Each timestamp’s

average cost was 58 ns. The files for the benchmarks were stored on the *Pod’s file

system. All of these benchmarks were performed in a *Pod environment running

on the PC machine running Debian Unstable with a Linux 2.4.18 kernel. Figure

3.1 shows the results of running our benchmarks under both configurations, with

the vanilla Linux configuration normalized to 1. A smaller number is better for all


benchmark results.

Figure 3.1 shows that *Pod virtualization overhead is small. *Pod incurs less

than 10% overhead for most of the micro-benchmarks and less than 4% overhead for

the iBench application workload. The overhead for the simple system call getpid

benchmark is only 7% compared to vanilla Linux, reflecting the fact that *Pod vir-

tualization for these kinds of system calls only requires an extra procedure call and a

hash table lookup. The most expensive benchmark for *Pod is semget+semctl, which

took 51% longer than vanilla Linux. The cost reflects the fact that our untuned *Pod

prototype needs to allocate memory and do a number of namespace translations.

Kernel semaphores are widely used by web browsers such as Mozilla and Konqueror

to perform synchronization. The ioctl benchmark also has high overhead because

of the 12 separate assignments it does to protect the call against malicious processes.

This is large compared to the simple FIONREAD ioctl that just performs a simple

dereference. But because the ioctl is simple, it only adds 200 ns of overhead over any

ioctl. There is a minimal overhead for functions such as fork and the fork/exec

combination. This is indicative of what happens when the web browser loads a plugin

such as Adobe Acrobat, where the web browser runs the acroread program in the

background.

Figure 3.1 shows that *Pod has low virtualization overhead for real applications

as well as micro-benchmarks. This is illustrated by the performance on the iBench

benchmark, which is a modified version of the Web Text Page Load test from the

Ziff-Davis iBench 1.5 benchmark suite. It consists of a JavaScript-controlled load

of a set of web pages from the web benchmark server. iBench also uses JavaScript

to measure how long it takes to download and process each web page, then deter-

mines the average download time per page. The pages contain both text and bitmap

graphics, with pages varying in the proportions of text and graphics. The graphics


0.1

1.0

10.0

100.0

1 Browser

10 Browsers

DesktopTotem

Oglempg123

SolitareTetris

Quake

WebPod DeskPod MediaPod GamePod

Tim

e (s

)

CheckpointRestart

Plain

Figure 3.2 – *Pod Checkpoint/Restart vs. Normal Startup Latency

are embedded images in GIF and JPEG formats. Our results show that running the

iBench benchmark in the *Pod environment incurs no performance overhead versus

running in vanilla Linux from local SCSI storage.

To measure the cost of checkpointing and restarting *Pod sessions, as well as

demonstrate *Pod’s ability to improve the way a user works with various applica-

tions, we migrated multiple *Pod sessions containing different sets of applications.

For WebPod, we migrated multiple sessions containing different numbers of open

browser windows between the two machines described above. For DeskPod, we mi-

grated a session containing the KWrite word processor, the KSpread spreadsheet and

the Konqueror web browser, each displaying a document, in addition to a Konsole

terminal application. This is indicative of a regular desktop environment. For Media-

Pod, we migrated multiple sessions containing different sets of running desktop and

multimedia applications: first, a MediaPod using the Totem media player playing an


XviD encoded version of a DVD; second, a MediaPod using Ogle playing a straight

DVD image copied to the MediaPod; third, a MediaPod playing an mp3 file using

the mpg123 program.

Figure 3.2 shows how long it takes to checkpoint to disk and warm cache restart the

multiple *Pod sessions described above. We compared this to how long it would take

to warm cache startup each session independently. Figure 3.2 shows that, in general,

it is significantly faster to checkpoint and restart *Pod sessions than it is to start the

same kind of session from scratch. Checkpointing and restarting a *Pod, even with

many browser windows opened, takes under a second. A *Pod user can disconnect,

plug in to another machine, and start using their session again very quickly. Many

tasks have a large startup time, such as Ogle, which iterates through all the files on the

DVD image to determine if they have to be decrypted and calculates the decryption

key. Furthermore, these experiments were run across two different machines with two

different operating system environments, demonstrating that *Pod can indeed work

across different software environments.

In contrast, Figure 3.2 shows that starting the applications the traditional way is

much slower in all cases. For instance, starting a browsing session takes 12 seconds

when opening the browser windows with actual web content. Even starting a web

browsing session by opening a single browser window takes more than a second. Even

in the mpg123 case, where it appears that the mpg123 application starts faster than

*Pod can restart it, this is because it is not a direct comparison. For plain startup,

all we are doing is restarting the small 136KB mpg123 application, while for *Pod

restart, we are restarting the entire KDE desktop environment as well. It should

be noted that *Pod’s approach to restarting applications is fundamentally different

than plain restarting, as *Pod returns the application sessions to where they were

executing when they were suspended. For instance, a restarted MediaPod session


WebPod Desk- MediaPod GamePod1 Web 10 Web Pod Totem Ogle mpg123 Solit. Tetris Quake

Size 25 mb 46 mb 50 mb 44 mb 27 mb 17 mb 44 mb 22 mb 50 mb

Table 3.3 – *Pod Checkpoint Sizes

will continue playing the file from where it was, while a WebPod session will show the

web browser’s content, even if the content on the web has changed in the meantime.

Similarly, restarting a *Pod device requires restarting all applications associated with

it, including the desktop environment, as opposed to starting a plain application that

uses the desktop environment that is already running.

Table 3.3 shows the amount of storage needed to store the checkpointed sessions

using *Pod for each of the *Pod devices and sessions described. The results reported

show checkpointed image sizes without applying any compression techniques to reduce

the image size. These results show that the checkpointed state that needs to be saved

is very modest and easy to store on any portable storage device. Given the modest

size of the checkpointed images, there is no need for any additional compression,

which would reduce the minimal storage demands, but add additional latency due

to the need to compress and decompress the checkpointed images. The checkpointed

image size in all cases was 50 MB or less.

3.4 Related Work

*Pod builds upon our previous work on MobiDesk [26], which provides a hosted desk-

top infrastructure that improves management by enabling the desktop sessions to be

migrated between the back-end infrastructure machines. *Pod differs from MobiDesk

in two fundamental ways. First, it builds upon MobiDesk by coupling its compute ses-

sion migration with portable storage to improve users’ mobility. Second, MobiDesk is

limited to a single administrative domain. Unlike *Pod devices, which can be moved


between machines managed by different users and organizations, MobiDesk sessions

can only exist within a single organization and therefore do not require the secure

operating system virtualization abstraction.

Given the ubiquity of web browsers on modern computers, many traditional ap-

plications are becoming web-enabled, allowing the mobile user to use them from any

computer. Common applications such as email [2, 5], instant messaging [20], and

even word processing and spreadsheet applications [3] have been ported to a web

services environment that is usable within a simple web browser. The advantage of

this approach is that users effectively store their data on centrally managed servers

accessible from any networked computer.

But even the web user relies on various applications, such as Adobe Acrobat

Reader, to be available on whatever computer they are using at the moment. If the

application is already installed on the host, the web browser can use it, but otherwise,

the user is unable to complete the task at hand. Some web-based applications have

been created to fill these gaps, such as one that converts PDF files to simple image files

viewable from any web browser. These approaches, however, are application-specific

and often quite limited. For instance, converting PDF files to simple image files cuts

out useful features of the native application, such as the ability to search the PDF.

Similarly, items like cookies and bookmarks allow a user to work more efficiently, but

do not travel with the user as they move between web browsers on different machines.

Another solution that solves many of the above problems is the use of thin-client

computing [14, 27, 51]. The thin-client approach provides several significant advan-

tages over traditional desktop computing. Clients can be essentially stateless appli-

ances that do not need to be backed up or restored, require almost no maintenance

or upgrades, and do not store any sensitive data that can be lost or stolen. Server

resources can be physically secured in protected data centers and centrally adminis-


tered, with all the attendant benefits of easier maintenance and cheaper upgrades.

Computing resources can be consolidated and shared across many users, resulting

in more effective utilization of computing hardware. Moreover, the ability of thin

clients to decouple display from application execution over a network offers a myr-

iad of other benefits, including graphical remote access to a persistent session from

anywhere, screen sharing for remote collaboration, and instant technical support.

A number of solutions resembling the thin-client approach have sprung up in the

past. The model has come and gone many times, however, whether in mainframe

dumb terminals, X terminals, or network computers, without being able to displace

the desktop computer. No matter how fast the network connection is, the connection

between the computer and the local video device will be significantly faster. For exam-

ple, one would need gigabit ethernet to transfer a decoded DVD across the network,

while even 10-year-old PCs have enough video bandwidth to do this. Although giga-

bit ethernet is becoming more common today, we are transitioning to high-definition

video streams which require many times more bandwidth. Similarly, many applica-

tions, especially 3D-oriented ones, need to transfer large amounts of data quickly, and

have been shown to use as much bandwidth as possible, as they are what is pushing

the state of the art in hardware graphic devices.

The emergence of cheap, portable storage devices has led to the development of

web browsers for USB drives, including Stealth Surfer [10] and Portable Firefox [8].

These approaches only provide the ability to run a web browser on a USB drive.

Unlike *Pod, they do not provide a complete application environment. The various

programs and plugins that make the user’s experience more comfortable do not work

within this environment. The U3 platform [12] has attempted to provide a standard

way to enable applications to store data and launch applications. But it has not gained

any traction in the marketplace and, unlike *Pod, does not address mobile users’ need


for persistent application sessions that can be easily moved between locations.

Systems like SoulPad [37] and the Collective [41] provide a solution similar to

*Pod, but are based on using a bootable Linux distribution like Knoppix and VMware [142]

on a USB drive. For these systems, Knoppix provides a Linux operating system that

can boot from a USB drive for certain hardware platforms. VMware provides a vir-

tual machine monitor (VMM) that enables an entire operating system environment

and its applications to be suspended and resumed from disk. They are designed to

take over the host computer they are plugged into by booting their own operating

system. They then launch a VMware VM that runs the migratable operating sys-

tem environment. Unlike *Pod, they do not rely on any software installed on the

host. However, they require minutes to start up given the need to boot and configure

an entire operating system for the specific host being used. *Pod does not need to

provide an entire operating system instance for the virtual machine to run, and so

is much more lightweight. *Pod requires less storage, so it can operate on smaller

USB drives, and does not require rebooting the host into another operating system,

so it starts up much faster. However, unlike these systems, *Pod is limited to the

same operating system interface as the host machine, and requires a secure operating

system virtualization layer to be written for every operating system it is to be used

with.

Moka5 [94] attempts to optimize the management and distribution of these portable

hardware virtual machine-based devices by storing the virtual machine on the network

and only requiring the user to carry a small cache storage device. This cache storage

device provides a base host operating system and the ability to page in the necessary

parts of the virtual machine on demand. Whereas today’s storage devices can easily

hold the entire virtual machine, this cache architecture improves management. It

allows the virtual machines to be upgraded on the server by a central administrator,


with updates pulled into the cache when the machine is rebooted.

In general, providing virtualization and checkpoint/restart capabilities using a

VMM such as VMware represents an interesting alternative to the *Pod operating

system virtualization approach. VMMs virtualize the underlying machine hardware

while *Pod virtualizes the operating system. VMMs can checkpoint and restart an

entire operating system environment. However, unlike *Pod, VMMs cannot check-

point and restart applications without also checkpointing and restarting the operat-

ing system. *Pod virtualization operates at a finer granularity than virtual machine

approaches by virtualizing individual sessions instead of complete operating system

environments. Using VMMs can be more space- and time-intensive because the op-

erating system must be included on the portable storage device.

Chapter 4

AutoPod: Reducing Downtime for

System Maintenance

A key problem many organizations face is keeping their computer services available

while the underlying machines are maintained. These services run on increasingly

networked computers, which are frequent targets of attacks that attempt to exploit

vulnerable software they could be running. To prevent these attacks from succeed-

ing, software vendors frequently release patches to address security and maintenance

issues. But for these patches to be effective, they must be applied to the machines.

System maintenance, however, commonly results in a system service being un-

available. For example, patching an operating system may mean that the whole

system is down for a length of time. If system administrators fix an operating sys-

tem security problem immediately, they risks upsetting their users because of loss of

data. If the underlying hardware has to be replaced, the machine will have to be shut

down. The system administrators must schedule downtime in advance and in coop-

eration with users, leaving the computer vulnerable until repaired. If the operating

system is patched successfully, downtime may be limited to just a few minutes during

Chapter 4. AutoPod: Reducing Downtime for System Maintenance 42

the reboot. Even then, users incur additional inconvenience and delays in starting

applications again and attempting to restore their sessions. If the patch is not success-

ful, downtime can extend for many hours while the problem is diagnosed and solved.

Downtime due to security and maintenance problems is costly as well as inconvenient.

Therefore, it is not uncommon for systems to continue running unpatched software

long after a security exploit is well-known [123].

To address these problems, we have designed and built AutoPod, a system that

provides an easy-to-use autonomic infrastructure [77] for operating system self-maint-

enance. AutoPod is unique because it enables unscheduled operating system updates

of commodity operating systems while preserving application service availability dur-

ing system maintenance. AutoPod functions without modifying, recompiling, or re-

linking applications or operating system kernels. We have done this by combining

three key mechanisms: a lightweight operating system virtualization isolation ab-

straction that can be used at the level of individual applications, a checkpoint/restart

mechanism that operates across operating system versions with different security and

maintenance patches, and an autonomic system status service that monitors the sys-

tem for system faults and security updates.

AutoPod combines *Pod’s secure pod abstraction with a novel checkpoint/restart

mechanism that uniquely decouples processes from the underlying system and main-

tains process state semantics, allowing processes to migrate across different machines

with different operating system versions. The checkpoint/restart mechanism intro-

duces a platform-independent intermediate format for saving the states associated

with processes and AutoPod virtualization. AutoPod combines this format with

higher-level functions for saving and restoring process states to yield a degree of porta-

bility impossible with previous approaches. This checkpoint/restart mechanism relies

on the same kind of operating system semantics that allow applications to function


correctly across operating system versions with different security and maintenance

patches.

AutoPod combines these mechanisms with an autonomous system status service.

The service monitors the system for faults and security updates. When the service

detects new security updates, it downloads and installs them automatically. If the

update requires a reboot, the service uses AutoPod’s checkpoint/restart capability to

save the AutoPod’s state, reboot the machine into the newly repaired environment,

and restart the processes within the AutoPod without data loss. This permits fast

recovery from downtime even when other machines are not available to run appli-

cation services. Alternatively, if another machine is available, the AutoPod can be

migrated to the new machine while the original machine is maintained and rebooted,

further reducing application service downtime. This allows security patches to be

applied to operating systems in a timely manner with minimal impact on application

service availability. Once the original machine is updated, applications can continue

to execute even though the underlying operating system has changed. Similarly, if

the service detects an imminent system fault, AutoPod can checkpoint the processes,

migrate, and restart them on a new machine before the fault causes their execution

to fail.

4.1 AutoPod Architecture

The AutoPod architecture is based on *Pod’s secure pod abstraction. As shows in

Figure 4.1, AutoPod permits server consolidation by allowing multiple pods to run on

a single machine while enabling automatic machine status monitoring. As each pod

provides a complete secure virtual machine abstraction, it is able to run any server ap-

plication that would run on a regular machine. By consolidating multiple machines


Pod A Pod B

AutoP

od System

MonitorAutoPod Virtualization Layer

Host Operating System

Host Hardware

Figure 4.1 – AutoPod Model

into distinct pods running on a single server, the administrator has fewer physical

hardware and operating system instances to manage. Similarly, when kernel security

holes are discovered, server consolidation minimizes the number of machines to be up-

graded and rebooted. The AutoPod system monitor further improves manageability

by constantly monitoring the host system for stability and security problems.

By leveraging the secure pod abstraction, AutoPod is able to securely isolate

multiple independent services running on a single machine. Operating system virtu-

alization restricts what operating system resources are accessible to processes within

it simply by not providing identifiers to certain resources within its namespace. An

AutoPod can then be constructed to provide access only to resources needed for its

service. An administrator configures the AutoPod in the same way a regular machine

is configured and installs applications within it. The secure pod abstraction en-


forces secure isolation to prevent exploited services from attacking the host or other

services on it. Similarly, secure isolation allows running multiple services from dif-

ferent organizations, with different sets of users and administrators on a single host,

while retaining the semantic of multiple distinct and individually managed machines.

Multiple services that previously ran on multiple machines can now run on a single

machine.

For example, a web server pod is easily configured to contain only the files the web

server needs to run and the content it is to serve. The web server pod could have its

own IP address, decoupling its network presence from that of the underlying system.

Using a firewall, the pod’s network access is limited to client-initiated connections.

Connections to the pod’s IP address are limited to the ports served by the application

running within this pod. If multiple isolated web servers are required, multiple pods

can be set up on a single machine. If one web server application is compromised, its

pod limits further harm to the system, because the only resources the compromised

pod can access are those explicitly needed by its service. Because this web server

pod does not need to initiate connections to other hosts, it is easy to firewall it

to prevent it from directly initiating connections to other systems. This limits an

attacker’s ability to use the exploited service as a launching point for other attacks.

Furthermore, there is no need to disable other network services commonly enabled by

the operating system to guard against the compromised pod because those services,

and the operating system itself, reside outside the pod’s context.

4.2 Migration Across Different Kernels

AutoPod complements the secure pod virtualization abstraction with a cross-kernel

checkpoint/restart system that improves the mobility of services within a data cen-


ter. Checkpoint/restart provides the glue that permits a pod to checkpoint services,

migrate the state to a new machine, and restart them across other computers with

different hardware and operating system kernels. AutoPod’s migration is limited to

machines with a common CPU architecture, and that run “compatible” operating

systems. Compatibility is determined by the extent to which they differ in their

API and internal semantics. Minor versions are normally limited to maintenance

and security patches, without affecting the kernel’s API. Major versions carry sig-

nificant changes, like modifying the application’s execution semantics or introducing

new functionality, that may break application compatibility. Nevertheless, they are

usually backward compatible. For instance, the Linux kernel has two major versions,

2.4 and 2.6, with over 30 minor versions each. Linux 2.6 significantly differs from

2.4 in how threads behave, and also introduces various new system calls. This im-

plies that migration across minor versions is generally not restricted, but migration

between major versions is only feasible from older to newer.

To support migration across different kernels, AutoPod’s checkpoint/restart mech-

anism employs three key design principles: storing operating system state in adequate

abstract representation, converting between the abstract representation and operat-

ing system-specific state using specialized filters, and using well-established native

kernel interfaces to access and alter the state.

AutoPod’s checkpoint/restart mechanism relies on an intermediate abstract for-

mat to represent the state to be saved. While the low-level details maintained by

the operating system may change radically between different kernels, the high-level

properties are unlikely to change since they reflect the actual semantics upon which

the application relies. AutoPod describes the state of a process in terms of this

higher-level semantic information rather than kernel-specific data. To illustrate this,

consider the data that describe inter-process relationships, e.g., parent, child, siblings


and threads. The operating system normally optimizes for speed by keeping multiple

data structures to reflect these relationships. But this format has limited portability

across different kernels; in Linux, the exact technique did indeed change between 2.4

and 2.6. Instead, AutoPod uses a tree structure to capture a high-level representation

of the relationships, mirroring its semantics. The same holds for other resources, e.g.,

communication sockets, pipes, open files and system timers. AutoPod extracts the

relevant state the way it is encapsulated in the operating system’s API, rather than

in the details of its implementation. Doing so maximizes portability across kernel

versions by adopting properties that are considered highly stable.

To accommodate differences in semantics that inevitably occur between kernel

versions, AutoPod uses specialized conversion filters. The checkpointed state data

is saved and restored as a stream. The conversion filters manipulate the contents

of this stream. Although typically they are designed to translate between different

representations, they can be used to perform other operations such as compression

and encryption. Their main advantages are extreme flexibility and being executed

like regular helper applications. Building on the example above, because the thread

model changes between Linux 2.4 and 2.6, a filter can easily be designed to make the

former abstract data adhere to the new semantics. Additional filters can be built if

semantic changes occur in the future. This is a very robust and powerful solution.

AutoPod leverages high-level native kernel services in order to transform the in-

termediate representation of the checkpointed image into the complete internal state

required by the target kernel during restart. Continuing with the previous example,

AutoPod restores the structure of the process tree by exploiting the native fork sys-

tem call. According to the abstract process tree data, a sequence of fork calls is

issued to replicate the original relationships. This avoids dealing with any internal

kernel details. Moreover, high-level primitives of this sort remain virtually unchanged


across minor or major kernel changes. Finally, these services are available for use by

loadable kernel modules, enabling AutoPod to perform cross-kernel migration without

requiring modifications to the kernel.

To eliminate possible dependencies on low-level kernel details, AutoPod’s check-

point/restart mechanism requires processes to be suspended before being check-

pointed. Suspending processes creates a quiescent state necessary to guarantee the

correctness of the checkpointed image, and substantially reduces the amount of in-

formation that needs to be saved by avoiding transient data. For example, consider

a checkpoint started while one of the processes is executing the exit system call. It

would take tremendous effort and detail to ensure a proper and consistent capture of

such a transient state. Instead, by first suspending all processes, such ongoing activ-

ities are either completed or interrupted. AutoPod uses this property to guarantee a

consistent and static state during the checkpoint.

Finally, we must ensure that changes in system call interfaces are properly handled.

AutoPod has a virtualization layer that employs system call interposition to maintain

namespace consistency. It follows that a change in the semantics for any system call

intercepted could raise an issue in migrating across such differences. Fortunately,

such changes are rare, and when they occur, they are hidden by standard libraries

from the application level lest they break the applications. Consequently, AutoPod is

protected just as legacy applications are protected. On the other hand, the addition

of new system calls to the kernel requires that the encapsulation be extended to

support them. Moreover, it restricts migration back to older versions. For instance, an

application that invokes the new waitid system call in Linux 2.6 cannot be migrated

back to 2.4 unless an emulation layer exists there.

AutoPod uses two techniques to save and restore device-specific states, depending

on the device class. Some devices provide standard interfaces for applications to read


and set their state. Common sound cards and the Intel MMX processor extensions

are two notable examples. With these, it is possible to easily inquire the device with

regard to state prior to checkpoint, and reestablish it during restart.

However, many device drivers maintain internal state that is practically inacces-

sible from the outside. AutoPod ensures that processes within its session only have

access to such devices through the virtual device drivers provided by the AutoPod

device. This makes it simple to checkpoint the device-specific data associated with

the processes. For instance, the AutoPod display system is built using its own virtual

display device driver which is not tied to any specific hardware device, and keeps its

entire state in regular memory. As a result, its state can be readily checkpointed

as a simple matter of saving that process similarly to others. After the process is

restarted, the AutoPod viewer on the host reconnects to the virtual display driver to

display the complete session.

4.3 Autonomic System Status Service

AutoPod provides a generic autonomic framework for managing system state. The

framework can monitor multiple sources for information and use this information to

make autonomic decisions about when to checkpoint pods, migrate them to other

machines, and restart them. Although there are many items that can be monitored,

our service monitors two in particular. First, it monitors the vendor’s software security

update repository to ensure that the system stays up to date with the latest security

patches. Second, it monitors the underlying hardware of the system to ensure that an

imminent fault is detected before the fault occurs and corrupts application state. By

monitoring these two sets of information, the autonomic system status service is able

to reboot or shut down the computer while checkpointing or migrating the processes.


This helps to ensure that data is not lost or corrupted because of a forced reboot or

hardware fault propagating into the running processes.

Many operating system vendors enable users to automatically check for and in-

stall system updates. Example of these include Microsoft’s Windows Update service

and Debian’s security repositories. These updates are guaranteed genuine through

cryptographic signed hashes that verify that the contents come from the vendors.

But some of these updates require reboots. In the case of Debian GNU/Linux, this

is limited to kernel upgrades. We provide a simple service that monitors security

update repositories. The autonomic service downloads all security updates and uses

AutoPod’s checkpoint/restart mechanism to enable the updates that need reboots

without disrupting running applications and causing them to lose state.

Commodity systems also provide information about the current state of the system

that can indicate if the system has an imminent failure on its hands. Subsystems, such

as a hard disk’s Self-Monitoring Analysis Reporting Technology (S.M.A.R.T.) [46] let

an autonomic service monitor the system’s hardware state. S.M.A.R.T. provides diag-

nostic information, such as temperature and read/write error rates, on the hard drives

in the system that can indicate if the hard disk is nearing failure. Many commodity

computer motherboards also have the ability to measure CPU and case temperature,

as well as the speeds of the fans that regulate those temperatures. If temperature in

the machine rises too high, hardware in the machine can fail catastrophically. Simi-

larly, if the fans fail and stop spinning, the temperature will likely rise out of control.

Our autonomic service monitors these sensors. If it detects an imminent failure, it

will attempt to migrate a pod to a cooler system, and shut down the machine to

prevent the hardware from being destroyed.

Many administrators use an interruptible power supply to avoid data loss or cor-

ruption during a power loss. Although one can shut down a computer when the


battery backup runs low, most applications are not written to save their data in the

presence of a forced shutdown. AutoPod, on the other hand, monitors UPS status.

If the battery backup becomes low, it can quickly checkpoint the pod’s state to avoid

any data loss when the computer is forced to shut down.

Similarly, the operating system kernel on the machine monitors the state of the

system, and if irregularities occur, such as DMA timeouts or resetting the IDE bus, it

logs them. Our autonomic service monitors the kernel logs to discover these irregular

conditions. When the hardware monitoring systems or the kernel logs provide infor-

mation about possible pending system failures, the autonomic service checkpoints the

pods running on the system and migrates them to a new system to be restarted. This

ensures that state is not lost and informs administrators that maintenance is needed.

Many policies can be implemented to determine to which system a pod should be

migrated when a machine needs maintenance. Our autonomic service allows a pod to

be migrated within a specified set of clustered machines. The autonomic service gets

reports at regular intervals from the other machines’ autonomic services that report

each machine’s load. If the autonomic service decides that it must migrate a pod, it

chooses the machine in its cluster with the lightest load.

4.4 AutoPod Examples

We give two brief examples to illustrate how AutoPod can be used to improve appli-

cation availability for system services such as email delivery and desktop computing.

In both cases we describe the architecture of the system and show how it can be

run within AutoPod, enabling administrators to reduce downtime in the face of ma-

chine maintenance. We also discuss how a system administrator can set up and use

AutoPod.


4.4.1 System Services

Administrators like to run many services on a single machine. By doing this, they are

able to benefit from improved machine utilization, but this gives each service access

to many resources not needed to perform their job. A classic example of this is email

delivery. Email delivery services such as Exim and Sendmail are often run on the

same system as other Internet services to improve resource utilization and simplify

system administration through server consolidation. But these services, Sendmail

in particular, have been exploited many times because they have access to system

resources, such as a shell program, that they do not need to perform their job.

For email delivery, AutoPod can isolate email delivery to provide a significantly

higher level of security in light of the many attacks on mail transfer agents. Consider

isolating an Exim service installation, the default Debian mail transfer agent. Using

AutoPod, Exim can execute in a resource-restricted pod that isolates email delivery

from other services on the system. Since AutoPod allows migrating a service between

machines, the email delivery pod is migratable. If a fault is discovered in the under-

lying host machine, the email delivery service can be moved to another system while

the original host is patched, keeping the email service available.

With this email delivery example, a simple system configuration can prevent the

common buffer overflow exploit of getting the privileged server to execute a local shell.

By simply removing shells from within the Exim pod, we are limiting the amateur

attacker’s ability to exploit flaws, while requiring very little additional knowledge

about how to configure the service. AutoPod can further automatically monitor

system status and checkpoint the Exim pod if a fault is detected to ensure that no

data is lost or corrupted. Similarly, in the event that a machine has to be rebooted,

the service can automatically be migrated to a new machine to avoid downtime.


A common problem system administrators face is that forced machine downtime,

e.g., for reboots, can make a service unavailable. A usual way to avoid this is to

throw multiple machines at the problem. By providing the service through a cluster

of machines, system administrators can upgrade the individual machines in a rolling

manner. This enables system administrators to upgrade the systems while keeping

the service available. But more machines increase management complexity and cost.

AutoPod, in conjunction with hardware virtual machine monitors, improves this

situation immensely. Using a virtual machine monitor to provide two virtual ma-

chines on a single host, AutoPod can then run a pod within a virtual machine to

enable a single node maintenance scenario that decreases costs as well as manage-

ment complexity. During regular operation, all application services run within the

pod on one virtual machine. To upgrade the operating system on the running virtual

machine, bring the second virtual machine online and migrate the pod to the new

virtual machine. Once the initial virtual machine is upgraded and rebooted, migrate

the pod back to it. Only one physical machine is needed, reducing costs. Only one

virtual machine is in use for the majority of the time, reducing management com-

plexity. Because AutoPod runs unmodified applications, any application service that

can be installed can take advantage of its general single node maintenance.

4.4.2 Desktop Computing

As personal computers have become ubiquitous in large corporate, government, and

academic organizations, the cost of owning and maintaining them is growing un-

manageable. These computers are increasingly networked, which only complicates

matters. They must be constantly patched and upgraded to protect them and their

data from the myriad of viruses and other attacks commonplace on today’s networks.


To solve this problem, many organizations have turned to thin-client solutions such

as Microsoft’s Windows Terminal Services and Sun’s Sun Ray. Thin clients allow ad-

ministrators to centralize many of their administrative duties because only a single

computer or cluster of computers needs to be maintained in a central location, while

stateless client devices are used to access users’ desktop computing environments. Al-

though thin-client solutions lower some administrative costs, this comes at the loss of

semantics that users normally expect from a private desktop. For instance, users who

use their own private desktop expect to be isolated from their coworkers. However,

in a shared thin-client environment, users share the same machine. There may be

many shared files, and a user’s computing behavior can impact the performance of

other users on the system.

Although a thin-client environment minimizes the number of machines, the cen-

tralized servers still need to be administered, and since they are more heavily utilized,

management becomes more difficult. For instance, on a private system, one only has

to schedule system maintenance with a single user. However, in a thin-client envi-

ronment, one has to schedule maintenance with all the users on the system to avoid

data loss.

AutoPod enables system administrators to solve these problems by allowing each

user to run a desktop session within a pod. Instead of users sharing a single file

system, AutoPod provides each pod with three file systems: a shared read-only file

system of all the regular system files users expect in their desktop environments,

a private writeable file system for a user’s persistent data, and a private writeable

file system for a user’s temporary data. By sharing common system files, AutoPod

provides centralization benefits that simplify system administration. By providing

private writeable file systems for each pod, AutoPod provides each user with privacy

benefits similar to a private machine.


Coupling pod virtualization and isolation mechanisms with a migration mecha-

nism can provide scalable computing resources for the desktop and improve desktop

availability. If a user needs access to more computing resources, for instance while do-

ing complex mathematical computations, AutoPod can migrate that user’s session to

a more powerful machine. If maintenance needs to be done on a host machine, Auto-

Pod can migrate the desktop sessions to other machines without scheduling downtime

and without forcibly terminating any programs users are running.

4.4.3 Setting Up and Using AutoPod

To demonstrate how simple it is to set up a pod to run within the AutoPod en-

vironment, we provide a step-by-step walkthrough on how one would create a new

pod that can run the Exim mail transfer agent. Setting up AutoPod to provide the

Exim pod on Linux is straightforward and leverages the same skill set and experience

system administrators already have on standard Linux systems. AutoPod is started

by loading its kernel module into a Linux system and using its user-level utilities to

set up and insert processes into a pod.

Creating a pod’s file system is the same as creating a chroot environment. Ad-

ministrators with experience creating a minimal environment containing only the

application they want to isolate do not need to do any extra work. However, many

administrators do not have experience creating such an environment and therefore

need an easy way to create an environment in which to run their application. These

administrators can take advantage of Debian’s debootstrap utility that allows a user

to quickly set up an environment equivalent to a base Debian installation. An ad-

ministrator would do a debootstrap stable /autopod to install the most recently

released Debian system into the /autopod directory. While this also includes many


packages that are not required by the installation, it provides a small base to work

from.

To configure Exim, an administrator edits the appropriate configuration files

within the /autopod/etc/exim4/ directory. To run Exim in a pod, an adminis-

trator does mount -o bind /autopod /autopod/exim/root to loopback-mount the

pod directory onto the staging area directory, where the pod expects it to be. autopod

add exim is used to create a new pod named exim which uses /autopod/exim/root

as the root for its file system. Finally, autopod addproc exim /usr/sbin/exim4

is used to start Exim within the pod by executing the /usr/sbin/exim4 program,

which is located at /autopod/exim/root/usr/sbin/exim4.

AutoPod isolates the processes running within a pod from the rest of the system,

which helps contain intrusions if they occur. But since a pod does not have to be

maintained by itself, but can be maintained in the context of a larger system, one can

also prune down the environment and remove many programs that an attacker could

use against the system. For instance, if an Exim pod does not need to run shell scripts,

there is no reason to leave programs such as /bin/bash, /bin/sh, and /bin/dash

within the environment. But these programs will be necessary in the future if the

administrator wants to upgrade the package using normal Debian methods. Because

it is easy to recreate the environment, one approach would be to remove all the

programs that are not wanted within the environment and recreate the environment

when an upgrade is needed. Another would be to move those programs outside the

pod, perhaps by creating a /autopod-backup directory. To upgrade the pod using

normal Debian methods, the programs can be moved back into the pod’s file system.

If an administrator wants to manually reboot the system without killing the pro-

cesses within the Exim pod, they can first checkpoint the pod to disk by running

autopod checkpoint exim -o /exim.ck, which tells AutoPod to checkpoint the


processes associated with the Exim pod to the file /exim.ck. The system can then be

rebooted, potentially with an updated kernel. Once it comes back up, the pod can be

restarted from the /exim.ck file by running autopod restart exim -i /exim.ck.

These mechanisms are the same as those used by the AutoPod system status service

for controlling the checkpointing and migration of pods.

Standard Debian facilities can be used for running other services within a pod.

Once the base environment is set up, an administrator can chroot into this environ-

ment to continue setup. By editing the /etc/apt/sources.list file appropriately

and running apt-get update, an administrator will be able to install any Debian

package into the pod. In the Exim example, Exim does not need to be installed

since it is the default mail transfer agent (MTA) and is already included in the base

Debian installation. If one wanted to install another MTA, such as Sendmail, one

could run apt-get install sendmail, which will download Sendmail and all the

packages needed to run it. This will work for any service available within Debian.

An administrator can also use the dpkg --purge option to remove packages that are

not required by a given pod. For instance, in running an Apache web server in a pod,

one can remove the default Exim mail transfer agent because Apache does not need

it.


We implemented AutoPod as a loadable kernel module in Linux, which requires no

changes to the kernel, as well as a user space system status monitoring service. We

present some experimental results using our Linux prototype to quantify the overhead

of using AutoPod on various applications. Experiments were conducted on three IBM

Netfinity 4500R machines, each with a 933Mhz Intel Pentium-III CPU, 512MB RAM,


Name ApplicationsNormalStartup

Email Exim 3.36 504 ms

Web Apache 1.3.26 and MySQL 4.0.14 2.1 s

Desktop

Xvnc – VNC 3.3.3r2 X Server

19 s

KDE – Entire KDE 2.2.2 environment, including window manager,panel and assorted background daemons and utilitiesSSH – openssh 3.4p1 client inside a KDE Konsole terminal con-nected to a remote hostShell – The Bash 2.05a shell running in a Konsole terminalKGhostView – A PDF viewer with a 450KB 16-page PDF fileloadedKonqueror – A modern standards-compliant web browser that ispart of KDEKOffice – The KDE word processor and spreadsheet programs

Table 4.1 – Application Scenarios

9.1 GB SCSI HD, and 100 Mbps Ethernet connected to a 3Com Superstack II 3900

switch. One of the machines was used as an NFS server from which directories were

mounted to construct the virtual file system for the pod on the other client systems.

One client ran Debian Stable with a Linux 2.4.5 kernel, and the other ran Debian

Unstable with a Linux 2.4.18 kernel.

To measure the cost of AutoPod migration and demonstrate the ability of Auto-

Pod to migrate real applications, we migrated three application scenarios: an email

delivery service using Exim and Procmail, a web content delivery service using Apache

and MySQL, and a KDE desktop computing environment. Table 4.1 describes the

configurations of the application scenarios we migrated and shows the time it takes

to start up on a regular Linux system. To demonstrate our AutoPod prototype’s

ability to migrate across Linux kernels with different minor versions, we checkpointed

each application workload on the 2.4.5 kernel client machine and restarted it on the

2.4.18 kernel machine. For these experiments, the workloads were checkpointed to

and restarted from a local disk.


Case Checkpoint Restart Size Compressed

Email 11 ms 14 ms 284 KB 84 KB

Web 308 ms 47 ms 5.3 MB 332 KB

Desktop 851 ms 942 ms 35 MB 8.8 MB

Table 4.2 – AutoPod Migration Costs

Table 4.2 shows the time to checkpoint and restart each application workload. Mi-

gration time also has to take into account network transfer time. As this is dependent

on the transport medium, we include the uncompressed and compressed checkpoint

image sizes. In all cases, checkpoint and restart times were significantly faster than

the regular startup times listed in Table 4.1, taking less than a second for both op-

erations, even when performed on separate machines or across a reboot. Moreover,

a number of techniques have since been pioneered to further minimize downtime, in-

cluding pre-copy/incremental checkpointing [43,81,141] and intelligent quiescing [81].

Pre-copy/incremental checkpointing minimizes the amount of time the services will

be unavailable by taking partial checkpoints during the service’s execution and only

saving what has changed since the last checkpoint was taken. Intelligent quiescing

minimizes the time checkpointing takes by keeping the services available until the

entire service is ready to checkpoint.

We also show that the actual checkpoint images saved were modestly sized for

complex workloads. For example, the Desktop pod had over 30 different processes

running, including the KDE desktop applications, substantial underlying window

system infrastructure, inter-application sharing, and a rich desktop interface managed

by a window manager. Even with all these applications running, they checkpoint to

a very reasonable 35 MB uncompressed for a full desktop environment. Additionally,

if checkpoint images must be transferred over a slow link, Table 4.2 shows that they

can be compressed very well with bzip2.


4.6 Related Work

Virtual machine monitors (VMMs) have been used to provide secure isolation [28,

142,147] and to migrate an entire operating system environment [128]. Unlike Auto-

Pod, standard VMMs decouple processes from the underlying machine hardware, but

tie them to an instance of an operating system. As a result, VMMs cannot migrate

processes apart from that operating system instance and cannot continue running

those processes if the operating system instance goes down, such as during security

upgrades. In contrast, AutoPod decouples process execution from the underlying

operating system, allowing it to migrate processes to another system when an op-

erating system instance is upgraded. VMMs have been proposed to support online

maintenance of systems [87] by having a microvisor that supports at most two virtual

machines running on the machine at the same time, effectively giving each physical

machine the ability to act as its own hot spare. This proposal, however, explicitly

depends on AutoPod’s heterogeneous migration without providing this functionality

itself.

Many systems have been proposed to support process migration [22, 24, 40, 42,

54,85,95,106,119,120,125,129], but they do not allow migration across independent

machines running different operating system versions. TUI [131] provides support for

process migration across machines running different operating systems and hardware

architectures. Unlike AutoPod, TUI has to compile applications on each platform

using a special compiler and does not work with unmodified legacy applications. Au-

toPod builds on Zap [100] to support transparent migration across systems running

the same kernel version. AutoPod goes beyond Zap in providing transparent migra-

tion across minor kernel versions, which is essential for making applications available

during operating system security upgrades.


Replication in clustered systems can provide the ability to do rolling upgrades. By

leveraging many nodes, individual nodes can be taken down for maintenance without

significantly impacting the load that the cluster can handle. For example, web content

is commonly delivered by multiple web servers behind a front end manager. This

front end manager enables an administrator to bring down back end web servers

for maintenance by directing requests only to the active web servers. This simple

solution is effective because it is easy to replicate web servers to serve the same

content. Although this model works fine for web server loads, as the individual jobs

are very short, it does not work for long-running jobs, such as a user’s desktop. In

the web server case, replication and upgrades are easy to do because only one web

server is used to serve any individual request and any web server can be used to serve

any request. For long-running stateful applications, such as a user’s desktop, requests

cannot be arbitrarily redirected to any desktop computing environment because each

user’s desktop session is unique. Although specialized hardware support could be

used to keep replicas synchronized by having all of them process all operations, this is

prohibitively expensive for most workloads and does not address how to resynchronize

the replicas in the presence of rolling upgrades.

Another possible solution is allowing the kernel to be hot-pluggable. Although

micro-kernels are not prevalent, they are able to upgrade their parts on the fly. More

commonly, many modern monolithic kernels have kernel modules that can be inserted

and removed dynamically. This can allow upgrading parts of a monolithic kernel

without requiring reboots. The Nooks [136] system extends this concept by enabling

kernel drivers and other kernel functionality, such as file systems, to be isolated in

their own domains to help isolate faults in kernel code and provide a more reliable

system. However, in all of these cases, there is still a base kernel on the machine that

cannot be replaced without a reboot. If that part must be replaced, all data is lost.


The K42 operating system can be dynamically updated [29], enabling software

patches to be applied to a running kernel even in the presence of data structure

changes. But it requires a completely new operating system design and does not

work with any commodity operating system. Even on K42, it is not yet possible to

upgrade the kernel while running realistic application workloads.

Chapter 5

PeaPod: Isolating Cooperating

Processes

A key problem faced by today’s computers is that they are difficult to secure due to

the numerous complex services they run. If a single service is exploited, an attacker is

able to access all the resources available to the machine it is running on. To prevent

this from occurring, it is important to design systems with security principles [126] in

mind to limit the damage that can occur when security is breached. One of the most

important principles is ensuring that one operates in a Least-Privilege environment.

Least-Privilege environments require that a user or a program has access only to

the resources that are required to complete their job. Even if the user’s or service’s

environment is exploited, the attacker will be constrained. For a system with many

distinct users and uses, designing a least-privilege system can prove to be very difficult,

as many independent application systems can be used in many different and unknown

ways.

A common approach to providing least-privilege environments is to separate each

individual service into its own sandbox container environment, such as provided by

Chapter 5. PeaPod: Isolating Cooperating Processes 64

AutoPod. Many sandbox container environments have been developed to isolate

untrusted applications [7, 60, 65, 74, 86, 118, 144]. However, many of these approaches

have suffered from being too complex and too difficult to configure to use in practice,

and have often been limited by an inability to work seamlessly with existing system

tools and applications. Virtual machine monitors (VMMs) offer a more attractive

approach by providing a much easier-to-use isolation model of virtual machines, which

look like separate and independent systems apart from the underlying host system.

However, because VMMs need to run an entire operating system instance in each

virtual machine, the granularity of isolation is very coarse, enabling malicious code

in a virtual machine to use the entire set of operating system resources. Multiple

operating instances also need to be maintained, adding administrative overhead.

A primary problem with a sandbox container that attempts to isolate a single

service is that many services are composed of many interdependent and cooperating

programs. Each individual application that makes up the service has its own set

of access requirements. However, since they all run within the same sandbox con-

tainer, each individual application ends up with access to the superset of resources

that are needed by all the programs that make up the service, thereby negating the

least-privilege principle. One cannot divide the programs into distinct sandbox con-

tainer environments since many programs are interdependent and expect to work

from within a single context.

We leveraged operating system virtualization to design and build PeaPod to en-

able the ability to sandbox complete services, while also enabling its interdependent

and cooperating components to be restricted into least-privilege environments. Pea-

Pod combines two key virtualization abstractions in its virtualization layer. First, it

leverages the secure pod abstraction to provide a sandbox container for entire services

to run within. Second, it introduces the pea (Protection and Encapsulation Abstrac-


tion). A pea is an easy-to-use least-privilege mechanism that enables further isolation

among application components that need to share limited system resources within a

single pod. It can prevent compromised application components from attacking other

components within the same pod. A pea provides a simple resource-based model that

restricts access to other processes, IPC, file system and network resources available

to the pod as a whole.

PeaPod improves upon previous approaches by not requiring any operating system

modifications, as well as avoiding the time of check, time of use (TOCTOU) race

conditions that affect many of them [145]. For instance, unlike other approaches

that perform file system security checks at the system call level and therefore do not

check the actual file system object that the operating system uses, PeaPod leverages

file system virtualization to integrate directly into the kernel’s file system security

framework. PeaPod is designed to avoid the time of check, time of use race conditions

that affect previous approaches by performing all file system security checks within

the regular file system security paths and on the same file system objects that the

kernel itself uses.

5.1 PeaPod Model

The PeaPod model combines the previously introduced operating system virtualiza-

tion secure pod abstraction with a new abstraction called peas. The secure pod

abstraction, as shown in AutoPod, is useful for separating distinct application ser-

vices into separate machine environments. Peas are used in a pod to provide fine-

grained isolation among application components that may need to interact within a

single machine environment, such as using interprocess communication mechanisms,

including signals, shared memory, IPC messages and semaphores, and process forking


��

��

��

��

��

��

��

Figure 5.1 – PeaPod Model

and execution. Figure 5.1 shows how pods and peas work together. Each pod, and

the resources contained with it, is fully independent from each other pods, but each

pod can each have an arbitrary number of peas associated with them to apply extra

security restrictions amongst their cooperating processes.

A pea is an abstraction that can contain a group of processes, restrict those pro-

cesses in interacting with processes outside of the pea, and limit their access to only

a subset of system resources. Unlike the secure pod abstraction, which achieves isola-

tion by controlling what resources are located within the namespace, a pea achieves

isolation levels by controlling what system resources within a namespace its processes

are allowed to access and interact with. For example, a process in a pea can see file

system resources and processes available to other peas within a single pod, but can

be restricted from accessing them. Unlike processes in separate pods, processes in

separate peas in a single pod share the same namespace and can be allowed to inter-


act using traditional interprocess communication mechanisms. Processes can also be

allowed to move between peas in the same pod. However, by default, a processes in a

pea cannot access any resource that is not made available to that pea, be it a process

pid, IPC key or file system entry.

Peas can support a wide range of resource restriction policies. By default, pro-

cesses contained in a pea can only interact with other processes in the same pea.

They have no access to other resources, such as file system and network resources or

processes outside of the pea. This provides a set of fail safe defaults, as any extra

access has to be explicitly allowed by the administrator.

The pea abstraction allows for processes running on the same system to have

varying levels of isolation by running in separate peas. Many peas can be used side

by side to provide flexibility in implementing a least-privilege system for programs

that are composed of multiple components that must work together, but do not all

need the same level of privilege. One usage scenario would be to have a severely

resource limited pea in which a privileged process executes. The process is, howerver,

allowed to use traditional Unix semantics to work with less privileged programs that

are in less resource restricted peas.

For example, peas can be used to allow a web server appliance the ability to serve

dynamic content via CGI in a more secure manner. Since the web server and the CGI

scripts need separate levels of privilege, and have different resource requirements, they

should not have to run within the same security context. By configuring two separate

peas for a web service, one for the web server to run within, and a separate one for the

specific CGI programs it wants to execute, one limits the damage that can occur if a

fault is discovered within the web server. If one manages to execute malicious code

within the context of the web server, one can only use resources that are allocated to

the web server’s pea, as well as only execute the specific programs that are needed


as CGIs. Since the CGI programs will also only run within their specific security

context, the ability for malicious code to do harm is severely limited.

Peas and pods together provide secure isolation based on flexible resource restric-

tion for programs as opposed to restricting access based on users. Peas and pods

also do not subvert underlying system restrictions based on user permissions, but

instead complement such models by offering additional resource control based on the

environment in which a program is executed. Instead of allowing programs with root

privileges to do anything they want to a system, PeaPod allows a system to control

the execution of such programs to limit their ability to harm a system even if they

are compromised.

5.2 PeaPod Virtualization

To support the PeaPod virtualization abstraction design of secure and isolated names-

paces on commodity operating systems, we leveraged the secure pod virtualization

architecture described in Chapter 3.1.1. For example, if one had a web server that

just serves static content, one can easily set up a web server pod to contain only the

files the web server needs to run and the content it wants to serve. The web server

pod could have its own IP address, decoupling its network presence from the under-

lying system. It could also limit network access to client-initiated connections. If the

web server application gets compromised, the pod limits the ability of an attacker

to further harm the system since the only resources the attacker has access to are

the ones explicitly needed by the service. Furthermore, there is no need to carefully

disable other network services commonly enabled by the operating system that might

be compromised, as only the single service is running within the pod.


5.2.1 Pea Virtualization

Peas are supported using virtualization mechanisms that label resources and enforce

a simple set of configurable permission rules to impose levels of isolation among

processes running within a single pod. For example, when a process is created via the

fork() and clone() system calls, its process identifier is tagged with the identifier of

the pea in which it was created. Peas leverage the pod’s shadow pod process identifier

and also place it in the same pea as its parent process. A process’s ability to access

pod resources is then dictated by the set of access permissions rules associated with its

pea. Like pod virtualization, the key pea operating system virtualization mechanisms

are system call interposition and file system stacking.

Pea virtualization employs system call interposition to virtualize the kernel and

wrap existing system calls. Kernel virtualization enables peas to enforce restrictions

on process interactions by controlling access to process and IPC virtual identifiers.

Since each resource is labeled with the pea in which it was created, the kernel virtu-

alization mechanism checks if the pea labels of the calling process and the resource

to be accessed are the same. When a process in one pea tries to send a signal to a

process in a separate pea by using the kill system call, the system returns an error

value of EPERM, as the process exists, but this process has no permission to signal

it. However, a parent process is able to use the wait system call to clean up a termi-

nated child process’s state, even if that child process is running within a separate pea,

since wait does not modify a process by affecting its execution. This is analogous to

a regular user being able to list the metadata of a file, such as owner and permission

bits, even if the user has no permission to read from or write to the file.

When a new process is created, it executes in the pea security domain of its

parent. However, when the process executes a new program, the security domain of


the parent might not be the appropriate security domain to execute the new program

in. Therefore, one wants the ability to explicitly transition the process from one

pea security domain to another on new program execution. To support this, peas

provide a single type of pea transition rule that lets a pea determine how a process

can transition from its current pea to another. This transition rule is specified by

a program filename and pea identifier. A pea is able to have multiple pea access

transition rules of this type. The rule specifies that a process should be moved

into the pea specified by the pea identifier if it executes the program specified by the

given filename. This is useful when it is desirable to have that new program execution

occur in an environment with different resource restrictions. For example, an Apache

web server running in a pea may want to execute its CGI child processes in a pea

with different restrictions. Pea transitioning is supported by interposing on the exec

system call and transitioning peas if the process to be executed matches a pea access

transition rule for the current pea. Note that pea access transition rules are one-way

transitions that do not allow a process to return to its previous pea unless its new

pea explicitly provides for such a transition.

Kernel virtualization is used to control network access inside the pea. Peas provide

two networking access rule types. One allow processes in the pea to make outgoing

network connections on a pod’s virtual network adapters, while the other allows

processes in the pea to bind to specific ports on the adapter to receive incoming

connections. Pea network access rules can allow complete access to a pod network

adapter, or only allow access on a per-port basis. Since any network access occurs

through system calls, peas simply check the options of the networking system call,

such as bind and connect, to ensure that it is allowed to perform the specified action.

Pea virtualization employs a set of file system access rules and file system vir-

tualization to provide each pea with its own permission set on top of the pod file


system. To provide a least-privilege environment, processes should not have access

to file system privileges they do not need. For example, while Sendmail has to write

to /var/spool/mqueue, it only has to read its configuration from /etc/mail and

should not need to have write permission on its configuration. To implement such a

least-privilege environment, peas allow files to be tagged with additional permissions

that overlay the respective underlying file permissions. File system permissions deter-

mine access rights based on the user identity of the process while pea file permission

rules determine access rights based on the pea context in which a process is executed.

Each pea file permission rule can selectively allow or deny the use of the underlying

read, write and execute permissions of a file on a per-pea basis. The underlying file

permission is always enforced, but pea permissions can further restrict whether the

underlying permission is allowed to take effect. The final permission is achieved by

performing a bitwise and operation on both the pea and file system permissions. For

example, if the pea permission rule allowed for read and execute, the permission set of

r-x would be triplicated to r-xr-xr-x for the three sets of Unix permissions and the

bitwise and operation would mask out any write permission that the underlying file

system allows. This prevents any process in the pea from opening the file to modify

it.

Enforcing on-disk labeling of every single file, such as supported through access

control lists provided by many modern file systems, is inflexible if a single underly-

ing file system is going to be used for multiple disparate pods and peas. As each

pea in each pod can use the same files with different permission schemes, storing

the pea’s permission data on disk is not feasible. Instead, peas support the abil-

ity to dynamically label each file within a pod’s file system based on two simple

path-matching permission rules: path-specific permission rules and directory-default

permission rules. A path-specific permission matches an exact path on the file sys-


tem. For instance, if there is a path-specific permission for /home/user/file, only

that file will be matched with the appropriate permission set. On the other hand, if

there is a directory-default permission for the directory /home/user/, then any file

under that directory in the directory tree can match it, and inherit its permission set.

Given a set of path-specific and directory-default permissions for a pea, the algo-

rithm for determining what permission matches to what path starts with the complete

path and walks up the path to the root directory until it finds a matching permission

rule. The algorithm can be described in four simple steps:

1. If the specific path has a path-specific permission, return that permission set.

2. Otherwise, choose the path’s directory as the current directory to test.

3. If the directory being tested has a directory-default permission, return that

permission set.

4. Otherwise set its parent as the current directory to test and go back to step 3.

If there is no path-specific permission, the closest directory-default permission to

the specified path becomes the permission set for that path. By default, peas give the

root directory “/” a directory-default permission denying all permissions; thus, the

default for every file on the system, unless otherwise specified, is deny. This ensures

that the peas have a fail safe default setup and do not allow access to any files unless

specified by the administrator.

The semantics of pea file permission are based on file path name. If a file has more

than one path name, such as via a hard link, both have to be protected by the same

permission; otherwise, depending on what order the file is accessed, the permission set

it gets will be determined simply based on the path name that was accessed initially.

This issue only occurs on creating the initial set of pea file access permissions. Once


the pea is set up, any hard links that are created will obey the regular file system

permissions. For instance, one is not allowed to create a hard link to a file that one

does not have permission to. On the other hand, if one has permission to access

the file, a path-specific permission rule will be created for the newly created file that

corresponds to the permission of the path name it was linked to.

The pea architecture uses file system virtualization to integrate the pea file sys-

tem namespace restrictions into the regular kernel permission model, thereby avoid-

ing TOCTOU race conditions. It accomplishes this by virtualizing the file sys-

tem’s ->lookup method, which fills in the respective file’s inode structure, and the

->permission method, which uses the stored permission data to make simple per-

mission determinations. A file system’s ->permission method is a standard part of

the operating system’s security infrastructure, so no kernel changes are necessary.

5.2.2 Pea Configuration Rules

5.2.2.1 File System

Many system resources in Unix, including normal files, directories, and system devices,

are accessed via files, so controlling access to the file system is crucial. Each pea must

be restricted to those files used by its component processes. This control is important

for security, because processes that work together do not necessarily need the same

access rights to files. All file system access is controlled by path-specific and directory-

default rules, which specify a file or directory and an access right.

The access right values for file rules are read, write, and execute similar to

standard Unix permissions. For convenience, we also define allow and deny, which

are aliases for all three of read, write and execute and cannot be combined with other

access values in the same rules. When a path-specific or directory-default rule gives


access to a directory entry, it implicitly gives execute, but not read or write, access

to all parent directories of the file, up to the root directory. On the other hand,

if a separate path-specific rule denies access to a directory, then access to both the

directory and its contents will be denied. This occurs even if a separate directory-

default rule would give access to subdirectories or files, as the path-specific rule is a

better match.

pod mailserver {

pea sendmail {

path /etc/mail/aliases read

path /etc/mail/aliases.db read

}

pea newaliases {

path /etc/mail/aliases read

path /etc/mail/aliases.db read,write

}

}

Figure 5.2 – Example of Read/Write Rules

Consider the case of the Sendmail mail daemon and the newaliases command

with regard to the system-wide aliases file. Sendmail runs as the root user and needs

to be able to read the aliases file in order to know to where it should forward mail

or otherwise redirect it. newaliases is a symbolic link to sendmail and typically

also runs as the root user in order to update the aliases file and convert it into the

database format used by the Sendmail daemon. In our example, newaliases runs in

its own pea and is able to read from /etc/mail/aliases and read from and write

to /etc/mail/aliases.db. Meanwhile sendmail runs in another pea and is able to

read both files, but not write to them. We use two path-specific rules to express these

access rules as described in Figure 5.2.

Similar rules can protect a device like /dev/dsp. When a user logs into a system

locally, via the console, they are typically given control of local devices, such as the


pod music {

pea play {

path /dev/dsp write

}

pea rec {

path /dev/dsp read

}

}

Figure 5.3 – Protecting a Device

physical display and the sound card. Any application that the user runs has access to

read from and write to these local devices, even though this privilege is not necessary.

For example, we want to restrict playing and recording of sound files to the play

and rec applications, which are part of SoX [9]. Figure 5.3 describes the rules that

provide the appropriate access to the device.

The other file system rule is the directory-default rule. It uses the same access

values as path-specific rules, but it is used to specify the default access for files below

a directory. Any file or sub-directory will inherit the same access flags since access is

determined by matching the longest possible path prefix. Unlike path-specific rules,

directory-default rules can deny access to a directory in general, while still allowing

access to specific files. Figure 5.4 describes a pea that denies access to all files in

/bin, while only allowing access to /bin/ls.

pod fileLister {

pea onlyLs {

dir-default /bin deny

path /bin/ls allow

}

}

Figure 5.4 – Directory-Default Rule


5.2.2.2 Transition Rules

When Sendmail and Procmail are used together to deliver mail to local users, the

sendmail process creates a new process and executes the procmail program to deliver

the mail to the user’s spool. Procmail needs different security settings, so it must

transition from a Sendmail pea to a Procmail pea. Rules must be defined that state

to which pea a process will transition upon execution. When a process calls the

execve system call, we examine the file name to be executed and perform a longest

prefix match on all the transition rules. For instance, by specifying a directory for a

transition, PeaPod will cause a pea transition to occur for any program executed that

is located in that directory, unless there is a more specific transition rule available.

Figure 5.5 creates a pea for Sendmail and Procmail, and specifies that a process

should transition when the procmail program is executed.

pod mailserver {

pea sendmail {

transition /usr/bin/procmail procmail

}

pea procmail {

}

}

Figure 5.5 – Transition Rules

PeaPod does not provide the ability for a process to transition to another pea

except by executing a new program. If it could, a process could open an allowed

file in one pea and then transition to another pea where access to that file was not

allowed and thus circumvent the security restrictions.


5.2.2.3 Networking Rules

PeaPod provides two rules that define the network capabilities a pea exposes to the

processes running within it. First, peas are able to restrict a process from instantiating

an outgoing connection. Second, peas are able to limit what ports a process can bind

to and listen for incoming connections. By default, peas do not let processes make

any outgoing connections or bind to any port. Whereas a full network firewall is an

important part of any security architecture, it is orthogonal to the goals of PeaPod

and therefore belongs in its own security layer.

Continuing the simplified Sendmail/Procmail usage case, an administrator would

want to easily confine the network presence of processes running within Sendmail/Proc-

mail peas as shown in Figure 5.6. By allowing sendmail to make outgoing connections,

to enable it to send messages, as well as bind to port 25, the standard port for receiv-

ing messages, Sendmail can continue to work normally. However, processes running

within the procmail pea, which will be less restricted, are not allowed to bind to any

port for this same reason, while they are allowed to initiate outgoing network connec-

tions. This allows programs, such as spam filters that require checking network-based

information, to continue to work.

pod mailserver {

pea sendmail {

outgoing allow

bind tcp/25

}

pea procmail {

outgoing allow

}

}

Figure 5.6 – Networking Rules


5.2.2.4 Shared Namespace Rules

PeaPod provides a single namespace rule for allowing processes to access the pod’s

virtual private identifiers that do not belong to its personal pea. PeaPod enables peas

to be configured to only have access to resources tagged with specific pea identifiers

or with the special global pea identifier that enables access to every virtual private

resource in the pod. This rule is used to create a global pea with access to all the

resources of a pod, for instance to allow a process to start up and shut down services

running within a resource-restricted pea. Figure 5.7 describes a pod that has a pea,

global access, that is able to access every resource in the pod, as well as a pea, test1,

that is able to access the resources created within one of its sibling peas, test2.

pod service {

pea global_access {

namespace global

}

pea test1 {

namespace test2

}

pea test2 {

}

}

Figure 5.7 – Namespace Access Rules

5.2.2.5 Managing Rules

To make it simpler for administrators to create peas in a pod, we allow groups of rules

to be saved to a file and included in the main configuration file for a given PeaPod

configuration. These groups of rules would typically describe the minimum resources

necessary for a single application. Application packagers can include rule group files

in their package and administrators can share rule groups with each other.


path /usr/bin/gcc read,execute

dir-default /usr/lib/gcc-lib read,execute

path /usr/bin/cpp read,execute

path /usr/lib/libiberty.a read

path /usr/bin/ar read,execute

path /usr/bin/as read,execute

path /usr/bin/ld read,execute

path /usr/bin/ranlib read,execute

path /usr/bin/strip read,execute

Figure 5.8 – Compiler Rules

pod workstation {

pea kernel-development {

include "stdlibs"

include "compiler"

include "tar"

include "bzip2"

dir-default /usr/local/src/ read

dir-default /scratch/binaries allow

}

}

Figure 5.9 – Set of Multiple Rule Files

A rule group, seen in Figure 5.8 for a compiler, would be stored in a central

location. An administrator uses an include rule to reference the external file as part

of a development PeaPod. Figure 5.9 contains the tools necessary to build a Linux

kernel from source; it permits access to the source code itself and a writable directory

for the binaries.

These management rules demonstrate PeaPod’s ability to isolate the specific re-

source needs of individual programs from the local policy an administrator defines.

The knowledge needed to build a set of rules for a program service that provides the

specific set of resources needed to execute is not always readily available to users of

security systems. However, this knowledge is available to the authors and distributors

of the system. PeaPod’s management rules allow the creation and distribution of rule


files that define the specific set of resources needed to execute a program service, while

enabling the local administrator to further define the resource-restriction policy.

5.3 Security Analysis

Saltzer and Schroeder [126] describe several principles for designing and building

secure systems. These include:

• Economy of mechanism: Simpler and smaller systems are easier to understand

and ensure that they do not allow unwanted access.

• Fail safe defaults : Systems must choose when to allow access as opposed to

choosing when to deny.

• Complete mediation: Systems should check every access to protected objects.

• Least-privilege: A process should only have access to the privileges and resources

it needs to do its job.

• Psychological acceptability : If users are not willing to accept the requirements

that the security system imposes, such as very complex passwords that the users

are forced to write down, security is impaired. Similarly, if using the system is

too complicated, users will misconfigure it and end up leaving it wide open.

• Work factor : Security designs should force an attacker to have to do extra work

to break the system. The classic quantifiable example is when one adds a single

bit to an encryption key, one doubles the key space an attacker has to search.

PeaPod is designed to satisfy these six principles. PeaPod provides economy of

mechanism using a thin virtualization layer, based on system call interposition for


kernel virtualization and file system stacking for file system virtualization, that only

adds a modest amount of code to a running system. The largest part of the system

is due to the use of a null stackable file system with 7000 lines of C code, but this file

system was generated using a simple high-level file system language [152], and only

50 lines of code were added to this well-tested file system to implement PeaPod’s file

system security. Furthermore, PeaPod changes neither applications nor the under-

lying operating system kernel. The modest amount of code to implement PeaPod

makes the system easier to understand. As the PeaPod security model provides only

resources that are explicitly stated, it is relatively easy to understand the security

properties of resource access provided by the model.

PeaPod provides fail safe defaults by only allowing access to resources that have

been explicitly given to peas and pods. If a resource is not created within a pea, or

explicitly made available to that pea, no process within that pea will be allowed to

access it. Whereas a pea can be configured to enable access to all resources of the

pod, this is an explicit action an administrator has to take.

PeaPod provides for complete mediation of all resources available on the host ma-

chine by ensuring that all resource access occur through the pod’s virtual namespace.

Unless a file, process or other operating system resource was explicitly placed in the

pod by the administrator or created within the pod, PeaPod’s virtualization will not

allow a process within a pod to access the resource.

PeaPod provides a least-privilege environment by enabling an administrator to

only include the data necessary for each service. PeaPod can provide separate pods

for individual services so that separate services are isolated and restricted to the

appropriate set of resources. Even if a service is exploited, PeaPod will limit the

attacker to the resources the administrator provided for that service. While one can

achieve similar isolation by running each individual service on a separate machine,


this leads to inefficient use of resources. PeaPod maintains the same least-privilege

semantic of running individual services on separate machines, while making efficient

use of machine resources at hand. For instance, an administrator could run MySQL

and Sendmail mail transfer services on a single machine, but within different pods.

If the Sendmail pod gets exploited, the pod model ensures that the MySQL pod

and its data will remain isolated from the attacker. Furthermore, PeaPod’s peas are

explicitly designed to enable least-privilege environments by restricting programs in

an environment that can be easily limited to provide the least amount of access for

the encapsulated program to do its job.

PeaPod provides psychological acceptability by leveraging the knowledge and skills

system administrators already use to set up system environments. Because pods

provide a virtual machine model, administrators can use their existing knowledge and

skills to run their services within pods. Furthermore, peas use a simple resource-based

model that does not require a detailed understanding of any underlying operating

system specifics. This differs from other least-privilege architectures that force an

administrator to learn new principles or complicated configuration languages that

require a detailed understanding of operating system principles.

Similar to least-privilege, PeaPod increases the work factor that it would take to

compromise a system simply by not making available the resources that attackers

depend on to harm a system once they have broken in. For example, because PeaPod

can provide selective access to what programs are included within their view, it would

be very difficult to get a root shell on a system that does not have access to any shell

program. While removing the shell does not create a complete least-privilege system,

it is a simple change that creates a lesser privilege system and therefore increases the

work factor that would be required to compromise the system.


5.4 Usage Examples

We briefly describe three examples that help illustrate how the PeaPod virtualization

layer can be used to improve computer security and application availability for differ-

ent application scenarios. The application scenarios are email delivery, web content

delivery, and desktop computing. In the following examples we make extensive use of

PeaPod’s ability to compose rule files in order to simplify the rules. Instead of listing

every file and library necessary to execute a program, we isolate them into a separate

rule file to place the focus on the actual management of the service that the pea is

trying to protect.

5.4.1 Email Delivery

For email delivery, PeaPod’s virtualization layer can isolate different components

of email delivery to provide a significantly higher level of security in light of the

many attacks on Sendmail vulnerabilities that have occurred [15,16,83,88]. Consider

isolating a Sendmail installation that also provides mail delivery and filtering via

Procmail. Email delivery services are often run on the same system as other Internet

services to improve resource utilization and simplify system administration through

server consolidation. However, this can provide additional resources to services that

do not need them, potentially increasing the damage that can be done to the system

if attacked.

As shown in Figure 5.10, using PeaPod’s virtualization layer, both Sendmail and

Procmail can execute in the same pod, which isolates email delivery from other ser-

vices on the system. Furthermore, Sendmail and Procmail can be placed in separate

peas, which allows necessary interprocess communication mechanisms between them

while improving isolation. This pod is a common example of a privileged service that


pod mail-delivery {

pea sendmail {

include "stdlibs"

include "sendmail"

dir-default /etc read

dir-default /var/spool/mqueue allow

dir-default /var/spool/mail allow

dir-default /var/run allow

path /usr/bin/procmail read, execute

transition /usr/bin/procmail procmail

bind tcp/25

outgoing allow

}

pea procmail {

dir-default / allow

outgoing allow

}

}

Figure 5.10 – Email Delivery Configuration

has child helper applications. In this case, the Sendmail pea is configured with full

network access to receive email, but only with access to files necessary to read its

configuration and to send and deliver email. Sendmail would be denied write access

to file system areas such as /usr/bin to prevent modification to those executables,

and would only be allowed to transition a process to the Procmail pea if it is execut-

ing Procmail, the only new program its pea allows it to execute. On mail delivery,

Sendmail would then exec Procmail, which transitions the process into the Procmail

pea. The Procmail pea is configured with a more liberal access permission, namely

allowing access to the pod’s entire file system, enabling it to run other programs, such

as SpamAssassin. Although an administrator could configure programs Procmail ex-

ecutes, such as SpamAssassin, to run within their own peas, this example keeps them

all within a single pea to demonstrate a simple configuration. As a result, the Send-

mail/Procmail pod can provide full email delivery service while isolating Sendmail


such that even if Sendmail is compromised by an attack, such as a buffer overflow,

the attacker would be contained in the Sendmail pea and would not even be able to

execute programs, such as a root shell, to further compromise the system.

5.4.2 Web Content Delivery

For web content delivery, PeaPod’s virtualization layer can isolate different compo-

nents of web content delivery to provide a significantly higher level of security in light

of common web server attacks that may exploit CGI script vulnerabilities. Consider

isolating an Apache web server front end, a MySQL database back-end, and CGI

scripts that interface between them. Although one could run Apache and MySQL

in separate pods, because they are providing a single service, it makes sense to run

them within a single pod that is isolated from the rest of the system. However, be-

cause both Apache and MySQL are within the pod’s single namespace, if an exploit

is discovered in Apache, it could be used to perform unauthorized modifications to

the MySQL database.

To provide greater isolation among different web content delivery components,

Figure 5.11 describes a set of three peas in a pod: one for Apache, a second for

MySQL, and a third for the CGI programs. Each pea is configured to contain the

minimal set of resources needed by the processes running within the respective pea.

The Apache pea includes the apache binary, configuration files and the static HTML

content, as well as a transition permission to execute all CGI programs into the

CGI pea. The CGI pea contains the relevant CGI programs as well as access to

the MySQL daemon’s named socket, allowing interprocess communication with the

MySQL daemon to perform the relevant SQL queries. The MySQL pea contains

the mysql daemon binary, configuration files and the files that make up the relevant


pod web-delivery {

pea apache {

include "stdlibs"

path /usr/sbin/apache read,execute

path /usr/sbin/apachectl read,execute

dir-default /var/www read,execute

transition /var/www/cgi-bin cgi

bind tcp/80

}

pea cgi {

include "stdlibs"

include "perl"

dir-default /var/www/data allow

path /tmp/mysql.sock allow

}

pea mysql {

include "stdlibs"

path /usr/sbin/mysqld read, execute

path /tmp/mysql.sock allow

dir-default /usr/share/mysql read

dir-default /var/lib/mysql allow

}

}

Figure 5.11 – Web Delivery Rules

databases. As Apache is the only program exposed to the outside world, it is the the

process that is mostly likely to be directly exploited. However, if an attacker is able

to exploit it, the attacker is limited to a pea that is able only to read or write specific

Apache files, as well as execute specific CGI programs into a separate pea. As the

only way to access the database is through the CGI programs, the only access to the

database an attacker would have is what is allowed by said programs. By writing

the CGI programs carefully to sanitize the inputs passed to them, one can protect

these entry points. Consequently, the ability of an attacker to cause serious harm

to such a web content delivery system running with PeaPod’s virtualization layer is

significantly reduced.


pod desktop {

pea firefox {

include "firefox"

dir-default /home/spotter/.mozilla allow

dir-default /home/spotter/tmp allow

dir-default /home/spotter/download allow

transition /usr/bin/mpg123 mpg123

transition /usr/bin/acroread acroread

}

pea mp3 {

include "stdlibs"

path /usr/bin/mpg123 read, execute

path /dev/dsp write


dir-default /home/spotter/music allow

}

pea acroread {

include "stdlibs"

include "acroread"


}

}

Figure 5.12 – Desktop Application Rules

5.4.3 Desktop Computing

For desktop computing, PeaPod’s virtualization layer enables desktop computing en-

vironments to run multiple desktops from different security domains within multiple

pods. Peas can also be used within the context of such a desktop computing envi-

ronment to provide additional isolation. Many applications used on a daily basis,

such as mp3 players [64] and web browsers [67], have had bugs that turn into security

holes when maliciously created files are viewed by them. These holes allow attackers

to execute malicious code or gain access to the entire local file system. Figure 5.12

describes a set of PeaPod rules that can contain a small set of desktop applications

being used by a user with the /home/spotter home directory.


To secure an mp3 player, a pea can be created within the desktop computing pod

that restricts the mp3 player’s use of files outside of a special mp3 directory. As

most users store their music within its own subtree, this is not a serious restriction.

Most mp3 content should not be trusted, especially if one is streaming mp3s from a

remote site. By running the mp3 player within this fully restricted pea, a malicious

mp3 cannot compromise the user’s desktop session. This mp3 player pea is simply

configured with four file system permissions. First, a path-specific permission that

provides access to the mp3 player itself is required to load the application. Second, a

directory-default permission that provides access to the entire mp3 directory subtree

is required to give the process access to the mp3 file library. Third is a directory-

default permission to a directory meant to store temporary files so the mp3 player

can be used as a helper application. Finally, a path-specific permission that provides

access to the /dev/dsp audio device is required to allow the process to play audio.

To secure a web browser, a pea can be created within a desktop computing pod

that restricts the web browser’s access to system resources. Consider the Mozilla Fire-

fox web browser as an example. A Firefox pea would need to have all the files Firefox

needs to run accessible from within the pea. Mozilla dynamically loads libraries and

stores them along with its plugins within the /usr/lib/firefox directory. By pro-

viding a directory-default permission that provides access to that directory, as well

as another directory-default permission that provides access to the user’s .mozilla

directory, the Firefox web browser can run normally within this special Firefox pea.

Users also want the ability to download and save files, as well as launch viewers, such

as for postscript or mp3 files, directly from the web browser. This involves a simple

reconfiguration of Firefox to change its internal application.tmp dir variable to be

a directory that is writable within the Mozilla pea. By creating such a directory,

such as download within the user’s home directory, and providing a directory-default


permission allowing access, we allow one to explicitly save files, as well as implicitly

save them when one wants to execute a helper application. Similarly, just like Mozilla

is configured to run helper applications for certain file types, one could configure the

Mozilla pea to execute those helper applications within their respective peas. As

shown in Figure 5.12, for an mp3 player, configuring such a pea for these processes is

fairly simple. The only addition one would have to make is to provide an additional

pea transition permission to the Mozilla pea that tells the PeaPod’s virtualization

layer to transition the process to a separate pea on execution of programs such as the

mpg123 mp3 player or the Acrobat Reader PDF viewer.

However, this desktop computing example is also the most complicated, and shows

the difficulty that can occur in trying to secure a complex desktop. In this example we

only attempt to secure a simplified desktop and isolate three applications, and yet it

is the largest rule set. Many desktop environments are made up of many applications

and each application would need its own set of rules. To avoid the need to create

rules for each individual application, we created Apiary, described in Chapter 7, to

specifically address desktop security.


We implemented PeaPod’s virtualization layer as a loadable kernel module in Linux

that requires no changes to the Linux kernel source code or design. We present

experimental results using our Linux prototype to quantify the overhead of using

PeaPod on various applications. Experiments were conducted on two IBM Netfinity

4500R machines, each with a 933Mhz Intel Pentium-III CPU, 512MB RAM, 9.1 GB

SCSI HD and a 100 Mbps Ethernet connected to a 3Com Superstack II 3900 switch.

One of the machines was used as an NFS server from which directories were mounted


Name Description

getpid average getpid runtime

ioctl average runtime for the FIONREAD ioctl

semaphore IPC Semaphore variable is created and removed

fork-exit process forks and waits for child that calls exit immediately

fork-sh process forks and waits for child to run /bin/sh to run a pro-gram that prints ”hello world” then exits

Postmark Use Postmark Benchmark to simulate Sendmail performance

Apache Runs Apache 1.3 under load and measures average request time

Make Linux Kernel 2.4.21 compile with up to 10 processes active atone time

MySQL “TPC-W like” interactions benchmark that uses Tomcat 4 andMySQL 4

Table 5.1 – Application Benchmarks

to construct the virtual file system for the PeaPod on the other client system. The

client ran Debian stable with a 2.4.21 kernel.

To measure the cost of PeaPod’s virtualization layer, we used a range of micro-

benchmarks and real application workloads and measured their performance on our

Linux PeaPod prototype and a vanilla Linux system. Table 5.1 shows the five micro-

benchmarks and four application benchmarks we used to quantify PeaPod’s virtual-

ization overhead. To obtain accurate measurements, we rebooted the system between

measurements. The files for the benchmarks were stored on the NFS server. All

of these benchmarks were performed in a chrooted environment on the NFS client

machine running Debian Unstable. Figure 5.13 shows the results of running the

benchmarks under both configurations, with the vanilla Linux configuration normal-

ized to one. Since all benchmarks measure the time to run the benchmark, a smaller

number is better for all benchmarks results.

The results in Figure 5.13 show that the PeaPod’s pea virtualization layer, as

expected, imposes negligible overhead over the already existing pod virtualization.

This is because to enforce resource isolation, all PeaPod has to do is compare the re-


0.7

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

getpid

ioctlsem

aphore

fork-exit

fork-sh

Postmark

Apache

Make

MySQL

Nor

mal

ized

per

form

ance

Plain LinuxPeaPod

Figure 5.13 – PeaPod Virtualization Overhead

source’s pea attribute against the process trying to access it. For PIDs and IPC keys,

it is a single equality test, which is minimal extra work beyond looking up the virtu-

alized mapping in a hash table. On the other hand, for file system entries it might

have to iterate through a small set of rules. Furthermore, this only matters for file

system operations that care about permissions, such as open; for all other file system

operations, such as read and write, there is no extra pea overhead. Therefore, just

like *Pod, PeaPod incurs less than 10% overhead for most of the micro-benchmarks

and less than 4% overhead for the application workloads. For the system call micro-

benchmarks, PeaPod has to do very little extra work to restrict each process to its

pea context. Similarly, PeaPod has to do very little work to virtualize the file system.

This is apparent from both the Postmark benchmark and a set of real applications.

Postmark was configured to operate on files between 512 and 10K bytes in size, rep-

resentative of the individual files on a mail server queue, with an initial set of 20,000

files and to perform 200,000 transactions. PeaPod exhibits very little overhead in the


postmark benchmark as it does not require any additional I/O operations. In addi-

tion, PeaPod exhibited a maximum overhead of 4% in real application benchmarks.

This overhead was measured using the http load benchmark [108] to place a paral-

lel fetch load on an Apache 1.3 server by simulating 30 parallel clients continuously

fetching a set of files and measuring the average request time for each HTTP session.

Similarly, we tested MySQL as part of a web commerce scenario outlined by TPC-W

with a bookstore servlet running on top of Tomcat 4 with a MySQL 4 back-end. The

PeaPod overhead for this scenario was less than 2% versus vanilla Linux.

5.6 Related Work

A number of other approaches have explored the idea of virtualizing the operating

system environment to provide application isolation. FreeBSD’s Jail mode [74] pro-

vides a chroot-like environment that processes cannot break out of. However, Jail

is limited, such as the fact that it does not allow IPC within a jail [57], and there-

fore many real-world application will not work. More recently, Linux Vserver [7] and

Containers [6], and Solaris Zones [116] offer a VM abstraction similar to PeaPod’s

pods, but require substantial in-kernel modifications to support the abstraction. Al-

though these systems share the simplicity of the Pod abstraction, they do not provide

finer-grained isolation as provided with peas.

Similarly, VMMs have been used to provide secure isolation [28,142,147]. Unlike

PeaPod’s virtualization layer, VMMs decouple processes from the underlying machine

hardware, but tie them to an instance of an operating system. As a result, VMMs

provide an entire operating system instance and namespace for each VM and lack the

ability to isolate components within an operating system. If a single process in a VM

is exploitable, malicious code can use it to access the entire set of operating system


resources. As PeaPod’s virtualization layer decouples processes from the underlying

operating system and its resulting namespace, it is natively able to limit the separate

processes of a larger system to the appropriate resources needed by them. Further-

more, VMMs require more administrative overhead due to requiring administration

of multiple full OS instances, as well as imposing higher memory overhead due to the

requirements of the underlying operating system.

Many systems have been developed to isolate untrusted applications. NSA’s Se-

curity Enhanced Linux [86], which is based upon the Flask Architecture [133], im-

plements a policy language that is used to implement models that enforce privilege

separation. The policy language is very flexible but also very complex to use. The

example security policy is over 80 pages long. There is research into creating tools to

make policy analysis tractable [21], but the language’s complexity makes it difficult

for the average end user to construct an appropriate policy.

System call interception has been used by systems such as Janus [59, 144], Sys-

trace [118], MAPbox [17], Software Wrappers [80] and Ostia [60]. These systems can

enable flexible access controls per system call, but they have been limited by the

difficulty of creating appropriate policy configurations. TRON [30], SubDomain [49]

and Alcatraz [84] also operate at the system call level but focus on limiting access to

the underlying file system. TRON allows transitions between different isolation units

but requires application modifications to use this feature, while SubDomain supports

an implicit transition on execution of a new child process. These systems provide a

model somewhat similar to the file system approach used by PeaPod peas. However,

the pea’s file system virtualization is based on a full-fledged stackable file system that

integrates fully with regular kernel security infrastructure and provides much better

performance. Similarly, the PeaPod’s kernel virtualization layer provides a complete

process-isolation solution that is not just limited to file system protection.


Safer languages and runtime environments, most notably Java, have been devel-

oped to prevent common software errors and isolate applications in language-based

virtual machine environments. These solutions require applications to be rewritten or

recompiled, often with some loss in performance. Other language-based tools [25,48]

have also been developed to harden applications against common attacks, such as

buffer overflow attacks. PeaPod’s virtualization layer complements these approaches

by providing isolation of legacy applications without modification.

Chapter 6

Strata: Managing Large Numbers

of Machines

A key problem organizations face is how to efficiently provision and maintain the

large number of machines deployed throughout their organizations. This problem is

exemplified by the growing adoption and use of virtual appliances (VAs). VAs are

pre-built software bundles run inside virtual machines (VMs). For example, one VM

might be tailored to be a web server VA, while another might be tailored to be a

desktop computing VA. Since VAs are often tailored to a specific application, these

configurations can be smaller and simpler, potentially resulting in reduced resource

requirements and more secure deployments.

VAs simplify application deployment. Once an application is installed in a VA,

it is easily deployed by end users with minimal hassle because both the software

and its configuration have already been set up in the VA. A new VA can be easily

created by cloning an existing VA that already contains a base installation of the

necessary software, then modifying it by adding applications and changing the system

configuration. There is no need to set up the common components from scratch.

Chapter 6. Strata: Managing Large Numbers of Machines 96

But while virtualization and VAs decrease the cost of hardware, they can tremen-

dously increase the human cost of administering these machines. As VAs are cloned

and modified, creating an ever-increasing sprawl of different configurations, organiza-

tions that once had a few hardware machines to manage now find themselves juggling

many more VAs with diverse system configurations and software installations. For

instance, in the past, an organization would have run services such as web, mail, da-

tabases, file storage and shell access on a single machine because these services share

many common files. By dividing these services into separate VAs, instead of a single

machine with five services, one now has five independent VAs.

This causes many management problems. First, because these VAs share a lot

of common data, they are inefficient to store, as there are multiple copies of many

common files. Although storage is cheap, the bandwidth available to write data to the

disks is not. Copying the VA into place is extremely time-consuming and negatively

impacts the performance of the other VAs running on the system.

Second, by increasing the number of systems in use, we increase the number of

systems needing security updates. Although software patches are released for security

threats, constantly deploying patches and upgrading software creates a management

nightmare as the number of VAs in the enterprise continues to rise. Many VAs may

be turned off, suspended, or not even actively managed, making patch deployment

before a security problem hits even more difficult. This problem is exacerbated by

the large number of VAs in a large organization. Although the management of any

one VA may not be difficult, the need to manage many different VAs results in a huge

scaling problem for large organizations as management overhead grows linearly with

the number of VAs needing maintenance. Instead of a single update for a machine

running five services, the administrator now must apply the update five separate

times.


Finally, as VAs are increasingly networked, the management problem only grows,

given the myriad viruses and other attacks commonplace today. Security problems

can wreak havoc on an organization’s virtual computing infrastructure. While vir-

tualization can improve security via isolation, the sprawl of machines increases the

number of hiding places for an attacker. Instead of a single actively used machine to

monitor for malicious changes, administrators now have to monitor many less used

machines. Furthermore, as VAs are designed to be dropped in place and not actively

managed, administrators might not even know what VAs have been put into use by

their end users.

Many approaches have been used to address these problems, including traditional

package management systems [4, 56], copy-on-write disks [91] and new VM storage

formats [41, 103]. Unfortunately, these approaches suffer from various drawbacks

that limit their utility and effectiveness in practice. They either incur management

overheads that grow with the number of VAs, or require all VAs to have the same

configuration, eliminating the main advantages of VAs. The fundamental problem

with previous approaches is that they are based on a monolithic file system or block

device. These file systems and block devices address their data at the block layer

and are simply used as a storage entity. They have no direct concept of what the

file system contains or how it is modified. However, managing VAs is essentially

done by making changes to the file system. As a result, any upgrade or maintenance

operation needs to be done to each VA independently, even when they all need the

same maintenance.

We have built Strata, a novel system that integrates file system unioning with

package management semantics, by introducing the Virtual Layered File System

(VLFS) and using it to solve VA management problems. Strata makes VA creation

and provisioning fast. It simplifies the regular maintenance and upgrades that must


be performed on provisioned VA instances. Finally, it improves the administrator’s

ability to detect and recover from security exploits.

Strata achieves this by providing three architectural components: layers, layer

repositories and Virtual Layered File Systems. A layer is a file hierarchy of related

files that are typically installed and upgraded as a unit. Layers are analogous to

software packages in package management systems. Like software packages, a layer

may require other layers to function correctly, just as applications often require various

system libraries to run. Strata associates dependency information with each layer that

defines relationships among layers. Unlike software packages, which must be installed

into each VA’s file system, layers can be shared directly among multiple VAs.

Layer repositories are used to store layers centrally within a virtualization infras-

tructure, enabling them to be shared among multiple VAs. Layers are updated and

maintained in the layer repository. For example, if a new version of an application

becomes available, a new layer is added to the repository. If a patch for an appli-

cation is issued, the corresponding layer is patched by creating a new layer with the

patch. Different versions of the same application may be available through different

layers in the layer repository. The layer repository is typically stored in a shared

storage infrastructure accessible by the VAs, such as a Storage Area Network (SAN).

Storing layers on the SAN does not impact VA performance because a SAN is where

a traditional VA’s monolithic file system is stored.

The VLFS is the file system for a VA. Unlike a traditional monolithic file sys-

tem, it is a collection of individual layers dynamically composed into a single view.

This is analogous to a traditional file system managed by a package manager that

is composed of many packages extracted into it. Each VA has its own VLFS, which

typically consists of a private read-write layer and a set of read-only layers shared

through the layer repository. The private read-write layer is used for all file system


modifications private to the VA that occur during runtime, such as modifying user

data. The shared read-only layers allow VAs with very different system configurations

and applications to share common layers representing software components common

across VAs. Layer changes to shared layers only need be done once in the reposi-

tory and are then automatically propagated to all VLFSs, resulting in management

overhead independent of the number of VAs.

By dynamically building a VLFS out of discrete layers, Strata introduces file

system unioning as the package management semantic. This provide a number of

management benefits. First, Strata is able to create and provision VAs more quickly

and easily. To create a template VA, an administrator just selects the applications

and tools of interest from the layer repository. The template VA’s VLFS automati-

cally unions the selected layers together with a read-write layer and incorporates any

additional layers needed to resolve any necessary dependencies. This template VA

then becomes the single image end users in an enterprise will use when they want to

use this service. End users can instantly create provisioned instances of this template

VA because no copying or on-demand paging is needed to instantiate its file system,

as all the layers are accessed from the shared layer repository. The layer repository

allows easy identification of the applications and tools of interest, and the VLFS au-

tomatically resolves dependencies on other layers, so provisioning VAs is relatively

easy. Because VAs are just defined by their associated sets of layers, Strata also offers

a new way to build VAs simply by combining existing ones.

Second, Strata simplifies upgrades and maintenance of provisioned VAs. If a layer

contains a bug to be fixed, the administrator creates a replacement layer with the

fix and updates the template VA. This informs the provisioned VAs to incorporate

the layer into their VLFS’s namespace view. Traditional VAs, which are provisioned

and updated by replacing their file system [41, 103], have to be rebooted in order to


incorporate changes by making use of a new block device. Strata, however, allows

online upgrades like a traditional package management system. Unlike package man-

agement system upgrades, in which a significant amount of time is spent deleting the

existing files and copying the new files into place, upgrades in a VLFS are atomic,

preventing the file system from ever being in an inconsistent state.

Finally, this semantic allows VAs managed by Strata to easily recover from security

exploits. VLFSs distinguish between files installed via its package manager, which are

stored in a shared read-only layer, and the changes made over time, which are stored

in the private read-write layer. If a VA is compromised and an attacker installs new

malware or modifies an existing application, these changes will be separated from the

deployed system’s initial state and isolated to the read-write layer. Such changes are

easier to identify and remove, returning the VA to a clean state.

6.1 Strata Basics

Figure 6.1 shows Strata’s three architectural components: layers, layer repositories

and VLFSs. A layer is a distinct self-contained set of files that corresponds to a specific

functionality. Strata classifies layers into three categories: software layers with self-

contained applications and system libraries, configuration layers with configuration

file changes for a specific VA, and private layers allowing each provisioned VA to be

independent. Layers can be mixed and matched, and may depend on other layers.

For example, a single application or system library is not fully independent, but

depends on the presence of other layers, such as those that provide needed shared

libraries. Strata enables layers to enumerate their dependencies on other layers. This

dependency scheme allows automatic provisioning of a complete, fully consistent file

system by selecting the main features desired within the file system.


Figure 6.1 – How Layers, Repositories, and VLFSs Fit Together

Layers are provided through layer repositories. As Figure 6.1 shows, a layer repos-

itory is a file system share containing a set of layers made available to VAs. When

an update is available, the old layer is not overwritten. Instead, a new version of

the layer is created and placed within the repository, making it available to Strata’s

users. Administrators can also remove layers from the repository, e.g., those with

known security holes, to prevent them from being used. Layer repositories are gener-

ally stored on centrally managed file systems, such as a SAN or NFS, but they can also

be provided by protocols such as FTP and HTTP and mirrored locally. Layers from

multiple layer repositories can form a VLFS as long as they are compatible with one

another. This allows layers to be provided in a distributed manner. Layers provided

by different maintainers can have the same layer names, causing a conflict. This,

however, is no different from traditional package management systems as packages


with the same package name, but different functionality, can be provided by different

package repositories.

As Figure 6.1 shows, a VLFS is a collection of layers from layer repositories that

are composed into a single file system namespace. The layers making up a particular

VLFS are defined by the VLFS’s layer definition file, which enumerates all the layers

that will be composed into a single VLFS instance. To provision a VLFS, an admin-

istrator selects software layers that provide the desired functionality and lists them

in the VLFS’s layer definition file.

Within a VLFS, layers are stacked on top of another and composed into a single

file system view. An implication of this composition mechanism is that layers on top

can obscure files on layers below them, only allowing the contents of the file instance

contained within the higher level to be used. This means that files in the private

or configuration layers can obscure files in lower layers, such as when one makes a

change to a default version of a configuration file located within a software layer.

However, to prevent an ambiguous situation from occurring, where the file system’s

contents depend on the order of the software layers, Strata prevents software layers

that contain the same file names from being composed into a single VLFS.

6.2 Strata Usage Model

Strata’s usage model is centered around the usage of layers to quickly create VLFSs

for VAs as shown in Figure 6.1. Strata allows an administrator to compose together

layers to form template VAs. These template VAs can be used to form other template

appliances that extend their functionality, as well as to provide the VA that end users

will provision and use. Strata is designed to be used within the same setup as a

traditional VM architecture. This architecture includes a cluster of physical machines


that are used to host VM execution as well as a shared SAN that stores all the VM

disk images that can be executed. However, instead of storing disk images on the

SAN, Strata stores the layers that will be used by the VMs it manages.

6.2.1 Creating Layers and Repositories

Layers are first created and stored in layer repositories. Layer creation is similar

to the creation of packages in a traditional package management system, where one

builds the software, installs it into a private directory, and turns that directory into

a package archive, or in Strata’s case, a layer. For instance, to create a layer that

contains the MySQL SQL server, the layer maintainer would download the source

archive for MySQL, extract it, and build it normally. However, instead of installing

it into the system’s root directory, one installs it into a virtual root directory that

becomes the file system component of this new layer. The layer maintainer then

defines the layer’s metadata, including its name (mysql-server in this case) and

an appropriate version number to uniquely identify this layer. Finally, the entire

directory structure of the layer is copied into a layer repository, making the layer

available to users of that repository.

6.2.2 Creating Appliance Templates

Given a layer repository, an administrator can then create template VAs. Creating a

template VA involves:

1. Creating the template VA with an identifiable name and the VLFS it will use.

2. Determining what repositories are available to it.

3. Selecting a set of layers that provide the functionality desired.


For example, to create a template VA that provides a MySQL SQL server, an

administrator creates an appliance/VLFS named sql-server and selects the lay-

ers needed for a fully functional MySQL server file system, most importantly, the

mysql-server layer. Strata composes these layers together into the VLFS in a read-

only manner along with a read-write private layer, making the VLFS usable within

a VM. The administrator boots the VM and makes the appropriate configuration

changes to the template VA, storing them within the VLFS’s private layer. Finally,

the private layer belonging to the template appliance’s VLFS is frozen and becomes

the template’s configuration layer. As another example, to create an Apache web

server appliance, an administrator creates an appliance/VLFS named web-server,

and selects the layers required for an Apache web server, most importantly, the layer

containing the Apache httpd program.

Strata extends this template model by allowing multiple template VAs to be com-

posed together into a single new template. For example, an administrator can create

a new template VA/VLFS, sql+web-server, composed of the MySQL and Apache

template VAs. The resulting VLFS has the combined set of software layers from both

templates, both of their configuration layers, and a new configuration layer containing

the configuration state that integrates the two services together, for a total of three

configuration layers.

6.2.3 Provisioning and Running Appliance Instances

Given templates, VAs are efficiently and quickly provisioned and deployed by end

users by cloning the available templates. Provisioning a VA involves:

1. Creating a virtual machine container with a network adapter and an virtual

disk.


2. Using the network adapter’s MAC address as the machine’s identifier for iden-

tifying the VLFS created for this machine.

3. Forming the VLFS by referencing the already existing template VLFS and com-

bining the template’s read-only software and configuration layers with a read-

write private layer provided by the VM’s virtual disk.

As each VM managed by Strata does not have a physical disk off which to boot,

Strata network boots each VM. When the VM boots, its BIOS discovers a network

boot server which provides it with a boot image, including a base Strata environment.

The VM boots this base environment, which then determines which VLFS should be

mounted for the provisioned VM using the MAC address of the machine. Once the

proper VLFS is mounted, the machine transitions to using it as its root file system.

6.2.4 Updating Appliances

Strata upgrades provisioned VAs efficiently using a simple three-step process. First,

an updated layer is installed into a shared layer repository. Second, administrators are

able to modify the template appliances under their control to incorporate the update.

Finally, all provisioned VAs based on that template will automatically incorporate the

update as well. Note that updating appliances is much simpler than updating generic

machines, as appliances are not independently managed machines. This means that

extra software that can conflict with an upgrade will not be installed into a centrally

managed appliance. Centrally managed appliance updates are limited to changes to

their configuration files and what data files they store.

Strata’s updates propagate automatically even if the VA is not currently running.

If a VA is shut down, the VA will compose whatever updates have been applied to its

templates automatically, never leaving the file system in a vulnerable state, because


it composes its file system afresh each time it boots. If it is suspended, Strata delays

the update to when the VA is resumed, as updating layers is a quick task. Updating

is significantly quicker than resuming, so this does not add much to its cost.

Furthermore, VAs are upgraded atomically, as Strata adds and removes all the

changed layers in a single operation. This is not like a traditional package management

system which, when upgrading a package, first uninstalls it before reinstalling the

newer version. The traditional method leaves the file system in an inconsistent state

for a short period of time because it is possible that files needed for program execution

may not be available. For instance, when the libc package is upgraded, its contents

are first removed from the file system before being replaced. Any application that

tries to execute during the interim will fail to dynamically link because the main

library on which it depends is not present within the file system at that moment.

6.2.5 Improving Security

Strata makes it much easier to manage VAs that have had their security compromised.

By dividing a file system into a set of shared read-only layers and storing all file system

modifications inside the private read-write layer, Strata separates changes made to the

file system via layer management from regular runtime modifications. This enables

Strata to easily determine when system files have been compromised as the changes

will be readily visible in the private layer. This allows Strata to not rely on tools like

Tripwire [79] or maintain separate databases to determine if files have been modified

from their installed state. Similarly, this check can be run external to the VA, as it

just needs access to the private layer, thereby preventing an attacker from disabling

it. This reduces management load due to not requiring any external databases be

kept in sync with the file system state as it changes.


This segregation of modified file system state also enables quick recovery from a

compromised system. By simply replacing the VA’s private layer with a fresh private

layer, the compromised system is immediately fixed by returning it to its default

freshly provisioned state. However, unlike reinstalling a system from scratch, replac-

ing the private layer does not require throwing away the contents of the old private

layer. Strata enables the layer to be mounted within the file system, enabling admin-

istrators to have easy access to the files located within it to move the uncompromised

files back to their proper places.

6.3 Virtual Layered File System

Strata introduces the concept of a virtual layered file system in place of traditional

monolithic file systems. Strata’s VLFS allows file systems to be created by composing

layers together into a single file system namespace view. Strata allows these layers

to be shared by multiple VLFSs in a read-only manner or to remain read-write and

private to a single VLFS.

Every VLFS is defined by a layer definition file (LDF), which specifies what

software layers should be composed together. An LDF is a simple text file that

lists the layers and their respective repositories. The LDF’s layer list syntax is

repository/layer version and can be preceded by an optional modifier command.

When an administrator wants to add or remove software from the file system, instead

of modifying the file system directly, they modify the LDF by adding or removing

the appropriate layers.

Figure 6.2 contains an example LDF for a MySQL SQL server template appliance.

The LDF lists each individual layer included in the VLFS along with its corresponding

repository. Each layer also has a number indicating which version will be composed


into the file system. If an updated layer is made available, the LDF is updated to

include the new layer version instead of the old one. If the administrator of the VLFS

does not want to update the layer, they can hold a layer at a specific version, with

the = syntax element. This is demonstrated by the mailx layer in Figure 6.2, which

is being held at the version listed in the LDF.

Strata allows an administrator to explicitly select only the few layers corresponding

to the exact functionality desired within the file system. Other layers needed in the file

system are implicitly selected by the layers’ dependencies as described in Section 6.3.2.

Figure 6.2 shows how Strata distinguishes between explicitly and implicitly selected

layers. Explicitly selected layers are listed first and separated from the implicitly

selected layers by a blank line. In this case, the MySQL server has only one explicit

layer, mysql-server, but has 21 implicitly selected layers. These include utilities such

as Perl and TCP Wrappers (tcpd), as well as libraries such as OpenSSL (libssl). It also

includes a layer providing a shared base common to all VLFSs. Strata distinguishes

explicit layers from implicit layers to allow future reconfigurations to remove one

implicit layer in favor of another if dependencies need to change.

When an end user provisions an appliance by cloning a template, an LDF is

created for the provisioned VA. Figure 6.3 shows an example introducing another

syntax element, @, that instructs Strata to reference another VLFS’s LDF as the

basis for this VLFS. This lets Strata clone the referenced VLFS by including its

layers within the new VLFS. In this case, because the user wants only to deploy

the SQL server template, this VLFS LDF only has to include the single @ line. In

general, a VLFS can reference more than one VLFS template, assuming that layer

dependencies allow all the layers to coexist.


main/mysql-server 5.0.51a-3

main/base 1

main/libdb4.2 4.2.52-18

main/apt-utils 0.5.28.6

main/liblocale-gettext-perl 1.01-17

main/libtext-charwidth-perl 0.04-1

main/libtext-iconv-perl 1.2-3

main/libtext-wrapi18n-perl 0.06-1

main/debconf 1.4.30.13

main/tcpd 7.6-8

main/libgdbm3 1.8.3-2

main/perl 5.8.4-8

main/psmisc 21.5-1

main/libssl0.9.7 0.9.7e-3

main/liblockfile1 1.06

main/adduser 3.63

main/libreadline4 4.3-11

main/libnet-daemon-perl 0.38-1

main/libplrpc-perl 0.2017-1

main/libdbi-perl 1.46-6

main/ssmtp 2.61-2

=main/mailx 3a8.1.2-0.20040524cvs-4

Figure 6.2 – Layer Definition for MySQL Server

@main/sql-server

Figure 6.3 – Layer Definition for Provisioned Appliance

6.3.1 Layers

Strata’s layers are composed of three components: metadata files, the layer’s file

system and configuration scripts. The metadata files define the information that

describes the layer. This includes its name, version and dependency information.

This information is important to ensure that a VLFS is composed correctly. The

metadata file contains all the metadata that is specified for the layer. Figure 6.4

shows an example metadata file. Figure 6.5 shows the full metadata syntax. The

metadata file has a single field per line with two elements, the field type and the field

contents. In general, the metadata file’s syntax is Field Type: value, where value


can be either a single entry or a comma-separated list of values.

The layer’s file system is a self-contained set of files providing a specific function-

ality. The files are the individual items in the layer that are composed into a larger

VLFS. There are no restrictions on the types of files that can be included. They can

be regular files, symbolic links, hard links or device nodes. Similarly, each directory

entry can be given whatever permissions are appropriate. A layer can be seen as a

directory stored on the shared file system that contains the same file and directory

structure that would be created if the individual items were installed into a traditional

file system. On a traditional UNIX system, the directory structure would typically

contain directories such as /usr, /bin and /etc. Symbolic links work as expected

between layers since they work on path names, but one limitation is that hard links

cannot exist between layers.

The layer’s configuration scripts are run when a layer is added or removed from a

VLFS to allow proper integration of the layer within the VLFS. Although many layers

are just a collection of files, other layers need to be integrated into the system as a

whole. For example, a layer that provides MP3 file playing capability should register

itself with the system’s MIME database to allow programs contained within the layer

to be launched automatically when a user wants to play an MP3 file. Similarly, if the

layer were removed, it should remove the programs contained within itself from the

MIME database.

Strata supports four types of configuration scripts: pre-remove, post-remove, pre-

install and post-install. If they exist in a layer, the appropriate script is run before

or after a layer is added or removed. For example, a pre-remove script can be used to

shut down a daemon before it is actually removed, while a post-remove script can be

used to clean up file system modifications in the private layer. Similarly, a pre-install

script can ensure that the file system is as the layer expects, while the post-install


Layer: mysql-server

Version: 5.0.51a-3

Depends: ..., perl (>= 5.6),

tcpd (>= 7.6-4),...

Figure 6.4 – Metadata for MySQL Server Layer

Layer: Layer Name

Version: Version of Layer Unit

Conflicts: layer1 (opt. constraint), ...

Depends: layer1 (...),

layer2 (...) | layer3, ...

Pre-Depends: layer1 (...), ...

Provides: virtual_layer, ...

Figure 6.5 – Metadata Specification

script can start daemons included in the layer. The configuration scripts can be

written in any scripting language. The layer must include the proper dependencies

to ensure that the scripting infrastructure is composed into the file system in order

to allow the scripts to run.

Layers are stored on disk as a directory tree named by the layer’s name and

version. For instance, version 5.0.51a of the MySQL server, with a Strata layer

version of 3, would be stored under the directory mysql-server 5.0.51a-3. Within

this directory, Strata defines a metadata file, a filesystem directory and a scripts

directory corresponding to the layer’s three components.

6.3.2 Dependencies

A key Strata metadata element is enumeration of the dependencies that exist between

layers. Strata’s dependency scheme is heavily influenced by the dependency scheme

in Linux distributions such as Debian and Red Hat. In Strata, every layer composed

into Strata’s VLFS is termed a layer unit. Every layer unit is defined by its name

and version. Two layer units that have the same name but different layer versions are


different units of the same layer. A layer refers to the set of layer units of a particular

name. Every layer unit in Strata has a set of dependency constraints placed within

its metadata. There are four types of dependency constraints:

• dependency

• pre-dependency

• conflict

• provide

Dependency and Pre-Dependency: Dependency and pre-dependency con-

straints are similar in that they require another layer unit to be integrated at the

same time as the layer unit that specifies them. They differ only in the order the

layer’s configuration scripts are executed to integrate them into the VLFS. A regular

dependency does not dictate order of integration. A pre-dependency dictates that the

dependency has to be integrated before the dependent layer. Figure 6.4 shows that the

MySQL layer depends on TCP Wrappers (tcpd) because it dynamically links against

the shared library libwrap.so.0 provided by TCP Wrappers. MySQL cannot run

without this shared library, so the layer units that contain MySQL must depend on a

layer unit containing an appropriate version of the shared library. These constraints

can also be versioned to further restrict which layer units satisfy the constraint. For

example, shared libraries can add functionality that breaks their application binary

interface (ABI), breaking in turn any applications that depend on that ABI. Since

MySQL is compiled against version 0.7.6 of the libwrap library, the dependency con-

straint is versioned to ensure that a compatible version of the library is integrated at

the same time.


Conflict: Conflict constraints indicate that layer units cannot be integrated into

the same VLFS. This generally occurs because the layer units depend on exclusive

access to the same operating system resource. This can be a TCP port in the case

of an Internet daemon, or two layer units that contain the same file pathnames and

therefore would obscure each other. For this reason, two layer units of the same layer

are by definition in conflict because they will contain some of the same files.

An example of this constraint occurs when the ABI of a shared library changes

without any source code changes, generally due to an ABI change in the tool chain

that builds the shared library. Because the ABI has changed, the new version can

no longer satisfy any of the previous dependencies. But because nothing else has

changed, the file on disk will usually not be renamed either. A new layer must then

be created with a different name, ensuring that the library with the new ABI is

never used to satisfy an old dependency on the original layer. Because the new layer

contains the same files as the old layer, it must conflict with the older layer to ensure

that they are not integrated into the same file system.

Provide: Provide dependency constraints introduce virtual layers. A regular

layer provides a specific set of files, but a virtual layer indicates that a layer provides

a particular piece of general functionality. Layer units that depend on a certain piece

of general functionality can depend on a specific virtual layer name in the normal

manner, while layer units that provide that functionality will explicitly specify that

they do. For example, layer units that provide webmail or content management soft-

ware depend on the presence of a web server, but which one is not important. Instead

of depending on a particular web server, they depend on the virtual layer name httpd.

Similarly, layer units containing a web server, such as Apache or Boa, are defined to

provide the httpd virtual layer name and therefore satisfy those dependencies. Unlike

regular layer units, virtual layers are not versioned.


6.3.2.1 Dependency Example

Figure 6.2 shows how dependencies can affect a VLFS in practice. This VLFS has only

one explicit layer, mysql-server, but 21 implicitly selected layers. The mysql-server

layer itself has a number of direct dependencies, including Perl, TCP Wrappers, and

the mailx program. These dependencies in turn depend on the Berkeley DB library

and the GNU dbm library, among others. Using its dependency mechanism, Strata

is able to automatically resolve all the other layers needed to create a complete file

system by specifying just a single layer.

Returning to Figure 6.4, this example defines a subset of the layers that the mysql-

server layer requires to be composed into the same VLFS to allow MySQL to run

correctly. More generally, Figure 6.5 shows the complete syntax for the dependency

metadata. Provides is the simplest, with only a comma-separated list of virtual layer

names. Conflicts adds an optional version constraint to each conflicted layer to limit

the layer units that are actually in conflict. Depends and Pre-Depends add a boolean

or of multiple layers in their dependency constraints to allow multiple layers to satisfy

the dependency.

6.3.2.2 Resolving Dependencies

To allow an administrator to select only the layers explicitly desired within the VLFS,

Strata automatically resolves dependencies to determine which other layers must be

included implicitly. To allow dependency resolution, Strata first provides a database

of all the available layer units’ locations and metadata. The collection of layer units

can be viewed as three sets: the set of layer units themselves, the set of dependency

relations for each individual layer unit, and the set of conflict relations (C) that define

which layer units cannot be integrated into the same file system. This collection can


be viewed as a directed dependency graph connecting layer units to the layer units

on which they depend.

A layer unit can be integrated into the VLFS when two principles hold. First,

there must be a set of layer units (I) that fulfills total closure of all the dependencies,

that is, every layer unit in the set has every dependency filled. Second, I × I ∩C = ∅

must hold, meaning that none of the layer units in I can conflict with each other.

Determining when these principles hold is a problem that has been shown to be

polynomial time reducible to 3-SAT [47, 139]. Because 3-SAT is NP-complete, this

could be very difficult to solve naively, but an optimized Davis-Putnam SAT solver [52]

can be used to solve it efficiently [47].

Even when a layer unit can be integrated into the VLFS, however, there will often

be many sets of implicitly selected layer units that allow this. Strata therefore has to

evaluate which of those sets is the best. Linux distributions already face this problem

and tools have been developed to address it, such as Apt [36] and Smart [98]. Strata

leverages Smart and adopts the same metadata database format that Debian uses for

packages for its own layers, as Smart already knows how to parse it. When Smart

is used with a regular Linux distribution, administrators request that it install or

remove packages and Smart determines whether the operation can succeed and what

is the best set of packages to add or remove to achieve that goal. In Strata, when an

administrator requests that a layer be added to or removed from a template appliance,

Smart also evaluates if the operation can succeed and what is the best set of layers to

add or remove. Instead of acting directly on the contents of the file system, however,

Strata only has to update the template’s LDF with the set of layers to be composed

into the file system.


6.3.3 Layer Creation

Strata allows layers to be created in two ways. First, .deb packages used by Debian-

derived distributions and the .rpm packages used by RedHat-derived distributions

can be directly converted into layers. Strata converts packages into layers in two

steps. First, the relevant metadata from the package is extracted, including its name

and version. Second, the package’s file contents are extracted into a private directory

that will be the layer’s file system components. When using converted packages,

Strata leverages the underlying distribution’s tools to run the configuration scripts

belonging to the newly created layers correctly. Instead of using the distribution’s

tools to unpack the software package, Strata composes the layers together and uses

the distribution’s tools as though the packages have already been unpacked. Although

Strata is able to convert packages from different Linux distributions, it cannot mix

and match them because they are generally ABI incompatible with one another.

More commonly, Strata leverages existing packaging methodologies to simplify the

creation of layers from scratch. In a traditional system, when administrators install

a set of files, they copy the files into the correct places in the file system using the

root of the file system tree as their starting point. For instance, an administrator

might run make install to install a piece of software compiled on the local machine.

In Strata, layer creation is a three-step process. First, instead of copying the files

into the root of the local file system, the layer creator installs the files into their own

specific directory tree. That is, they make a blank directory to hold a new file system

tree that is created by having the make install copy the files into a tree rooted at

that directory, instead of the actual file system root.

Second, the layer maintainer extracts programs that integrate the files into the

underlying file system and creates scripts that run when the layer is added to and


removed from the file system. Examples of this include integration with GNOME’s

GConf configuration system, creation of encryption keys, or creation of new local

users and groups for new services that are added. This leverages skills that package

maintainers in a traditional package management world already have.

Finally, the layer maintainer needs to set up the metadata correctly. Some ele-

ments of the metadata, such as the name of the layer and its version, are simple to

set, but dependency information can be much harder. But because package man-

agement tools have already had to address this issue, Strata is able to leverage the

tools they have built. For example, package management systems have created tools

that infer dependencies using an executable dynamically linking against shared li-

braries [117]. Instead of requiring the layer maintainer to enumerate each shared

library dependency, we can programmatically determine which shared libraries are

required and populate the dependency fields based on those versions of the library

currently installed on the system where the layer is being created.

6.3.4 Layer Repositories

Strata provides local and remote layer repositories. Local layer repositories are pro-

vided by locally accessible file system shares made available by a SAN. They contain

layer units to be composed into the VLFS. This is similar to a regular virtualization

infrastructure in which all the virtual machines’ disks are stored on a shared SAN.

Each layer unit is stored as its own directory; a local layer repository contains a set of

directories, each of which corresponds to a layer unit. The local layer repository’s con-

tents are enumerated in a database file providing a flat representation of the metadata

of all the layer units present in the repository. The database file is used for making

a list of what layers can be installed and their dependency information. By storing


the shared layer repository on the SAN, Strata lets layers be shared securely among

different users’ appliances. Even if the machine hosting the VLFS is compromised,

the read-only layers will stay secure, as the SAN will enforce the read-only semantic

independently of the VLFS.

Remote layer repositories are similar to local layer repositories, but are not acces-

sible as file system shares. Instead, they are provided over the Internet, by protocols

such as FTP and HTTP, and can be mirrored into a local layer repository. Instead of

mirroring the entire remote repository, Strata allows on-demand mirroring, where all

the layers provided by the remote repository are accessible to the VAs, but must be

mirrored to the local mirror before they can be composed into a VLFS. This allows

administrators to store only the needed layers while maintaining access to all the

layers and updates that the repository provides. Administrators can also filter which

layers should be available to prevent end users from using layers that violate admin-

istration policy. In general, an administrator will use these remote layer repositories

to provide the majority of layers, much as administrators use a publicly managed

package repository from a regular Linux distribution.

Layer repositories let Strata operate within an enterprise environment by handling

three distinct yet related issues. First, Strata has to ensure that not all end users

have access to every layer available within the enterprise. For instance, administra-

tors may want to restrict certain layers to certain end users for licensing, security or

other corporate policy reasons. Second, as enterprises get larger, they gain levels of

administration. Strata must support the creation of an enterprise-wide policy while

also enabling small groups within the enterprise to provide more localized adminis-

tration. Third, larger enterprises supporting multiple operating systems cannot rely

on a single repository of layers because of inherent incompatibilities among operating

systems.


By allowing a VLFS to use multiple repositories, Strata solves these three prob-

lems. First, multiple repositories let administrators compartmentalize layers accord-

ing to the needs of their end users. By providing end users with access only to

needed repositories, organizations prevent their end users from using the other lay-

ers. Second, by allowing sub-organizations to set up their own repositories, Strata

lets a sub-organization’s administrator provide the layers that end users need with-

out requiring intervention by administrators of global repositories. Finally, multiple

repositories allow Strata to support multiple operating systems, as each distinct op-

erating system has its own set of layer repositories. Strata supports multiple layer

repositories by providing a directory of layer repositories that can contain multiple

subdirectories, each of which serves as a mount point for a layer repository file system

share, or as a location to store the layers themselves locally. This enables adminis-

trators to use regular file system share controls to determine which layer repositories

users can access.

6.3.5 VLFS Composition

To create a VLFS, Strata has to solve a number of file system-related problems. First,

Strata has to support the ability to combine numerous distinct file system layers into a

single static view. This is equivalent to installing software into a shared read-only file

system. Second, because users expect to treat the VLFS as a normal file system, for

instance, by creating and modifying files, Strata has to let VLFSs be fully modifiable.

Similarly, users must also be able to delete files that exist on the read-only layer.

By basing the VLFS on top of unioning file systems [102, 150], Strata solves all

these problems. Unioning file systems join multiple layers into a single namespace.

Unioning file systems have been extended to apply attributes such as read-only and


read-write to their layers. The VLFS leverages this property to force shared layers

to be read-only, while the private layer remains read-write. If a file from a shared

read-only layer is modified, it is copied-on-write (COW) to the private read-write

layer before it is modified. For example, LiveCDs use this functionality to provide

a modifiable file system on top of the read-only file system provided by the CD.

Finally, unioning file systems use whiteouts to obscure files located on lower layers.

For example, if a file located on a read-only layer is deleted, a whiteout file will be

created on the private read-write layer. This file is interpreted specially by the file

system and is not revealed to the user while also preventing the user from seeing files

with the same name.

However, Strata has to solve two additional problems. First, Strata must main-

tain the usage semantic that users can recover deleted system files by reinstalling or

upgrading the layer that contains them. For example, in a traditional monolithic file

system managed by a package management system, reinstalling a package will replace

any files that might have been deleted. However, if the VLFS only used a traditional

union file system, the whiteouts stored in the private layer would persist and continue

to obscure the file even if the shared layer was replaced.

To solve this problem, Strata provides a VLFS with additional writeable layers

associated with each read-only shared layer. Instead of containing file data, as does

the topmost private writeable layer, these layers just contain whiteout marks that will

obscure files contained within their associated read-only layer. The user can delete a

file located in a shared read-only layer, but the deletion only persists for the lifetime

of that particular instance of the layer. When a layer is replaced during an upgrade or

reinstall, a new empty whiteout layer will be associated with the replacement, thereby

removing any preexisting whiteouts. In a similar way, Strata handles the case where

a file belonging to a shared read-only layer is modified and therefore copied to the


VLFS’s private read-write layer. Strata provides a revert command that lets the

owner of a file that has been modified revert the file to its original pristine state.

While a regular VLFS unlink operation would have removed the modified file from

the private layer and created a whiteout mark to obscure the original file, revert

only removes the copy in the private layer, thereby revealing the original below it.

Second, Strata supports adding and removing layers dynamically without taking

the file system offline. This is equivalent to installing, removing or upgrading a

software package while a monolithic file system is online. While some upgrades,

specifically of the kernel, will require the VA to be rebooted, most should be able to

occur without taking the VA offline. However, if a layer is removed from a union,

its data is effectively removed as well because unions operate only on file system

namespaces and not on the data the underlying files contain. If an administrator

wants to remove a layer from the VLFS, they must take the VA offline, because

layers cannot be removed while in use.

To solve this problem, Strata emulates a traditional monolithic file system. When

an administrator deletes a package containing files in use, the processes that are

currently using those files will continue to work. This occurs by virtue of unlink’s

semantic of first removing a file from the file system’s namespace, and only removing

its data after the file is no longer in use. This lets processes continue to run because

the files they need will not be removed until after the process terminates. This

creates a semantic in which a currently running program can be using versions of files

no longer available to other programs.

Existing package managers use this semantic to allow a system to be upgraded

online, and it is widely understood. Strata applies the same semantic to layers. When

a layer is removed from a VLFS, Strata marks the layer as unlinked, removing it from

the file system namespace. Although this layer is no longer part of the file system


namespace and thus cannot be used by any operations such as open that work on the

namespace, it does remain part of the VLFS, enabling data operations such as read

and write to continue working correctly for previously opened files.

6.4 Improving Appliance Security

In today’s world, machines are continually attacked and administrators work hard

to deflect the attacks. But even with an administrator’s best efforts, attacks still

succeed from time to time. A main problem in dealing with possibly compromised

machines is detecting whether they have indeed been compromised. Just because an

attack is detected does not mean that the attacker was able to change the machine in

a persistent way. Many administrators employ additional tools such as Tripwire [79]

to aid in this effort, but this creates an added burden. There are extra tools and

databases to be maintained and possibly neglected. This leaves the administrators

not always knowing what, or if, the attacker modified. A clean reinstall is often the

best option, but this causes two problems: downtime and lost data. Although an

administrator can back up the system before it is reinstalled, this further adds to the

time lost to repairs.

To address these problems, Strata not only manages appliances, but also keeps

them more secure, improves compromise detection, and makes it easier to fix compro-

mised machines. Strata does this in three fundamental ways. First, many machines

are exploited because they provide functionality that is not needed and therefore

not maintained appropriately. Strata improves auditing by allowing an administrator

to examine each VLFS configuration to determine if unneeded layers, and there-

fore pieces of software, are being included. As opposed to a traditional monolithic

file system, where files can become hidden among their peers, a VLFS enables an


administrator to determine easily which layers are included and isolate file system

modifications stored in the private read-write layer.

Similarly, in the face of an attempted compromise, the VLFS lets an administrator

determine quickly if the file system has been compromised simply by checking the file

system’s private layer. Because any changes made to the file system cause a change to

the private read-write layer, an administrator can see if any system binaries or libraries

have been copied up to the private layer. If this has occurred, the administrator

knows that the system has been maliciously modified. The attacker has no ability

to modify the shared read-only layers because the layer repository’s file system share

enforces the read-only access to the shared contents. To modify the contents in

the shared layer repositories, an attacker would have to find a way to attack the

file system share itself. Although the attacker can still modify the appliance’s file

system, administrators can easily tell that this has happened by noticing the system

files stored within the VLFS’s private read-write layer. Administrators can detect

these modifications without relying on external databases that have to be maintained

separately and updated whenever the file system is changed.

Second, by leveraging Strata’s layer concept, an administrator can deploy fixes

to all of the machines more quickly, without having to worry about machines not

currently running or forgotten altogether. When a layer update is available to fix a

security hole, an administrator needs only to import it into the local layer repository.

Systems managed by Strata will detect that the layer repository has been updated

and identify that updates are available for a layer that is being used in the local

VLFS. Strata will automatically include the new layer into the VLFS’s namespace

while removing the old one.

Finally, with a VLFS, it is simple to recreate a fresh system. By replacing the

compromised private layer with a fresh layer, the system is instantly cleaned. This is


equivalent to deploying a new virtual appliance, as the private layer is what distin-

guishes virtual appliance clones. As opposed to physical systems, where reinstalling

the system can require overwriting the compromised system, cleaning a system with

Strata does not require losing the contents of the compromised machine. Because

cleaning the system does not require getting rid of the compromised private layer,

an administrator need not waste time backing it up and can make it available within

the appliance’s file system as a regular directory without it being composed into the

normal file system view. This can puts the system back online quickly while also

allowing easy import of data to be preserved from the compromised system.

Quickly fixing compromised systems is useful, but often results in discarding the

authorized configuration changes made to that system. Until now, we have described

a single VLFS containing multiple read-only layers shared among appliances and one

read-write layer containing the virtual appliance’s private data. But the appliance’s

private data need not be limited to a single layer. An end user of a deployed appliance

can create their own configuration layers to lock in whatever persistent configuration

changes they desire. Regular configuration layers are read-only and shared between

appliances, but this configuration layer is read-only and accessible only to the local

appliance. In practice, the end user will initially create a VLFS as described above

that has only one read-write layer for private data. Configuration changes are usually

done at the outset and remain static for an extended period, so static configuration

changes can be confined to this private layer. When the user is satisfied with the

configuration, they convert the read-write private layer to a read-only configuration

layer to lock it in, while adding a new private layer to contain the file system changes

that occur during regular usage. If the machine’s configuration is corrupted due to

system compromise or an administrator’s authorized changes, the user can quickly

revert back to the locked down configuration, kept as it is on a read-only layer.



We have implemented Strata as a loadable kernel module on an unmodified Linux 2.6

series kernel. The loadable kernel module implements Strata’s VLFS as a stackable

file system. We present experimental results using our Linux prototype to manage

various VAs, demonstrating its ability to reduce management costs while incurring

only modest performance overhead. Experiments were conducted on VMware ESX

3.0 running on an IBM BladeCenter with 14 IBM HS20 eServer blades with dual

3.06 GHz Intel Xeon CPUs, 2.5 GB RAM, and a Q-Logic Fibre Channel 2312 host

bus adapter connected to an IBM ESS Shark SAN with 1 TB of disk space. The

blades were connected by a gigabit Ethernet switch. This is a typical virtualization

infrastructure in an enterprise computing environment where all virtual machines are

centrally stored and run. We compare plain Linux VMs with a virtual block device

stored on the SAN and formatted with the Ext3 file system to VMs managed by

Strata with the layer repository also stored on the SAN. By storing both the plain

VM’s virtual block device and Strata’s layers on the SAN, we eliminate any differences

in performance due to hardware architecture.

To measure management costs, we quantify the time taken by two common tasks,

provisioning and updating VAs. We quantify the storage and time costs for provi-

sioning many VAs and the performance overhead for running various benchmarks

using the VAs. We ran experiments on five VAs: an Apache web server, a MySQL

SQL server, a Samba file server, an SSH server providing remote access, and a remote

desktop server providing a complete GNOME desktop environment. While the server

VAs had relatively few layers, the desktop VA has very many layers. This enables

the experiments to show how the VLFS performance scales as the number of layers

increases. To provide a basis for comparison, we provisioned these VAs using the


normal VMware virtualization infrastructure and plain Debian package management

tools, and Strata. To make a conservative comparison to plain VAs and to test larger

numbers of plain VAs in parallel, we minimized the disk usage of the VAs. The

desktop VA used a 2 GB virtual disk, while all others used a 1 GB virtual disk.

6.5.1 Reducing Provisioning Times

Table 6.1 shows how long it takes Strata to provision VAs versus regular and COW

copying. To provision a VA using Strata, Strata copies a default VMware VM with

an empty sparse virtual disk and provides it with a unique MAC address. It then

creates a symbolic link on the shared file system from a file named by the MAC

address to the layer definition file that defines the configuration of the VA. When the

VA boots, it accesses the file denoted by its MAC address, mounts the VLFS with

the appropriate layers, and continues execution from within it. To provision a plain

VA using regular methods, we use QEMU’s qemu-img tool to create both raw copies

and COW copies in the QCOW2 disk image format.

Our measurements for all five VAs show that using COW copies and Strata takes

about the same amount of time to provision VAs, while creating a raw image takes

much longer. Creating a raw image for a VAs takes 3 to almost 6 minutes and is

dominated by the cost of copying data to create a new instance of the VA. For larger

VAs, these provisioning times would only get worse. In contrast, Strata provisions

VAs in only a few milliseconds because a null VMware VM has essentially no data to

copy. Layers do not need to be copied, so copying overhead is essentially zero. While

COW images can be created in a similar amount of time, they do not provide any of

the management benefits of Strata, as each new COW image is independent of the

base image from which it was created.


Apache MySQL Samba SSH Desktop

Plain 184s 179s 183s 174s 355s

Strata 0.002s 0.002s 0.002s 0.002s 0.002s

QCOW2 0.003s 0.003s 0.003s 0.003s 0.003s

Table 6.1 – VA Provisioning Times

Plain Strata

VM Wake 14.66s NA

Network 43.72s NA

Update 10.22s 1.041s

Suspend 3.96s NA

Total 73.2s 1.041s

Table 6.2 – VA Update Times

6.5.2 Reducing Update Times

Table 6.2 shows how long it takes to update VAs using Strata versus traditional

package management. We provisioned ten VA instances each of Apache, MySQL,

Samba, SSH and Desktop for a total of 50 provisioned VAs. All were kept in a

suspended state. When a security patch [146] was available for the tar package,

installed in all the VAs, we updated them. Strata simply updates the layer definition

files of the VM templates, which it can do even when the VAs are not active. When

the VA is later resumed during normal operation, it automatically checks to see if

the layer definition file has been updated and updates the VLFS namespace view

accordingly, an operation that is measured in microseconds. To update a plain VA

using normal package management tools, each VA instance must be resumed and

retrieve a network address. An administrator or script must ssh into each VA, fetch

and install the updated packages, and finally re-suspend the VA.

Table 6.2 shows the average time to update each VA using traditional methods ver-

sus Strata. We break down the update time into times to resume the VM, get access

to the network, actually perform the update, and re-suspend the VA. The measure-

ments show that the cost of performing an update is dominated by the management


overhead of preparing the VAs to be updated. Preparation is itself dominated by

getting an IP address and becoming accessible on a busy network. While this cost is

not excessive on a quiet network, on a busy network it can take a significant amount

of time for the client to get a DHCP address, and for the ARP tables on the machine

controlling the update to find the target machine. In our test, the average total time

to update each plain VA is about 73 seconds. In contrast, Strata takes only a second

to update each VA. As this is an order of magnitude shorter even than resuming the

VA, Strata is able to delay the update to a point when the VA will be resumed from

standby normally without impacting its ability to quickly respond. Strata provides

over 70 times faster update times than traditional package management when man-

aging even a modest number of VAs. Strata’s ability to decrease update times would

only improve as the number of VAs being managed grows.

6.5.3 Reducing Storage Costs

Figure 6.6 shows the total storage space required for different numbers of VAs stored

with raw and COW disk images versus Strata. We show the total storage space for 1

Apache VA, 5 VAs corresponding to an Apache, MySQL, Samba, SSH, and Desktop

VA, and 50 VAs corresponding to ten instances of each of the five VAs. As expected,

for raw images, the total storage space required grows linearly with the number of

VA instances. In contrast, the total storage space using COW disk images and Strata

is relatively constant and relatively independent of the number of VA instances. For

one VA, the storage space required for the disk image is less than the storage space

required for Strata, as the layer repository used contains more layers than those used

by any one of the VAs. In fact, to run a single VA, the layer repository size could be

trimmed down to the same size as the traditional VA.


1.0

10.0

100.0

1000.0

10000.0

100000.0

1 VM 5 VMs 50 VMs

Siz

e (M

B)

Plain VMStrata

Figure 6.6 – Storage Overhead

For larger numbers of VAs, however, Strata provides a substantial reduction in the

storage space required, because VAs share layers and do not require duplicate storage.

For 50 VAs, Strata reduces the storage space required by an order of magnitude over

the raw disk images. Table 6.3 shows that there is much duplication among the VAs,

as the layer repository of 405 distinct layers needed to build the different VLFSs for

multiple services is basically the same size as the largest service. Although initially

Strata does not have an significant storage benefit over COW disk images, as each

COW disk image is independent from the version it was created from, it now must be

managed independently. This increases storage usage, as the same updates must be

independently applied to many independent disk images. While other mechanisms

exist, such as deduplication, that can help with storage usage, they increase overhead

due to the effort that is required to find duplicates. Additionally, deduplication does

not help with the management of the individuals VAs as the updates will still have

to be applied to each individual system independently.


Repo Apache MySQL Samba SSH Desktop

1.8GB 217MB 206MB 169MB 127MB 1.7GB

# Layer 43 23 30 12 404

Shared 191MB 162MB 152MB 123MB 169MB

Unique 26MB 44MB 17MB 4MB 1.6GB

Table 6.3 – Layer Repository vs. Static VAs

6.5.4 Virtualization Overhead

To measure the virtualization cost of Strata’s VLFS, we used a range of micro-

benchmarks and real application workloads to measure the performance of our Linux

Strata prototype, then compared the results against vanilla Linux systems within a

virtual machine. The virtual machine’s local file system was formatted with the Ext3

file system and given read-only access to a SAN partition formatted with Ext3 as

well. We performed all benchmarks in every scenario described above.

To demonstrate the effect that Strata’s VLFS has on system performance, we

performed a number of benchmarks. Postmark [76] is a synthetic test that measures

how the system would behave if used as a mail server. Our postmark test operated on

files between 512 and 10K bytes to simulate the mail-server’s spool directory, with an

initial set of 20,000, and performed 200,000 transactions. Postmark is very intensive

on a few specific file system operations such as lookup(), create() and unlink()

because it is constantly creating, opening and removing files. Figure 6.7 shows that

running this benchmark within a traditional VA is significantly faster than running

it in Strata. This is because as Strata composes multiple file system namespaces

together, it places significant overhead on namespace operations such as lookup().

To demonstrate that postmark’s results are not indicative of performance in real-

life scenarios, we ran two application benchmarks to measure the overhead Strata

imposes in a desktop and server VA scenario. First, we timed a multi-threaded build

of the Linux 2.6.18.6 kernel with two concurrent jobs using the VM’s two CPUs. In


0.0

200.0

400.0

600.0

800.0

1000.0

1200.0

1400.0

1600.0

1800.0


Tim

e (s

)

Plain VMStrata

Figure 6.7 – Postmark Overhead in Multiple VAs

all scenarios, we added the layers required to build a kernel to the layers needed to

provide the service, generally adding 8 additional layers to each case. Figure 6.8

shows that while Strata imposes a slight overhead on the kernel build compared to

the underlying file system it uses, the cost is minimal, under 5% at worst.

Second, we measured the amount of HTTP transactions that were able to be

completed per second to an Apache web server placed under load. We imported

the database of a popular guitar tab search engine and used the http load [108]

benchmark to continuously performed a set of 20 search queries on the database

for 60 seconds. For each case that did not already contain Apache, we added the

appropriate layers to the layer definition file to make Apache available. Figure 6.9

shows that Strata imposes a minimal overhead of only 5%.


0.0

100.0

200.0

300.0

400.0

500.0

600.0


Tim

e (s

)

Plain VMStrata

Figure 6.8 – Kernel Build Overhead in Multiple VAs

6.6 Related Work

The most common way to provision and maintain machines today is using the package

management system built into the operating system [4, 56]. Package managers view

the file system into which they install packages as a simple container for files, not

as a partner in the management of the machine. This causes them to suffer from a

number of flaws in their management of large numbers of VAs. They are not space- or

time-efficient, as each provisioned VA needs an independent copy of the package’s files

and requires time-consuming copying of many megabytes or gigabytes into each VA’s

file system. These inefficiencies affect both provisioning and updating of a system

because a lot of time is spent downloading, extracting and installing the individual

packages into the many independent VAs.

As the package manager does not work in partnership with the file system, the file

system is unable to distinguish the different types of files it contains. A file installed


0.0

5.0

10.0

15.0

20.0


Fet

ches

/s

Plain VMStrata

Figure 6.9 – Apache Overhead in Multiple VAs

from a package and a file modified or created in the course of usage are indistin-

guishable. Specialized tools are needed to traverse the entire file system to determine

if a file belongs to a package or was created or modified after the package was in-

stalled. For instance, to determine if a VA has been compromised, an administrator

must determine if any system files have been modified. Finally, package management

systems work in the context of a running system to modify the file system directly.

These standard tools often do not work outside the context of a running system, for

example, for a VA that is suspended or turned off.

For local scenarios, the size and time efficiencies of provisioning a VA can be

improved by utilizing copy-on-write (COW) disks, such as QEMU’s QCOW2 [91]

format. These enables VAs to be provisioned quickly, as little data has to be written

to disk immediately due to the COW property. However, once provisioned, each

COW copy is now fully independent from the original, is equivalent to a regular copy,


and therefore suffers from all the same maintenance problems as a regular VA. Even

if the original disk image is updated, the changes would be incompatible with the

cloned COW images. This is because COW disks operate at the block level. As files

get modified, they use different blocks on their underlying device. Therefore, it is

likely that the original and cloned COW images address the same blocks for different

pieces of data. For similar reasons, COW disks do not help with VA creation, as

multiple COW disks cannot be combined together into a single disk image.

Both the Collective [41] and Ventana [103] attempt to solve the VA maintenance

problem by building upon COW concepts. Both systems enable VAs to be provisioned

quickly by performing a COW copy of each VA’s system file system. However, they

suffer from the fact that they manage this file system at either the block device

or monolithic file system level, providing users with only a single file system. While

ideally an administrator could supply a single homogeneous shared image for all users,

in practice, users want access to many heterogeneous images that must be maintained

independently and therefore increase the administrator’s work. The same is true for

VAs provisioned by the end user, while they both enable the VAs to maintain a

separate disk from the shared system disk that persists beyond upgrades.

Mirage [121] attempts to improve the disk image sprawl problem by introducing a

new storage format, the Mirage Index Format (MIF), to enumerate what files belong

to a package. However, it does not help with the actual image sprawl in regard to

machine maintenance, because each machine reconstituted by Mirage still has a fully

independent file system, as each image has its own personal copy. Although each

provisioned machine can be tracked, they are now independent entities and suffer

from the same problems as a traditional VA.

Stork [38] improves on package management for container-based systems by en-

abling containers to hard link to an underlying shared file system so that files are only


stored once across all containers. By design, it cannot help with managing indepen-

dent machines, virtual machines, or VAs, because hard links are a function internal

to a specific file system and not usable between separate file systems.

Union file systems [102,150] provide the ability to compose multiple different file

system namespaces into a single namespace view. Unioning file systems are commonly

used to provide a COW file system from a read-only copy, such as with LiveCDs. How-

ever, unioning file system by themselves do not directly help with VA management,

as the underlying file system has to be maintained using regular tools. Strata builds

upon and leverages this mechanism by improving its ability to handle deleted files as

well as managing the layers that belong to the union. This allows Strata to provide

a solution that enables efficient provisioning and management of VAs.

Chapter 7

Apiary: A Desktop of Isolated

Applications

In today’s world of highly connected computers, desktop security and privacy are

major issues. Desktop users interact constantly with untrusted data they receive

from the Internet by visiting new websites, downloading files and emailing strangers.

All these activities use information whose safety the user cannot verify. Data can be

constructed maliciously to exploit bugs and vulnerabilities in applications, enabling

attackers to take control of users’ desktops. For example, a major flaw was recently

discovered in Adobe Acrobat products that enables an attacker to take control of a

desktop when a maliciously constructed PDF file is viewed [18]. Adobe’s estimate to

release a fix was nearly a month after the exploit was released into the wild. Even

in the absence of bugs, untrusted data can be constructed to invade users’ privacy.

For example, cookies are often stored when visiting websites that allow advertisers to

track user behavior across multiple websites.

The prevalence of untrusted data and buggy software makes application fault

containment increasingly important. Many approaches have been proposed to isolate

Chapter 7. Apiary: A Desktop of Isolated Applications 137

applications from one another using mechanisms such as process containers [7, 116]

or virtual machines [147]. For instance, in Chapter 5, we introduced PeaPod to

leverage process containers to isolate the components of a single application. Faults

are confined so that if an application is compromised, only that application and the

data it can access are available to an attacker. By having only one application per-

container, each individual container becomes a simpler system, making it easier to

determine if unwanted processes are running within it.

However, existing approaches to isolating applications suffer from an unresolved

tension between ease of use and degree of fault containment. Some approaches [72,92]

provide an integrated desktop feel but only partial isolation. They are relatively easy

to use, but do not prevent vulnerable applications from compromising the system

itself. Other approaches [122, 143] have less of an integrated desktop feel but fully

isolate applications into distinct environments, typically by using separate virtual

machines. These approaches effectively limit the impact of compromised applications,

but are harder to use because users are forced to manage multiple desktops. Virtual

machine (VM) approaches also require managing multiple machine instances and

incur high overhead to support multiple operating system instances, making them

too expensive to allow more than a couple of fault containment units per-desktop.

To address these problems, we introduce Apiary, which provides strong isolation

for robust application fault containment while retaining the integrated look, feel and

ease of use of a traditional desktop environment. Apiary accomplishes this by us-

ing well-understood technologies like thin clients, operating system containers and

unioning file systems in novel ways. It does this using three key mechanisms.

First, it decomposes a desktop’s applications into isolated containers. Each con-

tainer is an independent software appliance that provides all system services an appli-

cation needs to execute. To retain traditional desktop semantics, Apiary integrates


these containers in a controlled manner at the display and file system. Apiary’s

containers prevent an exploit from compromising the user’s other applications. For

example, by having separate web browser and personal finance containers, any com-

promise from web browsing would not be able to access personal financial information.

At the same time, Apiary makes the web browser and personal finance containers look

and feel like part of the same integrated desktop, with all normal windowing functions

and cut-and-paste operations operating seamlessly across containers.

Second, it introduces the concept of ephemeral containers. Ephemeral containers

are execution environments with no access to user data that are quickly instantiated

from a clean state for only a single application execution. When the application

terminates, the container is archived, but never used again. Apiary uses ephemeral

containers as a fundamental building block of the integrated desktop experience while

preventing contamination across containers. For example, users often expect to view

PDF documents from the web, but need separate web browser and PDF viewer con-

tainers for fault containment. If a user always views PDF documents in the same

PDF viewer container, a single malicious document could exploit the container and

have access to future documents the user wants to keep private, like bills and bank

statements. Instead, Apiary enables the web browser to automatically instantiate a

new ephemeral PDF viewer container for each individual PDF document. Even if the

PDF file is malicious, it will have no effect on other PDF files because the container

instance it exploited will never be used again.

As illustrated by this PDF example, ephemeral containers have three benefits.

First, they prevent compromises, because exploits, even if triggered, cannot persist.

Second, they protect users from compromised applications. Even when an application

has been compromised, a new ephemeral container running that application in parallel

will remain uncompromised because it is guaranteed to start from a clean state.


Third, they help protect user privacy when using the Internet. For example, while

cookies must be accepted to use many websites, web browsers in separate ephemeral

containers can be used for different websites to prevent cookies from tracking user

behavior across websites.

Apiary’s third mechanism is Strata’s VLFS. Apiary leverages the VLFS to allow

the many application containers used in Apiary to be efficiently stored and instanti-

ated. Since each container’s VLFS will share the layers that are common to them,

Apiary’s storage requirements are the same as a traditional desktop. Similarly, since

no data has to be copied to create a new VLFS instance, Apiary is able to quickly

instantiate ephemeral containers for a single application execution.

Apiary’s approach differs markedly from the approach taken by PeaPod in Chap-

ter 5. In PeaPod, we isolate the different process components of a single larger ap-

plication, such as an email server. These applications contain processes that require

access to large amounts of the same data, but with differing levels of privilege and

therefore they cannot be fully isolated. Furthermore, in many of these applications,

the security model is well understood and therefore simple sets of rules can be created

to isolate each component. However, desktop security is much more complicated. As

can be seen in Chapter 5.4.3, just isolating one small portion of the desktop involved

the creation of the largest set of rules. In Apiary, we enable the isolation of desktop

applications without any rules.

7.1 Apiary Usage Model

Figure 7.1 shows the Apiary desktop. It looks and works like a regular desktop. Users

launch programs from a menu or from within other programs, switch among launched

programs using a taskbar, interact with running programs using the keyboard and


Figure 7.1 – Apiary screenshot showing a desktop session. At the the topmost left is(1), an application menu that provides access to all available applications. Just belowit, the window list (2) allows users to easily switch among running applications. (3) isthe composite display view of all the visible running applications.

mouse, and have a single display with an integrated window system and clipboard

functionality that contains all running programs.

Although Apiary provides a look and feel similar to a regular desktop, it provides

fault containment by isolating applications into separate containers. Containers en-

force isolation so that applications running inside cannot get out. Apiary isolates

individual applications, not individual programs. An application in Apiary can be

understood as a software appliance made up of multiple programs used together in a

single environment to accomplish a specific task. For instance, a user’s web browser

and word processor would be considered separate applications and isolated from one

another. The software appliance model means that users can install separate isolated


applications containing many or all of the same programs, but used for different pur-

poses. For example, a banking application contains a web browser for accessing a

bank’s website, while a web surfing application also contains a web browser, but for

general web browsing. Both appliances make use of the same web browser program,

but are listed as different applications in the application menu.

Apiary provides two types of containers: ephemeral and persistent. Ephemeral

containers are created fresh for each application execution. Persistent containers, like

a traditional desktop, maintain their state across application executions. Apiary lets

users select whether an application should launch within an ephemeral or a persis-

tent container. Windows belonging to ephemeral applications are, by default, given

distinct border colors so that users can quickly identify based on appearance in which

mode an application is executing.

Ephemeral containers provide a powerful mechanism for protecting desktop se-

curity and user privacy when running common desktop operations, such as viewing

untrusted data, that do not require storing persistent states. Users will typically

run multiple ephemeral containers at the same time, and, in some cases, multiple

ephemeral containers for the same application at the same time. They provide im-

portant benefits for a wide range of uses.

Ephemeral containers prevent compromises because exploits cannot persist. For

example, a malicious PDF document that exploits an ephemeral PDF viewer will

have no persistent effect on the system because the exploit is isolated in the container

and will disappear when the container finishes executing.

Ephemeral containers protect user privacy when using the Internet. For example,

many websites require cookies to function, but also store advertisers’ cookies to track

user behavior across websites and compromise privacy. Apiary makes it easy to

use multiple ephemeral web browser containers simultaneously, each with separate


cookies, making it harder to track users across websites.

Ephemeral containers protect users from compromises that may have already oc-

curred on their desktop. If a web browser has been compromised, parallel and future

uses of the web browser will allow an attacker to steal sensitive information when

the user accesses important websites (e.g., for banking). Ephemeral containers are

guaranteed to launch from a clean slate. By using a separate ephemeral web browser

container for accessing a banking site, Apiary ensures that an already exploited web

browser installation cannot compromise user privacy.

Ephemeral containers allow applications to launch other applications safely. For

example, users often receive email attachments such as PDF documents that they

wish to view. To avoid compromising an email container, Apiary creates a separate

ephemeral PDF viewer container for the PDF. Even if it is malicious, it will have

no effect on the user’s desktop, as it only affects the isolated ephemeral container.

Similarly, ephemeral word processor or spreadsheet containers will be created for

viewing these email attachments to prevent malicious files from compromising the

system. In general, Apiary allows applications to cause other applications to be

safely launched in ephemeral containers by default to support scenarios that involve

multiple applications.

Isolated persistent containers are necessary for applications that maintain state

across executions to prevent a single application compromise from affecting the entire

system. Users typically run one persistent container per-application to avoid needing

to track which persistent application container contains which persistent information.

Some applications only run in persistent containers, while others may run in both

types of containers. For example, an email application is typically used in a persistent

container to maintain email state across executions. On the other hand, a web browser

will be used both in a persistent container, to access a user’s trusted websites, and in


an ephemeral container, to view untrusted websites. Similarly, a browser may be used

in a persistent container to remember browsing history, plugins and bookmarks, but

may also be used in an ephemeral container when accessing untrusted websites. Note

that files stored in both kinds of containers are private by default and not accessible

outside their container.

Apiary’s containers work together to provide a security system that differs fun-

damentally from common security schemes that attempt to lock down applications

within a restricted-privilege environment. In Apiary, each application container is

an independent entity that is entirely isolated from every other application container

on the Apiary desktop. One does not have to apply any security analysis or com-

plex isolation rules to determine which files a specific application should be able to

access. Also, in most other schemes, an application, once exploited, will continue

to be exploited, even if the exploited application is restricted from accessing other

applications’ data. Apiary’s ephemeral containers, however, prevent an exploit from

persisting between application execution instances.

Apiary provides every desktop with two ways to share files between containers.

First, containers can use standard file system share concepts to create directories

that can be seen by multiple containers. This has the benefit of allowing any data

stored in the shared directory to be automatically available to the other containers

that have access to the share. Second, Apiary supplies every desktop with a special

persistent container with a file explorer. The explorer has access to all of the user’s

containers and can manage all of the user’s files, including copying them between

containers. This is useful if a user decides they want to preserve a file from an

ephemeral container, or move a file from one persistent container to another, as, for

instance, when emailing a set of files. The file explorer container cannot be used in an

ephemeral manner, its functionality cannot be invoked by any other application on


the system, and no other application is allowed to execute within it. This prevents an

exploited container from using the file explorer container to corrupt others. Note that

both of these mechanisms break the isolation barrier that exists between containers.

File system shares can be used by an exploited container as a vector to infect other

containers, while a user can be tricked into moving a malicious file between containers.

However, this is a tension that will always exist in security systems that are meant

to be usable by a diverse crowd of users.

7.2 Apiary Architecture

To support its container model, Apiary must have four capabilities. First, Apiary

must be able to run applications within secure containers to provide application iso-

lation. Second, Apiary must provide a single integrated display view of all running

applications. Third, Apiary must be able to instantiate individual containers quickly

and efficiently. Finally, for a cohesive desktop experience, Apiary must allow appli-

cations in different containers to interact in a controlled manner.

Apiary does this by using a virtualization architecture that consists of three main

components: an operating system container that provides a virtual execution envi-

ronment, a virtual display system that provides a virtual display server and viewer

and the VLFS. Additionally, Apiary provides a desktop daemon that runs on the

host. This daemon instantiates containers, manages their lifetimes and ensures that

they are correctly integrated.

7.2.1 Process Container

Apiary’s containers are essential to Apiary’s ability to isolate applications from one

another. By providing isolated containers, individual applications can run in parallel


within separate containers, and have no conception that there are other applications

running. This enforces fault containment, as an exploited process will only have

access to whatever files are available within its own container.

Apiary’s containers leverage features such as Solaris’s zones [116], FreeBSD’s

jails [74] and Linux’s containers [6] to create isolated execution environments. Each

container has its own private kernel namespace, file system and display server, provid-

ing isolation at the process, file system and display levels. Programs within separate

containers can only interact using normal network communication mechanisms. In

addition, each container has an application control daemon that enables the virtual

display viewer to query the container for its contents and interact with it.

7.2.2 Display

Apiary’s virtual display system is crucial to complete process isolation and a cohe-

sive desktop experience. If containers were to share a single display directly, ma-

licious applications could leverage built-in mechanisms in commodity display archi-

tectures [61, 93] to insert events and messages into other applications that share the

display, enabling the malicious application to remotely control the others, effectively

exploiting them as well. Many existing commodity security systems do not isolate

applications at the display level, providing an easy vector for attackers to further

exploit applications on the desktop.

But although independent displays isolate the applications from one another, they

do not provide the single cohesive display users expect. This cohesive display has two

elements. First, the display views have to be integrated into a single view. Second,

Apiary has to provide the normal desktop metaphors that users want, including a

single menu structure for launching applications and an integrated task switcher that


allows the user to switch among all running applications.

Apiary’s virtual display system incorporates both of these elements. First, Api-

ary’s virtual display provides each container with its own virtual display similar to

existing systems [14, 27, 51, 140]. This virtual display operates by decoupling the

display state from the underlying hardware and enabling the display output to be

redirected anywhere.

Second, Apiary enables these independent displays to be integrated into a single

display view. While a regular remote framework provides all the information needed

to display each desktop, it assumes that there is no other display in use, and there-

fore expects to be able to draw the entire display area. In Apiary, where multiple

containers are in use, this assumption does not hold. Therefore, to enable multiple

displays to be integrated into a single view, the Apiary viewer composes the display

together using the Porter-Duff [107] over operation.

Apiary’s viewer provides an integrated menu system that lists all the applications

users are able to launch. Apiary leverages the application control daemon running

within each container to enumerate all the applications within the container, much

like a regular menu in a traditional desktop. Instead of providing the menu directly

in the screen, however, it transmits the collected data back to the viewer, which then

integrates this information into its own menu, associating the menu entry with the

container it came from. When a user selects a program from the viewer’s menu, the

viewer instructs the correct daemon to execute it within its container.

Similarly, to manage running applications effectively, Apiary provides a single

taskbar with which the user can switch between all applications running within the

integrated desktop. Apiary leverages the system’s ability to enumerate windows and

switch applications [63] by having the daemon enumerate all the windows provided by

its container and transmit this information to the viewer. The viewer then integrates


this information into a single taskbar with buttons corresponding to application win-

dows. When the user switches windows using the taskbar, the viewer communicates

with the daemon and instructs it to bring the correct window to the foreground.

Note that by stacking the independent displays, the windowing semantic is changed

slightly from a traditional desktop. In a traditional desktop, when one brings a win-

dow to the foreground, only that window will be brought up. In Apiary, each display

can feature multiple windows, each of which can be raised to the foreground. How-

ever, in Apiary, bringing up a window also brings its entire display layer to the

foreground. Consequently, all other windows in the display will be raised above the

windows provided by all other displays.

7.2.3 File System

Apiary requires containers to be efficient in storage space and instantiation time. Con-

tainers must be storage-efficient to allow regular desktops to support the large number

of application containers used within the Apiary desktop. Containers must be effi-

ciently instantiated to provide fast interactive response time, especially for launching

ephemeral containers. Both of these requirements are difficult to meet using tradi-

tional independent file systems for each container. Each container’s file system would

be using its own storage space, which would be inefficient for a large number of con-

tainers, as it means many duplicated files. More important, the desktop becomes

much harder to maintain because each independent file system must be updated indi-

vidually. Similarly, instantiating the container requires copying the file system, which

can include many megabytes or gigabytes of storage space. Copying time prevents

the container from being instantiated quickly. Although file systems that support a

branching semantic [32, 103] can be used to quickly provision a new container’s file


system from a template image, each template image will still be independent and

therefore inefficient with regard to space, maintenance and upgrades.

Apiary leverages Strata’s Virtual Layered File System to meet these requirements.

The VLFS enables file systems to be created by composing layers together into a single

file system namespace view. VLFSs are built by combining a set of shared software

layers together in a read-only manner with a per-container private read-write layer.

Multiple VLFSs providing multiple applications are as efficient as a single regular

file system because all common files are stored only once in the set of shared layers.

Therefore, Apiary is able to store efficiently the file systems its containers need. This

also allows Apiary to manage its containers easily. To update every VLFS that uses

a particular layer, the administrator need only replace the single layer containing the

files that need updating. The VLFS also lets Apiary instantiate each container’s file

system efficiently. No data has to be copied into place because each of the software

layers is shared in a read-only manner. The instantiation is transparent to the end

user and nearly instantaneous.

7.2.4 Inter-Application Integration

Apiary provides independent containers for fault containment, but must also ensure

that they do not limit effective use of the desktop. For instance, if Firefox is totally

isolated from the PDF viewer, how does one view a PDF file? The PDF viewer could

be included within the Firefox container, but this violates the isolation that should

exist between Firefox and an application viewing untrusted content. Similarly, users

could copy the file from the Firefox container to the PDF viewer container, but this

is not the integrated feel that users expect.

Apiary solves this problem by enabling applications to execute specific applications


in new ephemeral containers. Every application used within Apiary is preconfigured

with a list of programs that it enables other applications to use in an ephemeral

manner. Apiary refers to these as global programs. For instance, a Firefox container

can specify /usr/bin/firefox and a Xpdf container can specify /usr/bin/xpdf

as global programs. Program paths marked global exist in all containers. Apiary

accomplishes this by populating a single global layer, shared by all the container’s

VLFSs, with a wrapper program for each global program. This wrapper program is

used to instantiate a new ephemeral container and execute the requested program

within it. Apiary only allows for the execution in a new ephemeral container and

not in a preexisting persistent or ephemeral container, as that would break Apiary

isolation constraints and cannot be done without risk to the preexisting container.

When executed, the wrapper program determines how it was executed and what

options were passed to it. It connects over the network to the Apiary desktop dae-

mon on the same host and passes this information to it. The daemon maintains a

mapping of global programs to containers and determines which container is being

requested to be instantiated ephemerally. This ensures that only the specified global

programs’ containers will be instantiated, preventing an attacker from instantiating

and executing arbitrary programs. Apiary is then able to instantiate the correct fresh

ephemeral container, along with all the required desktop services, including a display

server. The display server is then automatically connected to the viewer. Finally, the

daemon executes the program as it was initially called in the new container.

To ensure that ephemeral containers are discarded when no longer needed, Api-

ary’s desktop daemon monitors the process executed within the container. When it

terminates, Apiary terminates the container. Similarly, as the Apiary viewer knows

which containers are providing windows to it, if it determines that no more windows

are being provided by the container, it instructs the desktop daemon to terminate


the container. This ensures that an exploited process does not continue running in

the background.

Merely running a new program in a fresh container, however, is not enough to in-

tegrate applications correctly. When Firefox downloads a PDF and executes a PDF

viewer, it must enable the viewer to view the file. This will fail because Firefox

and ephemeral PDF viewer containers do not share the same file system. To enable

this functionality, Apiary enables small private read-only file shares between a parent

container and the child ephemeral container it instantiates. Because well-behaved

applications such as Firefox, Thunderbird and OpenOffice only use the system’s tem-

porary file directory to pass files among them, Apiary restricts this automatic file

sharing ability to files located under /tmp. To ensure that there are no namespace

conflicts between containers, Apiary provides containers with their own private di-

rectory under /tmp to use for temporary files, and they are preconfigured to use that

directory as their temporary file directory.

But providing a fully shared temporary file directory allows an exploited container

to access private files that are placed there when passed to an ephemeral container.

For instance, if a user downloads a malicious PDF and a bank statement in close

succession, they will both exist in the temporary file directory at the same time. To

prevent this, Apiary provides a special file system that enhances the read-only shares

with an access control list (ACL) that determines which containers can access which

files. By default, these directories will appear empty to the rest of the containers,

as they do not have access to any of the files. This prevents an exploited container

from accessing data not explicitly given to it. A file will only be visible within the

directories if the Apiary desktop daemon instructs the file system to reveal that file by

adding the container to the file’s ACL. This occurs when a global program’s wrapper

is executed and the daemon determines that a file was passed to it as an option. The


daemon then adds the ephemeral container to the file’s ACL. Because the directory

structure is consistent between containers, simply executing the requested program

in the new ephemeral container with the same options is sufficient.

Apiary enables the file explorer container discussed in Section 7.1 in a similar

way. The file explorer container is set up like all other containers in Apiary. It is

fully isolated from the rest of the containers and users interact with it via the regular

display viewer. It differs from the rest of the containers in that other containers are

not fully isolated from it. This is necessary as users can store their files in multiple

locations, most notably, the container’s /tmp directory and the user’s home directory.

Apiary’s file explorer provides read-write access to each of these areas as a file share

within the file explorer’s FS namespace. Apiary prevents any executable located

within these file systems from executing with the file explorer container to prevent

malicious programs from exploiting it. Users are able to use normal copy/paste

semantics to move files among containers. While this is more involved than a normal

desktop with only a single namespace, users generally do not have to move files among

containers.

The primary situation in which users might desire to move files between containers

is when interacting with an ephemeral container, as a user might want to preserve

a file from there. For instance, a user can run their web browser in an ephemeral

container to maintain privacy, but also download a file they want to keep. While the

ephemeral container is active, a user can just use the file explorer to view all active

containers. To avoid situations where the user only remembers after terminating the

ephemeral container that it had files they wanted to keep, Apiary archives all newly

created or modified non-hidden files that are accessible to the file explorer when the

ephemeral container terminates. This allows a user to gain access to them even after

the ephemeral container has terminated. Apiary automatically trims this archive if


no visible data was stored within the ephemeral container, such as in the case of an

ephemeral web browser that the user only used to view a web page, and did not save

a specific file. Similarly, Apiary provides the user the ability to trim the archive to

remove ephemeral container archives that do not contain data they need.

Apiary also turns the desktop viewer into an inter-process communication (IPC)

proxy that can enable IPC states to be shared among containers in a controlled

and secure manner. This means that only an explicitly allowed IPC state is shared.

For example, one of the most basic ways desktop applications share state is via the

shared desktop clipboard. To handle the clipboard, each container’s desktop daemon

monitors the clipboard for changes. Whenever a change is made to one container’s

clipboard, this update is sent to the Apiary viewer and then propagated to all the

other containers. The Apiary viewer also keeps a copy of the clipboard so that any

future container can be initialized with the current clipboard state. This enables

users to continue to use the clipboard with applications in different containers in a

manner consistent with a traditional desktop. This model can be extended to other

IPC states and operations.


We have implemented a remote desktop Apiary prototype system for Linux desktop

environments. The prototype consists of a virtual display driver for the X window sys-

tem that provides a virtual display for individual containers based on MetaVNC [140],

a set of user space utilities that enable container integration and a loadable kernel

module for the Linux 2.6 kernel that provides the ability to create and mount VLFSs.

Apiary uses a Linux container-like mechanism to provide the isolated containers [100]

and the VLFS.


Our prototype’s VLFS layer repository contained 214 layers created by converting

the set of Debian packages needed by the set of applications we tested into individual

layers. Using these layers, we are able to create per-application appliances for each in-

dividual application by simply selecting which high level applications we want within

the appliance, such as Firefox, with the dependencies between the layers ensuring that

all the required layers are included. Using these appliances, we are able to instantly

provision persistent and ephemeral containers for the applications as needed.

Using this prototype, we used real exploits to evaluate Apiary’s ability to contain

and recover from attacks. We conducted a user study to evaluate Apiary’s ease of

use compared to a traditional desktop. We also measured Apiary’s performance with

real applications in terms of runtime overhead, startup time and storage efficiency.

For our experiments, we compared a plain Linux desktop with common applications

installed to an Apiary desktop that has applications available to be used in persistent

and ephemeral containers. The applications we used are the Pidgin instant messenger,

the Firefox web browser, the Thunderbird email client, the OpenOffice.org office suite,

the MPlayer media player and the Xpdf PDF viewing program. Experiments were

conducted on an IBM HS20 eServer blade with dual 3.06 GHz Intel Xeon CPUs and

2.5 GB RAM. All desktop application execution occurred on the blade. Participants

in the usage study connected to the blade via a Thinkpad T42p laptop with a 1.8

GHz Intel Pentium-M CPU and 2GB of RAM running the MetaVNC viewer.

7.3.1 Handling Exploits

We tested two scenarios that illustrate Apiary’s ability to contain and recover from a

desktop application exploit, as well as explore how different decisions can affect the

security of Apiary’s containers.


7.3.1.1 Malicious Files

Many desktop applications have been shown to be vulnerable to maliciously created

files that enable an attacker to subvert the target machine and destroy the data.

These attacks are prevalent on the Internet, as many users will download and view

whatever files are sent to them. To demonstrate this problem, we use two malicious

files [62, 64] that exploit old versions of Xpdf and mpg123 respectively. The mpg123

program was stored within the MPlayer container. The mpg123 exploit works by

creating an invalid mp3 file that triggers a buffer overflow in old versions of mpg123,

enabling the exploit to execute any program it desires. The Xpdf exploit works by

exploiting a behavior of how Xpdf launched helper programs, that is, by passing a

string to sh -c. By including a back-tick (‘ ‘) string within a URL embedded in

the PDF file, an attacker could get Xpdf to launch unknown programs. Both of these

exploits are able to leverage sudo to perform privileged tasks, in this case, deleting the

entire file system. Sudo is exploited because popular distributions require users to use

it to gain root privileges and have it configured to run any applications. Additionally,

sudo, by default, caches the user’s credentials to avoid needing to authenticate the

user each time it needs to perform a privileged action. However, this enables local

exploits to leverage the cached credentials to gain root privileges.

In the plain Linux system, recovering from these exploits required us to spend a

significant amount of time reinstalling the system from scratch, as we had to install

many individual programs, not just the one that was exploited. Additionally, we

had to recover a user’s 23GB home directory from backup. Reinstalling a basic

Debian installation took 19 minutes. However, reinstalling the complete desktop

environment took a total of 50 minutes. Recovering the user’s home directory, which

included multimedia files, research papers, email and many other assorted files, took


an additional 88 minutes when transferred over a Gbps LAN.

Apiary protected the desktop and enabled easier recovery. It protected the desktop

by letting the malicious files be viewed within an ephemeral container. Even though

the exploit proceeded as expected and deleted the container’s entire file system, the

damage it caused is invisible to the user, because that ephemeral container was never

to be used again. Even when we permitted the exploit to execute within a persistent

container, Apiary enabled significantly easier recovery from the exploit. As shown in

Table 7.2, Apiary can provision a file system in just a few milliseconds. This is nearly

6 orders of magnitude faster than the traditional method of recovering a system by

reinstallation. Furthermore, Apiary’s persistent containers divide up home directory

content between them, eliminating the need to recover the entire home directory if

one application is exploited.

This also shows how persistent containers can be constructed in a more secure

manner to prevent exploits from harming the user. As a large amount of the above

user’s data, such as media files, is only accessed in a read-only manner, the data can

be stored on file system shares. This enables the user to allow the different containers

to have different levels of access to the share. The file explorer container can access

it in a read-write manner, enabling a user to manage the contents of the file system

share, while the actual applications that view these files can be restricted to accessing

them in a read-only manner, protecting the files from exploits.

7.3.1.2 Malicious Plugins

Applications are also exploited via malware that users are tricked into downloading

and installing. This can be an independent program or a plugin that integrates with

an already-installed application. For example, malicious attackers can try to convince

users to download a “codec” they need to view a video. Recently, a malicious Firefox


extension was discovered [31] that leverages Firefox’s extension and plugin mechanism

to extract a user’s banking username and password from the browser when the user

visits their bank’s website and sends the information to the attacker. These attacks

are common because users are badly conditioned to allow a browser to install what

it needs when it asks to install something.

In a traditional environment, this malicious extension persists until its discovered

and removed. As it does not affect regular use of the browser, there is very little to

alert users that they have been attacked. As this exploit is not readily available to

the public, we simulated its presence with the non-malicious Greasemonkey Firefox

extension. Much like the malicious file example, Apiary prevented the extension from

persisting when installed into an ephemeral container. Even when a user allowed the

installation of the extension, it did not persist to future executions of Firefox.

However, this exploit poses a significant risk if it enters the user’s persistent web

browser container. While one might expect Firefox extensions to be uninstallable

through Firefox’s extension manager, this is only true of extensions that are installed

through it. If an extension is installed directly into the file system, it cannot be

uninstalled this way. Although it can be disabled, it must later be removed from the

file system. This applies equally to Apiary and traditional machines. While users can

quickly recreate the entire persistent Firefox container, that requires knowing that the

installation was exploited. Apiary handles this situation more elegantly by allowing

the user to use Firefox in multiple web browsing containers. In this case, we created

a general-purpose web browsing container for regular use, as well as a financial web

browsing container for the bank website only. Apiary refused to install any addons

in the financial web browsing container, keeping it isolated and secure even when the

general-purpose web browsing container was compromised.

Apiary enables the creation of multiple independent application containers, each


containing the same application, but performing different tasks, such as visiting a

bank website. Because the great majority of the VLFS’s layers are shared, the user

incurs very little cost for these multiple independent containers. This approach can

be extended to other related but independent tasks, for instance, using a media player

to listen to one’s personal collection of music, as opposed to listening to Internet radio

from an untrusted source.

This scenario also reveals a problem with how plugins and other extensions are

currently handled. When the browser provides its own package management interface

independent of the system’s built-in package manager, this affects impacts Apiary,

because certain application extensions might be needed in an ephemeral container,

but if they are not known to the package manager, they cannot be easily included.

Even today, however, many plugins and browser extensions are globally installable

and manageable via the package manager itself in systems like Debian. In these

systems, this yields the benefit that when multiple users wish to use an extension, it

only has to be installed once. In Apiary, it additionally provides the benefit that it

can become part of the application container’s definition, making it available to the

ephemeral container without requiring it to be manually installed by the user on each

ephemeral execution.

Similarly, one can create containers with functionality provided by other contain-

ers. A LATEX paper writing container can provide Emacs, LATEX and a PDF viewer.

This PDF viewer is separate from the primary PDF container and its ephemeral in-

stances. This demonstrates how application containers can be designed to deliver a

specific functionality even when it overlaps with that of other parts of the system.

A user would want to include the PDF viewer within the LATEX container, as it is a

primary component of the paper-writing process, and not just a helper application

to be isolated. But as this copy of Xpdf is not made into a global program, no appli-


cation will call into this container. Because the layers are shared between containers,

it costs nothing to include it in the LATEX container. If Xpdf were not in the LATEX

container, users would have to go through multiple steps of copying the generated

PDF files to the PDF container to view them, as papers are not generally kept in the

/tmp directory.

7.3.2 Usage Study

We performed a usage study that evaluated the ability of users to use Apiary’s con-

tainerized application model with our prototype environment, focusing on their abil-

ity to execute applications from within other programs. Participants were mostly

recruited from within our local university, including faculty, staff and students. All of

the users were experienced computer users, including many experienced Linux users.

24 participants took part in the study.

For our study, we created three distinct environments. The first was a plain Linux

environment running the Xfce4 desktop. It provided a normal desktop Linux expe-

rience with a background of icons for files and programs and a full-fledged panel

application with a menu, task switcher, clock and other assorted applets. Second

was a full Apiary environment. It provided a much sparser experience, as the current

Apiary prototype only provides a set of applications and not a full desktop environ-

ment. Finally we supplied a neutered Apiary environment that differs from the full

environment in not launching any child applications within ephemeral containers.

The three environments enable us to compare the participants’ experience along

two axes. First, we can compare the plain Linux environment, where each application

is only installed once and always run from the same environment, to the neutered

Apiary environment, where each application is also only installed once and run from


the same environment. This allows us to measure the cost of using the Apiary viewer,

with its built-in taskbar and application menu, against plain Linux, where the taskbar

and application menu are regular applications within the environment. Second, the

full and neutered Apiary desktops enable us to isolate the actual and perceived cost to

the participants of instantiating ephemeral containers for application execution. We

presented the environments to the participants in random order and iterated through

all 6 permutations equally.

We timed the participants as they performed a number of specific multi-step tasks

in each environment that were designed to measure the overhead of using multiple

applications that needed to interact with one another. In summary, the tasks were:

(1) download and view a PDF file with Firefox and Xpdf and follow a link embedded

in the PDF back to the web; (2) read an email in Thunderbird that contains an

attachment that is to be edited in OpenOffice and returned to the sender; (3) create

a document in OpenOffice that contains text copied and pasted from the web and

sent by email as a PDF file; (4) create a “Hello World” web page in OpenOffice and

preview it in Firefox; and (5) launch a link received in the Pidgin IM client in Firefox.

As Figure 7.2 shows, the average time to complete each task, when averaged over

all the users doing tasks in random order, only differed by a few seconds in any di-

rection for all tasks in all environments. Figure 7.2 shows that, in all cases, users

performed their tasks quicker in the neutered Apiary environment than in the plain

Linux environment. This indicates that Apiary’s simpler environment is actually

faster to use than the plain Linux environment with its bells and whistles like ap-

plication launchers and applets running within taskbar panels. While this may seem

strange initially, it is perfectly understandable. Many environments that are simple

to use with minimal distractions, for example, the command line, are faster, but less

user-friendly, than others. Moreover, even though users were a little slower in the


0.0

20.0

40.0

60.0

80.0

100.0

Task 1 Task 2 Task 3 Task 4 Task 5

Tim

e (s

)

Plain LinuxPersistent

Ephemeral

Figure 7.2 – Usage Study Task Times

full Apiary environment than in the neutered version, they were still generally faster

than in the plain Linux environment. This indicates that while the full Apiary envi-

ronment has a small amount of overhead, in practice, users are just as effective there

as in the plain Linux environment.

We also asked to rate their perceived ease of use of each environment. Most

users perceived the prototype environments to be as easy to use as the plain Linux

environment. While some users preferred the polish of the plain Linux environment,

more preferred the simplicity of the environment provided by Apiary. Most users

could not determine a difference between the full and neutered Apiary’s desktops.

We also asked the participants a number of questions, including whether they

could imagine using the Apiary environment full-time, and whether they would prefer

to do so if it would keep their desktop more secure. All of the participants expressed

a willingness to use this environment full-time, and a large majority indicated that


Test Description

Untar Untar a compress Linux 2.6.19 kernel source code archive

Gzip Compress a 250MB Linux kernel source tar archive

Octave Octave 3.0.1 (MATLAB 4 clone) running a numerical benchmark [68]

Kernel Build the 2.6.19 kernel

Table 7.1 – Application Benchmarks

they would prefer to use Apiary over the plain Linux environment if it would keep

their applications more secure. The majority of those who would not prefer Apiary

expressed concern with bugs they perceived in the prototype. In addition, a few

expressed interest in the system, but said that their preference would depend on the

level of security they expected from the computer they were using.

7.3.3 Performance Measurements

7.3.3.1 Application Performance

To measure the performance overhead of Apiary on real applications, we compared

the runtime performance of a number of applications within the Apiary environment

against their performance in a traditional environment.

Table 7.1 lists our application tests. We focus mostly on file system benchmarks,

as others have shown [27, 100] that display and operating system virtualization have

little overhead. The untar tests file creation and throughput, while the gzip tests file

system throughput and computation. The Octave benchmark is a pure computation

benchmark. The kernel build benchmark tests computation as well as stressing the file

system, because of the large number of lookups that occur due to the large size of the

kernel source tree and the repeated execution of the preprocessor, compiler and linker.

To stress the system with many containers and provide a conservative performance

measure, each test was run in parallel with 25 instances. To avoid out-of-memory


0.0

200.0

400.0

600.0

800.0

1000.0

1200.0

1400.0

Untar Gzip Octave Kernel

Tim

e (s

)

PlainApiary

Figure 7.3 – Application Performance with 25 Containers

conditions, as the Octave benchmark requires 100-200 MB of memory at various points

during its execution, we ran the benchmarks staggered 5 seconds apart to ensure they

kept their high memory usage areas isolated and avoided the benchmark’s being killed

by Linux’s out-of-memory handler. As is shown in Figure 7.3, Apiary imposes almost

no overhead in most cases, with about 10% overhead in the kernel build case, with

the VLFS’s constant need to perform lookups on the file system incurring an extra

cost. This demonstrates that Apiary is able to scale to a large number of concurrent

containers with minimal overhead.

7.3.3.2 Container Creation

For ephemeral containers to be useful, container instantiation must be quick. We

measured this cost in two ways: first, how long it takes to instantiate its VLFS, and

second, how long the application takes to start up within the container. We quantify

how long it takes to instantiate a container and compare Apiary to other common


Pidgin Firefox T-Bird OOffice Xpdf MPlayer

Create 317 s 276 s 294 s 365 s 291 s 294 s

Extract 82 s 86 s 87 s 150 s 81 s 81 s

FS-Snap .016 s .015 s .016 s .020 s .009 s .010 s

Apiary .005 s .005 s .005 s .005 s .005 s .005 s

Table 7.2 – File System Instantiating Times

approaches. We compare how long it takes to setup a VLFS against how long it

takes to setup a container file system using Debian’s traditional bootstrapping tools

(Create), how long it would take to extract the same file system from a tar archive

(Extract), and how long it takes a file system with a snapshot operation to create

a new snapshot and branch of a preexisting file system namespace (FS-Snap), as

shown in Table 7.2. To minimize network effects with the bootstrapping tools, we

used a local Debian mirror on the local 100Mbps campus network, and were able to

saturate the connection while fetching the packages to be installed.

Table 7.2 shows that Apiary instantiates containers with a VLFS composed of

nearly 200 layers nearly instantaneously. This compares very positively with tradi-

tional ways of setting up a system. Table 7.2 show that it takes a significant amount

of time to create a file system for the application container using Debian’s bootstrap-

ping tool, and even extracting a tar archive takes a significant amount of time as

well. This discourages creating ephemeral application containers, as users will not

want to wait minutes for their applications to start. Tar archives also suffer from

their need be actively maintained and rebuilt whenever they need fixes. Therefore,

the amount of administrative work increases linearly with the number of applications

in use. As Apiary creates the file system nearly instantaneously, it is able to support

the creation of ephemeral application containers with no noticeable overhead to the

users. While Table 7.2 shows that file systems (in this case Btrfs) with a snapshot

and branch operation can also perform it quickly, the user would have to manage


2.0

4.0

6.0

8.0

10.0

12.0

14.0

16.0

18.0

20.0

Pidgin Firefox T-bird OOffice Mplayer Xpdf

Tim

e (s

)

Plain (C)Persistent (C)

Plain (W)Persisent (W)

Ephemeral

Figure 7.4 – Application Startup Time

each of the application’s independent file systems separately.

To quantify startup time, we measured how long it takes for the application to open

and then be automatically closed. In the case of Firefox, Xpdf and OpenOffice.org,

this includes the time it takes to display the initial page of a document, while Pidgin,

MPlayer and Thunderbird are only loading the program. For ephemeral containers,

we measure the total time it takes to set up the container and execute the application

within it. Ephemeral containers differ from persistent containers only in the time it

takes to set up the new ephemeral container, which is never a cold-cache operation

because the system is already in use. We compare these results to cold and warm cache

application startup times for both plain Linux and Apiary’s persistent containers.

We include cold cache results for benchmarking purposes and warm cache results to

demonstrate the results users would normally see.

As Figure 7.4 shows, while running within a container induced some overhead


on startup, it is generally under 25% in both cold and warm cache scenarios. This

overhead is mostly due to the added overhead of opening the many files needed by

today’s complex applications. The most complex application, OpenOffice, requires

the most, while the least complex application, Xpdf, is almost equivalent to the plain

Linux case. In addition, while the maximum absolute extra time spent in the cold

cache case was nearly 5 seconds for OpenOffice, in the warm cache case it dropped

to under 0.5 seconds. In addition, ephemeral containers provide an interesting result.

Even though they have a fresh new file system and would be thought to be equivalent

to a cold cache startup, they are nearly equivalent to the warm cache case. This is

because their underlying layers are already cached by the system due to their uses

by other containers. The ephemeral case has a slightly higher overhead due to the

need to create the container and execute a display server inside of it in addition to

regular application startup. However, as this takes under 10 milliseconds, it adds

only a minimal amount to the ephemeral application startup time.

7.3.4 File System Efficiency

To support a large number of containers, Apiary must store and manage its file system

efficiently. This means that storage space should not significantly increase with an

increasing number of instantiated containers and should be easily manageable in

terms of application updates. For each application’s VLFS, Table 7.3 shows its size,

its number of layers, the amount of state shared with the other application VLFSs,

and the amount of state unique to it. For instance, the 129 layers that make up

Firefox’s VLFS require 353 MB, of which 330MB are shared with other applications

and 23 MB are unique to the Firefox VLFS. In general, as Table 7.3 shows, there

is a lot of duplication among the containers, as the layer repository of 214 distinct


Repo Pidgin Firefox T-Bird OOffice Xpdf MPlayer

743 MB 394 MB 353 MB 367 MB 645 MB 339 MB 355 MB

# Layers 147 129 125 186 130 162

Shared 322 MB 330 MB 335 MB 329 MB 330 MB 326 MB

Unique 72 MB 23 MB 32 MB 316 MB 9 MB 29 MB

Table 7.3 – Apiary’s VLFS Layer Storage Breakdown

Single FS Multiple FSs VLFSs

Size 743 MB 2.1 GB 743 MB

Table 7.4 – Comparing Apiary’s Storage Requirements Against a Regular Desktop

Traditional Apiary

Avg. Time 18 s 0.12 s

Table 7.5 – Update Times for Apiary’s VLFSs

layers needed to build the different VLFSs for the different applications is the same

magnitude as the largest application.

Table 7.4 shows that using individual VLFSs for each application container con-

sumes approximately the same amount of file system space as a regular desktop file

system containing all the applications because each layer only has to be stored once.

This is comparison to the traditional method of provisioning multiple independent file

systems for each application container, which consumes a significantly larger amount

of disk space. Similarly, if multiple desktops are provided on a server, the VLFS usage

would remain constant with the size of the repository, while the other cases would

grow linearly with the number of desktops.

To demonstrate how Apiary allows users to maintain their many containers ef-

ficiently, we instantiated one container for each of the five applications previously

mentioned. When a security update was necessary [146], we applied the update to

each container. Table 7.5 shows the average times for the five application container

file systems. This demonstrates that while individual updates by themselves are not

too long, when there are multiple container file systems for each individual user, the


0.0

500.0

1000.0

1500.0

2000.0

Tim

e (s

)

Postmark

Plain DesktopPidgin VLFSFirefox VLFST-bird VLFS

OpenOffice VLFSXpdf VLFS

MPlayer VLFS

Figure 7.5 – Postmark Overhead in Apiary

amount of time to apply common updates will rise linearly, and as the traditional

method is two orders of magnitude greater than Apiary, it will be impacted to a

much greater extent.

7.3.5 File System Virtualization Overhead

To measure the virtualization cost of VLFS in the Apiary operating system virtual-

ization environment, we re-ran the benchmarks from Chapter 6. These benchmarks

differ from Chapter 6 in that they are not run within a hardware virtual machine,

but rather within an operating system virtualization namespace, and that instead of

the backing store of the VLFS being on a fast SAN device, they are on the slower

host machine disks.

Figure 7.5 shows that Postmark runs faster within a plain Linux environment

than when run within the VLFS. However, it should be noted that these results show


0.0

50.0

100.0

150.0

200.0

250.0

300.0

Tim

e (s

)

Kernel Build

Plain DesktopPidgin VLFSFirefox VLFST-Bird VLFS

OpenOffice VLFSXpdf VLFS

MPlayer VLFS

Figure 7.6 – Kernel Build Overhead in Apiary

significantly less overhead that those in Chapter 6. This is because even though

the disks are slower, as indicated by the plain Linux results, the operating system

virtualization overhead is minimal compared to the overhead imposed by the virtual

machine monitor in Chapter 6. Most notable is a decrease in memory pressure which

enables the VLFS to operate more efficiently because more data can remain cached.

Figure 7.6 shows similar results with the multi-threaded build of the Linux 2.6.18.6

kernel. In Chapter 6, the VLFS showed a 5% overhead; here, overhead is essentially

zero. Even though the SAN’s file system, used for the tests in Chapter 6, is signif-

icantly faster than the blade’s file system, the results here are much faster overall.

This again indicates the amount of overhead imposed by virtual machine monitors

over operating system virtualization.


7.4 Related Work

Isolation mechanisms such as VMs [143, 147] and OS containers [7, 116] are com-

monly used to increase the security of applications. However, if used for desktop

applications, this isolation prevents an integrated desktop experience. Products like

VMware’s Unity [143] attempt to solve part of this issue by combining the applica-

tions from multiple VMs into a single display with a single menu and taskbar, as well

as providing file system sharing between host and VMs. The applications, however,

are still fully isolated from one another, preventing them from leveraging other ap-

plications installed into separate VMs. While VMs provide superior isolation, they

suffer higher overhead due to running independent operating systems. This impacts

performance and makes them less suited for ephemeral usage on account of their long

startup times. However, Apiary can leverage them if one does not want to trust a

single operating system kernel.

Tahoma [122] is similar to Apiary in that it creates fully isolated application

environments that remain part of a single desktop environment. Tahoma creates

browser applications that are limited to certain resources, such as certain URLs, and

that are fully isolated from each other. Tahoma is similar to Apiary in that it enables

the creation of isolated application environments. However, it only provides these

isolated application environments for web browsers. It does not provide any way

to integrate these isolated environments and does not provide ephemeral application

environments. Google’s Chrome web browser [66] builds upon some of these ideas to

isolate web browser pages within a single browser. But the browser as a whole does

not offer any isolation from the system. While its multiple-process model uses OS

mechanisms to isolate separate web pages that are concurrently viewed, it does not

provide any isolation from the system itself. For instance, any plugin that is executed


has the same access to the underlying system as does the user running the browser.

Modern web browsers improve privacy by providing private browsing modes that

prevent browser state from being committed to disk. While they serve a similar

purpose to ephemeral containers, private browsing is fundamentally different. First,

it has to be written into the program itself. Many different types of programs have

privacy modes to prevent them from recording state and this model requires them

to implement it independently. Second, it only provides a basic level of privacy.

For instance, it cannot prevent a plugin from writing state to disk. Furthermore, it

makes the entire browser and any helper program or plugin that it executes part of the

trusted computing base (TCB). This means that the user’s entire desktop becomes

part of the TCB. If any of those elements gets exploited, no privacy guarantees can

be enforced. Apiary’s ephemeral containers make the entire execution private and

support any application with a state a user desires to remain private without any

application modifications. It also keeps the TCB much smaller, by only requiring that

the underlying OS kernel and the minimal environment of Apiary’s system daemon

be trusted.

Lampson’s Red/Green isolation [82] and WindowBox [23] resemble Apiary’s abil-

ity to run multiple applications in parallel. These isolation schemes involve users run-

ning two or more separate environments, for instance, a red environment for regular

usage and a green environment for actions requiring a higher level of trust. However,

unlike Apiary’s ephemeral containers, if an exploit can enter the green container, it

will persist. Furthermore, by requiring two separate virtual machines, one increases

the amount of work a user has to do to manage their machines. Apiary, by leveraging

the VLFS, minimizes the overhead required required to manage multiple machines.

Storage Capsules [33] also attempt to mitigate this problem by securely running the

applications requiring trust in the same operating system environment as the un-


trusted applications, while keeping their data isolated from one another. However,

this involves significant startup and teardown costs for each execution within a secure

storage capsule.

File systems and block devices with branching or COW semantics [32,103,128] can

be used to create a fresh file system namespace for a new container quickly. However,

these file systems do not help to manage the large number of containers that exist

within Apiary. Because each container has a unique file system with different sets

of applications, administrators must create individual file systems tailored to each

application. They cannot create a single template file system with all applications

because applications can have conflicting dependency requirements or desire to use

the same file system path locations. Furthermore, if all applications are in a single file

system, they are not isolated from each other. This results in a set of space-inefficient

file systems, as each file system has an independent copy of many common files. This

inefficiency also makes management harder. When security holes are discovered and

fixed, each individual file system must be updated independently.

Many systems have been created that attempt to provide security through iso-

lation mechanisms [17, 30, 49, 84, 86, 118, 144]. All these systems differ from Apiary

in that they try to isolate the many different components that make up a standard

fully-integrated single system using sets of rules to determine which of the machine’s

resources the application should be able to access. This often results in one of two

outcomes. First, a policy is created that is too strict and does not let the application

run correctly. Second, a policy is created that is too lenient and lets an exploited ap-

plication interact with data and applications it should not be able to access. Apiary,

on the other hand, forces each component to be fully isolated within its own container

before determining on which levels it should be integrated. As each container provides

all the resources that the application needs to execute in an isolated environment, no


complicated rule sets have to be created to determine what it can access.

Solitude [72] provides isolation via its Isolation File System (IFS), which a user

can throw away. This is similar to Apiary’s ephemeral containers. However, the IFSs

are not fully isolated. First, Solitude does not create a new IFS for each application

execution. Second, the IFS is built on top of a base file system with which it can

share data, breaking the isolation. To handle this, Solitude implements taint tracking

on files shared with the underlying base file system. This helps determine post facto

what other applications may have been corrupted by a maliciously constructed file.

Similarly, Solitude only provides isolation at the file system level. Because each appli-

cation still shares a single display, malicious and exploited applications can leverage

built-in mechanisms in commodity display architectures [61, 93] to insert events and

messages into other applications sharing the display.

Chapter 8

ISE-T: Two-Person Control

Administration

All organizations that rely on system administrators to manage their machines must

prevent accidental and malicious administrative faults from entering their systems.

As systems become more complex, it gets easier for administrators to make mistakes.

From a security perspective, these complex systems create an environment where it

is easier for a rogue user, whether an insider or outsider, to hide their attacks. For

example, Robert Hanssen, an FBI agent who was a Soviet spy, was able to evade

detection because he was the administrator of some of the FBI’s counterintelligence

computer systems [149]. He could see whether the FBI had identified his drop sites

and if he was being investigated [45].

Most approaches to insider attacks involve intrusion detection or role separation,

both of which are ineffective against rogue system administrators who can replace the

system module that enforces the separation or performs the intrusion detection. This

attack vector was described over thirty years ago by Karger and Schell [101] and still

remains a serious problem. Even if administrators can be trusted, they must deal

Chapter 8. ISE-T: Two-Person Control Administration 174

with very complicated software, and it is hard to catch mistakes before they cause

problems. If a mistake takes down an important service, the machine may not be

usable or administratable, and malicious attackers can act with impunity.

There are several ways to address faults, including partitioning, restore points and

peer review. One highly effective approach is two-person control [13], for example,

two pilots in an airplane, two keys for a safe deposit box, or running two or more

computations in parallel and comparing the results. We believe this concept can

be extended to problems in system administration by using virtualization to create

duplicate environments.

Toward this end, we created the “I See Everything Twice” [70] (ISE-T, pronounced

“ice tea”) architecture. ISE-T provides a general mechanism to clone execution en-

vironments, independently execute computations to modify the clones, and compare

how the resulting modified clones have diverged. The system can be used in a num-

ber of ways, such as performing the same task in two initially identical clones, or

executing the same computation in the same way in clones with some differences. By

providing clones, ISE-T creates a system where computation actions can be “seen

twice,” applying the concept used for fault-tolerant computing to other forms of two-

person control systems. There is, however, a crucial difference between our use of

replicas and that of fault-tolerant computing. We test for equivalence between two

replicas that may not be identical, rather than simply running two identical replicas

in lockstep and ensuring they remain identical.

By applying the ISE-T architecture to system administration, we are able to in-

troduce the two-person control concept to system administration. As ISE-T allows

a system to be easily cloned into multiple distinct execution domains, we can create

separate cloned environments for multiple administrators. ISE-T can then compare

the sets of changes produced by each administrator to determine if equivalent changes


were made. ISE-T allows administration to proceed in both a fail-safe and an au-

ditable manner.

ISE-T forces administrative acts to be performed multiple times before they are

considered correct. Current systems give full access to the machine to individual

administrators. This means that one person can accidentally or maliciously break

the system. ISE-T offers a new way to avoid this problem. ISE-T does not allow

any administrator to modify the underlying system directly, but instead creates in-

dividual clones for two administrators to work on independently. ISE-T is then able

to compare the changes each administrator performs. If the changes are equivalent,

ISE-T has a high assurance that the changes are correct and will commit them to

the base system. But if it detects discrepancies between the two sets of changes, it

will notify the administrators so that they can resolve the problem. This enables

fail-safe administration by catching accidental errors, while also preventing a single

administrator from maliciously damaging the system.

ISE-T leverages both virtualization and unioning file systems to produce the

clones. ISE-T uses both operating system virtualization, as in Solaris Zones [116]

and Linux VServer [7], and hardware virtualization as in VMware [142], to provide

each administrator with an isolated environment. ISE-T builds upon DejaView [81]

and Strata, using union file systems to yield a layered file system that provides the

initial file system namespace in one layer, while capturing all the system administra-

tor’s file system changes in a separate layer. This allows easy isolation of changes,

simplifying equivalence testing. ISE-T’s requiring everything to be installed twice

blocks many real attacks. A single malicious system administrator cannot create an

intentional back door, weaken firewall rules, or create unauthorized user accounts.

ISE-T is admittedly an expensive solution, too expensive for many commercial

sites. For high-risk situations, such as in the financial, government, and military sec-


tors, the added cost may be acceptable if risk is reduced. In fact, two-person controls

are already routine in those environments, ranging from checks that require two sig-

natures to requiring two people for nuclear weapons work. But we also demonstrate

how ISE-T can be used in a less expensive manner by introducing a form of auditable

system administration. Instead of requiring two system administrators at all times,

ISE-T can save all the changes performed by the system administrator to a log, which

is audited to provide a higher level of assurance that the administrator is behaving

properly.

In a similar manner, ISE-T can be extended to train less experienced system ad-

ministrators. First, ISE-T allows a junior system administrator to perform tasks in

parallel with a more senior system administrator. While only the senior administra-

tor’s solution will be committed to the system, the junior system administrator can

learn from how their solution differs from the senior system administrator’s. Second,

ISE-T can be extended to provide an approval mode, in which a junior system ad-

ministrator is given tasks to complete, but instead of being committed immediately,

they will be presented for the senior system administrator to approve or disapprove.

8.1 Usage Model

Systems managed by ISE-T are used by two classes of users, privileged and unpriv-

ileged. ISE-T does not change how regular users interact with the machine. They

are able to install any program into their personal space and run any program on the

system, including regular programs and UNIX programs such as setuid and passwd

that raise the privileges of the process on execution.

However, ISE-T fundamentally changes the way system administrators interact

with the machine. In regular systems, when administrators need to perform mainte-


System

ISE-T Service

Administrative Clone #1 Administrative Clone #2

Figure 8.1 – ISE-T Usage Model

nance on the machine, they use their administrative privilege to run arbitrary pro-

grams, for example, by executing a shell or using sudo. In these systems, adminis-

trators can modify the system directly.

As ISE-T prevents system administrators from executing arbitrary programs with

administrative privileges, the above model will not work with ISE-T. Instead, ISE-T

provides a new approach as shown in Figure 8.1. Instead of administering a sys-

tem directly, ISE-T creates administration clones. Each clone is fully isolated from

others and from the base system. ISE-T instantiates a clone for each administrator.

Once both administrators are finished making changes, ISE-T compares the clones

for equivalence and commits the changes if they pass the test. As opposed to a regu-

lar system, where the administrator can interleave file system changes with program

execution, in ISE-T only file system changes are committed to the underlying system.


Therefore ISE-T requires administrators to use other methods if they require file sys-

tem changes and program execution to be interleaved on the actual system, such as

for rotating log files or exploratory changes to diagnose a subtle system malfunction.

To allow this, ISE-T provides a new ise-t command that is used in a manner

similar to su. Instead of spawning a shell on the existing system, ise-t spawns a

new isolated container for that administrator. This container contains a clone of the

underlying file system. Within this clone, the administrators can perform generic

administrative actions, as on a regular system, but the changes will be isolated to

this new container. When the administrators are finished with their changes, they

exit the new container’s shell, much as they would exit a root shell; the container

itself is terminated, while its file system persists.

ISE-T then compares the changes each administrator performed for equivalence.

ISE-T performs this task automatically after the second administrator exits their

administration session and notifies both of the administrators of the results. If the

changes are equivalent, ISE-T automatically commits the changes to the underly-

ing base system. Otherwise, ISE-T notifies the administrators of the file system

discrepancies that exist between the two administration environments, allowing the

administrators to correct them.

Command Description

ise-t new Create an administration environment

ise-t enter Enter administration environment

ise-t done Ready for equivalence testing

ise-t diff Results of a failed equivalence test

Table 8.1 – ISE-T Commands

Because ISE-T only looks at file system changes, this can prevent it from per-

forming administrative actions that affect only the runtime of the system. To address

this, ISE-T provides a raw control mechanism via the file system, and allows itself


to be integrated with configuration management systems. First, ISE-T’s raw control

mechanism is implemented via a specialized file system namespace where an adminis-

trator can write commands. For instance, if the administrators want to kill a process,

stop a service or reboot the machine, those actions performed directly within their

administration container will have no effect on the base system. Some actions can

be inferred directly from the file system. For instance, if the system’s set of startup

programs is changed, ISE-T can infer that the service should be started, stopped or

restarted when the changes are committed to the underlying system. But this ap-

proach only helps when the file system is being changed. Sometimes administrators

want to stop or restart services without modifying the file system. ISE-T therefore

provides a defined method for killing processes, stopping and starting services, and

rebooting the machine using files stored on the local file system. ISE-T provides

each administrator with a special /admin directory for performing these predefined

administrative actions.

For example, if the administrator wants to reboot the machine, they create an

empty reboot file in the /admin directory. If both administrators create the file,

the system will reboot itself after the other changes are committed. Similarly, the

administrators can create a halt file to halt the machine. In addition, the /admin

directory has kill and services subdirectories. To kill a process, administrators

create individual files with the names of the process identifiers of processes running

on the base system that they want to kill. Similarly, if a user desires to stop, start, or

restart a init.d service, they create a file named by that service prefixed with stop,

start, or restart, such as stop.apache or restart.apache within the services

directory. ISE-T performs the appropriate actions when the changes are committed

to the base system. The files created within the /admin directory are not committed

to the base system; they are only used for performing runtime changes to the system.


Many systems already exist to manage systems and perform these types of tasks,

namely, configuration management systems such as lcfg [19]. At a high level, con-

figuration management systems work by storing configuration information on a cen-

tralized policy server that controls a set of managed clients. In general, the policy

server will contain a set of template configuration files that it uses to create the actual

configuration file for the managed clients based on information contained in its own

configuration. Configuration management systems also generally support the ability

to run predefined programs and scripts and execute predefined actions on the clients

they are managing.

When ISE-T is integrated with any configuration management system, it no longer

manages the individual machines. Instead of the managed clients being controlled

by ISE-T, the configuration policy server is managed by ISE-T and the clients are

managed by the configuration management system. This offers a number of bene-

fits. First, it simplifies the comparison of two different systems, as ISE-T can focus

on the single configuration language of the configuration management system. Sec-

ond, configuration system already have tools to manage the runtime state of their

client machines, such as stopping and starting services and restarting them when the

configuration changes. Third, many organizations are already accustomed to using

configuration management systems. By implementing ISE-T on the server side, they

can enforce the two-person control model in a more centralized manner.

8.2 ISE-T Architecture

To implement the two-person administrative control semantic, ISE-T provides three

architectural components. First, because the two administrators cannot administer

the system directly, they must be provided with isolated environments in which they


can perform their administrative acts. To ensure isolation, ISE-T provides container

mechanisms that allow ISE-T to create parallel environments based on the underlying

system to be administered. This allows ISE-T to fully isolate each administrator’s

clone environment from each other and from the base system.

Second, we note that any persistent administrative action must involve a change

to the file system. If the file system is not affected, the action will not survive a

reboot. While some administrative acts only affect the ephemeral runtime state of

the machine, the majority are more persistent. The file system is therefore a central

component in ISE-T’s two-person administrative control. ISE-T provides a file system

that can create branches of itself as well as isolate the changes made to it. This allows

for easy creation of clone containers and comparison of the changes performed to both

environments.

Finally, ISE-T provides the ISE-T System Service. This service instantiates and

manages the lifetimes of the administration environments. It is able to compare

the two separate administration environments for equivalence to determine if the

changes performed to them should be committed to the base system. ISE-T’s System

Service performs this via an equivalence test that compares the two administration

environment’s file system modifications for equivalence. If the two environments are

equivalent, the changes will be committed to the underlying base system. Otherwise,

the ISE-T System Service will notify the two administrators of the discrepancies and

allow them to fix their environments appropriately.

8.2.1 Isolation Containers

ISE-T can leverage multiple types of container environments depending on admin-

istrative needs. In general, the choice will be between hardware virtual machine


containers and operating system containers. Hardware virtual machines such as

VMware [142] provide a virtualized hardware platform with a separate operating

system kernel, yielding a complete operating system instance. Operating system

containers such as Solaris Zones [116], however, are just isolated kernel namespaces

running on a single machine.

For ISE-T, there are two main differences between these containers. First, hard-

ware virtual machines allow the administrators to install and test new operating

system kernels, as each container will be running its own kernel. Operating system

containers, on the other hand, prevent the administrators from testing the underlying

kernel, as there is only one kernel running, that of the underlying host machine. Sec-

ond, as hardware virtual machines require their own kernel and a complete operating

system instance, they make it time-consuming to create administration clones. Oper-

ating system containers, however, can be created almost instantly. As both types of

containers have significant benefits for different types of administrative acts, ISE-T

supports both. For most actions, administrators will prefer operating system con-

tainers, but they can still use a complete hardware virtual machine to test kernel

changes.

When ISE-T is integrated with a configuration management system, ISE-T does

not have to use any isolation container mechanism at all, as the configuration man-

agement system already isolates the administrators from the client system. Instead,

ISE-T simply provides each administrator with their own configuration management

tree and lets both administrators perform the changes.


8.2.2 ISE-T’s File System

To support its file system needs, ISE-T leverages the branching ability of some file

systems. Unlike a regular file system, a branchable file system can be snapshot at

some point in time and branched for future use. This allows ISE-T to quickly clone

the file system of the machine being managed. Because each file system branch is

independent, ISE-T can capture any file system changes in the newly created branch

by comparing the branch’s state to the initial file system’s state. Similarly, ISE-T

can then compare the sets of file system changes from both administration clones to

one another.

Although a classical branchable file system allows changes to be captured, it does

not make it possible to determine efficiently what has changed, because the branch

is a complete file system namespace. Iterating through the complete file system can

take a significant amount of time, place a large strain on the file system, and decrease

system performance. Two features allow ISE-T to use a file system efficiently. First,

it must be able to duplicate the file system to provide each administrator with their

own independent file system on which to make changes. Second, it must allow easy

isolation of each administrator’s changes to test them for equivalence.

To meet these requirements, ISE-T creates layered file systems for each adminis-

tration environment. Multiple file systems can be layered together into a single file

system namespace for each environment. This enables each administration environ-

ment to have a layered file system composed of two layers, a single shared layer that

is the file system of the machine they are administrating, as well as a layer containing

all the changes the administrator makes on the file system.


8.2.3 ISE-T System Service

ISE-T’s System Service has a number of responsibilities. First, it manages the life-

times of each administrator’s environment. When administration is required, it has to

set up the environments quickly. Similarly, when the administration session has been

completed and the changes committed to the underlying system, it removes them

from the system and frees up their space. Third, it evaluates the two environments

for equivalence by running a number of equivalence tests to determine if the two ad-

ministrators performed the same set of modifications. Finally, it has to either notify

the administrators of the discrepancies between their two environments or commit

the equivalent environment’s changes to the underlying base system.

ISE-T’s layered file system allows the system service to easily determine which

changes each administrator made, as each administrator’s changes are confined to

their personal layer of the layered file system. To determine if the changes are equiva-

lent, ISE-T first isolates the files that will not be committed to the base system, that

is, the administrator’s personal files in their branch, such as shell history. Instead of

merely removing them, ISE-T saves them for archival and audit purposes. ISE-T then

iterates through the files in each environment, comparing the file system contents and

files directly to one another. If each administrator’s branch has an equivalent set of

file system changes, ISE-T can then simply commit a set to the base system. On the

other hand, if the files contained within each branch are not equivalent, ISE-T flags

the differences and reports them. The administrators then confer to ensure that they

perform the same steps to create the same set of files to commit to the base system.

Ways of determining equivalence can vary based on the type of file and what is

considered to be equivalent in context. For instance, a configuration file modified

by both administrators with different text editors can visually appear equivalent,


but can differ if one uses spaces and another uses tabs. These files are equivalent

insofar as applications parse them the same way, but are different on a character by

character level. However, there are some languages (e.g., Python) where the amount

of white space matters and can have a great effect on how the script executes. On

the other hand, two files that have exactly the same file contents can have varying

metadata associated with the file, such as permissions, extended attributes, or time

data. Similarly, some sets of files need not be compared for equivalence, such as

the shell history that records the steps the administrators take in their respective

environments, and, in general, the home directory contents of the administrator in

his administration environment. ISE-T removes these files from the comparison, and

never commits them to the underlying system.

Taking this into consideration, ISE-T’s prototype comparison algorithm deter-

mines these sets of differences.

1. Directory entries which do not exist in both sets of changes are different.

2. Directory entries with different UIDs, GIDs, or permission sets are different.

3. Directory entries of different file types (Regular File, Symbolic Link, Directory,

Device Node, or Named Pipe) are different.

For directory entries of the same type, ISE-T performs the appropriate compari-

son.

• Device nodes must be of the same type.

• Symbolic links must contain the exact same path.

• Regular files must have the same size and the exact same contents.


There are two major problems with this approach. First, this comparison takes

place at a very low semantic level. It does not take into account simple differences

between files that make no difference in practice. However, without writing a parser

for each individual configuration language, one will not easily be able to compare

equivalence. Second, there are certain files, such as encryption keys, that will never be

generated identically, even though equivalent actions were taken to create them. This

can be important, as some keys are known to be weaker and a malicious administrator

can construct one by hand.

Both of these problems can be solved by integrating ISE-T with a configuration

management system and teaching ISE-T the configuration management system’s lan-

guage. First, these systems simplify the comparison by enabling it to focus on the

configuration management system’s language. Even though most configuration man-

agement systems work by creating template configuration files for the different ap-

plications, these files are not updated regularly and can be put through the stricter

exact comparison test. On the other hand, when ISE-T understands the language

of the configuration management system, it can rely on a more relaxed equivalence

test. Second, configuration management systems already deal with dynamic files like

encryption keys. A common way configuration management systems deal with these

types of files is by creating them directly on the managed client machines. Because

ISE-T understands the configuration management system’s language, the higher level

semantics that instruct the system to create the file will be compared for equivalence

instead of the files themselves. However, a potential weakness of ISE-T is in dealing

with files that cannot easily be created on the fly and will differ between two system

administration environments, such as databases. For instance, two identical database

operations can result in different databases due to different timestamps or reordering

of updates on the database server.


8.3 ISE-T for Auditing

Although the two-person control is useful for providing high assurance that faults

are not going to creep into the system, its expense can make it impractical in many

situations. For example, since the two-person control model requires the concurrence

of two system administrators on all actions, it can prevent time-sensitive actions if

only a single administrator is available. Similarly, while the two-person control model

provides a very high degree of assurance for a price, it would be useful if organizations

could get a somewhat higher degree of assurance at a lower price. To achieve these

goals, we can combine ISE-T’s mechanisms with audit trail principles to create an

auditable system administration semantic.

In auditable system administration, every system administration act is logged to

a secure location for review. The ISE-T System Service creates cloned administra-

tion environments for the two administrators and can capture the state they change

in order to compare for equivalence. For auditable system administration, ISE-T’s

mechanism can also be used. The audit system prevents the single system adminis-

trator from modifying the system directly, instead requiring the creation of a cloned

administration environment where the administrator can perform the changes before

they are committed to the underlying system. Instead of comparing for equivalence

against a second system administrator, the changes are logged so that they can be

examined at some time in the future, while being immediately committed to the un-

derlying system. Audit systems are known to increase assurance against malicious

changes, as the would-be perpetrator knows there is a good chance their actions will

be caught. Similarly, depending on the frequency and number of audits performed, it

can help prevent administration faults from persisting for long periods of time in the

system. However, it does not provide as much assurance as two-person control, be-


cause the administrator can use the fact that his changes are committed immediately

to create back doors in the system that will not be discovered until later.

Auditable system administration needs to be tied directly to an issue-tracking ser-

vice. This allows an auditor to associate an administrative action with its intended

result. Every time an administrator invokes ISE-T to administer the system, an

issue-tracking number is passed into the system to tie that action to the issue in the

tracker. This allows the auditor to compare the actual results with what the auditor

expects to have occurred. In addition, auditable system administration can be used

in combination with the two-person control system when only a single administrator

is available and immediate action is needed. With auditing, the action can be per-

formed by the single administrator, but can be immediately audited when the second

administrator becomes available.


To test the efficacy of ISE-T’s layered file system approach, we recruited 9 experienced

computer users with varying levels of system administration experience, though all

were familiar with managing their own machines. We provided each user with a

VMware virtual machine running Debian GNU/Linux 3.0. Each VM was configured

to create an ISE-T administration environment that would allow the users to perform

multiple administration tasks isolated from the underlying base system. Our ISE-T

prototype uses UnionFS [150] to provide the layered file system needed by ISE-T. We

asked the users to perform the eleven administration tasks listed in Table 8.2. The

user study was conducted in virtual machines running on an IBM HS20 eServer blade

with dual 3.06 Ghz Intel Xeon CPUs and 2.5GB RAM running VMware Server 1.0.

These tasks were picked to be representative of common administration tasks, and


Category Description Result Desired

SoftwareInstallation

Install official rdesktop package Equivalent YesCompile & install rdesktop from source Equivalent YesInstall all pending security updates Equivalent Yes

SystemServices

Install SSH daemon from package Not Equivalent NoRemove PPP using package manager Equivalent Yes

ConfigurationChanges

Edit machine’s persistent hostname Equivalent YesEdit the inetd.conf to enable a service Not Equivalent NoAdd a daily run cron job Equivalent YesRemove an hour run cron job Equivalent YesChange the time of a cron job Equivalent Yes

Exploit Create a backdoor setuid root shell Not Equivalent Yes

Table 8.2 – Administration Tasks

included a common way for a malicious administrator to create a back door in the

system.

Each task was performed in a separate ISE-T container, so that each administra-

tion task was isolated from the others, and none of the tasks depended on the results

of a previous task. We used ISE-T to capture the changes each user performed for

each task in its own file system. We were then able to compare each user against the

others for each of the eleven tasks to see where their modifications differed.

For every test, ISE-T prunes the changes to remove files that would not affect

equivalence, as described in Chapter 8.2.3. Notably, in our prototype, ISE-T prunes

the /root directory, which is the home directory of the root user, and therefore

would contain differences in files such as .bash history, among others, that are

particular to each user’s approach to the task. Similarly, ISE-T prunes the /var

subtree to remove any files that are not equivalent. For instance, depending on what

tools an administrator uses, different files are created. A cache of packages might be

downloaded and installed via the apt-get tool instead of manually. The reasoning

behind this pruning is that the /var tree is meant as a read-write file system for


per-system usage. Tools will modify it; if different tools are used, different changes

will be made. The entire directory tree cannot be pruned, however, because there

are files or directories within it that are necessary for runtime use and those changes

have to be committed to the underlying file system. Therefore, only those changes

that are equivalent are committed, while those that are different were ignored. ISE-T

also prunes the /tmp directory, as the contents of this directory would also not be

committed to the underlying disk. Finally, due to the UnionFS implementation used

for these experiments, ISE-T also prunes the whiteout files created by UnionFS if

there is no equivalent file on the underlying file system. In many cases, temporary

files with random names will be created; when they are deleted, UnionFS will create

a whiteout file, even if there is no underlying file to whiteout. As this whiteout file

does not have an impact on the underlying file system, it is ignored. On the other

hand, whiteout files that do correspond to underlying files and therefore indicate that

the file was deleted are not ignored.

8.4.1 Software Installation

In the software installation category, we had the users perform three separate tests

to demonstrate that when multiple users install the same piece of software, as long

as they install it in the same general way, the two installations will be equivalent.

To demonstrate this, the users were first instructed to install the rdesktop program

from its Debian package. Users had multiple ways of installing the package, including

downloading and installing it by hand via dpkg, using apt-get to download it and

any unfulfilled dependencies, as well as using the aptitude front end to apt-get.

Most users decided to install the package via apt-get, but even those who did not

made equivalent changes. The only differences were in pruned directories, demon-


strating that installing a piece of pre-packaged software using regular tools results in

an equivalent system.

Second, the users were instructed to build the rdesktop program from source code

and install it into the system. In this case, multiple differences could have occurred.

First, if the compiler were to create a different binary each time the source code is

compiled, even without any changes, it would be difficult to check for equivalence.

Second, programs generally can be installed in different areas of the file system, such

as /usr versus /usr/local. In this case, all the testers decided to install the program

into the default location, avoiding the latter problem, while also demonstrating that

as long as a the same source code is compiled by the same tool chain, it will result in

the same binary. However, some program source code, such as the Linux kernel, will

dynamically modify its source during build, for example to define when the program

was built. In these cases, we would expect equivalence testing to be more difficult,

as each build will result in a different binary. A simple solution would be to patch

the source code to avoid this behavior. A more complicated solution would involve

evaluating the produced binary’s code and text sections with the ability to determine

that certain text section modifications are inconsequential. Again, in this case, the

only differences were in pruned directories, notably the /root home directory, to

which the users downloaded the source for rdesktop.

Finally, we instructed the users to install all the pending security updates. This

is more complicated than the first test, as many packages were upgraded. Although

differences existed between the environments of the users, the differences were con-

fined to the /var file system tree and depended on how they performed the upgrade.

This is because Debian provides multiple ways to do an upgrade of a complete system

and those cause different log files to be written. As they all installed the same set of

packages, the rest of the file system, as expected, contained no differences.


8.4.2 System Services

Our second set of tests involved adding and removing services. Users were instructed

to install SSH and remove PPP. These tests were an extension of the previous pack-

age installation tests and demonstrated how one would automatically start and stop

services, as well as a demonstration of files we knew would fail equivalence testing.

For the first test, we instructed the users to install the SSH daemon. This test

sought to demonstrate that ISE-T can detect when a new service is installed and

therefore enable it when the changes are committed. In Linux systems, a System-V

init script has to be added to the system to allow it to be started each time the

machine boots. If the user’s administration environment contains a new init script,

ISE-T automatically determines that the service should be started when this set

of administration changes is committed to the base system. This test also sought to

demonstrate that certain files are always going to be different between users if created

within their private environment. The SSH host key for each environment is different

because it is created based on the kernel’s random entropy pool, which is different for

each user and therefore will never be the same if created in a separate environment.

A way around this would be not to create it within the private branch of each user,

but instead to create it after the equivalent changes are committed, for example, the

first time the service’s init script is executed.

For the second test, we instructed the users to remove the PPP daemon. This test

demonstrated that there are multiple ways to remove a package in a Debian system,

and depending on the way the package is removed, the underlying file system will be

different. Specifically, a package can either be removed or purged. When a package

is removed, files marked as configuration files are left behind, allowing the packages

to be reinstalled and have the configuration remain the same. On the other hand,


when a package is purged, the package manager will remove the package and all the

configuration files associated with it. In this case, the users chose different ways to

remove the package, and ISE-T was able to determine the differences for those that

chose to remove or purge it.

8.4.3 Configuration Changes

Our third set of tests involved modifications to configuration files on the system and

included five separate tests in two categories. The first category was composed of

simple file configuration changes. We first instructed the users to modify the host

name of the machine persistently from debian to iset, which is accomplished by

editing the /etc/hostname file. As expected, as this configuration change is very

simple. All users modified the system’s hostname in the exact same manner, allowing

ISE-T to determine that all the systems were equivalent.

Next, we instructed the users to modify the /etc/inetd.conf to enable the

discard service. In this case, as the file is more free-form, their changes were not ex-

act, and many were not equivalent. For example, some users enabled it for both TCP

and UDP, while others enabled it for TCP alone. Also, some added a comment, while

others did not. Whereas the first change is not equivalent, the second change should

be considered equivalent, but this cannot be determined by a simple diff. One needs

to parse the files correctly to determine that they are equivalent, an ability lacking in

our ISE-T prototype. However, ISE-T was able to clearly report the differences that

existed between the users environments.

The second category of tests involved setting up and removing cron jobs and was

composed of three tests. First, we provided the users with a script in /root that

we instructed them to install in the system in a manner so that it will be executed


daily. In Debian there are two ways to have a cron job execute daily. The best way

is to simply place the script in the /etc/cron.daily directory, as all the programs

located within it are executed once a day. On the other hand, a user could use

the /etc/crontab file and instruct it exactly what time each day to execute the

program. We were looking for users to place the file in /etc/cron.daily which

the large majority did; but one user who perhaps was not used to Debian modified

/etc/crontab directly. Again, ISE-T was able to detect the difference when it existed

and report the discrepancy. Similarly, we instructed users to remove a cron job that

ran hourly. The cron job’s script was already located in /etc/cron.hourly and not

in crontab. All users deleted it from the /etc/cron.hourly directory, even the

user who modified /etc/crontab in the previous test. Finally, we had an entry in

/etc/crontab that executed a cron job at 4:35 AM every day, and instructed the

users to change it to 5:35 AM. As expected, all users performed the same set of

changes by modifying the /etc/crontab file.

8.4.4 Exploit

In our final test, we asked the users to create a back door shell into the system. This

test demonstrates that ISE-T will detect when one of the two system administrators

makes a rogue administration change to the system, as the other administrator will

not have made that same administration change. We did not specify how or where

the shell should be created, and left it up to the user. The users created setuid shells

in many different locations, some as “dot files” hidden from a normal ls listing,

and some just performed setuid on existing shells. In all cases, as none of the users

made this rogue modification in the same manner, ISE-T was able to show the rogue

changes as different in every environment.


8.5 Related Work

Two-person control mechanisms are known to provide high levels of assurance [134].

Multiple examples exist with regard to nuclear weapons. For instance, to launch

a nuclear weapon, two operators must separately confirm that launch orders are

valid and must turn their launch keys together to launch the missiles. In fact, every

sensitive action concerning nuclear weapons must be performed by two people with

the same training and authority [39, Chapter 2]. The same notion is applied in many

financial settings: banks will require two people to be involved in certain tasks, such

as opening a safe deposit box [148], and companies can require two people to sign

a check [55] over a certain amount. This makes it much more difficult for a single

person to commit fraud.

As far as we know, this mechanism has never been applied directly to system

administration. In the Compartmented Mode Workstation (CMW), the system ad-

ministration job is split into roles, so that many traditional administration actions

require more than one user’s involvement [138]. This demarcation of roles was first

pioneered in Multics at MIT [75]. Similarly, the Clark-Wilson model was designed

to prevent unauthorized and improper modifications to a system to ensure its in-

tegrity [44]. All these systems simply divided the administrators’ actions among

different users who performed different actions. This differs fundamentally from the

traditional notion of two-person control where both people do the same exact action.

More recently, many products have been created to help prevent and detect when

accidental mistakes occur in a system. SudoSH [69] is able to provide a higher level

of assurance during system administration as it records all keystrokes entered during

a session and is able to replay the session. However, while sudosh can provide an

audit log of what the administrator did, it does not provide the assurances provided


by the two-person control model. Even if one were to audit the record or replay

it, one is not guaranteed to get the same result. Although auditing this record can

be useful for detecting accidental mistakes, it cannot detect malicious changes. For

instance, a file fetched from the Internet can be modified. If the administrators can

control which files are fetched, they can manipulate them before and after the sudosh

session. ISE-T, on the other hand, does not care about the steps administrators take

to accomplish a task, only the end result as it appears on the file system.

Part of the reason accidental mistakes occur is that knowledge is not easily passed

between experienced and inexperienced system administrators. Although systems

like administration diaries and wikis can help, they do not easily associate specific

administration actions with specific problems. Trackle [50] attempts to solve this by

combining an issue tracker with a logged console session. Issues can be annotated,

edited and cross-referenced while the logged console session logs all actions taken

and file changes and stores them with the issue, improving institutional memory.

Although this allows less experienced system administrators to see the exact steps a

previous administrator took to fix a similar or equivalent issue, it does not actually

prevent mistakes from entering and remaining in the system, nor does it prevent a

malicious administrator from operating.

ISE-T’s notion of file system views was first explored in Plan 9 [104]. In Plan

9, it is a fundamental part of the system’s operation. As Plan 9 does not view

manipulating the file system view as a privileged operation, each process can craft

the namespace view it or its children will see. A more restricted notion of file system

views is described by Ioannidis [71]. There, its purpose is to overlay a different set of

permissions on an existing file system.

Finally, a common way to make a system tolerant of administration faults is to

use file system versioning, which allows rolling back to a configuration file’s previous


state if an error is made. Operating systems such as Tops-20 [53] and VMS [90]

include native operating system support for versioning as a standard feature of their

file systems. These operating systems employ a copy-on-write semantic that involves

versioning a file each time a process changes it. Other file systems, such as Ver-

sionFS [96], ElephantFS [127] and CVFS [132], have been created to provide better

control of the file system versioning semantic.

Chapter 9

Conclusions and Future Work

This dissertation demonstrates that many different types of modern computing prob-

lems can be solved in a relatively simple manner with different forms of operating

system virtualization.

First, we presented *Pod. *Pod decouples a user’s computing experience from

a single machine while providing them with the same persistent, personalized com-

puting session they expect from a regular computer. *Pod allows different types of

applications to be stored on a small portable storage device that can be easily carried

on a key chain or in a user’s pocket, thereby allowing the user increased mobility.

*Pod uses operating system and display virtualization to decouple the computing

session from the host on which it is currently running. It combines this virtualiza-

tion mechanism with a checkpoint/restart system that lets *Pod users suspend their

computing session, move around, and resume their session at any computer.

Second, we presented AutoPod. AutoPod expands on *Pod by enabling isolated

applications running within a pod to be transparently migrated across machines run-

ning different operating system kernel versions. This lets maintenance occur promptly,

as system administrators do not have to take down all applications running on a ma-

Chapter 9. Conclusions and Future Work 199

chine when it needs maintenance. Instead, the applications are migrated to a new

machine where they can continue execution. As AutoPod enables this across differ-

ent kernel versions, security patches can be applied to operating systems in a timely

manner with minimal impact on the availability of application services.

Third, we presented PeaPod, an operating system virtualization layer that en-

ables secure isolation of legacy applications. The virtualization layer leverages pods

and introduces peas to encapsulating processes. Pods provide an easy-to-use lightweight

virtual machine abstraction that can securely isolate individual applications without

the need to run an operating system instance in the pod. Peas provide fine-grained

least-privilege mechanism that can further isolate application components within

pods. PeaPod’s virtualization layer can isolate untrusted applications, preventing

them from being used to attack the underlying host system or other applications

even if they are compromised.

Fourth, we presented Strata, which improves the way system administrators man-

age the VAs under their control by introducing the virtual layered file system. By

addressing their contents by file location instead of block address, VLFSs allows Strata

to quickly and simply provision VAs, as no data needs to be copied into place. Strata

provides improved management, as file system modifications are isolated and upgrades

can be stored centrally and applied atomically. It also allows Strata to create new

VLFSs and VAs by composing together smaller base VLFSs and VAs that provide

core components. Strata significantly reducing the amount of disk space required for

multiple VAs, allows them to be provisioned almost instantaneously and allows them

to quickly updated no matter how many are in use. The research into Strata’s VLFS

also enabled DejaView’s ability to provide a time-traveling desktop [81]. By layering

a blank layer over the file system snapshot, DejaView was able to quickly recreate a

fully writable file system view.


Fifth, we presented Apiary, which introduces a new compartmentalized applica-

tion desktop paradigm. Instead of running one’s applications in a single environment

with complex rules to isolate the applications from each other, Apiary allows them

to be easily and completely isolated while retaining the integrated feel users expect

from their desktop computer. The key innovations that make this possible are the use

of virtual layered file systems and the ephemeral application execution environments

they enable. The VLFS allows the multiple containers to be stored as efficiently

as a single regular desktop, while also allowing containers to be instantiated almost

instantly. This functionality enables the creation of the ephemeral containers that

provide an always fresh and clean environment for applications to run it.

Apiary’s usage model of fully isolating each application works well in many sce-

narios, but can cause complications in others. For instance, as each application’s file

system is fully isolated, if one wanted to send a file as an email attachment, one could

not create a new email message and attach the file to it; the email program might

not have access to the file system containing the file. Although Apiary provides a

method for users to copy files between containers, this can have an impact on users’

ability to use the system efficiently. Applying Apiary’s principles to non-desktop

environments, such as smartphones and tablets, where user interface paradigms are

not as ingrained, as on the desktop, can enable user interface metaphors that behave

seamlessly without compromising Apiary’s application isolation.

Apiary also raises a number of interesting follow-up questions as it only explores

the benefits of applications that can run in total isolation. There are smaller appli-

cations, such as browser plugins, that cannot run in total isolation, but must remain

part of a larger environment. An interesting follow-up question would be to try to see

how Apiary’s concepts apply to multiple components of a single application, where

the components cannot be run independently.


The ephemeral execution model introduced by Apiary provides multiple avenues

for follow-up. For instance, many network-facing services, such as mail and web

services, continuously run based on untrusted input they receive from the network.

These services have also been consistently exploited due to flaws in their programs.

However, the ephemeral execution model, as presented by Apiary, is not a perfect

fit for these services as they need some level of “write” access to the underlying sys-

tem that will be persistent. An interesting area of research would be to understand

how these services operate and how ephemeral execution could be leveraged to pro-

vide more security while still allowing the persistent data storage that these services

require.

Finally, we presented ISE-T, which enables and applies the two-person controller

model to system administration. In administration, this model requires two admin-

istrators to perform the same administrative act with equivalent results for the ad-

ministrative changes to be allowed to affect the system that is being modified. ISE-T

creates multiple parallel environments for the administrators to perform their adminis-

trative changes independently. ISE-T then compares the results of the administrative

changes for equivalence. When the results are equivalent, there is a high assurance

that system administration faults have not been introduced into the system, be they

malicious or accidental in nature.

ISE-T’s application of the two-person controller model is just an element of a

larger vision of applying this dual control model to solving computing problems. In

particular, we want to explore how the ability to create dual environments can pro-

vide improved systems management and security of systems in general. For system

management, patching a system is critical to ensure that it remains secure. However,

many patches can introduce new bugs as well. By being able to create two environ-

ments that run in parallel, one can test the known working system against a patched


system to ensure that the patch does not introduce any new faults. Similarly, it

can improve security as we can create two parallel environments that differ randomly

in areas such as their process’s address space layout and stacks. As code injection

attacks are directly tied to these layouts, by running two systems in parallel with

different layouts, an attack will result in fundamentally different results on the two

systems, allowing one to detect that an attack is occurring.

Bibliography

[1] Fakeroot. http://fakeroot.alioth.debian.org/.

[2] Gmail. https://gmail.google.com.

[3] Google Docs. https://docs.google.com.

[4] he RPM Package Manager. http://www.rpm.org/.

[5] Hotmail. http://www.hotmail.com.

[6] Linux Containers. http://lxc.sourceforge.net/.

[7] Linux VServer Project. http://www.linux-vserver.org/.

[8] Portable Firefox. http://johnhaller.com/jh/mozilla/portable_firefox/.

[9] SoX - Sound eXchange. http://sox.sourceforge.net.

[10] Stealth Surfer. http://www.stealthsurfer.biz/.

[11] Trek Thumbdrive TOUCH. http://www.thumbdrive.com/p-thumbdrive.

php?product=tdswipecrypto.

[12] U3 Platform. http://www.u3.com.

http://fakeroot.alioth.debian.org/

https://gmail.google.com

https://docs.google.com

http://www.rpm.org/

http://www.hotmail.com

http://lxc.sourceforge.net/

http://www.linux-vserver.org/

http://johnhaller.com/jh/mozilla/portable_firefox/

http://sox.sourceforge.net

http://www.stealthsurfer.biz/

http://www.thumbdrive.com/p-thumbdrive.php?product=tdswipecrypto

http://www.thumbdrive.com/p-thumbdrive.php?product=tdswipecrypto

http://www.u3.com

Bibliography 204

[13] US DoD Joint Publication 1-02, DOD Dictionary of Military and Associated

Terms (as amended through 9 June 2004).

[14] Virtual Network Computing. http://www.realvnc.com/.

[15] Sendmail v.5 Vulnerability. Technical Report CA-1995-08, CERT Coordination

Center, August 1995.

[16] MIME Conversion Buffer Overflow in Sendmail Versions 8.8.3 and 8.8.4. Tech-

nical Report CA-1997-05, CERT Coordination Center, January 1997.

[17] Anurag Acharya and Mandar Raje. MAPbox: Using Parameterized Behavior

Classes to Confine Applications. In The 9th USENIX Security Symposium,

Denver, CO, August 2000.

[18] Adobe Systems Incorporated. Buffer Overflow Issue in Versions 9.0 and Earlier

of Adobe Reader and Acrobat. http://www.adobe.com/support/security/

advisories/apsa09-01.html, February 2009.

[19] Paul Anderson. LCFG: A Practical Tool for System Configuration. Usenix

Association, August 2008.

[20] http://www.aim.com/get_aim/express/.

[21] Myla Archer, Elizabeth Leonard, and Matteo Pradella. Towards a Methodol-

ogy and Tool for the Analysis of Security-Enhanced Linux. Technical Report

NRL/MR/5540—02-8629, NRL, August 2002.

[22] Yeshayahu Artsy, Hung-Yang Chang, and Raphael Finkel. Interprocess Com-

munication in Charlotte. IEEE Software, 4(1):22–28, January 1987.

http://www.realvnc.com/

http://www.adobe.com/support/security/advisories/apsa09-01.html

http://www.adobe.com/support/security/advisories/apsa09-01.html

http://www.aim.com/get_aim/express/

Bibliography 205

[23] Dirk Balfanz and Daniel R. Simon. WindowBox: A Simple Security Model for

the Connected Desktop. In The 4th USENIX Windows Systems Symposium,

Seattle, WA, August 2000.

[24] Amnon Barak and Richard Wheeler. MOSIX: An Integrated Multiprocessor

UNIX. In The 1989 USENIX Winter Technical Conference, pages 101–112,

San Diego, CA, February 1989.

[25] Arash Baratloo, Navjot Singh, and Timothy Tsai. Transparent Run-Time De-

fense Against Stack Smashing Attacks. In The 2000 USENIX Annual Technical

Conference, San Diego, CA, June 2000.

[26] Ricardo Baratto, Shaya Potter, Gong Su, and Jason Nieh. MobiDesk: Mobile

Virtual Desktop Computing. In The 10th Annual ACM International Confer-

ence on Mobile Computing and Networking, Philadelphia, PA, September 2004.

[27] Ricardo A. Baratto, Leonard N. Kim, and Jason Nieh. THINC: A Virtual

Display Architecture for Thin-Client Computing. In The 20th ACM Symposium

on Operating Systems Principles, Brighton, United Kingdom, October 2005.

[28] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex

Ho, Rolf Neugebauery, Ian Pratt, and Andrew Warfield. Xen and the Art of

Virtualization. In The 19th ACM Symposium on Operating Systems Principles,

Bolton Landing, NY, October 2003.

[29] Andrew Baumann, Jonathan Appavoo, Dilma Da Silva, Jeremy Kerr, Orran

Krieger, and Robert W. Wisniewski. Providing Dynamic Update in an Oper-

ating System. In 2005 USENIX Annual Technical Conference, pages 279–291,

Anaheim, CA, April 2005.

Bibliography 206

[30] Andrew Berman, Virgil Bourassa, and Erik Selberg. TRON: Process-specific

File Protection for the UNIX Operating System. In The 1995 USENIX Winter

Technical Conference, pages 165–175, New Orleans, LA, January 1995.

[31] bitdefender. Trojan.pws.chromeinject.b. http://www.bitdefender.com/

VIRUS-1000451-en--Trojan.PWS.ChromeInject.B.html, November 2008.

[32] Jeff Bonwick and Bill Moore. ZFS: The Last Word In File Systems. http://

opensolaris.org/os/community/zfs/docs/zfs_last.pdf, November 2005.

[33] Kevin Borders, Eric Vander Weele, Billy Lau, and Atul Prakash. Protecting

Confidential Data on Personal Computers with Storage Capsules. In The 18th

USENIX Security Symposium, Montreal. Canada, August 2009.

[34] Ed Bugnion, Scott Devine, and Mendel Rosenblum. Disco: Running Commod-

ity Operating Systems on Scalable Multiprocessors. In The 16th ACM Sym-

posium on Operating Systems Principles, pages 143–156, Saint Malo, France,

December 1997.

[35] Thomas Bushnell. The HURD: Towards a New Strategy of OS Design. http:

//www.gnu.org/software/hurd/hurd-paper.html, 1994.

[36] Bruce Byfield. An Apt-Get Primer. http://www.linux.com/articles/40745,

December 2004.

[37] Ramon Caceres, Casey Carter, Chandra Narayanaswami, and Mandayam

Raghunath. Reincarnating PCs with Portable SoulPads. In The 3rd Interna-

tional Conference on Mobile Systems, Applications, and Services, pages 65–78,

Seattle, WA, June 2005. ACM.

http://www.bitdefender.com/VIRUS-1000451-en--Trojan.PWS.ChromeInject.B.html

http://www.bitdefender.com/VIRUS-1000451-en--Trojan.PWS.ChromeInject.B.html

http://opensolaris.org/os/community/zfs/docs/zfs_last.pdf

http://opensolaris.org/os/community/zfs/docs/zfs_last.pdf

http://www.gnu.org/software/hurd/hurd-paper.html

http://www.gnu.org/software/hurd/hurd-paper.html

http://www.linux.com/articles/40745

Bibliography 207

[38] Justin Capps, Scott Baker, Jeremy Plichta, Duy Nyugen, Jason Hardies, Matt

Borgard, Jeffry Johnston, and John H. Hartman. Stork: Package Management

for Distributed VM Environments. In The 21st Large Installation System Ad-

ministration Conference, Dallas, TX, November 2007.

[39] Ashton B. Carter, John D. Steinbruner, and Charles A. Zraket, editors. Man-

aging Nuclear Operations. The Brookings Institution, Washington, DC, 1987.

[40] Jeremy Casas, Dan Clark, Rabi Konuru, Steve Otto, Robert Prouty, and

Jonathan Walpole. MPVM: A Migration Transparent Version of PVM. Com-

puting Systems, 8(2):171–216, 1995.

[41] Ramesh Chandra, Nickolai Zeldovich, Constantine Sapuntzakis, and Monica S.

Lam. The Collective: A Cache-Based System Management Architecture. In

The 2nd Symposium on Networked Systems Design and Implementation, pages

259–272, Boston, MA, April 2005.

[42] David R. Cheriton. The V Distributed System. Communications of the ACM,

31(3):314–333, March 1988.

[43] Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen, Eric Jul,

Christian Limpach, Ian Pratt, and Andrew Warfield. Live Migration of Virtual

Machines. In The 2nd Symposium on Networked Systems Design and Imple-

mentation, pages 273–286, Boston, MA, April 2005.

[44] David D. Clark and David R. Wilson. A Comparison of Commercial and Mil-

itary Computer Security Policies. IEEE Symposium on Security and Privacy,

0:184, April 1987.

Bibliography 208

[45] Commission for Review of FBI Security Programs, William Webster, chair.

Webste Report: A Review of FBI Security Programs, March 2002.

[46] Small Form Factors Committee. Specification for Self-Monitoring, Analysis and

Reporting Technology (S.M.A.R.T.). Technical Report SFF-8035, Technical

Committee T13 AT Attachment, April 1996.

[47] Roberto Di Cosmo, Berke Durak, Xavier Leroy, Fabio Mancinelli, and Jerome

Vouillon. Maintaining Large Software Distributions: New Challenges from the

FOSS Era. EASST Newsletter, 12:7–20, 2006.

[48] Crispan Cowan, Calton Pu, Dave Maier, Jonathan Walpole, Peat Bakke, Steve

Beattie, Aaron Grier, Perry Wagle, Qian Zhang, and Heather Hinton. Stack-

Guard: Automatic Adaptive Detection and Prevention of Buffer-Overflow At-

tacks. In The 7th USENIX Security Conference, pages 63–78, San Antonio, TX,

January 1998.

[49] Crispin Cowan, Steve Beattie, Greg Kroah-Hartman, Calton Pu, Perry Wagle,

and Virgil Gligor. SubDomain: Parsimonious Server Security. In 14th USENIX

Systems Administration Conference, New Orleans, LA, December 2000.

[50] Daniel S. Crosta, Matthew J. Singleton, and Benjamin A. Kuperman. Fighting

Institutional Memory Loss: The Trackle Integrated Issue and Solution Tracking

System. In The 20th Large Installation System Administration Conference,

pages 287–298, Washington, DC, December 2006.

[51] B.C. Cumberland, G. Carius, and A. Muir. Microsoft Windows NT Server 4.0,

Terminal Server Edition: Technical Reference. Microsoft Press, Redmond, WA,

August 1999.

Bibliography 209

[52] Martin Davis and Hilary Putnam. A Computing Procedure for Quantification

Theory. Journal of the ACM, 7(3):201–215, July 1960.

[53] Digital Equipment Corporation. TOPS-20 User’s Guide, January 1980.

[54] Fred Douglis and John Ousterhout. Transparent Process Migration: Design Al-

ternatives and the Sprite Implementation. Software - Practice and Experience,

21(8):757–785, August 1991.

[55] Michael Sack Elmaleh. Nonprofit Fraud Prevention. http://www.

understand-accounting.net/Nonprofitfraudprevention.html, 2007.

[56] Javier Fernandez-Sanguino. Debian GNU/Linux FAQ - Chapter 8 - The

Debian Package Management Tools. http://www.debian.org/doc/FAQ/

ch-pkgtools.en.html.

[57] FreeBSD Project. Developer’s Handbook. http://www.freebsd.org/doc/en_

US.ISO8859-1/books/developers-handbook/secure-chroot.html.

[58] Steve Friedl. Best Practices for UNIX chroot() Operations. http://unixwiz.

net/techtips/chroot-practices.html, January 2002.

[59] Tal Garfinkel. Traps and Pitfalls: Practical Problems in System Call Inter-

position Based Security Tools. In The 10th Annual Network and Distributed

Systems Security Symposium, San Diego, CA, February 2003.

[60] Tal Garfinkel, Ben Pfaff, and Mendel Rosenblum. Ostia: A Delegating Architec-

ture for Secure System Call Interposition. In The 1st Network and Distributed

Systems Security Symposium, February 2004.

[61] James Gettys and Robert W. Scheifler. Xlib - C Language X Interface. X

Consortium, Inc., 1996. p. 224.

http://www.understand-accounting.net/Nonprofitfraudprevention.html

http://www.understand-accounting.net/Nonprofitfraudprevention.html

http://www.debian.org/doc/FAQ/ch-pkgtools.en.html

http://www.debian.org/doc/FAQ/ch-pkgtools.en.html

http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/secure-chroot.html

http://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/secure-chroot.html

http://unixwiz.net/techtips/chroot-practices.html

http://unixwiz.net/techtips/chroot-practices.html

Bibliography 210

[62] Martyn Gilmore. 10Day CERT Advisory on PDF Files. http://seclists.

org/fulldisclosure/2003/Jun/0463.html, June 2003.

[63] Gnome.org. Libwnck Reference Manual. http://library.gnome.org/devel/

libwnck/.

[64] GOBBLES Security. Local/Remote Mpg123 Exploit. http://www.opennet.

ru/base/exploits/1042565884_668.txt.html, January 2003.

[65] L. Gong and R. Schemers. Implementing Protection Domains in the Java De-

velopment Kit 1.2. In The 1998 Internet Society Symposium on Network and

Distributed System Security, pages 125–134, San Diego, CA, 1998.

[66] Google. Google Chrome - Features. http://www.google.com/chrome/intl/

en/features.html.

[67] GreyMagic Security Research. Reading Local Files in Netscape 6 and Mozilla.

http://sec.greymagic.com/adv/gm001-ns/, April 2002.

[68] Philippe Grosjean. Speed Comparison of Various Number Crunching Packages

(Version 2). http://www.sciviews.org/benchmark/, March 2003.

[69] Douglas Hanks. Sudosh. http://sourceforge.net/projects/sudosh/.

[70] Joseph Heller. Catch-22. Simon and Schuster, 1961.

[71] Sotiris Ioannidis, Steven M. Bellovin, and Jonathan Smith. Sub-Operating

Systems: A New Approach to Application Security. In SIGOPS European

Workshop, Saint-Emilion, France, September 2002.

http://seclists.org/fulldisclosure/2003/Jun/0463.html

http://seclists.org/fulldisclosure/2003/Jun/0463.html

http://library.gnome.org/devel/libwnck/

http://library.gnome.org/devel/libwnck/

http://www.opennet.ru/base/exploits/1042565884_668.txt.html

http://www.opennet.ru/base/exploits/1042565884_668.txt.html

http://www.google.com/chrome/intl/en/features.html

http://www.google.com/chrome/intl/en/features.html

http://sec.greymagic.com/adv/gm001-ns/

http://www.sciviews.org/benchmark/

http://sourceforge.net/projects/sudosh/

Bibliography 211

[72] Shvetank Jain, Fareha Shafique, Vladan Djeric, and Ashvin Goel. Application-

level Isolation and Recovery with Solitude. In The 3rd ACM European Confer-

ence on Computer Systems, pages 95–107, Glasgow, Scotland, April 2008.

[73] Michael K. Johnson. Linux Kernel Hackers’ Guide. The Linux Documentation

Project, 1997.

[74] Poul-Henning Kamp and Robert N. M. Watson. Jails: Confining the Omnipo-

tent Root. In The 2nd International SANE Conference, MECC, Maastricht,

The Netherlands, May 2000.

[75] Paul Karger. Personal Communication, May 2009.

[76] Jeffrey Katcher. PostMark: A New File System Benchmark. Technical Report

TR3022, Network Appliance, Inc., October 1997.

[77] Jeffry O. Kephart and David M. Chess. The Vision of Autonomic Computing.

IEEE Computer, pages 41–50, January 2003.

[78] Yousef A. Khalidi and Michael N. Nelson. Extensible File Systems in Spring.

In The 14th ACM Symposium on Operating Systems Principles, pages 1–14,

Asheville, NC, December 1993. ACM.

[79] Gene Kim and Eugene Spafford. Experience with Tripwire: Using Integrity

Checkers for Intrusion Detection. In The 1994 System Administration, Net-

working, and Security Conference, April 1994.

[80] Calvin Ko, Timothy Fraser, Lee Badger, and Douglas Kilpatrick. Detecting and

Countering System Intrusions Using Software Wrappers. In The 9th USENIX

Security Symposium, Denver, CO, August 2000.

Bibliography 212

[81] Oren Laadan, Ricardo Baratto, Dan Phung, Shaya Potter, and Jason Nieh. De-

jaView: A Personal Virtual Computer Recorder. In The 21st ACM Symposium

on Operating Systems Principles, Stevenson, WA, October 2007.

[82] Butler Lampson. Accountability and Freedom. http://research.microsoft.

com/en-us/um/people/blampson/slides/accountabilityandfreedom.ppt,

September 2005.

[83] Jeffrey P. Lanza and Shawn V. Hernan. Remote Buffer Overflow in Sendmail.

Technical Report CA-2003-07, CERT Coordination Center, March 2003.

[84] Zhenkai Liang, V.N. Venkatakrishnan, and R. Sekar. Isolated Program Exe-

cution: An Application Transparent Approach for Executing Untrusted Pro-

grams. In 19th Annual Computer Security Applications Conference, Las Vegas,

NV, December 2003.

[85] Michael Litzkow, Todd Tannenbaum, Jim Basney, and Miron Livny. Checkpoint

and Migration of UNIX Processes in the Condor Distributed Processing System.

Technical Report 1346, University of Wisconsin Madison Computer Sciences,

April 1997.

[86] Peter Loscocco and Stephen Smalley. Integrating Flexible Support for Security

Policies into the Linux Operating System. In The FREENIX Track: 2001

USENIX Annual Technical Conference, Boston, MA, June 2001.

[87] David E. Lowell, Yasushi Saito, and Eileen J. Samberg. Devirtualizable Virtual

Machines Enabling General, Single-node, Online Maintenance. In The 11th

International Conference on Architectural Support for Programming Languages

and Operating Systems, Boston, MA, October 2004.

http://research.microsoft.com/en-us/um/people/blampson/slides/accountabilityandfreedom.ppt

http://research.microsoft.com/en-us/um/people/blampson/slides/accountabilityandfreedom.ppt

Bibliography 213

[88] Art Manion, Shawn V. Hernan, and Jeffery P. Lanza. Buffer overflow in send-

mail. Technical Report CA-2003-12, CERT Coordination Center, March 2003.

[89] David Mazieres. A Toolkit for User-Level File Systems. In The 2001 USENIX

Annual Technical Conference, pages 261–274, Boston, MA, June 2001.

[90] Kirby McCoy. VMS File System Internals. Digital Press, 1990.

[91] Mark McLoughlin. QCOW2 Image Format. http://www.gnome.org/~markmc/

qcow-image-format.htm, September 2008.

[92] Microsoft. Microsoft Application Virtualization. http://www.microsoft.com/

systemcenter/appv/default.mspx.

[93] Microsoft Corp. SendMessage Function. http://msdn.microsoft.com/en-us/

library/ms644950(VS.85).aspx.

[94] Moka5. Moka5 Technology Overview. http://www.moka5.com/node/381,

November 2006.

[95] Sape J. Mullender, Guido Van Rossum, Andrew S. Tanenbaum, Robert van

Renesse, and Hans Van Staveren. Amoeba: A Distributed Operating System

for the 1990s. IEEE Computer, 23(5):44–53, May 1990.

[96] Kiran-Kumar Muniswamy-Reddy, Charles P. Wright, Andrew Himmer, and

Erez Zadok. A Versatile and User-Oriented Versioning File System. In The

3rd USENIX Conference on File and Storage Technologies, pages 115–128, San

Francisco, CA, March/April 2004.

[97] Rajeev Nagar. Filter Drivers. In Windows NT File System Internals: A Devel-

oper’s Guide. O’Reilly, September 1997.

http://www.gnome.org/~markmc/qcow-image-format.htm

http://www.gnome.org/~markmc/qcow-image-format.htm

http://www.microsoft.com/systemcenter/appv/default.mspx

http://www.microsoft.com/systemcenter/appv/default.mspx

http://msdn.microsoft.com/en-us/library/ms644950(VS.85).aspx

http://msdn.microsoft.com/en-us/library/ms644950(VS.85).aspx

http://www.moka5.com/node/381

Bibliography 214

[98] Gustavo Niemeyer. Smart Package Manager. http://labix.org/smart.

[99] Peter Norton, Peter Aitken, and Richard Wilton. The Peter Norton PC Pro-

grammer’s Bible: The Ultimate Reference to the IBM PC and Compatible Hard-

ware and Systems Software. Microsoft Press, 1993.

[100] Steven Osman, Dinesh Subhraveti, Gong Su, and Jason Nieh. The Design and

Implementation of Zap: A System for Migrating Computing Environments. In

The 5th Symposium on Operating Systems Design and Implementation, Boston,

MA, December 2002.

[101] Paul A. Karger and Roger R. Schell. Multics Security Evaluation: Vulnerability

Analysis, Volume II. Technical Report ESD-TR-74-193, HQ Electronic Systems

Division: Hanscom AFB, MA, June 1974.

[102] Jan-Simon Pendry and Marshall Kirk McKusick. Union Mounts in 4.4BSD-lite.

In The 1995 USENIX Technical Conference, New Orleans, LA, January 1995.

[103] Ben Pfaff, Tal Garfinkel, and Mendel Rosenblum. Virtualization Aware File

Systems: Getting Beyond the Limitations of Virtual Disks. In 3rd Symposium

of Networked Systems Design and Implementation, San Jose, CA, May 2006.

[104] Rob Pike, David L. Presotto, Ken Thompson, and Howard Trickey. Plan 9

from Bell Labs. In The 1990 Summer UKUUG Conference, pages 1–9, London,

United Kingdom, July 1990. UKUUG.

[105] Rob Pike and Dennis M. Ritchie. The Styx Architecture for Distributed Sys-

tems. Bell Labs Technical Journal, 4(2):146–152, 1999 1999.

http://labix.org/smart

Bibliography 215

[106] James S. Plank, Micah Beck, Gerry Kingsley, and Kai Li. Libckpt: Transparent

Checkpointing under Unix. In The 1995 USENIX Winter Technical Conference,

pages 213–223, New Orleans, LA, January 1995.

[107] Thomas Porter and Tom Duff. Compositing Digital Images. Computer Graph-

ics, 18(3):253–259, July 1984.

[108] Jef Poskanzer. http://www.acme.com/software/http_load/.

[109] Shaya Potter, Ricardo Baratto, Oren Laadan, Leonard Kim, and Jason Nieh.

MediaPod: A Personalized Multimedia Desktop In Your Pocket. In The 11th

IEEE International Symposium on Multimedia, pages 219–226, San Diego, CA,

December 2009.

[110] Shaya Potter, Ricardo Baratto, Oren Laadan, and Jason Nieh. GamePod:

Persistent Gaming Sessions on Pocketable Storage Devices. In The 3rd Inter-

national Conference on Mobile Ubiquitous Computing, Systems, Services, and

Technologies, Sliema, Malta, October 2009.

[111] Shaya Potter, Steven M. Bellovin, and Jason Nieh. Two Person Controller

Administration: Preventing Administrative Faults through Duplication. In

The 23rd Large Installation System Administration Conference, Baltimore, MD,

November 2009.

[112] Shaya Potter and Jason Nieh. Reducing downtime due to system maintenance

and upgrades. In The 19th Large Installation System Administration Confer-

ence, pages 47–62, San Diego, CA, December 2005.

http://www.acme.com/software/http_load/

Bibliography 216

[113] Shaya Potter and Jason Nieh. WebPod: Persistent Web Browsing Sessions

with Pocketable Storage Devices. In The 14th International World Wide Web

Conference, Chiba, Japan, May 2005.

[114] Shaya Potter and Jason Nieh. Highly Reliable Mobile Desktop Computing in

Your Pocket. In The 2006 IEEE Computer Society Signature Conference on

Software Technology and Applications, September 2006.

[115] Shaya Potter, Jason Nieh, and Matt Selsky. Secure Isolation of Untrusted

Legacy Applications. In The 21st conference on Large Installation System Ad-

ministration Conference, pages 117–130, Dallas, TX, November 2007.

[116] Daniel Price and Andrew Tucker. Solaris Zones: Operating System Support

for Consolidating Commercial Workloads. In 18th Large Installation System

Administration Conference, November 2004.

[117] Debian Project. DDP Developers’ Manuals. http://www.debian.org/doc/

devel-manuals.

[118] Niels Provos. Improving Host Security with System Call Policies. In The 12th

USENIX Security Symposium, Washington, DC, August 2003.

[119] Jim Pruyne and Miron Livny. Managing Checkpoints for Parallel Programs. In

The 2nd Workshop on Job Scheduling Strategies for Parallel Processing, Hon-

olulu, HI, April 1996.

[120] Richard F. Rashid and George G. Robertson. Accent: A Communication Ori-

ented Network Operating System Kernel. In The 8th ACM Symposium on Op-

erating System Principles, pages 64–75, Bretton Woods, NH, December 1984.

http://www.debian.org/doc/devel-manuals

http://www.debian.org/doc/devel-manuals

Bibliography 217

[121] Darrell Reimer, Arun Thomas, Glenn Ammons, Todd Mummert, Bowen

Alpern, and Vasanth Bala. Opening Black Boxes: Using Semantic Information

to Combat Virtual Machine Image Sprawl. In The 2008 ACM International

Conference on Virtual Execution Environments, Seattle, WA, March 2008.

[122] Charles Reis and Steven D. Gribble. Isolating Web Programs in Modern

Browser Architectures. In The 4th ACM European Conference on Computer

Systems, Nuremberg, Germany, March 2009.

[123] Eric Rescorla. Security Holes... Who Cares? In The 12th USENIX Security

Conference, Washington, D.C., August 2003.

[124] David Rosenthal. Evolving the Vnode Interface. In The 1990 USENIX Summer

Technical Conference, pages 107–118, June 1990.

[125] Marc Rozier, Vadim Abrossimov, Francois Armand, I. Boule, Michel Gien,

Marc Guillemont, F. Herrman, Claude Kaiser, S. Langlois, P. Leonard, and

W. Neuhauser. Overview of the Chorus Distributed Operating System. In

The Workshop on Micro-Kernels and Other Kernel Architectures, pages 39–70,

Seattle, WA, 1992.

[126] Jerome H. Saltzer and Michael D. Schroeder. The Protection of Information in

Computer Systems. In The 4th ACM Symposium on Operating System Princi-

ples, Yorktown Heights, NY, October 1973.

[127] Douglas S. Santry, Michael J. Feeley, Norman C. Hutchinson, Alistair C. Veitch,

Ross W. Carton, and Jacob Ofir. Deciding When to Forget in the Elephant

File System. In The 17th ACM Symposium on Operating Systems Principles,

Charleston, SC, December 1999.

Bibliography 218

[128] Constantine P. Sapuntzakis, Ramesh Chandra, Ben Pfaff, Jim Chow, Monica S.

Lam, and Mendel Rosenblum. Optimizing the Migration of Virtual Comput-

ers. In The 5th Symposium on Operating Systems Design and Implementation,

Boston, MA, December 2002.

[129] Brian K. Schmidt. Supporting Ubiquitous Computing with Stateless Consoles

and Computation Caches. PhD thesis, Computer Science Department, Stanford

University, August 2000.

[130] Glenn C. Skinner and Thomas K. Wong. ”Stacking” Vnodes: A Progress Re-

port. In The 1993 USENIX Summer Technical Conference, pages 1–27, Cincin-

nati, Ohio, June 1993.

[131] Peter Smith and Norman C. Hutchinson. Heterogeneous Process Migration:

The Tui System. Software – Practice and Experience, 28(6):611–639, 1998.

[132] Craig A. N. Soules, Garth R. Goodson, John D. Strunk, and Gregory R. Ganger.

Metadata Efficiency in a Comprehensive Versioning File System. In The 2nd

USENIX Conference on File and Storage Technologies, San Francisco, CA,

March 2003.

[133] Ray Spencer, Stephen Smalley, Peter Loscocco, Mike Hibler, David Andersen,

and Jay Lepreau. The Flask Security Architecture: System Support for Diverse

Security Policies. In The 8th USENIX Security Symposium, Washington, DC,

August 1999.

[134] Peter Stein and Peter Feaver. Assuring Control of Nuclear Weapons. University

Press of America, 1987.

Bibliography 219

[135] Sun Microsystems, Inc. NFS: Network File System Protocol Specification. Tech-

nical Report RFC 1094, Internet Engineering Task Force, March 1989.

[136] Michael M. Swift, Brian N. Bershad, and Henry M. Levy. Improving the Re-

liability of Commodity Operating Systems. In The 19th ACM Symposium on

Operating Systems Principles, pages 207–222, Bolton Landing, NY, USA, Oc-

tober 2003. ACM Press.

[137] Miklos Szeredi. Filesystem in Userspace. http://fuse.sourceforge.net/.

[138] Johnny S. Tolliver. Compartmented Mode Workstation (CMW) Comparisons.

In 17th DOE Computer Security Group Training Conference, Milwaukee, WI,

May 1995.

[139] Anthony Towns. Checking Installability is an NP-Complete Prob-

lem. http://www.mail-archive.com/[email protected]/

msg03311.html, November 2007.

[140] Satoshi Uchino. MetaVNC - A Window Aware VNC. http://metavnc.

sourceforge.net/.

[141] Inc. VMWare. VMware VMotion for Live Migration of Virtual Machines. http:

//www.vmware.com/products/vi/vc/vmotion.html.

[142] VMware, Inc. http://www.vmware.com.

[143] VMware Inc. VMware Worksation 6.5 Release Notes. http://www.vmware.

com/support/ws65/doc/releasenotes_ws65.html, October 2008.

[144] David Wagner. Janus: An Approach for Confinement of Untrusted Applica-

tions. Master’s thesis, University of California, Berkeley, 1999.

http://fuse.sourceforge.net/

http://www.mail-archive.com/[email protected]/msg03311.html

http://www.mail-archive.com/[email protected]/msg03311.html

http://metavnc.sourceforge.net/

http://metavnc.sourceforge.net/

http://www.vmware.com/products/vi/vc/vmotion.html

http://www.vmware.com/products/vi/vc/vmotion.html

http://www.vmware.com

http://www.vmware.com/support/ws65/doc/releasenotes_ws65.html

http://www.vmware.com/support/ws65/doc/releasenotes_ws65.html

Bibliography 220

[145] Robert N. M. Watson. Exploiting Concurrency Vulnerabilities in System Call

Wrappers. In The 1st USENIX Workshop on Offensive Technologies, Boston,

MA, August 2007.

[146] Florian Weimer. DSA-1438-1 Tar – Several Vulnerabilities. http://www.ua.

debian.org/security/2007/dsa-1438, December 2007.

[147] Andrew Whitaker, Marianne Shaw, and Steven D. Gribble. Scale and Perfor-

mance in the Denali Isolation Kernel. In The 5th Symposium on Operating

Systems Design and Implementation, Boston, MA, December 2002.

[148] Wilshire State Bank. Safe Deposit Boxes. https://www.wilshirebank.com/

public/additional_safedeposit.asp, 2008.

[149] David Wise. Spy: The Inside Story of how the FBI’s Robert Hanssen Betrayed

America. Random House, 2002.

[150] Charles P. Wright, Jay Dave, Puja Gupta, Harikesavan Krishnan, David P.

Quigley, Erez Zadok, and Mohammad Nayyer Zubair. Versatility and Unix

Semantics in Namespace Unification. ACM Transactions on Storage, 2(1):1–32,

February 2006.

[151] X/Open, editor. Protocols for X/Open PC Interworking: SMB, Version 2.

X/Open Company Ltd, 1992.

[152] Erez Zadok and Jason Nieh. FiST: A Language for Stackable File Systems. In

The 2000 USENIX Annual Technical Conference, pages 55–70, San Diego, CA,

June 2000.

http://www.ua.debian.org/security/2007/dsa-1438

http://www.ua.debian.org/security/2007/dsa-1438

https://www.wilshirebank.com/public/additional_safedeposit.asp

https://www.wilshirebank.com/public/additional_safedeposit.asp

Appendix A

Restricted System Calls

To securely isolate regular Linux processes, we interpose on a number of additional

system calls beyond what is necessary for other forms of virtualization. Below is a

complete list of the few system calls that require more than plain virtualization on

Linux. We give the reasoning for the interposition, where it is not self-explanatory,

and note what functionality was changed from the base system call. Most system

calls do not require more than simple virtualization to ensure isolation because vir-

tualization of the resources itself isolates them. For example, the kill system call

cannot signal a process outside the virtualized environment because the virtualized

namespace will not map it, so the system call cannot reference the process.

A.1 Host-Only System Calls

These system calls are generally not needed in a virtualized environment and are

therefore not allowed.

1. mount – If a user within a virtualized environment were able to mount

a file system, they could mount a file system with device nodes already

Appendix A. Restricted System Calls 222

present and would thus be able to access the underlying system directly

in a manner not controlled by the virtualization architecture. Any file

systems that need to be mounted within the virtualized environment must

be mounted by the host.

2. stime, adjtimex, settimeofday – Allow a privileged process to adjust

the host’s clock.

5. acct – Sets the file on the host that BSD process accounting information

should be written to.

6. swapon, swapoff – Control swap space allocation.

8. reboot – Causes the system to reboot or changes Ctrl-Alt-Delete func-

tionality.

9. ioperm, iopl – Allow a privileged process to gain direct access to under-

lying hardware resources.

11. create module, init module, delete module, query module – Insert

and remove kernel modules.

15. nfsservctl – Enables a privileged process inside a virtual environment

to change the host’s internal NFS server.

16. bdflush – Controls the kernel’s buffer-dirty-flush daemon.

17. sysctl – A deprecated system call that enables runtime setting of kernel

parameters.

18. clock settime – Sets the realtime clock and is only usable by processes

with privilege on a regular system.


A.2 Root-Squashed System Calls

These system calls, in general, are system calls that are useful within a virtualized

environment, but treat the privileged root user in a manner that breaks the virtual-

ization abstraction. These can, however, be used without giving the root user any

special privilege.

1. nice, setpriority, sched setscheduler, sched setparam – These sys-

tem calls let a process change its priority. If a process is running as root

(UID 0), it can increase its priority and freeze out other processes on the

system. Therefore, we prevent any virtualized process from increasing its

priority.

5. ioctl – This system call is a system call demultiplexer that allows kernel

device drivers and subsystems to add their own functions that can be

called from user space. But because functionality can be exposed that

allows root to access the underlying host, all system calls, beyond a limited

audited safe set, are squashed to user nobody, much as NFS does.

6. setrlimit – This system call allows processes running as UID 0 to raise

their resource limits beyond what was preset, thereby allowing them to

disrupt other processes on the system by using too many resources. We

therefore prevent virtualized processes from using this system call to in-

crease the resources available to them.

7. mlock, mlockall – These system calls allow a privileged process to pin

an arbitrary amount of memory, thereby allowing a virtualized process to

lock all of available memory and starve all other processes on the host. We


therefore squash a privileged processes to user nobody when it attempts

to call this system call and treat it like an unprivileged process.

A.3 Option-Checked System Calls

These are system calls that are used within a virtualized environment, but can be

used in a way that can break the virtualization. Therefore, the options passed to

them are checked to ensure they are valid options for the virtualized environment.

1. mknod – This system call allows a privileged user to create special files,

such as pipes, sockets, devices, and even regular files. Because a privileged

process needs to use this functionality, the system call cannot be disabled.

However, if the process could create a device, the device would be an

access point to the underlying host system. Therefore, when a virtualized

process uses this system call, the options are checked to prevent it from

creating a device special file, while allowing the other types.

2. ustat – This system call returns information about a mounted file system,

specifically how much free space remains. This can be useful for a pro-

cess within a virtualized environment, but it can also provide information

about a host’s file systems that is not accessible to the processes within

the virtualized environment. Therefore, the options passed to this system

call are checked to ensure that they match the device of a file system

available only within the virtualized environment.

3. quotactl – This system call sets a limit on the amount of space individual

users can use on a given file system. Virtualized processes are only able

to call it for file systems available within their environment.


A.4 Per-Virtual-Environment System Calls

These system calls are on top of the IPC, shared memory and process namespace

virtualization that was provided by Zap [100].

1. sethostname, gethostname, setdomainname, getdomainname, uname, newuname,

olduname – These system calls read and write the name for the underlying

host. We wrap these system calls to read and write a virtual environment-

specific name and allow each virtual environment to set the name inde-

pendently.

8. socketcall – This system call provides access to the multitude of socket

system calls available in the kernel. Because a secure virtualized environ-

ment provides each environment with its own network namespace, this

system call is restricted to operating only on the namespace that belongs

to the virtualized environment.

9. keyctl, add key, request key – These system calls affect the key man-

agement provided by the kernel. Because keys can be associated with

user and group identifiers, they must be virtualized to a per-virtualized-

environment namespace.

12. mq open, mq unlink, mq timedsend, mq timedreceive, mq notify, mq getsetattr

– These system calls provide access to the kernel’s POSIX message queues.

Because they are used by name, they have to be virtualized on a per-

environment basis.

virtualization mechanisms for mobility, security and...

Documents