eﬃcient hardware/software architectures for …willmann/pubs/willmann_dissertation.pdfrice...

RICE UNIVERSITY

Efficient Hardware/Software Architectures forHighly Concurrent Network Servers

by

Paul Willmann

A THESIS SUBMITTEDIN PARTIAL FULFILLMENT OF THEREQUIREMENTS FOR THE DEGREE

Doctor of Philosophy

Approved, Thesis Committee:

Behnaam Aazhang, ChairJ. S. Abercrombie Professor ofElectrical and Computer Engineering

Alan L. CoxAssociate Professor of Computer Scienceand of Electrical and Computer Engineering

David B. JohnsonAssociate Professor of Computer Scienceand of Electrical and Computer Engineering

Scott RixnerAssociate Professor of Computer Scienceand of Electrical and Computer Engineering

Houston, Texas

November 2007

Abstract

Internet services continue to incorporate increasingly bandwidth-intensive applica-

tions, including audio and high-quality, feature-length video. As the pace of unipro-

cessor performance improvements slows, however, network servers can no longer rely

on uniprocessor technology to fuel the overall performance improvements necessary

for next-generation, high-bandwidth applications. Furthermore, rising per-machine

power costs in the datacenter are driving demand for solutions that enable consoli-

dation of multiple servers onto one machine, thus improving overall efficiency. This

dissertation presents strategies that improve the efficiency and performance of server

I/O using both virtual-machine concurrency and thread concurrency. Contemporary

virtual machine monitors (VMMs) aim to improve server efficiency by enabling con-

solidation of separate isolated servers onto one physical machine. However, modern

VMMs incur heavy device virtualization penalties, ultimately reducing application

performance by up to a factor of 3. Contemporary parallelized operating systems

aim to improve server performance by exploiting thread parallelism using multiple

processors. However, the concurrency and communication models used to imple-

ment that parallelism impose significant performance penalties, severely damaging

the server’s ability to leverage more processors to attain higher performance. This

dissertation examines the architectural sources of these inefficiencies and introduces

new OS- and VMM-level architectures that greatly reduce them.

Acknowledgments

I would like to thank Dr. Scott Rixner and Dr. Alan Cox for their steady tech-

nical guidance throughout the course of this research. Also, I would like to thank

Dr. Behnaam Aazhang and Dr. David Johnson for their perspectives regarding and

support of this work. Thanks also to Dr. Vijay Pai, who helped and encouraged

me from the beginning of my graduate school career through its conclusion, even

after he moved on to a new opportunity at Purdue University. Additionally, I want

to acknowledge Jeff Shafer’s significant contributions regarding development and de-

bugging of the CDNA prototype hardware. David Carr helped tremendously with

bringup of and scripting support for the Xen VMM environment. Marcos Huerta

provided the formatting template for this document and graciously helped me mod-

ify it to suit my needs. I also want to thank my family and friends, whose constant

support and encouragement made this work possible. Finally, I want to particularly

thank my wife Leighann. Her perspectives on the scientific process and academic

research made this work better, and her enduring love and patience made my life

better. Thank you.

Contents

1 Introduction 11.1 Server Concurrency Trends . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.3 Dissertation Organization . . . . . . . . . . . . . . . . . . . . . . . . 9

2 Background 102.1 Contemporary Server and CPU Technology . . . . . . . . . . . . . . . 102.2 Existing OS Support for Concurrent Network Servers . . . . . . . . . 142.3 Existing VMM Support for Concurrent Network Servers . . . . . . . 20

2.3.1 Private I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.3.2 Software-Shared I/O . . . . . . . . . . . . . . . . . . . . . . . 222.3.3 Hardware-shared I/O . . . . . . . . . . . . . . . . . . . . . . . 252.3.4 Protection Strategies for Direct-Access Private and Shared Vir-

tualized I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272.4 Hardware Support for Concurrent Server I/O . . . . . . . . . . . . . 30

2.4.1 Hardware Support for Parallel Receive-side OS Processing . . 302.4.2 User-level Network Interfaces . . . . . . . . . . . . . . . . . . 32

2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3 Parallelization Strategies for OS Network Stacks 353.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383.2 Parallel Network Stack Architectures . . . . . . . . . . . . . . . . . . 40

3.2.1 Message-based Parallelism (MsgP) . . . . . . . . . . . . . . . 413.2.2 Connection-based Parallelism (ConnP) . . . . . . . . . . . . . 43

3.3 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463.3.1 Evaluation Hardware . . . . . . . . . . . . . . . . . . . . . . . 463.3.2 Parallel TCP Benchmark . . . . . . . . . . . . . . . . . . . . . 47

3.4 Evaluation using One 10 Gigabit NIC . . . . . . . . . . . . . . . . . . 483.5 Evaluation using Multiple Gigabit NICs . . . . . . . . . . . . . . . . 493.6 Discussion and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.6.1 Locking Overhead . . . . . . . . . . . . . . . . . . . . . . . . . 523.6.2 Scheduler Overhead . . . . . . . . . . . . . . . . . . . . . . . . 563.6.3 Cache Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4 Concurrent Direct Network Access 634.1 Networking in Xen . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.1.1 Hypervisor and Driver Domain Operation . . . . . . . . . . . 66

i

4.1.2 Device Driver Operation . . . . . . . . . . . . . . . . . . . . . 674.1.3 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

4.2 CDNA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704.2.1 Multiplexing Network Traffic . . . . . . . . . . . . . . . . . . . 714.2.2 Interrupt Delivery . . . . . . . . . . . . . . . . . . . . . . . . . 724.2.3 DMA Memory Protection . . . . . . . . . . . . . . . . . . . . 744.2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.3 CDNA NIC Implementation . . . . . . . . . . . . . . . . . . . . . . . 794.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.4.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . 814.4.2 Single Guest Performance . . . . . . . . . . . . . . . . . . . . 834.4.3 Memory Protection . . . . . . . . . . . . . . . . . . . . . . . . 854.4.4 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

5 Protection Strategies for Direct I/O in Virtual Machine Monitors 895.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 915.2 IOMMU-based Protection . . . . . . . . . . . . . . . . . . . . . . . . 93

5.2.1 Single-use Mappings . . . . . . . . . . . . . . . . . . . . . . . 935.2.2 Shared Mappings . . . . . . . . . . . . . . . . . . . . . . . . . 955.2.3 Persistent Mappings . . . . . . . . . . . . . . . . . . . . . . . 96

5.3 Software-based Protection . . . . . . . . . . . . . . . . . . . . . . . . 975.4 Protection Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

5.4.1 Inter-Guest Protection . . . . . . . . . . . . . . . . . . . . . . 995.4.2 Intra-Guest Protection . . . . . . . . . . . . . . . . . . . . . . 100

5.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1015.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

5.6.1 TCP Stream . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1045.6.2 VoIP Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065.6.3 Web Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065.6.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6 Conclusion 1126.1 Orchestrating OS parallelization to characterize and improve I/O pro-

cessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1136.2 Reducing virtualization overhead using a hybrid hardware/software ap-

proach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1156.3 Improving performance and efficiency of protection strategies for direct-

access I/O . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1176.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

List of Figures

1.1 Uniprocessor frequency and network bandwidth history. . . . . . . . . 21.2 Network I/O throughput disparity between the modern FreeBSD op-

erating system and link capacity, using either six 1-Gigabit interfacesor one 10-Gigabit interface. . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Network I/O throughput disparity between native Linux and virtual-ized Linux, using six 1-Gigabit Ethernet interfaces. . . . . . . . . . . 4

2.1 Uniprocessor performance history (data source: Standard PerformanceEvaluation Corporation). . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2 The efficiency/parallelism continuum of OS network-stack paralleliza-tion strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.3 A contemporary software-based, shared-I/O virtualization architecture. 24

3.1 Aggregate transmit throughput for uniprocessor, message-parallel andconnection-parallel network stacks using 6 NICs. . . . . . . . . . . . . 50

3.2 Aggregate receive throughput for uniprocessor, message-parallel andconnection-parallel network stacks using 6 NICs. . . . . . . . . . . . . 51

3.3 The outbound control path in the application thread context. . . . . 533.4 Aggregate transmit throughput for the ConnP-L network stack as the

number of locks is varied. . . . . . . . . . . . . . . . . . . . . . . . . . 563.5 Profile of L2 cache misses per 1 Kilobyte of payload data (transmit test). 573.6 Profile of L2 cache misses per 1 Kilobyte of payload data (receive test). 58

4.1 Shared networking in the Xen virtual machine environment. . . . . . 654.2 The CDNA shared networking architecture in Xen. . . . . . . . . . . 714.3 Transmit throughput for Xen and CDNA (with CDNA idle time). . . 874.4 Receive throughput for Xen and CDNA (with CDNA idle time). . . . 87

i

List of Tables

2.1 I/O virtualization methods. . . . . . . . . . . . . . . . . . . . . . . . 20

3.1 FreeBSD network bandwidth (Mbps) using a single processor and a10 Gbps network interface. . . . . . . . . . . . . . . . . . . . . . . . . 39

3.2 Aggregate throughput for uniprocessor, message-parallel and connection-parallel network stacks. . . . . . . . . . . . . . . . . . . . . . . . . . . 48

3.3 Percentage of lock acquisitions for global TCP/IP locks that do notsucceed immediately when transmitting data. . . . . . . . . . . . . . 54

3.4 Cycles spent managing the scheduler and scheduler synchronizationper Kilobyte of payload. . . . . . . . . . . . . . . . . . . . . . . . . . 56

3.5 Percentage of L2 cache misses within the network stack to global datastructures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.1 Transmit and receive performance for native Linux 2.6.16.29 and par-avirtualized Linux 2.6.16.29 as a guest OS within Xen 3. . . . . . . . 69

4.2 Transmit performance for a single guest with 2 NICs using Xen andCDNA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.3 Receive performance for a single guest with 2 NICs using Xen andCDNA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

4.4 CDNA 2-NIC transmit performance with and without DMA memoryprotection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

4.5 CDNA 2-NIC receive performance with and without DMA memoryprotection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

5.1 TCP Stream profile. . . . . . . . . . . . . . . . . . . . . . . . . . . . 1045.2 OpenSER profile. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1065.3 Web Server profile using write(). . . . . . . . . . . . . . . . . . . . . 1075.4 Web Server profile using zero-copy sendfile(). . . . . . . . . . . . . 108

i

Chapter 1

Introduction

Internet services continue to incorporate ever more bandwidth-intensive, high-

performance applications, including audio and high-quality, feature-length video. Fur-

thermore, network services are proliferating into every aspect of businesses, with even

small-scale organizations leveraging robust storage, database, and voice-over-IP tech-

nology to manage resources, facilitate communications, and reduce costs. However,

ongoing processor trends toward chip multiprocessing present new challenges and op-

portunities for server architectures as those architectures strive to keep pace with

performance and efficiency demands. This dissertation addresses these challenges

with new operating system and virtual machine monitor architectures designed to

provide efficient, high-performance network input/output (I/O) support for coming

generations of servers.

1.1 Server Concurrency Trends

Throughout the vast expansion of Internet technology in the 1990s, processor per-

formance and network server bandwidth both grew at exponential rates. Figure 1.1

1

1980 1985 1990 1995 2000 20051

10

100

1000

10000

100000

YearE

ther

net B

andw

idth

(M

bps)

Pro

cess

or F

requ

ency

(M

Hz)

Ethernet BandwidthProcessor Frequency

Figure 1.1: Uniprocessor frequency and network bandwidth history.

shows the progression of uniprocessor frequency (for the Intel IA32 family of proces-

sors) and network interface bandwidth since 1982. Though frequency alone is not a

comprehensive measure of performance, the figure does show a qualitative compari-

son of Ethernet and commodity uniprocessor trends over the past twenty five years.

The figure shows the exponential growth of both uniprocessor frequency and Ethernet

bandwidth throughout the 1990s and early 2000s. However, the rate of uniproces-

sor frequency increases shows a marked decline in 2003. In 2003, physical circuit

limitations, such as increasing per-cycle interconnect delays, started to overwhelm

contemporary CPU architectures. Instead of relying on CPUs that rely on larger and

larger global interconnects, CPU architects turned to multicore designs.

The move to multicore designs has both implications for server performance and

efficiency. In terms of performance, it is important that the server be able to leverage

its multiple processors to keep pace with network bandwidth improvements, just as

past servers have leveraged faster uniprocessors to deliver more server performance.

In terms of efficiency, multicore architectures provide new opportunities to consol-

2

Six 1-Gigabit NICs One 10-Gigabit NIC 0

2000

4000

6000

8000

10000

TCP Throughput (Mb/s)

Link Capacity

Uniprocessor OS

Multiprocessor OS (p=4)

(a) Transmit

Six 1-Gigabit NICs One 10-Gigabit NIC 0

2000

4000

6000

8000

10000


Link Capacity

Uniprocessor OS

Multiprocessor OS (p=4)

(b) Receive

Figure 1.2: Network I/O throughput disparity between the modern FreeBSD operatingsystem and link capacity, using either six 1-Gigabit interfaces or one 10-Gigabit interface.

idate isolated servers from several different machines onto just one machine. Such

consolidation maximizes utilization of the server’s physical CPU and I/O resources

and can substantially reduce power and cooling costs, according to a study by In-

tel [22]. Virtual machine monitor (VMM) software provides multiplexing facilities

that enables this kind of consolidation and sharing, but it is important that virtual-

ization overheads are kept low so as to maximize the capacity of a consolidated server

and thus maximize the associated power and cooling savings.

However, modern server software architectures fall well short of meeting both

performance and efficiency demands given modern network link speeds. Figure 1.2

illustrates the gap between the theoretical peak I/O throughput of a modern server

and its achieved throughput. The figure shows TCP network throughput, using ei-

ther six 1 Gigabit Ethernet network interface cards (NICs) or a single 10 Gigabit

Ethernet NIC. The throughput achieved by uniprocessor and multiprocessor-capable

3

Transmit Receive 0

1000

2000

3000

4000

5000


Native

Virtualized

Figure 1.3: Network I/O throughput disparity between native Linux and virtualized Linux,using six 1-Gigabit Ethernet interfaces.

OS configurations are compared to the theoretical aggregate TCP throughput offered

by the physical links. The operating system used is FreeBSD, which uses a simi-

lar parallelization strategy to that of Linux and achieves similar performance (not

shown). The uniprocessor configurations use just one 2.2 GHz Opteron processor

core, whereas the multiprocessor configurations use all four cores of a chip multi-

processor that features two chips with two cores each. In all cases, the application’s

thread count is matched to the number of processors. The application is a lightweight

microbenchmark that simply sends or receives data, thus isolating and stressing the

operating system’s network stack. As the figure shows, existing approaches for net-

work OS multiprocessing can improve network performance in some cases. However,

the performance improvement can be meager (or nonexistent) and falls well short of

4

being able to saturate link resources. Furthermore, current multiprocessor OS orga-

nizations are poorly suited for managing a single, high-bandwidth NIC, sustaining

less than half of the available link bandwidth in the best case.

Whereas native OS performance is significantly less than ideal, Figure 1.3 shows

that virtualizing an operating system reduces its I/O performance even more. Fig-

ure 1.3 compares the TCP network throughput of native, unvirtualized Linux using

six 1 Gigabit NICs to that achieved by a virtualized Linux server using the Xen vir-

tual machine monitor. Linux is used as the benchmark native system here because

the Xen open-source VMM features mature support for Linux but only preliminary

support for FreeBSD; regardless, the performance limitations illustrated by the figure

are inherent to the VMM architecture, not the OS. The system under test uses a

single 2.4 GHz Opteron processor core. For both transmit and receive workloads, vir-

tualization imposes an I/O slowdown of more than 300% versus native-OS execution.

1.2 Contributions

This dissertation contributes to the field at the hardware, virtual machine monitor,

and operating system levels. This work tackles the efficiency and performance issues

within and across each of these levels that cause the performance disparities illustrated

in Figures 1.2 and 1.3. The fundamental approach of this research is to use a com-

bination of hardware and software to architect strategies that minimize inefficiencies

in concurrent network servers. Combined, these strategies form a software/hardware

5

architecture that will efficiently leverage coming generations of multicore network

servers. This architecture is comprised of three parts, each which has its own set of

contributions: strategies for efficient network-stack processing by a parallelized oper-

ating system, strategies for efficient network I/O sharing in VMM environments, and

strategies for efficient isolation of direct-access I/O devices in VMM environments.

Defining and exploring the continuum of network stack parallelization.

At the operating system level, an efficient parallelization of the network protocol

processing stack is required so as to maximize system performance and efficiency

on coming generations of chip multiprocessor hardware and to relieve the protocol

processing bottleneck demonstrated in Figure 1.2. Past research efforts have stud-

ied this problem in the context of improving protocol-processing performance for

100 Mb/s Ethernet using the SGI Challenge shared-memory multiprocessor. Though

these studies examined some of the tradeoffs for two different strategies for network-

stack parallelization, ten years later there exists no consensus among OS architects

regarding the unit of concurrency for network stack processor or the method of syn-

chronization. This dissertation defines and explores a continuum of practical paral-

lelization strategies for modern operating systems. The strategies on this continuum

vary according to their unit of concurrency and their method of synchronization and

have different overhead characteristics, and thus different performance and scalability.

Whereas past studies have used emulated network devices and idealized, experimental

research operating systems, this dissertation examines network-stack parallelization

6

on a real hardware platform using a modern network operating system, including

all of the device and software overhead. Through understanding this continuum of

parallelization and efficiency, this work finds that designing the operating system to

maximize parallel execution can actually decrease performance on parallel hardware

by ignoring sources of overhead. Further, this study identifies the hardware/software

interface between high-bandwidth devices and operating systems as a performance

bottleneck, but when overcome, that efficient network stack parallelization strategies

significantly improve performance versus contemporary inefficient strategies.

Designing a hardware/software architecture for shared I/O virtualiza-

tion. Contemporary architectures for I/O virtualization enable economical sharing

of physical devices among virtual machines. These architectures multiplex I/O traf-

fic, manage device notification messages, and perform direct memory access (DMA)

memory protection entirely in software. However, this software-based design incurs

heavy penalties versus native OS execution, as depicted in Figure 1.3. In contrast,

contemporary research by others has examined purely hardware-based architectures

that aim to reduce software overhead. This dissertation contributes a new architec-

ture for shared I/O virtualization that permits concurrent, direct access by untrusted

virtual machines using a combination of both hardware and software. This research

includes an examination of a prototype network interface developed using this hy-

brid hardware/software approach, which proves effective at eliminating most of the

overhead associated with traditional, software-based shared I/O virtualization. Be-

7

yond the prototype itself, the primarily software-based mechanism for enforcing DMA

memory protection is an entirely new contribution, differing greatly from contempo-

rary hardware-based mechanisms. The prototype device is a standard expansion card

that requires modest additional hardware beyond that found in a commodity network

interface. This low cost and the architecture’s suitability with existing, unmodified

commodity hardware makes it ideal for commodity network servers.

Developing and exploring alternative strategies for virtualized direct

I/O access. Unlike the shared, direct-access architecture explored in this disser-

tation, the prevailing industrial solution for high-performance virtualized I/O is to

provide private, direct access to a single device by a single virtual machine. This

obviates the device’s need for multiplexing of traffic among multiple operating sys-

tems, but such systems still need reliable DMA memory protection mechanisms to

prevent an untrusted virtual machine from potentially misusing an I/O device to ac-

cess another virtual machine’s memory. This dissertation contributes to the field by

developing and examining new hardware- and software-based strategies for managing

DMA memory protection and compares them to the state-of-the-art strategy. Con-

temporary high-availability server architectures use a hardware I/O memory manage-

ment unit (IOMMU) to enforce the memory access rules established by the memory

management unit, and commodity CPU manufacturers are aggressively pursuing in-

clusion of IOMMU hardware in next-generation processors. Though the aim of these

architectures is to provide near-native virtualized I/O performance, the strategy for

8

managing the system’s IOMMU hardware can greatly impact performance and effi-

ciency. This research contributes two novel strategies for achieving direct I/O access

using an IOMMU that, unlike the state-of-the-art strategy, reuse IOMMU entries to

reduce the total overhead of ensuring I/O safety. Further, this research finds that

the software-based DMA memory protection strategy introduced in this dissertation

performs comparably to the most aggressive hardware-based strategy. Contrary to

much of the industrial enthusiasm for IOMMUs in coming commodity servers, this

dissertation concludes that an IOMMU is not necessarily required to achieve safe,

high-performance virtualized I/O.

1.3 Dissertation Organization

This dissertation presents these contributions in three studies and is organized as

follows. Chapter 2 first provides some background regarding the motivation for this

work and the state-of-the-art hardware and software architectures of contemporary

network servers. Chapter 3 then presents a comparison and analysis of parallelization

strategies for modern thread-parallel operating systems. Chapter 4 introduces the

concurrent direct network access (CDNA) architecture for delivering efficient shared

access to virtualized operating systems in VMM environments. Chapter 5 follows

up with a comparison and analysis of hardware- and software-based strategies for

providing isolation among untrusted virtualized operating systems that have direct

access to I/O hardware, and Chapter 6 concludes.

9

Chapter 2

Background

Over the past forty years, there have been extensive industrial and academic

efforts toward improving the performance and efficiency of servers. These efforts

have touched on both multiprocessing concurrency and virtual machine concurrency.

Furthermore, there have been many efforts to coordinate the architecture of I/O

hardware (such as network interfaces) with software (both applications and operating

systems) to improve the performance and efficiency of the overall server. This chapter

discusses the background of contemporary server technology and its limitations and

then explores the prior research efforts that are related to the themes and strategies

of this dissertation.

2.1 Contemporary Server and CPU Technology

The Internet expansion of the 1990s was sustained with exponential growth in

processor performance and network server bandwidth. Figure 1.1 shows this expo-

nential progression of uniprocessor frequency (for the Intel IA32 family of processors)

and network interface bandwidth since 1982. These steady, exponential performance

10

improvements came from technology improvements, such as feature-size reduction

that enabled higher-frequency throughput and from architectural innovations, such

as superscalar instruction execution in CPUs and packet checksum offloading for net-

work interfaces. However, the dominant server architecture relied on a fairly constant

architecture: a single processor using a single network interface, both which were

exponentially improving in performance between generations. The commoditization

and rapid improvement of this architecture has yielded superior cost efficiencies com-

pared to more specialized designs, ultimately motivating companies such as Google

to standardize their server platforms on this commodity architecture [8].

The consistency of the commodity hardware architecture ensured that legacy soft-

ware architectures could readily leverage successive generations of higher-performance

hardware, ultimately producing higher-performance, higher-capacity servers. Effi-

ciency improvements in system software, such as zero-copy I/O and efficient event-

driven execution models, provided additional server performance on the same archi-

tectures. However, these software innovations did not rely on architectural changes

and instead improved performance using the existing contemporary architecture.

Figure 2.1 confirms this processor trend in terms of performance and shows that

it is not specific to the Intel IA32 family of processors. This figure plots the highest

reported SPEC CPU integer benchmark scores, selected across all families of proces-

sors, for each year since 1995. The left-hand portion of the figure shows SPEC95

scores, and the right-hand figure shows SPEC2000 scores. In 2000, the same com-

11

!""" !""# !""! !""$ !""% !""& !""' !""(#""

#"""

)*+,-./0),12+

)*+,-./0!"""

!""# !""$ !""% !""& !""" '(((

!(

)*+,-./0),12+

)*+,-./0"#

Long-te

rm Per

formanc

e Trend

Figure 2.1: Uniprocessor performance history (data source: Standard Performance Eval-uation Corporation).

puter system was evaluated using both SPEC95 and SPEC2000, so the “2000” point

in both graphs corresponds to the same computer. The y axes of each graph are

scaled to each other such that a line with a certain slope in the SPEC95 graph will

have the same slope in the SPEC2000 graph. Though the two benchmark suites are

not identical, they are designed to measure performance on contemporary hardware.

From 1995 to 2003, the average rate of benchmark improvement was 43% per year

(shown as the “Long-term Performance Trend” line in the figure). Though processor

frequencies remained mostly unchanged in the period of 2003-2007 (as shown for the

Intel family of processors in Figure 1.1), processor architects were still able to pro-

duce performance improvements for that time period. However, the rate of SPEC

benchmark improvement dropped significantly, to 19% per year.

12

The decline in frequency and performance increases stem primarily from transistor-

and circuit-level physical limitations. One of the most significant of these is parasitic

wire capacitance among transistors inside CPUs. The progression toward smaller

transistors has enabled larger-scale integration, but it has also led to increasing rela-

tive delay (in cycles) for global interconnect from generation to generation [15]. The

poor scalability of global interconnection networks (and thus the control circuitry of

modern superscalar processors) is contributing to the slowdown in uniprocessor per-

formance improvements from year to year. This scalability problem is also driving

CPU manufacturers toward multicore CPU designs. It is this migration and the sub-

sequent poor performance of contemporary server operating systems and VMMs that

serves in part as motivation for this work.

Though commodity chip multiprocessor technology is new, software and hard-

ware architects have been developing OS and VMM support for larger-scale, special-

purpose concurrent servers over the past four decades. Contemporary OS and VMM

solutions are derived from these prior endeavors, and current I/O architectures bear

many similarities to their ancestors. However, recent advances in server I/O (such

as the adoption of 10 Gigabit Ethernet) have placed new stresses on these architec-

tures, exposing inefficiencies and bottlenecks that prevent modern systems from fully

utilizing their I/O capabilities, as depicted in Figures 1.2 and 1.3 in the Introduction.

The poor I/O scalability and performance of modern concurrent servers are at-

tributable to both software and hardware inefficiencies. Many existing operating

13

systems are designed to maximize opportunities for parallelism. However, designs

that maximize parallelism incur higher synchronization and thread-scheduling over-

head, ultimately reducing performance. Furthermore, both operating systems and

VMMs are designed to interact with the traditional serialized hardware interface

exported by I/O devices. Operating system performance is bottlenecked by this in-

terface when the multithreaded higher levels of the OS must wait for single-threaded

device-management operations to complete. This serialized interface has more far-

reaching effects on VMM design, and consequently VMMs experience even heavier

efficiency penalties relative to native-OS performance. These penalties stem primar-

ily from the separation of device management (in one privileged OS instance) from

server computation (in traditional, untrusted OS instances) and the software virtual-

ization layers needed between them. Combined, all of these inefficiencies significantly

degrade OS and VMM performance and will prevent future servers from scaling with

contemporary I/O capabilities.

2.2 Existing OS Support for Concurrent Network Servers

Given the continuing trend toward commodity chip multiprocessor hardware, the

trend away from vast improvements in uniprocessor performance, and the ongoing

improvements in Ethernet link throughput, operating system architects must con-

sider efficient methods to close the I/O performance gap. The organization of the

operating system’s network stack is particularly important. An operating system’s

14

network stack implements protocol processing (typically TCP/IP or UDP). TCP pro-

cessing is the only operation being performed in the microbenchmark examined in

Figure 1.2, but there remains a clear performance gap imposed by the overhead of pro-

tocol processing. To close that gap, multiprocessor operating systems must efficiently

orchestrate concurrent protocol processing.

There exist two principal strategies for parallelizing the operating system’s net-

work stack, both of which derive from research in the mid-1990s that was conducted

using large-scale SGI Challenge shared-memory multiprocessors. These strategies

differ according to their unit of concurrency. Though current OS implementations

are derived from one or the other of these two strategies, no consensus exists among

developers regarding the most appropriate organization for emerging processing and

I/O hardware.

Nahum et al. first examined a parallelization scheme that attempted to treat

messages (usually packets) as the fundamental unit of concurrency [41]. In its most

extreme implementation, this message-parallel (or MsgP) strategy attempts to pro-

cess each packet in the system in parallel using separate processors. Because a server

has a constant stream of packets, the message-parallel approach maximizes the the-

oretically achievable concurrency. Though this message-oriented organization ideally

scales limitlessly given the abundance of in-flight packets available in a typical server,

Nahum et al. found that repeated synchronization for connection state shared among

packets belonging to the same connection (such as reorder queues) severely limited

15

scalability [41]. However, that study found that the scalability and performance of

the message-parallel organization was highly dependent on the synchronization char-

acteristics of the system.

The connection-parallel (or ConnP) strategy treats connections as the fundamen-

tal unit of concurrency. When the operating system receives a message for transmis-

sion or has received a message from the network, a ConnP organization first associates

the message with its connection. The OS then uses a synchronization mechanism to

move the packet to a thread responsible for processing its connection. Hence, ConnP

organizations avoid the MsgP inefficiencies associated with repeated synchronization

to shared connection state. However, ConnP organizations also limit concurrency

by enlarging the granularity of parallelism from packets to connections. In its most

extreme form, a ConnP organization has as many threads as there are connections.

After Nahum et al. examined the MsgP organization, Yates et al. conducted a

similar study using the SGI Challenge architecture. In this study, Yates et al. ex-

amined connection-oriented parallelization strategies that treat connections as the

fundamental unit of concurrency [62]. This study persistently mapped network stack

operations for a specific connection to a specific worker thread in the OS. This strat-

egy eliminates any shared connection state among threads and thus eliminates the

synchronization overhead of message-oriented parallelizations. Consequently, this or-

ganization yielded excellent scalability as threads were added.

16

Grouped

ConnP

ConnP

In-Order

MsgP

MsgP

Uniproc

Efficiency

Concurrency

Figure 2.2: The efficiency/parallelism continuum of OS network-stack parallelizationstrategies.

Both of these prior works ran on the mid-90s era SGI challenge multiprocessor,

utilized the user-space x -kernel, and used a simulated network device. However,

modern servers feature processors with very different synchronization costs relative

to processing, utilize operating systems that bear little resemblance to the x -kernel,

and incur the real-world overhead associated with interrupts and device management.

These works ultimately concluded that the synchronization and packet-ordering

overhead associated with fine-grained packet-level processing could severely damage

performance, and that a connection-oriented network stack yielded better efficiency

and performance. However, ten years later there is little if any consensus regarding

the “correct” organization for modern network servers. FreeBSD and Linux both

utilize a variant of a message-parallel network stack organization, whereas Solaris 10

and DragonflyBSD both feature connection-parallel organizations.

Hence, though prior research suggested three points of consideration for net-

work stack parallelization architectures (serial, ConnP, and MsgP), modern practical

variants represent several additional points of interest that make different efficiency

and performance tradeoffs. Consequently, there exist several additional points along

17

a continuum of both concurrency and efficiency. Figure 2.2 depicts this concur-

rency/efficiency continuum. Whereas uniprocessor organizations incur no synchro-

nization or thread-scheduling overhead and hence are the most efficient, they are also

the least concurrent. Conversely, purely MsgP organizations exploit the highest level

of concurrency but experience overhead that reduces their efficiency. ConnP organi-

zations attempt to trade some of the MsgP concurrency for increased efficiency, and

better overall performance.

Real-world MsgP and ConnP implementations also compromise concurrency for

efficiency, though some of these compromises are motivated by pragmatism. Whereas

theoretical MsgP organizations attempt limitless message-level concurrency, real-

world implementations such as Linux and FreeBSD process messages coming into

the network stack from any given source in-order. In effect, packets from a given

hardware interface or from a given application thread will be processed in-order.

However, these in-order MsgP stacks utilize fine-grained synchronization to enable

message parallelism, particularly between received and transmitted packets. Simi-

larly, realizable ConnP organizations do not have limitless connection parallelism;

instead, ConnP organizations such as Solaris 10 and DragonflyBSD associate each

connection with a group, and then process that group in parallel.

As Figure 1.2 shows, a massive performance gap exists between achievable through-

put of modern in-order MsgP organizations and the link capacity of modern 10 Giga-

bit interfaces. Furthermore, the figure shows that current multiprocessor operating

18

systems are not effective at reducing this gap, but that parallel network interfaces

achieve higher throughput than a single interface. This performance gap and the lack

of existing solutions that close it motivate the research in this dissertation.

Thus, prior research has established that there are at least two methods of paral-

lelizing an operating system network stack, each which differ according to their unit

of concurrency. However, prior evaluations have been conducted in the context of an

experimental user-space operating system that uses a simulated network device, and

which was done on hardware that had very different synchronization overhead char-

acteristics from modern hardware. So while previous research established that there

were different approaches to parallelizing an OS network stack, that research did not

fully evaluate those approaches on practical hardware in a practical software environ-

ment. In Chapter 3, this dissertation reaches beyond that prior work by examining

the way the network stack’s unit of concurrency affects efficiency and performance

in a real operating system with a real network device, including the associated de-

vice and scheduling overhead. This research also breaks ground by examining and

comparing how the means for synchronization (locks versus threads) within a given

organization can affect performance and efficiency, thus rounding out an examina-

tion of an entire continuum of network stack parallelization strategies, whereas prior

research examined only some of the points on that continuum.

19

I/O AccessPrivate Shared

HypervisorSystem/360 and /370 System/360 and /370 Networking [19, 36],

Software Disks and Terminals [37, 43] Xen 1.0 [7], VMware ESX [56]Manager

Operating System POWER4 LPAR [24]POWER5 LPAR [4],

Xen 2.0+ [16], VMware Workstation [53]

Table 2.1: I/O virtualization methods.

2.3 Existing VMM Support for Concurrent Network Servers

Other than OS multiprocessing, virtualization is another method of achieving

server parallelism. Just as parallelized operating systems can exploit connection-level

parallelism, parallel virtual machines exploit a coarser-grained connection parallelism

by managing separate connections inside separate virtual machines. There is a sig-

nificant amount of existing research in the field with respect to virtualization tech-

niques, much of which predates modern operating system research. With respect

to server I/O, past virtualized architectures have exploited private device architec-

tures (in which a device is assigned to just one virtual machine) and shared device

architectures (in which I/O resources for one device are shared among many virtual

machines). These private and shared I/O architectures have been realized using ei-

ther hardware-based or software-based techniques. All of these architectures require

that an I/O device cannot be used by a VM to gain access to another VM’s resources,

and prior research has explored these isolation issues as well.

The first widely available virtualized system was IBM’s System 370, which was

first deployed almost 40 years ago [43]. Though demand for server consolidation has

inspired new research in virtualization technology for commodity systems, contem-

porary I/O virtualization architectures bear a strong resemblance to IBM’s original

concepts, particularly with respect to network I/O virtualization. Over time, this

I/O virtualization architecture has led to significant software inefficiencies, ultimately

manifesting themselves in large performance degradations such as those depicted in

Figure 1.3.

20

There are two approaches to I/O virtualization. Private I/O virtualization archi-

tectures statically partition a machine’s physical devices, such as disk and network

controllers, among the system’s virtual machines. In a Private I/O environment, only

one virtual machine has access to a particular device. Shared I/O virtualization archi-

tectures enable multiple virtualized operating systems to access a particular device.

Existing Shared I/O systems use software interfaces to multiplex I/O requests from

different virtual machines onto a single device. Private I/O architectures have the

benefit of near-native performance, but they require each virtual machine in a system

to have its own private set of network, disk, and terminal devices. Because this costly

requirement is impractical on both commodity servers and large-scale servers capable

of running hundreds of virtual machines, current-generation VMMs employ Shared

I/O architectures.

2.3.1 Private I/O

IBM’s System/360 was the first widely available virtualization solution [43]. The

System/370 was an extended version of this same architecture and featured hardware-

assisted enhancements for processor and memory virtualization, but which supported

I/O using the same mechanisms [17, 37, 49]. The first I/O architectures developed for

System/360 and System/370 did not permit shared access to physical I/O resources.

Instead, a particular VM instance had private access to a specific resource, such as a

terminal. To permit many users to access a single costly disk, the System/360 and

System/370 architecture extended the idea of private device access by sub-dividing

contiguous regions on a disk into logically separate, virtual “mini-disks” [43]. Though

multiple virtual machines could access the same physical disk via the mini-disk ab-

straction, these VMs did not concurrently share access to the same mini-disk region,

and hence mini-disks still represented logical private I/O access. System/360 and

System/370 required that I/O operations (triggered by the start-io instruction) be

21

trapped and interpreted by the system hypervisor. The hypervisor ensured that a

given virtual machine had permission to access a specific device, and that the given

VM owned the physical memory locations being read from or written to by the pend-

ing I/O command. The hypervisor would then actually restart the I/O operation,

returning control to the virtual machine only after the operation completed. Hence,

the System/360 and System/370 hypervisor managed I/O resources.

More recent virtualization systems have also relied on private device access, such as

IBM’s first release of the logical partitioning (LPAR) architecture featuring POWER4

processors [24]. The POWER4 architecture isolated devices at the PCI-slot level and

assigned them to a particular VM instance for management. Each VM required a

physically distinct disk controller for disk access and a physically distinct network

interface for network access. Unlike the System/360 and System/370 architecture,

the POWER4’s I/O devices accessed host memory asynchronously via DMA using

OS-provided DMA descriptors. Since a buggy or malicious guest OS could provide

DMA descriptors pointing to memory locations for which the given VM has no access

permissions, the POWER4 employs an IOMMU [9]. The IOMMU validates all PCI

operations per-slot using a set of hypervisor-maintained permissions. Hence, the

POWER4’s hypervisor can set up the IOMMU at device-initialization time, but I/O

resources can be directly managed at runtime by the guest operating systems.

2.3.2 Software-Shared I/O

Requiring private access to I/O devices imposes significant hardware costs and

scalability limitations, since each VM must have its own private hardware and device

slots. The first shared-I/O virtualization solutions were part of the development of

networking support for System/360 and System/370 between physically separated

virtual machines [19, 36]. This networking architecture supported shared access to

network I/O resources by use of a virtualized spool-file interface that was serviced

22

by a special-purpose virtual machine, or I/O domain, dedicated to networking. The

various general-purpose VMs in a machine could read from or write to virtualized

spool files. The system hypervisor would interpret these reads and writes based on

whether or not the spool locations were on a physically remote machine; if the data

was on a remote machine, the hypervisor would invoke the special-purpose networking

VM. This networking VM would then use its physical network interfaces to connect to

a remote machine. The remote machine used this same virtualized spool architecture

and dedicated networking VM to service requests. The networking I/O domain was

trusted to not violate memory protection rules, so on System/370 architectures that

supported “Preferred-Machine” execution, the I/O domain could be granted direct

access to the network interfaces and would not require the hypervisor to manage

network I/O [17].

This software architecture for sharing devices through virtualized interfaces is

logically identical to most virtualization solutions today. Xen, VMware, and the

POWER5 virtualization architectures all share access to devices through virtualized

software interfaces and rely on a dedicated software entity to actually perform physical

device management [4, 7, 53]. Subsequent releases of Xen and VMware have moved

device management either into the hypervisor and out of an I/O domain (as is the

case with VMware ESX [56]) or into an I/O domain and out of the hypervisor (as is

the case with Xen versions 2.0 and higher [16]). Furthermore, different architectures

use different interfaces for implementing shared I/O access. For example, the Denali

isolation kernel provides a high-level interface that operates on packets [58]. The Xen

VMM provides an interface that mimics that of a real network interface card but

abstracts away many of the register-level management details [16]. VMware can sup-

port either an emulated register-level interface that implements the precise semantics

of a hardware NIC, or it can support a higher-level interface similar to Xen’s [53, 56].

23

Driver Domain

Virtual Machine Monitor

Guest Domain

Device Driver

Backend Driver

Frontend Driver

I/O Device(s) CPU / MemoryInterrupt

Virtual Interrupts

Data/

Control

Data/Control

Multi-plexing

Figure 2.3: A contemporary software-based, shared-I/O virtualization architecture.

Regardless of the interface, however, the overall organization is fundamentally quite

similar.

Figure 2.3 depicts the organization of a typical modern Shared I/O architecture,

which is heavily influenced by IBM’s original software-based Shared I/O architecture

for sharing network resources. These modern architectures grant direct hardware

access to only a single virtual machine instance, referred to as the “driver domain” in

the figure. Each consolidated server instance exists in an operating system running

inside an unprivileged “guest domain”. The driver domain is a privileged operating

system instance (for example, running Linux) whose sole responsibility is to manage

physical hardware in the machine and present a virtualized software interface to the

guest domains. This interface is exported by the driver domain’s backend driver,

as depicted in the figure. Guest domains access this virtualized interface and issue

I/O requests using their own frontend drivers. Upon reception of I/O requests in

the backend driver, the driver domain uses a separate multiplexing module inside the

operating system (such as a software Ethernet bridge in the case of network traffic)

to map requests to physical device drivers. The driver domain then uses native device

24

drivers to access hardware, thus carrying out the various I/O operations requested

by guest domains.

This I/O virtualization architecture addresses the many requirements for imple-

menting Shared I/O in a virtualized environment featuring untrusted virtualized

servers. First, the architecture provides a method to multiplex I/O requests from

various guest operating systems onto a single commodity device, which exports just

one management interface to software. This software-only architecture is practical

insofar as it supports a large class of existing commodity devices. Second, the architec-

ture provides a centralized, trusted interface with which to safely translate virtualized

I/O requests originating from an untrusted guest into trusted requests operating on

physical hardware and memory. Third, this architecture provides inter-VM message

notification using a virtualized interrupt system. This messaging system is imple-

mented by the VMM and is used by guest and driver domains to notify each other of

pending requests and event completion.

However, forcing all I/O operations to be forwarded through the driver domain

incurs significant overhead that ultimately reduces performance. The magnitude of

the performance loss for network I/O under the Xen VMM is depicted in Figure 1.3,

and Sugerman et al. has reported similar results using other the VMware VMM [53].

These performance losses are attributable to the inefficiency of moving I/O requests

into and out of the driver domain and multiplexing those requests once inside the

driver domain.

2.3.3 Hardware-shared I/O

Direct I/O access (rather than indirect access through management software)

eliminates the overhead of moving all I/O traffic through a software management

entity for multiplexing and memory protection purposes. In addition to supporting

private management of I/O devices by just one software entity at a time (either the

25

OS or hypervisor), the System/360 and System/370 fully implemented the direct ac-

cess storage device (DASD) architecture. The DASD architecture enabled concurrent

programs, operating systems, and the hypervisor to access disks directly and simul-

taneously [12]. DASD-capable devices had several distinct, separately addressable

channels. Software subroutines called channel programs performed programmed I/O

on a channel to carry out a disk request. Disk-access commands in channel programs

executed synchronously. A significant benefit of multi-channel DASD hardware was

that it permitted one channel program to access the disk while another performed data

comparisons (in local memory) to determine if further record accesses were required.

This hardware support for concurrency significantly improved I/O throughput.

On a virtualized system, the hypervisor would trap upon execution of the privi-

leged start-io instruction that was meant to begin a channel program. The hyper-

visor would then inspect all of the addresses to be used by the device in the pend-

ing channel program, possibly substituting machine-physical addresses for machine-

virtual addresses. The hypervisor would also verify the ownership of each address on

the disk to be accessed. After ensuring the validity of each address in the programmed-

I/O channel subroutine, the hypervisor would execute the modified subroutine [17].

This trap-modify-execute interpretive execution model enabled the hypervisor to

check and ensure that no virtual machine could read or write from another VM’s

physical memory and that no virtual machine could access another VM’s disk area.

However, the synchronous nature of the interface afforded the hypervisor simplicity

with respect to memory protection enforcement; it was sufficient to check the current

permission state in the hypervisor and base I/O operations on that state. However,

modern operating systems operate devices asynchronously to achieve concurrency.

Devices use DMA to access host memory at an indeterminate time after software

makes an I/O request. Rather than simply querying memory ownership state at

the instant software issues an I/O request, a modern virtualization solution that sup-

26

ports concurrent access by separate virtual machines to a single physical device would

require tracking of memory ownership state over time.

2.3.4 Protection Strategies for Direct-Access Private and Shared Virtu-

alized I/O

Providing untrusted virtual machines with direct access to I/O resources (as in

Private I/O architectures or Hardware-Shared I/O architectures) can substantially

improve performance by avoiding software overheads associated with indirect access

(as in Software-Shared I/O architectures). However, VMs with direct I/O access

could maliciously or accidentally use a commodity I/O device to access another VM’s

memory via the device’s direct memory access (DMA) facilities. Furthermore, a fault

by the device could generate an invalid request to an unrequested region of memory,

possibly corrupting memory.

One approach to providing isolation among operating systems that have direct

I/O access is to leverage a hardware I/O memory management unit (IOMMU). The

IOMMU translates all DMA requests by a device according to the IOMMU’s page

table, which is managed by the VMM. Before making a DMA request, an untrusted

VM must first request that the VMM install a valid mapping in the IOMMU, so that

later the device’s transaction will proceed correctly with a current, valid translation.

Hence, the VMM can effectively use the IOMMU to enforce system-wide rules for

controlling what memory an I/O device (under the direction of an untrusted VM) may

access. By requesting immediate destruction of IOMMU translations, an untrusted

VM can furthermore protect itself against later, errant requests by a faulty I/O device.

Contemporary commodity virtualization solutions run on standard x86 hardware,

which typically lacks an IOMMU. Hence, these solutions forbid direct I/O access

and instead use software to implement both protection and sharing of I/O resources

among untrusted guest operating systems. Confining direct I/O accesses only within

27

the trusted VMM ensures that all DMA descriptors used by hardware have been

constructed by trusted software. Though commodity VMMs confine direct I/O within

privileged software, they provide indirect, shared access to their unprivileged VMs

using a variety of different software interfaces.

IBM’s high-availability virtualization platforms feature IOMMUs and can support

private direct I/O by untrusted guest operating systems, but they do not support

shared direct I/O. The POWER4 platform supports logical partitioning of hardware

resources among guest operating systems but does not permit concurrent sharing of

resources [24]. The POWER5 platform adds support for concurrent sharing using

software, effectively sacrificing direct I/O access to gain sharing [4]. This sharing

mechanism works similarly to commodity solutions, effectively confining direct I/O

access within what IBM refers to as a “Virtual I/O Server”. Unlike commodity

VMMs, however, this software-based interface is used solely to gain flexibility, not

safety. When a device is privately assigned to a single untrusted guest OS, the

POWER5 platforms can still use its IOMMU to support safe, direct I/O access.

The high overhead of software-based shared I/O virtualization motivated recent

research toward hardware-based techniques that support simultaneous, direct-access

network I/O by untrusted guest operating systems. These each have different ap-

proaches to implementing isolation and protection. Liu et al. developed an Infiniband-

based prototype that supports direct access by applications running within untrusted

virtualized guest operating systems [35]. This work adopted the Infiniband model

of registration-based direct I/O memory protection, in which trusted software (the

VMM) must validate and register the application’s memory buffers before those

buffers can be used for network I/O. Registration is similar to programming an

IOMMU but has different overhead characteristics, because registrations require inter-

action with the device rather than modification of IOMMU page table entries. Unlike

an IOMMU, registration alone cannot provide any protection against malfunctioning

28

by the device, since the protection mechanism is partially enforced within the I/O

device.

Raj and Schwan also developed an Ethernet-based prototype device that sup-

ports shared, direct I/O access by untrusted guests [45]. Because of hardware-

implementation constraints, their prototype has limited addressability of main mem-

ory and thus requires all network data to be copied through VMM-managed bounce-

buffers. This strategy permits the VMM to validate each buffer but does not provide

any protection against faulty accesses by the device within its addressable memory

range.

AMD and Intel have recently proposed the addition of IOMMUs to their upcoming

architectures [3, 23]. Though they will be new to commodity architectures, IOMMUs

are established components in high-availability server architectures [9]. Ben-Yehuda

et al. recently explored the TCP-stream network performance of IBM’s state-of-the-

art IOMMU-based architectures using both non-virtualized, “bare-metal” Linux and

paravirtualized Linux running under Xen [10]. They reported that the state-of-the-

art IOMMU-management strategy can incur significant overhead. They hypothesized

that modifications to the single-use IOMMU-management strategy could avoid such

penalties.

The concurrent direct network access (CDNA) architecture described in Chapter 4

of this dissertation is an Ethernet-based prototype that supports concurrent, direct

network access by untrusted guest operating systems. Unlike the Ethernet-based

prototype developed by Raj and Schwan, the CDNA prototype does not require extra

copying nor bounce buffers; instead, the CDNA architecture uses a novel software-

based memory protection mechanism. Like registration, this software-based strategy

offers no protection against faulty device behavior. The CDNA architecture does not

fundamentally require software-based DMA memory protection. Rather, CDNA can

be used with an IOMMU to implement DMA memory protection. This dissertation

29

explores such an approach to DMA memory protection further in Chapter 5. Like all

direct, shared-I/O architectures, CDNA fundamentally requires that the device be

able to access several guests’ memory simultaneously. Consequently, the device could

still use the wrong VM’s data for a particular I/O transaction, and hence it is not

possible to guard against faulty behavior by the device even when using an IOMMU.

Overall, there has been quite a large volume of research regarding support for I/O

virtualization, some of which dates back forty years. Given the continuing demand for

server consolidation, researchers continue to develop architectures for private I/O, for

software-shared I/O, for hardware-shared I/O, and to develop protection strategies

for allowing direct access to a particular I/O device by an untrusted virtual machine.

This dissertation explores a novel approach that combines several of these techniques

to create a new, hybrid architecture. This research uses a combination of software and

hardware to facilitate shared I/O with much greater efficiency and performance than

past approaches. Further, this research explores new software techniques for managing

hardware designed to enforce I/O memory protection policies, also achieving higher

efficiency and performance.

2.4 Hardware Support for Concurrent Server I/O

In addition to software support for thread and VM concurrency, there has been sig-

nificant research in the field regarding concurrent-access I/O devices. These concurrent-

access I/O devices alone do not provide an architectural solution to the performance

and efficiency challenges with respect to server concurrency. However, they can be

used in concurrency-aware architectures to improve the efficiency of the software.

30

2.4.1 Hardware Support for Parallel Receive-side OS Processing

Proposals for parallel receive queues on NICs (such as receive-side scaling (RSS) [40])

are a beginning toward providing explicitly concurrent access to I/O devices by mul-

tiple threads. Such architectures maintain separate queues for received packets that

can be processed simultaneously by the operating system. The NIC classifies packets

into a specific queue according to a hashing function that is usually based on the IP

address and port information in each IP packet header. Because this IP address and

port information is unique per connection in traditional protocols (such as TCP and

UDP), the NIC can distribute incoming packets into specific queues according to con-

nection. This distribution ensures that packets for the same connection are not placed

in different queues and thus later processed out-of-order by the operating system.

While this approach should efficiently improve concurrency for single, non-virtualized

operating systems in receive-dominated workloads, such proposals do not improve

transmit-side concurrency. Though parallel receive queues are a necessary component

to improving the efficiency of receive-side network stack concurrency, this driver/NIC

interface leaves the larger network stack design issues unresolved. These issues must

be confronted to prevent inefficiencies in the network stack from rendering any archi-

tectural improvements useless. Consequently, this dissertation examines the larger

network stack issues in detail.

Additionally, a restricted interface such as RSS that considers only receive con-

currency is not amenable to supporting concurrent direct hardware access by parallel

virtualized operating systems. A more flexible interface would be beneficial for ex-

tracting the most utility from a modified hardware architecture. At minimum, an

RSS-style NIC architecture would need to be modified to enable more flexible classi-

fication of incoming packets based on the virtual machine they belong to rather than

the connection they are associated with. Even so, such a modified device architec-

31

ture would be insufficient because it would still require all transmit operations to be

performed via traditional software sharing rather than direct hardware access.

2.4.2 User-level Network Interfaces

User-level network interfaces provide a more flexible hardware/software interface

that allows concurrent user-space applications to directly access a special-purpose

NIC [44, 52]. In effect, the NIC provides a separate hardware context to each request-

ing application instance. Hence, user-level NICs provide the functional equivalent of

implementing parallel transmit and receive queues on a single traditional NIC, which

could be used as a component toward building an interface that breaks the scalability

limitations of traditional NICs. However, user-level NIC architectures lack two key

features required for use in efficient, concurrent network servers.

First, user-level NICs do not provide context-private event notification. Instead,

applications written for user-level NICs typically poll the status of a private context

to determine if that context is ready to be serviced. While this is perfectly suitable

for high-performance message-passing applications in which the application may not

have any work to do until a new message arrives, a polling model is inappropriate for

general-purpose operating systems or virtual machine monitors in which many other

applications or devices may require service.

Second, user-level network interfaces require a single trusted software entity to im-

plement direct memory access (DMA) memory protection, which limits the scalability

of this approach. For unvirtualized environments, this entity is the operating system;

for virtualized environments, the entity is a single trusted “driver domain” OS in-

stance. Like all applications, user-level NIC applications manipulate virtual memory

addresses rather than physical addresses. Hence, the addresses provided by an ap-

plication to a particular hardware context on a user-level NIC are virtual addresses.

However, commodity architectures (such as x86) require I/O devices to use physical

32

addresses. To inform the NIC of the appropriate virtual-to-physical address trans-

lations, applications invoke the trusted managing software entity to perform an I/O

interaction with the NIC (typically referred to as memory registration) that updates

the NIC’s current translations. Liu et al. present an implementation of the Infiniband

user-level NIC architecture with support for the Xen VMM and show that memory

registration costs can significantly degrade performance [35]. Unlike user-level NIC

applications that typically only invoke memory registration twice (once during ini-

tialization and again during application termination), operating systems frequently

create and destroy virtual-to-physical mappings at runtime, especially when utilizing

zero-copy I/O. Hence, the costly memory registration model is inappropriate for op-

erating systems running running on a VMM. The concurrent direct network access

architecture presented in this dissertation avoids these registration costs by using a

lightweight, primarily software-based protection strategy instead. This dissertation

also explores other IOMMU-based strategies for efficient memory protection that are

attractive alternatives to costly on-device memory registration.

Thus, prior research has examined hardware support for both OS and VMM con-

currency, but this hardware alone is not sufficient to address the problems of each.

Furthermore, these prior endeavors each either only solved one aspect of concurrency

(as in the RSS model which addresses receive parallelism, but not transmit paral-

lelism) or they lacked important components necessary for high-performance servers

(such as high-performance, low-overhead DMA memory protection so as to facilitate

zero-copy I/O). This dissertation uses both hardware and software to create a compre-

hensive solution, or when applicable, determine and characterize the components that

are still necessary for a comprehensive solution. Moreover, this research represents a

fundamentally different approach than these past efforts by using a hardware/software

synthesis to achieve a comprehensive architectural analysis and solution rather than

33

using a primarily hardware-based approach that examines only part of the problems

and their overhead.

2.5 Summary

Server technology has become increasingly important for academic and commer-

cial applications, and the Internet era has brought explosive growth in demand from

home users. The demand for efficient, high-performance server technology has mo-

tivated extensive research over the past several decades that touch on issues related

to the contributions of this dissertation. Though there is extensive research in this

area, there are new challenges with regard to supporting new levels of thread- and

virtual-machine-level concurrency. This chapter has described the efficiency and per-

formance challenges observed in modern systems and has outlined the research that

is most closely related to solving these problems. Previous research has explored

some variations of OS architectures that support thread-parallel network I/O pro-

cessing, but the research in this dissertation reaches beyond that by exploring a fuller

spectrum of OS architectures and by examining them on real, rather than simulated,

hardware and software. Furthermore, previous research has explored the performance,

efficiency, and protection issues related to different I/O virtualization architectures,

but the research in this dissertation presents a completely new, novel architecture

that brings with it different performance, efficiency, and protection characteristics.

Finally, a key aspect of the research in this dissertation is that it uses hardware

to improve the efficiency of software. There have been several prior efforts to use

hardware to support OS and VMM concurrency, but these efforts have been almost

exclusively hardware-centric and did not address issues relevant to real-world appli-

cation performance, such as support for zero-copy I/O in modern server applications.

The architecture presented in this dissertation uses hardware in synthesis with soft-

34

ware to comprehensively address efficiency and performance of real-world applications

running on modern thread- and virtual-machine-concurrent network servers.

35

Chapter 3

Parallelization Strategies for OS Network Stacks

As established in the previous chapter, network server architectures will feature

chip multiprocessors in the future. Furthermore, the slowdown in uniprocessor per-

formance improvements means that network servers will have to leverage parallel

processors to meet the ever increasing demand for network services. A wide range of

parallel network stack organizations have been proposed and implemented. Among

the parallel network stack organizations, there exist two major categories: message-

based parallelism (MsgP) and connection-based parallelism (ConnP). These organi-

zations expose different levels of concurrency, in terms of the maximum available

parallelism within the network stack. They also achieve different levels of efficiency,

in terms of achieved network bandwidth per processing core, as they incur differing

cache, synchronization, and scheduling overheads.

The costs of synchronization and scheduling have changed dramatically in the

years since the parallel network stack organizations introduced in Chapter 2 were

originally proposed and studied. Though processors have become much faster, the

gap between processor and memory performance has become much greater, increasing

the cost, in terms of lost execution cycles, of synchronization and scheduling. Fur-

thermore, technology trends and architectural complexity are keeping uniprocessor

performance growth from keeping pace with Ethernet bandwidth increases. Both of

these factors motivate a fresh examination of parallel network stack architectures on

modern parallel hardware.

36

Today, network servers are frequently faced with tens of thousands of simultaneous

connections. The locking, cache, and scheduling overheads of parallel network stack

organizations vary depending on the number of active connections in the system.

However, network performance evaluations generally focus on the bandwidth over a

small numbers of connections, often just one. In contrast, this study evaluates the

different network stack organizations under widely varying connection loads.

This study has four main contributions. First, this study presents a fair com-

parison of uniprocessor, message-based parallel, and connection-based parallel net-

work stack organizations on modern multiprocessor hardware. Three competing net-

work stack organizations are implemented within the FreeBSD 7 operating system:

message-based parallelism (MsgP), connection-based parallelism using threads for

synchronization (ConnP-T), and connection-based parallelism using locks for syn-

chronization (ConnP-L). The uniprocessor version of FreeBSD is efficient, but its

performance falls short of saturating the fastest available network interfaces. Utiliz-

ing 4 cores, the parallel stack organizations can outperform the uniprocessor stack,

but at reduced efficiency.

Second, this study compares the performance of the different network stack or-

ganizations when using a single 10 Gbps network interface versus multiple 1 Gbps

network interfaces. Unsurprisingly, a uniprocessor network stack can more efficiently

utilize a single 10 Gbps network interface, as multiple network interfaces generate

additional interrupt overheads. However, the interactions between the network stack

and the device serialize the parallel stack organizations when only a single network

interface is present in the system. The parallel network stack organizations benefit

from the device-level parallelism that is exposed by having multiple network inter-

faces, allowing a system with multiple 1 Gbps network interfaces to outperform a

system with a single 10 Gbps network interface. With multiple interfaces, the par-

37

allel organizations are able to process interrupts concurrently on multiple processors

and experience reduced lock contention at the device level.

Third, this study presents an analysis of the locking and scheduling overhead

incurred by the different parallel stack organizations. MsgP experiences significant

locking overhead, but is still able to outperform the uniprocessor for almost all connec-

tion loads. In contrast, ConnP-T has very low locking overhead but incurs significant

scheduling overhead, leading to reduced performance compared to even the unipro-

cessor kernel for all but the heaviest loads. ConnP-L mitigates the locking overhead of

MsgP, by grouping connections so that there is little global locking, and the scheduling

overhead of ConnP-T, by using the requesting thread for network processing rather

than forwarding the request to another thread.

Finally, this study analyzes the cache behavior of the different parallel stack orga-

nizations. Specifically, this study categorizes data sharing within the network stack

as either concurrent or serial. If a datum may be accessed simultaneously by two or

more threads, that datum is shared concurrently. If, however, a datum may only be

accessed by one thread at a time, but it may be accessed by different threads over

time, that datum is shared serially. CMP organizations with shared caches will likely

reduce the cache misses to concurrently shared data, but are unlikely to provide any

benefit for serially shared data. Unfortunately, this study shows that there is a sig-

nificant amount of serial sharing in the parallel network stack organizations, but very

little concurrent sharing.

The remainder of this chapter proceeds as follows. The next section further mo-

tivates the need for parallelized network stacks in current and future systems. Sec-

tion 3.2 describes the parallel network stack architectures that are evaluated in this

paper. Section 3.3 then describes the hardware and software used to evaluate each

organization. Sections 3.4 and 3.5 present evaluations of the organizations using one

10 Gbps interface and six 1 Gbps interfaces, respectively. Section 3.6 provides a dis-

38

cussion of these results. This chapter is based in part on my previously published

work [59].

3.1 Background

The most efficient network stacks in modern operating systems are designed for

uniprocessor systems. There are still concurrent threads in such operating systems,

but locking and scheduling overhead are minimized as only one thread can execute at

a time. For example, a lock operation can often be made atomic simply by masking

interrupts during the operation. Despite their efficiency, such network stacks are not

capable of saturating a modern 10 Gbps Ethernet link. In 2004, Hurwitz and Feng

found that, using Linux 2.4 and 2.5 uniprocessor kernels (with TCP segmentation

offloading), they were only able to achieve about 2.5 Gbps on a 2.4 GHz Intel Pentium

4 Xeon system [20].

Increasing processor performance has allowed uniprocessor network stacks to achieve

higher bandwidth, but they still are not close to saturating a 10 Gbps Ethernet link.

Table 3.1 shows the performance of FreeBSD 7 on a modern 2.2 GHz Opteron unipro-

cessor system. The first row shows the performance of the uniprocessor kernel, which

remains nearly constant around 4 Gbps as the number of connections in the system

is varied. While this is an improvement over the performance reported in 2004, it

is still less than one half of the link’s capacity. Though the use of jumbo frames

can improve these numbers, network servers connected to the Internet will continue

to use standard 1500 byte Ethernet frames into the foreseeable future in order to

interoperate with legacy hardware.

In the face of technology constraints and uniprocessor complexity, architects have

turned to chip multiprocessors to continue to provide additional processing perfor-

mance [14, 18, 25, 30, 31, 32, 33, 34, 42, 54]. The network stack within the operating

39

OS Type Processors 24 conns 192 conns 16384 conns

Uniprocessor only 1 4177 4156 4037SMP capable 1 3688 3796 3774SMP capable 4 3328 3251 1821

Table 3.1: FreeBSD network bandwidth (Mbps) using a single processor and a 10 Gbpsnetwork interface.

system will have to be able to take advantage of such architectures in order to keep

up with increases in network bandwidth demand. However, parallelizing the network

stack inherently reduces its efficiency. A symmetric multiprocessing (SMP) kernel

must use a more expensive implementation of lock operations as there is now physi-

cal concurrency in the system. For a lock operation to be atomic, it must be ensured

that threads running on the other processors will not interfere with the read-modify-

write sequence required to acquire and release a lock. On x86 hardware, this is

accomplished by adding the lock prefix to lock acquisition instructions. The lock

prefix causes the instruction to be extremely expensive, as it serializes all instruction

execution on the processor and it locks the system bus to ensure that the proces-

sor can do an atomic read-modify-write with respect to the other processors in the

system. Scheduling is also potentially more expensive, as the operating system now

must schedule multiple threads across multiple physical processors. As the second

row in Table 3.1 shows, in FreeBSD 7, the overhead of making the kernel SMP capa-

ble results in a 7–12% reduction in efficiency. Note that this is still using just a single

physical processor.

As the number of processors increases, lock contention becomes a major issue. The

third row of Table 3.1 shows the results of this effect. With the same SMP capable

kernel with 4 physical processors, not only does the efficiency further decrease, but the

absolute performance also decreases. Note that the problem gets dramatically worse

as the number of connections are increased. This is because with a larger number of

40

connections, each connection has much lower bandwidth, so less work is accomplished

for each lock acquisition.

These results strongly motivate a reexamination of network stack parallelization

strategies in the face of modern technology trends. It seems unlikely that uniproces-

sor performance will scale fast enough to keep up with increasing network bandwidth

demands, so the efficiency of uniprocessor network stacks can no longer be relied

upon to provide the necessary networking performance. Furthermore, the inefficien-

cies of modern SMP capable network stacks leads to a situation where small-scale

chip multiprocessors are only going to make the situation worse, as networking per-

formance actually gets worse, not better, using 4 processing cores. There have been

several proposals to use a single core of a multiprocessor to achieve the efficiencies

of a uniprocessor network stack [6, 11, 46, 47, 48]. However, this is not a solution,

either, as each core of a CMP is likely to provide less performance than a monolithic

uniprocessor. So, if a uniprocessor is insufficient, there is no reason to believe a single

core of a CMP will be able to do any better. Furthermore, dedicating multiple cores

for network reintroduces the need for synchronization. The remainder of this paper

will examine the continuum of parallelization strategies depicted in Figure 2.2 and

analyze their behavior on small scale multiprocessor systems to better understand

this situation.

3.2 Parallel Network Stack Architectures

As was introduced in Chapter 2 and depicted in Figure 2.2, there are two primary

network stack parallelization strategies: message-based parallelism and connection-

based parallelism. Using message-based parallelism, any message (or packet) may be

processed simultaneously with respect to other messages. Hence, messages for a single

connection could be processed concurrently on different threads, potentially resulting

41

in improved performance. Connection-based parallelism is more coarse-grained; at the

beginning of network processing (either at the top or bottom of the network stack),

messages and packets are classified according to the connection with which they are

associated. All packets for a certain connection are then processed by a single thread

at any given time. However, each thread may be responsible for processing one or

more connections.

These parallelization strategies were studied in the mid-1990s, between the in-

troduction of 100 Mbps and 1 Gbps Ethernet. Despite those efforts, there is not a

solid consensus among modern operating system developers on how to design effi-

cient and scalable parallel network stacks. Major subsystems of FreeBSD and Linux,

including the network stack, have been redesigned in recent years to improve perfor-

mance on parallel hardware. Both operating systems now incorporate variations of

message-based parallelism within their network stacks. Conversely, Sun has recently

redesigned the Solaris operating system for their high-throughput computing micro-

processors and it now incorporates a variation of connection-based parallelism [55].

DragonflyBSD also uses connection-based parallelism within its network stack.

Each strategy was implemented within the FreeBSD 7 operating system to enable a

fair comparison of the trade-offs among the different strategies. This section provides

a more detailed explanation of how each parallelization strategy works.

3.2.1 Message-based Parallelism (MsgP)

Message-based parallel (MsgP) network stacks, such as FreeBSD, exploit paral-

lelism by allowing multiple threads to operate within the network stack simultane-

ously. Two types of threads may perform network processing: one or more application

threads and one or more inbound protocol threads. When an application thread makes

a system call, that calling thread context is “borrowed” to then enter the kernel and

carry out the requested service. So, for example, a read or write call on a socket

42

would loan the application thread to the operating system to perform networking

tasks. Multiple such application threads can be executing within the kernel at any

given time. The network interface’s driver executes on an inbound protocol thread

whenever the network interface card (NIC) interrupts the host, and it may transfer

packets between the NIC and host memory. After servicing the NIC, the inbound

protocol thread processes received packets “up” through the network stack.

Given that multiple threads can be active within the network stack, FreeBSD uti-

lizes fine-grained locking around shared kernel structures to ensure proper message

ordering and connection state consistency. As a thread attempts to send or receive

a message on a connection, it must acquire various locks when accessing shared con-

nection state, such as the global connection hash table lock (for looking up TCP

connections) and per-connection locks (for both socket state and TCP state). If a

thread is unable to obtain a lock, it is placed in the lock’s queue of waiting threads

and yields the processor, allowing another thread to execute. To prevent priority

inversion, priority propagation from the waiting threads to the thread holding the

lock is performed.

As is characteristic of message-based parallel network stacks, FreeBSD’s locking

organization thus allows concurrent processing of different messages on the same

connection, so long as the various threads are not accessing the same portion of the

connection state at the same time. For example, one thread may process TCP timeout

state based on the reception of a new ACK, while at the same time another thread

is copying data into that connection’s socket buffer for later transmission. However,

note that the inbound thread configuration described is not the FreeBSD 7 default.

Rather, the operating system’s network stack has been configured to use the optional

direct-dispatch mechanism. Normally dedicated parallel driver threads service each

NIC and then hand off inbound packets to a single protocol thread via a shared

queue. That protocol thread then processes the received packets “up” through the

43

network stack. The default configuration thus limits the performance of MsgP and is

hence not considered in this paper. The thread-per-NIC model also differs from the

message-parallel organization described by Nahum et al. [41], which used many more

worker threads than interfaces. Such an organization requires a sophisticated scheme

to ensure these worker threads do not reorder inbound packets that were received in

order, and hence that organization is also not considered.

3.2.2 Connection-based Parallelism (ConnP)

To compare connection parallelism in the same framework as message parallelism,

FreeBSD 7 was modified to support two variants of connection-based parallelism

(ConnP) that differ in how they serialize TCP/IP processing within a connection. The

first variant assigns each connection to one of a small number of protocol processing

threads (ConnP-T). The second variant assigns each connection to one of a small

number of locks (ConnP-L).

Connection Parallelism Serialized by Threads (ConnP-T)

Connection-based parallelism using threads utilizes several kernel threads dedi-

cated to per-connection protocol processing. Each protocol thread is responsible for

processing a subset of the system’s connections. At each entry point into the TCP/IP

protocol stack, the requested operation is enqueued for service by a particular protocol

thread based on the connection that is being processed. Each connection is uniquely

mapped to a single protocol thread for the lifetime of that connection. Later, the pro-

tocol threads dequeue requests and process them appropriately. No per-connection

state locking is required within the TCP/IP protocol stack, because the state of each

connection is only manipulated by a single protocol thread.

The kernel protocol threads are simply worker threads that are bound to a specific

CPU. They dequeue requests and perform the appropriate processing; the messaging

44

system between the threads requesting service and kernel protocol threads maintains

strict FIFO ordering. Within each protocol thread, several data structures that are

normally system-wide (such as the TCP connection hash table) are replicated so

that they are thread-private. Kernel protocol threads provide both synchronous and

asynchronous interfaces to threads requesting service.

If a requesting thread requires a return value or if the requester must maintain

synchronous semantics (that is, the requester must wait until the kernel thread com-

pletes the desired request), that requester yields the processor and waits for the kernel

thread to complete the requested work. Once the kernel protocol thread completes

the desired function, the kernel thread sends the return value back to the requester

and signals the waiting thread. This is the common case for application threads,

which require a return value to determine if the network request succeeded. However,

interrupt threads (such as those that service the network interface card and pass “up”

packets received on the network) do not require synchronous semantics. In this case,

the interrupt context classifies each packet according to its connection and enqueues

the packet for the appropriate kernel protocol thread. The connection-based paral-

lel stack uniquely maps a packet or socket request to a specific protocol thread by

hashing the 4-tuple of remote IP address, remote port number, local IP address, and

local port number. This implementation of connection-based parallelism is like that

of DragonflyBSD.

Connection Parallelism Serialized by Locks (ConnP-L)

Just as in thread-serialized connection parallelism, connection-based parallelism

using locks is based upon the principle of isolating connections into groups that are

each bound to a single entity during execution. As the name implies, however, the

binding entity is not a thread; instead, each group is isolated by a mutual exclusion

lock.

45

When an application thread enters the kernel to obtain service from the network

stack, the network system call maps the connection being serviced to a particular

group using a mechanism identical to that employed by thread-serialized connection

parallelism. However, rather than building a message and passing it to that group’s

specific kernel protocol thread for service, the calling thread directly obtains the lock

for the group associated with the given connection. After that point, the calling

thread may access any of the group-private data structures, such as the group-private

connection hash table or group-private per-connection structures. Hence, these locks

serve to ensure that at most one thread may be accessing each group’s private con-

nection structures at a time. Upon completion of the system call in the network

stack, the calling thread releases the group lock, allowing another thread to obtain

that group’s lock if necessary. Threads accessing connections in different groups may

proceed concurrently through the network stack without obtaining any stack-specific

locks other than the group lock.

Inbound packet processing is also analogous to connection-based parallelism using

threads. After receiving a packet, the inbound protocol thread classifies the packet

into a group. Unlike the thread-oriented connection-parallel case, the inbound thread

need not hand off the packet from the driver to the worker thread corresponding

to the packet’s connection group. Instead, the inbound thread directly obtains the

appropriate group lock for the packet and then processes the packet “up” the protocol

stack without any thread handoff. This control flow is similar to the message-parallel

stack, but the lock-serialized connection-parallel stack does not require any further

protocol locks after obtaining the connection group lock. As in the MsgP case, there

is one inbound protocol thread for each NIC, but the number of groups may far

exceed the number of threads. This implementation of connection-based parallelism

is similar to the implementation used in Solaris 10.

46

3.3 Methodology

To gain insights into the behavior and characteristics of the parallel network stack

architectures described in Section 3.2, these architectures were evaluated on a modern

chip multiprocessor. All stack architectures were implemented within the 2006-03-27

repository version of the FreeBSD 7 operating system to facilitate a fair comparison.

This section describes the benchmarking methodology and hardware platforms.

3.3.1 Evaluation Hardware

The parallel network stack organizations were evaluated using a 4-way SMP

Opteron system, using either a single 10 Gbps Ethernet interface or six 1 Gbps Ether-

net interfaces. The system consists of two dual-core 2.2 GHz Opteron 275 processors

and four 512 MB PC2700 DIMMs per processor (two per memory channel). Each of

the four processor cores has a private level-2 cache. The 10 Gbps NIC evaluation is

based on a Myricom 10 Gbps PCI-Express Ethernet interface. The six 1 Gbps NIC

evaluation is based on three dual-port Intel PRO/1000-MT Ethernet interfaces that

are spread across the motherboard’s PCI-X bus segments.

In both configurations, data is transferred between the 4-way Opteron’s Ethernet

interface(s) and one or more client systems. The 10 Gbps configuration uses one

client with an identical 10 Gbps interface as the system under test, whereas the six-

NIC configuration uses three client systems that each have two Gigabit Ethernet

interfaces. Each client is directly connected to the 4-way Opteron without the use of

a switch. For the 10 Gbps evaluation, the client system uses faster 2.6 GHz Opteron

285 processors and PC3200 memory, so that the client will never be a bottleneck in

any of the tests. For the six-NIC evaluation, each client was independently tested to

confirm that it can simultaneously sustain the theoretical peak bandwidth of its two

47

interfaces. Therefore, all results are determined solely by the behavior of the 4-way

Opteron 275 system.

3.3.2 Parallel TCP Benchmark

Most existing network benchmarks evaluate single-connection performance. How-

ever, modern multithreaded server applications simultaneously manage tens to thou-

sands of connections. This parallel network traffic behaves quite differently than a

single network connection. To address this issue, a multithreaded, event-driven, net-

work benchmark was developed that distributes traffic across a configurable number

of connections. The benchmark distributes connections evenly across threads and

utilizes libevent to manage connections within a thread. For all of the experiments

in this paper, the number of threads used by the benchmark is equal to the number

of processor cores being used. Each thread manages an equal number of connections.

For evaluations using 6 NICs, the application’s connections are distributed across the

server’s NICs equally such that each of the four threads uses each NIC, and every

thread has the same number of connections that map to each NIC.

Each thread sends data over all of its connections using zero-copy sendfile().

Threads receive data using read(). The sending and receiving socket buffer sizes

are set to be sufficiently large (typically 256 KB) to accommodate the large TCP

windows for high-bandwidth connections. Using larger socket buffers did not improve

performance for any test. All experiments use the standard 1500-byte maximum

transmission unit and do not utilize TCP segmentation offload, which currently is

not implemented in FreeBSD. The benchmark is always run for 3 minutes.

48

Stack Type 24 conns 192 conns 16384 conns

UP 4177 4156 4037MsgP 3328 3251 1821ConnP-T(4) 2543 2475 2483ConnP-L(128) 3321 3240 1861

Table 3.2: Aggregate throughput for uniprocessor, message-parallel and connection-parallel network stacks.

3.4 Evaluation using One 10 Gigabit NIC

Table 3.2 shows the aggregate throughput across all the connections of the parallel

TCP benchmark described in Section 3.3.2 when using a single 10 Gbps interface. The

table presents the throughput for each network stack organization when the evaluated

system is transmitting data on 24, 192, or 16384 simultaneous connections.

“UP” is the uniprocessor version of the FreeBSD kernel running on a single core of

the Opteron server. The rest of the configurations are run on all 4 cores. “MsgP” is

the multiprocessor FreeBSD-based MsgP kernel described in Section 3.2.1. “ConnP-

T(4)” is the multiprocessor FreeBSD-based ConnP-T kernel described in Section 3.2.2,

using 4 kernel protocol threads for TCP/IP stack processing that are each pinned to a

different core. “ConnP-L(128)” is the multiprocessor FreeBSD-based ConnP-L kernel

described in Section 3.2.2. ConnP-L(128) divides the connections among 128 locks

within the TCP/IP stack.

As Table 3.2 shows, none of the parallel organizations outperform the “UP” kernel.

This corroborates prior evaluations of 10 Gbps Ethernet that used hosts with two

processors and an SMP variant of Linux and exhibited worse performance than when

the hosts used a uniprocessor kernel [20]. Of the parallel organizations, MsgP and

ConnP-L perform approximately the same and outperform ConnP-T when using 24

or 192 connections. However, ConnP-T performs best when using 16384 connections.

Both the software interface to the single 10 Gbps NIC and the various overheads

inherent to each parallel approach limit performance and prevent the parallel orga-

49

nizations from outperforming the uniprocessor. When using one NIC, performance

is limited by the serialization constraints imposed by the device’s interface. Because

the device has a single physical interrupt line, only one thread is triggered when

the device raises an interrupt, and hence one thread carries received packets “up”

through the network stack as described in Section 3.2.1. Transmit-side traffic also

faces a device-imposed serialization constraint. Because multiple threads can poten-

tially request to transmit a packet at the same time and invoke the NIC’s driver, the

driver requires acquisition of a mutual exclusion lock to ensure consistency of shared

state related to transmitting packets. Process profiling shows that for all connection

loads, the driver’s lock is held by a core in the system nearly 100% of the time, and

that even with 16384 connections, MsgP and ConnP-L organizations show more than

50% idle time. The ConnP-T organization is also constrained by the driver’s lock,

but it is able to outperform the other organizations with 16384 connections because

it does not constrain received acknowledgement packets to be processed by the single

interrupt thread, as the other organizations do. Instead, it is able to distribute re-

ceive processing to protocol threads running on all of the processor cores. However,

ConnP-T performs worse than the uniprocessor because of the significant scheduler

overheads associated with ConnP-T’s thread handoff mechanism.

3.5 Evaluation using Multiple Gigabit NICs

As is shown in the previous section, using a single 10 Gbps interface limits the

parallelism available to the network stack at the device interface. This external bot-

tleneck prevents the parallelism within the network stack from being exercised. To

provide additional inbound parallelism and to reduce the degree to which a single

driver’s lock can serialize network stack processing, the uniprocessor, message-parallel,

and connection-parallel organizations are evaluated using six Gigabit Ethernet NICs

50

0

1000

2000

3000

4000

5000

6000

Connections

Thr

ough

put (

Mb/

s)

24 48 96 192

384

768

1536

3072

6144

1228

816

384

UPMsgPConnP−T(4)ConnP−L(128)

Figure 3.1: Aggregate transmit throughput for uniprocessor, message-parallel andconnection-parallel network stacks using 6 NICs.

rather than one single 10 Gigabit NIC. Hence, on the inbound processing path there

are six different interrupts with six different interrupt threads to feed the network

stack in parallel. Each NIC has a separate driver instance with a separate driver

lock, reducing the probability that the network stack will contend for a driver lock.

This model more closely resembles the abundant thread parallelism that is presented

to the operating system at the application layer by the parallel benchmark and hence

fully stresses the network stack’s parallel processing capabilities. Because the single

10 Gbps-NIC configuration is insufficient to utilize processing resources for each orga-

nization and cannot effectively isolate the network stack, it is not examined further.

Figures 3.1 and 3.2 depict the aggregate TCP throughput across all connections

for the various network stack organizations when using six separate Gigabit interfaces.

Figure 3.1 shows that the “UP” kernel performs well when transmitting on a small

number of connections, achieving a bandwidth of 3804 Mb/s with 24 connections.

51

0

1000

2000

3000

4000

5000

6000

Connections

Thr

ough

put (

Mb/

s)

24 48 96 192

384

768

1536

3072

6144

1228

816

384

UPMsgPConnP−T(4)ConnP−L(128)

Figure 3.2: Aggregate receive throughput for uniprocessor, message-parallel andconnection-parallel network stacks using 6 NICs.

However, total bandwidth decreases as the number of connections increases. MsgP

performs better, providing an 11% improvement over the uniprocessor bandwidth

at 24 connections but quickly ramps up to 4630 Mb/s, holding steady through 768

connections and then decreasing to 3403 Mb/s with 16384 connections. ConnP-

T(4) achieves its peak bandwidth of 3169 Mb/s with 24 connections and provides

approximately steady bandwidth as the number of connections increase. Finally, the

ConnP-L(128) curve is shaped similar to that of MsgP, but its performance is larger in

magnitude and always outperforms the uniprocessor kernel. ConnP-L(128) delivers

steady performance around 5440 Mb/s for 96–768 connections and then gradually

decreases to 4747 Mb/s with 16384 connections. This peak performance is roughly

the peak TCP throughput deliverable by the three dual-port Gigabit NICs.

Figure 3.2 shows the aggregate TCP throughput across all connections when re-

ceiving data on six Gigabit interfaces. Again, ConnP-L(128) performs best, followed

52

by MsgP, ConnP-T(4), and the uniprocessor kernel. Unlike the transmit case, the par-

allel organizations always outperform the uniprocessor, and in many cases they receive

at a higher rate than they transmit. The ConnP-L(128) organization is again able to

receive at near-peak performance at 384 connections and holds approximately steady,

receiving over 5 Gb/s of throughput using 16384 connections. Both the ConnP-T(4)

and uniprocessor kernels also receives steady (but lower) bandwidth across all con-

nection loads tested, only slightly decreasing as connections are added. Conversely,

MsgP does not provide as consistent bandwidth across the various connection loads,

but it does uniformly outperform both ConnP-T(4) and “UP”.

3.6 Discussion and Analysis

The locking, scheduling, and cache overheads of the network stack vary depending

on both the parallel network stack organization and the number of active connections

in the system. The following subsections will examine these issues for the best per-

forming hardware configuration, a system with six 1 Gbps network interfaces. All of

the statistics within this section were collected using either the Opterons’ performance

counters or FreeBSD’s lock-profiling facilities.

3.6.1 Locking Overhead

There are two significant costs of locking within the parallelized network stacks.

The first is that SMP locks are fundamentally more expensive than uniprocessor locks.

In a uniprocessor kernel, a simple atomic test-and-set instruction can be used to

protect against interference across context switches, whereas, SMP systems must use

system wide locking to ensure proper synchronization among simultaneously running

threads. This is likely to incur significant overhead in the SMP case. For example,

on x86 architectures, the lock prefix, which is used to ensure that an instruction

53

Socket Send

Prepare metadata structures pointing to

message data

Socket Buffer A

Socket Buffer R

Calculate Ready Data to Send

TCP Send

Connection Hashtable A

Look Up Connection

Connection A

Connection Hashtable R

TCP Output

for(packet-sized

segments in message)

Connection R

Route Hashtable A

Route Hashtable R

Route A

Route R

Prepare TCP header for one packet

Socket Buffer A

Socket Buffer R

Allocate new route structure

Fill in route

Prepare IP header

IP Output

Ethernet

Output

Route A

Route R

Validate route

Ensure ARP entry

isn't expired

TX Interface Queue R

TX Interface Queue A

Insert packet

Route A

Route R

Destroy route structure

return value

Interface

Queue

return value

Driver

Lock-Name A

Lock-Name R

Acquisition of lock Lock-Name

Release of lock Lock-Name

=

=

Bold = Global Lock Per-Connection

LockRegular =

Figure 3.3: The outbound control path in the application thread context.

is executed atomically across the system, effectively locks all other cores out of the

memory system during the execution of the locked instruction.

The second is that contention for global locks within the network stack is sig-

nificantly increased when multiple threads are actively performing network tasks si-

54

OS Type 6 conns 192 conns 16384 conns

MsgP 89 100 100ConnP-L(4) 60 56 52ConnP-L(8) 51 30 26ConnP-L(16) 49 18 14ConnP-L(32) 41 10 7ConnP-L(64) 37 6 4ConnP-L(128) 33 5 2

Table 3.3: Percentage of lock acquisitions for global TCP/IP locks that do not succeedimmediately when transmitting data.

multaneously. As an illustration of how locks can contend within the network stack,

Figure 3.3 shows the locking required in the control path for send processing within

the sending application’s thread context in the MsgP network stack of FreeBSD 7.

Most of the locks pictured are associated with a single socket buffer or connection.

Therefore, it is unlikely that multiple application threads would contend for those

locks since connection-oriented applications do not use multiple application threads

to send data over the same connection. However, those locks could be shared with

the kernel’s inbound protocol threads that are processing receive traffic on the same

connection. Global locks that must be acquired by all threads that are sending (or

possibly receiving) data over any connection are far more problematic.

There are two global locks on the send path: the Connection Hash-table lock

and the Route Hash-table lock. These locks protect the hash tables that map a

particular connection to its individual connection lock that map a particular connec-

tion to its individual route lock, respectively. These locks are also used in lieu of

explicit reference counting for individual connections and locks. Watson presents a

more detailed description of locking within the FreeBSD network stack [57].

There is very little contention for the Route Hash-table lock because the cor-

responding Route lock is quickly acquired and released so a thread is unlikely to be

blocked while holding the Route Hash-table lock and waiting for a Route lock. In

contrast, the Connection Hash-table lock is highly contended. This lock must be

55

acquired by any thread performing any network operation on any connection. Fur-

thermore, it is possible for a thread to block while holding the lock and waiting for

its corresponding Connection lock, which can be held for quite some time.

Table 3.3 depicts global TCP/IP lock contention when sending data, measured as

the percentage of lock acquisitions that do not immediately succeed because another

thread holds the lock. ConnP-T is omitted from the table because it eliminates

global TCP/IP locking completely. As the table shows, the MsgP network stack

experiences significant contention for the Connection Hash-table lock, which leads

to considerable overhead as the number of connections increases.

One would expect that as connections are added, contention for per-connection

locks would decrease, and in fact lock profiling supports this conclusion. However,

because other locks (such as that guarding the scheduler) are acquired while holding

the per-connection lock, and because those other locks are system-wide and become

highly contended during heavy loads, detailed locking profiles show that the average

time per-connection locks are held increases dramatically. Hence, though contention

for per-connection locks decreases, the increasing cost for a contended lock is so

much greater that the system exhibits increasing average acquisition times for per-

connection locks as connections are added. This increased per-connection acquisition

time in turn leads to longer waits for the Connection Hash-table lock, eventually

bogging down the system with contention.

Whereas the MsgP stack relies on repeated acquisition to the Connection Hash-table

and Connection locks to continue stack processing, ConnP-L stacks can also become

periodically bottlenecked if a single group becomes highly contended. Table 3.3 shows

the contention for the Network Group locks for ConnP-L stacks as the number of net-

work groups is varied to from 4 to 128 groups. The table demonstrates that contention

for the Network Group locks consistently decreases as the number of network groups

increases. Though ConnP-L(4)’s Network Group lock contention is high at over 50%

56

0

1000

2000

3000

4000

5000

6000

Connections

Thr

ough

put (

Mb/

s)

24 48 96 192

384

768

1536

3072

6144

1228

816

384

ConnP−L(128)ConnP−L(64)ConnP−L(32)ConnP−L(16)ConnP−L(8)ConnP−L(4)

Figure 3.4: Aggregate transmit throughput for the ConnP-L network stack as the numberof locks is varied.

Stack TypeTransmit Receive

24 conns 192 conns 16384 conns 24 conns 192 conns 16384 conns

UP 452 440 423 350 378 421MsgP 1305 1818 2448 1125 1126 1158ConnP-T(4) 3617 3602 4535 858 957 1547ConnP-L(128) 1056 924 1064 598 519 524

Table 3.4: Cycles spent managing the scheduler and scheduler synchronization per Kilo-byte of payload.

for all connection loads, increasing the number of network groups to 128 reduces

contention from 52% to just 2% for the heaviest connection load.

Figure 3.4 shows the effect increasing the number of network groups has on aggre-

gate throughput for 6, 192, and 16384 connections. As is suggested by the contention

reduction associated with larger numbers of network groups, network throughput in-

creases with more network groups. However, there are diminishing returns as more

groups are added.

3.6.2 Scheduler Overhead

The ConnP-T kernel trades the locking overhead of the ConnP-L and MsgP kernels

for scheduling overhead. As operations are requested for a particular connection,

57

0

20

40

60

L2 M

isse

s/K

B T

hrou

ghpu

t

24 Connections 192 Connections 16384 Connections

UPMsgP

ConnP−T(4)

ConnP−L(128) UPMsgP

ConnP−T(4)


ConnP−T(4)

ConnP−L(128)

SchedulerNetwork Stack

Figure 3.5: Profile of L2 cache misses per 1 Kilobyte of payload data (transmit test).

they must be scheduled onto the appropriate protocol thread. As Figures 3.1 and 3.2

showed, this results in stable, but low total bandwidth as connections scale for ConnP-

T(4). ConnP-L approximates the reduced intra-stack locking properties of ConnP-T

and adopts the simpler scheduling properties of MsgP; locking overhead is minimized

by the additional groups and scheduling overhead is minimized since messages are

not transferred to protocol threads. This results in consistently better performance

than the other parallel organizations.

To further explain this behavior, Table 3.4 shows the number of cycles spent

managing the scheduler and scheduler synchronization per KB of payload data trans-

mitted and received. This shows the overhead of the scheduler normalized to network

bandwidth. Though MsgP experiences significantly less scheduling overhead than

ConnP-T in most cases, locking overhead within the threads negate the scheduler

advantage as connections are added. In contrast, the scheduler overhead of ConnP-T

remains high, particularly when transmitting, corresponding to relatively low band-

width. Conversely, ConnP-L exhibits stable scheduler overhead that is much lower

than either MsgP or ConnP-L, contributing to its higher throughput. ConnP-L does

not require a thread handoff mechanism and its low lock contention compared to

MsgP results in fewer context switches from threads waiting for locks.

58

0

20

40

60

L2 M

isse

s/K

B T

hrou

ghpu

t

24 Connections 192 Connections 16384 Connections

UPMsgP

ConnP−T(4)


ConnP−T(4)


ConnP−T(4)

ConnP−L(128)

Data CopyingSchedulerNetwork Stack

Figure 3.6: Profile of L2 cache misses per 1 Kilobyte of payload data (receive test).

All of the network stack organizations examined experience higher scheduler over-

head when transmitting than when receiving. The reference FreeBSD 7 operating

system utilizes an interrupt-serialized task queue architecture for processing received

packets. This architecture obviates the need for explicit mutual exclusion locking

within NIC drivers when processing received packets, though locking is still required

on the transmit path. Each of the organizations examined benefit from this optimiza-

tion. Because FreeBSD’s kernel-adaptive mutual exclusion locks invoke the thread

scheduler when acquisitions repeatedly fail, eliminating lock acquisition attempts nec-

essarily reduces scheduler overhead.

The ConnP-T organization experiences an additional reduction in scheduler over-

head when processing received packets. In this organization, inbound packets are

queued asynchronously for later processing by the appropriate network protocol thread,

which eliminates the need to block the thread that enqueues a packet or to later no-

tify a blocked thread of completion. When sending, most processing occurs when

the application attempts to send data, which requires a more scheduler-intensive syn-

chronous call, and hence ConnP-T exhibits significantly higher scheduler overhead

when transmitting than when receiving.

59

Table 3.4 shows that the reference ConnP-T implementation in this paper incurs

heavy overhead in the thread scheduler, and hence an effective ConnP-T organi-

zation would require a more efficient interprocessor communication mechanism. A

lightweight mechanism for interprocessor communication, as implemented in Dragon-

flyBSD, would enable efficient intra-kernel messaging between processor cores. Such

an efficient messaging mechanism is likely to greatly benefit the ConnP-T organization

by allowing message transfer without invoking the general-purpose scheduler.

3.6.3 Cache Behavior

Figures 3.5 and3.6 show the number of L2 cache misses per KB of payload data

transmitted and received, respectively. The stacked bars separate the L2 cache misses

based upon where in the operating system the misses occurred. On the transmit side,

all of the L2 cache misses occur either in the scheduler or in the network stack. On the

receive side, there are also misses copying the data from the kernel to the application.

Recall that zero-copy transmit is used, so the corresponding copy from the application

to the kernel does not occur on the transmit side.

The figures show the efficiency of the cache hierarchy normalized to network band-

width. The uniprocessor kernel incurs very few cache misses relative to the multipro-

cessor configurations. The lack of data migration between processor caches accounts

for the uniprocessor kernel’s cache efficiency. As the number of connections are in-

creased, the additional connection state within the kernel stresses the cache and

directly results in increased cache misses and decreased throughput [27, 28].

The parallel network stacks incur significantly more cache misses per KB of trans-

mitted data because of data migration and lock accesses. Surprisingly, ConnP-T(4)

incurs the most cache misses despite each thread being pinned to a specific proces-

sor core. One might expect that such pinning would improve locality by eliminating

migration of many connection data structures. However, Figure 3.5 shows that for

60

the cases with 6 and 192 connections, ConnP-T(4) exhibits more misses in the net-

work stack than any of the other organizations. While thread pinning can improve

locality by eliminating migration of connection metadata, frequently updated socket

metadata is still shared between the application and protocol threads, which leads

to data migration and a higher cache miss rate. Pinning the protocol threads does

result in better utilization of the caches for the 16384-connection load when trans-

mitting, however. In this case, ConnP-T(4) exhibits the fewest network-stack L2

cache misses. However, the relatively higher number of L2 cache misses caused by

the scheduler prevents this advantage from translating into a performance benefit.

Other than the cache misses due to data copying, the cache miss profiles for

transmit and receive are quite similar. However, ConnP-T(4) incurs far fewer cache

misses in the scheduler when receiving data than it does when transmitting data.

This is directly related to the reduced scheduling overhead on the receive side, as

discussed in the previous section.

The cache misses within the network stack can be divided between misses to

concurrently shared data and serially shared data. Global network stack data is

concurrently shared, as it may be simultaneously accessed by multiple threads in

order to transmit or receive data. In contrast, per-connection data is serially shared,

as it is only accessed by a single thread at a time, although it may be accessed by

multiple threads over time. In a true MsgP organization, the per-connection data will

also be concurrently shared, as multiple threads can process packets from the same

connection simultaneously. However, in a practical, in-order MsgP implementation,

as described in Section 3.2.1, per-connection data may be accessed by at most two

threads at a time, one sending data and one receiving data on the same connection.

Table 3.5 indicates the percentage of the cache misses within the network stack

that are due to global data structures, and are therefore concurrently shared. The

remaining L2 cache misses are due to per-connection data structures. As previously

61

Stack TypeTransmit Receive

24 conns 192 conns 16384 conns 24 conns 192 conns 16384 conns

UP 4% 3% 27% 1% 1% 12%MsgP 12% 14% 32% 16% 15% 22%ConnP-T(4) 7% 9% 15% 14% 13% 11%ConnP-L(128) 15% 19% 21% 18% 18% 20%

Table 3.5: Percentage of L2 cache misses within the network stack to global data structures.

stated, these per-connection data structures are rarely, if ever, accessed by different

threads concurrently.

Data that is concurrently shared is most likely to benefit from a CMP with a

shared cache. Therefore, the percentages of Table 3.5 indicate the possible reduction

in L2 cache misses within the network stack if a CMP with a shared cache were used.

For most connection loads and network stack organizations, fewer than 20% of the L2

cache misses are due to global data. As there is no guarantee that a shared L2 will

eliminate these misses entirely, the benefits of a shared cache for the network stack

are likely to be minimal. Furthermore, it is also possible for a shared cache to have a

detrimental effect on the serially shared data. Previous work has shown that shared

caches can hurt performance when the cores are not actively sharing data [29]. If

the processor cores must compete to store per-connection state with each other, this

could potentially lead to an overall increase in L2 cache misses within the network

stack, despite the benefits for concurrently shared data.

The lock contention, scheduling, and cache efficiency data show that the different

concurrency models and the different synchronization mechanisms employed by their

implementations directly impact network stack efficiency and throughput. Though

all of these parallelized organizations can outperform the uniprocessor when using 4

cores, each parallel organization experiences higher locking overhead, decreased cache

efficiency, and higher scheduling overhead than a uniprocessor network stack. The

ConnP-L organization maximizes performance and efficiency compared to the MsgP

and ConnP-T organizations. ConnP-L mitigates the locking overhead of the highly

62

contentious MsgP organization by grouping connections to reduce global locking.

ConnP-L also benefits from reduced scheduling overhead as compared to ConnP-T,

since ConnP-L does not require inter-thread communication or message passing to

carry out network stack processing. Hence, though the ConnP-L parallelism model

is more restricted than that of MsgP, ConnP-L still provides the same level of par-

allelism expected by most applications (e.g., connection- or socket-level parallelism)

and achieves higher efficiency and higher throughput.

63

Chapter 4

Concurrent Direct Network Access

In many organizations, the economics of supporting a growing number of Internet-

based services has created a demand for server consolidation. In such organizations,

maximizing machine utilization and increasing the efficiency of the overall server is

just as important as increasing the efficiency of each individual operating system, as

in Chapter 3. Consequently, there has been a resurgence of interest in machine virtu-

alization [1, 2, 7, 13, 16, 21, 35, 53, 58]. A virtual machine monitor (VMM) enables

multiple virtual machines, each encapsulating one or more services, to share the same

physical machine safely and fairly. In principle, general-purpose operating systems,

such as Unix and Windows, offer the same capability for multiple services to share

the same physical machine. However, VMMs provide additional advantages. For ex-

ample, VMMs allow services implemented in different or customized environments,

including different operating systems, to share the same physical machine.

Modern VMMs for commodity hardware, such as VMware [1, 13] and Xen [7],

virtualize processor, memory, and I/O devices in software. This enables these VMMs

to support a variety of hardware. In an attempt to decrease the software overhead

of virtualization, both AMD and Intel are introducing hardware support for virtu-

alization [2, 21]. Specifically, their hardware support for processor virtualization is

currently available, and their hardware support for memory virtualization is immi-

nent. As these hardware mechanisms mature, they should reduce the overhead of

virtualization, improving the efficiency of VMMs.

64

Despite the renewed interest in system virtualization, there is still no clear solu-

tion to improve the efficiency of I/O virtualization. To support networking, a VMM

must present each virtual machine with a virtual network interface that is multi-

plexed in software onto a physical network interface card (NIC). The overhead of this

software-based network virtualization severely limits network performance [38, 39, 53].

For example, a Linux kernel running within a virtual machine on Xen is only able

to achieve about 30% of the network throughput that the same kernel can achieve

running directly on the physical machine.

This study proposes and evaluates concurrent direct network access (CDNA), a

new I/O virtualization architecture that combines software and hardware components

to significantly reduce the overhead of network virtualization in VMMs. The CDNA

network virtualization architecture provides virtual machines running on a VMM safe

direct access to the network interface. With CDNA, each virtual machine is allocated

a unique context on the network interface and communicates directly with the network

interface through that context. In this manner, the virtual machines that run on the

VMM operate as if each has access to its own dedicated network interface.

Using CDNA, a single virtual machine running Linux can transmit at a rate of

1867 Mb/s with 51% idle time and receive at a rate of 1874 Mb/s with 41% idle

time. In contrast, at 97% CPU utilization, Xen is only able to achieve 1602 Mb/s for

transmit and 1112 Mb/s for receive. Furthermore, with 24 virtual machines, CDNA

can still transmit and receive at a rate of over 1860 Mb/s, but with no idle time. In

contrast, Xen is only able to transmit at a rate of 891 Mb/s and receive at a rate of

558 Mb/s with 24 virtual machines.

The CDNA network virtualization architecture achieves this dramatic increase in

network efficiency by dividing the tasks of traffic multiplexing, interrupt delivery, and

memory protection among hardware and software in a novel way. Traffic multiplexing

is performed directly on the network interface, whereas interrupt delivery and memory

65

Guest

Domain 2

Front-End Driver

Driver Domain

EthernetBridge

NIC Driver

Page

Flipping

Packet

Data

Guest

Domain 1

Front-End Driver

HypervisorInterrupt Dispatch

Back-End Drivers

NIC CPU / Memory / Disk / Other Devices

Driver Control

Packet Data

Interrupts

Hypervisor Control

Control & Data

Virtual Interrupts

Figure 4.1: Shared networking in the Xen virtual machine environment.

protection are performed by the VMM with support from the network interface.

This division of tasks into hardware and software components simplifies the overall

software architecture, minimizes the hardware additions to the network interface, and

addresses the network performance bottlenecks of Xen.

The remainder of this study proceeds as follows. The next section discusses net-

working in the Xen VMM in more detail. Section 4.2 describes how CDNA manages

traffic multiplexing, interrupt delivery, and memory protection in software and hard-

ware to provide concurrent access to the NIC. Section 4.3 then describes the custom

hardware NIC that facilitates concurrent direct network access on a single device.

Finally, Section 4.4 presents the experimental methodology and results. This study

is based on one of my previously published works [60].

66

4.1 Networking in Xen

4.1.1 Hypervisor and Driver Domain Operation

A VMM allows multiple guest operating systems, each running in a virtual ma-

chine, to share a single physical machine safely and fairly. It provides isolation be-

tween these guest operating systems and manages their access to hardware resources.

Xen is an open source VMM that supports paravirtualization, which requires modifi-

cations to the guest operating system [7]. By modifying the guest operating systems

to interact with the VMM, the complexity of the VMM can be reduced and overall

system performance improved.

Xen performs three key functions in order to provide virtual machine environ-

ments. First, Xen allocates the physical resources of the machine to the guest oper-

ating systems and isolates them from each other. Second, Xen receives all interrupts

in the system and passes them on to the guest operating systems, as appropriate. Fi-

nally, all I/O operations go through Xen in order to ensure fair and non-overlapping

access to I/O devices by the guests.

Figure 4.1 shows the organization of the Xen VMM. Xen consists of two elements:

the hypervisor and the driver domain. The hypervisor provides an abstraction layer

between the virtual machines, called guest domains, and the actual hardware, en-

abling each guest operating system to execute as if it were the only operating system

on the machine. However, the guest operating systems cannot directly communicate

with the physical I/O devices. Exclusive access to the physical devices is given by the

hypervisor to the driver domain, a privileged virtual machine. Each guest operating

system is then given a virtual I/O device that is controlled by a paravirtualized driver,

called a front-end driver. In order to access a physical device, such as the network in-

terface card (NIC), the guest’s front-end driver communicates with the corresponding

back-end driver in the driver domain. The driver domain then multiplexes the data

67

streams for each guest onto the physical device. The driver domain runs a modified

version of Linux that uses native Linux device drivers to manage I/O devices.

As the figure shows, in order to provide network access to the guest domains, the

driver domain includes a software Ethernet bridge that interconnects the physical

NIC and all of the virtual network interfaces. When a packet is transmitted by a

guest, it is first transferred to the back-end driver in the driver domain using a page

remapping operation. Within the driver domain, the packet is then routed through

the Ethernet bridge to the physical device driver. The device driver enqueues the

packet for transmission on the network interface as if it were generated normally

by the operating system within the driver domain. When a packet is received, the

network interface generates an interrupt that is captured by the hypervisor and routed

to the network interface’s device driver in the driver domain as a virtual interrupt.

The network interface’s device driver transfers the packet to the Ethernet bridge,

which routes the packet to the appropriate back-end driver. The back-end driver

then transfers the packet to the front-end driver in the guest domain using a page

remapping operation. Once the packet is transferred, the back-end driver requests

that the hypervisor send a virtual interrupt to the guest notifying it of the new packet.

Upon receiving the virtual interrupt, the front-end driver delivers the packet to the

guest operating system’s network stack, as if it had come directly from the physical

device.

4.1.2 Device Driver Operation

The driver domain in Xen is able to use unmodified Linux device drivers to ac-

cess the network interface. Thus, all interactions between the device driver and the

NIC are as they would be in an unvirtualized system. These interactions include

programmed I/O (PIO) operations from the driver to the NIC, direct memory access

68

(DMA) transfers by the NIC to read or write host memory, and physical interrupts

from the NIC to invoke the device driver.

The device driver directs the NIC to send packets from buffers in host memory

and to place received packets into preallocated buffers in host memory. The NIC

accesses these buffers using DMA read and write operations. In order for the NIC to

know where to store or retrieve data from the host, the device driver within the host

operating system generates DMA descriptors for use by the NIC. These descriptors

indicate the buffer’s length and physical address on the host. The device driver notifies

the NIC via PIO that new descriptors are available, which causes the NIC to retrieve

them via DMA transfers. Once the NIC reads a DMA descriptor, it can either read

from or write to the associated buffer, depending on whether the descriptor is being

used by the driver to transmit or receive packets.

Device drivers organize DMA descriptors in a series of rings that are managed

using a producer/consumer protocol. As they are updated, the producer and con-

sumer pointers wrap around the rings to create a continuous circular buffer. There

are separate rings of DMA descriptors for transmit and receive operations. Transmit

DMA descriptors point to host buffers that will be transmitted by the NIC, whereas

receive DMA descriptors point to host buffers that the OS wants the NIC to use as it

receives packets. When the host driver wants to notify the NIC of the availability of a

new DMA descriptor (and hence a new packet to be transmitted or a new buffer to be

posted for packet reception), the driver first creates the new DMA descriptor in the

next-available slot in the driver’s descriptor ring and then increments the producer

index on the NIC to reflect that a new descriptor is available. The driver updates

the NIC’s producer index by writing the value via PIO into a specific location, called

a mailbox, within the device’s PCI memory-mapped region. The network interface

monitors these mailboxes for such writes from the host. When a mailbox update is

detected, the NIC reads the new producer value from the mailbox, performs a DMA

69

System Transmit (Mb/s) Receive (Mb/s)Native Linux 5126 3629Xen Guest 1602 1112

Table 4.1: Transmit and receive performance for native Linux 2.6.16.29 and paravirtualizedLinux 2.6.16.29 as a guest OS within Xen 3.

read of the descriptor indicated by the index, and then is ready to use the DMA

descriptor. After the NIC consumes a descriptor from a ring, the NIC updates its

consumer index, transfers this consumer index to a location in host memory via DMA,

and raises a physical interrupt to notify the host that state has changed.

In an unvirtualized operating system, the network interface trusts that the device

driver gives it valid DMA descriptors. Similarly, the device driver trusts that the NIC

will use the DMA descriptors correctly. If either entity violates this trust, physical

memory can be corrupted. Xen also requires this trust relationship between the device

driver in the driver domain and the NIC.

4.1.3 Performance

Despite the optimizations within the paravirtualized drivers to support commu-

nication between the guest and driver domains (such as using page remapping rather

than copying to transfer packets), Xen introduces significant processing and com-

munication overheads into the network transmit and receive paths. Table 4.1 shows

the networking performance of both native Linux 2.6.16.29 and paravirtualized Linux

2.6.16.29 as a guest operating system within Xen 3 Unstable1 on a modern Opteron-

based system with six Intel Gigabit Ethernet NICs. In both configurations, check-

sum offloading, scatter/gather I/O, and TCP Segmentation Offloading (TSO) were

enabled. Support for TSO was recently added to the unstable development branch of

Xen and is not currently available in the Xen 3 release. As the table shows, a guest

1Changeset 12053:874cc0ff214d from 11/1/2006.

70

domain within Xen is only able to achieve about 30% of the performance of native

Linux. This performance gap strongly motivates the need for networking performance

improvements within Xen.

4.2 CDNA Architecture

With CDNA, the network interface and the hypervisor collaborate to provide the

abstraction that each guest operating system is connected directly to its own net-

work interface. This eliminates many of the overheads of network virtualization in

Xen. Figure 4.2 shows the CDNA architecture. The network interface must support

multiple contexts in hardware. Each context acts as if it is an independent physical

network interface and can be controlled by a separate device driver instance. Instead

of assigning ownership of the entire network interface to the driver domain, the hy-

pervisor treats each context as if it were a physical NIC and assigns ownership of

contexts to guest operating systems. Notice the absence of the driver domain from

the figure: each guest can transmit and receive network traffic using its own private

context without any interaction with other guest operating systems or the driver do-

main. The driver domain, however, is still present to perform control functions and

allow access to other I/O devices. Furthermore, the hypervisor is still involved in

networking, as it must guarantee memory protection and deliver virtual interrupts to

the guest operating systems.

With CDNA, the communication overheads between the guest and driver domains

and the software multiplexing overheads within the driver domain are eliminated

entirely. However, the network interface now must multiplex the traffic across all of

its active contexts, and the hypervisor must provide protection across the contexts.

The following sections describe how CDNA performs traffic multiplexing, interrupt

delivery, and DMA memory protection.

71

CPU / Memory / Disk /

Other DevicesCDNA NIC

Hypervisor

Guest

Domain 1

NIC Driver

Guest

Domain 2

NIC Driver

Guest

Domain 3

NIC Driver

Interrupt

Dispatch

Virtual Interrupts

Packet

DataDriver

ControlControl

Interrupts

Figure 4.2: The CDNA shared networking architecture in Xen.

4.2.1 Multiplexing Network Traffic

CDNA eliminates the software multiplexing overheads within the driver domain

by multiplexing network traffic on the NIC. The network interface must be able to

identify the source or target guest operating system for all network traffic. The net-

work interface accomplishes this by providing independent hardware contexts and

associating a unique Ethernet MAC address with each context. The hypervisor as-

signs a unique hardware context on the NIC to each guest operating system. The

device driver within the guest operating system then interacts with its context ex-

actly as if the context were an independent physical network interface. As described

in Section 4.1.2, these interactions consist of creating DMA descriptors and updating

a mailbox on the NIC via PIO.

Each context on the network interface therefore must include a unique set of

mailboxes. This isolates the activity of each guest operating system, so that the NIC

can distinguish between the different guests. The hypervisor assigns a context to

a guest simply by mapping the I/O locations for that context’s mailboxes into the

guest’s address space. The hypervisor also notifies the NIC that the context has been

allocated and is active. As the hypervisor only maps each context into a single guest’s

72

address space, a guest cannot accidentally or intentionally access any context on the

NIC other than its own. When necessary, the hypervisor can also revoke a context at

any time by notifying the NIC, which will shut down all pending operations associated

with the indicated context.

To multiplex transmit network traffic, the NIC simply services all of the hardware

contexts fairly and interleaves the network traffic for each guest. When network

packets are received by the NIC, it uses the Ethernet MAC address to demultiplex

the traffic, and transfers each packet to the appropriate guest using available DMA

descriptors from that guest’s context.

4.2.2 Interrupt Delivery

In addition to isolating the guest operating systems and multiplexing network traf-

fic, the hardware contexts on the NIC must also be able to interrupt their respective

guests. As the NIC carries out network requests on behalf of any particular context,

the CDNA NIC updates that context’s consumer pointers for the DMA descriptor

rings, as described in Section 4.1.2. Normally, the NIC would then interrupt the guest

to notify it that the context state has changed. However, in Xen all physical inter-

rupts are handled by the hypervisor. Therefore, the NIC cannot physically interrupt

the guest operating systems directly. Even if it were possible to interrupt the guests

directly, that could create a much higher interrupt load on the system, which would

decrease the performance benefits of CDNA.

Under CDNA, the NIC keeps track of which contexts have been updated since the

last physical interrupt, encoding this set of contexts in an interrupt bit vector, which

is stored in the hypervisor’s private memory-mapped control context on the NIC. To

signal a set of interrupts to the hypervisor, the NIC raises a physical interrupt, which

invokes the hypervisor’s interrupt service routine (ISR). The hypervisor then reads

the interrupt bit vector from the NIC via programmed I/O. Next, the hypervisor

73

decodes the vector and schedules virtual interrupts to each of the guest operating

systems that have pending updates from the NIC. Because the Xen scheduler guar-

antees that these virtual interrupts will be delivered, the hypervisor can immediately

acknowledge the set of interrupts that have been processed. The hypervisor performs

this acknowledgment by writing the processed vector back to the NIC to a separate

acknowledgment location in the hypervisor’s private memory-mapped control context.

After acknowledgment and after the hypervisor’s ISR has run, the hypervisor’s

scheduler will execute and select the next guest operating system to run. When

subsequent guest operating systems are next scheduled by the hypervisor, the CDNA

network interface driver within the guest receives these virtual interrupts that the

hypervisor has sent. The virtual interrupts are received by the paravirtualized guest

as if they were actual physical interrupts from the hardware. At that time, the guest’s

driver examines the updates from the NIC and determines what further action, such

as processing received packets, is required.

During the development of this research, an alternative NIC-to-hypervisor event

notification mechanism was explored. Instead of providing the interrupt vector to

the hypervisor via memory-mapped I/O, the NIC would transfer an interrupt bit

vector into the hypervisor’s memory space using DMA. The interrupt bit vectors

were stored in a circular buffer using a producer/consumer protocol to ensure that

they are processed by the host before being overwritten by the NIC. The vectors would

then be processed identically to the memory-mapped I/O implementation. However,

further examination found that under heavy load, it was possible that the ring buffer

could fill up and interrupts could be lost, even with a very large event ring buffer

of 256 entries. The positive-acknowledgment strategy ensures more reliable delivery

under heavy load, though it does incur some additional overhead. At minimum, the

memory-mapped I/O implementation requires an additional programmed-I/O write

(for the acknowledgment) compared to the ring-buffer implementation. When several

74

interrupt vectors are processed in one invocation of the hypervisor’s ISR, the ring-

buffer implementation can save one memory-mapped I/O read per vector processed,

since those vectors are read from host memory rather than from the memory-mapped

location on the NIC. The ring-buffer-based strategy was evaluated in [61].

4.2.3 DMA Memory Protection

In the x86 architecture, network interfaces and other I/O devices use physical

addresses when reading or writing host system memory. The device driver in the

host operating system is responsible for doing virtual-to-physical address translation

for the device. The physical addresses are provided to the network interface through

read and write DMA descriptors as discussed in Section 4.1.2. By exposing physical

addresses to the network interface, the DMA engine on the NIC can be co-opted into

compromising system security by a buggy or malicious driver. There are two key

I/O protection violations that are possible in the x86 architecture. First, the device

driver could instruct the NIC to transmit packets containing a payload from physical

memory that does not contain packets generated by the operating system, thereby

creating a security hole. Second, the device driver could instruct the NIC to receive

packets into physical memory that was not designated as an available receive buffer,

possibly corrupting memory that is in use.

In the conventional Xen network architecture discussed in Section 4.1.2, Xen trusts

the device driver in the driver domain to only use the physical addresses of network

buffers in the driver domain’s address space when passing DMA descriptors to the

network interface. This ensures that all network traffic will be transferred to/from

network buffers within the driver domain. Since guest domains do not interact with

the NIC, they cannot initiate DMA operations, so they are prevented from causing

either of the I/O protection violations in the x86 architecture.

75

Though the Xen I/O architecture guarantees that untrusted guest domains cannot

induce memory protection violations, any domain that is granted access to an I/O

device by the hypervisor can potentially direct the device to perform DMA operations

that access memory belonging to other guests, or even the hypervisor. The Xen

architecture does not fundamentally solve this security defect but instead limits the

scope of the problem to a single, trusted driver domain [16]. Therefore, as the driver

domain is trusted, it is unlikely to intentionally violate I/O memory protection, but

a buggy driver within the driver domain could do so unintentionally.

This solution is insufficient for the CDNA architecture. In a CDNA system, device

drivers in the guest domains have direct access to the network interface and are able

to pass DMA descriptors with physical addresses to the device. Thus, the untrusted

guests could read or write memory in any other domain through the NIC, unless

additional security features are added. To maintain isolation between guests, the

CDNA architecture validates and protects all DMA descriptors and ensures that a

guest maintains ownership of physical pages that are sources or targets of outstanding

DMA accesses. Although the hypervisor and the network interface share the respon-

sibility for implementing these protection mechanisms, the more complex aspects are

implemented in the hypervisor.

The most important protection provided by CDNA is that it does not allow guest

domains to directly enqueue DMA descriptors into the network interface descriptor

rings. Instead, the device driver in each guest must call into the hypervisor to per-

form the enqueue operation. This allows the hypervisor to validate that the physical

addresses provided by the guest are, in fact, owned by that guest domain. This pre-

vents a guest domain from arbitrarily transmitting from or receiving into another

guest domain. The hypervisor prevents guest operating systems from independently

enqueueing unauthorized DMA descriptors by establishing the hypervisor’s exclusive

76

write access to the host memory region containing the CDNA descriptor rings during

driver initialization.

As discussed in Section 4.1.2, conventional I/O devices autonomously fetch and

process DMA descriptors from host memory at runtime. Though hypervisor-managed

validation and enqueuing of DMA descriptors ensures that DMA operations are valid

when they are enqueued, the physical memory could still be reallocated before it is

accessed by the network interface. There are two ways in which such a protection

violation could be exploited by a buggy or malicious device driver. First, the guest

could return the memory to the hypervisor to be reallocated shortly after enqueueing

the DMA descriptor. Second, the guest could attempt to reuse an old DMA descriptor

in the descriptor ring that is no longer valid.

When memory is freed by a guest operating system, it becomes available for re-

allocation to another guest by the hypervisor. Hence, ownership of the underlying

physical memory can change dynamically at runtime. However, it is critical to pre-

vent any possible reallocation of physical memory during a DMA operation. CDNA

achieves this by delaying the reallocation of physical memory that is being used in

a DMA transaction until after that pending DMA has completed. When the hyper-

visor enqueues a DMA descriptor, it first establishes that the requesting guest owns

the physical memory associated with the requested DMA. The hypervisor then incre-

ments the reference count for each physical page associated with the requested DMA.

This per-page reference counting system already exists within the Xen hypervisor; so

long as the reference count is non-zero, a physical page cannot be reallocated. Later,

the hypervisor then observes which DMA operations have completed and decrements

the associated reference counts. For efficiency, the reference counts are only decre-

mented when additional DMA descriptors are enqueued, but there is no reason why

they could not be decremented more aggressively, if necessary.

77

After enqueuing DMA descriptors, the device driver notifies the NIC by writing

a producer index into a mailbox location within that guest’s context on the NIC.

This producer index indicates the location of the last of the newly created DMA

descriptors. The NIC then assumes that all DMA descriptors up to the location

indicated by the producer index are valid. If the device driver in the guest increments

the producer index past the last valid descriptor, the NIC will attempt to use a stale

DMA descriptor that is in the descriptor ring. Since that descriptor was previously

used in a DMA operation, the hypervisor may have decremented the reference count

on the associated physical memory and reallocated the physical memory.

To prevent such stale DMA descriptors from being used, the hypervisor writes a

strictly increasing sequence number into each DMA descriptor. The NIC then checks

the sequence number before using any DMA descriptor. If the descriptor is valid,

the sequence numbers will be continuous modulo the size of the maximum sequence

number. If they are not, the NIC will refuse to use the descriptors and will report a

guest-specific protection fault error to the hypervisor. Because each DMA descriptor

in the ring buffer gets a new, increasing sequence number, a stale descriptor will have

a sequence number exactly equal to the correct value minus the number of descriptor

slots in the buffer. Making the maximum sequence number at least twice as large as

the number of DMA descriptors in a ring buffer prevents aliasing and ensures that

any stale sequence number will be detected.

4.2.4 Discussion

The CDNA interrupt delivery mechanism is neither device nor Xen specific. This

mechanism only requires the device to transfer an interrupt bit vector to the hy-

pervisor via DMA prior to raising a physical interrupt. This is a relatively simple

mechanism from the perspective of the device and is therefore generalizable to a va-

78

riety of virtualized I/O devices. Furthermore, it does not rely on any Xen-specific

features.

The handling of the DMA descriptors within the hypervisor is linked to a par-

ticular network interface only because the format of the DMA descriptors and their

rings is likely to be different for each device. As the hypervisor must validate that

the host addresses referred to in each descriptor belong to the guest operating system

that provided them, the hypervisor must be aware of the descriptor format. Fortu-

nately, there are only three fields of interest in any DMA descriptor: an address, a

length, and additional flags. This commonality should make it possible to generalize

the mechanisms within the hypervisor by having the NIC notify the hypervisor of

its preferred format. The NIC would only need to specify the size of the descriptor

and the location of the address, length, and flags. The hypervisor would not need

to interpret the flags, so they could just be copied into the appropriate location. A

generic NIC would also need to support the use of sequence numbers within each

DMA descriptor. Again, the NIC could notify the hypervisor of the size and location

of the sequence number field within the descriptors.

CDNA’s DMA memory protection is specific to Xen only insofar as Xen permits

guest operating systems to use physical memory addresses. Consequently, the current

implementation must validate the ownership of those physical addresses for every

requested DMA operation. For VMMs that only permit the guest to use virtual

addresses, the hypervisor could just as easily translate those virtual addresses and

ensure physical contiguity. The current CDNA implementation does not rely on

physical addresses in the guest at all; rather, a small library translates the driver’s

virtual addresses to physical addresses within the guest’s driver before making a

hypercall request to enqueue a DMA descriptor. For VMMs that use virtual addresses,

this library would do nothing.

79

4.3 CDNA NIC Implementation

To evaluate the CDNA concept in a real system, RiceNIC, a programmable and

reconfigurable FPGA-based Gigabit Ethernet network interface [50], was modified

to provide virtualization support. RiceNIC contains a Virtex-II Pro FPGA with

two embedded 300MHz PowerPC processors, hundreds of megabytes of on-board

SRAM and DRAM memories, a Gigabit Ethernet PHY, and a 64-bit/66 MHz PCI

interface [5]. Custom hardware assist units for accelerated DMA transfers and MAC

packet handling are provided on the FPGA. The RiceNIC architecture is similar to

the architecture of a conventional network interface. With basic firmware and the

appropriate Linux or FreeBSD device driver, it acts as a standard Gigabit Ethernet

network interface that is capable of fully saturating the Ethernet link while only using

one of the two embedded processors.

To support CDNA, both the hardware and firmware of the RiceNIC were modified

to provide multiple protected contexts and to multiplex network traffic. The network

interface was also modified to interact with the hypervisor through a dedicated context

to allow privileged management operations. The modified hardware and firmware

components work together to implement the CDNA interfaces.

To support CDNA, the most significant addition to the network interface is the

specialized use of the 2 MB SRAM on the NIC. This SRAM is accessible via PIO from

the host. For CDNA, 128 KB of the SRAM is divided into 32 partitions of 4 KB each.

Each of these partitions is an interface to a separate hardware context on the NIC.

Only the SRAM can be memory mapped into the host’s address space, so no other

memory locations on the NIC are accessible via PIO. As a context’s memory partition

is the same size as a page on the host system and because the region is page-aligned,

the hypervisor can trivially map each context into a different guest domain’s address

space. The device drivers in the guest domains may use these 4 KB partitions as

80

general purpose shared memory between the corresponding guest operating system

and the network interface.

Within each context’s partition, the lowest 24 memory locations are mailboxes

that can be used to communicate from the driver to the NIC. When any mailbox

is written by PIO, a global mailbox event is automatically generated by the FPGA

hardware. The NIC firmware can then process the event and efficiently determine

which mailbox and corresponding context has been written by decoding a two-level

hierarchy of bit vectors. All of the bit vectors are generated automatically by the

hardware and stored in a data scratchpad for high speed access by the processor. The

first bit vector in the hierarchy determines which of the 32 potential contexts have

updated mailbox events to process, and the second vector in the hierarchy determines

which mailbox(es) in a particular context have been updated. Once the specific

mailbox has been identified, that off-chip SRAM location can be read by the firmware

and the mailbox information processed.

The mailbox event and associated hierarchy of bit vectors are managed by a

small hardware core that snoops data on the SRAM bus and dispatches notification

messages when a mailbox is updated. A small state machine decodes these messages

and incrementally updates the data scratchpad with the modified bit vectors. This

state machine also handles event-clear messages from the processor that can clear

multiple events from a single context at once.

Each context requires 128 KB of storage on the NIC for metadata, such as the rings

of transmit- and receive-DMA descriptors provided by the host operating systems.

Furthermore, each context uses 128 KB of memory on the NIC for buffering transmit

packet data and 128 KB for receive packet data. However, the NIC’s transmit and

receive packet buffers are each managed globally, and hence packet buffering is shared

across all contexts.

81

System NIC Mb/sDomain Execution Profile Interrupts/s

HypDriver Domain Guest OS

IdleDriver Guest

OS User OS User Domain OS

Xen Intel 1602 19.8% 35.7% 0.8% 39.7% 1.0% 3.0% 7,438 7,853Xen RiceNIC 1674 13.7% 41.5% 0.5% 39.5% 1.0% 3.8% 8,839 5,661

CDNA RiceNIC 1865 10.8% 0.1% 0.2% 42.7% 1.7% 44.5% 0 13,903

Table 4.2: Transmit performance for a single guest with 2 NICs using Xen and CDNA.

System NIC Mb/sDomain Execution Profile Interrupts/s


IdleDriver Guest


Xen Intel 1112 25.7% 36.8% 0.5% 31.0% 1.0% 5.0% 11,138 5,193Xen RiceNIC 1075 30.6% 39.4% 0.6% 28.8% 0.6% 0% 10,946 5,163

CDNA RiceNIC 1850 9.9% 0.2% 0.2% 52.6% 0.6% 36.5% 0 7,484

Table 4.3: Receive performance for a single guest with 2 NICs using Xen and CDNA.

The modifications to the RiceNIC to support CDNA were minimal. The major

hardware change was the additional mailbox storage and handling logic. This could

easily be added to an existing NIC without interfering with the normal operation

of the network interface—unvirtualized device drivers would use a single context’s

mailboxes to interact with the base firmware. Furthermore, the computation and

storage requirements of CDNA are minimal. Only one of the RiceNIC’s two embedded

processors is needed to saturate the network, and only 12 MB of memory on the NIC

is needed to support 32 contexts. Therefore, with minor modifications, commodity

network interfaces could easily provide sufficient computation and storage resources

to support CDNA.

4.4 Evaluation

4.4.1 Experimental Setup

The performance of Xen and CDNA network virtualization was evaluated on

an AMD Opteron-based system running Xen 3 Unstable2. This system used a

Tyan S2882 motherboard with a single Opteron 250 processor and 4GB of DDR400

2Changeset 12053:874cc0ff214d from 11/1/2006.

82

SDRAM. Xen 3 Unstable was used because it provides the latest support for high-

performance networking, including TCP segmentation offloading, and the most recent

version of Xenoprof [39] for profiling the entire system.

In all experiments, the driver domain was configured with 256 MB of memory

and each of 24 guest domains were configured with 128 MB of memory. Each guest

domain ran a stripped-down Linux 2.6.16.29 kernel with minimal services for memory

efficiency and performance. For the base Xen experiments, a single dual-port Intel

Pro/1000 MT NIC was used in the system. In the CDNA experiments, two RiceNICs

configured to support CDNA were used in the system. Linux TCP parameters and

NIC coalescing options were tuned in the driver domain and guest domains for optimal

performance. For all experiments, checksum offloading and scatter/gather I/O were

enabled. TCP segmentation offloading was enabled for experiments using the Intel

NICs, but disabled for those using the RiceNICs due to lack of support. The Xen

system was setup to communicate with a similar Opteron system that was running a

native Linux kernel. This system was tuned so that it could easily saturate two NICs

both transmitting and receiving so that it would never be the bottleneck in any of

the tests.

To validate the performance of the CDNA approach, multiple simultaneous con-

nections across multiple NICs to multiple guests domains were needed. A multi-

threaded, event-driven, lightweight network benchmark program was developed to

distribute traffic across a configurable number of connections. The benchmark pro-

gram balances the bandwidth across all connections to ensure fairness and uses a

single buffer per thread to send and receive data to minimize the memory footprint

and improve cache performance.

83

4.4.2 Single Guest Performance

Tables 4.2 and 4.3 show the transmit and receive performance of a single guest

operating system over two physical network interfaces using Xen and CDNA. The

first two rows of each table show the performance of the Xen I/O virtualization

architecture using both the Intel and RiceNIC network interfaces. The third row of

each table shows the performance of the CDNA I/O virtualization architecture.

The Intel network interface can only be used with Xen through the use of software

virtualization. However, the RiceNIC can be used with both CDNA and software vir-

tualization. To use the RiceNIC interface with software virtualization, a context was

assigned to the driver domain and no contexts were assigned to the guest operating

system. Therefore, all network traffic from the guest operating system is routed via

the driver domain as it normally would be, through the use of software virtualization.

Within the driver domain, all of the mechanisms within the CDNA NIC are used

identically to the way they would be used by a guest operating system when config-

ured to use concurrent direct network access. As the tables show, the Intel network

interface performs similarly to the RiceNIC network interface. Therefore, the benefits

achieved with CDNA are the result of the CDNA I/O virtualization architecture, not

the result of differences in network interface performance.

Note that in Xen the interrupt rate for the guest is not necessarily the same as it is

for the driver. This is because the back-end driver within the driver domain attempts

to interrupt the guest operating system whenever it generates new work for the front-

end driver. This can happen at a higher or lower rate than the actual interrupt rate

generated by the network interface depending on a variety of factors, including the

number of packets that traverse the Ethernet bridge each time the driver domain is

scheduled by the hypervisor.

84

DMA Protection Mb/sDomain Execution Profile Interrupts/s


IdleDriver Guest


Enabled 1865 10.8% 0.1% 0.2% 42.7% 1.7% 44.5% 0 13,903Disabled 1865 1.9% 0.2% 0.2% 37.0% 1.8% 58.9% 0 14,202

Table 4.4: CDNA 2-NIC transmit performance with and without DMA memory protection.

DMA Protection Mb/sDomain Execution Profile Interrupts/s


IdleDriver Guest


Enabled 1850 9.9% 0.2% 0.2% 52.6% 0.6% 36.5% 0 7,484Disabled 1850 2.2% 0.2% 0.3% 49.5% 0.8% 47.0% 0 7,616

Table 4.5: CDNA 2-NIC receive performance with and without DMA memory protection.

Table 4.2 shows that using all of the available processing resources, Xen’s software

virtualization is not able to transmit at line rate over two network interfaces with ei-

ther the Intel hardware or the RiceNIC hardware. However, only 41% of the processor

is used by the guest operating system. The remaining resources are consumed by Xen

overheads—using the Intel hardware, approximately 20% in the hypervisor and 37%

in the driver domain performing software multiplexing and other tasks.

As the table shows, CDNA is able to saturate two network interfaces, whereas

traditional Xen networking cannot. Additionally, CDNA performs far more efficiently,

with 45% processor idle time. The increase in idle time is primarily the result of two

factors. First, nearly all of the time spent in the driver domain is eliminated. The

remaining time spent in the driver domain is unrelated to networking tasks. Second,

the time spent in the hypervisor is decreased. With Xen, the hypervisor spends the

bulk of its time managing the interactions between the front-end and back-end virtual

network interface drivers. CDNA eliminates these communication overheads with the

driver domain, so the hypervisor instead spends the bulk of its time managing DMA

memory protection.

Table 4.3 shows the receive performance of the same configurations. Receiving

network traffic requires more processor resources, so Xen only achieves 1112 Mb/s

with the Intel network interface, and slightly lower with the RiceNIC interface. Again,

85

Xen overheads consume the bulk of the time, as the guest operating system only

consumes about 32% of the processor resources when using the Intel hardware.

As the table shows, not only is CDNA able to saturate the two network interfaces,

it does so with 37% idle time. Again, nearly all of the time spent in the driver domain

is eliminated. As with the transmit case, the CDNA architecture permits the hyper-

visor to spend its time performing DMA memory protection rather than managing

higher-cost interdomain communications as is required using software virtualization.

In summary, the CDNA I/O virtualization architecture provides significant per-

formance improvements over Xen for both transmit and receive. On the transmit

side, CDNA requires half the processor resources to deliver about 200 Mb/s higher

throughput. On the receive side, CDNA requires 63% of the processor resources to

deliver about 750 Mb/s higher throughput.

4.4.3 Memory Protection

The software-based protection mechanisms in CDNA can potentially be replaced

by a hardware IOMMU. For example, AMD has proposed an IOMMU architecture

for virtualization that restricts the physical memory that can be accessed by each

device [2]. AMD’s proposed architecture provides memory protection as long as each

device is only accessed by a single domain. For CDNA, such an IOMMU would have

to be extended to work on a per-context basis, rather than a per-device basis. This

would also require a mechanism to indicate a context for each DMA transfer. Since

CDNA only distinguishes between guest operating systems and not traffic flows, there

are a limited number of contexts, which may make a generic system-level context-

aware IOMMU practical.

Tables 4.4 and4.5 show the performance of the CDNA I/O virtualization archi-

tecture both with and without DMA memory protection under transmit and receive

tests, respectively. (The performance of CDNA with DMA memory protection en-

86

abled was replicated from Tables 4.2 and 4.3 for comparison purposes.) By disabling

DMA memory protection, the performance of the modified CDNA system establishes

an upper bound on achievable performance in a system with an appropriate IOMMU.

However, there would be additional hypervisor overhead to manage the IOMMU that

is not accounted for by this experiment. Since CDNA can already saturate two net-

work interfaces for both transmit and receive traffic, the effect of removing DMA

protection is to increase the idle time by about 10–15%, depending on the workload.

As the table shows, this increase in idle time is the direct result of reducing the num-

ber of hypercalls from the guests and the time spent in the hypervisor performing

protection operations.

Even as systems begin to provide IOMMU support for techniques such as CDNA,

older systems will continue to lack such features. In order to generalize the design

of CDNA for systems with and without an appropriate IOMMU, wrapper functions

could be used around the hypercalls within the guest device drivers. The hypervisor

must notify the guest whether or not there is an IOMMU. When no IOMMU is

present, the wrappers would simply call the hypervisor, as described here. When

an IOMMU is present, the wrapper would instead create DMA descriptors without

hypervisor intervention and only invoke the hypervisor to set up the IOMMU. Such

wrappers already exist in modern operating systems to deal with such IOMMU issues.

4.4.4 Scalability

Figures 4.3 and 4.4 show the aggregate transmit and receive throughput, respec-

tively, of Xen and CDNA with two network interfaces as the number of guest operat-

ing systems varies. The percentage of CPU idle time is also plotted above each data

point. CDNA outperforms Xen for both transmit and receive both for a single guest,

as previously shown in Tables 4.2 and 4.3, and as the number of guest operating

systems is increased.

87

1 2 4 8 12 16 20 24400

600

800

1000

1200

1400

1600

1800

2000

Xen Guests

Xen

Tra

nsm

it T

hrou

ghpu

t (M

bps)

44.5%25.4%

5.9%0% 0% 0% 0% 0%

3.0%0%

0%

0%

0%

0% 0% 0%

CDNA / RiceNICXen / Intel

Figure 4.3: Transmit throughput for Xen and CDNA (with CDNA idle time).

1 2 4 8 12 16 20 24400

600

800

1000

1200

1400

1600

1800

2000

Xen Guests

Xen

Rec

eive

Thr

ough

put (

Mbp

s)

36.5%29.1%

12.6%0% 0% 0% 0% 0%

5.0%

0%

0%

0%

0%0%

0% 0%

CDNA / RiceNICXen / Intel

Figure 4.4: Receive throughput for Xen and CDNA (with CDNA idle time).

As the figures show, the performance of both CDNA and software virtualization

degrades as the number of guests increases. For Xen, this results in declining band-

width, but the marginal reduction in bandwidth decreases with each increase in the

number of guests. For CDNA, while the bandwidth remains constant, the idle time

88

decreases to zero. Despite the fact that there is no idle time for 8 or more guests,

CDNA is still able to maintain constant bandwidth. This is consistent with the level-

ing of the bandwidth achieved by software virtualization. Therefore, it is likely that

with more CDNA NICs, the throughput curve would have a similar shape to that

of software virtualization, but with a much higher peak throughput when using 1–4

guests.

These results clearly show that not only does CDNA deliver better network per-

formance for a single guest operating system within Xen, but it also maintains signifi-

cantly higher bandwidth as the number of guest operating systems is increased. With

24 guest operating systems, CDNA’s transmit bandwidth is a factor of 2.1 higher than

Xen’s and CDNA’s receive bandwidth is a factor of 3.3 higher than Xen’s.

89

Chapter 5

Protection Strategies for Direct I/O in Virtual

Machine Monitors

As the CDNA architecture shows, direct I/O access by guest operating systems

can significantly improve performance. Preferably, guest operating systems within

a virtual machine monitor would be able to directly access all I/O devices without

the need for the data to traverse an intermediate software layer within the virtual

machine monitor [45, 60]. However, if a guest can directly access an I/O device, then

it can potentially direct the device to access memory that it does not own via direct

memory access (DMA). Therefore, the virtual machine monitor must still be able to

ensure that guest operating systems do not access each other’s memory indirectly

through the shared I/O devices in the system. Both IOMMUs [10] and software-

based methods (as established in the previous chapter) can provide DMA memory

protection for the virtual machine monitor. They do so by preventing guest operating

systems from directing I/O devices to access memory that is not owned by that guest,

while still allowing the guest to directly access the device.

This study is the first experimental study that performs a head-to-head compar-

ison of DMA memory protection strategies supporting direct access to I/O devices

from untrusted guest operating systems within a virtual machine monitor. Specifi-

cally, three hardware IOMMU-based strategies and one software-based strategy are

explored. The first IOMMU-based strategy uses single-use I/O memory mappings

that are created before each I/O operation and immediately destroyed after each I/O

90

operation. The second IOMMU-based strategy uses shared I/O memory mappings

that can be reused by multiple, concurrent I/O operations. The third IOMMU-based

strategy uses persistent I/O memory mappings that can be reused until they need

to be reclaimed to create new mappings. Finally, the software-based strategy uses

validated DMA descriptors that can only be used for one I/O operation.

The comparison of these four strategies yields several insights. First, all four

strategies provide equivalent protection between guest operating systems for direct

access to shared I/O devices in a virtual machine monitor. All of the techniques

prevent a guest operating system from directing the device to access memory that

does not belong to that guest. The traditional single-use strategy, however, provides

this protection at the greatest cost. Second, there is significant opportunity to reuse

IOMMU mappings which can reduce the cost of providing protection. Multiple con-

current I/O operations are able to share the same mappings often enough that there

is a noticeable decrease in the overhead of providing protection. That overhead can

further be decreased by allowing mappings to persist so that they can also be reused

by future I/O operations. Finally, the software-based protection strategy performs

comparably to the best of the IOMMU-based strategies.

The next section provides background on how I/O devices access main mem-

ory and the possible memory protection violations that can occur when doing so.

Sections 5.2 and 5.3 discuss the three IOMMU-based protection strategies and the

one software-based protection strategy. Section 5.4 then describes the protection

properties afforded by the four strategies. Section 5.5 describes the experimental

methodology and Section 5.6 evaluates the protection strategies.

91

5.1 Background

Modern server I/O devices, including disk and network controllers, utilize direct

memory access (DMA) to move data between the host’s main memory and the device’s

on-board buffers. The device uses DMA to access memory independently of the host

CPU, so such accesses must be controlled and protected. To initiate a DMA operation,

the device driver within the operating system creates DMA descriptors that refer to

regions of memory. Each DMA descriptor typically includes an address, a length,

and a few device-specific flags. In commodity x86 systems, devices lack support for

virtual-to-physical address translation, so DMA descriptors always contain physical

addresses for main memory. Once created, the device driver passes the descriptors

to the device, which will later use the descriptors to transfer data to or from the

indicated memory regions autonomously. When the requested I/O operations have

been completed, the device raises an interrupt to notify the device driver.

For example, to transmit a network packet, the network interface’s device driver

might create two DMA descriptors. The first descriptor might point to the packet

headers and the second descriptor might point to the packet payload. Once created,

the device driver would then notify the network interface that there are new DMA

descriptors available. The precise mechanism of that notification depends on the

particular network interface, but typically involves a programmed I/O operation to

the device telling it the location of the new descriptors. The network interface would

then retrieve the descriptors from main memory using DMA—if they were not written

to the device directly by programmed I/O. The network interface would then retrieve

the two memory regions that compose the network packet and transmit them over

the network. Finally, the network interface would interrupt the host to indicate that

the packet has been transmitted. In practice, notifications from the device driver and

92

interrupts from the network interface would likely be aggregated to cover multiple

packets for efficiency.

Three potential memory access violations can occur on every I/O transfer initiated

using this DMA architecture:

1. The device driver could create a DMA descriptor with an incorrect address.

2. The memory referenced by the DMA descriptor could be repurposed after the

descriptor was created by the device driver, but before it is used by the device.

3. The device itself could initiate a DMA transfer to a memory address not refer-

enced by the DMA descriptor.

These violations could occur either because of failures or because of malicious intent.

However, as devices are typically not user-programmable, the last type of violation is

only likely to occur as a result of a hardware or software failure on the device.

In a non-virtualized environment, the operating system is responsible for pre-

venting the first two types of memory access violations. This requires the operating

system to trust the device driver to create the correct DMA descriptors and to pin

physical memory used by I/O devices. A failure of the operating system to prevent

these memory access violations could potentially result in system failure. In a vir-

tualized environment, however, the virtual machine monitor cannot trust the guest

operating systems to prevent these memory access violations, as a memory access

violation incurred by one guest operating system can potentially harm other guest

operating systems or even bring down the whole system. Therefore, a virtual machine

monitor requires mechanisms to prevent one guest operating system from intention-

ally or accidentally directing and I/O device to access the memory of another guest

operating system. The only way that would be possible is via one of the first two

types of memory access violations. Depending on the reliability of the I/O devices, it

93

may also be desirable to try to prevent the third type of memory access violation, as

well (although it is frequently not possible to protect against a misbehaving device,

as will be discussed in Section 5.4). The following sections describe mechanisms and

strategies for preventing these memory access violations.

5.2 IOMMU-based Protection

A virtual machine monitor can utilize an I/O memory management unit (IOMMU)

to help provide DMA memory protection when allowing direct access to I/O devices.

Whereas a virtual memory management unit enforces access control and provides

address translation services for software as it accesses memory, an IOMMU enforces

access control and provides address translation services for I/O devices as they access

memory. The IOMMU uses page table entries (PTEs) that each specify translation

from an I/O address to a physical memory address and specify access control (such

as which devices are permitted to use the given PTE).

An IOMMU only permits I/O devices to access memory for which a valid mapping

exists in the IOMMU page table. Thus, in an IOMMU-based system, there must be

a valid IOMMU translation for each host memory buffer to be used in an upcoming

DMA descriptor. Otherwise, the DMA descriptor will refer to a region unmapped by

the IOMMU, and the I/O transaction will fail.

The following subsections present three strategies for using an IOMMU to provide

DMA memory protection in a virtual machine monitor. The strategies primarily differ

in the extent to which IOMMU mappings are allowed to be reused.

5.2.1 Single-use Mappings

A common strategy for managing an IOMMU is to create a single-use mapping

for each I/O transaction. The Linux DMA-Mapping interface, for example, implements

94

a single-use mapping strategy. Ben-Yehuda, et al. also explored a single-use map-

ping strategy in the context of virtual machine monitors [10]. In such a single-use

strategy, the driver must ensure that a new IOMMU mapping is created for each

DMA descriptor. The IOMMU mapping is then destroyed once the corresponding

I/O transaction has completed. In a virtualized system, the trusted virtual machine

monitor is responsible for creating and destroying IOMMU mappings at the driver’s

request. If the VMM does not create the mapping, either because the driver did not

request it or because the request referred to memory not owned by the guest, then

the device will be unable to perform the corresponding DMA operation.

To carry out an I/O transaction using a single-use mapping strategy, the virtual

machine monitor (VMM), untrusted guest operating system (GOS), and the device

(DEV) carry out the following steps:

1. GOS: The guest OS requests an IOMMU mapping for the memory buffer in-

volved in the I/O transaction.

2. VMM: The VMM validates that the requesting guest OS has appropriate read

or write permission for each memory page in the buffer to be mapped.

3. VMM: The VMM marks the memory buffer as “in I/O use”, which prevents the

buffer from being reallocated to another guest OS during an I/O transaction.

4. VMM: The VMM creates one or more IOMMU mappings for the buffer. As

with virtual memory management units, one mapping is usually required for

each memory page in the buffer.

5. GOS: The guest OS creates a DMA descriptor with the IOMMU-mapped ad-

dress that was returned by the VMM.

6. DEV: The device carries out its I/O transaction as directed by the DMA de-

scriptor and it notifies the driver upon completion.

95

7. GOS: The driver requests destruction of the corresponding IOMMU mapping(s).

8. VMM: The VMM validates that the mappings belong to the guest OS making

the request.

9. VMM: The VMM destroys the IOMMU mappings.

10. VMM: The VMM clears the “in I/O use” marker associated with each memory

page referred to by the recently-destroyed mapping(s).

5.2.2 Shared Mappings

Rather than creating a new IOMMU mapping for each new DMA descriptor,

it is possible to share a mapping among DMA descriptors so long as the mapping

points to the same underlying memory page and remains valid. Sharing IOMMU

mappings is advantageous because it avoids the overhead of creating and destroying

a new mapping for each I/O request by instead reusing an existing mapping. To

implement sharing, the guest operating system must keep track of which IOMMU

mappings are currently valid, and it must keep track of how many pending I/O

requests are currently using the mapping. To protect a guest’s memory from errant

device accesses, an IOMMU mapping should be destroyed once all outstanding I/O

requests that use the mapping have been completed. Though the untrusted guest

operating system has responsibilities for carrying out a shared-mapping strategy,

it need not function correctly to ensure isolation among operating systems, as is

discussed further in Section 5.4.

To carry out a shared-mapping strategy, the guest OS and the VMM perform many

of the same steps as required by the single-use strategy. The shared-mapping strategy

differs at the initiation and termination of an I/O transaction. Before step 1 would

occur in a one-time-use strategy, the guest operating system first queries a table of

known, valid IOMMU mappings to see if a mapping for the I/O memory buffer already

96

exists. If so, the driver uses the previously established IOMMU-mapped address for

a DMA descriptor, and then passes the descriptor to the device, in effect skipping

steps 1–4. If not, the guest and VMM follow steps 1–4 to create a new mapping.

Whether a new mapping is created or not, before step 5, the guest operating system

increments its own reference count for the mapping (or setting it to one for a new

mapping). This reference count is separate from the reference count maintained by

the VMM.

Steps 5 and 6 then proceed as in the single-use strategy. After these steps have

completed, the driver calls the guest operating system to decrement its reference

count. If the reference count is zero, no other I/O transactions are in progress that

are using this mapping, and it is appropriate to call the VMM to destroy the mapping

as in steps 7–10 of the single-use strategy. Otherwise, the IOMMU mapping is still

being used by another I/O transaction within the guest OS, so steps 7–10 are skipped.

5.2.3 Persistent Mappings

IOMMU mappings can further be reused by allowing them to persist evan after all

I/O transactions using the mapping have completed. Compared to a shared mapping

strategy, such a persistent mapping strategy attempts to further reduce the overhead

associated with creating and destroying IOMMU mappings inside the VMM. Whereas

sharing exploits reuse among mappings only when a mapping is being actively used

by at least one I/O transaction, persistence exploits temporal reuse across periods of

inactivity.

The infrastructure and mechanisms for implementing a persistent mapping strat-

egy are similar to those required by a shared mapping strategy. The primary difference

is that the guest operating system does not request that mappings be destroyed after

the I/O transactions using them complete. In effect, this means that mappings persist

until they must be recycled. Therefore, in contrast to the shared mapping strategy,

97

when the guest’s reference count is decremented after step 6, the I/O transaction is

complete and steps 7–10 are always skipped. This should dramatically reduce the

number of hypercalls into the VMM.

As mappings are now persistent, they must be recycled whenever a new mapping

is needed. This changes the behavior of step 1 when compared to the shared mapping

case. Before performing step 1, as in the shared mapping case, the guest operating

system first queries a table of known, valid IOMMU mappings to see if a mapping

for the I/O memory buffer already exists. If one does not, a new mapping is needed

and the guest operating system must select an idle mapping to be recycled. In step 1,

the guest then passes this idle mapping to the virtual machine monitor along with

the request to create a new mapping. Steps 8, 10, and 2–4 are then performed by

the VMM to modify the mapping(s) for use by the new I/O transaction. Note that

step 9 can be skipped, as one valid mapping is going to be immediately replaced by

another valid mapping.

5.3 Software-based Protection

IOMMU-based protection strategies enforce safety even when untrusted software

provides unverified DMA descriptors directly to hardware, because the DMA oper-

ations generated by any device are always subject to later validation. However, an

IOMMU is not necessary to ensure full isolation among untrusted guest operating

systems, even when they use DMA-capable hardware that directly reads and writes

host memory. Rather than relying on hardware to perform late validation during

I/O transactions, a lightweight software-based system performs early validation of

DMA descriptors before they are used by hardware. The software-based strategy

also must protect validated descriptors from subsequent unauthorized modification

by untrusted software, thus ensuring that all I/O transactions operate only on buffers

98

that have been approved by the VMM. The CDNA architecture relies on a software-

based protection mechanism, as introduced in Chapter 4. This study compares that

approach to IOMMU-based approaches.

The runtime operation of a software-based protection strategy works much like

a single-use IOMMU-based strategy, since both validate permissions for each I/O

transaction. Whereas the single-use IOMMU-based strategy uses the VMM to create

IOMMU mappings for each transaction, software-based I/O protection creates the

actual DMA descriptor. The descriptor is valid only for the single I/O transaction.

Unlike an IOMMU-based system, an untrusted guest OS’s driver must first register

itself with the VMM during initialization. At that time, the VMM takes ownership

of the driver’s DMA descriptor region and the driver’s status region, revoking write

permissions from the guest. This prevents the guest from independently creating

or modifying DMA descriptors, or modifying the status region. Finally, the VMM

must prevent the guest from changing the descriptor and status regions. This can be

trivially accomplished by only mapping the device’s configuration registers into the

VMM’s address space, and not into the guests’ address spaces.

After initialization, the runtime operation of the software-based strategy is similar

to the single-use IOMMU-based strategy outlined in Section 5.2.1. Steps 1–3 of a

software-based strategy are identical. In step 4, the VMM creates a DMA descriptor

in the write-protected DMA descriptor region, obviating the OS’s role in step 5. The

device carries out the requested operation using the validated descriptor, as in step 6,

and because the descriptor is write-protected, the untrusted guest cannot induce an

unauthorized transaction. When the device signals completion of the transaction,

the VMM inspects the device’s state (which is usually written via DMA back to the

host) to see which DMA descriptors have been used. The VMM then processes those

completed descriptors, as in step 10, permitting the associated guest memory buffers

to be reallocated.

99

5.4 Protection Properties

The protection strategies presented in Sections 5.2 and 5.3 can be used to prevent

the memory access violations presented in Section 5.1. Those memory access viola-

tions, however, can occur both across multiple guests (inter-guest) and within a single

guest (intra-guest). A virtual machine monitor must provide inter-guest protection

in order to operate reliably. A guest operating system may additionally benefit if

the virtual machine monitor can also help provide intra-guest protection. This sec-

tion describes the protection properties of the four previously presented protection

strategies.

5.4.1 Inter-Guest Protection

Perhaps surprisingly, all four strategies provide equivalent protection against the

first two types of memory access violations presented in Section 5.1: creating of

an incorrect DMA descriptor and repurposing the memory referenced by a DMA

descriptor. In all of the IOMMU-based strategies, if the device driver creates a

DMA descriptor that refers to memory that is not owned by that guest operating

system, the device will be unable to perform that DMA, as no IOMMU mapping will

exist. The only requirement to maintain this protection is that the VMM must never

create an IOMMU mapping for a guest that does not refer to that guest’s memory.

Similarly, only the VMM can repurpose memory to another guest, so as long as it does

not do so while there is an existing IOMMU mapping to that memory, the second

memory protection violation can never occur. The software-based approach provides

exactly the same guarantees by only allowing the VMM to create DMA descriptors.

Therefore, these strategies allow the VMM to provide protection.

The third type of memory access violation, the device initiating a rogue DMA

operation, is more difficult to prevent. If the device is shared among multiple guest

100

operating systems, then no strategy can prevent this type of protection violation. For

example, if a network interface is allowed to receive packets for two guest operating

systems, there is no way for the VMM to prevent the device from sending the traffic

destined for one guest to the other. This is one simple example of many protection

violations that a shared device can commit.

However, if a device is privately assigned to a single guest operating system, the

IOMMU-based strategies can be used to provide protection against faulty device

behavior. In this case, the VMM simply has to ensure that there are only IOMMU

mappings to the guest that privately owns the device. In that manner, there is no

way the device can even access memory that does not belong to that guest. However,

the software-based strategy cannot even provide this level of protection. As DMA

descriptors are pre-validated, there is no way to stop the device from simply ignoring

the DMA descriptor and accessing any physical memory.

5.4.2 Intra-Guest Protection

None of the four protection strategies can protect the guest OS from the first

two types of access violations caused by its device drivers. In essence, the protection

afforded to the guest OS by any of the strategies is only as good as the implementation

of the strategy in a device driver. Consider the IOMMU-based strategies. For an

actual access violation to be prevented, the device driver would have to map the

correct buffer through the IOMMU but construct an incorrect DMA descriptor for

it. Such an error, however, seems unlikely. In the case of the software-based strategy,

such a scenario is impossible because the memory protection on the buffer and the

creation of the DMA descriptor are combined into operation by the VMM.

In contrast, the IOMMU-based strategies offer some protection against the third

type of memory access violation, the device initiating a rogue DMA operation. Of

these strategies, the single-use and shared strategies will offer the greatest protection

101

against this type of memory access violation because the only pages that could be

corrupted are those that are the target of a pending I/O operation. However, the

persistent strategy offers very little protection, as there will be a significant number

of active mappings at any given time that the device could erroneously use.

5.5 Experimental Setup

The protection strategies described here were evaluated on a system with an AMD

Opteron 250 processor that includes an integrated graphics address relocation table

(GART) alongside the memory controller. The GART can be used to translate mem-

ory addresses using physical-to-physical address mappings. Therefore, with the ap-

propriate software infrastructure, a GART can model the functionality of an IOMMU.

GART mappings are established at the memory-page granularity (in this case, 4

KB). Each page requires a separate GART mapping. Software programs the GART

hardware to create a mapping at a specific location within the GART’s contigu-

ous physical address range that points to a memory-backed memory location. The

GART’s physical address range is often referred to as the GART “aperture”.

GART mappings are organized in an array in memory. An index into the mapping

array corresponds to a page index into the aperture. When an I/O device accesses

a location in the GART’s aperture, the GART transparently redirects the memory

access to a target memory location as specified by the corresponding GART mapping’s

address. For unused or unmapped locations within the aperture, software creates a

dummy mapping pointing to a single, shared garbage memory page.

So long as an I/O device can only access memory within the GART aperture, all

of that device’s accesses will be subject to remapping and access controls as specified

by the virtual machine monitor. Thus, the GART’s mapping table limits I/O device

102

accesses to those regions approved by the VMM, just as an IOMMU limits I/O device

accesses.

Unlike an IOMMU-based system, however, a device could still generate an access

outside the GART region, thus bypassing access controls. As a practical measure, I

modify the prototype network interface to only accept DMA requests that lie within

the VMM-specified GART aperture. Even though this system architecture could

allow a faulty device to access memory outside the GART aperture, the architecture

faithfully models the overheads of a system for which all of the network interface’s

DMA requests are subject to the IOMMU strategy implemented by the VMM. Hence,

this architecture is an effective means for examining the efficiency and performance

of the various IOMMU management strategies under consideration.

Ben-Yehuda et al. identified that platform-specific IOMMU implementation de-

tails can significantly affect performance and influence the efficiency of a system’s

protection strategy [10]. Specifically, that work noted that the inability to individu-

ally replace IOMMU mappings without globally flushing the CPU cache can severely

degrade performance. The GART-based IOMMU implementation used in this work

does not incur the cache-flush penalties associated with the IBM platform, and thus

the GART-based implementation should represent a low-overhead upper-bound with

respect to architectural efficiency and performance.

I implement the IOMMU- and software-based protection strategies in the open

source Xen 3 virtual machine monitor [7]. I evaluate these strategies on a variety of

network-intensive workloads, including a TCP stream microbenchmark, a voice-over-

IP (VoIP) server benchmark, and a static-content web server benchmark. The stream

microbenchmark either transmits or receives bulk data over a TCP connection to a

remote host. The VoIP benchmark uses the OpenSER server. In this benchmark,

OpenSER acts as a SIP registrar and 50 clients simultaneously initiate calls as quickly

as possible. The web server benchmark uses the lighttpd web server to host static

103

HTTP content. In this benchmark, 32 clients simultaneously replay requests from

various web traces as quickly as possible. Three web traces are used in this study:

“CS”, “IBM”, and “WC”. The CS trace is from a computer science departmental web

server and has a working set of 1.2 GB of data. The IBM trace is from an IBM web

server and has a working set of 1.1 GB of data. The WC trace is from the 1998 World

Cup soccer web server and has a working set of 100 MB of data. For all benchmarks,

the client machine is never saturated, so the server machine is always the bottleneck.

The server under test uses a 2.4 GHz Opteron processor, has two Gigabit Ether-

net network interface cards, and features DDR 400 DRAM. The network interfaces

are publicly available prototypes that support shared, direct access [51]. A single

unprivileged guest operating system has 1.4 GB of memory. The IOMMU-based

strategies employ 512 MB of physical GART address space for remapping. In each

benchmark, direct access for the guest is granted only for the network interface cards.

Because the guest’s memory allocation is large enough to hold each benchmark and its

corresponding data set, other I/O is insignificant. For the web-based workloads, the

guest’s buffer cache is warmed prior to performance testing. For all of the benchmarks,

each configuration was performance tested at least five times with each benchmark.

Because there was effectively no variance across runs for a given configuration and

benchmark, the statistics reported are averages of those runs.

5.6 Evaluation

Network server applications can stress network I/O in different ways, depending

on the characteristics of the application and its workload. Applications may gen-

erate large or small network packets, and may or may not utilize zero-copy I/O.

For an application running on a virtualized guest operating system, these network

characteristics interact with the I/O protection strategy implemented by the VMM.

104

Protection CPU % Reuse (%) HC/Strategy Total Prot. TX RX DMA

Stream Transmit

None 41 0 N/A N/A 0Single-use 64 23 N/A N/A .88Shared 59 18 39 0 .55Persistent 51 10 100 100 0Software 56 15 N/A N/A .90

Stream Receive

None 53 0 N/A N/A 0Single-use 79 26 N/A N/A .37Shared 73 20 39 0 .10Persistent 66 13 100 100 0Software 64 11 N/A N/A .39

Table 5.1: TCP Stream profile.

Consequently, the efficiency of the I/O protection strategy can affect application per-

formance in different ways.

For all applications, I evaluate the four protection strategies presented earlier,

and I compare each to the performance of a system lacking any I/O protection at

all (“None”). “Single-use”, “Shared”, and “Persistent” all use an IOMMU to enforce

protection, using either single-use, shared-mapping, or persistent-mapping strategies,

respectively, as described in Section 5.2. “Software” uses software-based I/O protec-

tion, as described in Section 5.3.

5.6.1 TCP Stream

A TCP stream microbenchmark either transmits or receives bulk TCP data and

thus isolates network I/O performance. This benchmark does not use zero-copy I/O.

Table 5.1 shows the CPU efficiency and overhead associated with each protection

mechanism when streaming data over two network interfaces. The table shows the

total percentage of CPU consumed while executing the benchmark and the percentage

of CPU spent implementing the given protection strategy. The table also shows

105

the percentage of times a buffer to be used in an I/O transaction (either transmit

or receive) already has a valid IOMMU mapping that can be reused. Finally, the

table shows the number of VMM invocations, or hypercalls (HC), required per DMA

descriptor used by the network interface driver.

When either transmitting or receiving, all of the strategies achieve the same TCP

throughput (1865 Mb/s transmitting, 1850 Mb/s receiving), but they differ accord-

ing to how costly they are in terms of CPU consumption. The single-use protec-

tion strategy is the most costly, with its repeated construction and destruction of

IOMMU mappings consuming 23% of total CPU resources for transmit and 26% for

receive. The shared strategy reclaims some of this overhead through its sharing of

in-use mappings, though this reuse only exists for transmitted packets (data in the

transmit-stream case, TCP ACK packets in the receive case). The lack of reuse for

received packets is caused by the XenoLinux buffer allocator, which dedicates an en-

tire 4 KB page for each receive buffer, regardless of the buffer’s actual size. This

over-allocation is an artifact of the XenoLinux I/O architecture, which was designed

to remap received packets to transfer them between guest operating systems. Regard-

less, the persistent strategy achieves 100% reuse of mappings, as the small number of

persistent mappings that cover network buffers essentially become permanent. This

further reduces overhead relative to single-use and shared. Notably, the number of

hypercalls per DMA operation rounds to zero. However, management of the persis-

tent mappings—mapping lookup and recycling, as described in Section 5.2.3—still

consume over 10% of the processor’s resources.

Surprisingly, the overhead incurred by the software-based technique is comparable

to the IOMMU-based persistent mapping strategy. The software-based technique

certainly requires far more hypercalls per DMA than the IOMMU-based strategies.

However, the cost of those VMM invocations and the associated page-verification

operations is similar to the cost of managing persistent mappings for an IOMMU.

106

Protection Calls/ CPU % Reuse (%) HC/Strategy Sec. Prot. TX RX DMA

None 3005 0 N/A N/A 0Single-use 2790 6.1 N/A N/A .68Shared 2835 6.0 4 0 .65Persistent 2901 2.1 100 100 0Software 2895 3.5 N/A N/A .67

Table 5.2: OpenSER profile.

5.6.2 VoIP Server

Table 5.2 shows the performance and overhead profile for the OpenSER VoIP ap-

plication benchmark for the various protection strategies. The OpenSER benchmark

is largely CPU-intensive and therefore only uses one of the two network interface cards.

Though the strategies rank similarly in efficiency for the OpenSER benchmark as in

the TCP Stream benchmark, Table 5.2 shows one significant difference with respect to

reuse of IOMMU mappings. Whereas the shared strategy was able to reuse mappings

39% of the time for transmit packets under the TCP Stream benchmark, OpenSER

sees only 4% reuse. Unlike typical high-bandwidth streaming applications, OpenSER

only sends and receives very small TCP messages in order to initiate and terminate

VoIP phone calls. Consequently, the shared strategy provides only a minimal effi-

ciency and performance improvement over the high-overhead single-use strategy for

the OpenSER benchmark, indicating that sharing alone does not provide an efficiency

gain for applications that are heavily reliant on small messages.

5.6.3 Web Server

Table 5.3 shows the performance, overhead, and sharing profiles of the various pro-

tection strategies when running a web server under each of three different trace work-

loads, “CS”, “IBM”, and “WC”. As in the TCP Stream and OpenSER benchmarks,

the different strategies rank identically among each other in terms of performance

107

Protection HTTP CPU % Reuse (%) HC/Strategy Mbps Prot. TX RX DMA

CS Trace


IBM Trace


WC Trace


Table 5.3: Web Server profile using write().

and overhead. Each of the different traces generates messages of different sizes and

requires different amounts of web-server compute overhead. For the write()-based

implementation of the web server, however, the server is always completely saturated

for each workload. “CS” is primarily network-limited, generating relatively large re-

sponse messages with an average HTTP message size of 34 KB. “IBM” is largely

compute-limited, generating relatively small HTTP responses with an average size

of 2.8 KB. “WC” lies in between, with an average response size of 6.7 KB. As the

table shows, the amount of reuse exploited by the shared strategy is dependent on

the average HTTP response being generated. Larger average messages lead to larger

amounts of reuse for transmitted buffers under the shared strategy. Though larger

amounts of reuse slightly reduce the CPU overhead for the shared strategy relative

108

Protection HTTP CPU % Reuse (%) HC/Strategy Mbps Prot. TX Hdr. TX File RX DMA

CS Trace

None 1378 (35% idle) 0 N/A N/A N/A 0Single-use 1291 (7% idle) 27.6 N/A N/A N/A .37Shared 1330 (17% idle) 17.7 82 72 0 .17Persistent 1342 (23% idle) 11.5 100 96 100 .02Software 1351 (21% idle) 13.7 N/A N/A N/A .37

IBM Trace

None 475 0 N/A N/A N/A 0Single-use 403 14.0 N/A N/A N/A .43Shared 413 12.3 34 50 0 .35Persistent 438 4.3 100 99 100 0Software 422 6.2 N/A N/A N/A .43

WC Trace

None 961 0 N/A N/A N/A 0Single-use 760 19.9 N/A N/A N/A .39Shared 796 16.0 53 62 0 .27Persistent 872 5.1 100 100 100 0Software 833 8.7 N/A N/A N/A .40

Table 5.4: Web Server profile using zero-copy sendfile().

to the single-use strategy, the reuse is not significant enough under these workloads

to yield significant performance benefits.

As in the other benchmarks, receive buffers are not subject to reuse with the

shared-mapping strategy. Regardless of the workload, the persistent strategy is 100%

effective at reusing existing mappings as the mappings again become effectively per-

manent. As in the other benchmarks, the software-based strategy achieves application

performance between the shared and persistent IOMMU-based strategies.

For all of the previous workloads, the network application utilized the write()

system call to send any data. Consequently, all buffers that are transmitted to the

network interface have been allocated by the guest operating system’s network-buffer

allocator. Using the zero-copy sendfile() interface, however, the guest OS generates

network buffers for the packet headers, but then appends the application’s file buffers

109

rather than copying the payload. This interface has the potential to change the

amount of reuse exploitable by a protection strategy. Using sendfile(), the packet-

payload footprint for IOMMU mappings is no longer limited to the number of internal

network buffers allocated by the OS, but instead is limited only by the size of physical

memory allocated to the guest.

Table 5.4 shows the performance, efficiency, and sharing profiles for the different

protection strategies for web-based workloads when the server uses sendfile() to

transmit HTTP responses. Note that for the “CS” trace, the host CPU is not com-

pletely saturated, and so the CPU’s idle time percentage is annotated next to HTTP

performance in the table. For the other traces, the CPU is completely saturated.

The table separates reuse statistics for transmitted buffers according to whether or

not the buffer was a packet header or packet payload. As compared to Table 5.3,

Table 5.4 shows that the shared strategy is more effective overall at exploiting reuse

using sendfile() than with write(). Consequently, the shared strategy gives a

larger performance and efficiency benefit relative to the single-use strategy when us-

ing sendfile(). Table 5.4 also shows that the persistent strategy is highly effective

at capturing file reuse, even though the total working-set size of the “CS” and “IBM”

traces are each more than twice as large as the 512 MB mapping space afforded by

the GART. Finally, the table shows that the software-based strategy performs better

than either the shared or single-use IOMMU strategies for all workloads, and can

perform even better than the persistent strategy on the CS trace, though it consumes

more CPU resources.

5.6.4 Discussion

The architecture of the GART imposes some limitations on this study. In partic-

ular, it is infeasible to evaluate a direct map strategy using the IOMMU. Under this

strategy, the VMM creates a persistent identity mapping for each VM that permits

110

access to its entire memory. This mapping is created by the VMM when the VM

is started and updated only if the memory allocated to the VM changes. Moreover,

because the direct map strategy uses an identity mapping, there is no need for the

device driver to translate the address that is provided by the guest OS into an ad-

dress that is suitable for DMA. Unfortunately, the GART cannot implement such an

identity mapping because the address of the aperture cannot overlap with that of

physical memory.

Like the other protection strategies, the direct map strategy has pros and cons. It

provides the same protection between guest operating systems as the other IOMMU-

based strategies, but it provides the least safety within a guest operating system.

For example, under the persistent mapping strategy, a page will only be mapped by

the IOMMU if it is the target of an I/O operation. Moreover, an unused mapping

may ultimately be destroyed. In contrast, under the direct map strategy, all pages

are mapped at all times. The direct map strategy’s unique advantage is that it can

be implemented entirely within the VMM without support from the guest OS. Its

implementation is, in effect, transparent to the guest OS.

Although it is not possible to determine the performance of the direct map strat-

egy experimentally using the GART-based setup, it is reasonable to argue that its

performance must be bounded by the performance of the “Persistent” and “None”

strategies. Although, in many cases, the “Persistent” strategy achieves near 100%

reuse, the direct map strategy could have lower overhead because the device driver

does not have to translate the address that is provided by the guest OS into an address

that is suitable for DMA.

The GART’s translation table is a single, one-dimensional array. Moreover, if

an IOTLB miss occurs, address translation requires at most one memory access. In

contrast, the coming IOMMUs from AMD and Intel will use multilevel translation

tables, similar to the page tables used by the processor’s MMU. Thus, both updates

111

by the hypervisor and IOTLB misses may have a higher cost because of the additional

memory references incurred by walking multilevel translation tables.

Regardless of the benchmark, the data in Section 5.6 shows many opportunities

for reuse of mappings in network I/O applications. However, some of this reuse is

a consequence of the difference between the mapping’s granularity (ie, a 4 kilobyte

memory page) and the granularity of a network packet (ie, 1500 bytes). Hence,

adjacent buffers in the same memory page can be reused for multiple packets be-

cause the packet size is smaller than that of a memory page. A hardware technique

that increases the maximum transaction size of a DMA operation from could invert

this relationship and decrease the amount of reuse exploitable by the existing im-

plementations examined here. For example, network interfaces that support TCP

segmentation offload provide the abstraction to the operating system of a NIC that

has a much larger maximum transmission unit (ie, 16 kilobytes instead of 1500 bytes).

In this case, the Shared protection strategy could approach the reuse properties of

the Single-use strategy, since a memory page would likely be used only once for one

large buffer rather than being used multiple times. However, previous studies by Kim

et al. show that the payload data for the web-based traces examined in this study

have significant reuse, and hence one would still expect to see reuse benefits in the

Persistent protection strategy [26].

Xen differs from many virtualization systems in that it exposes host physical ad-

dresses to the guest OS. In particular, the guest OS, and not the VMM, is responsible

for translating between pseudo-physical addresses that are used at most levels of the

guest OS and host physical addresses that are used at the device level. This does

not, however, fundamentally change the implementation of the various protection

strategies.

112

Chapter 6

Conclusion

As demand for high-bandwidth network services continues to grow, network servers

must continue to deliver more and more performance. Simultaneously, power and

cooling continue to be first-class concerns for datacenter servers, and thus network

servers must support the highest levels of efficiency possible. Architectural trends to-

ward chip multiprocessors are straining contemporary OS network stacks and network

hardware, exposing efficiency bottlenecks that can prevent software architectures from

gaining any substantial performance through multiprocessing. And whereas multicore

architectures offer an opportunity to better utilize physical server resources in a more

efficient manner through consolidation, inefficiencies inherent to modern I/O sharing

architectures and protection strategies severely damage performance and undermine

overall server efficiency.

This dissertation addresses key OS and VMM architectural components that can

limit I/O performance and efficiency in modern thread-concurrent and VM-concurrent

servers. Each of these components has separate performance and efficiency issues that

have tangible effects on the ability of a server to support its network applications.

The OS parallelization strategy affects the performance of network I/O processing by

the operating system, affects the maximum throughput attainable on a given connec-

tion, and thus affects application throughput and scalability. The virtual machine

monitor’s I/O virtualization architecture affects the overhead required to share ac-

cess to a given I/O device, which thus affects the maximum aggregate application

113

performance attainable on the system and affects the ability of the system to support

larger numbers of concurrent virtual machines. Furthermore, the VMM’s memory-

protection strategy also affects the overhead of device virtualization, which affects

application performance, and affects the level of isolation supported by the system,

which affects the operating systems’ and hence applications’ stability. The design

decisions of each strategy explored and the characteristics of the resulting architec-

tures have implications for server architects who will seek to build upon this work

and tackle remaining and future challenges facing server architecture. Those design

decisions, their characteristics, and the corresponding implications are discussed in

further detail.

6.1 Orchestrating OS parallelization to characterize and im-

prove I/O processing

The trend toward chip multiprocessing hardware necessitates parallelization of

the operating system’s network stack. This dissertation establishes that a contin-

uum of parallelism and efficiency exists among contemporary protocol network stack

organizations, and this research explores points along that continuum. Along with

the synchronization mechanism employed by each organization, the parallelization

strategy has a direct impact on overall efficiency and ultimately throughput. This re-

search found that a traditional network interface featuring a single high- bandwidth

link imposes an inherent bottleneck with regard to its single interface (ie, packet

queue) to the operating system, which limited throughput regardless of the network

stack organization used. However, introducing parallelism at the network interface

(by using separate interfaces) exposed the scheduling and synchronization efficiency

characteristics of each organization on the continuum. Through examining these

characteristics, it is clear that attempting to maximize theoretical parallelism in the

114

network stack can actually hurt performance, even on a highly parallel machine. This

research finds that a less-parallel, more-efficient connection-parallel network stack is

both more efficient and higher-performing than other organizations that attempt to

maximize packet parallelism.

Though this dissertation explored primarily performance and efficiency, the se-

lection of a connection-parallel network stack within the operating system has im-

plications well beyond just the operating system. This study showed that hardware

support is needed to overcome the bottleneck imposed by the serialized interface ex-

ported by a single high-bandwidth NIC. To efficiently support a connection-parallel

network stack, the network interface card would first require parallel packet queues

so that multiple threads could access the NIC at the same time, without synchroniza-

tion. Second, the NIC would require some form of packet classification that can map

incoming packets to specific connections (or connection groups) and then place them

in a specific packet queue on-board the NIC associated with that connection. With

the additional capability to fire a separate interrupt for each separate queue, it would

be possible to closely mimic the behavior of the parallel-NIC prototype evaluated in

this work, in which packets for a specific connection “queue” (a NIC in this case) are

persistently mapped to that same queue.

Even with such hardware support, additional challenges in both the hardware

and software remain, including support for load balancing. In the experiments evalu-

ated in this dissertation, the load was purposefully spread evenly across the separate

connections and their groups, and a connection always hashed to the same group.

However, a static hash mechanism may lead to undesirable overload conditions for

only a subset of connection groups, leading to under-utilization in lightly used groups.

In this case, it would be desirable to migrate busy connections to lightly-loaded con-

nection groups. If the hardware is responsible for mapping packets to connections,

though, then clearly the hardware must participate in this scheme. One can imag-

115

ine several possibilities for providing this support, including full control by hardware

(where the hardware attempts to detect the overload condition and notifies the OS

of a migration), full control by software (where the software detects overload and

migrates specific connections by notifying the NIC), or something in between, where

the software provides hints to the hardware about future possibilities for migration.

Regardless, the issue of load-balancing across multiple queues will be a critical area

for maintaining high performance with connection-parallel network stacks that have

hardware support.

6.2 Reducing virtualization overhead using a hybrid hard-

ware/software approach

Whereas OS support for thread-parallelism incurs performance-damaging inef-

ficiencies, contemporary software-based techniques for providing shared access to

an I/O device also incur severe performance overheads. Though the contemporary

software-based virtualization architecture supports a variety of hardware, the hypervi-

sor and driver domain consume as much as 70% of the execution time during network

transfers. This dissertation introduces the novel CDNA I/O virtualization architec-

ture, which is a hybrid hardware/software approach to providing safe, shared access

to a single network interface. CDNA uses hardware to perform traffic multiplexing,

a combination of hardware and software to facilitate event notification from the I/O

device to a particular virtual machine, and a combination of hardware and software

to enforce isolation of DMA requests initiated by each untrusted virtual machine.

This study demonstrates that a hybrid hardware/software approach is both eco-

nomical and effective. The CDNA prototype device required about 12 MB of on-

board storage and used a 300 MHz embedded processor, which is about the same as

modern network interface cards. Using these resources, the CDNA architecture im-

116

proved transmit and receive performance for concurrent virtual machines by factors

of 2.1 and 3.3, respectively, over a standard software-based shared I/O virtualization

architecture. And whereas a purely hardware-based approach could require costly

memory-registration operations or system-level hardware modifications to enforce

DMA memory protection, the lightweight, software-based DMA memory protection

strategy introduced in this research incurs relatively little overhead and requires no

system-level modifications.

Moving traffic multiplexing to the hardware proved to be the biggest source of

performance and efficiency improvement in the CDNA architecture. By not forcing

the VMM to inspect, demultiplex, and page-flip data between virtual machines, the

relatively simple hardware of the CDNA prototype dramatically reduced total I/O

virtualization overhead. This reduction occurs despite introducing the new overhead

of software-based DMA memory protection for supporting direct I/O access.

Beyond the performance and efficiency issues explored in this study, the CDNA

architecture presents new opportunities for I/O virtualization research, including gen-

eralization to other devices and challenges not related to performance. As presented,

the CDNA device is a prototype. Though there is nothing specific about the prototype

or its architecture that prevents it from being adapted to other types of devices (such

as graphics cards or disk controllers), actually doing this generalization remains an

area for future exploration. Clearly the performance and efficiency benefits demon-

strated for network I/O could prove advantageous for other types of I/O, though

the percentage of impact would depend on the importance of I/O for any particular

workload. Generalizing the CDNA interface would require development of a general

software interface for communicating DMA updates to the virtual machine monitor,

since the prototype’s method is actually based on updates to the NIC-specific DMA

descriptor structure. Further, generalization would require a method to generically,

concisely describe the control region (and mechanisms) for any particular device so

117

that the VMM maintains control of actually enqueueing DMA descriptors, as is re-

quired by the software-based DMA memory protection method.

Another area open for exploration with the CDNA architecture is that of provid-

ing quality of service guarantees, including support for customizable service allocation

and prioritization. For example, it would be advantageous to be able to guarantee

that a high-priority virtual machine received performance according to its application

needs (which could be either high bandwidth, or low latency, or both). Traditional

software-based I/O virtualization allows fine-grained centralized load balancing, be-

cause the VMM actively controls the flow of I/O into and out of the device. With

direct-access, hardware-shared devices such as the CDNA architecture, however, the

hardware ultimately determines the order and priority that concurrent requests are

processed. Thus, the hardware must have some mechanism for implementing the de-

sired quality-of-service policy. Furthermore, the hardware must support some mech-

anism for the VMM to communicate the desired policy to the hardware. Finally, in

cases when the device supports unsolicited I/O (such as a network interface, which

receives packets from a network), it would be advantageous for the hardware to track

usage statistics and report them to the VMM so that it could make better decisions

that might avoid I/O failure (e.g., packet loss).

6.3 Improving performance and efficiency of protection strate-

gies for direct-access I/O

CDNA’s performance and efficiency gains versus software-based virtualization il-

lustrate the effectiveness of direct I/O access by untrusted virtual machines. Though

direct I/O access overcomes performance penalties, it requires new protection strate-

gies to prevent the guest operating systems from directing the device to violate mem-

ory protection.

118

This dissertation has evaluated a variety of DMA memory protection strategies

for direct access to I/O devices within virtual machine monitors. As others have

noted, overhead for managing DMA memory protection using an IOMMU in a virtu-

alized environment can noticeably degrade network I/O performance, ultimately af-

fecting application throughput. Even with the novel IOMMU-based strategies aimed

at reducing this overhead by reusing installed mappings in the IOMMU hardware,

there remains a nonzero overhead that reduces throughput. However, this research

has shown that reuse-based strategies are effective at reducing overhead relative to

the state-of-the-art, single-use strategy. Furthermore, this research shows that the

software-based implementation for providing DMA memory protection introduced

with the CDNA architecture can deliver performance and efficiency comparable to

the most aggressive reuse-based strategies. These results held true across a wide ar-

ray of TCP-based applications with different I/O demand characteristics. resulting

in several key insights.

This research also explored the differences in the level of protection offered by

different strategies and the level of efficiency gained through reuse. All of the strate-

gies (single-use, shared, persistent, and software-based) explored in this study provide

equivalent protection between guest operating systems when those guest operating

systems are sharing a single device and have direct access. Further, all of these tech-

niques prevent a guest operating system from directing the device to access memory

that does not belong to that guest. The traditional single-use strategy, however, pro-

vides this protection at the greatest cost, consuming from 6–26% of the CPU. This

cost can be reduced by reusing IOMMU mappings. Multiple concurrent network

transmit operations are typically able to share the same mappings 20–40% of the

time, yielding small performance improvements. However, due to Xen’s I/O architec-

ture, network receive operations are usually unable to share mappings. In contrast,

even with a small pool of persistent IOMMU mappings, reuse approaches 100% in

119

almost all cases, reducing the overhead of protection to only 2–13% of the CPU. Fi-

nally, the software-based protection strategy performs comparably to the best of the

IOMMU-based strategies, consuming only 3–15% of the CPU for protection.

After comparing the performance and protection offered by hardware- and software-

based DMA protection strategies, an IOMMU proves to provide surprisingly limited

benefits beyond what is possible with software. This finding comes despite industrial

enthusiasm for deploying IOMMU hardware in the next generation of commodity sys-

tems. As these new systems arrive, a new comparison that uses an actual IOMMU

(rather than the GART-modeled IOMMU in this study) may be illustrative so as

to quantify the performance of “direct-map” IOMMU-based protection strategies.

In such a strategy, the entire physical memory space of a given virtual machine is

mapped (usually once) by the IOMMU and remains mapped for the lifetime of the

virtual machine. Such a strategy should not impose the remapping overhead or reuse-

lookup overheads of the strategies explored in this dissertation, but they should not

perform any better than the “no-protection” case, either. Hence, it is unlikely that

even future, improved IOMMU-based designs will offer significantly better perfor-

mance than the best-performing strategies explored in this dissertation, which came

within just a few percent of native performance.

Though it is possible to achieve near-native performance with an optimized, reuse-

based protection strategy, the results in this study also show that inefficient use of

hardware structures designed to reduce the burden of software (such as an IOMMU)

can in fact significantly degrade performance. Hence, this dissertation illustrates a

warning to system architects who would use hardware to solve an architectural prob-

lem without consideration of the software overhead. The availability of an IOMMU

does not significantly improve performance unless one compares against a naive,

worst-case implementation of DMA memory protection. This underscores the need

120

for software architects to work closely with hardware architects to solve problems

such as DMA memory protection and, in general, I/O virtualization.

6.4 Summary

This dissertation explored performance and efficiency of server concurrency at

both the OS and VMM levels and introduced several hybrid hardware/software tech-

niques that strategically use hardware to improve software efficiency and performance.

By changing the way the operating system uses its parallel processors to facilitate I/O

processor, by changing the responsibilities of I/O devices to more efficiently integrate

with a virtualized environment, and by strategically using memory protection hard-

ware to reduce the total cost of using that hardware, these techniques each modify

the overall hardware/software system architecture. The OS and VMM architectures

introduced and explored in this dissertation provide a scalable means to deliver high-

performance, efficient I/O for contemporary and future commodity servers. As servers

continue to support more and more cores on a chip, thread concurrency and VMM

concurrency will be increasingly critical for system integrators facing performance

challenges for high-bandwidth applications and efficiency challenges for system con-

solidation in the datacenter. The techniques explored in this dissertation rely on a

synthesis of software orchestration, parallel computation resources (in the OS and

among virtual machines) and lightweight, efficient interfaces with hardware that sup-

port the desired level of concurrency. Given the cost and performance advantages

that were repeatedly found using the hybrid hardware/software approach explored

in this dissertation, this approach should be a guiding principle for hardware and

software architects facing the future challenges of concurrent server architecture.

121

Bibliography

[1] Keith Adams and Ole Agesen. A comparison of software and hardware tech-

niques for x86 virtualization. In Proceedings of the Conference on Architectural

Support for Programming Languages and Operating Systems (ASPLOS), pages

2–13, October 2006.

[2] Advanced Micro Devices. Secure Virtual Machine Architecture Reference Man-

ual, May 2005. Publication 33047, Revision 3.01.

[3] Advanced Micro Devices. AMD I/O Virtualization Technology (IOMMU) Spec-

ification, February 2007. Publication 34434, Revision 1.20.

[4] W. J. Armstrong, R. L. Arndt, D. C. Boutcher, R. G. Kovacs, D. Larson, K. A.

Lucke, N. Nayar, and R. C. Swanberg. Advanced virtualization capabilities of

POWER5 systems. IBM Journal of Research and Development, 49(4/5):523–532,

2005.

[5] Avnet Design Services. Xilinx Virtex-II Pro Development Kit: User’s Guide,

November 2003. ADS-003704.

[6] Kalpana Banerjee, Aniruddha Bohra, Suresh Gopalakrishnan, Murali Rangara-

jan, and Liviu Iftode. Split-OS: An operating system architecture for clusters of

intelligent devices. Work-in-Progress Session at the 18th Symposium on Oper-

ating Systems Principles, October 2001.

[7] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, Tim Harris, Alex Ho,

Rolf Neugebauer, Ian Pratt, and Andrew Warfield. Xen and the art of virtualiza-

122

tion. In Proceedings of the Symposium on Operating Systems Principles, pages

164–177, October 2003.

[8] Luiz A. Barroso, Jeffrey Dean, and Urz Holzle. Web search for a planet: The

Google cluster architecture. IEEE Micro, 23(2):22–28, March–April 2003.

[9] Muli Ben-Yehuda, Jon Mason, Orran Krieger, Jimi Xenidis, Leendert Van Doorn,

Asit Mallick, Jun Nakajima, and Elsie Wahlig. Utilizing IOMMUs for virtual-

ization in Linux and Xen. In Proceedings of the Linux Symposium, pages 71–85,

July 2006.

[10] Muli Ben-Yehuda, Jimi Xenidis, Michal Ostrowski, Karl Rister, Alexis Bruem-

mer, and Leendert Van Doorn. The price of safety: Evaluating IOMMU perfor-

mance. In Proceedings of the 2007 Linux Symposium, pages 9–19, July 2007.

[11] Tim Brecht, G. (John) Janakiraman, Brian Lynn, Vikram Saletore, and Yoshio

Turner. Evaluating network processing efficiency with processor partitioning and

asynchronous I/O. In Proceedings of EuroSys 2006, pages 265–278, April 2006.

[12] D. T. Brown, R. L. Eibsen, and C. A. Thorn. Channel and direct access device

architecture. IBM Systems Journal, 11(3):186–199, 1972.

[13] S. Devine, E. Bugnion, and M. Rosenblum. Virtualization system including

a virtual machine monitor for a computer with a segmented architecture. US

Patent #6,397,242, October 1998.

[14] Keith Diefendorff. Power4 focuses on memory bandwidth. Microprocessor Report,

13(13):1–7, October 1999.

[15] M. Engelhardt, G. Schindler, W. Steinhogl, and G. Steinlesberger. Challenges of

interconnection technology till the end of the roadmap and beyond. Microelec-

tronic Engineering, 64(1–4):3–10, October 2002.

123

[16] Keir Fraser, Steven Hand, Rolf Neugebauer, Ian Pratt, Andrew Warfield, and

Mark Williamson. Safe hardware access with the Xen virtual machine monitor.

In Proceedings of the Workshop on Operating System and Architectural Support

for the on-demand IT Infrastructure, October 2004.

[17] P. H. Gum. System/370 extended architecture: Facilities for virtual machines.

IBM Journal of Research and Development, 27(6):530–544, 1983.

[18] Lance Hammond, Basem A. Nayfeh, and Kunle Olukotun. A single-chip multi-

processor. Computer, 30(9):79–85, September 1997.

[19] E. C. Hendricks and T. C. Hartmann. Evolution of a virtual machine subsystem.

IBM Systems Journal, 18(1):111–142, 1979.

[20] Justin Hurwitz and Wu-chun Feng. End-to-end performance of 10-gigabit Eth-

ernet on commodity systems. IEEE Micro, 24(1):10–22, Jan./Feb. 2004.

[21] Intel. Intel Virtualization Technology Specification for the Intel Itanium Archi-

tecture (VT-i), April 2005. Order Number 305942-002, Revision 2.0.

[22] Intel Corporation. Energy-efficient performance for the data center, September

2006. Order Number 315018-001US.

[23] Intel Corporation. Intel Virtualization Technology for Directed I/O, May 2007.

Order Number D51397-002, Revision 1.0.

[24] J. Jann, L. M. Browning, and R. S. Burugula. Dynamic reconfiguration: Basic

building blocks for autonomic computing on IBM pSeries servers. IBM Systems

Journal, 42(1):29–37, 2003.

[25] Sanjiv Kapil, Harlan McGhan, and Jesse Lawrendra. A chip multithreaded

processor for network-facing workloads. IEEE Micro, 24(2):20–30, Mar./Apr.

2004.

124

[26] Hyong-youb Kim, Vijay S. Pai, and Scott Rixner. Improving web server through-

put with network interface data caching. In Proceedings of the Tenth Inter-

national Conference on Architectural Support for Programming Languages and

Operating Systems, pages 239–250, October 2002.

[27] Hyong-youb Kim and Scott Rixner. Performance characterization of the FreeBSD

network stack. Technical Report TR05-450, Rice University Computer Science

Department, June 2005.

[28] Hyong-youb Kim and Scott Rixner. TCP offload through connection handoff. In

Proceedings of EuroSys, pages 279–290, April 2006.

[29] Seongbeom Kim, Dhruba Chandra, and Yan Solihin. Fair cache sharing and

partitioning in a chip multiprocessor architecture. In Proceedings of the 13th

International Conference on Parallel Architectures and Compilation Techniques,

pages 111–122, 2004.

[30] Poonacha Kongetira, Kathirgamar Aingaran, and Kunle Olukotun. Niagara: A

32-way multithreaded SPARC processor. IEEE Micro, 25(2):21–29, Mar./Apr.

2005.

[31] David Koufaty and Deborah T. Marr. Hyperthreading technology in the netburst

microarchitecture. IEEE Micro, 23(2):56–65, 2003.

[32] Kevin Krewell. UltraSPARC IV mirrors predecessor. Microprocessor Report,

17(11):1–6, November 2003.

[33] Kevin Krewell. Double your Opterons; double your fun. Microprocessor Report,

18(10):26–28, October 2004.

[34] Kevin Krewell. Sun’s Niagara pours on the cores. Microprocessor Report,

18(9):11–13, September 2004.

125

[35] Jiuxing Liu, Wei Huang, Bulent Abali, and Dhabaleswar K. Panda. High per-

formance VMM-bypass I/O in virtual machines. In Proceedings of the USENIX

Annual Technical Conference, pages 29–42, June 2006.

[36] R. A. MacKinnon. The changing virtual machine environment: Interfaces to real

hardware, virtual hardware, and other virtual machines. IBM Systems Journal,

18(1):18–46, 1979.

[37] M. McGrath. Virtual machine computing in an engineering environment. IBM

Systems Journal, 11(2):131–149, June 1972.

[38] Aravind Menon, Alan L. Cox, and Willy Zwaenepoel. Optimizing network vir-

tualization in Xen. In Proceedings of the USENIX Annual Technical Conference,

June 2006.

[39] Aravind Menon, Jose Renato Santos, Yoshio Turner, G. (John) Janakiraman,

and Willy Zwaenepoel. Diagnosing performance overheads in the Xen virtual

machine environment. In Proceedings of the ACM/USENIX Conference on Vir-

tual Execution Environments, pages 13–23, June 2005.

[40] Microsoft Corporation. Scalable networking: Eliminating the receive processing

bottleneck – Introducing RSS. In Proceedings of the Windows Hardware Engi-

neering Conference, April 2004.

[41] E. M. Nahum, D. J. Yates, J. F. Kurose, and D. Towsley. Performance issues in

parallelized network protocols. In Proceedings of the Symposium on Operating

Systems Design and Implementation, pages 125–137, November 1994.

[42] Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson, and Kunyung

Chang. The case for a single-chip multiprocessor. In Proceedings of the Seventh

126

International Conference on Architectural Support for Programming Languages

and Operating Systems, pages 2–11, October 1996.

[43] R. P. Parmelee, T. I. Peterson, C. C. Tillman, and D. J. Hatfield. Virtual storage

and virtual machine concepts. IBM Systems Journal, 11(2):99–130, 1972.

[44] Ian Pratt and Keir Fraser. Arsenic: A user-accessible Gigabit Ethernet interface.

In Proceedings of IEEE INFOCOM, pages 67–76, April 2001.

[45] Himanshu Raj and Karsten Schwan. High performance and scalable I/O vir-

tualization via self-virtualized devices. In Proceedings of the 16th International

Symposium on High Performance Distributed Computing, pages 179–188, June

2007.

[46] Murali Rangarajan, Aniruddha Bohra, Kalpana Banerjee, Enrique V. Carrera,

Ricardo Bianchini, Liviu Iftode, and Willy Zwaenepoel. TCP Servers: Offloading

TCP/IP Processing in Internet Servers. Design, Implementation, and Perfor-

mance. Computer Science Department, Rutgers University, March 2002. Tech-

nical Report DCR-TR-481.

[47] Greg Regnier, Srihari Makineni, Ramesh Illikkal, Ravi Iyer, Dave Minturn, Ram

Huggahalli, Don Newell, Linda Cline, and Annie Foong. TCP Onloading for

Data Center Servers. Computer, 37(11):48–58, November 2004.

[48] Greg Regnier, Dave Minturn, Gary McAlpine, Vikram A. Saletore, and Annie

Foong. ETA: Experience with an Intel Xeon Processor as a Packet Processing

Engine. IEEE Micro, 24(1):24–31, January 2004.

[49] L. H. Seawright and R. A. MacKinnon. VM/370–a study of multiplicity and

usefulness. IBM Systems Journal, 18(1):4–17, 1979.

127

[50] Jeff Shafer and Scott Rixner. A Reconfigurable and Programmable Gigabit Eth-

ernet Network Interface Card. Rice University, Department of Electrical and

Computer Engineering, December 2006. Technical Report TREE0611.

[51] Jeffrey Shafer and Scott Rixner. RiceNIC: A reconfigurable network interface

for experimental research and education. In Proceedings of the Workshop on

Experimental Computer Science, June 2007.

[52] Piyush Shivam, Pete Wyckoff, and Dhabaleswar K. Panda. EMP: Zero-copy

OS-bypass NIC-driven Gigabit Ethernet message passing. In Proceedings of the

ACM/IEEE Conference on Supercomputing (CDROM), pages 57–57, November

2001.

[53] J. Sugerman, G. Venkitachalam, and B. Lim. Virtualizing I/O devices on

VMware Workstation’s hosted virtual machine monitor. In Proceedings of the

USENIX Annual Technical Conference, pages 1–14, June 2001.

[54] J. M. Tendler, J. S. Dodson, Jr J. S. Fields, H. Le, and B. Sinharoy. POWER4

system microarchitecture. IBM Journal of Research and Development, 46(1):5–

26, January 2002.

[55] Sunay Tripathi. FireEngine—a new networking architecture for the Solaris op-

erating system. White paper, Sun Microsystems, June 2004.

[56] VMware Inc. VMware ESX server: Platform for virtualizing servers, storage and

networking. http://www.vmware.com/pdf/esx datasheet.pdf, 2006.

[57] Robert N. M. Watson. Introduction to multithreading and multiprocessing in

the FreeBSD SMPng network stack. In Proceedings of EuroBSDCon, November

2005.

128

http://www.vmware.com/pdf/esx_datasheet.pdf

[58] A. Whitaker, M. Shaw, and S. Gribble. Scale and performance in the Denali

isolation kernel. In Proceedings of the Symposium on Operating Systems Design

and Implementation (OSDI), pages 195–210, December 2002.

[59] Paul Willmann, Scott Rixner, and Alan L. Cox. An evaluation of network stack

parallelization strategies in modern operating systems. In Proceedings of the

USENIX Annual Technical Conference, pages 91–96, 2006.

[60] Paul Willmann, Jeffrey Shafer, David Carr, Aravind Menon, Scott Rixner,

Alan L. Cox, and Willy Zwaenepoel. Concurrent direct network access for virtual

machine monitors. In Proceedings of the 13th International Symposium on High

Performance Computer Architecture, pages 306–317, February 2007.

[61] Paul Willmann, Jeffrey Shafer, David Carr, Aravind Menon, Scott Rixner,

Alan L. Cox, and Willy Zwaenepoel. Concurrent direct network access for

virtual machine monitors. In Proceedings of the International Symposium on

High-Performance Computer Architecture, February 2007.

[62] David J. Yates, Erich M. Nahum, James F. Kurose, and Don Towsley. Net-

working support for large scale multiprocessor servers. In Proceedings of the

ACM SIGMETRICS International Conference on Measurement and Modeling of

Computer Systems, pages 116–125, May 1996.

129

eﬃcient hardware/software architectures for …willmann/pubs/willmann_dissertation.pdfrice...

Documents