web servers: implementation and performanceerich nahum1 web servers: implementation and performance...
Post on 19-Dec-2015
221 views
TRANSCRIPT
Web Servers: Implementation and Performance
Erich Nahum 1
Web Servers: Implementation and
Performance
Erich Nahum
IBM T.J. Watson Research Centerwww.research.ibm.com/people/n/nahum [email protected]
Web Servers: Implementation and Performance
Erich Nahum 2
Contents of This Tutorial• Introduction to HTTP • HTTP Servers:
– Outline of an HTTP Server Transaction– Server Models: Processes, Threads, Events– Event Notification: Asynchronous I/O
• HTTP Server Workloads:– Workload Characteristics– Workload Generation
• Server TCP Issues– Introduction to TCP– Server TCP Dynamics– Server TCP Implementation Issues
• Other Issues (time permitting):– Large Site Studies– Clusters– Running Experiments– Brief Overview of Other Topics
Web Servers: Implementation and Performance
Erich Nahum 3
Things Not Covered in Tutorial
• Client-side issues: DNS, HTML rendering• Proxies: some similarities, many
differences• Dynamic Content: CGI, PHP, ASP, etc.• QoS for Web Servers• SSL/TLS and HTTPS• Content Distribution Networks (CDN’s)• Security and Denial of Service
If time is available, may cover briefly at the end
Web Servers: Implementation and Performance
Erich Nahum 4
Assumptions and Expectations
• Some familiarity with WWW as a user(Has anyone here not used a browser?)
• Some familiarity with networking concepts(e.g., unreliability, reordering, race conditions)
• Familiarity with systems programming(e.g., know what sockets, hashing, caching are)
• Examples will be based on C & Unix taken from BSD, Linux, AIX, and real servers(sorry, Java and Windows fans)
Web Servers: Implementation and Performance
Erich Nahum 5
Objectives and Takeaways
• Basics of server implementation & performance
• Pros and cons of various server architectures• Difficulties in workload generation• Interactions between HTTP and TCP• Design loop of implement, measure, profile,
debug, and fix
After this tutorial, hopefully we will all know:
Many lessons should be applicable to any networked server, e.g., files, mail, news, DNS, LDAP, etc.
Web Servers: Implementation and Performance
Erich Nahum 6
Timeline
• Intro, HTTP, server transaction: 40 min• Server models, event notification: 40 min• Workload characterization & generation: 40
min• Intro to TCP, dynamics, implementation: 40
min• Clusters, large site studies, experiments: 30
min• Other topics: time permitting
Web Servers: Implementation and Performance
Erich Nahum 7
Acknowledgements
Many people contributed comments and suggestions to this tutorial, including:
Abhishek ChandraMark CrovellaSuresh ChariPeter DruschelJim Kurose
Balachander KrishnamurthyVivek PaiJennifer RexfordAnees Shaikh
Errors are all mine, of course.
Web Servers: Implementation and Performance
Erich Nahum 8
Chapter 1: Introduction to HTTP
Web Servers: Implementation and Performance
Erich Nahum 9
Introduction to HTTP
• HTTP: Hypertext Transfer Protocol– Communication protocol between clients and servers– Application layer protocol for WWW
• Client/Server model:– Client: browser that requests, receives, displays object– Server: receives requests and responds to them
• Protocol consists of various operations– Few for HTTP 1.0 (RFC 1945, 1996)– Many more in HTTP 1.1 (RFC 2616, 1999)
Laptop w/Netscape
Server w/ ApacheDesktop w/ Explorer
http request http request
http response http response
Web Servers: Implementation and Performance
Erich Nahum 10
How are Requests Generated?
• User clicks on something • Uniform Resource Locator (URL):
– http://www.nytimes.com– https://www.paymybills.com– ftp://ftp.kernel.org– news://news.deja.com– telnet://gaia.cs.umass.edu– mailto:[email protected]
• Different URL schemes map to different services• Hostname is converted from a name to a 32-bit
IP address (DNS resolve)• Connection is established to server
Most browser requests are HTTP requests.
Web Servers: Implementation and Performance
Erich Nahum 11
What Happens Then?• Client downloads HTML document
– Sometimes called “container page”– Typically in text format (ASCII)– Contains instructions for rendering
(e.g., background color, frames)
– Links to other pages
• Many have embedded objects:– Images: GIF, JPG (logos, banner ads)– Usually automatically retrieved
• I.e., without user involvement• can control sometimes
(e.g. browser options, junkbusters)
<html><head><meta name=“Author” content=“Erich Nahum”><title> Linux Web Server Performance </title></head><body text=“#00000”><img width=31 height=11 src=“ibmlogo.gif”><img src=“images/new.gif><h1>Hi There!</h1>Here’s lots of cool linux stuff!<a href=“more.html”>Click here</a>for more!</body></html>
sample html file
Web Servers: Implementation and Performance
Erich Nahum 12
So What’s a Web Server Do?
• Respond to client requests, typically a browser– Can be a proxy, which aggregates client requests (e.g., AOL)– Could be search engine spider or custom (e.g., Keynote)
• May have work to do on client’s behalf:– Is the client’s cached copy still good?– Is client authorized to get this document?– Is client a proxy on someone else’s behalf?– Run an arbitrary program (e.g., stock trade)
• Hundreds or thousands of simultaneous clients• Hard to predict how many will show up on some day• Many requests are in progress concurrently
Server capacity planning is non-trivial.
Web Servers: Implementation and Performance
Erich Nahum 13
What do HTTP Requests Look Like?
GET /images/penguin.gif HTTP/1.0User-Agent: Mozilla/0.9.4 (Linux 2.2.19)Host: www.kernel.orgAccept: text/html, image/gif, image/jpegAccept-Encoding: gzipAccept-Language: enAccept-Charset: iso-8859-1,*,utf-8Cookie: B=xh203jfsf; Y=3sdkfjej<cr><lf>
• Messages are in ASCII (human-readable)• Carriage-return and line-feed indicate end of headers• Headers may communicate private information
(browser, OS, cookie information, etc.)
Web Servers: Implementation and Performance
Erich Nahum 14
What Kind of Requests are there?
Called Methods:• GET: retrieve a file (95% of requests)• HEAD: just get meta-data (e.g., mod time)• POST: submitting a form to a server• PUT: store enclosed document as URI• DELETE: removed named resource• LINK/UNLINK: in 1.0, gone in 1.1• TRACE: http “echo” for debugging (added in 1.1)• CONNECT: used by proxies for tunneling (1.1)• OPTIONS: request for server/proxy options (1.1)
Web Servers: Implementation and Performance
Erich Nahum 15
What Do Responses Look Like?
HTTP/1.0 200 OKServer: Tux 2.0Content-Type: image/gifContent-Length: 43Last-Modified: Fri, 15 Apr 1994 02:36:21 GMTExpires: Wed, 20 Feb 2002 18:54:46 GMTDate: Mon, 12 Nov 2001 14:29:48 GMTCache-Control: no-cachePragma: no-cacheConnection: closeSet-Cookie: PA=wefj2we0-jfjf<cr><lf><data follows…>
• Similar format to requests (i.e., ASCII)
Web Servers: Implementation and Performance
Erich Nahum 16
What Responses are There?
• 1XX: Informational (def’d in 1.0, used in 1.1)100 Continue, 101 Switching Protocols
• 2XX: Success 200 OK, 206 Partial Content
• 3XX: Redirection 301 Moved Permanently, 304 Not Modified
• 4XX: Client error 400 Bad Request, 403 Forbidden, 404 Not Found
• 5XX: Server error 500 Internal Server Error, 503 Service Unavailable, 505 HTTP Version Not Supported
Web Servers: Implementation and Performance
Erich Nahum 17
What are all these Headers?
• General: Connection, Date
• Request: Accept-Encoding, User-Agent
• Response: Location, Server type
• Entity: Content-Encoding, Last-Modified
• Hop-by-hop: Proxy-Authenticate, Transfer-Encoding
Specify capabilities and properties:
Server must pay attention to respond properly.
Web Servers: Implementation and Performance
Erich Nahum 18
Summary: Introduction to HTTP
• The major application on the Internet– Majority of traffic is HTTP (or HTTP-related)
• Client/server model:– Clients make requests, servers respond to them– Done mostly in ASCII text (helps debugging!)
• Various headers and commands– Too many to go into detail here– We’ll focus on common server ones– Many web books/tutorials exist (e.g., Krishnamurthy
& Rexford 2001)
Web Servers: Implementation and Performance
Erich Nahum 19
Chapter 2: Outline of a Typical HTTP Transaction
Web Servers: Implementation and Performance
Erich Nahum 20
Outline of an HTTP Transaction
• In this section we go over the basics of servicing an HTTP GET request from user space
• For this example, we'll assume a single process running in user space, similar to Apache 1.3
• At each stage see what the costs/problems can be
• Also try to think of where costs can be optimized
• We’ll describe relevant socket operations as we go
initialize;forever do { get request; process; send response; log request;}
server ina nutshell
Web Servers: Implementation and Performance
Erich Nahum 21
Readying a Server
• First thing a server does is notify the OS it is interested in WWW server requests; these are typically on TCP port 80. Other services use different ports (e.g., SSL is on 443)
• Allocate a socket and bind()'s it to the address (port 80)• Server calls listen() on the socket to indicate willingness
to receive requests• Calls accept() to wait for a request to come in (and
blocks)• When the accept() returns, we have a new socket which
represents a new connection to a client
s = socket(); /* allocate listen socket */bind(s, 80); /* bind to TCP port 80 */listen(s); /* indicate willingness to accept */while (1) { newconn = accept(s); /* accept new connection */b
Web Servers: Implementation and Performance
Erich Nahum 22
Processing a Request
• getsockname() called to get the remote host name– for logging purposes (optional, but done by most)
• gethostbyname() called to get name of other end – again for logging purposes
• gettimeofday() is called to get time of request– both for Date header and for logging
• read() is called on new socket to retrieve request• request is determined by parsing the data
– “GET /images/jul4/flag.gif”
remoteIP = getsockname(newconn);remoteHost = gethostbyname(remoteIP);gettimeofday(currentTime);read(newconn, reqBuffer, sizeof(reqBuffer));reqInfo = serverParse(reqBuffer);
Web Servers: Implementation and Performance
Erich Nahum 23
Processing a Request (cont)
• stat() called to test file path – to see if file exists/is accessible– may not be there, may only be available to certain people– "/microsoft/top-secret/plans-for-world-domination.html"
• stat() also used for file meta-data– e.g., size of file, last modified time– "Have plans changed since last time I checked?“
• might have to stat() multiple files just to get to end – e.g., 4 stats in bill g example above
• assuming all is OK, open() called to open the file
fileName = parseOutFileName(requestBuffer);fileAttr = stat(fileName);serverCheckFileStuff(fileName, fileAttr);open(fileName);
Web Servers: Implementation and Performance
Erich Nahum 24
Responding to a Request
• read() called to read the file into user space• write() is called to send HTTP headers on socket
(early servers called write() for each header!)
• write() is called to write the file on the socket• close() is called to close the socket• close() is called to close the open file descriptor• write() is called on the log file
read(fileName, fileBuffer);headerBuffer = serverFigureHeaders(fileName, reqInfo);write(newSock, headerBuffer);write(newSock, fileBuffer);close(newSock);close(fileName);write(logFile, requestInfo);
Web Servers: Implementation and Performance
Erich Nahum 25
Optimizing the Basic Structure
• As we will see, a great deal of locality exists in web requests and web traffic.
• Much of the work described above doesn't really need to be performed each time.
• Optimizations fall under 2 categories: caching and custom OS primitives.
Web Servers: Implementation and Performance
Erich Nahum 26
Optimizations: Caching
• Again, cache HTTP header info on a per-url basis, rather than re-generating info over and over.
fileDescriptor = lookInFDCache(fileName);metaInfo = lookInMetaInfoCache(fileName);headerBuffer = lookInHTTPHeaderCache(fileName);
Idea is to exploit locality in client requests. Many files are requested over and over (e.g., index.html).
• Why open and close files over and over again? Instead, cache open file FD’s, manage them LRU.
• Why stat them again and again? Cache path name and access characteristics.
Web Servers: Implementation and Performance
Erich Nahum 27
Optimizations: Caching (cont)
• Instead of reading and writing the data, cache data, as well as meta-data, in user space
fileData = lookInFileDataCache(fileName);fileData = lookInMMapCache(fileName);remoteHostName = lookRemoteHostCache(fileName);
• Since we see the same clients over and over, cache the reverse name lookups (or better yet, don't do resolves at all, log only IP addresses)
• Even better, mmap() the file so that two copies don't exist in both user and kernel space
Web Servers: Implementation and Performance
Erich Nahum 28
Optimizations: OS Primitives
• Rather than call accept(), getsockname() & read(), add a new primitive, acceptExtended(), which combines the 3 primitives
acceptExtended(listenSock,&newSock, readBuffer, &remoteInfo);
currentTime = *mappedTimePointer;
buffer[0] = firstHTTPHeader;buffer[1] = secondHTTPHeader;buffer[2] = fileDataBuffer;writev(newSock, buffer, 3);
• Instead of calling write() many times, use writev()
• Instead of calling gettimeofday(), use a memory-mapped counter that is cheap to access (a few instructions rather than a system call)
Web Servers: Implementation and Performance
Erich Nahum 29
OS Primitives (cont)• Rather than calling read() & write(), or write() with
an mmap()'ed file, use a new primitive called sendfile() (or transmitfile()). Bytes stay in the kernel.
httpInfo = cacheLookup(reqBuffer);sendfile(newConn,
httpInfo->headers, httpInfo->fileDescriptor, OPT_CLOSE_WHEN_DONE);
• Also add an option to close the connection so that we don't have to call close() explicitly.
• While we're at it, add a header option to sendfile() so that we don't have to call write() at all.
All this assumes proper OS support. Most have it these days.
Web Servers: Implementation and Performance
Erich Nahum 30
An Accelerated Server Example
• acceptex() is called– gets new socket, request, remote host IP address
• string match in hash table is done to parse request– hash table entry contains relevant meta-data, including modification
times, file descriptors, permissions, etc.
• sendfile() is called – pre-computed header, file descriptor, and close option
• log written back asynchronously (buffered write()).
That’s it!
acceptex(socket, newConn, reqBuffer, remoteHostInfo);httpInfo = cacheLookup(reqBuffer);sendfile(newConn, httpInfo->headers,
httpInfo->fileDescriptor, OPT_CLOSE_WHEN_DONE);write(logFile, requestInfo);
Web Servers: Implementation and Performance
Erich Nahum 31
Complications
• Much of this assumes sharing is easy:– but, this is dependent on the server architectural model– if multiple processes are being used, as in Apache, it is
difficult to share data structures.
• Take, for example, mmap():– mmap() maps a file into the address space of a process. – a file mmap'ed in one address space can’t be re-used for
a request for the same file served by another process.– Apache 1.3 does use mmap() instead of read().– in this case, mmap() eliminates one data copy versus a
separate read() & write() combination, but process will still need to open() and close() the file.
Web Servers: Implementation and Performance
Erich Nahum 32
Complications (cont)
• Similarly, meta-data info needs to be shared:– e.g., file size, access permissions, last modified time, etc.
• While locality is high, cache misses can and do happen sometimes:– if previously unseen file requested, process can block
waiting for disk.
• OS can impose other restrictions:– e.g., limits on number of open file descriptors. – e.g., sockets typically allow buffering about 64 KB of data.
If a process tries to write() a 1 MB file, it will block until other end receives the data.
• Need to be able to cope with the misses without slowing down the hits
Web Servers: Implementation and Performance
Erich Nahum 33
Summary: Outline of a Typical HTTP Transaction
• A server can perform many steps in the process of servicing a request
• Different actions depending on many factors:– e.g., 304 not modified if client's cached copy is good– e.g., 404 not found, 401 unauthorized
• Most requests are for small subset of data: – we’ll see more about this in the Workload section– we can leverage that fact for performance
• Architectural model affects possible optimizations– we’ll go into this in more detail in the next section
Web Servers: Implementation and Performance
Erich Nahum 34
Chapter 3: Server Architectural Models
Web Servers: Implementation and Performance
Erich Nahum 35
Server Architectural Models
Several approaches to server structure:• Process based: Apache, NCSA• Thread-based: JAWS, IIS• Event-based: Flash, Zeus• Kernel-based: Tux, AFPA, ExoKernel
We will describe the advantages and disadvantages of each.
Fundamental tradeoffs exist between performance, protection, sharing, robustness, extensibility, etc.
Web Servers: Implementation and Performance
Erich Nahum 36
Process Model (ex: Apache)
• Process created to handle each new request:– Process can block on appropriate actions, (e.g., socket read, file read, socket write)– Concurrency handled via multiple processes
• Quickly becomes unwieldy:– Process creation is expensive. – Instead, pre-forked pool is created.– Upper limit on # of processes is enforced
• First by the server, eventually by the operating system.• Concurrency is limited by upper bound
Web Servers: Implementation and Performance
Erich Nahum 37
Process Model: Pros and Cons
• Advantages: – Most importantly, consistent with programmer's way of
thinking. Most programmers think in terms of linear series of steps to accomplish task.
– Processes are protected from one another; can't nuke data in some other address space. Similarly, if one crashes, others unaffected.
• Disadvantages:– Slow. Forking is expensive, allocating stack, VM data
structures for each process adds up and puts pressure on the memory system.
– Difficulty in sharing info across processes.– Have to use locking.– No control over scheduling decisions.
Web Servers: Implementation and Performance
Erich Nahum 38
Thread Model (Ex: JAWS)
• Use threads instead of processes. Threads consume fewer resources than processes (e.g., stack, VM allocation).
• Forking and deleting threads is cheaper than processes.
• Similarly, pre-forked thread pool is created. May be limits to numbers but hopefully less of an issue than with processes since fewer resources required.
Web Servers: Implementation and Performance
Erich Nahum 39
Thread Model: Pros and Cons
• Advantages: – Faster than processes. Creating/destroying cheaper.– Maintains programmer's way of thinking.– Sharing is enabled by default.
• Disadvantages: – Less robust. Threads not protected from each other.– Requires proper OS support, otherwise, if one thread
blocks on a file read, will block all the address space.– Can still run out of threads if servicing many clients
concurrently.– Can exhaust certain per-process limits not encountered
with processes (e.g., number of open file descriptors). – Limited or no control over scheduling decisions.
Web Servers: Implementation and Performance
Erich Nahum 40
Event Model (Ex: Flash)
• Use a single process and deal with requests in a event-driven manner, like a giant switchboard.
• Use non-blocking option (O_NDELAY) on sockets, do everything asynchronously, never block on anything, and have OS notify us when something is ready.
while (1) { accept new connections until none remaining; call select() on all active file descriptors; for each FD: if (fd ready for reading) call read(); if (fd ready for writing) call write(); }
Web Servers: Implementation and Performance
Erich Nahum 41
Event-Driven: Pros and Cons• Advantages:
– Very fast. – Sharing is inherent, since there’s only one process.– Don't even need locks as in thread models. – Can maximize concurrency in request stream easily. – No context-switch costs or extra memory consumption.– Complete control over scheduling decisions.
• Disadvantages: – Less robust. Failure can halt whole server. – Pushes per-process resource limits (like file descriptors). – Not every OS has full asynchronous I/O, so can still block on a
file read. Flash uses helper processes to deal with this (AMPED architecture).
Web Servers: Implementation and Performance
Erich Nahum 42
In-Kernel Model (Ex: Tux)
• Dedicated kernel thread for HTTP requests:– One option: put whole server in kernel. – More likely, just deal with static GET requests in
kernel to capture majority of requests. – Punt dynamic requests to full-scale server in user
space, such as Apache.
TCP
HTTP
IP
ETH
SOCK
user/ kernel boundary
user-space server
kernel-space server
TCP
IP
ETH
HTTP
user/ kernel boundary
Web Servers: Implementation and Performance
Erich Nahum 43
In-Kernel Model: Pros and Cons
• In-kernel event model:– Avoids transitions to user space, copies across u-k boundary, etc. – Leverages already existing asynchronous primitives in the kernel
(kernel doesn't block on a file read, etc.)• Advantages:
– Extremely fast. Tight integration with kernel.– Small component without full server optimizes common case.
• Disadvantages: – Less robust. Bugs can crash whole machine, not just server. – Harder to debug and extend, since kernel programming required,
which is not as well-known as sockets.– Similarly, harder to deploy. APIs are OS-specific (Linux, BSD,
NT), whereas sockets & threads are (mostly) standardized.– HTTP evolving over time, have to modify kernel code in
response.
Web Servers: Implementation and Performance
Erich Nahum 44
So What’s the Performance?
• Graph shows server throughput for Tux, Flash, and Apache.• Experiments done on 400 MHz P/II, gigabit Ethernet, Linux 2.4.9-ac10, 8 client
machines, WaspClient workload generator• Tux is fastest, but Flash close behind
Web Servers: Implementation and Performance
Erich Nahum 45
Summary: Server Architectures
• Many ways to code up a server– Tradeoffs in speed, safety, robustness, ease of programming
and extensibility, etc.
• Multiple servers exist for each kind of model– Not clear that a consensus exists.
• Better case for in-kernel servers as devicese.g. reverse proxy accelerator, Akamai CDN node
• User-space servers have a role:– OS should provides proper primitives for efficiency– Leave HTTP-protocol related actions in user-space– In this case, event-driven model is attractive
• Key pieces to a fast event-driven server: – Minimize copying– Efficient event notification mechanism
Web Servers: Implementation and Performance
Erich Nahum 46
Chapter 4: Event Notification
Web Servers: Implementation and Performance
Erich Nahum 47
Event Notification Mechanisms
• Recall how Flash works: – One process, many FD's, calling select() on all active socket
descriptors. – All sockets are set using O_NDELAY flag (non-blocking)– Single address space aids sharing for performance– File reads and writes don't have non-blocking support, thus
helper processes (AMPED architecture)
• Point is to exploit concurrency/parallelism:– Can read one socket while waiting to write on another
• Event notification:– Mechanism for kernel and application to notify each other of
interesting/important events – E.g., connection arrivals, socket closes, data available to
read, space available for writing
Web Servers: Implementation and Performance
Erich Nahum 48
State-Based: Select & Poll• select() and poll():
– State-based: Is socket ready for reading/writing?– select() interface has FD_SET bitmasks turned on/off based on
interest– poll() is simple array, larger structure but simpler
implementation
• Performance costs:– Kernel scans O(N) descriptors to set bits– User application scans O(N) descriptors– select() bit manipulation can be expensive
• Problems:– Traffic is bursty, connections not active all at once
• # (active connections) << # (open connections).• Costs are O(total connections), not O(active connections)
– Application keeps specifying interest set repeatedly
Web Servers: Implementation and Performance
Erich Nahum 49
Event-Based Notification
• Propose an event based approach, rather than state-based:– Something just happened on socket X, rather than socket X
is ready for reading or writing– Server takes event as indication socket might be ready– Multiple events can happen on a single socket (e.g.,
packets draining (implying writeable) or accumulating (readable))
• API has following:– Application notifies kernel by calling declare_interest() once
per file descriptor (e.g., after accept()), rather than multiple times like in select()/poll()
– Kernel queues events internally– Application calls get_next_event() to see changes
Banga, Mogul & Druschel (USENIX 99)
Web Servers: Implementation and Performance
Erich Nahum 50
Event-Based Notification (cont)
• Problems:– Kernel has to allocate storage for event queue. Little's law
says it needs to be proportional to the event rate– Bursty applications could overflow queue– Can address multiple events by coalescing based on FD– Results in storage O(total connections).
• Application has to change the way it thinks: – Respond to events, instead of checking state. – If events are missed, connections might get stuck.
• Evaluation shows it scales nicely:– cost is O(active) not O(total)
• Windows NT has something similar:– called IO completion ports
Web Servers: Implementation and Performance
Erich Nahum 51
Notification in the Real World
POSIX Real-Time Signals:– Different concept: Unix signals are invoked when something
is ready on a file descriptor.– Signals are expensive and difficult to control (e.g., no
ordering), so applications can suppress signals and then retrieve them via sigwaitinfo()
– If signal queue fills up, events will be dropped. A separate signal is raised to notify application about signal queue overflow.
Problems:– If signal queue overflows, then app must fall back on state-
based approach. Chandra and Mosberger propose signal-per-fd (coalescing events per file descriptor).
– Only one event is retrieved at a time: Provos and Lever propose sigtimedwait4() to retrieve multiple signals at once
Web Servers: Implementation and Performance
Erich Nahum 52
Notification in the Real World
• Sun's /dev/poll:– App notifies kernel by writing to special file /dev/poll to
express interest– App does IOCTL on /dev/poll for list of ready FD's– App and kernel are still both state based– Kernel still pays O(total connections) to create FD list
• Libenzi’s /dev/epoll (patch for Linux 2.4):– Uses /dev/epoll as interface, rather than /dev/poll– Application writes interest to /dev/epoll and IOCTL's to
get events– Events are coalesced on a per-FD basis– Semantically identical to RT signals with sig-per-fd &
sigtimedwait4().
Web Servers: Implementation and Performance
Erich Nahum 53
Real File Asynchronous I/O
• Like setting O_NDELAY (non-blocking) on file descriptors:– Application can queue reads and writes on FDs and pick
them up later (like dry cleaning)– Requires support in the file system (e.g., callbacks)
• Currently doesn't exist on many OS's:– POSIX specification exists– Solaris has non-standard version– Linux has it slated for 2.5 kernel
• Two current candidates on Linux: – SGI's /dev/kaio and Ben LeHaises's /dev/aio
• Proper implementation would allow Flash to eliminate helpers
Web Servers: Implementation and Performance
Erich Nahum 54
Summary: Event Notification
• Goal is to exploit concurrency– Concurrency in user workloads means host CPU can
overlap multiple events to maximize parallelism– Keep network, disk busy; never block
• Event notification changes applications:– state-based to event-based– requires a change in thinking
• Goal is to minimize costs:– user/kernel crossings and testing idle socket descriptors
• Event-based notification not yet fully deployed:– Most mechanisms only support network I/O, not file I/O– Full deployment of Asynchronous I/O spec should fix this
Web Servers: Implementation and Performance
Erich Nahum 55
Chapter 5: Workload Characterization
Web Servers: Implementation and Performance
Erich Nahum 56
Workload Characterization
• Why Characterize Workloads?– Gives an idea about traffic behavior ("Which documents are users interested in?")– Aids in capacity planning ("Is the number of clients increasing over time?")– Aids in implementation ("Does caching help?")
• How do we capture them ?– Through server logs (typically enabled)– Through packet traces (harder to obtain and to process)
Web Servers: Implementation and Performance
Erich Nahum 57
Factors to Consider
• Where do I get logs from?– Client logs give us an idea, but not necessarily the
same– Same for proxy logs– What we care about is the workload at the server
• Is trace representative?– Corporate POP vs. News vs. Shopping site
• What kind of time resolution?– e.g., second, millisecond, microsecond
• Does trace/log capture all the traffic?– e.g., incoming link only, or one node out of a cluster
client? proxy? server?
Web Servers: Implementation and Performance
Erich Nahum 58
Probability Refresher
• Lots of variability in workloads– Use probability distributions to express– Want to consider many factors
• Some terminology/jargon:– Mean: average of samples– Median : half are bigger, half are smaller– Percentiles: dump samples into N bins (median is 50th percentile number)
• Heavy-tailed: – As x->infinity
acxxX ]Pr[
Web Servers: Implementation and Performance
Erich Nahum 59
Important Distributions
Some Frequently-Seen Distributions:
• Normal: – (avg. sigma, variance mu)
• Lognormal:– (x >= 0; sigma > 0)
• Exponential: – (x >= 0)
• Pareto: – (x >= k, shape a, scale k)
2)(
)2/()( 22
xe
xf
2)(
)2/())(ln( 22
x
exf
x
xexf )(
)1(/)( aa xakxf
Web Servers: Implementation and Performance
Erich Nahum 60
More Probability
• Graph shows 3 distributions with average = 2.• Note average median in all cases !• Different distributions have different “weight” in tail.
Web Servers: Implementation and Performance
Erich Nahum 61
What Info is Useful?
• Request methods– GET, POST, HEAD, etc.
• Response codes – success, failure, not-modified, etc.
• Size of requested files• Size of transferred objects• Popularity of requested files• Numbers of embedded objects• Inter-arrival time between requests• Protocol support (1.0 vs. 1.1)
Web Servers: Implementation and Performance
Erich Nahum 62
Sample Logs for Illustration
Name: Chess1997
Olympics1998
IBM1998
IBM2001
Description: Kasparov-Deep Blue Event Site
Nagano 1998 Olympics Event Site
Corporate Presence
Corporate Presence
Period: 2 weeks inMay 1997
2 days inFeb 1998
1 day inJune 1998
1 day inFeb 2001
Hits: 1,586,667 5,800,000 11,485,600 12,445,739
Bytes: 14,171,711 10,515,507 54,697,108 28,804,852
Clients: 256,382 80,921 86,0211 319,698
URLS: 2,293 30,465 15,788 42,874
We’ll use statistics generated from these logs as examples.
Web Servers: Implementation and Performance
Erich Nahum 63
Request Methods
• KR01: "overwhelming majority" are GETs, few POSTs• IBM2001 trace starts seeing a few 1.1 methods (CONNECT, OPTIONS, LINK), but still very small
(1/10^5 %)
Chess 1997
Olympics 1998
IBM 1998
IBM 2001
GET 96% 99.6% 99.3% 97%
HEAD 04% 00.3 % 00.08% 02%
POST 00.007%
00.04 % 00.02% 00.2%
Others: noise noise noise noise
Web Servers: Implementation and Performance
Erich Nahum 64
Response Codes
• Table shows percentage of responses. • Majority are OK and NOT_MODIFIED.• Consistent with numbers from AW96, KR01.
Code Meaning Chess
1997
Olympics
1998
IBM
1998
IBM
2001
200
204
206
301
302
304
400
401
403
404
407
500
501
503
???
OK
NO_CONTENT
PARTIAL_CONTENT
MOVED_PERMANENTLY
MOVED_TEMPORARILY
NOT_MODIFIED
BAD_REQUEST
UNAUTHORIZED
FORBIDDEN
NOT_FOUND
PROXY_AUTH
SERVER_ERROR
NOT_IMPLEMENTED
SERVICE_UNAVAIL
UNKNOWN
85.32
--.--
00.25
00.05
00.05
13.73
00.001
--.—-
00.01
00.55
--.--
--.--
--.--
--.--
00.0003
76.02
--.--
--.--
--.--
00.05
23.24
00.0001
00.001
00.02
00.64
--.--
00.003
00.0001
--.--
00.00004
75.28
00.00001
--.--
--.--
01.18
22.84
00.003
00.0001
00.01
00.65
--.--
00.006
00.0005
00.0001
00.005
67.72
--.--
--.--
--.--
15.11
16.26
00.001
00.001
00.009
00.79
00.002
00.07
00.006
00.0003
00.0004
Web Servers: Implementation and Performance
Erich Nahum 65
Resource (File) Sizes
• Shows file/memory usage (not weighted by frequency!)• Lognormal body, consistent with results from AW96, CB96, KR01.• AW96, CB96: sizes have Pareto tail; Downey01: Sizes are lognormal.
Web Servers: Implementation and Performance
Erich Nahum 66
Tails from the File Size
• Shows the complementary CDF (CCDF) of file sizes.• Haven’t done the curve fitting but looks Pareto-ish.
Web Servers: Implementation and Performance
Erich Nahum 67
Response (Transfer) Sizes
• Shows network usage (weighted by frequency of requests)• Lognormal body, pareto tail, consistent with CBC95, AW96,
CB96, KR01
Web Servers: Implementation and Performance
Erich Nahum 68
Tails of Transfer Size
• Shows the complementary CDF (CCDF) of file sizes.• Looks somewhat Pareto-like; certainly some big transfers.
Web Servers: Implementation and Performance
Erich Nahum 69
Resource Popularity
• Follows a Zipf model: p(r) = r^{-alpha} (alpha = 1 true Zipf; others “Zipf-like")
• Consistent with CBC95, AW96, CB96, PQ00, KR01• Shows that caching popular documents is very effective
Web Servers: Implementation and Performance
Erich Nahum 70
Number of Embedded Objects
• Mah97: avg 3, 90% are 5 or less• BC98: pareto distr, median 0.8, mean 1.7• Arlitt98 World Cup study: median 15 objects,
90% are 20 or less• MW00: median 7-17, mean 11-18, 90% 40 or
less• STA00: median 5,30 (2 traces), 90% 50 or less• Mah97, BC98, SCJO01: embedded objects tend
to be smaller than container objects• KR01: median is 8-20, pareto distribution
Trend seems to be that number is increasing over time.
Web Servers: Implementation and Performance
Erich Nahum 71
Session Inter-Arrivals
• Inter-arrival time between successive requests – “Think time"– difference between user requests vs. ALL requests– partly depends on definition of boundary
• CB96: variability across multiple timescales, "self-similarity", average load very different from peak or heavy load
• SCJO01: log-normal, 90% less than 1 minute.• AW96: independent and exponentially
distributed• KR01: pareto with a=1.5, session arrivals follow
poisson distribution, but requests follow pareto
Web Servers: Implementation and Performance
Erich Nahum 72
Protocol Support
• IBM.com 2001 logs:– Show roughly 53% of client requests are 1.1
• KA01 study:– 92% of servers claim to support 1.1 (as of Sep 00)– Only 31% actually do; most fail to comply with spec
• SCJO01 show:– Avg 6.5 requests per persistent connection– 65% have 2 connections per page, rest more. – 40-50% of objects downloaded by persistent
connections
Appears that we are in the middle of a slow transition to 1.1
Web Servers: Implementation and Performance
Erich Nahum 73
Summary: Workload Characterization
• Traffic is variable:– Responses vary across multiple orders of magnitude
• Traffic is bursty:– Peak loads much larger than average loads
• Certain files more popular than others– Zipf-like distribution captures this well
• Two-sided aspect of transfers:– Most responses are small (zero pretty common)– Most of the bytes are from large transfers
• Controversy over Pareto/log-normal distribution• Non-trivial for workload generators to replicate
Web Servers: Implementation and Performance
Erich Nahum 74
Chapter 6: Workload Generators
Web Servers: Implementation and Performance
Erich Nahum 75
Why Workload Generators?• Allows stress-testing and bug-
finding• Gives us some idea of server
capacity• Allows us a scientific process
to compare approaches– e.g., server models, gigabit
adaptors, OS implementations
• Assumption is that difference in testbed translates to some difference in real-world
• Allows the performance debugging cycle
Measure Reproduce
Find Problem
Fix and/or improve
The Performance Debugging Cycle
Web Servers: Implementation and Performance
Erich Nahum 76
Problems with Workload Generators
• Only as good as our understanding of the traffic• Traffic may change over time
– generators must too
• May not be representative– e.g., are file size distributions from IBM.com similar to
mine?
• May be ignoring important factors– e.g., browser behavior, WAN conditions, modem
connectivity
• Still, useful for diagnosing and treating problems
Web Servers: Implementation and Performance
Erich Nahum 77
How does W. Generation Work?
• Many clients, one server– match asymmetry of Internet
• Server is populated with some kind of synthetic content
• Simulated clients produce requests for server
• Master process to control clients, aggregate results
• Goal is to measure server– not the client or network
• Must be robust to conditions– e.g., if server keeps sending 404 not
found, will clients notice?
ResponsesRequests
Web Servers: Implementation and Performance
Erich Nahum 78
Evolution: WebStone• The original workload generator from SGI in 1995• Process based workload generator, implemented in C• Clients talk to master via sockets• Configurable: # client machines, # client processes, run
time• Measured several metrics: avg + max connect time,
response time, throughput rate (bits/sec), # pages, # files• 1.0 only does GETS, CGI support added in 2.0• Static requests, 5 different file sizes:
Percentage Size
35.00 500 B
50.00 5 KB
14.00 50 KB
0.90 500 KB
0.10 5 MBwww.mindcraft.com/webstone
Web Servers: Implementation and Performance
Erich Nahum 79
Evolution: SPECWeb96
• Developed by SPEC– Systems Performance Evaluation Consortium– Non-profit group with many benchmarks (CPU, FS)
• Attempt to get more representative– Based on logs from NCSA, HP, Hal Computers
• 4 classes of files:
• Poisson distribution between each class
Percentage Size
35.00 0-1 KB
50.00 1-10 KB
14.00 10-100 KB
1.00 100 KB – 1 MB
Web Servers: Implementation and Performance
Erich Nahum 80
SPECWeb96 (cont)
• Notion of scaling versus load:– number of directories in data set size doubles as
expected throughput quadruples (sqrt(throughput/5)*10)
– requests spread evenly across all application directories
• Process based WG• Clients talk to master via RPC's (less robust)• Still only does GETS, no keep-alive
www.spec.org/osg/web96
Web Servers: Implementation and Performance
Erich Nahum 81
Evolution: SURGE
• Scalable URL Reference GEnerator– Barford & Crovella at Boston University CS Dept.
• Much more worried about representativeness, captures:– server file size distributions,– request size distribution,– relative file popularity– embedded file references– temporal locality of reference– idle periods ("think times") of users
• Process/thread based WG
Web Servers: Implementation and Performance
Erich Nahum 82
SURGE (cont)
• Notion of “user-equivalent”:– statistical model of a user – active “off” time (between URLS),– inactive “off” time (between pages)
• Captures various levels of burstiness• Not validated, shows that load generated is
different than SpecWeb96 and has more burstiness in terms of CPU and # active connections
www.cs.wisc.edu/~pb
Web Servers: Implementation and Performance
Erich Nahum 83
Evolution: S-client
• Almost all workload generators are closed-loop:– client submits a request, waits for server, maybe thinks for
some time, repeat as necessary
• Problem with the closed-loop approach:– client can't generate requests faster than the server can
respond– limits the generated load to the capacity of the server– in the real world, arrivals don’t depend on server state
• i.e., real users have no idea about load on the server when they click on a site, although successive clicks may have this property
– in particular, can't overload the server
• s-client tries to be open-loop:– by generating connections at a particular rate – independent of server load/capacity
Web Servers: Implementation and Performance
Erich Nahum 84
S-Client (cont)• How is s-client open-loop?
– connecting asynchronously at a particular rate– using non-blocking connect() socket call
• Connect complete within a particular time?– if yes, continue normally.– if not, socket is closed and new connect initiated.
• Other details:– uses single-address space event-driven model like Flash– calls select() on large numbers of file descriptors– can generate large loads
• Problems:– client capacity is still limited by active FD's– “arrival” is a TCP connect, not an HTTP request
www.cs.rice.edu/CS/Systems/Web-measurement
Web Servers: Implementation and Performance
Erich Nahum 85
Evolution: SPECWeb99
• In response to people "gaming" benchmark, now includes rules:– IP maximum segment lifetime (MSL) must be at least 60
seconds (more on this later!)– Link-layer maximum transmission unit (MTU) must not be larger
than 1460 bytes (Ethernet frame size)– Dynamic content may not be cached
• not clear that this is followed– Servers must log requests.
• W3C common log format is sufficient but not mandatory.– Resulting workload must be within 10% of target.– Error rate must be below 1%.
• Metric has changed:– now "number of simultaneous conforming connections“: rate of
a connection must be greater than 320 Kbps
Web Servers: Implementation and Performance
Erich Nahum 86
SPECWeb99 (cont)• Directory size has changed:
(25 + (400000/122000)* simultaneous conns) / 5.0)
• Improved HTTP 1.0/1.1 support:– Keep-alive requests (client closes after N requests)– Cookies
• Back-end notion of user demographics– Used for ad rotation– Request includes user_id and last_ad
• Request breakdown:– 70.00 % static GET– 12.45 % dynamic GET– 12.60 % dynamic GET with custom ad rotation– 04.80 % dynamic POST – 00.15 % dynamic GET calling CGI code
Web Servers: Implementation and Performance
Erich Nahum 87
SPECWeb99 (cont)• Other breakdowns:
– 30 % HTTP 1.0 with no keep-alive or persistence– 70 % HTTP 1.0 with keep-alive to "model" persistence– still has 4 classes of file size with Poisson distribution– supports Zipf popularity
• Client implementation details:– Master-client communication now uses sockets– Code includes sample Perl code for CGI– Client configurable to use threads or processes
• Much more info on setup, debugging, tuning• All results posted to web page,
– including configuration & back end code
www.spec.org/osg/web99
Web Servers: Implementation and Performance
Erich Nahum 88
So how realistic is SPECWeb99?
• We’ll compare a few characteristics:– File size distribution (body)– File size distribution (tail)– Transfer size distribution (body)– Transfer size distribution (tail)– Document popularity
• Visual comparison only– No curve-fitting, r-squared plots, etc.– Point is to give a feel for accuracy
Web Servers: Implementation and Performance
Erich Nahum 89
SpecWeb99 vs. File Sizes
• SpecWeb99: In the ballpark, but not very smooth
Web Servers: Implementation and Performance
Erich Nahum 90
SpecWeb99 vs. File Size Tail
• SpecWeb99 tail isn’t as long as real logs (900 KB max)
Web Servers: Implementation and Performance
Erich Nahum 91
SpecWeb99 vs.Transfer Sizes
• Doesn’t capture 304 (not modified) responses• Coarser distribution than real logs (i.e., not smooth)
Web Servers: Implementation and Performance
Erich Nahum 92
Spec99 vs.Transfer Size Tails
• SpecWeb99 does OK, although tail drops off rapidly (and in fact, no file is greater than 1 MB in SpecWeb99!).
Web Servers: Implementation and Performance
Erich Nahum 93
Spec99 vs. Resource Popularity
• SpecWeb99 seems to do a good job, although tail isn’t long enough
Web Servers: Implementation and Performance
Erich Nahum 94
Evolution: TPC-W• Transaction Processing Council (TPC-W)
– More known for database workloads like TPC-D– Metrics include dollars/transaction (unlike SPEC)– Provides specification, not source– Meant to capture a large e-commerce site
• Models online bookstore– web serving, searching, browsing, shopping carts– online transaction processing (OLTP)– decision support (DSS)– secure purchasing (SSL), best sellers, new products– customer registration, administrative updates
• Has notion of scaling per user– 5 MB of DB tables per user– 1 KB per shopping item, 25 KB per item in static images
Web Servers: Implementation and Performance
Erich Nahum 95
TPC-W (cont)• Remote browser emulator (RBE)
– emulates a single user– send HTTP request, parse, wait for thinking, repeat
• Metrics:– WIPS: shopping– WIPSb: browsing– WIPSo: ordering
• Setups tend to be very large:– multiple image servers, application servers, load balancer– DB back end (typically SMP)– Example: IBM 12-way SMP w/DB2, 9 PCs w/IIS: 1M $
www.tpc.org/tpcw
Web Servers: Implementation and Performance
Erich Nahum 96
Summary: Workload Generators
• Only the beginning. Many other workload generators:– httperf from HP– WAGON from IBM– WaspClient from IBM– Others?
• Both workloads and generators change over time:– Both started simple, got more complex– As workload changes, so must generators
• No one single "good" generator– SpecWeb99 seems the favorite (2002 rumored in the works)
• Implementation issues similar to servers:– They are networked-based request producers (i.e., produce GET's instead of 200 OK's). – Implementation affects capacity planning of clients! (want to make sure clients are not bottleneck)
Web Servers: Implementation and Performance
Erich Nahum 97
Chapter 7: Introduction to TCP
Web Servers: Implementation and Performance
Erich Nahum 98
Introduction to TCP
• Layering is a common principle in network protocol design
• TCP is the major transport protocol in the Internet
• Since HTTP runs on top of TCP, much interaction between the two
• Asymmetry in client-server model puts strain on server-side TCP implementations
• Thus, major issue in web servers is TCP implementation and behavior
application
transport
network
link
physical
Web Servers: Implementation and Performance
Erich Nahum 99
The TCP Protocol
• Connection-oriented, point-to-point protocol:– Connection establishment and teardown phases– ‘Phone-like’ circuit abstraction– One sender, one receiver
• Originally optimized for certain kinds of transfer:– Telnet (interactive remote login)– FTP (long, slow transfers)– Web is like neither of these
• Lots of work on TCP, beyond scope of this tutorial– e.g., know of 3 separate TCP tutorials!
Web Servers: Implementation and Performance
Erich Nahum 100
TCP Protocol (cont)
• Provides a reliable, in-order, byte stream abstraction:– Recover lost packets and detect/drop duplicates– Detect and drop bad packets– Preserve order in byte stream, no “message boundaries”– Full-duplex: bi-directional data flow in same connection
• Flow and congestion controlled: – Flow control: sender will not overwhelm receiver– Congestion control: sender will not overwhelm network!– Send and receive buffers– Congestion and flow control windows
socketlayer
TCPsend buffer
applicationwrites data
TCPreceive buffer
socketlayer
applicationreads data
data segment
ACK segment
Web Servers: Implementation and Performance
Erich Nahum 101
The TCP HeaderFields enable the
following:• Uniquely identifying a
connection (4-tuple of client/server IP
address and port numbers)
• Identifying a byte range within that connection
• Checksum value to detect corruption
• Identifying protocol transitions (SYN, FIN)
• Informing other side of your state (ACK)
source port # dest port #
32 bits
applicationdata
(variable length)
sequence number
acknowledgement numberrcvr window size
ptr urgent datachecksum
FSRPAUheadlen
notused
Options (variable length)
Web Servers: Implementation and Performance
Erich Nahum 102
Establishing a TCP Connection
• Client sends SYN with initial sequence number (ISN)
• Server responds with its own SYN w/seq number and ACK of client (ISN+1) (next expected byte)
• Client ACKs server's ISN+1
• The ‘3-way handshake’
• All modulo 32-bit arithmetic
client
SYN (X)
server
SYN (Y) + ACK (X+1)
ACK (Y+1)
connect()listen()port 80
accept()
read()
time
Web Servers: Implementation and Performance
Erich Nahum 103
Sending Data
• Sender puts data on the wire:– Holds copy in case of loss– Sender must observed receiver flow control window– Sender can discard data when ACK is received
• Receiver sends acknowledgments (ACKs)– ACKs can be piggybacked on data going the other way– Protocol says receiver should ACK every other packet
in attempt to reduce ACK traffic (delayed ACKs)– Delay should not be more than 500 ms. (typically 200)– We’ll see how this causes problems later
socketlayer
TCPsend buffer
applicationwrites data
TCPreceive buffer
socketlayer
applicationreads data
data segment
ACK segment
Web Servers: Implementation and Performance
Erich Nahum 104
Preventing Congestion
• Sender may not only overrun receiver, but may also overrun intermediate routers:– No way to explicitly know router buffer occupancy,
so we need to infer it from packet losses– Assumption is that losses stem from congestion, namely, that
intermediate routers have no available buffers
• Sender maintains a congestion window:– Never have more than CW of un-acknowledged data
outstanding (or RWIN data; min of the two)– Successive ACKs from receiver cause CW to grow.
• How CW grows based on which of 2 phases:– Slow-start: initial state.– Congestion avoidance: steady-state.– Switch between the two when CW > slow-start threshold
Web Servers: Implementation and Performance
Erich Nahum 105
Congestion Control Principles
• Lack of congestion control would lead to congestion collapse (Jacobson 88).
• Idea is to be a “good network citizen”.• Would like to transmit as fast as possible without loss.• Probe network to find available bandwidth.• In steady-state: linear increase in CW per RTT.• After loss event: CW is halved.• This is called additive increase /multiplicative decrease
(AIMD).• Various papers on why AIMD leads to network stability.
Web Servers: Implementation and Performance
Erich Nahum 106
Slow Start
• Initial CW = 1.• After each ACK, CW +=
1;• Continue until:
– Loss occurs OR– CW > slow start threshold
• Then switch to congestion avoidance
• If we detect loss, cut CW in half
• Exponential increase in window size per RTT
sender
one segment
RTT
receiver
time
two segments
four segments
Web Servers: Implementation and Performance
Erich Nahum 107
Congestion Avoidance
Until (loss) { after CW packets ACKed: CW += 1;}ssthresh = CW/2;Depending on loss type: SACK/Fast Retransmit: CW/= 2; continue; Course grained timeout: CW = 1; go to slow start.
(This is for TCP Reno/SACK: TCP Tahoe always sets CW=1 after a loss)
Web Servers: Implementation and Performance
Erich Nahum 108
How are losses recovered?
Say packet is lost (data or ACK!)
• Coarse-grained Timeout:– Sender does not receive ACK after
some period of time– Event is called a retransmission
time-out (RTO)– RTO value is based on estimated
round-trip time (RTT)– RTT is adjusted over time using
exponential weighted moving average: RTT = (1-x)*RTT + (x)*sample(x is typically 0.1)
First done in TCP Tahoe
Seq=92, 8 bytes data
ACK=100
loss
tim
eout
lost ACK scenario
X
Seq=92, 8 bytes data
ACK=100
sender receiver
time
Web Servers: Implementation and Performance
Erich Nahum 109
Fast Retransmit
• Receiver expects N, gets N+1:– Immediately sends ACK(N)– This is called a duplicate ACK– Does NOT delay ACKs here!– Continue sending dup ACKs for
each subsequent packet (not N)
• Sender gets 3 duplicate ACKs:– Infers N is lost and resends– 3 chosen so out-of-order packets
don’t trigger Fast Retransmit accidentally
– Called “fast” since we don’t need to wait for a full RTT
sender receiver
time
SEQ=3000, size=1000
ACK 3000
XSEQ=4000SEQ=5000SEQ=6000
ACK 3000ACK 3000
ACK 3000
SEQ=3000, size=1000
Introduced in TCP Reno
Web Servers: Implementation and Performance
Erich Nahum 110
Other loss recovery methods
• Selective Acknowledgements (SACK):– Returned ACKs contain option w/SACK block– Block says, "got up N-1 AND got N+1 through N+3"– A single ACK can generate a retransmission
• New Reno partial ACKs:– New ACK during fast retransmit may not ACK all
outstanding data. Ex:• Have ACK of 1, waiting for 2-6, get 3 dup acks of 1• Retransmit 2, get ACK of 3, can now infer 4 lost as well
• Other schemes exist (e.g., Vegas)• Reno has been prevalent; SACK now catching
on
Web Servers: Implementation and Performance
Erich Nahum 111
How about Connection Teardown?
• Either side may terminate a connection. ( In fact, connection can stay half-closed.) Let's say the server closes (typical in WWW)
• Server sends FIN with seq Number (SN+1) (i.e., FIN is a byte in sequence)
• Client ACK's the FIN with SN+2 ("next expected")
• Client sends it's own FIN when ready
• Server ACK's client FIN as well with SN+1.
client
FIN(Y)
server
ACK(Y+1)
ACK(X+1)
FIN(X)
close()
close()
closed
tim
ed w
aittime
Web Servers: Implementation and Performance
Erich Nahum 112
The TCP State Machine
• TCP uses a Finite State Machine, kept by each side of a connection, to keep track of what state a connection is in.
• State transitions reflect inherent races that can happen in the network, e.g., two FIN's passing each other in the network.
• Certain things can go wrong along the way, i.e., packets can be dropped or corrupted. In fact, machine is not perfect; certain problems can arise not anticipated in the original RFC.
• This is where timers will come in, which we will discuss more later.
Web Servers: Implementation and Performance
Erich Nahum 113
TCP State Machine: Connection Establishment
ESTABLISHED
SYN_RCVD
SYN_SENT
CLOSED
LISTEN
client applicationcalls connect()
send SYN
receive SYNsend SYN + ACK
server applicationcalls listen()
receive SYN & ACKsend ACK
receive ACK
• CLOSED: more implied than actual, i.e., no connection
• LISTEN: willing to receive connections (accept call)
• SYN-SENT: sent a SYN, waiting for SYN-ACK
• SYN-RECEIVED: received a SYN, waiting for an ACK of our SYN
• ESTABLISHED: connection ready for data transfer
receive SYNsend ACK
Web Servers: Implementation and Performance
Erich Nahum 114
TCP State Machine: Connection Teardown
ESTABLISHED
FIN_WAIT_2
TIME_WAIT
FIN_WAIT_1
LAST_ACK
CLOSE_WAIT
CLOSED
wait 2*MSL(240 seconds)
receive ACK
receive FINsend ACK
receive ACKof FIN
close() calledsend FIN
receive FINsend ACK
• FIN-WAIT-1: we closed first, waiting for ACK of our FIN (active close)
• FIN-WAIT-2: we closed first, other side has ACKED our FIN, but not yet FIN'ed
• CLOSING: other side closed before it received our FIN
• TIME-WAIT: we closed, other side closed, got ACK of our FIN
• CLOSE-WAIT: other side sent FIN first, not us (passive close)
• LAST-ACK: other side sent FIN, then we did, now waiting for ACK
CLOSING
receive FINsend ACK
receive ACKof FIN
close() calledsend FIN
Web Servers: Implementation and Performance
Erich Nahum 115
Summary: TCP Protocol
• Protocol provides reliability in face of complex network behavior
• Tries to trade off efficiency with being "good network citizen"
• Vast majority of bytes transferred on Internet today are TCP-based:– Web– Mail– News– Peer-to-peer (Napster, Gnutella, FreeNet, KaZaa)
Web Servers: Implementation and Performance
Erich Nahum 116
Chapter 8: TCP Dynamics
Web Servers: Implementation and Performance
Erich Nahum 117
TCP Dynamics• In this section we'll describe some of the problems
you can run into as a WWW server interacting with TCP.
• Most of these affect the response as seen by the client, not the throughput generated by the server.
• Ideally, a server developer shouldn't have to worry about this stuff, but in practice, we'll see that's not the case.
• Examples we'll look at include:– The initial window size– The delayed ACK problem– Nagle and its interaction with delayed ack– Small receive windows interfering with loss recovery
Web Servers: Implementation and Performance
Erich Nahum 118
TCP’s Initial Window Problem
• Recall congestion control:– senders’ initial congestion window
is set to 1
• Recall delayed ACKs:– ack every other packet– set 200 ms. delayed ack timer
• Short-term deadlock:– sender is waiting for ACK since it
sent 1 segment– receiver is waiting for 2nd segment
before ACKing
• Problem worse than it seems:– multiple objects per web page– IE does not do pipelining!
sender
1st segment
receiver
time
ACK of 1st segment
20
0 m
s.
2nd segment
3rd segment
ACK of 2nd + 3rd segments
RTT
GET /index.html
RTT
Web Servers: Implementation and Performance
Erich Nahum 119
Solving the IW Problem
Solution: set IW = 2-4– RFC 2414– Didn't affect many BSD
systems since they (incorrectly) counted the connection setup in congestion window calculation
– Delayed ACK still happens, but now out of critical path of response time for download
sender
1st segment
receiver
time
ACK of 1st and 2nd
2nd segment
3rd segment
ACK of 3rd
GET /index.html
20
0 m
s.R
TT
RTT
Web Servers: Implementation and Performance
Erich Nahum 120
Receive Window Size Problem
Recall Fast Retransmit:• Amount of data in flight:
– MIN(cong win,recv win)– can't ever have more than that
outstanding
• In order for FR to work: – enough data has to be in flight– after lost packet, 3 more
segments must arrive– 4.5 KB of receive-side buffer
space must be available.– note many web documents are
less than 4.5 KB!
sender receiver
time
SEQ=3000, size=1000
ACK 3000
XSEQ=4000SEQ=5000SEQ=6000
ACK 3000ACK 3000
ACK 3000
SEQ=3000, size=1000
Web Servers: Implementation and Performance
Erich Nahum 121
Receive Window Size (cont)
• Previous discussion assumes large enough receive windows!– Early versions of MS Windows
had 16 KB default recv. window
• Balakrishnan et al. 1998:– Study server TCP traces from
1996 Olympic Web Server – show over 50% of clients have
receive window < 10K– Many suffer coarse-grained
retransmission timeouts (RTOs)– Even SACK would not have
helped!
sender receiver
time
SEQ=3000, size=1000
ACK 3000, RWIN = 2000
XSEQ=4000
SEQ=3000, size=1000
(illegal for sender to send more)
RTO
tim
eout
ACK 3000, RWIN = 1000
Web Servers: Implementation and Performance
Erich Nahum 122
Fixing Receive Window Problem
• Balakrishnan et. al 98– "Right-edge recovery“– Also proposed by Lin & Kung 98– Now an RFC (3042)
• How does it work?– Arrival of dup ack means, segment has
left the network– When dup ACK is received, send next
segment (not retransmission) – Continue with 2nd and 3rd dup acks– Idea is "keep ACK clock flowing" by
forcing more duplicate acks to be generated
– Claim is that it would have avoided 25% of course-grained timeouts in 96 Olympics trace
sender receiver
time
SEQ=3000, size=1000
ACK 3000, RWIN = 2000
XSEQ=4000
SEQ=3000, size=1000
ACK 3000, RWIN = 1000
ACK 3000, RWIN = 1000
SEQ=5000
SEQ=6000
ACK 3000, RWIN = 10003rd dup ack!
Web Servers: Implementation and Performance
Erich Nahum 123
The Nagle Algorithm
• Different types of TCP traffic exist:– Some apps (e.g., telnet) send one byte of data, then
wait for ACK– Others (e.g., FTP) use full-size segments
• Recall server can write() to a socket at any time– Once written, should host stack send? Or should we
wait and hope to get more data?
• May send many small packets, which is bad for 2 reasons:– Uses more network bandwidth (raises ratio of headers
to content)– Uses more CPU (many costs are per-packet, not per-
byte)
Web Servers: Implementation and Performance
Erich Nahum 124
The Nagle Algorithm
Solution is the Nagle algorithm:
• If full-size segment of data is available, just send• If small segment available, and there is no unacknowledged
data outstanding, send• Otherwise, wait until either:
– More data arrives from above (can coalesce packet), or– ACK arrives acknowledging outstanding data
• Idea is have at most one small packet outstanding
Web Servers: Implementation and Performance
Erich Nahum 125
Interaction of Nagle & Delayed ACK
• Nagle and delayed ACK's cause (temporary) deadlock:– Sender wants to send 1.5
segments, sends first full one– Nagle prevents second from being
sent (since not full size, and now we have unacked data outstanding)
– Sender waits for delayed ACK from receiver
– Receiver is waiting for 2nd segment before sending ACK
– Similar to IW=1 problem earlier
Result: Many disable Nagle.– via setsockopt() call
sender
1st segment (full size)
receiver
ACK of 1st segment
20
0 m
s.
2nd segment (half size)
RTT
GET /index.html
RTT
(Nagle forbids sender from
sending more)
write()
write()
Web Servers: Implementation and Performance
Erich Nahum 126
Interaction of Nagle & Delayed ACK
• For example, in WWW servers:– original NCSA server issued a write() for every header– Apache does its own buffering to do a single write() call– other servers use writev() (e.g., Flash)– if not careful you can flood the network with packets
• More of an issue when using persistent connections:– closing the connection forces data out with the FIN bit– but persistent connections or 1.0 “keep-alives” affected
• Mogul and Minshal 2001 evaluate a number of modifications to Nagle to deal with this
• Linux has similar "TCP_CORK" option – suppresses any non-full segment– application has to remember to disable TCP_CORK when finished.
Web Servers: Implementation and Performance
Erich Nahum 127
Summary: TCP Dynamics
• Many ways in which an HTTP transfer can interact with TCP
• Interaction of factors can cause delays in response time as seen by clients
• Hard to shield server developers from having to understand these issues
• Mistakes can cause problems such as flood of small packets
Web Servers: Implementation and Performance
Erich Nahum 128
Chapter 9: TCP Implementation
Web Servers: Implementation and Performance
Erich Nahum 129
Server TCP Implementation
• In this section we look at ways in which the host TCP implementation is stressed under large web server workloads. Most of these techniques deal with large numbers of connections:– Looking up arriving TCP segments with large numbers
of connections– Dealing with the TIME-WAIT state caused by closing
large number of connections– Managing large numbers of timers to support
connections– Dealing with memory consumption of connection state
• Removing data-touching operations – byte copying and checksums
Web Servers: Implementation and Performance
Erich Nahum 130
In the beginning…BSD 4.3
• Recall how demultiplexing works:– given a packet, want to find
connection state (PCB in BSD)– 4-tuple of source, destination port &
IP addresses
• Original BSD:– used one-behind cache with linear
search to match 4-tuple– assumption was "next segment very
likely is from the same connection“– assumed solitary, long-lived, FTP-like
transfer– average miss time is O(N/2)
(N=length of PCB list)
IP: 9.2.16.1, port: 873
IP: 10.1.1.2, port: 5194
IP: 118.23.48.3, port: 65383
Head of PCB list
IP: 192.123.168.40, port: 23
IP: 1.2.3.4, port: 45981
IP: 10.1.1.2, port: 5194
packet arrival: ?
One-behind cache
Web Servers: Implementation and Performance
Erich Nahum 131
PCB Hash Tables
• McKenney & Dove SigComm 92:– linear search with one-behind cache
doesn't work well for transaction workloads
– hashing does much better– hash based on 4-tuple– cost: O(1) (constant time)
• BSD adds hash table in 90's– other BSD Unixes (such as AIX) quickly
followed.
• Algorithmic work on hash tables:– e..g., CLR book, “perfect” hash tables– none specific to web workloads– hash table sizing problematic
IP: 10.1.1.2, port: 5194
IP: 10.1.1.2, port: 5194
packet arrival: ?
PCB Hash Table
- - - - - - -O (N-1)
Web Servers: Implementation and Performance
Erich Nahum 132
Problem of Old Duplicates• Recall in the Internet:
– packets may be arbitrarily duplicated, delayed, and reordered.
– while rare, case must be accounted for.
• Consider the following:– two hosts connect, transfer data,
close– client starts new connection using
same 4-tuple– duplicate packet arrives from first
connection– connection has been closed, state
is gone– how can you distinguish?
client server
SYN (X)
GET /index.html
Content + FIN
SYN (Y) ACK(X+1)
ACK
SYN (X)
ACK + FIN
?SYN (X)
time
Web Servers: Implementation and Performance
Erich Nahum 133
Role of the TIME-WAIT State• Solution: don’t do that!
– prevent same 4-tuple from being used
– one side must remember 4-tuple for period of time to reject old packets.
– spec says, whoever closes the connection must do this (in the TW state).
– Period is 2 times maximum segment lifetime (MSL), after which it is assumed no packet from previous conversation will still be alive
– MSL defined as 2 minutes in RFC 1122
client server
SYN (X)
GET /index.html
Content + FIN
SYN (Y) ACK(X+1)
ACK
SYN (X)
ACK + FIN
reject!
SYN (Z)
time
X
(2 *
MSL)
Web Servers: Implementation and Performance
Erich Nahum 134
TIME-WAIT Problem in Servers
• Recall in a WWW server, server closes connection!– asymmetry of client/server model means many clients – PCB sticks around for 2*MSL units of time
• Mogul 1995 CA Election server study:– shows large numbers (90%) of PCB's in TIME-WAIT.– would have been 97% if followed proper MSL!
• Example: doing 1000 connections/sec.– Assume MSL is 120 seconds, request takes 1 second.– Have 1000 connections in ESTABLISHED state.– 240,000 connections in TIME-WAIT state!
• FTY99 propose & evaluate 3 schemes:– require client to close (requires changing HTTP).– have client use new TCP option (client close) (TCP).– do client reset (browser, MS did this for a while)– claim 50% improvement in throughput, 85% in memory use
Web Servers: Implementation and Performance
Erich Nahum 135
Dealing with TIME-WAIT• Sorting hash table entries
(Aron & Druschel 99)– Demultiplexing requires that
all PCB's be examined (for some hash bucket) before you can give up on that PCB and say it was not found.
– Since most lookups are for existing connections, most connections will be in ESTABLISHED state rather than TIME-WAIT.
– Can sort PCB chain such that TW entries are at the end. Thus, ESTABLISHED entries are at front of chain.
- - - - - - -
PCB Hash Table
O (N-1)
192.123.168.40: ESTABLISHED
128.119.82.37: TIME_WAIT
9.2.16.145: TIME_WAIT
10.1.1.2: TIME_WAIT
178.23.48.3: TIME_WAIT
Web Servers: Implementation and Performance
Erich Nahum 136
Server Timer Management
• Each TCP connection can have up to 5 timers associated with it:– delayed ack, retransmission,
persistence, keep-alive, time-wait
• Original BSD:– linear linked list of PCB's– fast timer (200 ms): walk all PCB's for
delayed ACK timer– slow timer (500 ms): walk all PCB's for
all other timers– time kept in relative form, so have to
subtract time from PCB (500 ms) for 4 larger timers
– costs: O(#PCBs), not O(#active timers)
9.2.16.1: delayed ACK in 100 ms
10.1.1.2: persist in 10 secs
118.23.48.3: keep-alive in 1 sec
HEAD OF PCB LIST
192.123.168.40: RTO in 2 secs
1.2.3.4: TIME-WAIT in 30 secs
Web Servers: Implementation and Performance
Erich Nahum 137
Server Timer Management• Can again exploit semantics
of the TIME-WAIT state:– If PCB's are sorted by state,
delayed ACK timer can stop after it encounters PCB in TIME-WAIT, since ACKs are not delayed for connections in TIME-WAIT state
– Aron and Druschel show 25 percent HTTP throughput improvement using this technique
– Attribute most of win to reduced timer processing, but probably helps PCB lookup as well.
- - - - - - -
PCB Hash Table
O (N-1)
192.123.168.40: ESTABLISHED
128.119.82.37: TIME_WAIT
9.2.16.145: TIME_WAIT
10.1.1.2: TIME_WAIT
178.23.48.3: TIME_WAIT
Web Servers: Implementation and Performance
Erich Nahum 138
Customized PCB Tables
• Maintain 2 sets of PCBs: normal and TIME-WAIT– first done in BSDI in 96– still must search both PCBs
• Aron & Druschel 99:– can compress TW PCBs, since only
port and sequence numbers needed
– normal still has full PCB state– show you can save a lot of kernel
pinned RAM (from 31 MB to 5 MB, a 82% reduction)
– results in more RAM available for disk cache, which leads to better performance
- - - - - - -O (N-1)
192.123.168.40: ESTABLISHED
9.2.16.145
Regular PCB Hash Table
TIME-WAIT PCB Hash Table
- - - - - - -O (N-1)
128.119.72.4
10.1..1.2
Web Servers: Implementation and Performance
Erich Nahum 139
Scalable Timers: Timing Wheels
• Varghese SOSP 1987:– use a hash-table-like structure
called timing wheel– events are ordered by relative time
in the future– given event in future time T, put in
slot (T mod N)– list sorted by time (scheme 5)
• Each clock tick: – wheel “turns” one slot (mod N)– look at first item in chain:
• if ready, fire, check next• if empty or not ready to fire, all
done– continue until non-ready item is
encountered (or end of list)
Expire: 12
Timing Wheel
- - - - - - -O (N-1)
Expire: 22
Expire: 42
Ex: current time = 12, N = 10;
wheel poin
ter
Web Servers: Implementation and Performance
Erich Nahum 140
Timing Wheels (cont)• Variant (scheme 6 in paper):
– just insert into wheel slot, don’t bother to sort– check all timers in slot on each tick
• Original SOSP 1987 paper– premise was more for large-scale simulations– have lots of events happening "in the future"
• Algorithmic Costs (assuming good hash function):– O(1) average time for basic dictionary operations
• insertion, cancellation, per-tick bookkeeping– O(N) (N = number timers) worst-case for scheme 6– O(log(N)) worst-case for scheme 5
• Deployment:– Used in FreeBSD as of release 3.4 (scheme 6)– Variant in Linux 2.4 (hierarchy of timers with cascade)– Aron claims "about the same perf" as his approach
Web Servers: Implementation and Performance
Erich Nahum 141
Data-Touching Operations
• Lots of research in high-speed network community about how touching data is bad – especially as CPU speeds increase relative to memory
• Several ways to avoid data copying:– Use mmap() as described earlier to cut to one copy– Use I/O lite primitives (new API) to move buffers around– Use sendfile() API combined with integrated zero-copy I/O system
in kernel
• Also a cost to reading the data via checksums:– Jacobson showed how it can be folded into the copy for free, with
some complexity on the receive side– I/O Lite /exokernel use checksum caches– Advanced network cards do checksum for you
• Originally on SGI FDDI card (1995)• Now on all gigabit adaptors, some 100baseT adaptors
Web Servers: Implementation and Performance
Erich Nahum 142
Summary: Implementation Issues
• Scaling problems happen in large WWW Servers:– Asymmetry of client/server model– Large numbers of connections– Large amounts of data transferred
• Approaches fall into one or more categories:– Hashing– Caching– Exploit common-case behavior– Exploiting semantic information– Don't touch the data
• Most OS's now support these functions over the last 3 years