measuring proxy performance with the wisconsin proxy benchmark

14
Computer Networks and ISDN Systems 30 (1998) 2179–2192 Measuring proxy performance with the Wisconsin Proxy Benchmark Jussara Almeida 1 , Pei Cao *;1 Department of Computer Sciences, University of Wisconsin-Madison, 1210 West Dayton Street, Madison, WI53705, USA Abstract As Web proxies become increasingly widespread, there is a critical need to establish a benchmark that can compare the performance of different proxy servers and predict their performance in practice. In this paper, we describe the Wisconsin Proxy Benchmark (WPB) and the performance comparison of four proxy softwares using the benchmark. Using the benchmark, we also study the effect of more disk arms on proxy performance, and the effect of low-bandwidth modem client connections. We find that proxy implementations differ in their performance characteristics significantly. In addition, though disk arms appear to be the bottleneck limiting proxy throughput, adding extra disks does not result in performance improvement for every proxy implementation. Finally, we find that the latency advantage of caching proxies vanishes in the presence of modem connections. 1998 Elsevier Science B.V. All rights reserved. Keywords: World Wide Web caching; Proxy server; Benchmarking; Performance analysis 1. Introduction As the World Wide Web continues to grow, caching proxies become a critical component to han- dle Web traffic and reduce both network traffic and client latency. However, despite its importance, there has been little understanding of how different proxy servers perform compared with each other, and how they behave under different workloads. The design and implementation of a proxy benchmark is the first step to test and understand the performance characteristics of a proxy server. A benchmark al- lows customers not only to test the performance of a proxy running on different software and hard- ware platforms, but also to compare different proxy implementations and choose one that best matches L Corresponding author. 1 E-mail: {jussara,cao}@cs.wisc.edu. their requirements. The Wisconsin Proxy Benchmark (WPB) has been developed in an attempt to provide a tool to analyze and predict performance of different proxy products in real-life situations. The main feature of WPB is that it tries to repli- cate the workload characteristics found in real-life Web proxy traces. WPB consists of Web client and Web server processes. First, it generates server re- sponses whose sizes follow the heavy-tailed Pareto distribution described in [1]. In other words, it in- cludes very large files with a non-negligible proba- bility. This is important because heavy-tail distribu- tion of file sizes significantly impact proxy behavior, as it must handle (and store in the local cache) files with a wide range of sizes. Second, the benchmark generates a request stream that has the same temporal locality as those found in real proxy traces. Studies have shown that the probability that a document is requested t requests after the last request to it is 0169-7552/98/$ – see front matter 1998 Elsevier Science B.V. All rights reserved. PII:S0169-7552(98)00247-5

Upload: jussara-almeida

Post on 02-Jul-2016

214 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Measuring proxy performance with the Wisconsin Proxy Benchmark

Computer Networks and ISDN Systems 30 (1998) 2179–2192

Measuring proxy performance with the Wisconsin Proxy Benchmark

Jussara Almeida 1, Pei Cao *;1

Department of Computer Sciences, University of Wisconsin-Madison, 1210 West Dayton Street, Madison, WI 53705, USA

Abstract

As Web proxies become increasingly widespread, there is a critical need to establish a benchmark that can compare theperformance of different proxy servers and predict their performance in practice. In this paper, we describe the WisconsinProxy Benchmark (WPB) and the performance comparison of four proxy softwares using the benchmark. Using thebenchmark, we also study the effect of more disk arms on proxy performance, and the effect of low-bandwidth modemclient connections. We find that proxy implementations differ in their performance characteristics significantly. In addition,though disk arms appear to be the bottleneck limiting proxy throughput, adding extra disks does not result in performanceimprovement for every proxy implementation. Finally, we find that the latency advantage of caching proxies vanishes inthe presence of modem connections. 1998 Elsevier Science B.V. All rights reserved.

Keywords: World Wide Web caching; Proxy server; Benchmarking; Performance analysis

1. Introduction

As the World Wide Web continues to grow,caching proxies become a critical component to han-dle Web traffic and reduce both network traffic andclient latency. However, despite its importance, therehas been little understanding of how different proxyservers perform compared with each other, and howthey behave under different workloads. The designand implementation of a proxy benchmark is thefirst step to test and understand the performancecharacteristics of a proxy server. A benchmark al-lows customers not only to test the performanceof a proxy running on different software and hard-ware platforms, but also to compare different proxyimplementations and choose one that best matches

Ł Corresponding author.1 E-mail: {jussara,cao}@cs.wisc.edu.

their requirements. The Wisconsin Proxy Benchmark(WPB) has been developed in an attempt to provide atool to analyze and predict performance of differentproxy products in real-life situations.

The main feature of WPB is that it tries to repli-cate the workload characteristics found in real-lifeWeb proxy traces. WPB consists of Web client andWeb server processes. First, it generates server re-sponses whose sizes follow the heavy-tailed Paretodistribution described in [1]. In other words, it in-cludes very large files with a non-negligible proba-bility. This is important because heavy-tail distribu-tion of file sizes significantly impact proxy behavior,as it must handle (and store in the local cache) fileswith a wide range of sizes. Second, the benchmarkgenerates a request stream that has the same temporallocality as those found in real proxy traces. Studieshave shown that the probability that a document isrequested t requests after the last request to it is

0169-7552/98/$ – see front matter 1998 Elsevier Science B.V. All rights reserved.PII: S 0 1 6 9 - 7 5 5 2 ( 9 8 ) 0 0 2 4 7 - 5

Page 2: Measuring proxy performance with the Wisconsin Proxy Benchmark

2180 J. Almeida, P. Cao / Computer Networks and ISDN Systems 30 (1998) 2179–2192

proportional to 1=t [2,3]. The benchmark replicatesthe probability distribution and measures the hit ratioof the proxy cache. Third, the benchmark emulatesWeb server latencies by forcing the server processto delay sending back responses. This is because thebenchmark is often run in a local area network, andthere is no natural way to incur long latencies whenfetching documents from the servers. However, Webserver latencies affect the resource requirements atthe proxy server, particularly network descriptors.Thus, the benchmark supports configurable serverlatencies in testing proxy systems.

The main performance data collected by thebenchmark are latency, proxy hit ratio and bytehit ratio, and number of client errors. There is nosingle performance number since different environ-ments weight the four performance metrics differ-ently. Proxy throughput is estimated by dividing thenumber of concurrent requests by the request latency.

Using the benchmark, we compare the perfor-mance of four popular proxy servers — Apache,CERN, Squid and a CERN-derived commercialproxy — running on the same hardware and OS.We find that caching incurs significant overhead interms of client latency in all proxy systems. We alsofind that the different implementation styles of theproxy software result in quite different performancecharacteristics, including hit ratio, client latency andclient connection errors. In addition, the proxy soft-wares stress the CPU and disks differently.

We then use the benchmark to analyze the impactof adding one extra disk on the overall performanceof proxies. We can only experiment with Squidand the CERN-derived proxy as neither Apache norCERN allows one to spread the cache storage overmultiple disks. The results show that the disk is themain bottleneck during the operation of busy prox-ies. However, although the bottleneck is alleviatedwhen one extra disk is added, the overall proxyperformance does not improve as much as we ex-pected. In fact, Squid’s performance remains aboutthe same.

Finally, using a client machine that emulates mul-tiple modem connections [4], we analyze the be-havior of the four proxies when they must handlerequests sent by clients connected through low band-width connections. The results show that in thiscase transmission delays are the main component of

client latencies and the low bandwidth clearly limitsthe overall performance. As a consequence, cachingdoes not significantly reduce client latencies.

The paper is organized as follows. Section 2describes the design and implementation of WPB.Section 3 presents the performance comparison offour popular proxy servers. Sections 4 and 5 discussthe effect of multiple disks and low bandwidth con-nections on proxy performance. Section 6 presentsrelated work. Finally, Section 7 discusses futurework and concludes.

2. Wisconsin Proxy Benchmark

As the World Wide Web continues to grow, manyinstitutions are considering the installation of proxyservers for performance and security reasons. Asthe industry moves to meet customer demands andproduces many Web proxy server products, there isa critical need for a benchmark that compares theperformance of various proxy products. This sectiondescribes our effort to build a proxy benchmarkcalled the Wisconsin Proxy Benchmark (WPB).

Fig. 1 shows an overview of the design of WPBand illustrates a typical benchmarking setup. Thegeneral setup of the benchmark is that a collectionof Web client machines are connected to the proxysystem under testing, which is in turn connected toa collection of Web server machines. There can bemore than one client or server processes running ona client or server machine. The client and serverprocesses run the client and server codes in thebenchmark, instead of running a browser or a Webserver. There is also a master process to coordinatethe actions of client processes and generate an over-all performance report. Some of the setup parametersare defined in a configuration file. The following sec-tions describe each of the main component of WPBas well as the main performance metrics reported.

2.1. Master process

A master process is needed to coordinate theactions of client processes. Once a client processstarts, it sends an ‘ALIVE’ message to the masterprocess, and then waits for a ‘GO’ message fromthe master. Once it receives ‘GO’, it starts sending

Page 3: Measuring proxy performance with the Wisconsin Proxy Benchmark

J. Almeida, P. Cao / Computer Networks and ISDN Systems 30 (1998) 2179–2192 2181

....

....

....

....

....

Server Processes

Proxy System

Server MachinesClient Machines

....

Master Process

Client Processes

Fig. 1. Design of WPB 1.0.

HTTP requests to the proxy as described below. Themaster process waits till all client processes sendit the ‘ALIVE’ message, then sends ‘GO’ to all ofthem. After sending all ‘GO’ messages, the masterwaits for the end of the experiment when it receivesstatistics from all clients and servers. Based on thisdata, it generates an overall performance report. Thenumber of client and server processes that the masterhas to wait for is given in a configuration file, whosename is passed as the command line argument to themaster process. The master process may run on oneof the client machines.

2.2. Client processes

The client process runs on the client machineand issues HTTP requests one after another with nothinking time in between. This means that clientssend requests as fast as the proxy can handle them.The client process takes the following parameters ascommand line arguments: URL of the proxy server,number of HTTP requests to issue, seed for therandom number generator, and name of the config-uration file. The configuration file specifies the Webservers to which the client should send requests.Currently, the clients send HTTP requests in theformat of “GET http://server_name:port_number/dummy[filenum].html HTTP/1.0”. The server_name,port_number and filenum vary from requests to re-quests. Clearly, our client code does not yet includeother types of HTTP requests or HTTP 1.1 persistent

connection. We are working on a new version of thebenchmark that fix these limitations.

The client process varies the server_name,port_number and filenum of each request so thatthe request stream has a particular inherent hit ratio(i.e. the hit ratio under infinite cache) and followsthe temporal locality pattern observed in existingproxy traces. The client process sends requests intwo stages. During the first stage, the client sendsN requests, where N is the command line argumentspecifying the number of requests to send. For eachrequest, the client picks a random server machine,picks a random port at the server, and sends anHTTP request with the filenum increasing from 1 toN . Thus, during the first stage there is no cache hitin the request stream, since the file number increasesfrom 1 to N . These requests serve to populate thecache, and also stress the cache replacement mecha-nism in the proxy. The requests are all recorded in anarray that is used in the second stage.

During the second stage, the client also sends Nrequests, but for each request, it picks a randomnumber and takes different actions depending on therandom number. If the number is higher than a cer-tain constant, a new request is issued. If the numberis lower than the constant, the client re-issues a re-quest that it has issued before. Thus, the constantis the inherent hit ratio in the request stream. If theclient needs to re-issue an old request, it chooses therequest that it issued t requests ago with a probabil-ity that is proportional to 1=t . More specifically, the

Page 4: Measuring proxy performance with the Wisconsin Proxy Benchmark

2182 J. Almeida, P. Cao / Computer Networks and ISDN Systems 30 (1998) 2179–2192

client program maintains the sum of 1=t for t from1 to the number of requests issued (call it S). Every-time it has to issue an old request, it picks a randomnumber from 0 to 1 (call it r ), calculates r Ł S, andchooses t where

Pt�1iD1 1=i < r Ł S <

PtiD1 1=i . In

essence, t is chosen with probability 1=.S Ł t/.The above temporal locality pattern is chosen

based on a number of studies on the locality in Webaccess streams seen by the proxy. (We have inspectedthe locality curves of the requests generated by ourcode and found it to be similar to those obtained fromtraces.) Note here that we only capture temporallocality, and do not model spatial locality at all. Weplan to include spatial locality models in the future.

Finally, the inherent hit ratio in the second stageof requests can be specified in the configuration file.The default value is 50%.

2.3. Server processes

The server process listens on a particular porton the server machine. When the server receivesan HTTP request, it parses the URL and finds thefilenum. It then chooses a random number (accordingto a particular distribution function) as the size ofthe HTML document, and forks a child process.The child process sleeps for a specified number ofseconds, constructs a byte array of the specified size,fills the array with random strings, replies to theHTTP request with the array attached to a pre-filledresponse header, and then exits. The response headerspecifies that the document is last modified at a fixeddate, and expires in three days. The server processalso makes sure that if it is servicing a request thatit has serviced before, the file size and array contentsstay the same.

The sleeping time in the child process models theresponse delay seen by the proxy. It is importantto model this delay because in practice, latency inserving HTTP requests affects the resource require-ments at the proxy. Originally, we set the sleepingtime to be a random variable from 0 to 10 seconds,to reflect the varying latencies seen by the proxy.In our preliminary testing, in order to reduce thevariances among experiments, we have fixed the la-tency to be a specific number of seconds that can beset through a command line argument. We are stillinvestigating whether variable latencies would ex-

pose different problems in proxy performance fromconstant latency.

Currently, the server process does not supportother types of GET requests, such as conditionalGET. The server process also gives the fixed last-modified-date and time-to-live for every response.These limitations would be addressed once we learnmore about the distribution of these parameters inpractice.

The server program uses two different file sizedistributions. The default distribution is very primi-tive, consisting of a uniform distribution from 0 to40 KB for 99% of the requests, and 1 MB for 1% ofthe requests. Another distribution is the heavy-tailedPareto distribution, for which two parameters, Þ andk, must be specified in the configuration file. Theparameter k represents the minimum file size and Þmeans that the average file size average is given by

Þ D average

average� k: (1)

Typical values of Þ and k are 1.1 and 3.0 KB, asshown in [1].

2.4. Configuration file

The configuration file specifies the following: themachine where the master process runs, the port num-ber that the master process is listening at, the to-tal number of client processes, the total number ofserver machines, and the specification of the serverprocesses for each server machine. The specificationof server processes includes the name of the servermachine, the base port number, and the number ofports at which there should be server processes lis-tening. This way one server machine can host morethan one server processes. The inherent hit ratio is anoptional parameter as well as Þ and k, the parametersthat define the heavy tail file size distribution. It isalso possible to specify the average file size insteadof Þ. In this case, Þ can be calculated by the Eq. (1).The default minimum file size is 3.0 KB.

2.5. Main performance metrics

At the end of the experiment, the master processreceives from clients the total number of requestsand the sum of response bytes, and receives from

Page 5: Measuring proxy performance with the Wisconsin Proxy Benchmark

J. Almeida, P. Cao / Computer Networks and ISDN Systems 30 (1998) 2179–2192 2183

servers the total number of requests and bytes ser-viced. Clients also send the average latency andthe total number of errors observed during the twophases of the experiment. An error is reported everytime a connection between client and proxy couldnot be established. The master process generates anoverall performance report that contains the follow-ing data: average latency in the first and secondphase, total number of errors observed during bothphases, hit ratio and byte hit ratio. Proxy through-put, i.e., the number of requests that are handled ineach time unit, can be estimated based on the aver-age latencies and on the number of requests servedconcurrently.

2.6. Measuring the behavior of overloaded proxyservers

We have also incorporated features to WPB togenerate traffic with peak loads that exceed the ca-pacity of the proxy, based on the scalable clientarchitecture (presented in [5]). We have changed theclient process code to have a pair of process —sender and handler — generating HTTP traffic tothe proxy using non-blocking sockets. The senderprocess is responsible for establishing a connectionwith the proxy. After the connection is established, itpasses the socket descriptor to the handler and sendsa new connection request. The handler is responsi-ble for keeping track of all pending connections andreceiving the requested files from the proxy. If theconnection could not be established in a reasonableamount of time (500 ms in our implementation), thesender issues a new connection request immediately.Each pair sender–handler manages a pre-definednumber of sockets descriptors so that the number ofsockets (simulated clients) that are trying to establishnew connections is kept constant.

We have done some experiments with this featureof WPB using different proxies. The main resultsshow that there are a lot of connection time-outs thatare reported as errors. The timeout period used inour implementation (500 ms) is much larger than themaximal round-trip time between client and proxyin our testbed (described in Section 3). Therefore,the proxies were actually overloaded and could nothandle many of the requests. For some proxies, weobserved such a large number of errors that the

effective proxy throughput was very low, and thelatency was unrealistically low. Since none of theproxies handle the overload well, in the rest of thepaper we compare the proxies using the regular WPB(i.e. not using the scalable clients).

3. Performance of example systems

We used WPB to measure the performance offour proxy systems:Apache version 1.3b2

This is a multiprocess proxy that forks, at startup,a predefined number of processes to handle in-coming requests. This number is dynamically ad-justed in response to the current load on the proxy.Apache stores cached documents in a four-leveldirectory tree. However, the number of entries ineach subdirectory as well as the size of directoriesin the same level are not fixed. Apache also copiesdocuments to a temporary file in the root cachedirectory before copying it to a final location.

CERN version 3.0AThis is a multiprocess proxy that forks a newprocess to handle each request. After handling asingle request, the new process terminates. It usesthe file system to cache data (Web documents) aswell as proxy meta-data, such as expiration times,content type, etc. Web pages are stored in separatefiles and the directory structure is derived fromthe structure of the URL: each URL componentis translated into a directory name. The directorypath for a specific URL is called URL directory.Metadata for each file, used to find out whetherthe file expires, is kept in the corresponding URLdirectory.

A CERN-derived commercial proxyThis is a multiprocess proxy that at startup forksa predefined number of processes that are respon-sible for handling all the incoming requests. Aftera predefined number of requests serviced, a pro-cess terminates and gets respawned by a masterprocess. Therefore, the total number of processesremains constant. For our experiments, we set thenumber of processes to 32, the default value. Ituses the file system to cache data and proxy meta-data. The proxy cache is separated into one ormore three-level cache sections. Each cache sec-

Page 6: Measuring proxy performance with the Wisconsin Proxy Benchmark

2184 J. Almeida, P. Cao / Computer Networks and ISDN Systems 30 (1998) 2179–2192

tion contains 64 subdirectories and can hold 100to 250 MB of data. It uses an algorithm to deter-mine the directory where a document should bestored that tries to ensure equal dispersion of doc-uments in the sections. The proxy uses the RSAMD5 algorithm to reduce a URL to 8 characters,which it uses for the file name of the documentand to determine the storage directory. This proxyserver will be identified as Proxy N.

Squid version 1.1.14This is a single-process event-driven system us-ing non-blocking network I=O calls. One processhandles all incoming requests. It manages its ownresources and keeps meta-data about the cachecontents in main memory. Therefore, it is not nec-essary to access disk to determine if a file is storedin the cache. Squid uses main memory to cachein-transit objects, to cache the most recently usedobjects and to cache error responses. The cachestorage in disk is structured as a three-level cache:there are 16 sections, each one containing 256subdirectories where files are stored. Squid mapsURLs to cache object names using ‘fingerprint-ing’. It implements its own DNS cache and usesa predefine number of ‘DNS servers’ to which itsends non-blocking DNS requestsWe have run a set of experiments using the Wis-

consin Proxy Benchmark 1.0 to analyze how theabove systems perform under different loads. Wevaried the number of client processes and collectedstatistics for caching and no caching configurations.Before presenting the mains results, we describe thehardware platform where the experiments were run.

We ran our experiments on a Cluster of Work-stations (COW) consisting of 40 Sun Sparc20, eachwith two 66 Mhz CPUs, 64 MB of main memoryand two 1 GB disks. The COW nodes are connectedthrough 100-BaseT Ethernet links. We ran experi-ments varying the number of clients from 8 to 96,using four client machines and two server machines.We kept the ratio between number of client processesand number of server processes at 4 to 1. The cachesize was set to 75 MB. The server latency is set to 3seconds in all the experiments. The inherent hit ratioin the client request stream is the default value 50%.Thus, the maximum hit ratio that can be observedin the experiments (which is an average of the twophases) is around 25%.

Fig. 2. Client latency for CERN 3.0A.

Fig. 2 shows, for CERN 3.0A, how the averagelatency varies as a function of the number of clientsin both caching and no caching configurations. Forthe caching configuration, it also shows the latenciesin the first and second phases of the experiment.Clearly, under the no caching configuration, latencyincreases very slowly, whereas under the cachingconfiguration, it increases with the number of clients.

When a large number of clients are accessing theproxy, a cache hit may take longer than a cache miss.In other words, the server delay is sometimes shorterthan the time spent retrieving the cached file from theproxy’s disk. It is interesting to note that this is truedespite the fact that we try to model transmissionoverhead by imposing a delay of 3 seconds in theserver. For CERN, this limit is 32 clients; after thispoint, the second phase latency is bigger than the nocaching average latency.

Fig. 3 shows the hit ratio and byte hit ratios asa function of the number of clients. The curves aresimilar to each other and show a slight decreaseas the number of clients increase. Fig. 4 showsthe number of connection errors as the number ofclients increases. As can be seen, after 40 concurrentclients, CERN is unable to handle all the clientconnection requests. The number of errors increasesas the number of clients increases. However, duringthe second phase of the caching experiment, whencache hits occur, no errors are observed.

Fig. 5 shows the client latencies for Apache. Al-though the curves are unstable, it is clear that theclient latency in the caching configuration increases

Page 7: Measuring proxy performance with the Wisconsin Proxy Benchmark

J. Almeida, P. Cao / Computer Networks and ISDN Systems 30 (1998) 2179–2192 2185

Fig. 3. Hit ratio for CERN 3.0A.

Fig. 4. Client errors for CERN 3.0A.

as a function of the number of clients. However,the latency remains constant when caching is dis-abled. Latency in the second phase of the cachingexperiments remains shorter than that in the nocaching configuration until the number of clientsexceeds 72. After this point, it increases signifi-cantly. The degradation of Apache performance asthe number of clients increases is also notewor-thy. The latency increases up to almost 30 secondsfor 88 clients in the first phase of the experiment.This is probably a consequence of thrashing, as thetwo-phase write operations involve many memorycopies. Fig. 6 shows the hit ratio curves for Apache.It seems that the hit ratio in Apache is not signif-icantly affected by the number of clients. No con-

Fig. 5. Latency for Apache 1.3b2.

Fig. 6. Hit ratio for Apache 1.3b2.

nection errors were observed during the experiments.Fig. 7 shows the latency curves for Proxy N. The

average latency under the no-caching configurationincreases linearly with the number of clients. Also,the second phase latency for the caching experimentsis lower than the no-caching latency. Since latencyincreases linearly with the number of clients evenwhen caching is disabled, we suspect that as thenumber of client increases, more requests are de-layed waiting for available processes to handle thembecause the number of proxy processes is always thesame. Fig. 8 shows the hit ratio curves for Proxy N.Its hit ratio degrades very slowly, and the only proxythat has a better hit ratio is Squid (shown next). Thismay be due to different algorithms for cleaning thecache. Fig. 9 shows that the number of connectionerrors. Clearly, the fixed number of processes results

Page 8: Measuring proxy performance with the Wisconsin Proxy Benchmark

2186 J. Almeida, P. Cao / Computer Networks and ISDN Systems 30 (1998) 2179–2192

Fig. 7. Latency for Proxy N.

Fig. 8. Hit ratio for Proxy N.

in a high number of errors during the experiments,especially when there is no hit in the cache. With 32processes, the proxy is unable to handle all the re-quests if there are more than 48 clients. If the numberof proxy processes is increased to 64, the overall be-havior is the same, but the saturation point is shiftedto 88 clients. The results show that the number ofprocesses must be carefully tuned in order to mini-mize the number of errors. As a consequence, ProxyN may have problems in handling bursty traffic sincethis parameter must be statically chosen.

Fig. 10 shows the latency curves for Squid. Thesecurves have a behavior similar to those for Apache

Fig. 9. Client error for Proxy N.

Fig. 10. Latency for Squid 1.1.14.

and CERN. However, CERN has a slightly betteraverage latency for caching experiments. These re-sults are consistent to those presented in [6]. Whencaching is disabled, Squid performs better. Fig. 11shows the hit ratio curves. Among all four prox-ies, Squid achieves the highest hit ratio, maintaininga constant hit ratio independent of the number ofclients. Byte hit ratio is unstable due to the variancein document sizes in the experiments.

We also found that Squid treat all incoming re-quests fairly, resulting in equal response times whenserver delays are the same. This is in sharp contrastwith CERN, whose responses to different client re-quests vary significantly depending on the processscheduling at the proxy. Thus, event-driven architec-tures clearly show an advantage in terms of fairness.

Page 9: Measuring proxy performance with the Wisconsin Proxy Benchmark

J. Almeida, P. Cao / Computer Networks and ISDN Systems 30 (1998) 2179–2192 2187

Table 1The effect of multiple disks on Squid performance

Metrics One disk Two disks No caching

First phase latency (s) 9.00 (2%) 9.88 (1%) 3.65 (0.6%)Second phase latency (s) 5.44 (1.1%) 5.89 (5%) 3.42 (2%)First phase errors 0 0 0Second phase errors 0 0 0Hit ratio 20.29 (1%) 22.50 (1%) 0Byte hit ratio 5.67 (22%) 11.52 (9%) 0

Table 2The effect of multiple disks on Proxy N performance

Metrics One disk Two disks No caching

First phase latency (s) 9.82 (5%) 8.78 (2%) 6.71 (5%)Second phase latency (s) 5.42 (4%) 4.91 (5%) 5.34 (6%)First phase errors 86.67 (4%) 68.67 (7%) 77.67 (17%)Second phase errors 12.67 (32%) 14.33 (20%) 16.33 (52%)Hit ratio 21.61 (2%) 22.46 (0.7%) 0Byte hit ratio 12.18 (56%) 21.70 (31%) 0

Fig. 11. Hit ratio for Squid 1.1.14.

4. Effect of adding disk arms

Several studies have claimed that disks are a mainbottleneck in the performance of busy proxies. Thus,we wanted to analyze the impact of spreading thecached files over multiple disks. We expected that byincreasing the number of disks, disk queuing wouldbe reduced, the time spent servicing each disk re-quest would be reduced, and the overall performanceof the proxy would be improved.

Only two of the four proxy servers in our testbedallowed us to specify multiple directories for cache

storage: Squid and Proxy N. In this section, wepresent results for both Squid and Proxy N whenthe cache storage is spread over one and two disks.We also compare these results with those collectedwhen caching was disabled. For these experiments,we used the same cache size (75 MB), since we areinterested only in understanding the impact of oneextra disk in proxy performance. Table 1 shows theresults for Squid. Surprisingly, there is no improve-ment in the overall performance when we add anextra disk. Actually, there is a small slowdown inlatency of both phases, but hit ratio remains aroundthe same. Table 2 shows results for Proxy N. ForProxy N, the extra disk guaranteed a reduction of10% in client latency. Hit ratio remained around thesame. However, the number of errors occurred inthe first phase decreased by 20%. This is probablya side effect of the improvement in latency: be-cause requests are handled faster, new requests canbe taken out from the pending connections queuefaster and the probability of finding this queue fulldecreases.

Comparing Proxy N’s latency for two disk andno caching experiments, it can be seen that addingthe extra disk indeed alleviates the disk bottleneckfor this proxy. In the no-caching experiment, there isminimal load on the disk. In the caching experiment,

Page 10: Measuring proxy performance with the Wisconsin Proxy Benchmark

2188 J. Almeida, P. Cao / Computer Networks and ISDN Systems 30 (1998) 2179–2192

when the proxy has only one disk arm to use, theclient latency in the first phase is increased by 46%,mostly contributed by the disk bottleneck. However,when the proxy has two disk arms to use, the latencyincrease is limited to only 30%. Furthermore, whenthe proxy has two disk arms to use, the processing ofcache hits is sped up and the second phase latency issignificantly improved.

However, despite the improvements, we were verysurprised with these results, since we expected amore drastic improvement in latency. During thoseexperiments, we also collected disk, processor andpaging activities using vmstat and iostat in order tohave a better picture of how the system resources areconsumed by the proxies.

Table 3 shows the main statistics collected bythese tools for Proxy N. Disk is clearly a bottleneck,since it was busy over 94% of time. With the extradisk, read and write requests were spread over twodisks and, as a consequence, service time (svc_t),which is the average time spent servicing a request,dropped significantly due to reduction of queuing de-lay. The average queue length (wait) drops to almost0 for both disks. The total number of read and write

Table 3Proxy N resource consumption for one disk, two disk and no caching experiments

Metrics One disk Two disks No caching

Disk 1reads=s – 10.60 (7%) –writes=s – 41.43 (10%) –Kbytes read=s – 43.53 (9%) –Kbytes written=s – 239.96 (5%) –wait – 0.068 (31%) –svc_t – 106.96 (8%) –busy (%) – 79.11 (9%) –

Disk 2reads=s 7.06 (4%) 7.00 (10%) 0.0106 (153%)writes=s 57.47 (4%) 26.08 (12%) 0.53 (4%)Kbytes read=s 18.09 (11%) 29.54 (8%) 0.0576 (158%)Kbytes written=s 344.09 (1%) 156.53 (16%) 4.02 (5%)wait 1.82 (11%) 0.00168 (67%) 0svc_t 174.46 (2%) 67.53 (8%) 24.18 (4%)busy (%) 94.36 (3%) 48.83 (12%) 1.14 (4%)

cpu idle (%) 76.17 (0.8%) 72.79 (4%) 80.11 (0.7%)cpu user (%) 6.57 (3%) 7.36 (16%) 6.72 (1%)cpu system (%) 17.25 (3%) 19.83 (11%) 13.17 (4%)

page-in=s 31.54 (2%) 39.30 (10%) 0.062 (173%)page-out=s 180.57 (4%) 198.40 (10%) 0.539 (173%)

operations is bigger for the two-disk experiment as aconsequence of less contention to disk access. Diskthroughput increases as well as disk bandwidth. Pro-cessor and paging activity remains almost the same,with slight increase in cpu utilization, both in termsof user and system modes. However, the statisticsshow that the load is not evenly distributed betweenthe disks: disk one received a bigger number ofrequests. This may be one explanation for the less-than-expected improvements from the two disk arms,and we are still investigating the issues.

Table 4 shows similar results for Squid. It is clearthat the disk bottleneck was reduced when one ex-tra disk was added. The service time (svc_t) wasreduced by almost 50% for both disks. The queuelength (actv) and busy time were also reduced.As a consequence, disk throughput and bandwidthincreased. However, this improvements were not re-flected in the client latency. Processor utilization re-mained about the same but paging activity increaseswhen the extra disk is added. We learned later thata bug in Squid 1.1.14 prevented it from achievingbetter performance under multiple disks, and the bugwas fixed in Squid 1.1.20.

Page 11: Measuring proxy performance with the Wisconsin Proxy Benchmark

J. Almeida, P. Cao / Computer Networks and ISDN Systems 30 (1998) 2179–2192 2189

Table 4Squid resource consumption for one disk, two disk and no caching experiments

Metrics One disk Two disks No caching

Disk 1reads=s – 12.93 (5%) –writes=s – 17.30 (4%) –Kbytes read=s – 61.60 (7%) –Kbytes written=s – 134.48 (3%) –wait – 0.000778 (23%) –svc_t – 79.62 (2%) –busy (%) – 43.71 (3%) –

Disk 2reads=s 14.71 (6%) 13.35 (5%) 0.0797 (12%)writes=s 28.98 (12%) 17.56 (3%) 0.663 (14%)Kbytes read=s 58.48 (9%) 69.17 (5%) 0.243 (45%)Kbytes written=s 207.07 (15%) 134.50 (3%) 6.67 (12%)wait 0.80 (18%) 0.0041 (37%) 0svc_t 153.23 (11%) 80.56 (2%) 25.69 (10%)busy (%) 58.62 (8%) 43.50 (2%) 1.297 (16%)

cpu idle (%) 78.59 (2%) 79.14 (0.5%) 67.88 (8%)cpu user (%) 7.04 (7%) 6.73 (1%) 10.74 (18%)cpu system (%) 14.32 (7%) 14.11 (2%) 21.35 (18%)

page-in=s 37.45 (9%) 47.96 (0.7%) 0.0433 (128%)page-out=s 63.18 (34%) 100.97 (6%) 3.01 (86%)

5. Effect of low bandwidth client-connections

We were also interested in analyzing the impact oflow bandwidth connections on proxy performance.We wanted to analyze how proxies behave when theymust handle requests sent by several clients, usingslow connections, such as modems. We used a mo-dem emulator [4] which introduces delays to each IPpacket transmitted in order to achieve a certain effec-tive bandwidth that is smaller than the one providedby the network connection. This modem emulator,implemented in the Linux operating system, assignsto each process a predefined maximum bandwidththat it can use. The default bandwidth used in ourexperiments is 28.8 Kbps for each process. We usedthe WPB to measure the performance of the fourproxy servers included in our testbed when requestswere sent by clients running on top of this modifiedversion of Linux. We spread 30 clients over threeDEC Celebris XL590 90 MHz Pentium machineswith 32 MB of main memory each and standard 10Mbps Ethernet card. We spawn 20 WPB server pro-cesses on two COW nodes. We compare the resultsfor these experiments with those collected when the

same number of clients were spread over the samemachines running an unmodified Linux. The networkconnecting clients and proxies is non-dedicated andincludes one router in the path.

Tables 5 and 6 show the results obtained for allfour proxies. No errors were observed during theexperiments for any proxy server. It is clear thatthe low bandwidth effect dominates. The latency in-creases by more than a factor of 2 for all proxies. It isalso interesting to notice that the difference betweenthe latencies observed in the first and second phasesis smaller for the low bandwidth connection. Thisreflects the fact that the time spent in the commu-nication between client and proxy is the dominantfactor in the overall performance and it is muchmore important than the delay introduced by missesin the cache, since the connections between serversand proxy is much faster in our setup. The decreasein the hit ratio is also noteworthy, especially forApache and Proxy N. Observing the results collectedby iostat and vmstat, it is clear that the bottleneck isthe transmission in the network. Despite the shorterservice time, disk throughput is lower for the exper-iments with low bandwidth. The utilization of both

Page 12: Measuring proxy performance with the Wisconsin Proxy Benchmark

2190 J. Almeida, P. Cao / Computer Networks and ISDN Systems 30 (1998) 2179–2192

Table 5The effect of low bandwidth connections on Squid and Proxy Nperformance

Metric Bandwidth

28 Kbps 10 Mbps

SquidFirst phase latency (s) 12.13 (7%) 4.74 (1%)Second phase latency (s) 13.19 (42%) 2.45 (2%)Hit ratio 25.01 (0.5%) 25.11 (0.4%)Byte hit ratio 21.36 (61%) 19.58 (19%)reads=s 2.44 (51%) 6.68 (3%)writes=s 6.10 (57%) 26.67 (8%)svc_t (ms) 65.11 (45%) 182.76 (5%)disk busy (%) 11.53 (54%) 45.3 (4%)cpu idle (%) 97.98 (1%) 81.78 (2%)cpu user (%) 0.51 (65%) 5.56 (6%)cpu system (%) 1.46 (60%) 12.68 (8%)

Proxy NFirst phase latency (s) 13.66 (15%) 5.02 (13%)Second phase latency (s) 10.39 (9%) 4.29 (67%)Hit ratio 15.24 (2%) 20.86 (9%)Byte hit ratio 11.30 (2%) 17.95 (8%)reads=s 1.48 (56%) 6.06 (8%)writes=s 17.22 (11%) 50.80 (23%)svc_t (ms) 110.86 (6%) 155.78 (13%)disk busy (%) 25.52 (14%) 83.19 (20%)cpu idle (%) 95.06 (0.7%) 79.66 (7%)cpu user (%) 1.07 (21%) 5.99 (25%)cpu system (%) 3.79 (11%) 14.36 (27%)

disk and cpu is much lower for the 28.8 Kbps band-width experiment and, as consequence, the overallproxy throughput degrades.

6. Related work

Benchmarking Web servers is an active researcharea. Several benchmarks for Web servers have beendeveloped, including WebStone [7] and SPECWeb[8]. There are also studies on the overload behavior ofthe benchmarks and improvement of the benchmarks[5]. However none of the benchmarks can be easilyused to measure proxy performances. We borrowedseveral ideas from these benchmarks when design-ing the Wisconsin Proxy Benchmark. Different fromthese benchmarks, the Wisconsin Proxy Benchmarkincorporates unique characteristics that make it ap-propriate to analyze the performance of proxy servers.

Table 6The effect of low bandwidth connection on CERN and Apacheperformance

Metric Bandwidth

28 Kbps 10 Mbps

CERNFirst phase latency (s) 12.61 (4%) 4.98 (21%)Second phase latency (s) 11.19 (21%) 2.92 (7%)Hit ratio 19.12 (24%) 22.98 (7%)Byte hit ratio 15.59 (38%) 18.26 (9%)reads=s 0.27 (21%) 1.18 (85%)writes=s 13.64 (10%) 35.48 (85%)svc_t (ms) 63.49 (9%) 78.36 (72%)disk busy (%) 19.50 (13%) 53.99 (87%)cpu idle (%) 87.12 (3%) 30.79 (18%)cpu user (%) 2.68 (32%) 15.50 (13%)cpu system (%) 10.16 (18%) 53.69 (7%)

ApacheFirst phase latency (s) 10.78 (4%) 5.51 (42%)Second phase latency (s) 9.75 (7%) 3.00 (20%)Hit ratio 12.89 (4%) 26.06 (40%)Byte hit ratio 8.84 (25%) 18.07 (51%)reads=s 6.58 (3%) 9.48 (52%)writes=s 11.31 (57%) 33.64 (76%)svc_t (ms) 55.46 (43%) 115.64 (12%)disk busy (%) 26.93 (35%) 65.62 (47%)cpu idle (%) 94.05 (1%) 78.50 (6%)cpu user (%) 0.64 (26%) 2.98 (32%)cpu system (%) 5.27 (17%) 18.58 (21%)

There has also been many studies measuring theperformance of Squid proxies in real use [6,9]. Ourresults confirm most of the findings in those studies.Different from those studies, WPB can be used asa tool to pin point the performance problems of avariety of proxy products.

Finally, there have been many studies on character-istics of proxy traces [2,3,10–12]. Our research ben-efits from those studies in designing the benchmarkand understanding the requirement on proxy servers.

7. Conclusion

In this paper, we have described the design of theWisconsin Proxy Benchmark and the result of usingthe benchmark to compare four proxy implementa-tions. We also used the benchmark to study the effectof extra disk arms and modem client connections.

Page 13: Measuring proxy performance with the Wisconsin Proxy Benchmark

J. Almeida, P. Cao / Computer Networks and ISDN Systems 30 (1998) 2179–2192 2191

Our main findings are the following:(1) Load on caching proxies can contribute sig-

nificantly to client latency. In all the proxies that wetested, a caching configuration results in higher aver-age latency than the no-caching configuration whenthe number of clients is high, despite the multi-sec-ond server latencies. Thus, though caching proxiesare successful at reading network traffic, the existingsoftware are much less successful at reducing clientlatency.

(2) Process-based proxy implementations musttake care to avoid client connection errors. BothCERN and Proxy N cannot handle all requests andmany errors occur. These errors reflect the fact thatmany requests were dropped due to overflow in thepending connection queue. For Proxy N, this canbe explained by the fact that the number of proxyprocesses handling all requests is fixed. For CERN,the overhead of forking a new process for eachrequest may be an explanation. On the other hand,no error was observed for Apache. The dynamicadjustment on the number of processes in Apacheseems to be successful. Squid does not suffer fromthis problem because of its event-driven architecture.

(3) In terms of hit ratios, Squid and Apachemaintain roughly constant hit ratios across the load.For both CERN and Proxy N, hit ratio decreasessignificantly as the number of client increases. Interms of latency, Apache has the worst performancewhen the load is high, probably due to paging.

(4) Disk is the main bottleneck during the opera-tion of busy proxies: disk is busy up to 90% of timewhile CPU is idle for more 70% of time. Adding anextra disk reduces the bottleneck in the disk. How-ever, for Squid 1.1.14, this reduction did not reflectin an improvement in the overall performance of theproxy due to an implementation bug. For Proxy N,an improvement of 10% was achieved.

(5) When a proxy must handle requests sentthrough very low bandwidth connections, the timespent in the network dominates. Both disk and cpuremains idle for more than 70% of time. As a con-sequence, proxy throughput decreases and client la-tency increases by more than a factor of two.

Clearly, much more work remains to be done. Thecurrent version of the benchmark does not modelDNS lookups, HTTP 1.1 persistent connections, con-ditional Get and other forms of HTTP requests, real-

istic path name for URLs and spatial locality in Webaccesses (i.e., once a user accesses a document fromone Web server, it tends to access other documentsat the same Web server). We are continuing the de-velopment of the benchmark and hope to eliminatethese limitations in future versions of the benchmark.In addition, we need to perform more experiments tobetter understand the impact of slow modem clientconnections and ways to improve proxy performancein those contexts. Lastly, we plan to investigate theeffect of application-level main-memory caching forhot documents and its proper implementation (e.g.,avoid double buffering with the operating system’sfile buffer cache).

References

[1] M. Crovella, A. Bestavros, Self-similiarity in World WideWeb traffic: evidence and possible causes, in: Proc. 1996Sigmetrics Conf. on Measurement and Modeling of Com-puter systems, Philadelphia, May 1996.

[2] P. Cao, S. Irani, Cost-aware WWW proxy caching algo-rithms, in: Proc. 1997 USENIX Symp. on Internet Technol-ogy and Systems, December 1997, http://www.cs.wisc.edu/cao/papers/gd-size.html

[3] P. Lorenzetti, L. Rizzo, L. Vicisano, Replacement policiesfor a proxy cache, Technical Report, Universita di Pisa, Italy,October 1996, http://www.iet.unipi.it/luigi/caching.ps.gz

[4] K. Beach, P. Cao, Modem emulator: faking multiple modemconnections on a lan, CS736 Project Report, 1998 ([email protected] for more detail).

[5] G. Banga, P. Druschel, Measuring the capacity of a Webserver, in: Proc. 1997 USENIX Symp. on Internet Technol-ogy and Systems, December 1997.

[6] C. Maltzahn, K. Richardson, D. Grunwald, Performanceissues of enterprise level Web proxies, in: Proc. 1997 ACMSigmetrics Int. Conf. on Measurement and Modelling ofComputer Systems, June 1997, pp. 13–23.

[7] G. Trent, M. Sake, WebSTONE: the first generation inHTTP server benchmarking, Technical Report, MTS,Silicon Graphics Inc., February 1995 (available fromhttp://www-europe.sgi.com/TEXT/Products/WebFORCE/WebStone/paper.html).

[8] SPEC Group, SPECweb96 benchmark, Technical Report,1996 (available from http://www.spec.org/osg/web96/).

[9] V. Soloviev, Analyzing Squid’s performance, private com-munication, 1998.

[10] B.M. Duska, D. Marwood, M.J. Feeley, The measuredaccess characteristics of World-Wide-Web client proxycaches, in: Proc. USENIX Symp. on Internet Technologyand Systems, December 1997.

[11] S. Gribble, E. Brewer, System design issues for Internetmiddleware service: deduction from a large client trace, in:

Page 14: Measuring proxy performance with the Wisconsin Proxy Benchmark

2192 J. Almeida, P. Cao / Computer Networks and ISDN Systems 30 (1998) 2179–2192

Proc. USENIX Symp. on Internet Technology and Systems,December 1997.

[12] T.M. Kroeger, D.D.E. Long, J.C. Mogul, Exploring thebounds of Web latency reduction from caching andprefetching, in: Proc. USENIX Symp. on Internet Tech-nology and Systems, December 1997.

Jussara Almeida received M.S. and B.S. degrees from Uni-versidade Federal de Minas Gerais, Brazil, in 1997 and 1994,respectively. In September 1997, she joined the Computer Sci-ences Department at the University of Wisconsin-Madison as agraduate student with a scholarship from CNPq=Brazil. She iscurrently working toward a Ph.D. degree under Prof. Pei Cao’ssupervision. She is a member of the Wisweb research projectand her research interests include operating systems, networkingprotocols and performance of World Wide Web.

Pei Cao received her Ph.D. degreefrom Princeton University in 1995and is currently assistant professorof Computer Science at Universityof Wisconsin-Madison. Her researchinterests are in distributed systemsincluding World Wide Web, operat-ing systems and computer architec-ture. She leads the WisWeb researchgroup at University of Wisconsin-Madison 2, whose goal is to buildconsistent, scalable and high-perfor-

mance Web services. She has served on numerous conferenceprogram committees and is a member of the IEEE ComputerSociety Task Force on Internetworking.

2 http://www.cs.wisc.edu/wisweb/