the use of data mining to predict web performance

22
THE USE OF DATA MINING TO PREDICT WEB PERFORMANCE LESZEK BORZEMSKI Institute of Information Science and Engineering, Wroclaw University of Technology, Wroclaw, Poland Web mining is the area of data mining that deals with the extraction of interesting knowledge from World Wide Web data. The purpose of this article is to show how data mining may offer a promising strategy for discovering and building knowledge usable in the prediction of Web performance. We introduce a novel Web mining dimension—a Web performance mining that discovers the knowledge about Web performance issues using data mining. The analysis is aimed at the characterization of Web performance as seen by the end users. Our strategy involves discovering knowledge that characterizes Web per- formance perceived by end users and then making use of this knowl- edge to guide users in future Web surfing. For that, the predictive model using a two-phase mining procedure is constructed on the basis of the clustering and decision tree techniques. The usefulness of the method for the prediction the future Web performance has been con- firmed in a real-world experiment, which showed the average correct prediction ratio of about 80%. The WING (Web pING) measurement infrastructure was used for active measurements and data gathering. INTRODUCTION Data Mining (DM), also known as Knowledge Discovery in Databases (KDD), means a process of nontrivial extraction of implicit, previously unknown, and potentially useful information (such as knowledge rules, Address correspondence to Leszek Borzemski, Institute of Information Science and Engineering, Wroclaw University of Technology, Wybrzeze Wyspianskiego 27, Wroclaw, 50-370, Poland. E-mail: [email protected] Cybernetics and Systems: An International Journal, 37: 587–608 Copyright Q 2006 Taylor & Francis Group, LLC ISSN: 0196-9722 print=1087-6553 online DOI: 10.1080/01969720600734586

Upload: rafael100gmail

Post on 12-Nov-2014

602 views

Category:

Documents


0 download

TRANSCRIPT

THE USE OF DATA MINING TO PREDICT

WEB PERFORMANCE

LESZEK BORZEMSKI

Institute of Information Science and Engineering,Wroclaw University of Technology, Wroclaw, Poland

Web mining is the area of data mining that deals with the extraction of

interesting knowledge from World Wide Web data. The purpose of

this article is to show how data mining may offer a promising strategy

for discovering and building knowledge usable in the prediction of

Web performance. We introduce a novel Web mining dimension—a

Web performance mining that discovers the knowledge about Web

performance issues using data mining. The analysis is aimed at the

characterization of Web performance as seen by the end users. Our

strategy involves discovering knowledge that characterizes Web per-

formance perceived by end users and then making use of this knowl-

edge to guide users in future Web surfing. For that, the predictive

model using a two-phase mining procedure is constructed on the basis

of the clustering and decision tree techniques. The usefulness of the

method for the prediction the future Web performance has been con-

firmed in a real-world experiment, which showed the average correct

prediction ratio of about 80%. The WING (Web pING) measurement

infrastructure was used for active measurements and data gathering.

INTRODUCTION

Data Mining (DM), also known as Knowledge Discovery in Databases

(KDD), means a process of nontrivial extraction of implicit, previously

unknown, and potentially useful information (such as knowledge rules,

Address correspondence to Leszek Borzemski, Institute of Information Science and

Engineering, Wroclaw University of Technology, Wybrzeze Wyspianskiego 27, Wroclaw,

50-370, Poland. E-mail: [email protected]

Cybernetics and Systems: An International Journal, 37: 587–608

Copyright Q 2006 Taylor & Francis Group, LLC

ISSN: 0196-9722 print=1087-6553 online

DOI: 10.1080/01969720600734586

constraints, regularities) from data in databases (Chen et al. 1996). The

term ‘‘data mining’’ sometimes refers to the overall process of the KDD

and sometimes only to the methods and techniques that are used in data

pattern extraction and knowledge discovery. The terms DM and KDD

are sometimes used interchangeably as well. Here we follow a broader

definition that is oriented toward the whole knowledge discovery pro-

cess, which typically involves the following iterative steps: data selection,

data preparation and transformation, data analysis to identify patterns,

and the evaluation of mining results (Bose and Mahapatra 2001; Zhang

et al. 2003). Mining knowledge from databases has been identified by

many researchers as a key research topic in database systems, machine

learning, and knowledge management. DM is used for a variety of pur-

poses, ranging from improving service or performance to analyzing

and detecting interesting patterns and characteristics in different

application domains. Generally, DM is an approach to knowledge gene-

ration, which is the first and basic process in knowledge management

(Pechenizkiy et al. 2005; Spiegler 2003).

Web mining is the application of DM methods and techniques to dis-

cover useful knowledge from World Wide Web data. Web mining focuses

now on four main research directions related to the categories of Web

data. They deal with namely Web content mining, Web usage mining,

Web structure mining, and Web user profile mining (Chakrabarti 2003;

Furnkranz 2005; Wang et al. 2005). Typically in Web mining, we analyze

such data sources as the content of the Web documents (usually text and

graphics), Web data logs (e.g., IP addresses, date and time Web access),

data describing the Web structure (i.e., HTML and XML tags), and Web

user profile data. Web content mining discovers what Web pages are

about and reveals new knowledge from them. Web usage mining con-

cerns the identification of patterns in user navigation through Web pages

and is performed for reasons of service personalization, system improve-

ment, and usage characterization (Facca and Lanzi 2005). Web usage

mining is known also as a clickstream analysis. Web structure mining

investigates how Web documents are structured and discovers the model

underlying the link structures of the World Wide Web. Web user profile

mining discovers user’s profiles based on users’ behavior on the Web, for

example, for the needs of e-commerce recommendation systems (Cho et al.

2002; Kim et al. 2002).

The purpose of this article is to spread some of the basic ideas

underlying the application of data mining in Web performance

588 L. BORZEMSKI

evaluation. We propose a method of Web network behavior evaluation

and prediction using DM techniques; thereby, we introduce a novel

Web mining dimension—a Web performance mining that discovers the

knowledge about Web performance behavior from Web data. Our aim

is to characterize Web performance from the perspective of end users

(Web clients, i.e., Web browsers) that is in the sense of the Web server-

to-browser throughput. Especially, due to the clients’ needs, we focus on

predicting the performance behavior of the Web through the knowledge

gained in DM. Thanks to the approach presented in this article, we can

explore unknown knowledge from data to predict upcoming state of good

or poor performance in access to the Web site. Our method involves dis-

covering knowledge that characterizes variability in Web performance

and using it in order to guide users in their further Internet exploration.

Several applications can benefit from knowing the future possible

performance characteristics of the Web. In fact, network performance

prediction allows applications to optimize their operational efficiency,

which is directly impacted by network interactions.

This article presents a methodology, tools, and empirical study of

Web performance mining. We want to answer how to develop a simple

but yet fully usable DM-based predictive model describing Web perform-

ance from the perspective of end users. To attain our goal, there are a few

separate issues to be dealt with. First, a Web performance problem should

be defined so that it is adequate for the end-users’ performance view. The

second question is what Web performance data we should use. How and

where do we get, measure, or collect needed data? The next question is

how do we predict Web performance parameters using DM?

From the end-user’s perspective, Web performance is measured by

the time between clicking the Web page link and the completion of total

page downloading. This is the period of time from the point at which the

user requests access to the Web page to the point at which the data is

presented on the user’s computer. In order to effectively do his or her

Web surfing, the entire data transfer path including the path through

the Web server, local computer, Web intermediary systems, and the net-

work must present as little delay and big throughput as possible. Down-

loading time depends mostly on the total page size (including embedded

objects), Web page and site design, Web server response, network

latency, and available data transfer rate. However, it has never been easy

to determine whether slow responses perceived by the end-users are due

to either network or end system on both sides, i.e., user and Web server

DATA MINING IN WEB PERFORMANCE PREDICTION 589

sides. Generally, we cannot exactly diagnose and isolate performance

problem’s key sources. Since almost 60–80% of downloading latency

as perceived by users refers to the network bottlenecks issues on the

transmission path between the user and Web host (Cardellini et al.

2002), we focus on Web data transmission performance.

Application-level data measured near a client (or in a similar

location) is needed for the evaluation, estimation, and prediction of

Web page downloading performance. Generally, the user is mainly inter-

ested in what throughput (transfer speed) is achieved while downloading

a page. Therefore, in Web page access, there is a major interest in measur-

ing of server-to-client HTTP traffic to determine available bandwidth.

Client-side measurements and processing can be made by the client itself

but only by means of the special clients, not the commonly used Web

browsers. In practice this is a nontrivial task, and special measurement

and processing infrastructure must be provided. To be an effective sol-

ution, such infrastructure might provide its service for a community of cli-

ents located nearby. Therefore, we have developed the measurement

infrastructure called WING for the active probing, measuring, collecting,

and visualization of Web transactions (Borzemski and Nowak 2004b).

WING can instantly or periodically probe selected Web servers; collect

and store data about Web page downloading; as well as preprocesses that

data for further analysis, including statistical analysis and data mining.

Web performance can be measured and evaluated by means of pass-

ive observations or active probing (benchmarking). The dataset used in

our DM analysis was collected actively by the WING system. We mea-

sured periodical downloading of specific Web pages from several Web

sites all around the world. In Borzemski and Nowak (2004a), the

obtained dataset was used to derive a descriptive overall performance

model of the Web as seen by end users in the Wroclaw University of

Technology campus network. This model was developed using tra-

ditional data analysis approach and showed the correlation between

median values of TCP connections’ Round-Trip Times (RTTs) and

HTTP throughputs over all servers under consideration. Here, we use

the same raw dataset in the deployment of a new predictive performance

model by means of DM.

The prediction of Internet performance has been always a challenge

and a topical issue (Abusina et al. 2005; Arlitt et al. 2005; He et al. 2005).

Such prediction might be useful when a Web client schedules its activity

in time and space, and it chooses when to access the Web server and

590 L. BORZEMSKI

what Web server is to be selected. The decision might be based on the pre-

diction of the network performance prior to actually starting the access

and data transfer. Examples include peer-to-peer and overlay networks,

Web-based distributed computing infrastructures and grids (Baker

et al. 2002), as well as distributed corporate portals (Daniel and Ward

2005) used to assess the contributions of intranets, grids, and portals to

knowledge management initiatives (Yousaf and Welzl 2005).

Many techniques are used to generate knowledge by means of DM.

The core mining techniques are clustering, classification, association,

and time series. In mining we used clustering and classification mining

functions as standardized in the DB2 Intelligent Miner for Data software

by IBM (IBM 2005).

The rest of this article is organized as follows. First, we give the

background of the research and review the related work. Next, we

present the proposed DM-based method to predict Web performance.

After that, we overview the measurement methodology and experiment

setup and show the application example of our method using real-world

Web measurements. Finally, we present the conclusion and describe

future work.

BACKGROUND AND RELATED WORK

Performance has always been a key issue in the design and operation of

computer systems. This is especially critical with regard to the Internet.

The performance of the Web is crucial to user satisfaction, they expect

high quality of a Web application. As e-business is often overwhelmed

by performance problems, Web service providers would also act in a

similar manner as NSPs (Network Service Providers) in case of their

services and set up the concept of the Quality of Web Service (Cardellini

et al. 2002; Casalicchio and Colajanni 2001), which refers to the user’s

view of Web performance. NSPs often specify service levels, called ser-

vices level agreements, committed to be provided to the users. Then

the network performance parameters, such as packet loss, delay, jitter,

and throughput, are observed as well as predicted in the framework of

the network traffic management (Abusina et al. 2005). Therefore, it is

important to understand Web performance issues. Web performance

mining might help in this by showing the type of problems and when they

can occur. The application of DM methods and algorithms would predict

Web performance behavior while the user interacts within the Web. The

DATA MINING IN WEB PERFORMANCE PREDICTION 591

predictive models are perhaps the most popular results of DM and have

proven their usefulness in several applications.

Measurements in World Wide Web resulted in many datasets with

huge amounts of performance data collected for the administration or

operational reasons by means of passive and=or active measurements.

They are mostly spatio-temporal datasets organized in the form of a time

series of categorical or numerical type data. Examples of such datasets

include the logs and traffic traces from the Web servers, e-business sites,

and Internet links.

Basically, DM deals with datasets obtained from some observa-

tional studies, which are connected with passive measurements. Unlike

in our work, we deal with datasets collected in active measurements. In

passive measurements, we monitor a network whereas in active mea-

surements, we generate our own traffic and observe the response.

One crucial issue with passive measurements is that they rely on traffic

flowing across the link being measured. If we want to gather data over

long period of time, then there is the problem of the size and complexity

of data repositories. Then appropriate sampling is needed. Even when

collecting all traffic (which is practically impossible), then we can

achieve mislabeled and unavailable information datasets. However, very

often we cannot have any datasets without excitation and experimental

design. Therefore, we need to construct an experimental design in

such a way as to be able to estimate the effects of network probing.

Proper data usable for Web performance evaluation as it is considered

in this article can be effectively gathered only in appropriate active

measurements.

The performance prediction may be done in a short-term or long-term

way. We can perform predictions using formula-based or history-based

algorithms. Short-term forecasting requires the instantaneous measuring

of network performance and real-time calculations using forecasting for-

mula. However, very often we are not able to measure and calculate the

performance indexes instantaneously. Therefore, then we consider long-

term forecasting. The purpose of long-term forecasting is to be able

to predict, with a high degree of certainty, how the Web (specific Web

server) will perform in the future based on its past performance and dis-

covered knowledge. Accurate long-term forecasting is generally thought

of as a time-consuming and tedious process but essential for almost any

Web user. It gives the opportunity to schedule user activities related to

the Web, e.g., in Grid systems when a group of users works together

592 L. BORZEMSKI

and shares common Web and network resources. In this article, we just

deal with a long-term prediction based on historical information.

The data analysis in Web and Internet measurements projects is made

using traditional statistical analysis. For the first time, we introduced DM

in the performance prediction analysis of Internet paths in our TRACE

project (Borzemski 2004), where we evaluated the IP-related performance

of the communication path between the user and Internet host locations

in a long-term prediction scale period. Using DM, we discovered how

the round-trip times of the packets and the number of hops they pass

on the routing path may vary with the day of the week and the time of

the measurement. After that, we used this knowledge to build the decision

tree guiding to future characteristics of a relevant properties of a given

Internet path in a long-term scale.

Users perceive a good Internet performance as characterized by low

latency and high throughput. The network latency is usually estimated by

the RTT, which is the delay between sending the request for data and

receiving (the first bit of) the reply. The lower the latency, the faster

we can do low-data activities. The other key element of network perform-

ance, throughput, also affects Web applications. Throughput is the

‘‘network bandwidth’’ metric, which tells about the actual number of

bytes transferred over a network path during a fixed amount of time.

Throughput determines the ‘‘speed’’ of a network as perceived by the

end user. The higher the throughput of the Internet connection, the

faster the user can surf the Internet.

When browsing the Web, users are concerned with the performance

of downloading entire pages, which are constructed from the base page

and embedded objects. Various factors and solutions impact Web

performance. Among them there are Web site architectures, available

network throughput, as well as browsers themselves. For instance, to

speed up Web site service, we can organize a number of servers in a clus-

ter with front-end components called Web switches that distribute

incoming requests among servers (Borzemski and Zatwarnicki 2005).

However, it has never been easy to determine whether slow responses

are due to either network problems or end-system problems on both

sides, i.e., user and server sides, or both. All these factors may affect ulti-

mate user-to-server (and vice versa) performance. User-perceived Web

quality is extremely difficult to study in an integrated way because we

cannot exactly diagnose and isolate Web performance key sources,

which, moreover, are transient and very complex in the relationships

DATA MINING IN WEB PERFORMANCE PREDICTION 593

between different factors that may influence each other (Cardellini et al.

2002; Casalicchio and Colajanni 2001).

Although Web applications are usually stateless, there are new appli-

cations that require the predictable Web performance. They are becoming

a considerable portion of Internet traffic. They can be implemented, for

example, in Web-based grid infrastructures built within the Internet to

aggregate a wide variety of resources including supercomputers, storage

systems, and data sources distributed all over the world, and used as a sin-

gle unified resource for virtual communities or a service for knowledge

management in scientific laboratories (Baker et al. 2002; Lin and Hsueh

2003; Tian and Nakamori 2005). Then node-to-node well-predicted

TCP=IP throughput could be a key issue in such applications.

Several active and passive measurement projects have been built on

the Internet, e.g., those in Brownlee et al. (2001); CAIDA (2005), Claffy

and McCreary (1999), Luckie et al. (2001), MyKenote (2005), SLAC

(2005), and Zhang and Duffield (2001). Mostly they are aimed to deal

with the performance problem related to whole or a significant part of

the Internet where large amounts of measured data regarding, for

instance, round-trip delay among several node pairs over a few hours,

days, or months, and use specific measurements and data analysis infra-

structure. These projects can build so-called Internet weather at the IP

level. Most of them only measure the traffic and present the results

as some aggregated and temporary observations but do not provide

any network performance forecasting.

As the new grid-based solutions are developing over the Internet,

then such performance prediction services are needed. The Network

Weather Service (NWS), which has the functionality of being analogous

to weather forecasting (Wolski 1998), is used in grids for making predic-

tions of the performance of various resource components, including the

network itself, by sending out and monitoring lightweight probes through

the network to the sink destinations at regular intervals. It is intended to

be a lightweight, noninvasive monitoring system. NWS operates over a

distributed set of performance-sensors network monitors from which it

gathers readings of the instantaneous network conditions. It can also

monitor and forecast performance computational resources. NWS sen-

sors also exist for such components as CPU and disk. NWS runs only

in UNIX operating system environments and requires much installation

and administration work. It uses numerical models to generate short-

term forecasts of what the conditions will be for a given time frame.

594 L. BORZEMSKI

However, NWS basic prediction techniques are not representative of the

transfer speed obtainable for large files (10 MB to 1 GB) and do not sup-

port long-term forecasts. New NWS developments address these pro-

blems, e.g., Swany and Wolski (2002) show the technique developed

for forecasting long HTTP transfers using a combination of short

NWS TCP=IP bandwidth probes and previously observed HTTP trans-

fers, particularly for longer-ranged predictions.

Besides grids, there is the world of peer-to-peer applications, such as

Gnutella and resilient overlay networks. They are becoming a great

portion of Internet traffic. Such peer–to–peer (P2P) application net-

works are also built among scientific communities. Such initiatives also

require well-predictable Internet performance.

In the area of Web performance, probably the most known is a

commercial service, MyKeynote (MyKeynote 2005). Our WING system

can also be used for similar measurements providing some featured

evaluations not available in competitive developments (Borzemski 2006).

PROPOSED DATA-MINING–BASED PREDICTION METHOD

In this section, we show how we construct a DM-based Web-

performance prediction model. We do not want to forecast the particular

value of the RTT and throughput at specific time as we want to have a

prediction of Web performance in the sense of general characteristics

in a long-teme scale. We classify Web performance to one of classes.

Classes are derived based on past data and define distinguishable

conditions of Web performance behavior described by the RTT and

throughput categories. We assume that the time of day and day of week

mainly explain the variability in the RTT and throughput.

We propose to use a two-phase DM-based method in which the

clustering mining function is followed by the tree classification mining

function in such a way that the result of clustering is the input to

classification.

Clustering-segments performance data records into groups (clusters)

having similar properties. To do this type of discovery, we use in this arti-

cle the neural clustering algorithm, which employed a Kohonen Feature

Map neural network (IBM 2005). The result of the clustering function

shows the number of detected clusters and the characteristics of data

records that make up each cluster. To partition a dataset so that

measurement records that have similar characteristics are grouped

DATA MINING IN WEB PERFORMANCE PREDICTION 595

together, as active attributes participating in creation of clusters, we use

the day of the week and time of day, the average round-trip time, and

throughput.

One of the disadvantages of cluster models is that there are no

explicit rules to define each cluster. The model obtained by clustering

is thus difficult to implement, and there is no clear understanding of

how the model assigns clusters IDs. Therefore, we propose to employ

the classification that may give a simpler model of classes. The induced

model consists of patterns, essentially generalizations over the data

records that are useful in distinguishing the classes. Once a model is

induced, it can be used to automatically predict the class of other unclas-

sified (future) data records.

The decision tree is one of the most popular classification algorithms

in current use in DM. A decision tree is tree-shaped structure that repre-

sents sets of decisions. These decisions generate rules for the classi-

fication of a dataset. Trees develop arbitrary accuracy and use

validation data sets to avoid spurious detail. They are easy to understand

and modify. Moreover, in our situation the use of the tree representation

is preferable because it provides explicit, easy-to-understand rules for

each cluster for Web users, who are usually non-experts in data mining.

Hence, in the second step of the method, we use the tree classi-

fication mining function. The classification builds a decision-making

structure (a decision tree). Here, we explore modified (modified for cat-

egorical attributes) Classification and Regression Tree (CART) techni-

ques used for the classification of a dataset (IBM 2005). CART

segments a dataset by creating two-way splits on the basis of two time-

of-day and day-of-week attributes. The classes in the decision tree are

cluster IDs obtained in the first step of the method. The decision tree

represents the knowledge in the form of IF-THEN rules. Each rule

can be created for each path from the root to a leaf. The leaf node holds

the class prediction.

MEASUREMENT FRAMEWORK

To measure Web performance, we used the WING system developed at

our laboratory (Borzemski and Nowak 2004b). Several tools exist to

measure different parameters of network and Web performance, but

due to our specific needs, we have developed the measurement system

from scratch. Figure 1 shows the WING architecture. WING is a

596 L. BORZEMSKI

network measurement system that measures an end-to-end Web per-

formance path between the Web site and the end user. It is implemented

at our university side only.

WING can send the Web-page requests to the targeted Web site and

monitor the response. It can collect live HTTP trace data near a user

workstation, distill all key aspects of each Web transaction during brows-

ing (Figure 2), and store all time-stamped measurements in the database

for further analysis. WING uses a real browser running under a user

operating system; hence, it perceives a Web page downloading in the

same manner like a real browser. The system may be freely programmed

for periodic measurements using scripts or may be used in ad hoc

mode—then it returns a visualization of page downloading by showing

the HTTP timeline chart and a number of detailed and aggregated data

about the downloading process. For the needs of Web performance pre-

diction, it determines the average transfer rate of the HTTP objects

downloaded using the TCP connection. WING measures the time inter-

val between the first byte packet and the last byte packet of the object

received by the client using that connection. The transfer rate (through-

put) is then calculated by dividing the number of bytes transferred by the

amount of time taken to transfer them. The FIRST BYTE (Figure 2) is

the time between the sending of the GET request and the reception of the

first packet including a requested component. The LEFT BYTES is the

Figure 1. WING architecture.

DATA MINING IN WEB PERFORMANCE PREDICTION 597

time spent for downloading the rest of the requested object. WING also

estimates the RTT from CONNECT time as the time taken to form a

connection by the browser with the server. It is shown in Figure 2 as

the time spacing between the SYN packet sent and the ACK-SYN packet

received by the client. Today’s implementation of WING is done for MS

IE, which is the most popular Web browser in the Internet; however, the

service can monitor the activity of any browser. We should notice that

the browsers download Web pages in different ways so the Web page

downloading time chart can be different. And as a result, the actual

user-perceived performance can be different, and the result may be inad-

equate. Measurements can be made instantly or periodically. The appli-

cation of instant probing is shown in Borzemski (2006). For the needs of

this article, WING was used in periodic Web measurements.

The measurements analyzed in this article were performed from

September 4, 2002, to April 9, 2003. In the measurements, we used

the rfc1945.txt file as the probe, which was downloaded from several

Web pages. The file rfc1945.txt is large enough (its original size is

137582 bytes) to estimate the average transfer rate and yet not too large

to overload Internet links and target Web servers. The target servers

were chosen randomly by the Google search mechanism. Among a few

hundred links found by Google, we have chosen 209 direct links to that

Figure 2. A typical Web transaction diagram.

598 L. BORZEMSKI

file. After preliminary tests in which some servers died, we started to

measure 83 servers. These servers were probed at regular intervals of

2 h 40 min, i.e., 10 times a day over 24-hour period. Figure 3 shows

the testbed configuration and Figure 4 shows the partial list of target ser-

vers. More information about the experiment testbed and statistical data

analysis of the measurements is given in Borzemski and Nowak (2004a).

APPLYING THE METHOD TO REAL-LIFE SITUATIONS

In this section, we exemplify how we can build a DM-based prediction

model for Web performance forecasting in real-life situations. The fol-

lowing data preparation road map for DM is proposed. It involves the

server selection, data selection, data preparation and transformation,

clustering, cluster result analysis, cluster characterization using a

decision tree, and evaluation of the mining results. As our model can

be valid for a given Web server connection, therefore first of all we must

Figure 3. Testbed configuration.

DATA MINING IN WEB PERFORMANCE PREDICTION 599

choose a server for further analysis. Here we show the performance pre-

diction model for the server that demonstrates network traffic to have

the greatest degree of self-similarity. Thus the selection of the server

for further DM analysis can be done in the following way. First, we elim-

inate servers with more than 10% failed Web transactions. The pool of

servers is reduced to 63 servers. Next, we purge the dataset by filtering

out the data records for those servers that had more than 5 failed mea-

surements per day. We obtain the set of 33 servers. The final selection is

made on the basis of network traffic self-similarity characteristic (Leland

et al. 1994). We chose the server with the traffic exhibiting high self-

similarity evaluated both for RTT and throughput time series. The Hurst

parameter H is calculated for the traffic data from each server. Four

candidate servers are considered in the final selection: #77, #161,

#167, and #181. The server #161 is finally selected (www.ii.uib.no,

Bergen, Norway) where parameter H is around 0.63 for both RTT and

throughput traffic series. Next, in the data selection step, we select

records and clean data. Only non-error Web transactions are considered.

Missing RTT and throughput values are estimated as averages. A data

record with RTT > 200 is classified as being an outlier. Four attributes

(fields) are selected for DM: RTT, THROUGHPUT, DAY-OF-WEEK,

and TIME-OF-DAY. Thirty-four records are completed, and 75 records

are dropped in this step. In the data preparation and transformation

phase, we digitize TIME-OF-DAY into 9 equal-width intervals: (00:00–

02:40, . . . , 21:20–00:00), DAY-OF-WEEK into equal-width 7 intervals,

and RTT into non-equal-width 7 bins with breakpoints (0, 46, 56, 70,

90, 130, 165, 200 ms), and categorize THROUGHPUT by text labels:

Figure 4. A partial list of target Web servers.

600 L. BORZEMSKI

low, medium, high, and where medium is for 180–260 KB=s. Figure 5a

shows characteristics for four final candidate servers, and Figure 5b gives

a sample database used in DM (before THROUGHPUT categorization).

Now we show the clustering and decision tree mining analysis results

for that chosen Web server. We use the IBM Intelligent Miner for Data

8.1, and the measurements are stored in the relational table in a DB2

database, which rows contained records of measurements collected for

server #161 measured at the sampling times. In clustering, we use the

following active attributes: RTT, TROUGHPUT, DAY-OF-WEEK, and

TIME-OF-DAY. Figure 6 presents 9 clusters (the biggest ones among

16 clusters derived), which were identified when all records from the

dataset are mined, i.e., from the whole time horizon under consideration.

The clusters are ordered according to their size, the smallest one

includes about 7% of records; the biggest one is about 18%. The

description of clusters shows how the clusters differ from each other.

For instance, cluster #4 (9.41% of population) defines the set of records

where DAY-OF-DAY is predominantly 4 (Wednesday), TIME-OF-DAY

is predominantly 5 (time interval 10:20–13:00), RTT is predominantly

2 (47–55 ms), and THROUGHPUT is medium.

After clustering, we explore classification using the results of cluster-

ing as the inputs in decision tree deployment. The general objective of

creating the decision-making model is to use it to predict RTT and

throughput behavior most probable to achieve in the future. We assume

Figure 5. (a) Server characteristics; (b) sample database.

DATA MINING IN WEB PERFORMANCE PREDICTION 601

that the only a priori information that is to be given in the future is the

DAY-OF-WEEK and the TIME-OF-DAY. The fragment of the resulted

decision tree for the total dataset is shown in Figure 7. This decision tree

shows the accuracy of 75%. To extract the classification rules from the

decision tree, we need to represent knowledge in the form of IF-THEN

rules. One rule is created for each path from the root to a leaf. The leaf

node holds the class prediction. The purity in a leaf node indicates the

percentage of correctly predicted records in that node. As an example,

for the decision tree which part is shown in Figure 7, we can extract

the following classification rule: IF (TIME<4.5) AND (DAY� 3.5)

AND (TIME� 3.5) AND (DAY<4.5) THEN CLUSTER ID ¼ 4. It

says that if we want to download Web resources from the server #161

Figure 6. Characteristics of clusters.

Figure 7. A fragment of the decision tree for server #161.

602 L. BORZEMSKI

between 00:00 a.m. and 10:40 a.m. (TIME<4.5) on Wednesday, Thurs-

day, Friday, or Saturday (DAY� 3.5), and when this is after 8:00 a.m.

(TIME� 3.5) and on Sunday, Monday, Tuesday, or Wednesday

(DAY < 4.5), then we can expect that the network behavior is as

described by the cluster #4.

Further analysis includes moving window (horizon) and incremental

DM. In both analyses to obtain a single result, we make multiple predic-

tions employing each time the same two-phase mining procedure:

clustering and classification, but each time using a specific dataset

defined by the ‘‘mining window.’’ In incremental DM we use one week

increment. The DM set starts small (one week data) at 9=4=2002 and

increases incrementally (week by week) up to the whole dataset. The

result is shown in Figure 8. The accuracy varies, and we discover ‘‘abnor-

mal’’ network conditions (10=9=02–11=19=2 and after 3=5=03), which

are connected with the network reconfiguration, which is confirmed by

a separate analysis of traceroute data.

Moving window DM is the approach that address timeliness matters

in our history-based predictions. Probably mining over the entire history

may be impractical in this application (the Web is dynamically chan-

ging). The mining windows are defined on the basis of the timestamps

and include all samples from the dataset stream in the last n units of time

(n ¼ 1 week, 2 week, . . . , 20 weeks). Such windows are moved over the

dataset, starting from the beginning date. The accuracy of the prediction

Figure 8. Incremental DM.

DATA MINING IN WEB PERFORMANCE PREDICTION 603

of such moving DM for one, six, and twenty weeks are shown in

Figures 9, 10, and 11, respectively. As we can see, the accuracy of the

prediction varies very much from one window to another when the

windows size is equal to one week. Recent data (from one week) is not

enough for a prediction. Six–week windows give pretty high accuracies

around 70–80%. Twenty-week windows give better results, but too long

windows could be rather impractical due to the dynamic characteristics

of the Web.

Figure 9. Moving window mining (window size ¼ 1).

Figure 10. Moving window mining (window size¼6).

604 L. BORZEMSKI

CONCLUSION AND FUTURE WORK

In this article, we introduced a new application of DM connected with

the prediction of Web performance called Web performance mining.

We proposed a two-phase mining procedure based on clustering and

classification. We demonstrated in a real-word experiment that our

approach can play a usable role in Web performance prediction. The

sample model gave pretty high accuracie, about 80%.

There are still challenges in Web performance mining. We would like

to suggest some future research directions: (1) developing new measure-

ment scenarios to obtain the datasets characterizing by different

time-scale samplings, different target servers (server types, localizations,

and Internet connectivity), other WING locations, and other probe

definitions (type, size, and sampling); (2) a consideration of new issues,

performance factors, and DM techniques used in the analysis of data;

and (3) attempting to deal with integration issues in building automated

DM solutions for Web-based Grid applications.

REFERENCES

Abusina, Z. U. M., S. M. S. Zabir, A. Asir, D. Chakraborty, T. Suganuma, and

N. Shiratori. 2005. An engineering approach to dynamic prediction of

network performance from application logs. Int. Network Mgmt. 15:151–162.

Arlitt, M., B. Krishnamurthy, and J. C. Mogul. 2005. Predicting short-transfer

latency from TCP arcane: A trace-based validation. In Proc. of International

Figure 11. Moving window mining (window size¼20).

DATA MINING IN WEB PERFORMANCE PREDICTION 605

Measurement Conference IMC’05, Berkeley: USENIX association, pp.

119–124.

Baker, M., R. Buyya, and D. Laforenza. 2002. Grids and grid technologies for

wide-area distributed computing. Softw. Pract. Exper. 32:1437–1466.

Borzemski, L. 2004. Data mining in evaluation of Internet path performance. In

Innovations in Applied Artificial Intelligence. Proc. 17th International

Conference on Industrial and Engineering Applications of Artificial Intelli-

gence and Expert Systems IEA=AIE 2004. Lecture Notes in Artificial Intelli-

gence, Vol. 3029, edited by B. Orchard, Ch. Yang, and M. Ali. Berlin:

Springer-Verlag, pp. 643–652.

Borzemski, L. 2006. Testing, measuring, and diagnosing Web sites from the

user’s perspective. International Journal of Enterprise Information Systems,

2:54–66.

Borzemski, L. and Z. Nowak. 2004a. An empirical study of Web quality: Measur-

ing the Web from the Wroclaw University of Technology campus. In Engin-

eering advanced Web applications, edited by M. Matera and S. Comai.

Princeton, NJ: Rinton Publishers, pp. 307–320.

Borzemski, L. and K. Nowak. 2004b. WING: A Web probing, visualization, and

performance analysis service. In Web Engineering, Proc. 4th International

Conference on Web Engineering ICWE 2004. Lecture notes in computer

science, Vol. 3140, edited by N. Koch, P. Fraternali, and M. Wirsing. Berlin:

Springer-Verlag, pp. 601–602.

Borzemski, L. and K. Zatwarnicki. 2005. Using adaptive fuzzy-neural control to

minimize response time in cluster-based Web systems. In Advances in Web

Intelligence. Proc. 3rd Atlantic Web Intelligence Conference AWIC’05. Lec-

ture Notes in Artificial Intelligence, Vol. 3528, edited by P. S. Szczepaniak,

Kacprzyk, and A. Jiewiadomski. Berlin: Springer-Verlag, pp. 63–68.

Bose, I. and R. K. Mahapatra. 2001. Business data mining—A machine learning

perspective. Information & Management, 39:211–225.

Brownlee, N., Kc. Claffy, M. Murray, and E. Nemeth. 2001. Methodology for

passive analysis of a university Internet link. In Proc. Passive and Active

Measurement Workshop, Amsterdam.

CAIDA. 2005. The Cooperative Association for Internet Data Analysis.

http:==www.caida.org=home (17 July 2005).

Cardellini, V., E. Casalicchio, C. Colajanni, and P. S. Yu. 2002. The state of the

art in locally distributed Web-server systems. ACM Computing Surveys,

34:263–311.

Casalicchio, E. and M. Colajanni. 2001. A client-aware dispatching algorithm

for Web clusters providing multiple services. In Proc. World Wide Web 10,

Hong Kong, pp. 535–544.

Chakrabarti, S. 2003. Mining the Web: Analysis of Hypertext and Semistructured

Data. San Francisco: Morgan Kaufmann.

606 L. BORZEMSKI

Chen, M.-S., J. Han, and P. S. Yu. 1996. Data mining: An overview from a data-

base perspective. IEEE Trans. Knowledge and Data Engineering, 8:866–883.

Cho, Y. H., J. K. Kim, and S. H. Kim. 2002. A personalized recommender system

based on Web usage mining and decision tree induction. Expert Systems with

Applications, 23:329–342.

Claffy, K. C. and S. McCreary. 1999. Internet measurement and data analysis:

Passive and active measurement. University of California, San Diego: CAIDA.

Available: http:==www.caida.org=outreach=papers=1999=Nae4hansen=Nae4

hansen.html

Daniel, E. and J. Ward. 2005. Enterprise portals: Addressing the organizational

and individual perspectives of information systems. In Proc. Thirteenth Eur-

opean Conference on Information Systems ECIS 2005, edited by D. Bartmann,

F. Rajola, J. Kallinikos, D. Avison, R. Winter, P. Ein–Dor, J. Becker,

F. Bondendorf, and C. Weinhardt. Regensburg, Germany.

Facca, F. and P. Lanzi. 2005. Mining interesting knowledge from weblogs:

A survey. Data & Knowledge Engineering, 53:225–241.

Furnkranz, J. 2005. Web mining. In Data mining and knowledge discovery handbook,

edited by M. Oded and L. Rokach. Berlin: Springer-Verlag, pp. 899–920.

He, Q., C. Dovrolis, and M. Ammar. 2005. On the predictability of large

transfer TCP throughput. In Proc. SIGCOMM’05, New York: ACM Press,

pp. 145–156.

IBM. 2005. DB2 intelligent miner. Available: http:==www-306.ibm.com=software=

data=iminer=tools.html (17 July 2005).

Kim, J. K., Y. H. Cho, W. J. Kim, J. R. Kim, and J. H. Suh. 2002. A personalized

recommendation procedure for Internet shopping suport. Electronic Com-

merce Research and Applications, 1:301–313.

Leland, W., M. Taqqu, W. Willinger, and D. Wilson. 1994. On the self-similar

nature of Ethernet traffic. IEEE=ACM Trans. Networking, 2:1–15.

Lin, F.-R. and C.-M. Hsueh. 2003. Knowledge map creation and maintenance for

virtual communities of practice. In Proc. 36th International Conference on

Systems Sciences HICSS’03, edited by R. Sprague. Los Alamitos: IEEE

Press, P. 69.1.

Luckie, M. J., A. J. McGregor, and H.-W. Braun. 2001. Towards improving

packet probing techniques. In Proc. 1st ACM SIGCOMM Workshop on Inter-

net Measurement, edited by V. Paxson. New York: ACM Press, pp. 145–150.

MyKeynote. 2005. MyKeynote diagnosis page. Available: http:==www.mykeynote.

com (17 July 2005).

Pechenizkiy, M., A. Tsymbal, and S. Puuronen. 2005. Knowledge management

challenges in knowledge discovery systems. In Proc. 16th International Work-

shop on Database and Expert Systems Applications DEXA’05. Los Alamitos:

IEEE Press, pp. 433–437.

DATA MINING IN WEB PERFORMANCE PREDICTION 607

SLAC. 2005. Internet monitoring at Stanford Linear Accelerator Center. Available:

http:==www.slac.stanford.edu=comp=net=wan-mon.html (17 July 2005).

Spiegler, I. 2003. Technology and knowledge: Bridging a ‘‘generating’’ gap. Infor-

mation & Management, 40:533–539.

Swany, M. and R. Wolski. 2002. Multivariate resource performance forecasting

in the network weather service. In Proc. of the IEEE=ACM SC2002 Confer-

ence. Los Alamitos: IEEE Press, pp. 1–10.

Tian, J. and Y. Nakamori. 2005. Consideration on a service for knowledge man-

agement in scientific laboratories. In Proc. IEEE International Conference on

Services Systems and Services Management. Los Alamitos: IEEE Press, pp.

886–891.

Wang, X., A. Abraham, and K. A. Smith. 2005. Intelligent web traffic mining and

analysis. Journal of Network and Computer Applications, 28:147–165.

Wolski, R. 1998. Dynamically forecasting network performance using the

network weather service. Cluster Computing, 1:119–132.

Yousaf, M. M. and M. Welzl. 2005. A reliable network measurement and predic-

tion architecture for grid scheduling. 1st IEEE=IFIP International Workshop

on Autonomic Grid Networking and Management AGNM’05, edited by M. Z.

Hasan and V. Sander. Barcelona.

Zhang, S., C. Zhang, and Q. Yang. 2003. Data preparation for data mining.

Applied Artificial Intelligence, 17:375–381.

Zhang, Y. and N. Duffield. 2001. On the constancy of Internet path properties. In

Proc. 1st ACM SIGCOMM Workshop on Internet Measurement, edited by

V. Paxson. New York: ACM Press, pp. 197–211.

608 L. BORZEMSKI