predicting using multiple evidence combination

Upload: mourad

Post on 30-May-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/9/2019 Predicting WWWsurfing using multiple evidence combination

    1/17

    The VLDB Journal (2008) 17:401417DOI 10.1007/s00778-006-0014-1

    REGULAR PAPER

    Predicting WWW surng using multiple evidence combination

    Mamoun Awad

    Latifur Khan

    Bhavani Thuraisingham

    Received: 15 April 2005 / Accepted: 23 September 2005 / Published online: 4 November 2006 Springer-Verlag 2006

    Abstract Theimprovement of manyapplications suchas web search, latency reduction, and personalization/recommendation systems depends on surngprediction.Predicting user surng paths involves tradeoffs betweenmodel complexity and predictive accuracy. In thispaper,we combine two classication techniques, namely, theMarkov model and Support Vector Machines (SVM),to resolve prediction using Dempsters rule. Such fusionovercomes the inability of the Markov model in pre-dicting the unseen data as well as overcoming the prob-lem of multiclassication in the case of SVM, especiallywhen dealing with large numberof classes. We apply fea-ture extraction to increase the power of discriminationof SVM. In addition, during prediction we employ do-main knowledge to reduce the number of classiers forthe improvement of accuracy and the reduction of pre-diction time. We demonstrate the effectiveness of ourhybrid approach by comparing our results with widelyused techniques, namely, SVM, the Markov model, andassociation rule mining.

    1 Introduction

    Surng prediction is an important research area uponwhich many application improvements depend. Appli-cations, such as latency reduction, web search, person-

    M. Awad ( B ) L. Khan B. ThuraisinghamUniversity of Texas at Dallas, Dallas, Texas, USAe-mail: [email protected]

    L. Khane-mail: [email protected]

    B. Thuraisinghame-mail: [email protected]

    alization/recommendation systems, etc., utilize surngprediction to improve their performance.

    Latency of viewing with regard to web documentsis an early application of surng prediction. Web cach-ing and prefetching methods are developed to prefetchmultiple pages for improving the performance of WorldWide Web systems. The fundamental concept behind allthese caching algorithms is the ordering of various webdocuments using some ranking factors such as popular-ity and the size of the document according to server lo-cal existing knowledge. Prefetching the highest rankingdocuments results in a signicant reduction of latencyduring document viewing [ 14].

    Improvements in web search engines can also beachieved using predictive models. Surfers can be viewedas having walked over the entire WWW link structure.The distribution of visits over all WWW pages is com-puted and used for re-weighting and re-ranking results.Surfer path information is considered more importantthan the text keywords entered by the surfers; hence,the more accurate the predictive models the better thesearch results [ 5].

    In recommendation systems, collaborative ltering(CF) has been applied successfully to nd the top k usershaving the same tastes or interests based on a given tar-get users records [ 6]. The k-Nearest-Neighbor ( kNN)approach is used to compare a users historical proleand records with proles of other users to nd the topk similar users. Using the association rules, Mobasheret al. [7] propose a method that matches an active usersession with frequent itemset sessions and predicts thenext page that user is likelyto visit.TheseCF-based tech-niques suffer from very well-known limitations, includ-ing scalability and efciency [ 7,8]. Pitkow and Pirolli [9]explore pattern extraction and pattern matching based

  • 8/9/2019 Predicting WWWsurfing using multiple evidence combination

    2/17

    402 M. Awad et al.

    on a Markov model that predicts future surng paths.Longest repeating subsequences (LRS) are proposed toreduce the model complexity (not predictive accuracy)by focusing on signicant surng patterns.

    There are several problems with the current state of the art solutions. First, the predictive accuracy using aproposed solution, such as the Markov model, is low;for example, the maximum training accuracy is 41% [ 9].Second, predictionusingassociationrulemining(ARM)and LRS pattern extraction is done based on choosingthe path with the highest probability in the training set;hence, any new surng path is misclassied because theprobability of such a path occurring in the training setis zero. Third, the sparsity nature of the user sessions,which are used in training, can result in unreliable pre-dictors [6,10]. Fourth, WWW prediction is a multiclassproblem and prediction can resolve into too many clas-ses. Most multiclass techniques, such as one-VS-one andone-VS-all, arebased on binaryclassication. To resolveprediction, it is required to consult with all classiers. InWWW prediction, we generate large number of classi-ers because the number of classes is very large (11,700classes in our experiments). Hence, prediction accuracyis low [11] because the prediction model fails to choosethe correct class. Finally, many of the previous methodshave ignored domain knowledge as a mean for improv-ing prediction. Domain knowledge plays a key role inimproving thepredictive accuracybecause it canbe usedto eliminate irrelevant classiersor to reducetheir effec-tiveness by assigning them lower weights.

    In this paper, we combine two powerful classicationtechniques, namely, supportvector machines (SVM)andthe Markov model, to increase the predictive accuracyin WWW prediction. The Markov model is a powerfultechnique for predicting seen data; however, it cannotpredict unseen data (see Sect. 6). On the other hand,SVM is a powerful technique, which can predict notonly for the seen data but also for the unseen data.However, when dealing with too many classes, predic-tion might take long time because we need to consultlarge number of classiers. Also, when there is a possi-bility that one data point may belong to many classes(for example, a user after visiting the web pages p1,p2, p3, might go to page 10, while another might goto page 100), SVM predictive power may decrease. Toovercome these drawbacks with SVM, we apply featureextraction to extract relevant features from web pagesto train SVM. We also extract domain knowledge fromthe training set and incorporate this knowledge in thetesting set to improve the prediction of SVM. By fusingboth classiers, namely SVM and the Markov model,we overcome major drawbacks in each technique andimprove the predictive accuracy.

    The contribution of this paper is as follows. First, weovercome the drawbacks of SVM in WWW predictionby applying feature extraction fromthe user paths. Also,we incorporate domain knowledge in SVM prediction.Second, we fuse SVM and Markov model classiersin a hybrid prediction approach using Dempsters rule[12] to overcome drawbacks in both classiers. Finally,we compare our approach with different approaches,namely, Markov model, ARM, and SVM on a standardbenchmark dataset and demonstrate the superiority of our method.

    The organization of the paper is as follows. In Sect. 2,we discuss related work regarding the surng predictionproblem. In Sect. 3, we present the feature extractionprocess. In Sect. 4, we present different prediction mod-elsused in surng prediction. In Sect. 5, wepresentanewhybrid approach combining SVM and Markov modelsin WWW prediction using the Dempsters rule for evi-dence combination. In Sect. 6, we compare our resultswith other methods using a standard benchmark train-ing set. In Sect. 7, we summarize the paper and outlinesome future research.

    2 Related work

    Several predictive methods and/or models are devel-oped in order to rst accurately predict the surng pathand then reduce the latency of viewing web documents.Various techniques of caching and prefetching [14] aresome of the early methods developed in order to reducesurng latency. The most advanced and intelligent meth-ods [9,13,14] for predicting the surng path are devel-oped using the knowledge acquired from a surfersprevious path history. In this section, we rst discussin detail several well-known and widely used web surf-ing path prediction techniques and their usefulness inrecent practice.

    The fundamental concept behind caching and pre-fetching algorithms is to order various web documentsusing some ranking factors such as popularity and thesize of the document according to server local exist-ing knowledge. The highest ranking documents are thenprefetched in advance, which results in a signicantreduction of latency during document viewing. Yanget al. [1] further use so-called page replacement pol-icy, which species some conditions under which a newpage will replace an existing one. To further improve theaccuracy of the ability to cache and prefetch web doc-uments, this web caching algorithm is combined withweb log-mining technique [9]. Chinen and Yamaguchi[2] present an interactive method to reduce networklatency by only prefetching popular documents. They

  • 8/9/2019 Predicting WWWsurfing using multiple evidence combination

    3/17

    Predicting WWW surng using multiple evidence combination 403

    use embedded hyperlinks to prefetch the referencedpages. This method is later improved by Duchamp [3]by associating the frequency factor of those hyperlinks.Chen and Co-workers [ 4,15] propose integration of webcaching and prefetching algorithm (IWCP). The IWCPis a cache replacement algorithm that takes into consid-erations the prefetching rules governed by the prefetch-ing schemes. Hence, it is an integration scheme of bothcaching and prefetching.

    As the caching techniques started to lose its lustre,several recent studies were done in order to developsomeintelligentmethods for prefetchingweb documents.Pandey et al. [ 13] present an intelligent prefetchingmethod based on a proxy server. This predictive modelis based on association rules. A proxy server has the fullcooperation of the server and access to the server data.The job of this proxy server is to generate associationrules after formatting the logs by ltering andprocessingthe server data. These association rules are later used topredict the browsers surng path.

    Several other recent studies have developed differentmethods of using sequential data mining, path analysis,Markov model, and decision tree based on the knowl-edge that is captured during the users web surng. Suet al. [14] propose a WhatNext predictive method thatuses an n-gram model. In this model, an n-gram, wheren is greater than or equal to four, is rst built fromthe existing domain knowledge. Using these n-grams, afuture surng path is forecast.

    Nasraoui and Pavuluri [ 16] propose a Context Ultra-Sensitive Approach based on two-step Recommender system (CUSA-2-step-Rec) . Their system tries to over-come the drawbacks of the single-step prole predictionrecommendation procedure. In CUSA-2-step-Rec , neu-ral network, prole-specic URL-predictor, is trainedto complete the missing parts of ground truth sessions.This learning process is applied for each prole indepen-dently with separate set of training data. Recommenda-tion is done in two steps. First, a user session is mappedto one of the prediscovered proles. Second, one of theprole-specic URL-predictor is used to provide thenal recommendations.

    Our work differs in two ways. First, while the focusin [16] is to provide recommendations (many pages),we focus on predicting the next page only, however, ourmethod can be generalized easily to predict for the nextn pages. This can be done by considering the Demp-sters rule output as a score and choosing the top npages with the highest scores. Second, the main reasonto use a set of URL predictors in [ 16] is to overcomethe high complexity of architecture and training in caseone neural network is used. We use a hybrid model

    for prediction to improve the accuracy using differentprediction techniques.

    Nasraoui and Co-workers [ 17,18] propose a webrecommendation system using fuzzy inference. Afterpreprocessing the log les to obtain user sessions, clus-tering is applied to group proles using hierarchicalunsupervised niche clustering (H-UNC) [ 1820].Context-sensitive URL associations are inferred using afuzzy approximate reasoning based engine.

    Joachims et al. [ 21] propose a WebWatcher system.The WebWatcher is a learning tour guide software agentthat assists users browsing the Web. The WebWatcherextracts knowledge from two sources: previous userclicks and the hypertext structure, i.e., the relationshipbetween web pages traversed by the user. Furthermore,the user can provide the WebWatcher with her favoritesand interests (domain knowledge). A target function,that maps a page to a set of relevant pages, is learnedusing the nearest neighbor approach.

    Another technique applied in WWW prediction isARM. A recommendation engine, proposed byMobasher et al. [ 7], matches an active user session withthe frequent itemsets in the database, and predicts thenext page the user is likely to visit. The engine worksas follows. Given an active session of size w, the enginends all the frequent items sets of length w + 1 satisfyingsome minimum support ( minsupp ) and containing thecurrent active session. Prediction for an active session A is based on the condence ( ) of the correspondingassociation rules.

    There are two problems in ARM technique. First,when using global minimum support ( minsupp ), rarehits, i.e., web pages that are rarely visited, will not beincluded in the frequent sets because they will notachieve enough support. One solution is to have a verysmall support threshold; however, we will then end upwith very large frequent itemsets, which are computa-tionally hard to handle. Second, the ARM considersan unordered itemset; hence, the support for a spe-cic itemset is evaluated based on the frequencies of all combinations of that itemset. For example, the sup-port of the user path 1,5,9 is computed by addingup the frequencies of all combinations of that path,i.e., 1,5,9 , 1,9,5 , 5,9,1 , 5,1,9 , 9,5,1 , 9,1,5 .Although, it is evident that all these paths are not thesame, ARM treats them all as one path or set; hence, alot of user patterns can be lost. In our work, we considereach path as a separate instance.

    Grcar et al. [ 10] compare SVM with kNN using differ-ent datasets. Their main focus is to study the sparsityproblem anddata quality usingdifferentprediction mod-els. The nal nding is that SVM performs better in case

  • 8/9/2019 Predicting WWWsurfing using multiple evidence combination

    4/17

    404 M. Awad et al.

    of high sparsity data sets, while kNN outperforms SVMin case of low sparsity data sets.

    Our work is related to the n-gram model [14] andthe LRS model [ 9]. However, our approach differs fromthem in the following ways. First, with WhatNext, the n-gram based algorithmignores allthe n-grams with lengthless than four. This eliminates some important data thatcould well be helpful for predicting user surng paths.Similarly, the LRS algorithm only uses the LRS com-bined with a K th-order Markov model. Therefore, it willeliminate a set of segments of user information, whichcanadversely affect the prediction accuracy. In addition,the main focus of LRS is to reduce modeling complexityby reducing the data set. Furthermore, all these modelsare probabilistic. Therefore, they might fail to predictfor unobserved data points.

    In this paper, we employ SVM classication tech-nique, which shows greater accuracy in predicting notonly for the seen surng paths but also for the unseenones. We improve the accuracy of SVM by applying fea-ture extraction and by utilizing domain knowledge inprediction. We also combine Markovmodel andSVMtoachieve themaximum predictive power andto overcomethedrawbacksof each modelby fusing thetwooutcomesusing Dempsters rule. To the best of our knowledge, noone has combined SVM and Markov model in WWWprediction. This is the rst attempt in this area.

    3 Feature extraction

    In Web prediction, the available source of training datais the users sessions, which are sequences of pages thatusers visit within a period of time. In order to improvethe predictive ability using different classication tech-niques, we need to extract additional features besidesthe pages ids. In this section, we rst introduce some of the problematic aspects of the data set we are handling,namely, the user sessions logs. Next, we discuss the fea-ture extraction process and introduce the denition of each feature.

    In mining the Web, the only source of training exam-ples is the logs which contain sequences of pages/clicksthat users have visited/made, time, date, and an esti-mated period of time the user stays in each page. Noticethat many of these features, such as date and time, arenot relevant. It is evident that we have very few featuresin the logs; however, we can extract useful features thatmight increase the prediction capabilities. Also, manymodels [7,9] such as Markov model, ARM, as well asour approach, apply a sliding window on the user ses-sions to unify the length of training paths. If we applya sliding window, we will end up with many repeated

    instances that will dominate the probability calculations.Furthermore, the page ids in the user sessions are nomi-nal attributes. Nominal attributes bear no internal struc-ture and take one of a nite number of possible values[11].Manydataminingtechniques,suchasSVM,requirecontinuousattributes because they use a metricmeasurein their computations (dot product in the case of SVM).In our implementation, we use bit vectors to representpage ids.

    3.1 User session features

    We use a set of features to represent each user session.These features are computed using a frequency matrixas shown in Fig. 1. The rst row and column representthe enumeration of web page ids. Each cell in the matrixrepresents the number of times (frequency) that usershave visited two pages in a sequence. freq( x, y) is the

    number of times that users have visited page y after x.For example, cell (1,2) contains thenumber of times thatusers have visited page 2 after 1. Notice that freq(1,2) isnot necessarily equal to freq(2,1) and freq( x, x) is alwaysequal to 0.

    We extract several features from user sessions. Inaddition, we apply a sliding window of size N to breaklong user sessions into N -size sessions, and we apply fea-ture extraction on each one of these sessions. To furtherelaborate about the sliding window concept, we presentthe following example.

    Example 1 Supposewehaveausersession A = 1,2,3,4,5,6,7 , where 1,2,,7 are the sequence of pages someuser u has visited. Suppose, also, that we use a slid-ing window of size 5. We apply feature extraction to A = 1,2,3,4,5,6,7 and end up with the followinguser sessions with ve page length: B = 1,2,3,4,5 ,C = 2,3,4,5,6 , and D = 3,4,5,6,7 . Notice that theoutcome/label of the sessions A, B, C , and D are 7, 5, 6,and 7, respectively. This way, we end up with four user sessions , A, B, C and D , respectively. In general, the

    Fig. 1 Frequency matrix

  • 8/9/2019 Predicting WWWsurfing using multiple evidence combination

    5/17

    Predicting WWW surng using multiple evidence combination 405

    total number of extracted sessions using sliding windowof size w and original session of size A is | A | w + 1.

    Most of the features we extract are based on the fre-quency matrix that we generate from the training set(see Fig. 2). We dene the transition probability frompage x to page y as the probability that users visit page

    y after page x, or,

    pr ( x y) = freq ( x, y)/N

    t freq ( x, t ) (1)

    where N is the total number of pages traversed by allusers and freq( x, t ) is the frequency value obtained fromthe frequency matrix of cell ( x, t ). For example,pr (1 2) equals the freq(1,2) divided by the totalsum of frequencies of row 1 (i.e., N t freq (1, t )) in thefrequency matrix. Notice that the transition probabilitysum for each row equals 1 ( N t pr ( x t ) = 1). Also,going from one page to another is treated as an inde-pendent event. Note that in ( 1), source page ( x) is xed,and the destination page varies. This makes sense here,unless there is domain knowledge that relates the transi-tion from one page to another with some condition, forexample, users who traversed pages x, y, z should visitpage w next. We also, dene the weighted transitionprobability (WTP) from page x to page y as in Eq. 2,

    wpr ( x y) =pr ( x y)N t pr (t y)

    (2)

    where N is the total number of pages that users havevisited. Notice that the wpr is a way to weight the tran-sition probability by dividing it by the total sum of tran-sition probabilities of all other transitions that go tothe same destination page. For example, wpr (1 2)is the transition probability of going from page 1 to 2(pr (1 2)) divided by the total transition probabili-ties of all transitions to page 2 ( N t pr ( t 2) ). Notethat in ( 2), the denominator y is xed. We would liketo take into consideration that source page will vary butthe destination page will be the same.The features we extract are as follows:

    1. Outcome probability (OP) : This is the transitionprobability of the outcome of the session. For a ses-sion X = x1, x2, . . . , xN 2, xN 1 , the outcome prob-ability is computed as in ( 3), where xN is the label of the session, x1, . . . , xN 1 is the sequence of pages tra-versed before reaching xN page. Intuitively, the OPhelps to distinguish between long and small sessionshaving the same outcome/label because the outcomeprobabilitieswill be different. Noticethat, in general,the longer the session is, the less likely the OP of being the same is.

    OP ( X ) = pr ( x1 x2) pr ( x2 x3) pr ( xN 2 xN 1) . (3)

    2. Weighted outcome probability (WOP) : This is theproduct of the WTP for each transition up to theoutcome/label of the session. For a session X = x1, x2, . . . , xN 1, xN ,theWOPiscomputedasin( 4).

    WOP ( X ) = wpr ( x1 x2) wpr ( x2 x3) wpr ( xN 2 xN 1) (4)

    where wpr ( x y) is the WTP dened as in ( 2). TheWOP might be used to distinguish between sessions,however, in contrast to the outcome probability; weweight the transition probability pr ( x y) by divid-ing it by the summation of all transition probabilitiesto the same page y. This gives a different value of the outcome probability, i.e., different values which

    discriminate between sessions.3. Number of hops : Here, we consider the number of

    hops in each session. The number of hops is denedas the number of transitions in a user session (it cor-responds to the order of Markov model as we willsee in Sect. 4.1). For example, the numbers of hops,in Example 1, of A, B, C , and D are 5, 3, 3, and 3,respectively. Recall that we treat the last page as theoutcome.

    4. Probability sum (PS) : This is the PS of each transi-tion probability in the session.

    PS( x1, x2, . . . , xN ) =N 2

    i= 1

    pr ( xi xi+ 1) (5)

    where x1, x2, . . . , xN is theuser session. This featureboosts longer sessions over smaller ones. Longer ses-sions embed better information than smaller ones,because they carry more information about thepattern of going from one page to another [9].

    5. Weighted probability sum (WPS) : This is the WPS of each transition probability in the session.

    WPS ( x1, x2, . . . , xN ) =N 2

    i= 1

    wpr ( xi xi+ 1) (6)

    where x1, x2, . . . , xN is the user session and wpr isdened as in Eq. 2.

    6. K previous pages (KPP) : This is the last K pages auser traversed before he/she has visited the last page(outcome/label). For example, in the previous exam-ple, the 2PP of A , B , C , and D are 5, 6 , 3, 4 , 4, 5 ,and 5, 6 , respectively.

  • 8/9/2019 Predicting WWWsurfing using multiple evidence combination

    6/17

  • 8/9/2019 Predicting WWWsurfing using multiple evidence combination

    7/17

  • 8/9/2019 Predicting WWWsurfing using multiple evidence combination

    8/17

    408 M. Awad et al.

    no user has visited page y directly after page x. We donot distinguish between these two events and simplyexclude any classier f i when freq ( xn , i) = 0.

    5 Multiple evidence combination for WWW predictionusing Dempsters rule

    In this section,we present a hybridmodel for WWW pre-diction using the SVM and the Markov model as bodiesof evidence based on the Dempsters rule for evidencecombination. Figure 5 shows that the resolution of aclass that an outcome (i.e., testing point) belongs to isbased on the fusion of two separate classiers, namely,the SVM and the Markov model. Recall that predic-tion using SVM has an extra advantage over the Mar-kov model, i.e., SVM can predict for the unseen data,while theMarkovmodelworks betterwith theseen data.Hence, this hybrid model takes advantage of the best of both models in making a nal prediction.

    Dempsters rule is one part of the DempsterShaferevidence combination frame for fusing independentbodies of evidence. In Dempsters rule, the sources of evidence should be in the form of basic probability. Theoutput of SVM classier, (14)and( 15), is the signed dis-tance of an instance to the hyperplane. Since this out-put is not probability [23], we rst need to convert/mapthis output into a posterior probability P (class | input). Inthis section, we present, rst, a method to convert SVMoutput into a posterior probability by tting a sigmoidfunction after SVM output [ 24]. Next, we will presentthe background of DempsterShafer theory.

    5.1 Fitting a sigmoid after support vector machine

    The output value produced by SVMs, ( 14) and (15), isnot a probability [23]. Output must be mapped to a pos-terior probability before SVM output can be used inDempsters rule.

    A number of ways in which this can be accomplishedhave been presented [ 2325]. Vapnik [ 23] proposesdecomposing the feature space F into an orthogonal

    Fig. 5 A hybrid model using the Dempsters rule for evidencecombination

    and a nonorthogonal directions. The separating hyper-plane can then be considered. First, the direction that isorthogonal to the hyperplane is examined and, second,the N 1 other directions that are not orthogonal tothe separating hyper-pane are reviewed. As a param-eter for the orthogonal direction, a scaled version of f ( x) , t , is employed. A vector u is used to represent allother directions. The posterior probability is tted usingthe sum of cosine terms, and depends on both t and u:P ( y = 1| t , u):

    P ( y = 1| t , u) = a0(u) +N

    n= 1

    an (u) cos (nf ) . (16)

    This promising method has some limitations since it re-quires a solution of a linear system for every evaluationof the SVM. The sum of the cosine termsis not restrictedtolie between 0 and 1 and isnot constrained tobe mono-tonic in f . In order to consider theprobability P ( y = 1| f ) ,

    there is a very strong requirement that f be monotonic[24].

    Hastie and Tibshirani [26] t Gaussians to the class-conditionaldensities p( f | y = 1) and p( f | y = 1)4. Here,a single tied variance is estimated for both Gaussians.The posterior probability P ( y = 1| f ) is a sigmoid, whoseslope is determined by a tied variance.

    One can compute the mean and the variance for eachGaussian from the data set and apply the Bayes rule toobtain the posterior probability as follows:

    P ( y = 1| f ) = p( f | y = 1)P ( y = 1)

    i= 1,1 p( f | y = i)P ( y = i)

    (17)

    where P ( y = i) are prior probabilities computed fromthe training set and f is as dened in ( 14). The posterioris an analytic function of f with the form:

    P ( y = 1| f ) =1

    1 + exp (af 2 + bf + c)(18)

    The problem of this method is that ( 18) is not mono-tonic and the assumption of Gaussian class-conditionaldensities is often violated [ 24].

    In this study, we implement a parametric methodto t the posterior P ( y = 1| f ) directly instead of esti-mating the class-conditional densities p( f | y) . It followsthat classconditional densities between the marginsare apparently exponential [24]. Bayes rule ( 17) ontwo exponentials suggests using a parametric form of sigmoid:

    P ( y = 1| f ) =1

    1 + exp ( Af + B )(19)

    where f is as dened in ( 14). This sigmoid model isequivalent to assuming that the output of the SVM is

  • 8/9/2019 Predicting WWWsurfing using multiple evidence combination

    9/17

    Predicting WWW surng using multiple evidence combination 409

    proportional to the log odds of a positive example. Theparameters A and B of (19) are t using maximum like-lihood estimation from a training set ( f i, t i) , where the t iare the target probabilities dened as

    t i = yi + 1

    2. (20)

    The parameters A and B are found by minimizing thenegative log likelihood of the training data, which is across-entropy error function:

    min i

    t i log ( pi) + (1 t i) log (1 pi) (21)

    where

    pi =1

    1 + exp ( Af i + B). (22)

    The minimization in ( 21) is a two-parameter minimiza-tion. Hence, it can be performed using many of opti-mization algorithms. For robustness, we implement themodel-trustminimization algorithmbasedon the Leven-bergMarquardt algorithm [27], whose pseudocode isshown in Appendix.

    5.2 DempsterShafer evidence combination

    DempsterShafer theory (also known as theory of belief functions) is a mathematical theory of evidence[12] which is considered to be a generalization of theBayesian theory of subjective probability. Since a belief function rather than a Bayesian probability distributionis the best representation of a chance, the DempsterShafer theory differs from the Bayesian theory. A fur-ther difference is that probability values are assignedto sets of possibilities rather than single events. UnlikeBayesian methods, which often map unknown priorsto random variables, the DempsterShafer frameworkdoes not specify priors and conditionals.

    The DempsterShafer theory is based on two ideas.First, the notion of obtaining degrees of belief for onequestion based on subjective probabilities for a relatedquestion. Second, Dempsters rule for combining suchdegrees of belief when they are based on independentitems of evidence. Since we use twoindependent sourcesof evidence, namely, SVM and Markov model, we areinterested in the latter part of the DempsterShafer the-ory, namely, Dempsters rule. We use it to combine twobodies of evidence (SVM and Markov model) in WWWprediction. The reader is referred to [ 28,29] for moredetails regarding this theory and its applications.

    5.2.1 Dempsters rule for evidence combination

    Dempsters rule is a well-known method for aggregatingtwo different bodies of evidence in the same referenceset. Suppose we want to combine evidence for a hypoth-esis C . In WWW prediction, C is the assignment of apage during prediction for a user session. For example,what is the next page a user might visit after visitingpages p1, p3, p4, and p10. C is a member of 2 , i.e., thepower set of , where is our frame of discernment . Aframe of discernment is an exhaustive set of mutuallyexclusive elements (hypothesis, propositions). All of theelements in this power set, including the elements of ,are propositions. Given two independent sources of evi-dence, m1 and m2, Dempsters rule combines them inthe following frame:

    m1,2(C ) = A ,B , AB= C m1( A )m2(B )

    A ,B , AB= m1( A)m2(B). (23)

    Here, A and B are supersets of C , they are not neces-sarily proper supersets, i.e., they may be equal to C orto the frame of discernment . m1 and m2 are functions(also known as a mass of belief ) that assign a coefcientbetween 0 and 1 to different parts of 2 , m1( A ) is theportion of beliefassigned to A by m1. m1,2(C ) is the com-bined DempsterShafer probability for a hypothesis C .To elaborate more about DempsterShafer theory, wepresent the following example.

    Example 3 Consider a website that contains three sep-arate web pages B, C, and D. Each page has a hyperlinkto the two other pages. We are interested in predictingthe next web page (i.e., B, C, or D) a user visits afterhe/she visits several pages. We may form the followingpropositions which correspond to proper subsets of :

    P B : The user will visit page B.P C : The user will visit page C.P D : The user will visit page D.P B , P C : The user will visit either page B or page C.P D , P B : The user will visit either page D or page B.P D , P C : The user will visit either page D or page C.

    With these propositions, 2 would consist of thefollowing:

    2 = {{P D }, {P B}, {P C}, {P D , P C}, {P B , P C},{P D , P C }, {P D , P B , P C}, }

    In many applications, basic probabilities for everyproper subset of may not be available. In these cases,a nonzero m ( ) accounts for all those subsets for whichwe have no specic belief. Since we are expecting theuser to visit only one web page (it is impossible to visit

  • 8/9/2019 Predicting WWWsurfing using multiple evidence combination

    10/17

    410 M. Awad et al.

    two pages in the same time), we have positive evidencefor individual pages only, i.e.,

    m ( A ) > 0: A {{P D }, {P B}, {P C}}.

    The uncertainty of the evidence m ( ) in this scenariois

    m ( ) = 1 A

    m ( A ) . (24)

    In ( 23), the numerator accumulates the evidence whichsupports a particular hypothesis and the denominatorconditions it on the total evidence for those hypothesessupported by both sources. Applying this combinationformula to Example 3 assuming we have two bodies of evidence, namely, SVM and Markov Model, would yield

    m svm,markov (P B ) =W

    A ,B , AB= m svm ( A )mmarkov (B)

    where W = m svm (P B )mmarkov (P B ) + m svm (P B )mmarkov({P C , P B}) + m svm (P B )mmarkov ({P D , P B}) +

    5.2.2 Using DempsterShafer theory in WWW prediction

    There are several reasons we use Dempsters rule inWWW prediction. First, Dempsters rule is designed tofuse two (or more) independent bodies of evidence. Thebasic idea, in DempsterShafer theory, is that the argu-ments for/against a hypothesis should be separate ratherthan considering one to be the complement of the other.Such key idea is important because it suits WWW pre-diction applicationand other real-time applications suchas satellite image segmentation [ 30], medical data min-ing [31], etc. Note that, in WWW prediction, the uncer-tainty is involved because so many parameters, suchas user behavior, web site changes, etc., are involvedin computing the belief, i.e., there is a certain amountof ignorance in computing our belief because of thesevarious changing parameters. Hence, events and theircomplements should be separate. Second, the domainknowledge can be incorporated in Dempsters rule inthe uncertainty computations [see (24)]. Furthermore,Dempsters rule is adaptable when the domain knowl-edge changes, for example, when new data arrives. InWWW prediction, the domain knowledge changes be-cause new clicks are logged, new customers are come,and new web pages are created. This makes the appli-cation of DempsterShafter justiable/appropriate. Fi-nally, the output of Dempsters rule is a condence mea-surement; hence, there is no need to adopt a new con-dence measurement.

    We have two sources of evidence: the output of SVMand the output of the Markov model. SVM predicts thenext web page based on some features extracted fromthe user sessions while the Markov model counts on theconditional probability of the user session. These twomodelsoperate independently. Furthermore, we assumethat for any session x for which it does not appear in thetraining set, the Markov prediction is zero. If we useDempsters rule for combination of evidence we get thefollowing:

    m svm,markov (C ) = A ,B , AB= C m svm ( A)mmarkov (B )

    A ,B , AB= msvm ( A )mmarkov (B ).

    (25)

    In the case of WWW prediction, we can simplify thisformulation because we have only beliefs for singletonclasses (i.e., the nal prediction is only one web pageand it should not have more than one page) and the

    body of evidence itself ( m ( ) ). This means that for anypropersubset A of forwhich wehave nospecic belief,m( A)=0. For example, based on Example 3, we wouldhave the following terms in the numerator of ( 25)

    m svm (P B )mmarkov (P B ) ,m svm (P B )mmarkov ({P C , P B}) ,m svm (P B )mmarkov ({P D , P B}) ,m svm (P B )mmarkov ( ) ,mmarkov (P B )m svm ({P C , P B}) ,mmarkov (P B )m svm ({P D , P B}) ,

    mmarkov (P B )m svm ( ) .Since we have nonzero basic probability assignmentsfor only the singleton subsets (i.e., the nal predictionshould resolve to one web page and it should not havemore than one page) of and the itself, this meansthat

    m svm(P B )mmarkov (P B ) > 0,m svm(P B )mmarkov ({P C , P B}) = 0 since mmarkov ({P B , P C}) = 0m svm (P B )mmarkov ({P D , P B}) = 0 since mmarkov ({P D , P B}) = 0m svm(P B )mmarkov ( ) > 0,

    m svm({P B , P C})mmarkov (P B ) = 0, since m svm({P B , P C}) = 0,m svm({P B , P D })mmarkov (P B ) = 0, since msvm ({P B , P D }) = 0,m svm( ) mmarkov (P B ) > 0.

    After eliminating zero terms, we get the simpliedDempsters combination rule:

    m svm,markov (P B ) =msvm (P B )mmarkov (P B )+ msvm (P B )mmarkov ( ) + msvm ( ) mmarkov (P B )

    A ,B , AB= msvm ( A)mmarkov (B).

    (26)

  • 8/9/2019 Predicting WWWsurfing using multiple evidence combination

    11/17

    Predicting WWW surng using multiple evidence combination 411

    Since we are interested in ranking the hypotheses, wecan further simplify ( 26) because the denominator isindependent of any particular hypothesis (i.e., same forall) as follows:

    m svm,markov (P B ) m svm (P B )mmarkov (P B ) +

    m svm (P B )mmarkov ( ) +

    m svm ( ) mmarkov (P B ) . (27)

    The is the is proportional to relationship. m svm ( )and mvmarkov ( ) represent the uncertainty in the bodiesof evidence for SVM and Markov models, respectively.For m svm ( ) and mmarkov ( ) in (27), we use the fol-lowing. For SVM, we use the margin values for SVMto compute the uncertainty. Uncertainty is computedbased on the maximum distance of training examplesfrom the margin as follows.

    m svm ( ) =1

    ln(e + SVM margin )(28)

    SVM margin is the maximum distance of a training exam-ples from the margin and e is Eulers number. For Mar-kov model uncertainty, we use the maximum probabilityof a training example as follows.

    mmarkov ( ) =1

    ln(e + Markov probability )(29)

    Markov probability is the maximum probability of a train-ing example. Note that, in both models, the uncertaintyis inverselyproportional to thecorrespondingmaximum

    value.Here, we would like to show the basic steps in our

    algorithm to predict WWW surng using multiple evi-dence combination:

    Algorithm: WWW Prediction Using Hybrid Model

    6 Evaluation

    In this section, we present experimental results forWWW surng prediction using four prediction mod-els, namely, the Markov model (LRS model), SVM, theassociation rule model, and a hybrid method (HMDR)based on Dempsters rule for evidence combination.

    Here, we rst dene the prediction measurementsthat we use in our results. Second, we present the dataset that we use in this research. Third, we present theexperimental setup. Finally, we present our results.

    In all models, except for SVM, we used the N -gramrepresentation of paths [ 7,9]. The N -grams are tuples of the form X 1, X 2, . . . , X n that represents a sequence of page clicks by a set of surfers of a web page. One maychoose any length N for N -grams to record. In the SVMmodel, we apply feature extraction as shown in Sect. 3.Note that the N -gram is embedded in the fth feature,namely, K PP (Fig. 2).

    The following denitions will be used in the followingsections to measure the performance of the prediction.Pitkow et al. [ 9] have used these parameters to measurethe performance of the Markov model. Since we areconsidering the generalization accuracy and the trainingaccuracy we add additional two measurements that takeinto consideration the generalization accuracy, namely,Pr(Hit |MisMatch) and overall accuracy.

    Pr(Match), the probability that a penultimate path(i.e., second to last in a series or sequence), xn 1, xn 2, . . . , xn k , observed in the validation setwas matched by the same penultimate path in thetraining set.

    Pr(Hit |Match), theconditional probability that page xn is correctly predicted for the testing instance xn 1, xn 2, . . . , xn k and xn 1, xn 2, . . . , xn kmatches a penultimate path in the training set.

    Pr(Hit)=Pr(Match) Pr(Hit |Match), the probabilitythat the page visited in the test set is the one esti-mated from the training as the most likely to occur.

    Pr(Miss |Match), the conditional probability thatpage is incorrectly classied, given that its penul-timate path xn 1, xn 2, . . . , xn k matches a penul-timate path in the training set.

    Pr(Miss)=Pr(Match) Pr(Miss |Match), the probabil-ity that a page xn with a matched penultimate pathis incorrectly classied.

    Pr(Hit |MisMatch), theconditional probability that apage xn is correctly predicted for the testing instance xn 1, xn 2, . . . , xn k and xn 1, xn 2, . . . , xn k doesnot match any penultimate path in the training set.This measurementcorrespondsto thegeneralization

  • 8/9/2019 Predicting WWWsurfing using multiple evidence combination

    12/17

    412 M. Awad et al.

    accuracy and it is considered more accurate than thetraining accuracy (represented by Pr(Hit |Match)).

    Overall Accuracy A = Pr(Hit |Mismatch) Pr(mismatch) + Pr(hit | match) Pr(match). A is theoverall accuracy that combines both matching/seenand mismatching/unseen testing examples in com-puting the accuracy.

    The following relations hold for the above measure-ments:

    Pr (Hit |Match ) = 1 Pr (Miss |Match ) (30)

    Pr (Hit )/ Pr (Miss) = Pr (Hit |Match )/ Pr (Miss |Match )(31)

    Pr(Hit |Match) corresponds to the training accuracy, be-cause it shows the proportion of training examples thatare correctly classied. Pr(Hit |Mismatch) correspondsto the generalization accuracy because it shows the pro-portion of unseen examples that are correctly classied.The overall accuracy A combines both.

    6.1 Data processing

    In this section, we will rst give the process used to col-lect raw data and then the steps for processing this data.These preprocessing steps include collecting informa-tion from users sessions, data cleaning, user identica-tion, and session identication. As the details of each of these preprocessing tasks can be found in [ 32], in thissection, we will briey discuss them for the purpose of understanding the structure of the processed data set.

    For equal comparison purposes, and in order to avoidduplicating already existing work, we have used thedatacollected by Pitkow et al. [9] from Xerox.com for thedate May 10, 1998 and May 13, 1998.

    During this period, about 200 K web hits are receivedin each day on about 15 K web documents or les. Theraw data is collected by embedding cookies to the usersdesktop. In cases in which cookies did not exist, a set of fallback heuristics was used in order to collect users webbrowsing information.

    Several numbers of attributes are collected using theabove method which includes the IP address of theuser, time stamp with date and starting time, visitingURL address, referred URL address, and the browserinformation or agent.

    Once we have the access logs data for an intendedperiod, the next step is to uniquely identify users andtheir sessions. Without identifying unique users withtheir sessions, it is impossible to trace the path they havevisited during surng. Specially, the modern web pages

    with hidden cookies and embedded session IDs. Sessionidentication is the most crucial element for data pre-processing. This task is quite complicated, since any twoor more users may have same IP addresses. Pirolli et al.solve this problem by looking into the browser informa-tion [33]. If the IP addresses for two log entries in theaccess logs and if their browser information shows twodifferent browsing software and/or operating systems,then it can be concluded that the IP addresses representtwo distinct users. By the same token, if a log entry hasthe same IP address as other log entries, but its visitingpage is not in any way hyperlinked to the pages visitedby the previous entries, it can also be concluded that thenew entry is for separate user information. Also, anyuser can surf the web for a short period of time and thenreturn for browsing after a signicant amount of wait.To manage this situation, a default time of 30 min is usedin order to identify a session for that particular user. Forexample, if the time between page requests for a userexceeds more than 30 min a new session is consideredfor this particular user [ 32].

    6.2 Experiments Setup

    In this section, wepresent thesetting of ourexperiments.All our experiments are executed using a multiproces-sor machine of two 900 MHz Pentium IIIprocessorsandwith 4 GB memory.

    We use the LIBSVM [34] for SVM implementationand use the -SVM with RBF kernel. In our experi-

    ments, we set very low ( = 0.001) . Since we addressthe problem of multiclass classication, we implement aone-VS-all scheme due to its reduction of the numberof classiers need to be computed over one-VS-one classication. In addition, we train SVM usingthe features we described in Sect. 3. Also, we incorpo-rate the domain knowledge we previously mentioned inSect. 4.2.2 in prediction.

    In ARM, we generate the rules using apriori algo-rithm proposed in [35]. We set the minsupp to a verylow value (minsupp = 0.0001) to capture pages rarelyvisited. We implement the idea of the recommendation

    engine, proposed by Mobasher et al. [ 7].We divide the data set into three partitions. Two par-titions will be used for training and the third will be usedfor testing/prediction. Table 2 presents the distributionof the training and testing data points for each hop.

    6.3 Results

    In this section we present and compare the results of prediction using four different models, namely, ARM,

  • 8/9/2019 Predicting WWWsurfing using multiple evidence combination

    13/17

    Predicting WWW surng using multiple evidence combination 413

    Table 2 The distribution of the training and testing sets for eachhop

    Number of Size of Size of hops training set testing set

    1 23,028 12,0602 24,222 12,4813 22,790 11,5744 21,795 10,9315 20,031 9,9466 18,352 8,8957 16,506 7,834

    SVM, Markov models, and the multiple evidence com-bination model.

    Regarding the training time, it is important to men-tion that we assume an off-line training process. On onehand, the training time for Markov model and ARM isvery small (few minutes). On the other hand, training

    SVM using One-VS-All is an overhead. That is becausewe need to generate large number of classiers (SeeSect. 4.2.1). In our experiments, we have 5,430 differentweb pages; hence, we create 5,430 classiers. The totaltime of training for SVM is 26.3 h.Even though, in all prediction models, we need to con-sult large number of classiers in order to resolve thenal prediction outcome, the average prediction timefor one instance is negligent as we can see from Table 3.

    We consider up to seven hops in our experimentalresults for all models. Recall that we break long sessionsusing a sliding window. Sliding window of size 3 corre-sponds to one hop because the surfer hops once beforereaching an outcome, i.e., last page. Results vary, basedon the number of hops, because different patterns arerevealed fordifferent numberof hops. Also, we show theeffect of including domain knowledge on the accuracyof prediction using SVM.

    Table 4 shows the prediction results of all techniquesusing all measurements with one hop. The rst col-umn of Table 4 presents the different measurementswe mentioned above. Subsequent columns correspondto the prediction results of ARM, Markov, SVM, andDempsters rule. There are several points to note. First,the value of Pr(Match |Mismatch) is zero for both ARM

    Table 3 Average predictiontime for all methods Method time

    (ms)

    SVM 49.8Markov 0.12ARM 0.10Dempsters rule 51.1

    Table 4 Using all probability measurements with one hop

    ARM Markov SVM Dempsterrule

    Pr(Match) 0.59 0.59 0.59 0.59Pr(Hit |Match) 0.09 0.153 0.136 0.177Pr(Hit) 0.05 0.091 0.081 0.105Pr(Miss |Match) 0.91 0.846 0.863 0.822Pr(Miss) 0.53 0.501 0.511 0.487Pr(Hit |Mismatch) 0.0 0.0 0.093 0.082Pr(Hit)/Pr(Miss) 0.1 0.181 0.158 0.215Overall accuracy 0.05 0.091 0.119 0.138

    and Markov models because neither model can predictfor the unseen/mismatched data. Second, the Demp-sters rule method achieves the best scores using allmeasurements. For example, the training accuracy(Pr(Hit |Match)) for ARM, Markov, SVM, and Demp-sters rule are 9, 15, 13, and 17%, respectively. The over-all accuracy for ARM, Markov, SVM, and Dempstersrule are 5, 9, 11, and 13%, respectively. This proves thatour hybrid method improves the predictive accuracy.Third, even though the Pr(Hit |Match) in SVM is lessthan that in Markov model, the overall accuracyof SVMis better than that in Markov model. That is because thePr(Hit |Mismatch) is zero in case of Markov model whileit is 9% in case of SVM. Finally, notice that the ARMhas the poorest prediction results. The ARM uses theconcept of itemsets instead of item list (ordered itemset); hence, the support of one path is computed basedon the frequencies of that path and the frequencies of allits combinations. Also, it is very sensitive to the minsuppvalues. This might have caused important patterns to belost or mixed.

    Figure 6 depicts the Pr(Hit |Match) for all predictiontechniques using different numbers of hops. The x-axisrepresents the number of hops, and the y-axis repre-sents Pr(Hit |Match). In this gure, the Dempsters rulemethod achieves the best training accuracy using differ-ent numbers of hops. The Pr(Hit |Match), i.e., trainingaccuracies, using three hops for ARM, SVM, Markov,and Dempsters Rule are 10, 13, 17, and 23%, respec-tively. Notice that the SVM training accuracy is less thanMarkov model training accuracy because we are onlyconsideringtheaccuracyamong seen/matcheddata. ThePr(Match) in general is high, as we can see from Table 5.Recall that we break the data set into two parts foreach number of hops; two-thirds are used for trainingand one-third for testing. The Pr(Match) is the proba-bility that an example in the testing set is also in thetraining set. Since we get different data set for eachdifferent number of hops, the Pr(Match) is different.The Pr(Match) is 37% regardless of techniques in the

  • 8/9/2019 Predicting WWWsurfing using multiple evidence combination

    14/17

    414 M. Awad et al.

    Fig. 6 Comparable results of all techniques based onPr(Hit |Match)

    Table 5 The Pr(Hit), i.e.,probability of seendata, in thedata set using different num-ber of hops.

    Number of Pr(Match)hops

    1 0.59

    2 0.403 0.374 0.295 0.256 0.217 0.22

    case of three hops. When we increase the number of hops (sliding window is larger), the number of trainingpoints decreases, since Pr(Match) is adversely affected.Note that the probability of match values is valid for allprediction models, namely, SVM, Markov, ARM, andmultiple combination.

    Figure 7 represents the generalization accuracy for alltechniques. The x-axis represents the number of hops,and the y-axis depicts the Pr(Hit |Mismatch). Recall thatthegeneralization is theability to predict correctlybasedon unseen data. The ARM and Markov models havezero generalization accuracy because they are proba-bilistic models. Dempsters rule method has gained itsgeneralization accuracy from SVM fusion. The high-est generalization accuracy achieved is 10% using threehops for both the SVM and Dempsters rule. In general,the prediction accuracy in WWW surng is low; hence,this increase of thegeneralization accuracy is consideredvery important.

    In Fig. 8, we show the overall accuracy (representedin the y-axis) of all techniques using different number of hops (represented in the x-axis).Notice that Dempstersrule achieves the best overall accuracy, SVM comessecond, Markov model is third, and nally the ARM.For example using ve hops the overall accuracies forARM, Markov, SVM, and Dempsters rule are 1, 7, 10,and 14%, respectively. Notice that SVM has outper-

    Fig. 7 Generalization accuracy (i.e., Pr(Hit | Mismatch)) using allprediction methods

    Fig. 8 Comparable results based on the overall accuracy

    formed Markov model based on overall accuracy. Thisis because SVM generalizes better than Markov model.Dempsters rule proves to combine the best in both theSVM and Markov models because it has kept its supe-riority over all techniques, using all measurements, andusing different number of hops.

    Figure 9 shows the results of all methods measuredwith the Pr(Hit)/Pr(Miss). It is very similar to the Fig. 6because of the relationships between these two mea-surements [see ( 30) and ( 31)]. The reason we are usingthis measurement is because it does not depend on thePr(Match) which might vary from one data set to an-other [see ( 30) and( 31)].As we can see, Dempsters rulemethod outperforms other techniques using almost allhops. For example, using three hops the Pr(Hit |Match)for all models are 12, 15, 23, and 28% for ARM,

    SVM,Markov, andDempsters rulemodels, respectively.Figure 10 shows theresultsof allmethods measured withthe overall ratio of Hit/Miss, i.e.,

    overall (Hit / Miss) =number of correct predictions

    number of incorrect predictions.

    The x-axis depicts the number of hops while the y-axisrepresents the overall hit/miss ratio. It is very similar toFig. 8 as both consider the generalization accuracies. Aswe can see, the Dempsters rule method outperforms

  • 8/9/2019 Predicting WWWsurfing using multiple evidence combination

    15/17

    Predicting WWW surng using multiple evidence combination 415

    Fig. 9 Pr(Hit)/Pr(Miss) for all methods and using differentnumber of hops

    Fig. 10 Overall hit/miss accuracy using different number of hops

    other techniques using almost all numbers of hops. Forexample, using seven hops the overall hit/miss ratios forARM, SVM, Markov, and Dempsters rule models are0.6, 12, 8, and 18%, respectively. Larger overall hit/missmeans we have higher hit value and lower miss value.Therefore, this conrms the superiority of Dempstersrule over other techniques.

    One important remark is that, in all gures, the num-ber of hops affects the accuracy. The larger number of hops is, the higher the accuracy is. Recall that numberof hops reects the portion of user history. Intuitively,when weuselargerportion of theuser historyin training,we expect more accurate results. This is not unexpectedand similar result can be found in [7]. Another remarkis that even though the accuracy has improved using ourhybrid model, accuracy is still low. The reason might bethe large number of classes/pages each instance mightresolve to. In addition, we should predict only the mostprobable page/class that the user might visit. Recall thatwe consider prediction of the next page the user mightvisit, contrast to the recommendation systems whichprovide a list of recommended/relevant pages the usermight visit. Furthermore, conicting training paths canbe found in the training set, for example, we can havetwo outcomes, P 1 and P 2, that correspond to one pathP 4, P 3, P 8 , i.e., a user after traversing pages P 4, P 3,

    and P 8 visits pages P 1 in one session, and in anothersession she visits P 2 after traversing the same path. Thisnotonly confuses the training process andthepredictionprocess but also affects the prediction results.

    7 Conclusion and future work

    In this paper, we use a hybrid method in WWW predic-tion based on Dempsters rule for evidence combinationto improve the prediction accuracy. The improvementsin WWW prediction contribute to many applicationssuch as web search engines, latency reduction, and rec-ommendation systems.

    We used twosources of evidence/prediction in ourhy-brid model.Therstbody of evidenceis SVM. To furtherimprove the prediction of SVM, we applied two ideas.First, we extracted several probabilistic features fromthesurngsessions andtrained SVMwith these features.

    Second, we incorporated domain knowledge in predic-tion to improve prediction accuracy. The second bodyof evidence is the widely used Markov model, which isa probabilistic model. The Dempsters rule proves itseffectiveness by combining the best of SVM and theMarkov model demonstrated by the fact that its predic-tive accuracy has outperformed all other techniques.

    We compared the results obtained using all tech-niques, namely, SVM, ARM, Markov model, and Hy-brid method against a standard benchmarkdata set. Ourresults show that the hybrid method (Dempsters rule)outperforms all other techniques mainly because of the

    incorporation of domain knowledge (applied in SVM),the ability to generalize for unseen instances (applied inSVM), and the incorporation of the Markov model.

    We would like to extend our research along the fol-lowing directions. First, we would like to extend thework that will address incremental prediction of WWWsurng. In other words, training set will take into con-sideration user surng patterns can change over time.Second, we intend to use other classication techniquessuch as neural networks. Third, we would like to extendthis work for real-timeprediction in various domains. Fi-nally, wewould like to investigate anautomation method

    for re-training and re-prediction when the behaviors of surfers change with time.

    Appendix A: pseudo-code for the sigmoid training

    This appendix shows the pseudo-code for the training of SVM. The algorithm is a model-trust algorithm, basedon the LevenbergeMarguardt algorithm [ 27].

    ALGORITHM SVM-Sigmoid-Training

  • 8/9/2019 Predicting WWWsurfing using multiple evidence combination

    16/17

    416 M. Awad et al.

    References

    1. Yang, Q., Zhang, H., Li, T.: Mining web logs for predic-tion models in WWW caching and prefetching. In: 7th ACMSIGKDD International Conference on Knowledge Discov-ery and Data Mining KDD, August 2629, pp. 473478(2001)

    2. Chinen, K., Yamaguchi, S.: An interactive prefetching proxyserver for improvement of WWW latency. In: Proceedingsof the Seventh Annual Conference of the Internet Society(INEt97), Kuala Lumpur, June 1997

    3. Duchamp, D.: Prefetching hyperlinks. In: Proceedings of theSecond USENIX Symposium on Internet Technologies andSystems (USITS), Boulder, CO, pp. 127138 (1999)

    4. Teng,W.-G., Chang, C.-Y., Chen,M.-S.: Integrating Web cach-ing and web prefetching in client-side proxies. IEEE Trans.Parallel Distrib. Syst. 16(5), pp. 444455 (2005)

    5. Brin, S., Page, L.: The anatomy of a large-scale hypertextu-al web search engine. In: Proceedings of the 7 th InternatinalWWW Conference, Brisbane, Australia, pp. 107117 (1998)

    6. Burke, R.: Hybrid recommender systems: survey and exper-iments. User Model. User-Adapted Interact. 12(4), 331370(2002)

    7. Mobasher, B., Dai, H., Luo, T., Nakagawa, M.: Effective per-sonalization based on association rule discovery from Webusage data. In: Proceedings of the ACM Workshop on WebInformation and Data Management (WIDM01), pp. 915(2001)

    8. Sarwar, B.M., Karypis, G., Konstan, J., Riedl, J.: Analysis of recommender algorithms for e-commerce. In: Proceedingsof the 2 nd ACM E-Commerce Conference (EC00), October2000, Minneapolis, Minnesota, pp. 158167 (2000)

    9. Pitkow, J., Pirolli, P.: Mining longest repeating subsequencesto predict World Wide Web surng. In:Proceedings of the2 ndUSENIX Symposium on Internet Technologies and Systems

  • 8/9/2019 Predicting WWWsurfing using multiple evidence combination

    17/17

    Predicting WWW surng using multiple evidence combination 417

    (USITS99), Boulder, Colorado, October 1999, pp. 139150(1999)

    10. Grcar, M., Fortuna, B., Mladenic, D.: kNN versus SVM in thecollaborative ltering framework. In: WebKDD 05, August21, Chicago, Illinois, USA

    11. Chung, V., Li, C.H., Kwok, J.: Dissimilarity learning for nom-inal data, Pattern Recognition, 37(7), 14711477 (2004)

    12. Lalmas, M.: DempsterShafers theory of evidence applied tostructureddocuments: modelling uncertainty. In: Proceedingsof the 20 th Annual International ACM SIGIR, Philadelphia,PA, pp. 110118 (1997)

    13. Pandey, A., Srivastava, J., Shekhar, S.: A Web intelligent pre-fetcher fordynamic pages using association rules a summaryof results. In: SIAM Workshop on Web Mining (2001)

    14. Su, Z., Yang, Q., Lu, Y., Zhang, H.: Whatnext: a predic-tion system for web requests using n-gram sequence mod-els. In: Proceedings of the First International Conference onWeb Information System and Engineering Conference, HongKong, June 2000, pp. 200207 (2000)

    15. Chang, C.-Y., Chen, M.-S.: A new cache replacement algo-rithm for the integration of web caching and prefetching.In: Proceedings of the ACM 11th International Conferenceon Information and Knowledge Management (CIKM-02),

    November 49, pp. 632634 (2002)16. Nasraoui, O., Pavuluri, M.: Complete this puzzle: a connec-tionist approach to accurate web recommendations based ona committee of predictors. In: Mobasher, B., Liu, B., Masand,B., Nasraoui, O. (eds.) Proceedings of WebKDD 2004, Work-shop on Web Mining and Web Usage Analysis, part of theACM KDD: Knowledge Discovery and Data Mining Confer-ence, Seattle, WA (2004)

    17. Nasraoui, O., Petenes, C.: Combining web usage mining andfuzzy inference for website personalization. In: Proceedingsof WebKDD, pp. 3746 (2003)

    18. Nasraoui, O., Krishnapuram, R.: One step evolutionary min-ing of context sensitive associations and Web navigation pat-terns. In: SIAM International Conferince on Data Mining,Arlington , VA, April 2002, pp. 531547 (2002)

    19. Kraft, D.H., Chen, J., Martin-Bautista, M.J., Vila, M.A.:Textual information retrieval with user proles using fuzzyclusering and inferencing. In: Szczepaniak, P.S., Segovia, J.,Kacprzyk, J., Zadeh, L.A. (eds.) Intelligent Exploration of the Web, Physica-Verlag Hiedelberg (2002)

    20. Nasraoui, O., Krishnapuram, R.: An evolutionary approachto mining robust multi-resolution web proles and contextsensitive URL Associations , International Journal of Com-putational Intelligence andApplications, 2(3), 339348 (2002)

    21. Joachims, T., Freitag, D., Mitchell, T.: Webwatcher: a tourguide for the World Wide Web. In: Proceedings of theIJCAI-97, pp. 770777 (1997)

    22. Cristianini, N., Shawe-Taylor, J.: Introduction to SupportVector Machines, pp. 93122. Cambridge University Press,Cambridge (2000)

    23. Vapnik, V.: Statistical Learning Theory. Wiley, New York(1998)

    24. Platt, J.: Probabilities for SV machines. In: Smola, A.,Bartlett, P., Schlkopf, B, Schuurmans, D. (eds.) Advances inLarge Margin Classiers. Original Title: Probabilistic Out-puts for Support Vector Machines and Comparisons to Reg-ularized Likelihood Methods, pp. 6174, MIT Press, Cam-bridge (1999)

    25. Wahba, G.: Multivariate function and operator estimation,based on smoothing splines and reproducing kernels. In:Casdagli, M., Eubank, S. (eds.) NonlinearModeling and Fore-casting, SFI Studies in Sciences of Complexity, vol XII, pp.95112 (1992)

    26. Hastie, T., Tibshirani, R.: Classiaction by pairwise coupling.In: Proceedings of the 1997 Conference on Advances in Neu-ral Information ProcessingSystems 10, Denver, Colorado, pp:507513 (1997)

    27. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery,B.P.: Numerical Recipes in C: The Art of Scientic Com-puting, 2nd edn., Cambridge University Press, Cambridge(1992)

    28. Aslandogan,Y.A.,Yu, C.T.: Evaluatingstrategiesandsystemsfor content based indexing of person images on the Web. In:

    Proceedings of the eighth ACM International Conference onMultimedia, Marina del Rey, California, United States, pp.313321 (2000)

    29. Shafer, G.: A Mathematical Theory of Evidence, PrincetonUniversity Press, Princeton (1976)

    30. Bendjebbour, A., Delignon, Y., Fouque, L., Samson, V.,Pieczynski, W.: Multisensor imagesegmentation usingDemp-ster-shafer fusion in Markov elds context. IEEE Trans. Geo-sci. Remote Sens. 39(8), 17891798 (2001)

    31. Aslandogan, Y.A., Mahajani, G.A., Taylor, S.: Evidence com-bination in medical data mining. In: Proceedings of the Inter-national Conferenceon Information Technology: Coding andComputing (ITCC04), vol. 2, 465 pp. (2004)

    32. Cooley, R., Mobasher, B., Srivastava, J.: Data preparation formining World Wide Web browsing patterns. J. Knowl. Inf.

    Syst. 1(1) (1999)33. Pirolli, P., Pitkow, J., Rao, R.: Silk from a sows ear: extract-ing usable structures from the web. In: Proceedings of 1996 Conference on Human Factors in Computing Systems(CHI-96), Vancouver, British Columbia,Canada, pp. 118125(1996)

    34. Chang, C., Lin, C.: LIBSVM: a library for support vectormachines, http://www.csie.ntu.edu.tw/ cjlin/libsvm (2001)

    35. Agrawal, R., Srikant, R.: Fast algorithms for mining associ-ation rules in large databases. In: Proceedings of the 20thInternational Conference on Very Large Data Bases, SanFrancisco, CA, pp. 487499 (1994)