[advances in intelligent and soft computing] proceedings of the international conference on...

S.C. Satapathy et al. (Eds.): Proceedings of the InConINDIA 2012, AISC 132, pp. 897–905. springerlink.com © Springer-Verlag Berlin Heidelberg 201

Design and Implementation of an Effective Web Server Log Preprocessing System

Saritha Vemulapalli1 and M. Shashi2

1 Department of Information Technology, VNR Vignana jyothi Inst. of Engg. & Tech,

Hyderabad, A.P, India [email protected]

2 Department of CS & SE, Andhra University College of Engg (A),

Visakhapatnam, A.P, India [email protected]

Abstract. WWW constitutes huge repository, distributed and dynamically growing hyper medium, supporting access to information and services. As more organizations rely on WWW to conduct business, user behavior analysis becoming difficult in web-based applications. Information about user’s interactions with website is stored in server logs and serves as huge electronic survey of website. Web usage mining deals with discovering usage patterns from server logs in order to understand and better serve the needs of web users. The raw information contained in log file represents noisy data. Preprocessing includes cleaning, user identification, sessionization, path completion & structurization and is a prerequisite for improving accuracy and efficiency of the subsequent mining process. This paper emphasizes on an effective web log preprocessing system. Experimental results proved that the proposed system reduces the size of log file down to 12% and improves the performance of preprocessing in identifying users, sessions, path completion and structurization.

Keywords: Data Mining, Web Log Mining, Web Usage Mining, Preprocessing, Cleaning, User Identification, Sessionization, Path Completion.

1 Introduction

Since 1991, WWW became so popular & has a rapid development. Now it has formed a great distributed information source including 8.75 millions websites, 2.5 billions web pages and great many users [1]. The WWW constitutes a huge repository, widely distributed and dynamically growing hyper medium, supporting access to information and services. With the explosive growth of information sources available on the WWW, providing web users with more exactly needed information is becoming a critical issue in web-based applications. It has become necessary for users to utilize automated tools in order to find, extract, filter, and evaluate the desired information. As a result, web usage mining has attracted lot of attention in recent time [2].

2

898 S. Vemulapalli and M. Shashi

Web-based applications generate and collect large volumes of data in their day-to-day activities. Majority of this data is generated automatically by web servers and collected in server logs in an unstructured format. Web mining is the application of data mining which deals with the extraction of interesting knowledge from the WWW documents and services which are expressed in the forms of textual, linkage or usage information [3]. Web mining can be divided into web content mining, web structure mining and web usage mining. Web content mining is the process of discovering useful knowledge from the raw data (text, image, audio or video data) available in web pages. Web structure mining is the process of analyzing the link between pages of a web site using web topology. Cooley et al. [4] introduced the term web usage mining in 1997 and is defined as process of extracting useful information from server logs (i.e. user’s history) to improve web services and performance. Obtained user access patterns can be used in variety of applications, such as to identify the typical behavior of the users [5], making clusters of users with similar access patterns and by adding navigational links [6]. Typical applications are website design & management, web personalization, adaptive websites, recommendation systems, cross marketing strategies, promotional campaigns and user behavior analysis.

The paper is organized as follows. Section 2 describes overview of web usage mining. Design & implementation of proposed preprocessing system and also related algorithms are presented in section 3. Section 4 covers experimental results, proves the effectiveness & efficiency of our algorithms. Conclusions are in section 5.

2 Web Usage Mining Process

Web usage mining is the discovery of user access patterns from server logs, consists of data collection, preprocessing, pattern discovery & analysis and visualization [7]. The data which is used for mining process can be collected from server side, client side, proxy server, website topology, web page contents & user profile information. Server logs are the primary source of data for web usage mining that are collected as a result of users interactions with website, represented in standard formats (e.g. Common Log Format [8] and Extended Common Log Format [9]). The raw information in a web server log file doesn’t represent a structured, complete, reliable & consistent data. Preprocessing techniques can improve the quality of the data involves cleaning, user identification, session identification, path completion and data structurization [10]. Statistical & data mining techniques can be applied to the preprocessed web log data, in order to discover statistics & user access patterns and are represented using visualization techniques such as charts, graphs & reports.

2.1 Common Log Format

Each line in a log file represented in the common log format has the following syntax. [Host/IP Rfcname Userid [DD/MMM/YYYY: HH:MM:SS -0000] "Method /Path HTTP/1.x" Code Bytes]

A "-" in a field indicates missing data.

Design and Implementation of an Effective Web Server Log Preprocessing System 899

2.2 Extended Common Log Format

It’s an extension to common log format, having some additional information like user_agent, cookie and referrer. User_agent is the visitor’s browser version & O.S. Referrer defines the URL from where the visitor came from. Each line in a log file represented in the extended common log format has the following syntax. [s-computername s-ip s-port c-ip rfcname cs-userid date time cs-method cs-uri-stem cs-uri-query cs-version sc-status time-taken sc-bytes cs(user-agent) cs(cookie) cs(referrer)]

3 Design and Implementation of Proposed Preprocessing System

The proposed preprocessing system uses server logs of www.vnrvjiet.ac.in, is an implicitly generated data as a result of user interactions with a website are represented in Extended common log format (ECLF). Most of the researchers considered web server log file as most reliable and accurate for WUM process.

The following are some of the drawbacks of using server logs.

• Since HTTP is stateless, web server logs do not identify sessions or users. • Web cache keeps track of pages that are requested and saves a copy of these

pages for a certain period. Hence, these requests are not recorded in log files. • IP address misinterpretation due to shared computers. • The browser’s back button is “the second most used feature” on the web; it

accounts for 41% of all user interaction requests for web documents [11].

Our proposed system addresses all the above issues.

3.1 Data Preprocessing

As the web server logs are not designed for data mining, preprocessing must be carried out in order to obtain reliable and accurate data. Low-quality data will lead to low-quality mining results. Data preprocessing techniques can improve the quality of the data, thereby helping to improve the accuracy and efficiency of the subsequent mining process. Nearly 80% of mining efforts are required to improve the quality of data [12]. The proposed preprocessing system consists of components such as cleaning, user identification, session identification, path completion and data structurization is shown in Fig. 1. The implementation issues are explained below.

Fig. 1. Proposed Preprocessing System


Data Cleaning: The process of removing entries which are irrelevant and redundant in pattern discovery. HTTP is a stateless & connectionless protocol which requires separate connections for every file requested from the web server. In general a user does not explicitly request all of the graphics on a web page, which are automatically downloaded due to the embedded HTML tags. In the real world data, irrelevant files are found up to a ratio of 10:1, depending on how many graphics and other files the web pages contain [10]. The main intent of web usage mining is to get a picture of the user's behavior, other than file requests that the user did not explicitly request. Removing such entries decreases the memory usage and improves the performance.

The following rules are used for data cleaning in our proposed system:

i) Removing all the attributes which contain no data at all and are not essential for the analysis. ii) Removing log entries covering image, sound, video, flash animations, frames, pop-up pages, script’s and style sheet files. iii) Removing access records generated by automatic search engine agents such as crawler, spider, robot, etc. Spiders are widely used in web search engine tools to update their search indexes [13]. Spider requests can be identified by looking

a) All hosts that have requested the page “robots.txt.” b) Many crawlers voluntarily declare themselves in user agent field of log, by

referring user agent field weather it contains either a URL or an email address. iv) Removing log entries that have status of “error” or “failure”. All the entries with a status code other than the 200 range are removed. v) Removing log entries that have http request method other than Get or Post.

The following is an algorithm for data cleaning:

Input: Records of server log file, which is represented as log_file R= {R1, R2, …, Ri, …, Rn}. Where Ri=<F1, F2, …, Fj, …, Fn> is a record in log_file and is defined as <s-computername, s-ip, s-port, c-ip, rfcname, cs-userid, date, time, cs-method, cs-uri-stem, cs-uri-query, cs-version, sc-status, time-taken, sc-bytes, cs(user-agent), cs(cookie), cs(referrer)>.

Output: log_information & data_cleaning, are database object’s. Algorithm:

Begin 1. Remove non essential attributes for the analysis such as <s-computername, s-ip, s-port, rfcname, cs-uri-query, time-taken, sc-bytes> from log_file. 2. Remove the attributes which doesn’t contain data in all records of log_file. // indicates missing values. 3. FOR each Record Ri in log_file // 1<=i<=n

DO Insert Ri into log_information END FOR


4. FOR each Record Ri in log_information DO IF(Ri doesn’t represent image,sound,video,flash animation, frame, pop-up page, script, style sheet file, crawler request, error request and other than get or post request) Then Insert Ri into data_cleaning END IF END FOR END

User Identification: The process of identifying the unique users, who is interacting with a website using the web browser. The analysis of web usage doesn’t require knowledge about a user’s identity. However it is necessary to distinguish among different users.

The following rules are used for user identification in our proposed system:

i) If the IP address is different is assumed as new user. ii) If the IP address is same, but with different operating system or browser software is assumed as new user. iii) If the IP address, operating system and browser software are same, but with different http version is assumed as new user.

The following is an algorithm for user identification:

Input: Records of data_cleaning “R” = {R1, R2, …,Ri, …, Rn}.

Output: Records of users_info “U” = {U1, U2 , …, Uj, …, Un}, is a database object. Where Uj = <F1, F2, …, Fk, …, Fn> is a record in users_info.

Algorithm:

Begin 1. Select IP, user_agent, version fields of records of data_cleaning. 2. Insert R1 into users_info // R1 is a first record in R 3. FOR each Record Ri in data_cleaning // 1<= i<=n

DO FOR each Record Uj in users_info IF((IP is different ) OR (IP is same, but with Different operating system or browser software) OR (IP, operating system and browser software are same, but with different http version)) Then Insert Ri into users_info ; END IF END FOR END FOR END


User’s Session Identification: Web log span long periods of time; it is very likely that users will visit the web site more than once. The process of identifying sequence of activities of a single user during a single visit at a defined duration [14]. Since HTTP protocol is stateless and connectionless discovering the user’s sessions from server log is a complex task.

The following rules are used for session identification in our proposed system:

i) A new session begins each time when there is a new user. ii) A new session begins each time when the time gap between consecutive requests made by the same user exceeds threshold Δt=10 minutes when the referrer is “-“. iii) A new session is identified if the URL in the referrer field has never been accessed before in a current session.

Path Completion: Some important page requests are not recorded in server log due to the cache, thus causing the problem of incomplete path. It is the process of reconstructing the user’s navigation path, by appending missed page requests (page requests that are not recorded in server log) within the identified sessions.

The following rules are used for path completion in our proposed system:

i) With in the identified user’s sessions, if the URL in the referrer field of the page request made is not equivalent to URL of last page user has requested & if the URL in the referrer field is in the user’s history, it is assumed that user uses “back” button. Missing page references that are inferred through this rule are added to the user’s session file.

The following is an algorithm for sessionization & path completion:

Input: Records of data_cleaning “R” = {R1,R2,…,Ri,…,Rn} sorted in assending order of date, time and Records of users_info “U” = {U1, U2 , …, Uj, …, Un}.

Output: Records of users_sessions “S” = {S1, S2, …, Sk, …, Sn}, is a database object. Where Sk = {Uj, pathi} is a session in S, Uj is a record in users_info & pathi is defined as urli1urli2 …urlin //1<=i<=n And Records of users_sessions_path “RS”={RS1,RS2,…,RSk,…, RSn}, is a database object. Where RSk = {Uj, pathi} is a reconstructed session in RS, Uj is a record in users_info & pathi is defined as urli1urli2 …urlin //1<=i<=n

Algorithm:

Begin Set S={ },RS={ }; FOR each Record Uj in users_info Create a new Session Sk & Reconstructed Session RSk; DO FOR each Record Ri in data_cleaning DO IF(Values of IP,user_agent&version are same) Then DO IF((Referrer is ‘-‘ & Time gap between consecutive requests by the same user>10min) OR (URL in Referrer field has never been


accessed before in current session)) Then Create new Session Sk &Reconstructed Session RSk; Add uri-stem field to pathi of the current Session Sk & pathi of the current Reconstructed Session RSk; Else Add uri-stem field to pathi of the current Session Sk ;

IF(URL in Referrer field is not equivalent to URL of last page user has requested) Then Add missing page references to pathi of the current Reconstructed Session RSk ; Else Add uri-stem field to pathi of the current Reconstructed Session RSk ; END IF END FOR FOR each Session in Sk & RSk DO Insert Sk into users_sessions; Insert RSk into users_sessions_path; END FOR

END FOR END

Data Structurization: The process of transforming and storing the data into suitable form for input to the pattern discovery. Different tables are designed in the relational database for each object, identified in various stages of preprocessing.

4 Experimental Results

The proposed system was developed based on IIS web server log represented in ECLF, using java programming language. Experimental analysis is carried out to validate the effectiveness and efficiency of the proposed preprocessing system. The server log file of www.vnrvjiet.ac.in of 15th Nov 2010, having 10,375 records is selected for analysis. The results of preprocessing are shown in Table1. After cleaning the No. of records reduces down to 1,220 (12% of original records), 235 unique users & 589 user’s sessions are identified.

Table 1. The Results of Data Preprocessing

Records in logfile Records after cleaning NO of unique Users Sessions 10,375 1,220 235 589

Table 2 shows the results of data cleaning process. 1 represents records in raw server

log file, 2 represents records after removing image, sound, video, flash animations, frames, pop-up pages, script’s and style sheet files. 3 represents records after further removing crawler requests. 4 represents records after further removing error requests. Table 3 shows the results of user identification. 1 represents unique users identified using IP address, 2 represents unique users identified using IP address & user_agent. 3 represents unique users identified using IP address, user_agent & version.


Table 2. Results of Data Cleaning Process

Cleaning Process

No. of Records

1 10,375 2 1581 3 1230 4 1220

Table 3. Results of User Identification Process

User identification process

No. Of users

1 215

2 234

3 235

Table 4. Results of Session Identification Process

Session Identification Process

No. of sessions

1 269 2 253 3 818 4 589

User Identification using the IP address alone is not sufficient and reliable. This

can result in several users being erroneously grouped together as one user. Because although an IP address may represent one person only, an IP address is in most cases shared by more than one person (at a library, internet cafe or one user uses multiple computers). So, different users sharing the same host can not be distinguished.

Our experimental results proved that unique users can be identified more effectively using user_agent & version field’s along with IP address. The rationale behind this rule is that a user, when navigating the web site, rarely employs more than one browser, much more than one OS.

Table 4 shows the results of user’s session’s identification process. 1 represents sessions identified based on time gap between two consecutive page requests exceeds 10min’s. 2 represents identified sessions based on session duration as 30min’s. 3 represents identified sessions based on if the URL in the referrer field has accessed before in a current session. 4 represents identified sessions based on proposed session identification rules.

Time based methods are not reliable because users may involve in some other activities after opening the web page and factors such as busy communication line, loading time of components in web page, content size of web pages are not considered. Referrer based method introduces the confusion when user types URL directly or uses bookmark to reach pages not connected via links and identified sessions may contains more than one visit by the same user at different time. Our experimental results proved that session’s can be identified more effectively using our proposed session identification rules.

5 Conclusion and Future Enhancements

The raw information contained in a web server log file as a result of user’s interactions with a website doesn’t represent a structured, complete, reliable & consistent data. As the web server logs are not designed for data mining, preprocessing must be carried out to improve the accuracy and efficiency of the subsequent mining process. Low-quality data will lead to low-quality mining results. Server logs of www.vnrvjiet.ac.in are analyzed using the proposed preprocessing system in order to identify unique users, user sessions & path completion and data structurization, which play a major role in web usage mining process in order to


discover useful hidden patterns reflecting the typical behavior of users. Experimental results proved that the proposed system reduces the size of log file down to 12% and improves the performance of preprocessing in identifying users, sessions, path completion and structurization.

The proposed system can be enhanced in future for more accurate session identification & path completion.

References

1. Gudivada, V.N.: Information retrieval on the World Wide Web. IEEE Internet Computing 1(5), 58–68 (1997)

2. Cooley, R., Mobasher, B., Srivastava, J.: Web mining: information and pattern discovery on the World Wide Web. In: International Conference on Tools with Artificial Intelligence, pp. 558–567. IEEE, Newport Beach (1997)

3. Cooley, R., Mobasher, B., Srivastava, J.: Data preparation for mining World Wide Web browsing patterns. Journal of Knowledge and Information System, 1–27 (1999)

4. Cooley, R., Mobasher, B.S., Srivastava, J.: Grouping Web page references into transactions for mining World Wide Web browsing patterns. In: Knowledge and Data Engineering Workshop, pp. 2–9. IEEE, New port Beach (1997)

5. Srivastava, J., Cooley, R., Deshpande, M., Tan, P.-N.: Web usage mining: Discovery and applications of usage patterns from Web data. SIGKDD Explorations 1, 12–23 (2000)

6. Masseglia, F., Poncelet, P., Teisseire, M.: Using data mining techniques on Web access logs to dynamically improve Hypertext structure. ACM SigWeb Letters 8(3), 13–19 (1999)

7. Pabarskaite, Z., Raudys, A.: A process of knowledge discovery from web log data: Systematization and critical review. Journal of Intelligent Informatin Systems 28(1), 79–104 (2007)

8. Configuration file of W3C httpd (1995), http://www.w3.org/Daemon/User/Config/

9. W3C Extended Log File Format (1996), http://www.w3.org/TR/WD-logfile.html

10. Cooley, R., Mobasher, B., Srivastava, J.: Data Preparation for Mining World Wide Web Browsing Patterns. J. Knowledge and Information Systems 1(1), 5–32 (1999)

11. Catledge, L.D., Pitkow, J.E.: Characterizing browsing strategies in the world-wide web. Computer Networks and ISDN Systems 27, 1065–1073 (1995)

12. Frieder, O., Grossman, D.A.: Information Retrieval: Algorithms and Heuristics. The Information Retrieval Series, 2nd edn (2004)

13. Tanasa, D., Trousse, B.: Advanced data preprocessing for intersites Web usage mining. IEEE Intelligent Systems 19, 59–65 (2004)

14. Spiliopoulou, M.: Managing Interesting Rules in Sequence Mining. In: Żytkow, J.M., Rauch, J. (eds.) PKDD 1999. LNCS (LNAI), vol. 1704, pp. 554–560. Springer, Heidelberg (1999)

[advances in intelligent and soft computing] proceedings of the international conference on...

Documents