13.--analysis of web server log -full
DESCRIPTION
ÂTRANSCRIPT
ANALYSIS OF WEB SERVER LOG BY WEB USAGE MINING FOR EXTRACTING
USERS PATTERNS
OM KUMAR C. U.1 & P. BHARGAVI
2
1Department of Computer Science, Sree Vidyaniketan Engineering College, Tirupathi, Andhra Pradesh, India
2Senior Asst.Prof , Dept of CSE, Sree Vidyanikethan Engineering College , Tirupathi, Andhra Pradesh, India
ABSTRACT
WWW is a system of interlinked hypertext documents accessed via the Internet. Around 11 Hundred million
people access internet daily. And so the information available on WWW is also growing. With this continued growth of
information and proliferation of web services and web based information systems, web sites are also growing to host them.
Before analyzing such data using data mining technique the Servers web log need to be preprocessed. The log file data
offer insight into website usage. They can be collected from Web servers, Proxy Servers, Web Client. Web Usage mining
applies data mining technique to extract knowledge from these web log files. This paper discusses about the Log files and
uses Web mining techniques to extract usage patterns by using WEKA.
KEYWORDS: Pre-Processing, Web Usage Mining, Web Server Log Data, Classification, Clustering, Rule Based
Mining, Pattern Discovery
INTRODUCTION
WWW continues to grow at an astounding rate in both information and users perspective. The scale of information
on the internet is growing at an comprehensible rate, similar to the mystifying size of planets and stars. Internet has become
a place where a massive amount of information and data is being generated every day. Every Minute YouTube users upload
48 hours of video, Facebook users share 684,748 pieces of content, Instagram users share 3600 pictures and Tumblr shares
27,778 new posting. Over the last decade with the continued increase in the usage of WWW, Web mining has been
established as an important area of research. Web mining is used to analyse users using WWW who leave abundant
information in web log, which is structurally complex and incremental in nature. A Log file is a record that records
everything that goes in and out of a particular server. Analysing such data will yield knowledge but pre-processing of that
data is required before analysing it. Once analysing the Log File, they provide activities of users over a potentially long
period of time. They can be collected from web server, proxy server and Web client. These logs when mined properly
provide useful information for decision making. They contain information such as username, IP Address, timestamp, bytes
transferred, referred URL, User agent.
Based on the research of web mining [8] they are classified into 3 domains.
Web Content Mining
Web Structure Mining
Web Usage Mining.
Web usage mining in Figure 2 is a process of extracting information from server logs (i.e.,) users history. They
help in finding out what users are looking out in Internet.
International Journal of Computer Science Engineering
and Information Technology Research (IJCSEITR)
ISSN 2249-6831
Vol. 3, Issue 2, Jun 2013, 123-136
© TJPRC Pvt. Ltd.
124 Om Kumar C.U. & P. Bhargavi
Web Structure Mining in Figure 2 is the process of using graph theory to analyse node and connection structure of
web sites. They are of 2 types.
Extracting Pattern from Hyperlinks in the Web
They are structural components that connect the web page to different location.
Mining Document Structure
Analyses tree like structure of page to describe html, xml tag usages.Web content mining in Figure 2 is the mining,
finding pattern and extracting knowledge from web contents. They are of 2 types.
Information retrieval view
Database View
RELATED WORK
Data Preprocessing
The Log file contains immaterial attributes. So before mining the Log file, preprocessing needs to be done.
General preprocessing techniques applied on data are cleaning, Integration, Transformation and reduction. By applying the
above preprocessing techniques incomplete attributes, noisy data which contain errors and inconsistent data that has
discrepancies can be removed. By applying Data preprocessing we improve the quality of data.
Once the cleaned data is transformed, User sessions may be tracked to identify the user, and from it the user
patterns can be extracted.
Figure 1: Data Pre-Processing
The obtained data is now clean and can be analysed. Web mining process is considered for analysing the pre-
processed Log File.
Analysis of Web Server Log by Web Usage Mining for Extracting Users Patterns 125
Figure 2: Web Mining
Web Usage Mining
The goal of web usage mining is to get into the records of the servers (log files) that store the transactions that are
performed in the web in order to find patterns revealing the usage the customers [7][11]. We can also distinguish here:
General access pattern tracking. Here we combine the access patterns of a group rather than an individual to get a
trend that allow us to organize the web structure in such a way that the user is facilitated.
Customized access pattern tracking. Here we gather information regarding a client’s behavior with the website.
Based on the gathered information suggestions and advices are provided to improve the quality.
Web Content Mining
Here information is gathered regarding the search performed on the content to identify user patterns. There are two
main strategies of Web Content Mining are as follows:
Information Retrieval View: R. Kosala et al. summarized the research works done for unstructured data and semi-
structured data from information retrieval view. Study has revealed that researches use frequent words, which is generally a
single word. These Single words are considered as training corpuses for ranking the content in the web based on the number
of referrals. For the semi-structured data, all the works utilize the HTML structures inside the documents and some utilized
the hyperlink structure between the documents for document representation.
Database View
As for the database view, in order to have the better information management and querying on the web, the mining
always tries to infer the structure of the web site to transform a web site to become a database.
Web Structured Mining
This type of mining is usually done to reveal the structure of websites by gathering structure related data. Typically
it takes into account two types of links: static and dynamic.
SERVER LOG
The server responds to user requests and the server log records all the transactions right from start up to shutdown
of the server. They provide time stamp to user requests and respond by recording requested ID with the requested action.
126 Om Kumar C.U. & P. Bhargavi
These Log files can be located in 3 places.
Web Servers- A web server is dispenses the web pages as they are requested.
Proxy Server- A proxy server is a intermediary compute that acts as a computer hub through which user requests
are processed.
Web Client- A Web client is a computer application, such as a web browser, that runs on a user local computer or
workstation and connects to a server as necessary
Table 1: Types of Server Logs
S.No Type of Server Log Example
Transfer- these log records remote hosts visiting it with
its time stamp. 120.236.0.14 -2007-11-05
2 Agent - A remote host surfs through a browser. Agent
log records the user agent (browser). InternetExplorer/5.0(win 7;)
3 Error- This log records all the warnings, errors caused
to its system with the time and type of error.
10/10/2012 10:07:09 unable to open
file‖academics.‖ —No such file 05/06/2008
13:04:10—could not load repository template
extension.
4
Referrer- They provide extra features like user reference
as links. When used with agent log provides detail like
the type of user and the user agent. They can track
external hosts using your document from your space.
http://myblaze.sez.html>/pictures/myspace/happy.gif
Contents of Log File
A server log file is a log file that automatically creates and maintains the activities performed in it. It maintains a
history of page requests. It helps us in understanding how and when your website pages and application are being accessed
by the web browser. These log files contain information such as the IP address of the remote host,content requested, and a
time of request.
SYNTAX
IPaddress, logproprietor, Username, [DD:MM:YYYY: Timestamp GMToffset] , "req method" .
Ex
104.11.13.108 - - [13/Jan/2006:16:56:12 -0600]
"GET /EDC/cell.htm HTTP/1.0" 200 4093.
IP address- The IP address of the http request is recorded to identify the remote host. Ex: 204.31.113.138.
log proprietor- The name of the owner making an http request is recorded through this field. They do not expose
this information for security purpose. When they are not exposed they are denoted by (-).
Username- This field records the name of the user when it gets a http request. They do not expose this information
for security purpose. When they are not exposed they are denoted by (-).
date- Request date is recorded here in the mentioned format. Ex: 13/Jan/2006.
time- Request Time of the HTTP request is recorded here in astronomical format. Ex: 16:56:12.
Analysis of Web Server Log by Web Usage Mining for Extracting Users Patterns 127
GMToffset- This field shows the time difference between the actual request time and Greenwich Mean time so
that request from corner of the world can be analysed in any part of the world. Ex: -0600.
Reqmethod- The request type of the request is stored. Ex: GET.
Types of Log Formats
NCSA Log Formats,
W3C Extended Log Format,
Microsoft IIS Log Format,
Sun One Web Server Format.
NCSA Log Formats
National Centre for Supercomputing Application (NCSA) established in 1986 developed a web server called httpd
at its centre. This web server had a log initially which had several extensions later.
NCSA Common Log or Access Log Format
Stores basic information about the request received.
Syntax
Host IP address, Proprietor, Username, date: time, request method, status code, byte size.
Ex
200.40.12.4, -, -, [2006/Oct/10:10:16:52 +0500], GET /svec.html http 1.0‖, 200, 1460.
NCSA Combined Log Format
Stores all common log information with two additional fields. ―referrer‖ –history of request, ―user_agent‖ – type
of browser.
Syntax
Host IPaddress, Proprietor, Username, date: time, request method, status code, byte size, referrer, User_agent,
Cookie.
Ex
200.40.12.4, -, -,[2006/oct/10:10:16:52 +0500], GET /svec.html http 1.0‖, 200, 1460, http://www.cbci.com/,
―Mozilla/5.0 (WIN:7)‖, ―UserID=om123;Pwd=101112‖.
NCSA Separate Log Format
In this type the information is split into 3 log files instead of storing it in a single file.
The three log files are a) access log
b) Referral log
c) Agent log
128 Om Kumar C.U. & P. Bhargavi
W3C Extended Log Format
The worldwide web consortium (w3c) is an international standards organization. They provide rich information
hence the name extended log format.
The lines starting with # contain directives.
#version <int><int>
#Software- the software which generated the log .
#date-<date><time>
#fields-this directive lists a sequence of entries. They are as follows.
Table 2: W3C Directives
Acronym Description
C Client
S Server
CS Client to Server
SC Server to Client
r Remote host
Sr Server to Remote host
rS Remote host to Server
Ex
#version: 2.0
#Software: Microsoft windows server1.0
#date : < date> <time>
#field: C-ip S-ip CS-username CS-method CS-uri CS-version CS-user-agent SC-status
2000-04-14 10: 16: 42 , 192.16.14.1, 200.14.100.4 - GET/Mypictures.gif http1.0 ―Mozilla/4.0‖ 200.
Microsoft IIS Log File
Internet information service enables you to track or record the activities happening in your website through File
transfer protocol (ftp), Network News Transfer Protocol (NNTP), Simple Mail Transfer Protocol (SMTP) by allowing you
to choose a log format that works in synchronization with your system environment. Some supplementary attributes
provided are as follows. Elapsed time, total bytes transferred, target file.
Syntax
IP add, date timestamp, Server name, Server IP, elapsed time, http request size, byte size, status code, error,
request method.
Ex
192.16.10.1,-,10/4/01 14:02:10, svec, 170.42.14.2, 1604, 140, 4240, 200, 0, GET, /Mypicture.gif.
Sun One Web Server
They are similar in functionality with the above mentioned log formats. But provides more security in 2 ways.
Analysis of Web Server Log by Web Usage Mining for Extracting Users Patterns 129
-By using Secure Socket Layer (SSL) between client and server.
-Administrator can provide access controls or permissions to files and directories.
Request & Response by the Server
All the log formats specified above has fields that record http request in the form of elapse time and response in
the form of status code.
To Handle Request
Authorization
Checks user ID and password
URI Translation
Translates the Uniform Resource Identifier to local system path.
Checking
Checks the correctness of file path with user privileges.
MIME type checking:
Checks the Multi-Purpose Internet mail Encoding of the requested resource.
Input
Prepares the system for reading input.
Output
Prepares the output for client.
Service
Generates the response to client
Log Entry
Record the activity into the Log.
Error
This field is used only if any of the above mentioned field fails from its normal execution. They are of 2 types.
They are as follows.
Connection Errors
They happen when a connection established for communication with the web server drops.
They are classified as follows:
Void URL
This simply means that the format of the Uniform Resource Locater is invalid.
130 Om Kumar C.U. & P. Bhargavi
Host Not Found
This error occurs when the Server could not be found with its host/domain name.
Time Out
When a connection could not be established with in a predetermined time this error occurs. The default time out is
set to 90 seconds.
Connection Refused
This error occurs when an identified host refuses connection through its default port.
No response from Web Server
When an identified Web server fails to respond with a time period this error has said to be occurred.
Unexpected Error
These are errors that does not report itself in an anticipated manner; it cannot be classified into one of the
predetermined categories.
To Handle Response
If a connection is established successfully with a Web Server then the Server responds with one of the following
status codes.
Web Server Status Code & Messages
The Status-Code element is a 3-digit integer result code of the attempt to understand and satisfy the request [6].
Table 3-clearly explains the message status.
Table 3: Status Codes
Status Code Message
1xx Informational
100 Continue
101 Switching protocols
2xx Success
201 Created
201 Accepted
203 Non authoritative Information
204 No content
205 Reset content
206 Partial content
3xx Redirection
301 Moved permanently
302 Moved temporarily
303 See other
304 Not modified
305 Use proxy
4xx Client error
401 Bad request
402 Unauthorized
403 Payment required
404 Forbidden
405 Method not allowed
Analysis of Web Server Log by Web Usage Mining for Extracting Users Patterns 131
Table 3:Contd.,
406 Not acceptable
407 Proxy Authentication required
408 Request timeout
409 Conflict
410 Gone
411 Length Required
412 Precondition Failed
413 Required entity too long
414 Required URI too long
415 Unsupported media type
5xx Server error
501 Not implemented
502 Bad gateway
503 Service unavailable
504 Gateway timeout
505 http version timeout
EXPERIMENTAL RESULTS
A company’s log file in Figure 3 is analyzed using WEKA. We apply classification and clustering [9][15]
technique in it.
Figure 3: Log File
(Username, ReqType and UserAgent of the Log File are Not Considered here for Mining)
By applying If Then classification [10] rules we obtain
Rule
IF
freemem <= 193.5
freemem > 117.5
THEN
usr = 93
+ 0.0001 * outtime
132 Om Kumar C.U. & P. Bhargavi
- 0.0016 * intime
- 0.0019 * bytesize
+ 0.0022 * req
- 1.9716 * exec
Rule: 2
IF
freemem > 304.5
fork <= 1.095
bytesize > 1626.5
THEN
usr = 88
+ 0.152 * intime
- 0.0009 * outime
- 3.0119 *req
- 0.0011 * exec
+ 0.31 * freeswap
Similarly around 17 rules can be mined. Applying kmeans we obtained prior probabilities of clusters.
Mean Distribution
Attribute: UserId
Normal Distribution. Mean = 25.1373 StdDev = 60.3306
Attribute:In Time
Normal Distribution. Mean = 16.3085 StdDev = 33.4104
Attribute: Out Time
Normal Distribution. Mean = 2979.7651 StdDev = 1538.001
Attribute: Byte Size
Normal Distribution. Mean = 271.0738 StdDev = 215.9147
Attribute: Request
Normal Distribution. Mean = 2.3778 StdDev = 2.7902
Attribute: Exec
Normal Distribution. Mean = 3.7507 StdDev = 6.2205
Analysis of Web Server Log by Web Usage Mining for Extracting Users Patterns 133
Figure 4: Kmeans Cluster-Distribution
Cluster: 0 Prior probability: 0.6381
Cluster: 1 Prior probability: 0.3619
From the above graph Cluster 0 contains maximum number of users.
Farthest First
Farthest first is a variant of K Means that places each cluster centre in turn at the point furthermost from the
existing cluster centre. This point must lie within the data area. This greatly speeds up the clustering in most of the cases
since less reassignment and adjustment is needed.
Cluster Centroids
Cluster 0
1098.0 rajesh M Other password 10/9/2010 10/9/2010 2
Cluster 1
123.0 Vaishnavee F Mobile Bill 9.940870123E9 9/1/2009 9/1/2009 227
Time taken to build model (full training data) : 0.06 seconds
Clustered Instances
0 740 (95%)
1 42 (5%)
Figure 5: Centroids of Cluster
134 Om Kumar C.U. & P. Bhargavi
CONCLUSIONS AND FUTURE WORK
There is a growing trend among companies, organizations and individuals alike to gather information from log
files to gather information regarding user but it is a challenging task for them to fulfill the user needs .Web mining has
valuable uses to marketing of business and a direct impact to the success of their promotional strategies and internet traffic.
This information is gathered on a daily basis and continues to be analyzed consistently [15]. Analysis of this pertinent
information will help companies to develop promotions that are more effective, internet accessibility, inter-company
communication and structure, and productive marketing skills through web usage mining This paper gives a detailed look
about servers, data mining, web mining, web server log file and its format. Further we extracted patterns of the user using
clustering, decision trees and If-Then rules.
The extended work to this research work is to mine the log file based user clicks .This need to have a deep insight
in to log files that stores clicks of the user.
REFERENCES
1. V.V.R.MaheswaraRao , Dr.V.Valli Kumari ―An Enhanced Pre-Processing Research Framework For Web Log
Data Using a Learning Algorithm‖, Journal Of Computer science & information technology‖ ,pp.01-15, 2011.
2. Kobra Etminani ,Mohammad-R. Akbarzadeh-T.,Noorali Raeeji Yanehsari,‖ Web Usage Mining: users
navigational patterns extraction from web logs using Ant-based Clustering Method‖, International Fuzzy System
Association &European Society Of fuzzy Logic Technology(IFSA-EUSFLAT )‖ ,pp.396-401, 2009.
3. Mohd Helmy Abd Wahab, Mohd Norzali Haji Mohd, Hafizul Fahri Hanafi, Mohamad Farhan Mohamad Mohsin,‖
Data Pre-processing on Web Server Logs for Generalized Association Rules Mining Algorithm‖,Proc.Of World
Academy Of Science, Engineering and Technology, Vol 36 pp.970-977, Dec 2008.
4. Ratnesh Kumar Jain1 , Dr. R. S. Kasana, Dr. Suresh Jain,‖ Efficient Web Log Mining using Doubly Linked Tree‖,
International Journal of Computer Science and Information Security ‖, Vol. 3, No. 1, pp.402-407, 2009.
5. Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava,‖ Data Preparation for Mining World Wide Web
Browsing Patterns‖.
6. K. R. Suneetha, Dr. R. Krishnamoorthi,‖ Identifying User Behavior by Analyzing Web Server Access Log File‖,
International Journal of Computer Science and Network Security, VOL.9 No.4, pp.327-332, April 2009.
7. S. K. Pani , L.Panigrahy, V.H.Sankar, Bikram Keshari Ratha, A.K.Mandal, S.K.Padhi,‖ Web Usage Mining: A
Survey on Pattern Extraction from Web Logs‖, International Journal of Instrumentation, Control & Automation ,
Volume 1, Issue 1,pp.15-23, 2011.
8. Arvind Kumar Sharma,Dr. P.C. Gupta,‖ Exploration of Efficient Methodologies for the Improvement In Web
Mining Techniques: A Survey‖, International Journal of Research in IT & Management Vol 1, Issue 3, pp.85-95 ,
July 2011.
9. Stavros Valsamidis, Sotirios Kontogiannis, Ioannis Kazanidis, Theodosios Theodosiouand Alexandros Karakos,‖
A Clustering Methodology of Web Log Data for Learning Management Systems‖, pp. 154–167, 2012.
Analysis of Web Server Log by Web Usage Mining for Extracting Users Patterns 135
10. M. Malarvizhi, S. A. Sahaaya Arul Mary,‖ Preprocessing of Educational Institution Web Log Data for Finding
Frequent Patterns using Weighted Association Rule Mining Technique‖, European Journal of Scientific Research
Vol.74 No.4, pp. 617-633,2012.
11. Sawan Bhawsar, Kshitij Pathak, Sourabh Mariya, Sunil Parihar,‖ Extraction of Business Rules from Web logs to
Improve Web Usage Mining‖, Vol 2, Issue 8, Aug,pp.333-340, 2012.
12. Vijay K. Gurbani,Eric Burger, Carol Davids, Tricha Anjal,‖ SIP CLF: A Common Log Format (clf) For The
Session Initiation Protocol‖,pp.1-8, 2010. [13]
13. Thanakorn Pamutha, Siriporn Chimphlee, Chom Kimpan, and Parinya Sanguansat,‖ Data Preprocessing on Web
Server Log Files for Mining Users Access Patterns‖, International Journal of Research and Reviews in Wireless
Communications (IJRRWC), Vol. 2, No. 2, June ,pp.92-98, 2012,.
14. Navin Kumar Tyagi, A. K. Solanki and Manoj Wadhwa,‖ Analysis of Server Log by Web Usage Mining for
Website Improvement‖, International Journal of Computer Science Issues(IJCSI), Vol. 7, Issue 4, No 8, July
pp.17-20, 2010.
15. Ian H.Witten, and Eibe Frank,‖ Data Mining: Practical Machine Learning Toolsand Techniques with Java
Implementations‖Morgan Kaufman Publishers, 1999.