Design and Design and Implementation of a Web Implementation of a Web Log Preprocessing System Log Preprocessing System
Supporting Path Supporting Path CompletionCompletion
Design and Design and Implementation of a Web Implementation of a Web Log Preprocessing System Log Preprocessing System
Supporting Path Supporting Path CompletionCompletion
Batchimeg AI lab. 2005.04.19
AI lab.
OutlineOutlineOutlineOutline
IntroductionIntroduction BackgroundBackground Related workRelated work Purposed SystemPurposed System Experiment and ResultExperiment and Result Conclusion and Future workConclusion and Future work
AI lab.
IntroductionIntroductionIntroductionIntroduction
Web Log Mining ProcessWeb Log Mining Process
Viewing news
Web SiteVisitor
Logged data- IP-OS, Agent- Time- URL- Refer page- Date
-Cookie- Method- Status- UserID- bytes- …
DBDB
• Visualization tools• Knowledge Query• Intelligent Agents
download
shopping
Auction
Data Analysis
Saved Web Log Data in Web Server
Saved Web Log Data in Web Server
PreprocessingPreprocessing
Pattern DiscoveryPattern DiscoveryPattern AnalysisPattern Analysis
My research area:Web log preprocessing
AI lab.
Background (Background (1/41/4) ) Background (Background (1/41/4) ) Log format :Log format :
– – Client IP -Client IP - 210.126.19.93210.126.19.93
– – Date - 23/Jan/2005Date - 23/Jan/2005
– – Accessed time - 13:37:12Accessed time - 13:37:12
– – Method - GET (to request page ),Method - GET (to request page ), POST, HEAD (send to server) POST, HEAD (send to server)
– – Protocol - HTTP/1.1Protocol - HTTP/1.1
– – Status code - 200 (Success),Status code - 200 (Success), 401,301,500 (error) 401,301,500 (error)
– – Size of file - 2705 Size of file - 2705
– – Agent type -Agent type - Mozilla/4.0Mozilla/4.0
– – Operating system - Windows NTOperating system - Windows NT
http://www.olloo.mn/modules.php?name=News&file=article&catid=25&sid=8225http://www.olloo.mn/modules.php?name=News&file=article&catid=25&sid=8225 → →
→ → http://www.olloo.mn/modules.php?name=News&file=friend&op=FriendSend&sid=http://www.olloo.mn/modules.php?name=News&file=friend&op=FriendSend&sid=82258225
A visitor (210.126.19.93) after to view the news who send it to friend.A visitor (210.126.19.93) after to view the news who send it to friend.
210.126.19.93 - - [23/Jan/2005:13:37:12 -0800]“GET /modules.php?name=News&file=friend&op=FriendSend&sid=8225 HTTP/1.1" 200 2705 "http://www.olloo.mn/modules.php?name=News&file=article&catid=25&sid=8225" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)“ …
285014 lines record
AI lab.
Session Identification
Background (2/4) - User identification, Session Identification
Background (2/4) - User identification, Session Identification
CleaningLog
User Identification
PathCompletion
Formatting
User Identification is identifying each user accessing Web siteUser Identification is identifying each user accessing Web site User IP+Browser (User IP+Browser (UserID+IP+OS or cookieUserID+IP+OS or cookie)=> Identify the users)=> Identify the usersSession identification is to find each user’s access pattern and frequency path.Session identification is to find each user’s access pattern and frequency path.
IP, Browser
User Identification
202.131.3.100
Mozilla/5.0(Windows NT)202.131.3.100
Mozilla/4.0 (Win2000)
210.126.19.93
Mozilla/4.0(Windows NT)
IPIP BrowserBrowser
A,B,C,D,F,A,LA,B,G,L
N,O
Visited pagesVisited pages
Session Identification
202.131.3.100
Mozilla/4.0 (Win2000)
A,B,C,D,F
A,B,G,L
N,O
202.131.3.100 A,L
Mozilla/5.0(Windows NT)
Mozilla/5.0(Windows NT)
202.131.3.100
Mozilla/5.0(Windows NT)210.126.19.93
AI lab.
Missed Page Views at ServerMissed Page Views at Server
Background (Background (3/43/4) ) Server Log and CachingServer Log and CachingBackground (Background (3/43/4) ) Server Log and CachingServer Log and Caching
If client must request every web page from the server If client must request every web page from the server slower. slower.
The solution to this problem is cachingThe solution to this problem is caching. .
Clients and Proxy Servers save local copies of pages Clients and Proxy Servers save local copies of pages back” and “forwardback” and “forward
Client
Cache
Server
Request P4
Send P4
P4
Request P3
Send 5
P3
Request P6
Send P4
P5
Never logged by server
P3
… Request P3
AI lab.
CleaningLog
User Identification
Session Identification
PathCompletion
Formatting
Topological Structure
Path completionA.htmlA.html
B.htmlB.html
G.html
L.html
C.htmlC.html
F.htmlF.html
N.html
D.htmlD.html E.html
H.html
I.html K.html
O.html
M.html
P.html
J.html
Q.html
A,B,C,D,FA,B,C,D,F A,B,C,D,C,B,FA,B,C,D,C,B,F
A,L A,L
A,B,G,I A,B,A,G,I
N,O N,O
Before ..Before .. After After
Background (4/4) - Path completion
Background (4/4) - Path completion
Not all requested pages are recorded in Web log. Due to caching problem.Not all requested pages are recorded in Web log. Due to caching problem.
AI lab.
Related workRelated workRelated workRelated work
Related Related worksworks
Using Using Topological Topological StructureStructure
RemovinRemoving images g images
RemovinRemoving robot g robot
texttext
User User /Session /Session
IdentificatioIdentificationn
Path Path completiocompletio
nn
R. Cooley [12] OO OO OO
Login, IP, Login, IP, AgentAgent OO
1996 [8] 1996 [8] Olympics Olympics site site
XX OO XX CookieCookie XX
Yan, JacobseYan, Jacobsen [5]n [5] XX OO XX IP, AgentIP, Agent XX
Pitkow [7]Pitkow [7] OO OO XX Session IDSession ID OO
Shahabi [2] XX OO XX Session IDSession ID OO
Chen, Park Chen, Park [3][3] XX OO XX Login, IPLogin, IP XX
X – not used X – not used
O – used O – used
AI lab.
Purposed System(Purposed System(1/71/7))((preprocessingpreprocessing))
Purposed System(Purposed System(1/71/7))((preprocessingpreprocessing))
Data cleaning Data cleaning
(eliminate irrelevant info)(eliminate irrelevant info)
ResultResult
Web site’s topological Web site’s topological structurestructure (find the hyperlink (find the hyperlink relation relation between web pages)between web pages)
User Identification, session User Identification, session Identification, Identification, (identify (identify each user, find each user’s each user, find each user’s access pattern)access pattern)
After session After session Identification and Identification and path completion path completion
User grouping User grouping User User IdentifyIdentify
After session After session Identification and Identification and path completion path completion
User grouping User grouping User User IdentifyIdentify
Construct the Construct the site topological site topological
structure by structure by web log data in web log data in
serverserver
Construct the Construct the site topological site topological
structure by structure by web log data in web log data in
serverserver
Why preprocessing?
Preprocessing can take up to 60-80% of the times spend analyzing the data.Incomplete preprocessing task can easily result invalid pattern and wrong conclusions.
Path Path completiocompletio
nn
User User GroupinGroupin
gg
AI lab.
Purposed System (Purposed System (2/72/7))Purposed System (Purposed System (2/72/7))
Make the site topological structureMake the site topological structure Helps solving data preprocessing and Helps solving data preprocessing and
analysis:analysis:
- user identification- user identification- - path completionpath completion
Goal of purposed systemGoal of purposed system Discover Discover Similar user group, Relevant page group and FreqSimilar user group, Relevant page group and Freq
uency accessing pathsuency accessing paths
AI lab.
Purposed System (Purposed System (3/73/7))Purposed System (Purposed System (3/73/7))begin
end
Not end of Log file
Enter URL toURL_Queue
URL QueueNot empty
Get head,define depth
Find “http” data
To add link tothe Topo_Str_DB
Is there otherRecord?
No
No
No
No
Yes
Yes
Yes
Yes
Algorithm of Topological StructureAlgorithm of Topological Structure
Make TopologicalStructure
AI lab.
Purposed System (Purposed System (4/74/7)- )- Make the Make the topological structuretopological structure
Purposed System (Purposed System (4/74/7)- )- Make the Make the topological structuretopological structure
Topological StructureTopological Structure- input: URL input: URL path and link path and link- output: complete sitemap (treeoutput: complete sitemap (tree))
link, path, depth and referrerslink, path, depth and referrersqueuequeue0. Index.html (A) 1. L.html (referrer) 2.
Sport/Team/football.html 2.
Sport/News/Mongolia.html
1. Sport.html 2. Sport/Team/ 3.
Sport/Team/football.html 2. Sport/Advice/
..
..
..
Sport/Advice
Index.html (A)
Sport.html
Sport/News/Mongolia.html
L.html
Sport/Team/
Sport/Team/football.html
X
0
1
2
3
Depth
olloo.mn/L.htmlolloo.mn/L.html Sport/Team/football.htmlolloo.mn/L.html Sport/News/Mongolia.htmlolloo.mn/Sport.htmlolloo.mn/Sport.html /Team/football.htmlolloo.mn/Sport.html /Advice/
AI lab.
Flow chart of User Identification algorithmFlow chart of User Identification algorithm
Begin Begin
Not end of log DBNot end of log DB
IF current IP’s Agent and OS same
IF current IP’s Agent and OS same
End End
Yes
YesNo
No
IP not in IPSetIP not in IPSetYes
No
Save the IP, Agent and OS
Save the IP, Agent and OS
Is there other Records?
Is there other Records?
No
Assign to the User Set,Increase User
counter
Assign to the User Set,Increase User
counter
Yes
Purposed System (Purposed System (5/75/7)) - User - User IdentificationIdentification
.. for similar user group
AI lab.
Purposed System Purposed System ((6/76/7)- Session )- Session identificationidentification
Purposed System Purposed System ((6/76/7)- Session )- Session identificationidentification
Begin Begin
not end of log DB
not end of log DB
refer page empty?refer page empty?
End End
Yes
Yes
IP not in User Set?IP not in User Set? YesNo
Start new Session
Start new Session
Is there other Records?
Is there other Records?
No
A page append to the
session
A page append to the
session
Yes
time taken >25.5?time taken >25.5?
go to path Completion
go to path Completion
No
No
No Yes
Flow chart of Session Identification algorithmFlow chart of Session Identification algorithm
AI lab.
Purposed System (Purposed System (7/77/7)) - Path - Path completioncompletion
Purposed System (Purposed System (7/77/7)) - Path - Path completioncompletion
Flow chart of Path completion algorithmFlow chart of Path completion algorithm
Begin Begin
Not end of Session set
Not end of Session set
End End
Yes
A page in a Session contains next page
in that session
A page in a Session contains next page
in that session
YesNo
check to the next page
check to the next page
No
Complete the path
Complete the path
Search that page from site map
Search that page from site map
AI lab.
Experiment (Experiment (1/41/4))Experiment (Experiment (1/41/4))
URLs in Web server logwww.olloo.mn Raw log data
AI lab.
Experiment Experiment (2/4)(2/4)Experiment Experiment (2/4)(2/4)Topological Structure
AI lab.
Experiment (Experiment (3/43/4) ) Experiment (Experiment (3/43/4) )
0
10000
20000
30000
40000
50000
60000
Size (K)
Before clean After clean
Cleaning result
2005.01.032005.01.102005.01.172005.01.312005.02.192005.02.262005.03.142003.03.312003.04.05
Data cleaning
AI lab.
Experiment (Experiment (4/44/4))Experiment (Experiment (4/44/4))
AI lab.
ResultResultResultResult
This result can be more helpful to discover Similar user group, This result can be more helpful to discover Similar user group, Relevant page group, Frequency accessing paths in WUM.Relevant page group, Frequency accessing paths in WUM.
User groupPath completion
AI lab.
Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)
Start the new project.Start the new project.
AI lab.
Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)
Giving the project name and folderGiving the project name and folder
AI lab.
Interface (Re Interface of Path Completion Interface (Re Interface of Path Completion Preprocessing System (PCPS) sult)Preprocessing System (PCPS) sult)
Interface (Re Interface of Path Completion Interface (Re Interface of Path Completion Preprocessing System (PCPS) sult)Preprocessing System (PCPS) sult)
Add the log file to projectAdd the log file to project
AI lab.
Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)
Choose the log file to addChoose the log file to add
AI lab.
Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)
Asking to remove the image filesAsking to remove the image files
(files) Should to analyze…
(files) Should to clean …
AI lab.
Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)
Cleaned log and informationCleaned log and information
The pages and files that wanted to analyze
AI lab.
Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)
Topological StructureTopological Structure
AI lab.
Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)
Browser Browser
AI lab.
Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)Interface of Path Completion Interface of Path Completion Preprocessing System (PCPS)Preprocessing System (PCPS)
SystemSystem
AI lab.
Comparing other preprocessing Comparing other preprocessing approach to Purposed Systemapproach to Purposed SystemComparing other preprocessing Comparing other preprocessing approach to Purposed Systemapproach to Purposed System
Related Related worksworks
Creation of Creation of Topol. StrucTopol. Struc
tureture
Using Using TopologicaTopological Structurel Structure
Removing Removing images images
RemovinRemoving robot g robot
texttext
User User /Session /Session
IdentificatiIdentificationon
Path Path completiocompletio
nn
R. Cooley [12] XX OO OO OO
Login, IP, Login, IP, AgentAgent OO
1996 [8] 1996 [8] Olympics Olympics site site
XX XX OO XX CookieCookie XX
Yan, Jacobsen Yan, Jacobsen [5][5] XX XX OO XX IP, AgentIP, Agent XX
Pitkow [7]Pitkow [7] XX OO OO XX Session IDSession ID OO
Shahabi [2] XX XX OO XX Session IDSession ID OO
Chen, Park Chen, Park [3][3] XX XX OO XX Login, IPLogin, IP XX
Purposed Purposed SystemSystem OO OO OO OO
IP,Agent, IP,Agent, GroupingGrouping OO
O- used, X – not used
AI lab.
Conclusion Conclusion Conclusion Conclusion
ApproachApproach Identified Identified number of number of
accessaccess
Identified Identified number of number of
Users Users
Identified Identified number of number of
SessionSessionNot used path Not used path completioncompletion
1801918019 28122812 1040710407
Purposed Purposed SystemSystem
1801918019 30613061 1101911019
• My work focus on preprocessing of Web log mining and enhance the My work focus on preprocessing of Web log mining and enhance the discovering patterns. discovering patterns. 3061 – 2812 = 249 users neglected.3061 – 2812 = 249 users neglected.
• This paper presented some new approach and practicable algorithm.This paper presented some new approach and practicable algorithm.• This approach can be better precision than some existence approaches.This approach can be better precision than some existence approaches.
AI lab.
ReferenceReferenceReferenceReference
[1] R. Cooley, B. Mobasher, and J. Srivastava Department of Computer Science and Engineeri[1] R. Cooley, B. Mobasher, and J. Srivastava Department of Computer Science and Engineering University of Minnesota Minneapolis, MN 55455, USA ng University of Minnesota Minneapolis, MN 55455, USA “Web mining: Information and “Web mining: Information and Pattern Discovery on the World Wide Web” Pattern Discovery on the World Wide Web” 19981998
[2] [2] C. C. Shahabi and F.B. Kashani, “Shahabi and F.B. Kashani, “A Framework for Efficient and Anonymous Web Usage MiniA Framework for Efficient and Anonymous Web Usage Mining Based on Client-Side Trackingng Based on Client-Side Tracking,”,”20012001
[3] M.S. Chen, J.S. Park, P.S Yu. [3] M.S. Chen, J.S. Park, P.S Yu. Data mining for path traversal patterns in a Web environmeData mining for path traversal patterns in a Web environmentnt. 1996. 1996
[4] H. Mannila, H. Toivonen. [4] H. Mannila, H. Toivonen. Discovering generalized episodes using minimal occurrence.Discovering generalized episodes using minimal occurrence. 19 199696
[5] T. Yan, M. Jacobsen, H. Garcia-Molina, U. Dayal. [5] T. Yan, M. Jacobsen, H. Garcia-Molina, U. Dayal. From user access patterns to dynamic hFrom user access patterns to dynamic hypertext linking. ypertext linking. 1996.1996.
[6]. J. Pitkow. In search of reliable usage data on the WWW. 1997.[6]. J. Pitkow. In search of reliable usage data on the WWW. 1997.[7]. J. Pitkow, P. Pirolli and R. Rao. Silk. Extracting usable structures from the Web. [7]. J. Pitkow, P. Pirolli and R. Rao. Silk. Extracting usable structures from the Web. 19961996[8]. S. Elo-Dean and M. Viveros. Data mining the IBM official 1996 Olympics Web site.[8]. S. Elo-Dean and M. Viveros. Data mining the IBM official 1996 Olympics Web site.[9]. Open Market Inc. Open Market Web reporter. [9]. Open Market Inc. Open Market Web reporter. http://www.openmarkethttp://www.openmarket.com.com,1996.,1996.[10]. net.Genesis. net.analysis desktop [10]. net.Genesis. net.analysis desktop http://www.netgen.comhttp://www.netgen.com,1996 ,1996 [11]. Doru Tanasa, Brigitte Trousse[11]. Doru Tanasa, Brigitte Trousse “ “Advanced data preprocessing for intersites Web Usage Advanced data preprocessing for intersites Web Usage
Mining “2004Mining “2004[12]. R. Cooley, Web Usage Mining: Discovery and Application of Interesting Patterns from We[12]. R. Cooley, Web Usage Mining: Discovery and Application of Interesting Patterns from We
b Data, PhD thesis, Dept. of Computer Science, Univ. of Minnesota, 2000.b Data, PhD thesis, Dept. of Computer Science, Univ. of Minnesota, 2000.