Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

Download Web Usage Miningand Using Ontology for Capturing Web Usage Semantic

Post on 05-Dec-2014

998 views

Category:

Education

1 download

Embed Size (px)

DESCRIPTION

Professor Ismail Toroslu gave a lecture on "Web Usage Mining and Using Ontology for Capturing Web Usage Semantic" in the Distinguished Lecturer Series - Leon The Mathematician. More Information available at: http://dls.csd.auth.gr

TRANSCRIPT

<ul><li> 1. smail Hakk Toroslu Middle East Technical University Department of Computer Engineering Ankara, Turkey Web Usage Mining and Using Ontology for Capturing Web Usage Semantic </li> <li> 2. 08/28/11 PART I A New Approach for Reactive Web Usage Data Processing </li> <li> 3. <ul><li>Web Mining </li></ul><ul><li>Previous Session Reconstruction Heuristics </li></ul><ul><li>Smart-SRA </li></ul><ul><li>Agent Simulator </li></ul><ul><li>Experimental Results </li></ul><ul><li>Conclusion </li></ul>OUTLINE </li> <li> 4. Web Mining <ul><li>Data Mining: Discover and retrieve useful and interesting patterns from a large dataset. </li></ul><ul><li>Web mining: Dataset is the huge web data. </li></ul><ul><li>Dimensions: </li></ul><ul><li><ul><li>Web content mining </li></ul></li></ul><ul><li><ul><li>Web structure mining </li></ul></li></ul><ul><li><ul><li>Web usage mining </li></ul></li></ul></li> <li> 5. Web Usage Mining (WUM) Application of data mining techniques to web log data in order to discover user access patterns. Example User Web Access Log Web Mining 4130 200 HTTP/1.0 C.html GET [25/Apr/2005:03:04:4805] 144.123.121.23 2050 200 HTTP/1.0 B.html GET [25/Apr/2005:03:04:4305] 144.123.121.23 3290 200 HTTP/1.0 A.html GET [25/Apr/2005:03:04:4105] 144.123.121.23 Number of Bytes Transmitted Success of Return Code Protocol URL Method Request Time IP Address </li> <li> 6. Phases of Web Usage Mining Web Mining Pre-Processing Pattern Analysis Raw Server log User session File Rules and Patterns Interesting Knowledge Applications Session Reconstruction Heuristics Pattern Discovery Apriori, GSP, SPADE </li> <li> 7. Session Reconstruction <ul><li>Sessions are reconstructed by using heuristics that select and group requests belonging to the same user session </li></ul><ul><li>Types: </li></ul><ul><li><ul><li>Reactive: processing requests after they are handled by the web server, </li></ul></li></ul><ul><li><ul><li>Proactive: processing occurs during the interactive browsing of the web site by the user </li></ul></li></ul>Previous Session Reconstruction Heuristics </li> <li> 8. <ul><li>Time-oriented heuristics </li></ul><ul><li>Navigation-oriented heuristic </li></ul>New Reactive Session Reconstruction Technique: Smart-SRA Combines these heuristics with "site topology" information in order to increase the accuracy of the reconstructed sessions Previous Session Reconstruction Heuristics </li> <li> 9. Example Web Topology Graph Example Web Page Request Sequence Previous Session Reconstruction Heuristics 47 32 29 15 6 0 Timestamp P 23 P 34 P 49 P 13 P 20 P 1 Page </li> <li> 10. Time-oriented heuristics -1 <ul><li>Total session time: duration of a discovered session is limited with a threshold </li></ul><ul><li><ul><li><ul><li>Discovered Sessions (30 mins): </li></ul></li></ul></li></ul><ul><li><ul><li><ul><li>[P 1 , P 20 , P 13 , P 49 ] </li></ul></li></ul></li></ul><ul><li><ul><li><ul><li>[P 34 , P 23 ] </li></ul></li></ul></li></ul>Previous Session Reconstruction Heuristics 47 32 29 15 6 0 Timestamp P 23 P 34 P 49 P 13 P 20 P 1 Page </li> <li> 11. Time-oriented Heuristics -2 <ul><li>Page-stay time: time spent on any page is limited with a threshold </li></ul><ul><li><ul><li><ul><li>Discovered Sessions ( 10 mins): </li></ul></li></ul></li></ul><ul><li><ul><li><ul><li>[P 1 , P 20 , P 13 ] </li></ul></li></ul></li></ul><ul><li><ul><li><ul><li>[P 49 , P 34 ] </li></ul></li></ul></li></ul><ul><li><ul><li><ul><li>[P 23 ] </li></ul></li></ul></li></ul>Previous Session Reconstruction Heuristics 47 32 29 15 6 0 Timestamp P 23 P 34 P 49 P 13 P 20 P 1 Page </li> <li> 12. Navigation-Oriented Heuristic <ul><li>Adding page WP N+1 to a session [WP 1 , WP 2 , , WP N ] </li></ul><ul><li><ul><li><ul><li>If WP N has a hyperlink to WP N+1 </li></ul></li></ul></li></ul><ul><li><ul><li><ul><li>[WP 1 , WP 2 , , WP N , WP N+1 ] </li></ul></li></ul></li></ul><ul><li><ul><li><ul><li>If WP N does not have a hyperlink to WP N+1 </li></ul></li></ul></li></ul><ul><li><ul><li><ul><li>and WP Kmax is the nearest page having a hyperlink to WP N+1 add backward browser moves </li></ul></li></ul></li></ul><ul><li>[WP 1 , WP 2 ,, WP N , WP N-1 , WP N-2 ,..., WP Kmax , WP N+1 ] </li></ul>Previous Session Reconstruction Heuristics </li> <li> 13. Navigation-Oriented Heuristic Previous Session Reconstruction Heuristics [P 1 , P 20 , P 1 , P 13 , P 49 , P 13 , P 34 , P 23 ] P 23 Link[P 34 , P 23 ] =1 [P 1 , P 20 , P 1 , P 13 , P 49 , P 13 , P 34 ] P 34 Link[P 49 , P 34 ] = 0 Link[P 13 , P 34 ] = 1 [P 1 , P 20 , P 1 , P 13 , P 49 ] P 49 Link[P 13 , P 49 ] = 1 [P 1 , P 20 , P 1 , P 13 ] P 13 Link[P 20 , P 13 ] = 0 Link[P 1 , P 13 ] = 1 [P 1 , P 20 ] P 20 Link[P 1 , P 20 ] = 1 [P 1 ] P 1 [ ] New Page Condition Curent Session </li> <li> 14. Smart-SRA <ul><li>Phase 1: Shorter request sequences are constructed by using overall session duration time and page-stay time criteria </li></ul><ul><li><ul><li><ul><li>Satisfies the overall session duration time limit </li></ul></li></ul></li></ul><ul><li>Phase 2: Candidate sessions are partitioned into maximal sub-sessions such that: </li></ul><ul><li><ul><li>between each consecutive page pair in a session there is a hyperlink from the previous page to the next page </li></ul></li></ul><ul><li><ul><li>the page stay time criteria is also satisfied </li></ul></li></ul><ul><li>Adds referrer constraints of the topology rule while eliminating the need for inserting backward browser movements. </li></ul>Contains Two Phases: </li> <li> 15. <ul><li><ul><li>1. Determine the web pages without any referrer (on its left) and remove them from the candidate session </li></ul></li></ul><ul><li><ul><li>2. For each one of these pages </li></ul></li></ul><ul><li><ul><li><ul><li>For each previously constructed session </li></ul></li></ul></li></ul><ul><li><ul><li><ul><li><ul><li>If there is a hyperlink from the last page of the session to the web page, then append the web page to the session </li></ul></li></ul></li></ul></li></ul><ul><li><ul><li><ul><li><ul><li>(if the page stay time constraint is satisfied) </li></ul></li></ul></li></ul></li></ul><ul><li><ul><li>3. Remove non-maximal sessions </li></ul></li></ul>Smart-SRA Steps of Phase 2 <ul><li><ul><li>Process a candidate session from left to right by repeating </li></ul></li></ul><ul><li><ul><li>the following steps until the candidate session is empty: </li></ul></li></ul></li> <li> 16. Example Candidate Session Example Web Topology Smart-SRA 15 14 12 9 6 0 Timestamp P 23 P 34 P 49 P 13 P 20 P 1 Page </li> <li> 17. Smart-SRA [P 1 , P 13 , P 34 , P 23 ] , [P 1 , P 13 , P 49 , P 23 ] [P 1 , P 20 , P 23 ] [P 1 ,P 13 ,P 34 ], [P 1 , P 13 , P 49 ] [P 1 , P 20 ] New Session Set (after) [P 1 , P 13 , P 34 , P 23 ] [P 1 , P 13 , P 49 , P 23 ], [P 1 , P 20 , P 23 ] [P 1 ,P 13 ,P 34 ] [P 1 , P 13 , P 49 ] Temp Session Set {P 23 } {P 49 , P 34 } Temp Page Set [P 1 ,P 13 ,P 34 ] [P 1 , P 13 , P 49 ] [P 1 , P 20 ] [P 1 ,P 20 ] [P 1 ,P 13 ] New Session Set (before) [P 23 ] [P 49 , P 34 , P 23 ] Candidate Session 4 3 Iteration [P 1 ,P 20 ] [P 1 ,P 13 ] [P 1 ] New Session Set (after) [P 1 ,P 20 ] [P 1 ,P 13 ] [P 1 ] Temp Session Set {P 20 , P 13 } {P 1 } Temp Page Set [P 1 ] New Session Set (before) [P 20 , P 13 , P 49 , P 34 , P 23 ] [P 1 , P 20 , P 13 , P 49 , P 34 , P 23 ] Candidate Session 2 1 Iteration </li> <li> 18. Agent Simulator <ul><li>Models the behavior of web users and generates web user navigation and the log data kept by the web server </li></ul><ul><li>Used to compare the performances of alternative session reconstruction heuristics </li></ul><ul><li>Uses 4 Primitive behaviors for simulating complex navigation of web user. </li></ul></li> <li> 19. Web user can start a new session with any one of the possible entry pages of the web site Agent Simulator User-Behavior I </li> <li> 20. Web user can select a new page having a link from the most recently accessed page P 13 P 1 P 49 P 20 P 23 P 34 2 1 Agent Simulator User-Behavior II </li> <li> 21. Web user can select as the next page having a link from any one of the previously browsed pages Agent Simulator User-Behavior III P 13 P 1 P 49 P 20 P 23 P 34 2 1 3 4 5 </li> <li> 22. Web user can terminate the session Agent Simulator User-Behavior IV P 13 P 1 P 49 P 20 P 23 P 34 2 1 3 4 5 6 </li> <li> 23. Parameters for simulating behavior of web user <ul><li>Session Termination Probability (STP) </li></ul><ul><li>Link from Previous pages Probability (LPP) </li></ul><ul><li>New Initial page Probability (NIP) </li></ul>Agent Simulator </li> <li> 24. Heuristics Tested <ul><li>Time oriented heuristic (heur1) </li></ul><ul><li>(total time &lt; 30 min) </li></ul><ul><li>Time oriented heuristic (heur2) </li></ul><ul><li>(page stay &lt; 10 min) </li></ul><ul><li>Navigation oriented heuristic (heur3) </li></ul><ul><li>Smart-SRA heuristic (heur4) </li></ul>Experimental Results </li> <li> 25. Accuracy <ul><li>Reconstructed session H captures </li></ul><ul><li>a real session R </li></ul><ul><li>if R occurs as a subsequence of H (R H) </li></ul><ul><li>R = [P1, P3, P5] </li></ul><ul><li>H = [P9, P1, P3, P5 , P8] =&gt; R H </li></ul><ul><li>H = [P1, P9 , P3, P5, P8] =&gt; R H </li></ul>Experimental Results </li> <li> 26. Parameters for generating user sessions and web topology Experimental Results 30% 0%-90% NIP : Fixed &amp; Range 30% 0%-90% LPP : Fixed &amp; Range 5% 1%-20% STP : Fixed &amp; Range 10000 Number of agents 0,5 min Deviation for page stay time 2,2 min Average number of page stay time 15 Average number of outdegree 300 Number of web pages (nodes) in topology </li> <li> 27. Accuracy vs. STP Experimental Results </li> <li> 28. Accuracy vs LPP Experimental Results </li> <li> 29. Accuracy vs. NIP Experimental Results </li> <li> 30. Conclusion <ul><li>New session reconstruction heuristic: Smart-SRA </li></ul><ul><li><ul><li>Does not allow sequences with unrelated consecutive requests (no hyperlink between the previous one to the next one) </li></ul></li></ul><ul><li>No artificial browser (back) requests insertion in order to prevent unrelated consecutive requests </li></ul><ul><li><ul><li>Only maximal sessions </li></ul></li></ul><ul><li>Agent simulator </li></ul><ul><li>Accuracy measure </li></ul><ul><li>Experimental results show Smart-SRA outperforms previous heuristics </li></ul></li> <li> 31. 08/28/11 PART II Semantically Enriched Event Based Model f or W eb Usage Mining </li> <li> 32. <ul><li>Introduction </li></ul><ul><li>Related Work </li></ul><ul><li>Semantic Event Based Sessions </li></ul><ul><li>Formal Definition of Semantic Events </li></ul><ul><li>Algorithms for Mining Semantic Event Patterns </li></ul><ul><li>Experimental Results </li></ul><ul><li>Con c lusion </li></ul>08/28/11 OUTLINE </li> <li> 33. <ul><li>Traditional WUM is based on pageviews, </li></ul><ul><li>but user interaction model is changing </li></ul><ul><li>Users do not care about pageviews, </li></ul><ul><li>but they use web site to achieve high level goals such as </li></ul><ul><li><ul><li>Finding and viewing a video </li></ul></li></ul><ul><li><ul><li>Buying tickets </li></ul></li></ul><ul><li><ul><li>Searching for the nearest Italian restaurant </li></ul></li></ul><ul><li><ul><li>Listening to a song, etc </li></ul></li></ul>08/28/11 Introduction </li> <li> 34. <ul><li>We should analyze usage data in a series of events </li></ul><ul><li><ul><li>Search Mediterranean Restaurants </li></ul></li></ul><ul><li><ul><li>S earch Italian Restaurants </li></ul></li></ul><ul><li><ul><li>View the reviews for Restaurant A </li></ul></li></ul><ul><li><ul><li>View the reviews for Restaurant B </li></ul></li></ul><ul><li><ul><li>C lick the web site link of Restaurant A </li></ul></li></ul><ul><li>Incorporating semantic knowledge in the process is the logical choice </li></ul><ul><li><ul><li>A method should be devised to capture user behavior </li></ul></li></ul><ul><li><ul><li>Usage data should be mapped to semantic space </li></ul></li></ul><ul><li><ul><li>An algorithm should be developed to exploit semantic relations </li></ul></li></ul>08/28/11 Introduction </li> <li> 35. <ul><li>In this work we propose methods for: </li></ul><ul><li><ul><li>tracking and logging domain level events </li></ul></li></ul><ul><li><ul><li>i njecting semantic to events </li></ul></li></ul><ul><li><ul><li>semantic ordering of events </li></ul></li></ul><ul><li><ul><li>an algorithm for computing sequences of frequent events </li></ul></li></ul><ul><li>Proposed system tested with 2 web sites </li></ul><ul><li><ul><li>Music Streaming Site </li></ul></li></ul><ul><li><ul><li>Mobile Network Operators Site </li></ul></li></ul>08/28/11 Introduction </li> <li> 36. <ul><li>Events are conceptual actions </li></ul><ul><li>that the user performs to achieve a certain a ff ect </li></ul><ul><li>Events are used to capture business actions </li></ul><ul><li>that are dened in the sites domain </li></ul><ul><li>The site admin is responsible for </li></ul><ul><li>defining and tracking events </li></ul><ul><li>Events are tracked via JavaScript client </li></ul>08/28/11 Semantic Event Based Sessions </li> <li> 37. <ul><li>E xample event s :...</li></ul></li></ul>