1 1 1 advanced databases – inferring implicit/new knowledge from data(bases): web mining, esp. web...
Post on 18-Dec-2015
239 views
TRANSCRIPT
1
1
1
Advanced databases –
Inferring implicit/new knowledge from data(bases):
Web mining, esp. Web usage mining
Bettina Berendt
Katholieke Universiteit Leuven, Department of Computer Science
http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/
Last update: 29 November 2007
2
2
2
Semi-structured and unstructured data
Unstructured data: has „no“ structure (esp. not a relational one)
Common sources of unstructured data include: Documents: Word documents, PowerPoint presentations, newsletters,
source code, hard-copy documents
Images and graphics
Unstructured data: has „some“ structure (partly structured, partly unstructured)
Common sources of semi-structured data sources include: E-mails
TCP/IP packets
XML data
Images and graphics
Documents (all listed previously)
Web, text as two particularly interesting representatives
3
3
3
Agenda
Intro: Web Mining, specifically Web Usage Mining
Data Acquisition, Understanding, and Preparation
Forms of analysis; mining techniques
Case study 1: A multi-channel retailer method: Association-rule discovery
Case study 2: Search in an educational portal method: Sequence mining / generalized-sequ. discovery
Case study 3: Search in a community portal
9
9
9
Web Mining
Knowledge discovery (aka Data mining):
“the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data.” 1
Web Mining: the application of data mining techniques on the content, (hyperlink) structure, and usage of Web resources. Web mining areas:
Web content mining
Web structure mining
Web usage mining
1 Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (Eds.) (1996). Advances in Knowledge Discovery and Data Mining. Boston, MA: AAAI/MIT Press
10
10
10
Web Usage Mining: Basics and data sources
Definition of Web usage mining:
discovery of meaningful patterns from data generated by client-server transactions on one or more Web servers
Typical Sources of Data
automatically generated data stored in server access logs, referrer logs, agent logs, and client-side cookies
e-commerce and product-oriented user events (e.g., shopping cart changes, ad or product click-throughs, purchases)
user profiles and/or user ratings
meta-data, page attributes, page content, site structure
This is a slide from 2002 ...
11
11
11
Web usage is more than „browsing“:Interactions on the Web
Social viewpoint
User – server Search engine
Online store
Digital library
...
User – user „Web 2.0“ (and all its
precursors)
Technical viewpoint
Access content („read“)
Create content („write“)
Navigate
13
13
13
Agenda
Intro: Web Mining, specifically Web Usage Mining
Data Acquisition, Understanding, and Preparation
Forms of analysis; mining techniques
Case study 1: A multi-channel retailer method: Association-rule discovery
Case study 2: Search in an educational portal method: Sequence mining / generalized-sequ. discovery
Case study 3: Search in a community portal
14
14
14
Web Usage Mining
Discovery of meaningful patterns from data generated by client-server transactions on one or more Web servers
Typical Sources of Data
automatically generated data stored in server access logs, referrer logs, agent logs, and client-side cookies
e-commerce and product-oriented user events (e.g., shopping cart changes, ad or product click-throughs, etc.)
user profiles and/or user ratings
meta-data, page attributes, page content, site structure
What’s in a typical Web server log …
<ip_addr> - - <date><method><file><protocol><code><bytes><referrer><user_agent> <ip_addr> - - <date><method><file><protocol><code><bytes><referrer><user_agent>
203.30.5.145 - - [01/Jun/1999:03:09:21 -0600] "GET /Calls/OWOM.html HTTP/1.0" 200 3942 "http://www.lycos.com/cgi-bin/pursuit?query=advertising+psychology-&maxhits=20&cat=dir" "Mozilla/4.5 [en] (Win98; I)"
203.30.5.145 - - [01/Jun/1999:03:09:23 -0600] "GET /Calls/Images/earthani.gif HTTP/1.0" 200 10689 "http://www.acr-news.org/Calls/OWOM.html" "Mozilla/4.5 [en] (Win98; I)"
203.30.5.145 - - [01/Jun/1999:03:09:24 -0600] "GET /Calls/Images/line.gif HTTP/1.0" 200 190 "http://www.acr-news.org/Calls/OWOM.html" "Mozilla/4.5 [en] (Win98; I)"
203.252.234.33 - - [01/Jun/1999:03:12:31 -0600] "GET / HTTP/1.0" 200 4980 "" "Mozilla/4.06 [en] (Win95; I)"
203.252.234.33 - - [01/Jun/1999:03:12:35 -0600] "GET /Images/line.gif HTTP/1.0" 200 190 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)"
203.252.234.33 - - [01/Jun/1999:03:12:35 -0600] "GET /Images/red.gif HTTP/1.0" 200 104 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)"
203.252.234.33 - - [01/Jun/1999:03:12:35 -0600] "GET /Images/earthani.gif HTTP/1.0" 200 10689 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)"
203.252.234.33 - - [01/Jun/1999:03:13:11 -0600] "GET /CP.html HTTP/1.0" 200 3218 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)“
203.30.5.145 - - [01/Jun/1999:03:13:25 -0600] "GET /Calls/AWAC.html HTTP/1.0" 200 104 "http://www.acr-news.org/Calls/OWOM.html" "Mozilla/4.5 [en] (Win98; I)"
(Requests to www.acr-news.org)
… and what does it mean?
<ip_addr> - - <date><method><file><protocol><code><bytes><referrer><user_agent> <ip_addr> - - <date><method><file><protocol><code><bytes><referrer><user_agent>
203.30.5.145 - - [01/Jun/1999:03:09:21 -0600] "GET /Calls/OWOM.html HTTP/1.0" 200 3942 "http://www.lycos.com/cgi-bin/pursuit?query=advertising+psychology-&maxhits=20&cat=dir" "Mozilla/4.5 [en] (Win98; I)"
203.30.5.145 - - [01/Jun/1999:03:09:23 -0600] "GET /Calls/Images/earthani.gif HTTP/1.0" 200 10689 "http://www.acr-news.org/Calls/OWOM.html" "Mozilla/4.5 [en] (Win98; I)"
203.30.5.145 - - [01/Jun/1999:03:09:24 -0600] "GET /Calls/Images/line.gif HTTP/1.0" 200 190 "http://www.acr-news.org/Calls/OWOM.html" "Mozilla/4.5 [en] (Win98; I)"
203.252.234.33 - - [01/Jun/1999:03:12:31 -0600] "GET / HTTP/1.0" 200 4980 "" "Mozilla/4.06 [en] (Win95; I)"
203.252.234.33 - - [01/Jun/1999:03:12:35 -0600] "GET /Images/line.gif HTTP/1.0" 200 190 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)"
203.252.234.33 - - [01/Jun/1999:03:12:35 -0600] "GET /Images/red.gif HTTP/1.0" 200 104 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)"
203.252.234.33 - - [01/Jun/1999:03:12:35 -0600] "GET /Images/earthani.gif HTTP/1.0" 200 10689 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)"
203.252.234.33 - - [01/Jun/1999:03:13:11 -0600] "GET /CP.html HTTP/1.0" 200 3218 "http://www.acr-news.org/" "Mozilla/4.06 [en] (Win95; I)“
203.30.5.145 - - [01/Jun/1999:03:13:25 -0600] "GET /Calls/AWAC.html HTTP/1.0" 200 104 "http://www.acr-news.org/Calls/OWOM.html" "Mozilla/4.5 [en] (Win98; I)"
(Requests to www.acr-news.org)
18
18
18
Sources and destinations
Logs may extend beyond visits to the site and show where a visitor was before (referrer) ...
203.30.5.145 - - [01/Jun/1999:03:09:21 -0600] "GET /Calls/OWOM.html HTTP/1.0" 200 3942 "http://www.lycos.com/cgi-bin/pursuit?query=advertising+psychology-&maxhits=20&cat=dir" "Mozilla/4.5 [en] (Win98; I)"
... and where s/he went next (URL rewriting):
19
19
19
Raw UsageData
DataCleaning
EpisodeIdentification
User/SessionIdentification
Page ViewIdentification
PathCompletion Server Session File
Episode File
Site Structureand Content
Usage Statistics
Preprocessing of Web Usage Data
20
20
20
Raw UsageData
DataCleaning
EpisodeIdentification
User/SessionIdentification
Page ViewIdentification
PathCompletion Server Session File
Episode File
Site Structureand Content
Usage Statistics
Preprocessing of Web Usage Data
not always necessary and/or done
21
21
21
Data Preprocessing (1)
Data cleaning
remove irrelevant references and fields in server logs
remove references due to spider navigation
remove erroneous references
add missing references due to caching (done after sessionization)
Data integration
synchronize data from multiple server logs
Integrate semantics, e.g., meta-data (e.g., content labels)
e-commerce and application server data
integrate demographic / registration data
22
22
22
Data Preprocessing (2)
Data Transformation
user identification
sessionization / episode identification
pageview identification
a pageview is a set of page files and associated objects that contribute to a single display in a Web Browser
Data Reduction
sampling and dimensionality reduction (ignoring certain pageviews / items)
Identifying User Transactions (i.e., sets or sequences of pageviews possibly with associated weights)
23
23
23
Why sessionize?
Quality of the patterns discovered in KDD depends on the quality of the data on which mining is applied.
In Web usage analysis, these data are the sessions of the site visitors: the activities performed by a user from the moment she enters the site until the moment she leaves it.
Difficult to obtain reliable usage data due to proxy servers and anonymizers, dynamic IP addresses, missing references due to caching, and the inability of servers to distinguish among different visits.
Cookies and embedded session IDs produce the most faithful approximation of users and their visits, but are not used in every site, and not accepted by every user.
Therefore, heuristics are needed that can sessionize the available access data.
24
24
24
Mechanisms for User Identification
Examples: page tags (use javascript), some browser plugins
25
25
25
Examples of “software agents“
Page tagging with Javascript: see also http://www.bruceclay.com/analytics/disadvantages.htm
26
26
26
Sessionization strategies:Sessionization heuristics
These heuristics are quite accurate! (see Spiliopoulou et al., 2003)
27
27
27
Path Completion
Refers to the problem of inferring missing user references due to caching.
Effective path completion requires extensive knowledge of the link structure within the site
Referrer information in server logs can also be used in disambiguating the inferred paths.
Problem gets much more complicated in frame-based sites.
28
28
28
Why integrate semantics?
Basic idea: associate each requested page with one or more domain concepts, to better understand the process of navigation / Web usage
Example: a shopping site
p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:03:51 +0100] "GET /search.html?l=ostsee%20strand&syn=023785&ord=asc HTTP/1.0" 200 1759 p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:05:06 +0100] "GET /search.html?l=ostsee%20strand&p=low&syn=023785&ord=desc HTTP/1.0" 200 8450p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:06:41 +0100] "GET /mlesen.html?Item=3456&syn=023785 HTTP/1.0" 200 3478
Search by category Search by Category+title
Refine search Choose item
Look at indiv-idual product
From ...
To ...
29
29
29From URLs to topics / concepts: Basics of semantic session modelling
1 request 1 concept or n concepts
Concepts can concern content or service
Concepts can be part of an ontology (simple case: concept hierarchy)
Session = set / sequence / tree / graph of requests
also possible: n requests 1 concept
30
30
30
Ontology-based behaviour modelling – basic ideas (1)
The request for a Web page signals interest in the concept(s) and relations dealt with in this page – interest in the obtained content as well as in the requested service.
Formally: a request as a (multi)set, or as a vector, of concepts/relations.
31
31
31
Resulting format: if the request is the instance
Usually flat file (format like Web server log) or database
32
32
32
Resulting format: If a session is the instance
What features can a session have?
Refer again to the example:
p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:03:51 +0100] "GET /search.html?l=ostsee%20strand&syn=023785&ord=asc HTTP/1.0" 200 1759 p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:05:06 +0100] "GET /search.html?l=ostsee%20strand&p=low&syn=023785&ord=desc HTTP/1.0" 200 8450p3ee24304.dip.t-dialin.net - - [19/Mar/2002:12:06:41 +0100] "GET /mlesen.html?Item=3456&syn=023785 HTTP/1.0" 200 3478
Search by category Search by Category+title
Refine search Choose item
Look at indiv-idual product
customers
ordersproducts
OperationalDatabase
ContentAnalysisModule
Web/ApplicationServer Logs
Data Cleaning /Sessionization
Module
Site Map
SiteDictionary
IntegratedSessionized
Data
DataIntegration
Module
E-CommerceData Mart
Data MiningEngine
OLAPTools
Session Analysis /Static Aggregation
PatternAnalysis
OLAPAnalysis
SiteContent
Data Cube
Basic Framework for E-Commerce Data Analysis
Web Usage and E-Business Analytics
34
34
34
Agenda
Intro: Web Mining, specifically Web Usage Mining
Data Acquisition, Understanding, and Preparation
Forms of analysis; mining techniques
Case study 1: A multi-channel retailer method: Association-rule discovery
Case study 2: Search in an educational portal method: Sequence mining / generalized-sequ. discovery
Case study 3: Search in a community portal
35
35
35
Web Usage and E-Business Analytics
Session Analysis
Static Aggregation and Statistics
OLAP
Data Mining
Different Levels of AnalysisDifferent Levels of Analysis
36
36
36
Session Analysis
Simplest form of analysis: examine individual or groups of server sessions and e-commerce data.
Advantages:
Gain insight into typical customer behaviors.
Trace specific problems with the site.
Drawbacks:
LOTS of data.
Difficult to generalize.
37
37
37
Static Aggregation (Reports)
Most common form of analysis.
Data aggregated by predetermined units such as days or sessions.
Generally gives most “bang for the buck.”
Advantages:
Gives quick overview of how a site is being used.
Minimal disk space or processing power required.
Drawbacks:
No ability to “dig deeper” into the data.
Page Number of Average View Count View Sessions per Session
Home Page 50,000 1.5Catalog Ordering 500 1.1Shopping Cart 9000 2.3
38
38
38
Online Analytical Processing (OLAP)
Allows changes to aggregation level for multiple dimensions.
Generally associated with a Data Warehouse.
Advantages & Drawbacks
Very flexible
Requires significantly more resources than static reporting.
Page Number of Average View Count View Sessions per Session
Kid's Stuff Products 2,000 5.9
Page Number of Average View Count View Sessions per Session
Kid's Stuff Products Electronics Educational 63 2.3 Radio-Controlled 93 2.5
39
39
39
Data Mining: Going deeper
Sequence mining
Sequence mining
Markov chainsMarkov chains
Association rules
Association rules
ClusteringClustering
Session ClusteringSession
Clustering
ClassificationClassification
Prediction of next eventPrediction of next event
Discovery of associated events or application objectsDiscovery of associated events or application objects
Discovery of visitor groups with common properties and interests
Discovery of visitor groups with common properties and interests
Discovery of visitor groups with common behaviourDiscovery of visitor groups with common behaviour
Characterization of visitors with respect to a set of predefined classes
Characterization of visitors with respect to a set of predefined classes
Card fraud detectionCard fraud detection
40
40
40
KDD Techniques for Web Applications: Examples (1)
Calibration of a Web server:
Prediction of the next page invocation over a group of concurrent Web users under certain constraints
Sequence mining, Markov chains
Cross-selling of products:
Mapping of Web pages/objects to products
Discovery of associated products
Association rules, Sequence Mining
Placement of associated products on the same page
41
41
41
KDD Techniques for Web Applications: Examples (2)
Sophisticated cross-selling and up-selling of products:
Mapping of pages/objects to products of different price groups
Identification of Customer Groups
Clustering, Classification
Discovery of associated products of the same/different price categories
Association rules, Sequence Mining
Formulation of recommendations to the end-user
Suggestions on associated products
Suggestions based on the preferences of similar users
42
42
42
Agenda
Intro: Web Mining, specifically Web Usage Mining
Data Acquisition, Understanding, and Preparation
Forms of analysis; mining techniques
Case study 1: A multi-channel retailer method: Association-rule discovery
Case study 2: Search in an educational portal method: Sequence mining / generalized-sequ. discovery
Case study 3: Search in a community portal
44
44
44
A multi-channel retailer, its business goals, and analysis questions
General goals: “Standard e-tailer goals“ – attract users/shoppers and convert them into customers
Specific goals: assess the success of the Web site – in relation to other distribution channels
Questions of the evaluation:
• What business metrics can be calculated from Web usage data, transaction and demographic data for determining online success?
• Are there cross-channel effects between a company‘s e-shop and its physical stores?
52 5467 69
48 4633 31
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1999 2000 2001 2002 (proj.)
Pure Internetcompanies
Multi-channelbusinesses
Background: Internet market shares [BCG 2002]
46
46
46
Outline of the KDD process
Data preparation: Session IDs; usual data cleaning steps Linking of sessions & transaction information (anonymized)
Modelling / pattern discovery:
Web metrics, cluster analysis, association rules, sequence mining + correlation analysis, questionnaire study, qualitative market analysis
Evaluation: Interesting patterns
Business underst.: customer buying process
Data:
Web server sessions, transaction info.
Data understanding – main step:
modelling the semantics of the site in terms of a hierarchy of service concepts
47
47
47
Agenda – Case Study
Business Understanding
Data understanding and preparation
Pattern discovery + evaluation: Success metrics
Pattern disc. + eval.: Behavioural patterns
Pattern disc. + eval.: User types
Pattern disc. + eval.: Behaviour & demographics
48
48
48
Agenda – Case Study
Business Understanding
Data understanding and preparation
Pattern discovery + evaluation: Success metrics
Pattern disc. + eval.: Behavioural patterns
Pattern disc. + eval.: User types
Pattern disc. + eval.: Behaviour & demographics
49
49
49
Description of the site and its services
The retailer operates an e-shop and more than 5000 retail shops in over 10 European countries
It sells a wide range of consumer electronics
Online customers can pay, pick-up/deliver and return both online and offline
Web pages provide for all tasks in the customer buying process
50
50
50
Purchase Phases (Page Concepts) at Large MC Retailers
1. Acquisition (home): All Web pages that are semantically related to the initial acquisition of a visitor
Home (Acquisition)
51
51
51
Purchase Phases (Page Concepts) at Large MC Retailers
Home (Acquisition)
2. Catalogue information: pages providing an overview of product categories.
Product Impression
52
52
52
Purchase Phases (Page Concepts) at Large MC Retailers
Product Click-
ThroughHome
(Acquisition)
3. Information product (infprod): pages displaying information about a specific product
Product Impression
53
53
53
Purchase Phases (Page Concepts) at Large MC Retailers
OfflineinfoHome (Acquisition)
4. offline information (offinfo): All pages related to any offline information: store locator (pages for finding physical stores in one’s neighbourhood), information about offline services, offline referrers etc.
Product Click-
Through Product
Impression
54
54
54
Purchase Phases (Page Concepts) at Large MC Retailers
TransactionOfflineinfoHome (Acquisition)
5. transaction (transact): steps before an actual purchase, starting with a customer entering the order process: check-out, input of customer data, payment and delivery preferences (online or offline), etc.
Product Click-
Through Product
Impression
55
55
55
Purchase Phases (Page Concepts) at Large MC Retailers
Transaction PurchaseOfflineinfoHome (Acquisition)
6. purchase: indicates if a visitor completed the transaction process and bought a product, e.g. invocation of an order confirmation page.
Product Click-
Through Product
Impression
56
56
56
Agenda – Case Study
Business Understanding
Data understanding and preparation
Pattern disc. + eval.: Behavioural patterns
57
57
57
Data and data preparation
Data sources and sample:
92,467 sessions from the company’s Web logs from 21 days in 2002
anonymized transaction information of 13,653 customers who bought online over a period of 8 months in 2001/02.
621 transaction records (21 days) were linked to Web-usage records
Data preparation:
Sessions were determined by session IDs
Robot visits eliminated, usual data cleaning steps
Each URL request mapped to a service concept from {c1,...,cn}
Session representation: s = [w1, ...wn], with wi = weight of ci, indicating whether or not the concept was visited (1/0), or how often it was visited
Customer record: feature vector incl. session and transaction data
58
58
58
Site semantics: A service concept hierarchy
Any
Information
Transaction
Services
Information Product
Fulfillment/ Service
Customer Data
Shopping Cart Payment
Company Infos
Registration
Other
Acquisition
Offline Referrer
Advertiser Other
Store Locator
Information Catalog
Home
Game Offline Service
and Support
= Multi-Channel Concept
760,535 page requests were mapped onto the concepts from this hierarchy:
59
59
59
Types of patterns
Conversion rates (~ confidence of content-specified sequential association rules) for assessing business success
Association rule and sequence analysis for understanding online/offline preferences and their temporal development
Cluster analysis for customer segmentation
Correlation analysis for investigating the relationship between demographic indicators and online/offline preferences
60
60
60
>> Session representation
Each session represented as a feature vector on the multi-channel concepts
Two methods used for definition of new conversion metrics:
weighted-concept method (number of visits to a concept)
dichotomized concept method (whether or not concept was visited)
Session home infcat infprod service
transact
purch. offinfo
A 0 3 7 4 2 1 0B 1 3 5 0 0 0 2...
Session home infcat infprod service
transact
purch. offinfo
A 0 1 1 1 1 1 0B 1 1 1 0 0 0 1...
61
61
61
Agenda – Case Study
Business Understanding
Data understanding and preparation
Pattern disc. + eval.: Behavioural patterns
62
62
62
“Internal consistency“ of preferences – payment and delivery preferences
Online payment Direct delivery (s=0.27, c=0.97) < 1/3 traditional onl.users!
Online payment In-store pickup (s=0.02, c=0.03)
Cash on delivery Direct delivery (s=0.02, c=0.03)
In-store payment In-store pickup (s=0.69, c=0.94)
Site is primarily used to collect information.
s: support, c: confidence of the sequence
s: support, c: confidence of the sequence
63
63
63
“Internal consistency“ of preferences – return preferences
Return In-store (s=0.06, c=0.87)
Return Mail-in (s=0.04, c=0.13)
Customers may wish personal assistance.
(a result supported by the service mix analysis of different multi-channel retailers and by questionnaire results)
s: support, c: confidence of the association rule
s: support, c: confidence of the association rule
64
64
64
Development of preferences over time
Direct delivery In-store pickup in 1 following transaction (s=0.001,c=0.15)
Direct delivery Direct delivery in all following transactions (s=0.003,c=0.85)
In-store pickup Direct delivery in 1 foll. transaction (s=0.001, c=0.10) (*)
In-store pickup In-store pickup in all foll. transactions (s=0.004, c=0.90)
Results for payment migration are similar.
90% of repeat customers did not change transaction preferences at all.
Rule (*) as an indicator of the development of trust?!
s: support, c: confidence of the sequence
s: support, c: confidence of the sequence
65
65
65
Agenda
Intro: Web Mining, specifically Web Usage Mining
Data Acquisition, Understanding, and Preparation
Forms of analysis; mining techniques
Case study 1: A multi-channel retailer method: Association-rule discovery
Case study 2: Search in an educational portal method: Sequence mining / generalized-sequ. discovery
Case study 3: Search in a community portal
66
66
66
Association-rule mining
Coenen, F. (2003). Association rule mining and its wider context. AI2003 Association Rule Mining Tutorial, Cambridge, December 2003.
http://www.csc.liv.ac.uk/~frans/KDD/Tutorials/tutorialAI2003.ppt
pp. 5 – 20, covering
What is an association rule?
What are interestingness measures for association rules?
support, confidence, lift (there are also further measures)
cf. the „performance measures“ recall, precision, etc. for classifiers
How is association-rule mining performed?
the basic apriori algorithm
67
67
67
Agenda
Intro: Web Mining, specifically Web Usage Mining
Data Acquisition, Understanding, and Preparation
Forms of analysis; mining techniques
Case study 1: A multi-channel retailer method: Association-rule discovery
Case study 2: Search in an educational portal method: Sequence mining / generalized-sequ. discovery
Case study 3: Search in a community portal
68
68
68
The site
Business understanding / problem definition:
* How do users search in this online catalog?
* Which search criteria are popular?
* Which are efficient?
69
69
69
The concept hierarchies / site ontology(excerpt)
SEITE1-...LI (1st page of a list)orSEITEn-...LI (further page)
LA („Land“) SA („Schulart“) SU („Suche“)
70
70
70Sequence mining – one result pattern: successful search for a school in Germany
a refinement
a repetition
a continuation
one example pattern
select t from node a b, template a * b as t where a.url startswith "SEITE1-" and a.occurrence = 1 and b.url contains "1SCHULE" and b.occurrence = 1 and (b.support / a.support) >= 0.2
(Berendt & Spiliopoulou, VLDB J. 2000)
/liste.html?offset=920&zeilen=20&anzahl=1323&sprache=de&sw_kategorie=de&erscheint=&suchfeld=&suchwert=&staat=de®ion=by&schultyp=
/liste.html?offset=920&zeilen=20&anzahl=1323&sprache=de&sw_kategorie=de&erscheint=&suchfeld=&suchwert=&staat=de®ion=by&schultyp=
71
71
71
Agenda
Intro: Web Mining, specifically Web Usage Mining
Data Acquisition, Understanding, and Preparation
Forms of analysis; mining techniques
Case study 1: A multi-channel retailer method: Association-rule discovery
Case study 2: Search in an educational portal method: Sequence mining / generalized-sequ. discovery
Case study 3: Search in a community portal
72
72
72
An overview of the WUM formalism and algorithm
Berendt, B. (2007). Web Usage Mining - Modelling: frequent-pattern mining I (sequence mining with WUM, classification and clustering).
http://vasarely.wiwi.hu-berlin.de/WebMining07/index5_final.ppt
pp. 10-19
73
73
73
Agenda
Intro: Web Mining, specifically Web Usage Mining
Data Acquisition, Understanding, and Preparation
Forms of analysis; mining techniques
Case study 1: A multi-channel retailer method: Association-rule discovery
Case study 2: Search in an educational portal method: Sequence mining / generalized-sequ. discovery
Case study 3: Search in a community portal
75
75
75
Understanding the semantics of requestsStep 1: Domain ontology
• community portal ka2portal.aifb.uni-karlsruhe.de
• ontology-based:• Knowledge base in F-Logic
• Static pages: annotations
• Dynamic pages: generated
from queries
• Queries also in F-Logic
• Logs contain these queries
affiliation
76
76
76
Agenda
Intro: Web Mining, specifically Web Usage Mining
Data Acquisition, Understanding, and Preparation
Forms of analysis; mining techniques
Case study 1: A multi-channel retailer method: Association-rule discovery
Case study 2: Search in an educational portal method: Sequence mining / generalized-sequ. discovery
Case study 3: Search in a community portal method
78
78
78
In the preparation of a log file(recommendations for open-source tools are shown in green)
1. Use qualitative methods for application understanding (read!)
2. Inspect the site and the URLs for data understanding
1. Generate Analog reports for getting base statistics of usage
2. Build concept system / hierarchy and mapping: URLs concepts (notation: WUMprep regex)
3. Use WUMprep for data preparation
1. Remove unwanted entries (pictures etc.)
2. Sessionize
3. Remove robots
4. Replace URLs by concepts
5. (Build a database)
4. Use WEKA for modelling
1. [ Transform log file into ARFF (WUMprep4WEKA) ]
2. Cluster, classify, find association rules, ...
5. Use WUM for modelling
6. Select patterns based on objective interestingness measures (support, confidence, lift, ...) and on subjective interestingness measures (unexpected? Application-relevant?)
7. Present results in tabular, textual and graphical form (use Excel, ...)
8. Interpret the results
9. Make recommendations for site improvement etc.
79
79
79
In the case study:
1. Use qualitative methods for application understanding (read!)
2. Inspect the site and the URLs for data understanding
1. Generate Analog reports for getting base statistics of usage
2. Build concept system / hierarchy and mapping: URLs concepts (notation: WUMprep regex)
3. Use WUMprep for data preparation
1. Remove unwanted entries (pictures etc.)
2. Sessionize
3. Remove robots
4. Replace URLs by concepts
5. (Build a database)
4. Use WEKA for modelling
1. [ Transform log file into ARFF (WUMprep4WEKA) ]
2. Cluster, classify, find association rules, ...
5. Use WUM for modelling
6. Select patterns based on objective interestingness measures (support, confidence, lift, ...) and on subjective interestingness measures (unexpected? Application-relevant?)
7. Present results in tabular, textual and graphical form (use Excel, ...)
8. Interpret the results
9. Make recommendations for site improvement etc.
done
80
80
80
URLs of the tools
Analog: http://www.analog.cx/
WUMprep: http://www.hypknowsys.de/
WEKA: http://www.cs.waikato.ac.nz/ml/weka/
WUM: http://www.hypknowsys.de/
81
81
81
Short introductions to WUMprep
Lüderitz, S. (2006). Pre-processing of webserver logs for data mining. http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/Lecture/OtherSlides/luederitz-presentation1-slides_2006_07_10.pdf
(pp. 30-32)
Dettmar, G. (2003). Logfile-Preprocessing using WUMprep. http://warhol.wiwi.hu-berlin.de/~berendt/lehre/2003w/wmi/Student_Presentations/Gebhard_WUMprep.pdf
82
82
82
Materials for your case study
Original log
A transformed log (to simplify your work of sessionizing)
Some explanation: http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/Lecture/OtherSlides//explaining-the-ka2portal-logs.html
(original log and transformed log are hyperlinked there)
The ontology
http://annotation.semanticweb.org/iswc/iswc.daml
You can browse this ontology (it is the default ontology, see Wizard) for example with the Ontomat tool: http://annotation.semanticweb.org/ontomat/simple.html
Unfortunately, the site itself is not running any more! Use www.archive.org to inspect earlier versions
83
83
83
To structure your case study:
More details in
CRISP-DM 1.0. Step-by-step data mining guide.
www.crisp-dm.org/CRISPWP-0800.pdf
84
84
84
Next lecture
Inputs
Data preparation
Outputs
Multirelational data mining
Evaluation
Algorithm
What if the input isn‘t in a table (or even multiple tables)?Mining semi-structured / unstructured data II (text)
85
85
85
References / background reading (1)
Data preparation Cooley, R., B. Mobasher, J. Srivastava. 1999. Data preparation for mining world wide
web browsing patterns. J.of Knowledge and Inform.Systems 1 5–32. http://citeseer.ist.psu.edu/cooley99data.html
Spiliopoulou, M., Mobasher, B., Berendt, B., & Nakagawa, M. (2003). A framework for the evaluation of session reconstruction heuristics in Web-usage analyis. INFORMS Journal on Computing, 15, 171-190.
http://warhol.wiwi.hu-berlin.de/~berendt/Papers/spiliopoulou_etal_2003.pdf
Web mining Baldi, P., Frasconi, P., & Smyth, P. (2003). Modeling the Internet and the Web.
Probabilistic Methods and Algorithms. Chichester, UK: John Wiley & Sons. http://ibook.ics.uci.edu/
Bing Liu (2006). Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications). Springer. http://www.cs.uic.edu/%7Eliub/WebMiningBook.html
A general overview of Web usage mining Srivastava, J., Desikan, P., & Kumar, V. (2004). Web Mining - Concepts, Applications
and Research Directions. In H. Kargupta, A. Joshi, K. Sivakumar, & Y. Yesha (Eds.), Data Mining: Next Generation Challenges and Future Directions (pp. 405-423). Menlo Park, CA: AAAI/MIT Press. (earlier, longer version: http://www.ieee.org.ar/downloads/Srivastava-tut-paper.pdf
86
86
86
References / background reading (2)
Case study 1 Teltzrow, M., & Berendt, B. (2003). Web-Usage-Based Success Metrics for Multi-
Channel Businesses. In Proceedings of the WebKDD 2003 Workshop - Webmining as a Premise to Effective and Intelligent Web Applications.. August 27th, 2003, Washington DC, USA. Held in conjunction with The Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
http://warhol.wiwi.hu-berlin.de/~teltzrow/teltzrow_berendt_webkdd03.pdf Teltzrow, M., Berendt, B., & Günther, O. (2003). Consumer behaviour at multi-channel
retailers. In Proceedings of the 4th IBM eBusiness Conference, School of Management, University of Surrey, 9th December 2003.
http://warhol.wiwi.hu-berlin.de/~berendt/Papers/teltzrow_berendt_guenther_2003.pdf
Case study 2 Berendt, B. & Spiliopoulou, M. (2000). Analysis of navigation behaviour in web sites
integrating multiple information systems. The VLDB Journal, 9, 56-75.
http://vasarely.wiwi.hu-berlin.de/Home/berendt-spiliopoulou-vldbj00.pdf