web mining/web usage miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · web...

36
Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar 08, 2012 Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 1 / 36

Upload: others

Post on 02-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Web Mining/Web Usage MiningMMIS 2 VU SS 2011 - 707.025

Denis Helic

KMI, TU Graz

Mar 08, 2012

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 1 / 36

Page 2: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Introduction

The Web is the largest data source in the world

Web Mining aims to extract and mine knowledge from the data onthe Web

Data → Information (Data in context) → Knowledge (Information incontext)

Typically, knowledge inside of human mind

Automatic extraction to prepare it for humans

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 2 / 36

Page 3: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Example: Navigational behavior on the Web

Study by Huberman in 1998

Strong Regularities in World Wide Web Surfing

Observing the number of links users follow on a website

Theoretical model confirmed with the log analysis of several largewebsites

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 3 / 36

Page 4: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Example: Navigational behavior on the Web

Figure: Number of links followed vs. number of users

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 4 / 36

Page 5: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Introduction

Web Mining is multidisciplinary field

Data mining, machine learning, network science

Statistics, information retrieval, multimedia, etc.

Databases, in particular NoSQL databases

Map/Reduce, GraphDB, etc.

Lack of structure, heterogeneity → very challenging task

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 5 / 36

Page 6: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Opportunities and Challenges

The amount of information is huge and easily accessible

The coverage of information is huge (information on anything)

All types of information exist (structured databases, text,multimedia,...)

Much of the Web information is semi-structured (HTML)

Much of the Web information is linked

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 6 / 36

Page 7: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Opportunities and Challenges

A lot of redundancy (copy&paste instead of linking)

A lot of noise (advertisement, copyright notices, navigation panels, ...)

A lot of Web services that provide different responses for differentrequest parameters

The Web is dynamic (information changes → snapshots)

It is virtual society → not only about data but also about people

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 7 / 36

Page 8: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Web Mining

Web Mining classification

Web Usage Mining

User access and interaction patterns

Search access and search interaction → search query logs

Navigation and browsing → access logs

We will deal mostly with this topic

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 8 / 36

Page 9: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Web Mining

Web Structure Mining

Discover knowledge from the link structure

E.g. PageRank

But also HITS algorithm

Discussed in e.g. Web Science or MMIS1

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 9 / 36

Page 10: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Web Mining

Web Content Mining

Mining, integration and extraction of knowledge from the Webcontent

E.g. clustering search results according to the content similarity

Sentiment analysis (positive, negative opinions, ...)

Discussed in e.g. Application areas of Knowledge Management

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 10 / 36

Page 11: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Web Mining

A subcategory that belongs to all other categories

Web Metadata Mining

Extraction of knowledge from the user metadata, e.g. tags

Tags are also content, tags are typically represented as links, tags area specific product of interaction with the system

But other types of metadata are possible: e.g. Wikipedia categories

We will deal with extraction of hierarchies from Web metadata

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 11 / 36

Page 12: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Data Sources

Web Metadata Mining

Datasets of diverse social Web sites

E.g. Wikipedia dumps

Crawls from tagging systems, e.g. delicious or flickr

Typically crawled via APIs offered by those systems

Very large files

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 12 / 36

Page 13: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Delicious crawl

000001b9e9e5be0c86cac873e42c2c4d

basil3whitehouse http://en.wikipedia.org/wiki/Roomano

1176073200 food cheese

00000c9d3fee7592680fa80646c36fa7

NicoC http://en.wikipedia.org/wiki/Green_Anaconda

1170720000 anacondas animalinfo

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 13 / 36

Page 14: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Data Sources

Web Usage Mining

Server level collection

Client level collection

Proxy level collection

Very large files and multiple files

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 14 / 36

Page 15: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Server level collection

2012-03-07 00:14:20,469 |INFO|

/af/AEIOU/Conrad_von_H%C3%B6tzendorf,_Franz_Freiherr|

-|66.249.66.206|

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

2012-03-07 00:14:21,026 |INFO|

/af/Wissenssammlungen/Fossilien/Escharella|

-|62.47.22.30|Mozilla/5.0

(compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 15 / 36

Page 16: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Server level collection

POST methods typically not logged

Cache hits not logged

Tracking of user session difficult

Cookies, query data stored in separate files → integration

Single site but multiple users

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 16 / 36

Page 17: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Client level collection

Javascript or plugin, extension code

E.g. Google Analytics sending client data from Javascript to Googlefor a specific site

Search toolbars for collecting search and navigation(!) paths

No problems with caching or sessions

Single or multiple sites but single user

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 17 / 36

Page 18: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Proxy level collection

Proxy servers in organizations

Multiple users and multiple sites

Users are anonymous

Still possible to track sessions with heuristics

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 18 / 36

Page 19: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Web Usage Mining Framework

Prep rocess ing

Raw Logs

Site Files

v

Preprocessed "Interesting" Ciickstream Rules, Patterns,

Data and Statistics Rules, Patterns, and Statistics

Figure 1: High Level Web Usage Mining Process

IP Address

123.456.78.9

123.456.78.9

123.456.78.9

123A56.78.9

123.456.78.9

123,456.78.9

123.456.76.9

123.456.78.9

123.458,78.9

123.456.78.9

123.456.78.9

209,458.782

209.456.78.3

Usedd Time MethodJ URU Protocol

[25/Apr/1998:03:94:41-0580] "GET A.h~l HI-FP/1.0"

[23/Apd1998:03:05:34 -0500] "GET B.html I..ITFP/1.0"

'GET Lhlrnl H'ITPI1.0" [25/April998:03:05:39,0500]

[25/April998:03:06:02 -0500]

[25/April998:03:06:58 -0580]

"GET F.html HTTP/1.ff'

"GET A.h~l HTrP/1.0'

[25/Apr/1998:03:07:42 -0500] "GET B.hlml HTTP/1.0"

[25/April998:03:07:55 -0500] "GET R.html HTTPI1.0"

[25/April998:03:09:50 -0500] "GET C.html HI-rP/1.0"

[25/April998:03:10:02..0500] "GET O.hlml HTIP/1.0"

[25/Apr/1998:03:10:45..0500] 'GET J.html HTTP/I.0"

[25/Apr/1998:03:12:23-0500] "GET G.html HTTP/I.0"

[25/,Apr/1998:05:05:22-0500] "GET A.html H'FrP/I.0"

[225/Apr/1998:05:06:03 -0500] 'GET D.h~l HTTP/1.0'

Statue

200

200

200

200

200

200

200

200

200

200

200

200

200

Size Referrer Agent

3290 Mozla/3.04 (Win95, I)

2050 A.h~l Moziga/3.94 (Win95,1)

4130 Moziga/3.94 (Win95, I)

5896 B.hlml Moziga/3.04 (Win95,1)

3290 Mozilla/3.01 {Xll, I, IRIX6.2, IP22)

2050 A.html MoziBa/3.01 (X11,I, IRIX6.2, IP22)

8140 Lhtml Mozma/3.94 (Win95,1)

1820 A.hknl Mozgla/3.01 (XI1.I, IRIX6.2,1P22)

2270 F,html MoziBa/3.94 (Win95,1)

9430 C.html Moziga/3.01 (X11,I, IRIX62, IP22)

7220 B.htnd MoziBa/3.94 (Win95,1)

3290 Mozgla/3.94 0Nin95, I)

1680 A.hb'nl Moziga/3.94(Win95,1)

Figure 2: Sample Web Server Log

SIGKDD Explorations. Jan 2000. Volume 1, Issue 2 - page 15

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 19 / 36

Page 20: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Web Metadata Mining

Process slightly different

E.g. instead of log data we have a raw dataset

Depending on the task there might be some additional steps

E.g. extracting a hierarchy

After the analysis apply an optimal algorithm for hierarchy extraction

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 20 / 36

Page 21: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Web Metadata Preprocessing

Preprocessing typically involves removing irrelevant data

Stemming

Grouping and integration of data

Sorting, etc.

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 21 / 36

Page 22: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Web Metadata Preprocessing

Depending on the file size (up to e.g. 40 50G) Unix shell commandsare very useful

E.g. awk, sed, sort, uniq, grep, wc, ...

Also perl

E.g. distribution of items: sort -n -r data.txt | uniq -c

Filter lines: grep -v ‘‘null’’

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 22 / 36

Page 23: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Web Metadata Preprocessing

Wikipedia dumps, e.g. link dump 30G

(12,0,’Alain_Badiou’),(12,0,’Albert_Camus’)

perl -p -i.bac -e "s/\((.+?),’(.+?)’,.+?’\)(,|;)/\1,\2\n/g" test.txt

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 23 / 36

Page 24: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Web Usage Preprocessing

Difficult because on the server we have only IP address, agent, serverclick stream

We need to identify users and sessions

Single IP but multiple sessions because of ISP proxies

Multiple IPs but single user using different machines

Multiple agents but single user even from the same machine

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 24 / 36

Page 25: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Web Usage Preprocessing

Assuming that each user has been identified (e.g. troughcookies/IP-agent/path analysis)

We need to extract sessions

Difficult to know when user left the site for another site

Session time-out, typically 30 minutes

Problems with client side caching

If session state is managed elsewhere difficult to know what content isserved

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 25 / 36

Page 26: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Web Usage Preprocessing

Session heuristics

Time heuristics

Total time must not exceed 30 minutes

Total time at a single page must not exceed 10 minutes

Path heuristics (href)

A page must be reached from a previous page in the same session

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 26 / 36

Page 27: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Web Usage Preprocessing

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 27 / 36

Page 28: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Web Usage Preprocessing

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 28 / 36

Page 29: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Pattern Discovery

Based on insights and algorithms from statistics, data mining,machine learning and pattern recognition

Statistical analysis, association rules, clustering, classification,sequential patterns, ...

Statistical analysis: descriptive statistics

Frequency, mean, median, mode, standard deviation, ...

E.g. access statistics, average time spent on page, etc.

Outliers detection, e.g. non-valid URLs

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 29 / 36

Page 30: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Pattern Discovery

Association rules: correlation statistics

Which pages are often visited in the same session

Correlation of visits to two non-linked pages

Improving the site navigation structure

Clustering: grouping similar items together

E.g. usage clusters and page clusters

Improving search results by showing similar pages

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 30 / 36

Page 31: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Pattern Discovery

Classification: labeling pages from a predefined set of labels

User profiling

Classifying users to product groups such as e.g. music, movies, etc.

Sequential patters: identify time-ordered sequences of visits

Can use to predict future visit patterns

Identify points of changing directions, etc.

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 31 / 36

Page 32: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Pattern Analysis

Visual analytics

Filter out what is not needed

Concentrate on patterns important for the task at hand

E.g. to improve navigation structure

Identify navigation sequences and navigational hubs

E.g. problems in continuing from the hubs

Potential improvements, e.g. more links, hierarchy, more hints what isbehind links, ...

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 32 / 36

Page 33: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Applications

Personalization

Dynamic recommendations of links, pages, products, ...

E.g. Facebook

You click a couple of times on liberal blogs posted by your liberalfriends

The conservative blogs posted by your conservative friends are notshown anymore

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 33 / 36

Page 34: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Applications

System improvement

Depending on patterns in accessing you might design new cachingstrategies

Also load balancing, or data distribution

Security: you might recognize malicious access, ...

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 34 / 36

Page 35: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Applications

Site modification

Redesigning content and structure

Better linking

More usable navigation structures

Removing of distractions, etc.

Evaluation of improvements

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 35 / 36

Page 36: Web Mining/Web Usage Miningkti.tugraz.at/staff/denis/courses/mmis2/material/slides_wum.pdf · Web Mining/Web Usage Mining MMIS 2 VU SS 2011 - 707.025 Denis Helic KMI, TU Graz Mar

Further Info

Web usage mining: discovery and applications of usage patterns fromWeb data http://dl.acm.org/citation.cfm?id=846183.846188

Book Web Data Mininghttp://www.cs.uic.edu/~liub/WebMiningBook.html

Tutiorial Web Content Mininghttp://www.cs.uic.edu/~liub/WebContentMining.html

Denis Helic (KMI, TU Graz) Web Mining/Web Usage Mining Mar 08, 2012 36 / 36