data mining - emory universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · web mining...

60
4/9/2008 1 Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and Micheline Kamber; Anand Rajaraman, Jeffrey D. Ullman Olfa Nasraoui Bing Liu

Upload: others

Post on 30-Mar-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

4/9/2008 1

Data Mining: Concepts and Techniques

Web Mining

Li Xiong

Slides credits: Jiawei Han and Micheline Kamber;

Anand Rajaraman, Jeffrey D. Ullman

Olfa Nasraoui

Bing Liu

Page 2: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Web Mining

Web mining vs. data miningStructure (or lack of it)

Linkage structure and lack of structure in textual information

ScaleData generated per day is comparable to largest conventional data warehouses

SpeedOften need to react to evolving usage patterns in real-time (e.g., merchandising)

Page 3: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Web Mining

Structure MiningExtracting info from topology of the Web (links among pages)

Content MiningExtracting info from page content (text, images, audio or video, etc)Natural language processing and information retrieval

Usage MiningExtracting info from user’s usage data on the web (how user visits the pages or makes transactions)

4/9/2008 Li Xiong 3

Page 4: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Web Mining

4/9/2008 4

Page 5: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Web Mining

Web structure miningWeb graph structure and link analysis

Web text miningText representation and IR models

Web usage miningCollaborative filtering

4/9/2008 Li Xiong 5

Page 6: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Structure of Web Graph

Web as a directed graphPages = nodes, hyperlinks = edges

Problem: Understand the macroscopic structure and evolution of the web graphPractical implications

Crawling, browsing, computation of link analysis algorithms

Page 7: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Power-law degree distribution

Source: Broder et al, 00

Page 8: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Bow-tie Structure (Broder et al. 00)

Page 9: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

The Daisy Structure (Donato et al. 05)

4/9/2008 9

Page 10: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

April 9, 2008 Li Xiong 10

Link Analysis

Problem: exploit the link structure of a graph to order or prioritize the set of objects within the graphApplication of social network analysis at actor level: centrality and prestigeAlgorithms

PageRankHITS

Page 11: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

PageRank (Brin & Page’98)

IntuitionWeb pages are not equally “important”

www.joe-schmoe.com v www.stanford.edu

Links as citations: a page cited often is more importantwww.stanford.edu has 23,400 inlinkswww.joe-schmoe.com has 1 inlink

Recursive model: links from heavily linked pages weighted more

PageRank is essentially the eigenvector prestige in social network

Page 12: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Each link’s vote is proportional to the importance of its source pageIf page P with importance x has n outlinks, each link gets x/n votesPage P’s own importance is the sum of the votes on its inlinks

Simple Recursive Flow Model

Yahoo

M’softAmazon

y

a m

y/2

y/2

a/2

a/2

m

y = y /2 + a /2a = y /2 + mm = a /2

Solving the equation with constraint: y+a+m = 1y = 2/5, a = 2/5, m = 1/5

Page 13: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Matrix formulation

Web link matrix M: one row and one column per web page

Rank vector r: one entry per web pageFlow equation: r = Mrr is an eigenvector of the M

i

j

M r r

=j

i

⎪⎩

⎪⎨⎧ ∈

=otherwise

EjiifOM jij

0

),(1

Page 14: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Matrix formulation Example

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 1m 0 1/2 0

y a m

y = y /2 + a /2a = y /2 + mm = a /2

r = Mr

y 1/2 1/2 0 ya = 1/2 0 1 am 0 1/2 0 m

Page 15: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Power Iteration method

Solving equation: r = Mr

Suppose there are N web pagesInitialize: r0 = [1/N,….,1/N]T

Iterate: rk+1 = Mrk

Stop when |rk+1 - rk|1 < ε|x|1 = ∑1≤i≤N|xi| is the L1 norm Can use any other vector norm e.g., Euclidean

Page 16: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Power Iteration Example

Yahoo

M’softAmazon

y 1/2 1/2 0a 1/2 0 1m 0 1/2 0

y a m

ya =m

1/31/31/3

1/31/21/6

5/121/31/4

3/811/241/6

2/52/51/5

. . .

Page 17: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Random Walk Interpretation

Imagine a random web surferAt any time t, surfer is on some page PAt time t+1, the surfer follows an outlink from P uniformly at randomEnds up on some page Q linked from PProcess repeats indefinitely

p(t) is the probability distribution whose ith component is the probability that the surfer is at page i at time t

Page 18: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

The stationary distribution

Where is the surfer at time t+1?p(t+1) = Mp(t)

Suppose the random walk reaches a state such that p(t+1) = Mp(t) = p(t)

Then p(t) is a stationary distribution for the random walk

Our rank vector r satisfies r = Mr

Page 19: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Existence and Uniqueness of the Solution

Theory of random walks (aka Markov processes):

A finite Markov chain defined by the stochastic matrix has a unique stationary probability distribution if the matrix is irreducible and aperiodic.

April 9, 2008 Mining and Searching Graphs in Graph Databases 19

Page 20: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

CS583, Bing Liu, UIC 20

M is a not stochastic matrix

M is the transition matrix of the Web graph

It does not satisfy

Many web pages have no out-linksSuch pages are called the dangling pages.

∑=

=n

iijM

1

1

⎪⎩

⎪⎨⎧ ∈

=otherwise

EjiifOM jij

0

),(1

Page 21: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

CS583, Bing Liu, UIC 21

M is a not irreducible

Irreducible means that the Web graph G is strongly connected. Definition: A directed graph G = (V, E) is strongly connected if and only if, for each pair of nodes u, v ∈ V, there is a path from u to v.A general Web graph is not irreducible because

for some pair of nodes u and v, there is no path from u to v.

Page 22: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

CS583, Bing Liu, UIC 22

M is a not aperiodic

A state i in a Markov chain being periodic means that there exists a directed cycle that the chain has to traverse.

Definition: A state i is periodic with period k > 1 if k is the smallest number such that all paths leading from state i back to state i have a length that is a multiple of k.

If a state is not periodic (i.e., k = 1), it is aperiodic.A Markov chain is aperiodic if all states are aperiodic.

Page 23: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Solution: Random teleports

Add a link from each page to every pageAt each time step, the random surfer has a small probability teleporting to those links

With probability β, follow a link at randomWith probability 1-β, jump to some page uniformly at randomCommon values for β are in the range 0.8 to 0.9

Page 24: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Random teleports Example (β = 0.8)

Yahoo

M’softAmazon

1/2 1/2 01/2 0 00 1/2 1

1/3 1/3 1/31/3 1/3 1/31/3 1/3 1/3

y 7/15 7/15 1/15a 7/15 1/15 1/15m 1/15 7/15 13/15

0.8 + 0.2

ya =m

111

1.000.601.40

0.840.601.56

0.7760.5361.688

7/115/11

21/11. . .

Page 25: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Matrix formulation

Matrix vector A Aij = βMij + (1-β)/NMij = 1/|O(j)| when j→i and Mij = 0 otherwise

Verify that A is a stochastic matrixThe page rank vector r is the principal eigenvector of this matrix

satisfying r = ArEquivalently, r is the stationary distribution of the random walk with teleports

Page 26: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

CS583, Bing Liu, UIC 26

Advantages and Limitations of PageRank

Fighting spam PageRank is a global measure and is query independent Computed offlineCriticism: query-independence.

It could not distinguish between pages that are authoritative in general and pages that are authoritative on the query topic.

Page 27: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

April 9, 2008 Data Mining: Concepts and Techniques 27

HITS: Capturing Authorities & Hubs (Kleinberg’98)

IntuitionsPages that are widely cited are good authorities

Pages that cite many other pages are good hubs

HITS (Hypertext-Induced Topic Selection)When the user issues a search query, HITS expands the list of relevant pages returned by a search engine and produces two rankings

1. Authorities are pages containing useful information and linked by Hubs

course home pageshome pages of auto manufacturers

2. Hubs are pages that link to Authoritiescourse bulletinlist of US auto manufacturers

Hubs Authorities

Page 28: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Matrix Formulation

Transition (adjacency) matrix AA[i, j] = 1 if page i links to page j, 0 if not

The hub score vector h: score is proportional to the sum of the authority scores of the pages it links to

h = λAaConstant λ is a scale factor

The authority score vector a: score is proportional to the sum of the hub scores of the pages it is linked from

a = μAT hConstant μ is scale factor

Hubs Authorities

Page 29: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Transition Matrix Example

Yahoo

M’softAmazon

y 1 1 1a 1 0 1m 0 1 0

y a m

A =

Page 30: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Iterative algorithm

Initialize h, a to all 1’sh = AaScale h so that its max entry is 1.0 a = AThScale a so that its max entry is 1.0Continue until h, a converge

Page 31: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Iterative Algorithm Example

1 1 1A = 1 0 1

0 1 0

1 1 0AT = 1 0 1

1 1 0

a(yahoo)a(amazon)a(m’soft)

===

111

111

14/51

10.751

. . .

. . .

. . .

10.7321

h(yahoo) = 1h(amazon) = 1h(m’soft) = 1

12/31/3

10.730.27

. . .

. . .

. . .

1.0000.7320.268

10.710.29

Page 32: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Existence and Uniqueness of the Solution

h = λAaa = μAT hh = λμAAT ha = λμATA a

Under reasonable assumptions about A, the dual iterative algorithm converges to vectors h* and a* such that:• h* is the principal eigenvector of the matrix AAT

• a* is the principal eigenvector of the matrix ATA

Page 33: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

33

Strengths and weaknesses of HITS

Strength: its ability to rank pages according to the query topic, which may be able to provide more relevant authority and hub pages. Weaknesses:

Easily spammedTopic driftInefficiency at query time

Page 34: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

PageRank and HITS

ModelPageRank: depends on the links into SHITS: depends on the value of the other links out of S

CharacteristicsSpam resistanceQuery independence

Destinies post-1998PageRank: trademark of GoogleHITS: not commonly used by search engines (Ask.com?)

Page 35: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Web Mining

Web structure miningWeb graph structureLink analysis

Web text miningWeb usage mining

Collaborative filtering

4/9/2008 35

Page 36: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Li Xiong

Text Mining

Text mining refers to data mining using text documents as data. Tasks

Text summarizationText classificationText clustering…

Intersection with Information Retrieval and Natural Language Processing

Page 37: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Levels of text representations

Character (character n-grams and sequences)Words (stop-words, stemming, lemmatization)Phrases (word n-grams, proximity features)Part-of-speech tagsTaxonomies / thesauriVector-space modelLanguage modelsFull-parsingCross-modalityCollaborative tagging / Web2.0Templates / FramesOntologies / First order theories

Page 38: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

N-Gram

N-gram: a sub-sequence of n items from a given sequence.

The items can be characters, words or base pairs according to the application.Unigram, bigram, trigram

Example: Google n-gram corpus

4-grams

serve as the incoming (92)serve as the incubator (99)serve as the independent (794)serve as the index (223)serve as the indication (72)serve as the indicator (120)

Page 39: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Bag-of-Words Document Representation

Page 40: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

40

Vector space model

Each document is represented as a vector. Given a collection of documents D, let V = {t1, t2, ..., t|V|} be the set of distinctive words/terms in the collection. V is called the vocabulary. A weight wij > 0 is associated with each term tiof a document dj. For a term that does not appear in document dj, wij = 0.

dj = (w1j, w2j, ..., w|V|j)

Page 41: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

TFIDF WeightingTF (Term frequency)IDF (Inverse Document Frequency)

Tf(w) – term frequency (number of word occurrences in a document)Df(w) – document frequency (number of documents containing the word)N – number of all documentsTfIdf(w) – relative importance of the word in the document

))(

log(.)(wdf

Ntfwtfidf =

Page 42: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Similarity between document vectors

Each document is represented as a vector of weights D = <x>Cosine similarity (dot product) is the most widely used similarity measure between two document vectors

…calculates cosine of the angle between document vectors…efficient to calculate (sum of products of intersecting words)…similarity value between 0 (different) and 1 (the same)

∑∑

∑=

k kj j

iii

xx

xxDDSim

22

21

21 ),(

Page 43: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Web Mining

Web structure miningWeb graph structureLink analysis

Web text miningWeb usage mining

Collaborative filtering

4/9/2008 Li Xiong 43

Page 44: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Web Usage Data

Web Logs: Low levelTracks queries, individual pages/items requested by a Web browser

Application logs: Higher levelWhen customers check in and check out, items placed or removed from shopping cart, …etc

4/9/2008 44

Page 45: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Web Usage Mining

Association rule miningDiscovered associations between pages and products

Sequential pattern discoveryHelp to discover visit patterns and make predictions about visit patterns

ClusteringGroup similar sessions into clusters which may correspond to user profiles / modes of usage of the website

Collaborative FilteringFilter/recommend pages and products based on similar users

4/9/2008 45

Page 46: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

4/9/2008 Data Mining: Principles and Algorithms 46

Collaborative Filtering: Motivation

User PerspectiveLots of web pages, online products, books, movies, etc.Reduce my choices…please…

Manager Perspective

“ if I have 3 million customers on the web, I should have 3 million stores on the web.”

CEO of Amazon.com [SCH01]

Page 47: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

4/9/2008 Data Mining: Principles and Algorithms 47

Basic Approaches

Collaborative Filtering (CF)Based on the active user’s historyBased on other users’ collective behavior

Content-based FilteringBased on keywords and other features

Page 48: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

4/9/2008 Data Mining: Principles and Algorithms 48

Collaborative Filtering: A Framework

u1u2…

ui...

um

Items: Ii1 i2 … ij … in

3 1.5 …. … 2

2

1

3

rij=?

The task:Q1: Find Unknown ratings?Q2: Which items should we recommend to this user?...

Unknown function f: U x I→ R

Users: U

Page 49: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

4/9/2008 Data Mining: Principles and Algorithms 49

Collaborative Filtering: Main Methods

User-User MethodsMemory-based: K-NNModel-based: Clustering

Item-Item MethodCorrelation AnalysisLinear RegressionBelief NetworkAssociation Rule Mining

Page 50: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

4/9/2008 Data Mining: Principles and Algorithms 50

User-User method: Intuition

TargetTargetCustomerCustomer

Q1: How to measure similarity?

Q2: How to select neighbors?

Q3: How to combine?

Page 51: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

4/9/2008 Data Mining: Principles and Algorithms 51

How to Measure Similarity?

Pearson correlation coefficient

Cosine measure

Users are vectors in product-dimension space

∑∑

∈∈

−−

−−=

Items RatedCommonly j

2

Items RatedCommonly j

2

Items RatedCommonly j

)()(

))((),(

iijaaj

iijaaj

prrrr

rrrriaw

ui

ua

i1 in

22*.),(

ia

iac rr

rriaw =

Page 52: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

4/9/2008 Data Mining: Principles and Algorithms 52

Nearest Neighbor Approaches [SAR00a]

Offline phase:Do nothing…just store transactions

Online phase:Identify highly similar users to the active one

Best K onesAll with a measure greater than a threshold

Prediction

∑∑ −

+=

i

iiji

aaj iaw

rriawrr

),(

)(),(

User a’s neutralUser i’s deviation

User a’s estimated deviation

Page 53: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

4/9/2008 Data Mining: Principles and Algorithms 53

Clustering [BRE98]

Offline phase:Build clusters: k-mean, k-medoid, etc.

Online phase:Identify the nearest cluster to the active userPrediction:

Use the center of the clusterWeighted average between cluster members

Weights depend on the active user

Page 54: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

4/9/2008 Data Mining: Principles and Algorithms 54

Clustering vs. k-NN Approaches

K-NN using Pearson measure is slower but more accurateClustering is more scalable

Active user

Bad recommendations

Page 55: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

Reference: Link Analysis

Brin, S. and Page, L. The anatomy of a large-scale hypertextual Web search engine (PageRank). In Computer Networks and ISDN Systems, 1998J. Kleinberg. Authoritative sources in a hyperlinked environment (HITS). In ACM-SIAM Symp. Discrete Algorithms, 1998S. Chakrabarti, B. Dom, D. Gibson, J. Kleinberg, S.R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins, Mining the link structure of the World Wide Web. IEEE Computer’99D. Cai, X. He, J. Wen, and W. Ma, Block-level Link Analysis. SIGIR'2004.

Page 56: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

References

1. C. D. Manning and H. Schutze, “Foundations of Natural Language Processing”, MIT Press, 1999.

2. S. Russell and P. Norvig, “Artificial Intelligence: A Modern Approach”, Prentice Hall, 1995.

3. S. Chakrabarti, “Mining the Web: Statistical Analysis of Hypertext and Semi-Structured Data”, Morgan Kaufmann, 2002.

4. G. Miller, R. Beckwith, C. FellBaum, D. Gross, K. Miller, and R. Tengi. Five papers on WordNet. Princeton University, August 1993.

5. C. Zhai, Introduction to NLP, Lecture Notes for CS 397cxz, UIUC, Fall 2003.6. M. Hearst, Untangling Text Data Mining, ACL’99, invited paper.

http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-tdm.html7. R. Sproat, Introduction to Computational Linguistics, LING 306, UIUC, Fall

2003.8. A Road Map to Text Mining and Web Mining, University of Texas resource

page. http://www.cs.utexas.edu/users/pebronia/text-mining/9. Computational Linguistics and Text Mining Group, IBM Research,

http://www.research.ibm.com/dssgrp/

Page 57: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

References

Fabrizio Sebastiani, “Machine Learning in Automated Text

Categorization”, ACM Computing Surveys, Vol. 34, No.1, March 2002

Soumen Chakrabarti, “Data mining for hypertext: A tutorial survey”,

ACM SIGKDD Explorations, 2000.

Cleverdon, “Optimizing convenient online accesss to bibliographic

databases”, Information Survey, Use4, 1, 37-47, 1984

Yiming Yang, “An evaluation of statistical approaches to text

categorization”, Journal of Information Retrieval, 1:67-88, 1999.

Yiming Yang and Xin Liu “A re-examination of text categorization

methods”. Proceedings of ACM SIGIR Conference on Research and

Development in Information Retrieval (SIGIR'99, pp 42--49), 1999.

Page 58: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

4/9/2008 Data Mining: Principles and Algorithms 58

References: Collaborative Filtering

Charu C. Aggarwal, Joel L. Wolf, Kun-Lung Wu, Philip S. Yu: Horting Hatches an Egg: A New Graph-Theoretic Approach to Collaborative Filtering. KDD 1999: 201-212J. Breese, D. Heckerman, C. Kadie Empirical Analysis of Predictive Algorithms for Collaborative Filtering. In Proc. 14th Conf. Uncertainty in Artificial Intelligence, Madison, July 1998. Yoon Ho Cho and Jae Kyeong Kim: Application of Web usage mining and product taxonomy to collaborative recommendations in e-commerce. Expert Systems with Applications, 26(2), 2003William W. Cohen, Robert E. Schapire, and Yoram Singer. Learning to order things. In Advances in Neural Processing Systems 10, Denver, CO, 1997 Toshihiro Kamishima: Nantonac collaborative filtering: recommendation based on order responses. KDD 2003: 583-588 Lee, C.-H, Kim, Y.-H., Rhee, P.-K. Web personalization expert with combining collaborative filtering and association rule mining technique. Expert Systems with Applications, v 21, n 3, October, 2001, p 131-137

Page 59: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

4/9/2008 Data Mining: Principles and Algorithms 59

W. Lin, 2001P, online presentation available at: http://www.wiwi.hu-berlin.de/~myra/WEBKDD2000/WEBKDD2000_ARCHIVE/LinAlvarezRuiz_WebKDD2000.pptWeiyang Lin, Sergio A. Alvarez, and Carolina Ruiz. Efficient adaptive-support association rule mining for recommender systems. Data Mining andKnowledge Discovery, 6:83--105, 2002 G. Linden, B. Smith, and J. York, "Amazon.com Recommendations Iemto -item collaborative filtering", IEEE Internet Computing, Vo. 7, No. 1, pp. 7680, Jan. 2003. Badrul M. Sarwar, George Karypis, Joseph A. Konstan, John Riedl: Analysis of recommendation algorithms for e-commerce. ACM Conf. Electronic Commerce 2000: 158-167B. Sarwar, G. Karypis, J. Konstan, and J. Riedl: Application of dimensionality reduction in recommender systems--a case study. In ACM WebKDD 2000 Web Mining for E-Commerce Workshop, 2000.B. M. Sarwar, G. Karypis, J. A. Konstan, and J. Riedl. Item-based collaborative filtering recommendation algorithms. WWW’01

References: Collaborative Filtering

Page 60: Data Mining - Emory Universitylxiong/cs570s08/share/slides/10.pdf · 2009. 7. 22. · Web Mining Web mining vs. data mining Structure (or lack of it) Linkage structure and lack of

4/9/2008 Data Mining: Principles and Algorithms 60

B. Sarwar, 2000P, online presentation available at: http://www.wiwi.hu-berlin.de/~myra/WEBKDD2000/WEBKDD2000_ARCHIVE/badrul.ppt J. Ben Schafer, Joseph A. Konstan, John Riedl: E-Commerce Recommendation Applications. Data Mining and Knowledge Discovery 5(1/2): 115-153, 2001 L.H. Ungar and D.P. Foster: Clustering Methods for Collaborative Filtering, AAAI Workshop on Recommendation Systems, 1998.Yi-Fan Wang, Yu-Liang Chuang, Mei-Hua Hsu and Huan-Chao Keh: A personalized recommender system for the cosmetic business. Expert Systems with Applications, v 26, n 3, April, 2004 Pages 427-434 S. Vucetic and Z. Obradovic. A regression-based approach for scaling-up personalized recommender systems in e-commerce. In ACM WebKDD 2000 Web Mining for E-Commerce Workshop, 2000. Kai Yu, Xiaowei Xu, Martin Ester, and Hans-Peter Kriegel: Selecting relevant instances for efficient accurate collaborative filtering. In Proceedings of the 10th CIKM, pages 239--246. ACM Press, 2001.Cheng Zhai, Spring 2003 online course notes available at:

http://sifaka.cs.uiuc.edu/course/2003-497CXZ/loc/cf.ppt

References: Collaborative Filtering