mining frequent patterns ii: mining sequential & navigational patterns bamshad mobasher depaul...

23
Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University

Upload: mason-safford

Post on 14-Dec-2015

224 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University

Mining Frequent Patterns II:Mining Sequential & Navigational

Patterns

Bamshad MobasherDePaul University

Bamshad MobasherDePaul University

Page 2: Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University

Sequential pattern mining

i Association rule mining does not consider the order of transactions.

i In many applications such orderings are significant. E.g., 4 in market basket analysis, it is interesting to know whether people buy

some items in sequence, h e.g., buying bed first and then bed sheets some time later.

4 In Web usage mining, it is useful to find navigational patterns of users in a Web site from sequences of page visits of users

2

Page 3: Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University

3

Sequential PatternsExtending Frequent Itemsets

i Sequential patterns add an extra dimension to frequent itemsets and association rules - time.4 Items can appear before, after, or at the same time as each other.4 General form: “x% of the time, when A appears in a transaction, B appears within z

transactions.”h note that other items may appear between A and B, so sequential patterns do not

necessarily imply consecutive appearances of items (in terms of time)

i Examples4 Renting “Star Wars”, then “Empire Strikes Back”, then “Return of the Jedi” in that order4 Collection of ordered events within an interval4 Most sequential pattern discovery algorithms are based on extensions of the Apriori

algorithm for discovering itemsets

i Navigational Patterns4 they can be viewed as a special form of sequential patterns which capture navigational

patterns among users of a site4 in this case a session is a consecutive sequence of pageview references for a user over a

specified period of time

Page 4: Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University

4

Objectivei Given a set S of input data sequences (or sequence

database), the problem of mining sequential patterns is to find all the sequences that have a user-specified minimum support

i Each such sequence is called a frequent sequence, or a sequential pattern

i The support for a sequence is the fraction of total data sequences in S that contains this sequence

Page 5: Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University

5

Sequence Databases

i A sequence database consists of an ordered lis of elements or events4 Each element can be a set of items or a single item (a singleton set)

i Transaction databases vs. sequence databases

A sequence database SID sequences

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>

A transaction database TID itemsets

10 a, b, d

20 a, c, d

30 a, d, e

40 b, e, f

Elements in (…) are sets

Page 6: Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University

6

Subsequence vs. super sequence

i A sequence is an ordered list of events, denoted < e1 e2 … el >

i Given two sequences α=< a1 a2 … an > and β=< b1 b2 … bm >

i α is called a subsequence of β, denoted as α β, if there exist ⊆integers 1≤ j1 < j2 <…< jn ≤m such that a1 b⊆ j1, a2 b⊆ j2,…, an b⊆ jn

i Examples:4 < (ab), d> is a subsequence of < (abc), (de)> 4 3, (4, 5), 8 is contained in (or is a subsequence of)

6, (3, 7), 9, (4, 5, 8), (3, 8) 4 <a.html, c.html, f.html> ⊆

<a.html, b. html, c.html, d.html, e.html, f.html, g.html>

Page 7: Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University

7

What Is Sequential Pattern Mining?i Given a set of sequences and support threshold, find the

complete set of frequent subsequences

A sequence database

A sequence : < (ef) (ab) (df) c b >

An element may contain a set of items.Items within an element are unorderedand we list them alphabetically.

<a(bc)dc> is a subsequence of <a(abc)(ac)d(cf)>

Given support threshold min_sup =2, <(ab)c> is a sequential pattern

SID sequence

10 <a(abc)(ac)d(cf)>

20 <(ad)c(bc)(ae)>

30 <(ef)(ab)(df)cb>

40 <eg(af)cbc>

Page 8: Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University

Another Example

8

Transactions Sorted by Customer ID

Page 9: Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University

Example (continued)

9

Sequences produced from transactions

Final sequential patterns

Page 10: Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University

GSP mining algorithmi Very similar to the Apriori algorithm

10

Page 11: Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University

11

Sequential Pattern Mining Algorithms

i Apriori-based method: GSP (Generalized Sequential Patterns:

Srikant & Agrawal, 1996)

i Pattern-growth methods: FreeSpan & PrefixSpan (Han et al., 2000;

Pei, et al., 2001)

i Vertical format-based mining: SPADE (Zaki 2000)

i Constraint-based sequential pattern mining (SPIRIT: Garofalakis, et

al., 1999; Pei, et al., 2002)

i Mining closed sequential patterns: CloSpan (Yan, Han & Afshar,

2003)

From: J. Han and M. Kamber. Data Mining: Concepts and Techniques, www.cs.uiuc.edu/~hanji

Page 12: Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University

Mining Navigation Patterns

i Each session induces a user trail through the sitei A trail is a sequence of web pages followed by a user

during a session, ordered by time of accessi A sequential pattern in this context is a frequent traili Sequential pattern mining can help identify common

navigational sequences which in turn helps in understanding common user behavioral patterns

i If the goal is to make predictions about future user actions based on past behavior, approaches such as Markov models (e.g., Markov Chains) can be used

12

Page 13: Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University

13

Mining Navigational Patternsi Another Approach: Markov Chains

4 idea is to model the navigational sequences through the site as a state-transition diagram without cycles (a directed acyclic graph)

4 a Markov Chain consists of a set of states (pages or pageviews in the site)

S = {s1, s2, …, sn}

and a set of transition probabilities

P = {p1,1, … , p1,n, p2,1, … , p2,n, … , pn,1, … , pn,n}

4 a path r from a state si to a state sj, is a sequence states where the transition probabilities for all consecutive states are greater than 0

4 the probability of reaching a state sj from a state si via a path r is the product of all the probabilities along the path:

4 the probability of reaching sj from si is the sum over all paths:

Page 14: Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University

Construct Markov Chain from Web Navigational Data

i Add a unique start state4 the start state has a transition to the first page in each session

(representing the start of a session)4 alternatively, could have a transition to every state, assuming that every

page can potentially be start of a session

i Add a unique final state4 the last page in each trail has a transition to the final state (representing

the end of the session)

i The transition probabilities are obtained from counting click-throughs

i The Markov chain built is called absorbing since we always end up in the final state

14

Page 15: Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University

15

A Hypothetical Markov Chain

i What is the probability that a user who visits the Home page purchases a product?4 Home -> Search -> PD -> $ = 1/3 * 1/2 *1/2 = 1/12 = 0.0834 Home -> Cat -> PD -> $ = 1/3 * 1/3 * 1/2 = 1/18 = 0.0564 Home -> Cat -> $ = 1/3 * 1/3 = 1/9 = 0.1114 Home -> RS -> PD -> $ = 1/3 * 2/3 * 1/2 = 1/9 = 0.111

i What is the probability that a user who visits the Home page purchases a product?4 Home -> Search -> PD -> $ = 1/3 * 1/2 *1/2 = 1/12 = 0.0834 Home -> Cat -> PD -> $ = 1/3 * 1/3 * 1/2 = 1/18 = 0.0564 Home -> Cat -> $ = 1/3 * 1/3 = 1/9 = 0.1114 Home -> RS -> PD -> $ = 1/3 * 2/3 * 1/2 = 1/9 = 0.111

Sum = 0.361

An exampleMarkov Chain

Page 16: Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University

16

A

B

C

D

E

Sessions:A, BA, BA, B, CA, B, CA, B, C, D A, B, C, E A, C, EA, C, EA, B, D A, B, D A, B, D, EB, CB, CB, C, DB, C, EB, D, E

Transition BC:

Total occurrences of B: 14 Total occurrence of BC: 8

Pr(C|B) = 8/14 = 0.57

0.57

Web site hyperlink graph

Calculating conditional probabilities for transitionsCalculating conditional probabilities for transitions

Markov Chain Example

Page 17: Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University

17

Sessions:A, BA, BA, B, CA, B, CA, B, C, D A, B, C, E A, C, EA, C, EA, B, D A, B, D A, B, D, EB, CB, CB, C, DB, C, EB, D, E

The Full Markov Chain

A

B

C

D

E

0.57

Start

Final0.69

0.31

0.21

0.82

0.18

0.20

0.40

0.33

0.67

1.00

0.14

0.40

Probability that someone will visit page C? SBC + SAC + SABC(0.31 * 0.57) + (0.69 * 0.18) + (0.69 * 0.82 * 0.57) = 0.503

Prob. that someone who has visited B will visit E? BDE + BCE + BCDE(0.21 * 0.33) + (0.57 * 0.40) + (0.57 * 0.20 * 0.33) = 0.335

Probability that someone visiting page C will leave the site? 0.40 = 40%

Markov Chain Example (cont.)

Page 18: Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University

Mining Frequent Trails Using Markov Chains

i Support s in [0,1) – accept only trails whose initial probability is above s

i Confidence c in [0,1) – accept only trails whose probability is above c 4 Recall: the probability of a trail is obtained by multiplying the transition

probabilities of the links in the trail

i Mining for Patterns4 Find all trails whose initial probability is higher than s, and whose trail

probability is above c.4 Use depth-first search on the Markov chain to compute the trails4 The average time needed to find the frequent trails is proportional to the

number of web pages in the site

18

Page 19: Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University

Markov Chains: Another Example

19

ID Session Trail

1 A1 > A2 > A3

2 A1 > A2 > A3

3 A1 > A2 > A3 > A4

4 A5 > A2 > A4

5 A5 > A2 > A4 > A6

6 A5 > A2 > A3 > A6

Page 20: Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University

Frequent Trails From Example Support = 0.1 and Confidence = 0.3

Trail Probability

A1 > A2 > A3 0.67

A5 > A2 > A3 0.67

A2 > A3 0.67

A1 > A2 > A4 0.33

A5 > A2 > A4 0.33

A2 > A4 0.33

A4 > A6 0.33

20

Page 21: Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University

Trail Probability

A1 > A2 > A3 0.67

A5 > A2 > A3 0.67

A2 > A3 0.67

21

Frequent Trails From Example Support = 0.1 and Confidence = 0.5

Page 22: Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University

22

Efficient Management of Navigational Trails

i Approach: Store sessions in an aggregated sequence tree4 Initially introduced in Web Utilization Miner (WUM) - Spiliopoulou, 19984 for each occurrence of a sequence start a new branch or increase the frequency counts of

matching nodes4 in example below, note that s6 contains “b” twice, hence the sequence is <(b,1),(d,1),(b,2),

(e,1)>

Page 23: Mining Frequent Patterns II: Mining Sequential & Navigational Patterns Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University

23

Mining Navigational Patterns

The aggregated sequence tree can be used directly to determine support and confidence for navigational patterns

Navigation pattern: a bSupport = 11/35 = 0.31Confidence = 11/21 = 0.52

Navigation pattern: a bSupport = 11/35 = 0.31Confidence = 11/21 = 0.52

Nav. pattern: a b eSupport = 11/35 = 0.31Confidence = 11/11 = 1.00

Nav. pattern: a b eSupport = 11/35 = 0.31Confidence = 11/11 = 1.00

Nav. patterns: a b e fSupport = 3/35 = 0.086Confidence = 3/11 = 0.27

Nav. patterns: a b e fSupport = 3/35 = 0.086Confidence = 3/11 = 0.27

Support = count at the node / count at rootConfidence = count at the node / count at the parent

Note that each node represents a navigational path ending in that node