efficient mining of correlated sequential patterns based on...

8
Efficient Mining of Correlated Sequential Patterns Based on Null Hypothesis Cindy Xide Lin †§ , Ming Ji , Marina Danilevsky , Jiawei Han Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA § Twitter Inc, San Francisco, CA, USA [email protected], {mingji1, danilev1, hanj}@illinois.edu ABSTRACT Frequent pattern mining has been a widely studied topic in the research area of data mining for more than a decade. However, pattern mining with real data sets is complicat- ed - a huge number of co-occurrence patterns are usually generated, a majority of which are either redundant or un- informative. The true correlation relationships among data objects are buried deep among a large pile of useless infor- mation. To overcome this difficulty, mining correlations has been recognized as an important data mining task for its many advantages over mining frequent patterns. In this paper, we formally propose and define the task of mining frequent correlated sequential patterns from a se- quential database. With this aim in mind, we re-examine various interestingness measures to select the appropriate one(s), which can disclose succinct relationships of sequen- tial patterns. We then propose PSBSpan, an efficient min- ing algorithm based on the framework of the pattern-growth methodology which mines frequent correlated sequential pat- terns. Our experimental study on real datasets shows that our algorithm has outstanding performance in terms of both efficiency and effectiveness. Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications— Data Mining General Terms Algorithms Keywords Frequent Pattern Mining, Correlated Pattern Mining 1. INTRODUCTION Frequent pattern mining has been a widely studied top- ic in data mining research. Common approaches include Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Web-KR’12, October 29, 2012, Maui, Hawaii, USA. Copyright 2012 ACM 978-1-4503-1711-5/12/10 ...$15.00. association rule mining [8], sequential pattern mining [22], graph pattern mining [30] etc. However, all pattern mining approaches have a hard time with real datasets. When the minimum support threshold is high, in general only obvious, common sense ‘knowledge’ will be found, whereas when the minimum support is low, a huge number of patterns will usually be generated, most of which are redundant, uninfor- mative or just random combinations of popular data object- s [3]. The question of how to discover truly useful patterns [25, 10, 4, 33] that are buried deep among a large pile of use- less information has recently attracted substantial attention from researchers. Example 1.1. What makes a frequent pattern ‘interest- ing’? We crawled 13, 409, 424 Flickr photos containing geo- spatial information, and generated frequent patterns by treat- ing each sequence of photos uploaded by each user as one sequence. We discovered a huge number of patterns such as popular tourism trails, some of which revealed a clear picture of tourists’ interests (e.g., traveling from the center of San Francisco to the Pacific seashore, as shown in Figure 1(b)), while others are just combinations of popular locations (e.g., Figure 1(a)). Bay Bridge Downtown (a) Frequent Pattern Coit Tower Transamerica Pyramid Downtown Union Square (b) Interesting Pattern Figure 1: Popular Tours at San Francisco In this paper, we study the problem of finding sequences that are both popular and correlated in an input sequen- tial database. This task is widely applicable in a variety

Upload: others

Post on 11-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Efficient Mining of Correlated Sequential Patterns Based on ...hanj.cs.illinois.edu/pdf/webkr12_clin.pdf†Department of Computer Science, University of Illinois at Urbana-Champaign,

Efficient Mining of Correlated Sequential Patterns Basedon Null Hypothesis

Cindy Xide Lin†§, Ming Ji†, Marina Danilevsky†, Jiawei Han†

†Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, USA§Twitter Inc, San Francisco, CA, USA

[email protected], {mingji1, danilev1, hanj}@illinois.edu

ABSTRACT

Frequent pattern mining has been a widely studied topic inthe research area of data mining for more than a decade.However, pattern mining with real data sets is complicat-ed - a huge number of co-occurrence patterns are usuallygenerated, a majority of which are either redundant or un-informative. The true correlation relationships among dataobjects are buried deep among a large pile of useless infor-mation. To overcome this difficulty, mining correlations hasbeen recognized as an important data mining task for itsmany advantages over mining frequent patterns.

In this paper, we formally propose and define the taskof mining frequent correlated sequential patterns from a se-quential database. With this aim in mind, we re-examinevarious interestingness measures to select the appropriateone(s), which can disclose succinct relationships of sequen-tial patterns. We then propose PSBSpan, an efficient min-ing algorithm based on the framework of the pattern-growthmethodology which mines frequent correlated sequential pat-terns. Our experimental study on real datasets shows thatour algorithm has outstanding performance in terms of bothefficiency and effectiveness.

Categories and Subject Descriptors

H.2.8 [Database Management]: Database Applications—Data Mining

General Terms

Algorithms

Keywords

Frequent Pattern Mining, Correlated Pattern Mining

1. INTRODUCTIONFrequent pattern mining has been a widely studied top-

ic in data mining research. Common approaches include

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Web-KR’12, October 29, 2012, Maui, Hawaii, USA.Copyright 2012 ACM 978-1-4503-1711-5/12/10 ...$15.00.

association rule mining [8], sequential pattern mining [22],graph pattern mining [30] etc. However, all pattern miningapproaches have a hard time with real datasets. When theminimum support threshold is high, in general only obvious,common sense ‘knowledge’ will be found, whereas when theminimum support is low, a huge number of patterns willusually be generated, most of which are redundant, uninfor-mative or just random combinations of popular data object-s [3]. The question of how to discover truly useful patterns[25, 10, 4, 33] that are buried deep among a large pile of use-less information has recently attracted substantial attentionfrom researchers.

Example 1.1. What makes a frequent pattern ‘interest-ing’? We crawled 13, 409, 424 Flickr photos containing geo-spatial information, and generated frequent patterns by treat-ing each sequence of photos uploaded by each user as onesequence. We discovered a huge number of patterns such aspopular tourism trails, some of which revealed a clear pictureof tourists’ interests (e.g., traveling from the center of SanFrancisco to the Pacific seashore, as shown in Figure 1(b)),while others are just combinations of popular locations (e.g.,Figure 1(a)).

Bay Bridge

Downtownow ow

(a) Frequent Pattern

Coit Tower

Transamerica

Pyramid

Downtown

Union

Square

(b) Interesting PatternFigure 1: Popular Tours at San Francisco

In this paper, we study the problem of finding sequencesthat are both popular and correlated in an input sequen-tial database. This task is widely applicable in a variety

Page 2: Efficient Mining of Correlated Sequential Patterns Based on ...hanj.cs.illinois.edu/pdf/webkr12_clin.pdf†Department of Computer Science, University of Illinois at Urbana-Champaign,

S1 support vector machine S2 graph support classificationS3 support vector machine S4 graph theoryS5 support evidence S6 graph pattern miningS7 machine learning S8 graph pattern miningS9 spectral clustering algorithm S10 sequence pattern miningS11 spectral clustering algorithm S12 novel association pattern miningS13 spectral clustering method S14 construction algorithmS15 spectral clustering model S16 EM algorithm

Table 1: The Example Sequence Database SDB

of settings, including popular event tracking [17, 18], re-search topic analysis [20], market basket problems [1], etc.Although the traditional association pattern mining prob-lem is well defined and has been thoroughly studied over thelast decade, there is currently no canonical way to measurethe degree of the so-called correlation between sequentialpatterns. We believe that there should intuitively be morethan one ‘correct’ solution to define this new type of pat-tern, especially among different scenarios. Let us start witha toy example to illustrate (i) the differences between an as-sociation and a correlated pattern, and (ii) how a reasonablecorrelation measure can effectively mine correlated sequen-tial patterns.

Example 1.2. Suppose we have a mini database madeup of 16 word phrases extracted from bibliographical record-s (see Table 1). Some phrases therein are research topics,e.g., ‘support vector machine’ and ‘machine learning’, whileothers are combinations of popular terms, e.g., ‘support ma-chine’, ‘graph mining’ and ‘clustering algorithm’. All of thefive patterns are association patterns because they appeartogether frequently, but only two are correlated, as they ex-press specific meaningful research topics, whereas the otherthree express either useless or overly broad meanings.

Although the answer to whether a sequential pattern iscorrelated or not is not an absolute ‘Yes’ or ‘No’, we at leastexpect to match common knowledge, so that the phrase ’sup-port vector machine’ would be more correlated than ’supportvector’. Thus, under an appropriate measure of correlation,a long pattern should be allowed to be more correlated thanits sub-patterns. Based on such observation, we will re-examine a lot of interestingness measures and make carefulselections for our mining task in latter sections.

In this paper, we propose a novel algorithm for mining

frequent correlated sequential patterns, based on ex-tensions of the pattern growth methodology [23]. Specifical-ly, we make the following contributions:

1. Although there have been extensive studies on miningitem pairs, correlated association rules, and recurrent se-quence patterns [3, 16], to the best of our knowledge, thispaper is the first work to formally propose and define thetask of mining correlated sequential patterns.

2. We analyze the ‘good’ and ‘bad’ properties that a reason-able correlation measure should satisfy, and select mea-sure(s) appropriate to our mining task.

3. We propose an efficient mining algorithm based on thepattern growth framework [23], and demonstrate the out-standing performance of our algorithm, in terms of bothefficiency and effectiveness, on two real datasets.

2. PRELIMINARIESIn this section, we formally define the problem of mining

correlated sequential patterns (Section 2.1), and theoretical-ly and empirically analyze certain properties under this def-inition, in order to select appropriate association measure(s)for our mining task (Section 2.2).

2.1 Problem FormulationLet E be a set of distinct items. A sequence S is an ordered

list of items, denoted as S = e1 e2 · · · e|S|, where ei ∈ E is

an item. For convenience, we refer to the ith item ei as S[i].An input sequence database is a set of sequences, denotedas SDB = {S1, S2, · · · , SN}.

Definition 2.1. (Subsequence, and Super-sequence) Fortwo sequences S = e1 e2 · · · e|S| and S′ = e′1, e

′2, · · · , e

′|S′|

(|S| ≤ |S′|), S is said to be a subsequence of S′, denotedby S ⊆ S′, if there exists a series of one-to-one mappingpositions 1 ≤ p1 < p2 < · · · < p|S| ≤ |S′|, s.t. S[i] = S′[pi](i.e., ei = e′pi) for any i ∈ 1, 2, · · · , |S|. In particular, for|S| < |S′|, we call S a proper (strict) subsequence of S′,denoted by S ⊂ S′. We may also say S′ is a super-sequenceof S, or S′ contains S.

A pattern is also a sequence in the context of a sequentialdatabase. For two patterns P and P’, if P is a subsequenceof P’, then P is said to be a sub-pattern of P’, and Paais asuper-pattern of P.

Definition 2.2. (Projected Database, Support, and Prob-ability) For an input sequence database SDB and a sequenceP , DB(P )) is the set of sequences in SDB, of which P

appears as a subsequence. We define the support of P asSup(P ) = |DB(P )| and therefore the probability of P as

Pr(P ) = Sup(P )|SDB|

. Any sequence in DB(P ) is referred as

P ’s supporting sequence and DB(P ) is called the projecteddatabase of SDB based on the sequence P .

Definition 2.3. (Cutting and Cutting Set) For a sequenceS (|S| > 2) and an ordered list of subsequences C = {c1,c2, · · · , c|C|} (|C| ≥ 2), we call C a cutting of S, if anyci ∈ C is non-empty (i.e. |ci| > 0) and the concatenation ofc1, c2, · · · , c|C| equals to S. Specifically, C is k-piece cuttingif |C| = k. We abbreviate the set of all k-piece cuttings of Sas Ck(S), and furthermore C2..k(S) = C2(S) ∪ · · · ∪ Ck(S).

A number of interesting correlation measures defined onassociation patterns [1, 3] have been proposed and analyzed[29, 25], including χ2, lift, all-confidence, max-confidence,Kulcsynski and cosine. Formally, a pattern P is said to fre-quent if Sup(P ) ≥ min sup, and P is said to be correlated ifits correlation score is no less than min cor, where min sup

and min cor are specified empirically by users. The task ofmining frequent correlated sequential patterns is thereforeto find the complete set of sequences in an input databasewhich are both frequent and correlated.

Page 3: Efficient Mining of Correlated Sequential Patterns Based on ...hanj.cs.illinois.edu/pdf/webkr12_clin.pdf†Department of Computer Science, University of Illinois at Urbana-Champaign,

2.2 Measure SelectionIn this section, we analyze ‘good’ and ‘bad’ properties

a reasonable correlation measure should satisfy, and makecareful selections for our mining task. Let us start withintroducing the well-known Apriori property on support.

Lemma 1. (Monotonicity on Support) Given two pattern-s P and P ′ in a sequence database SDB, if P ′ is a super-pattern of P (i.e., P ⊆ P ′), then Sup(P ) ≥ Sup(P ′).

Theorem 1. (Apriori Property on Support) If P is notfrequent, none of its super-patterns P ′ are frequent either.Or equivalently, if P is frequent, any of its sub-patterns arefrequent too.

This Apriori property on support can be well utilized toimprove the mining efficiency: we could eliminate furtherexploration of the super patterns of P at the stage of thesub-pattern P if P is not frequent. However, an appropri-ate correlation measure may not satisfy the Apriori propertyon correlation score. Recall Example 1.2: the word phrase‘support vector machine’ is more interesting than ‘supportmachine’, but the former is a super pattern of the latter.Actually, it is often the case in real situations that a super-pattern is more correlated than its sub-pattern(s). We ad-mit that the failure of Apriori property on correlation bringchallenges in computational issues, but on the other handit makes the mining results more effective and meaningful.Therefore, we re-examine various interestingness measures[2, 29, 14], in search of a measure we could use to minecorrelated sequential patterns, and finally obtain the moti-vation of designing our correlation measure from the nullhypothesis in Ngram testing.

Tests of correlation between n random variables typicallyset up a null hypothesis that holds if the n random variablesare independent. The n items in a correlated pattern thatfail the test might then be considered to be related or depen-dent in some way, since they have failed to exhibit statisticalindependence. Formally speaking, for n items a1, a2, · · · , an

that make up a correlated pattern a1a2 · an, we would ex-pect the probability of these items appearing together tobe significantly larger than the product of the probabilitiesof each item appearing separately. Moreover, we extend theconcept of an individual item to any arbitratry sub-sequenceof a correlated pattern, and propose a correlation measureon sequential patterns as:

CorSeq(P ) =Pr(P )

MaxC={ci}∈C2..|P |(P )

{∏

iPr(ci)}

(1)

In the above formula, we express the combiniation of in-formation units by ‘cutting’ (see Definition 2.3). The gen-eral idea is: a correlated sequential pattern P is expectedto fail the null hypothesis consisting of any arbitrary cut-ting that make up P , i.e., the probability of a correlatedpattern should be significantly larger than the product ofthe probabilities of sub-patterns appearing in any of its cut-tings. Hence, we use the ratio between the probability of Pand the maximum of the joint probability of its sub-patternsappearing separately in any possible cutting of P to denotethe correlation score of P .

Theorem 2. The correlation measure defined by Equa-tion 1 does not satisfy the Apriori Property on correlationscore, i.e., if P is correlated, it is unnecessary for its sub-pattern P ′ to be correlated too.

Proof. We prove Theorem 2 by giving a counter examplestated in Section 3.

3. AN EFFICIENT MINING ALGORITHMIn this section, we first introduces the classical PrefixSpan

[23] algorithm (Section 3.1), based on which a three-stagemining method is developed to extract correlated sequentialpatterns (Section 3.2).

3.1 The PrefixSpan AlgorithmPrefixSpan [23] belongs to a series of pattern-growth method-

ologies [7, 9, 30]. The major idea is that instead of projectingsequence databases by considering all possible occurrencesof frequent subsequences, the projection is based only onfrequent prefixes, as any frequent subsequence can alwaysbe found by growing a frequent prefix. Generally speak-ing, sequential patterns can be mined by a prefix-projectionmethod in three steps: (i) find all frequent length-1 patterns;(ii) divide the search space into projected databases basedon prefixes; and (iii) Mine each projected database recur-sively and output sequential patterns with prefixes added tothe front.

However, there are challenges in computing the correla-tion score (Equation (1)) of a frequent pattern during themining step of PrefixSpan, because we do not know the prob-ability of any of sub-patterns. The naıve solution would beto generate all frequent patterns first, create an in-memoryindex on these patterns, and re-examine all frequent pat-terns by calculating their correlation scores. However, sucha solution is undesirable for several reasons:

1. First, to avoid missing useful patterns, the minimum sup-port is usually set low, and the in-memory index of fre-quent patterns may exceed the availability of the primarymemory as a result. In fact this very situation occurs inour experiments (Section 4).

2. Second, even if we are able to create an in-memory indexfor frequent patterns, the efficiency of the pattern miningalgorithm will rely heavily on the efficiency of the hashfunction of the index (i.e., the cost of accessing one value),especially since the patterns themselves may have variousformats and structures.

3. Third, when an algorithm is disk-based, it is easier toimprove the efficiency by utilizing parallel computing [27].

3.2 The PSBSpan Mining Algorithm

Definition 3.1. (k-Piece Correlated Pattern and k-PieceMaximum Cutting Probability) A frequent pattern P is calleda k-piece correlated pattern (k ≥ 2) if k-CorSeq(P ) ≥ min cor,where k-CorSeq(P ) is derived from Equation (1) by settinga constraint on the cutting size. For convenience, we callthe denominator of the first line of Equation (2) the k-piecemaximum cutting probability.

k-CorSeq(P ) =Pr(P )

MaxC={ci}∈C2..k(P )

{∏

iPr(ci)}

(2)

Page 4: Efficient Mining of Correlated Sequential Patterns Based on ...hanj.cs.illinois.edu/pdf/webkr12_clin.pdf†Department of Computer Science, University of Illinois at Urbana-Champaign,

association 0.06 vector 0.13 machine 0.19 learning 0.06spectral 0.25 clustering 0.25 algorithm 0.25 method 0.06model 0.06 mining 0.25 classification 0.06 graph 0.25EM 0.06 theory 0.06 pattern 0.25 novel 0.06sequence 0.06 support 0.25 construction 0.06 evidence 0.06

Table 2: The Vocabulary E for Example 3.1

Non-Single-Item Patterns PrefixSpan(Corpre) SuffixSpan(Corpost) Bindinggraph mining sup = 2 2.0 2.0graph pattern sup = 2 2.0 2.0graph pattern mining sup = 2 4.0 X 2.0pattern mining sup = 4 4.0 X 4.0 X X

spectral clustering sup = 4 4.0 X 4.0 X X

spectral clustering algorithm sup = 2 2.0 4.0 X

support machine sup = 2 2.7 2.7support vector sup = 2 4.0 X 4.0 X X

support vector machine sup = 2 5.3 X 4.0 X X

vector machine sup = 2 5.3 X 5.3 X X

Table 3: The Sequential Patterns Extracted at Different Stages

Definition 3.2. (Prefix and Suffix Upper-bound) For asequence P = e1 e2 · · · en (n ≥ 2), {e1 e2 · · · en−1, en} and{e1, e2e3 · · · en} are two cuttings of P , based on which it is

easy to prove that CorSeqpre(P ) = Pr(P )Pr(e1e2···en−1)Pr(en)

and

CorSeqpost(P ) = Pr(P )Pr(e2,e3,··· ,en)Pr(e1)

are two upper-bounds

of CorSeq(P ), called the prefix upper-bound and the suffixupper-bound, respectively.

We now introduce a mining algorithm (called PSBSpan)for generating the complete set of 2-piece correlated patternsin three steps: the PrefixSpan step, the SuffixSpan step, andthe Binding step, and then extend the results to correlatedpatterns of pieces of arbitrary size.

The PrefixSpan Step. This step is almost identical tothe traditional PrefixSpan algorithm, with one additionalcalculation:

1. We first calculate the prefix upper-bound of each frequentpattern. A pattern is a said to be a potentially corre-

lated pattern if its prefix upper-bound is no less thanthe minimum correlation threshold min cor. We outputa frequent pattern only if (i) it is a potentially correlatedpattern, or (ii) it is a prefix of a potentially correlatedpattern. The intuition behind this is: (a) If a pattern isnot a potentially correlated pattern, it is definitely not acorrelated pattern, so that it is unnecessary to considerit further.(b) The cost of pruning is low: when we are atthe stage of having found a given frequent pattern, wemust have previously accessed the probability of its im-mediate prefix and the amount of space needed to storethe probability of each prefix of the current pattern andof each single item, is miniscule.

2. For each pattern we select to output using the above step,we not only output its probability, but also the probabili-ty of any of its proper prefixes. We will show the utility ofoutputting this additional information in the discussionof the Binding Step.

Example 3.1. We illustrate the mining process (shownin Table 3) of PrefixSpan with a toy sequence database con-sisting of the abbreviated title of 16 publications (listed inTable 1). The vocabulary with word probabilities are giv-en in Table 2. The parameters are empirically set to bemin sup = 2 and min cor = 3.0.

The frequent patterns are listed in the first column ofTable 3, along with their supports. Prefix upper-bounds ofthose frequent patterns are calculated in the second column.We can see that 10 non-single-item patterns are generatedduring the mining procedure, only 6 of which (marked withX) have larger prefix upper-bounds than the minimum cor-relation threshold of 3.0.

Theorem 3. The output sequences of the PrefixSpan Stepare automatically sorted lexicologically.

Proof. For two sequential patterns S = e1, e2, · · · , enand S′ = e′1, e

′2, · · · , em in the output, w.l.o.g., we suppose

S < S′ that ei = e′i for any i = 1, 2, · · · , j and ej+1 < e′j+1,i.e., S∗ = e1 e2 · · · ej is the longest common prefix of S

and S′. The two sequences are projected into two differ-ent sub-databases DB(S∗ + ej+1) and DB(S∗ + e′j+1), re-spectively, at the function call of PrefixSpan(DB(S∗), S∗,min sup, min cor). Since PrefixSpan generates all patternsin DB(S∗ + ej+1) before the patterns in DB(S∗ + e′j+1), asa result, S is output earlier than S′.

The SuffixSpan Step. This step is a mirrored version ofthe PrefixSpan step, but with an additional sorting step atthe end: (i) Project databases based on suffixes and generatepatterns by concatenating suffixes to the end of patternsmined from projected databases. (ii) Calculate the suffixupper-bound of each pattern, and output a frequent patternif it is a potentially correlated pattern or if it is a suffix of apotentially correlated pattern, together with the probabilityof its suffixes. (iii) Sort the output sequences in lexicologicalorder by (disk-based) sorting methods [15].

Example 3.2. Following Example 3.1, the suffix upper-bounds are listed in the third column of Table 3 for eachfrequent pattern. 6 of the 10 frequent patterns (marked withX) have larger suffix upper-bounds than the minimum corre-lation threshold 3.0 and therefore remain as potentially cor-related patterns.

The Binding Step. With the two sorted output lists gen-erated from the two previous steps, we (1) find their over-lapping set; (ii) do a final verification for each pattern in theoverlapping set, to check whether it is a ‘true’ 2-piece cor-related pattern according to Equation (2), and (iii) output

Page 5: Efficient Mining of Correlated Sequential Patterns Based on ...hanj.cs.illinois.edu/pdf/webkr12_clin.pdf†Department of Computer Science, University of Illinois at Urbana-Champaign,

0.5 1 1.5 2 2.5 3

x 105

0

1000

2000

3000

4000

5000

Size of input database

Runnin

g tim

e (

in s

ec)

PrefixSpan+

PSBSpan

(a) the running time w.r.t. |SDB|

0.01% 0.015% 0.02% 0.025%0

0.5

1

1.5

2

2.5

3

3.5x 10

4

Minimum support threshold

Runnin

g tim

e (

in s

ec)

PrefixSpan+

PSBSpan

(b) the running time w.r.t. min sup

5 10 15 20 25 300

1000

2000

3000

4000

5000

Minimum correlation threshold

Runnin

g tim

e (

in s

ec)

PrefixSpan+

PSBSpan

(c) the running time w.r.t. min cor

Figure 2: PSBSpan vs. PrefixSpan+

a frequent pattern if it is a 2-piece correlated pattern or ifit is a prefix of a 2-piece correlated pattern, along with the2-piece maximum cutting probability of each of its prefixes.

The procedure is basically a merge sort (in linear time)[15], and the verification step is easy, since we already havethe probability of any prefix or suffix of a potentially corre-lated pattern. For each 2-piece correlated pattern, we out-put its probability as well as the 2-piece maximum cuttingprobability of any of its proper prefixes.

Example 3.3. After ‘binding’ the results generated fromExamples 3.1 and 3.2, the ‘truly’ correlated patterns aremarked with a X in the last column of Table 3.

Extending to k-Piece Correlated Patterns To find cor-related patterns of pieces of arbitrary size, the only problemwe need to solve relies on how to extend k-piece correlat-ed patterns to (k + 1)-piece ones. Let us start with twotheorems:

Theorem 4. The 2-piece correlated patterns generated byPSBSpan are automatically sorted lexicologically.

Proof. Correlated patterns are generated at the Bindingstep only if they are contained in the result from the PrefixS-pan step. Since results generated by the PrefixSpan step aresorted lexicologically as per Thereom 3, and Binding doesnot change the order among patterns from the PrefixSpanstep, the final results from PSBSpan are also automaticallysorted lexicologically.

Theorem 5. For a k-piece correlated pattern P (|P | ≥k), if we have the k-piece maximum cutting probability ofany of its proper prefixes and the probability of any of itsproper suffixes, we can calculate the (k + 1)-piece correlatedscore (Equation (2)) and the (k+1)-piece maximum cuttingprobability of P .

Proof. We do the following formula transformation:

(k + 1)-Mcp(P )

= MaxC={ci}∈C2..k+1(P )

{∏

i

Pr(ci)}

= MaxC′,c′:C+c′=P

{ MaxC′={c′

i}∈C2..k(P )

{∏

i

Pr(c′i)}Pr(c′)}

= MaxC′,c′:C+c′=P

{k-Mcp(C′)Pr(c′)}

Thus, given the k-piece maximum cutting probability ofP , it is easy to obtain its (k+1)-piece correlation score (E-quation (2)), and therefore judge whether P is a (k+1)-piececorrelated pattern.

4. EXPERIMENTSIn this section, we conduct our experiments on two real

datasets to show the performance of the PSBSpan algorithm.All the algorithms were implemented in Java (Eclipse Helio2000) and the experiments were performed on a Windows7 server with Intel Core2 Duo processors and 2GB of mainmemory.

DBLP Dataset. The Digital Bibliography and LibraryProject 1 is a web accessible database of the bibliographicinformation of computer science publications. In this paper,we use a collection of DBLP articles [26] released by theArnetMiner group of Tsinghua University, which contains1, 632, 442 publications and 1, 741, 170 researchers. We con-sider therein 32, 224 papers published in prestigious confer-ences (e.g., SIGKDD, SIGIR, SIGMOD, VLDB, SDM, etc)in the areas of database, data mining, and machine learning.

Flickr Dataset. Flickr 2 is an image and video hostingwebsite. In this paper, we use a collection of 13, 409, 424Flickr photos supplied by Kodak Inc. Each photo is asso-ciated with a publishing user, a title, geographical informa-tion, a set of tags, etc. We consider therein 13.7% photostaken in 12 metropolitan cities famous for tourists.

4.1 Efficiency EvaluationIn this section, we will evaluate the mining efficiency of

our PSBSpan algorithm.

PrefixSpan+. The number of generated frequent patternsmay be several millions, and creating indexes on these fre-quent patterns exceeds the availability of our primary mem-ory. To avoid using indexes, we implemented an alterna-tive method, called PrefixSpan+, as a baseline to comparewith our PSBSpan algorithm. Generally, PrefixSpan+ isthe same approach as traditional PrefixSpan algorithm [23],but with additional correlation testing during the patterngrowth-based mining process. However, since computing thecorrelation score (Equation 1) of a sequence depends on theprobability of any arbitrary sub-sequence, PrefixSpan+ hasno other choice but to simply go back to the original input

1www.informatik.uni-trier.de/∼ley/db/2www.flickr.com

Page 6: Efficient Mining of Correlated Sequential Patterns Based on ...hanj.cs.illinois.edu/pdf/webkr12_clin.pdf†Department of Computer Science, University of Illinois at Urbana-Champaign,

Measure Top Ranked Patterns

support object oriented database distributed database systemdatabase management system relational database systemobject oriented system data management systemobject database system association rule miningsupport vector machine oriented database systemdata base system time series dataobject oriented database system real tme database

all-confidence object oriented database association rule miningpeer peer network object oriented systemnearest neighbor search nearest neighbor queryself organizing map distributed database systemconcurrency control database database management systemrelational database system wireless sensor networkreal time database mining association rule

lift support vector machine nonnegative matrix factorizationreverse nearest neighbor conditional random fieldnamed entity recognition nearest neighbor movinglatent dirichlet allocation object oriented databasenearest neighbor uncertain singular value decompositionprivacy preserving publishing association rule miningcontinuous nearest neighbor nearest neighbor search

cor nonnegative matrix factorization singular value decompositionconditional random field named entity recognitionaqualogic data service platform latent dirichlet allocationassociation rule mining join algorithm multiprocessoroptimized rule numeric attribute inductive logic programmingreverse nearest neighbor wireless data broadcastprivacy preserving data publishing message table content index

Table 4: Case Study on DBLP

database SDB to count the support of these sub-sequencesby scanning the whole database.

In Figure 2, we show the running time (in seconds) ofPSBSpan and PrefixSpan+, while varying the size of theinput database (Figure 2(a)), the minimum support thresh-old (Figure 2(b)) and the minimum correlation threshold(Figure 2(c)). We can see that in all cases, PSBSpan signif-icantly outperforms PrefixSpan+. In fact, when the size ofdatabase is larger than 25K or the minimum support thresh-old is lower than 0.01%, the running time of PrefixSpan+becomes intolerable.

4.2 Case StudyIn this experiment, we perform case studies on two real

datasets, i.e., DBLP and Flirck, to show the effectiveness ofour method.

4.2.1 Study on DBLP

In Table 4, we list top-ranked sequential patterns (whosesize is larger than two) according to four measures: sup-port (see Definition 2.2), all-confidence [29], lift [14], andcor (Equation 1). We can see that patterns with the highestsupport values are mostly random combinations of popularwords: even though some phrases make sense as high lev-el concepts, e.g., ‘database system’, their useless duplicatesmay appear multiple times, such as, ‘oriented database sys-tem’, ‘data base system’, and ‘object database system’.

As discussed in Section 2.2, the measure all-confidencesatisfies the Apriori property, i.e., a super-pattern must havea higher all-confidence score than its sub-patterns, whichis obviously unreasonable in real situations. For instance,‘object oriented database system’ is a meaningful phrase,

but its useless sub-pattern ‘object oriented system’ is rankedhigher by all-confidence (see the second row of Table 4).

The traditional measure lift defined on itemsets and thecormeasure we proposed for sequential patterns (Equation (1))have the same intuition: the probability of a correlated pat-tern should be significantly larger than the joint probabilityof ‘information units’ that make up the correlated pattern.The main difference is that the only possible informationunits considered by lift are single items appearing in thepattern, while the concept of an information unit in cor isextended to be any sub-pattern of the correlated pattern,so that the cor measure considers a more complete set ofpatterns than the lift measure. We refer to this extendedconcept as ‘cutting’ (Definition 2.3) and define the cor mea-sure by using cutting in Equation (1).

For the reasons that we have brought up previously, ‘near-est neighbor’ is an interesting information unit, but ‘nearest’and ‘neighbor’ separately are not. Using the lift measure,‘reverse nearest neighbor’, ‘nearest neighbor moving’, ‘near-est neighbor uncertain’, and ‘nearest neighbor search’ allappear as highly ranked patterns (see the third row of Ta-ble 4), but only ‘reverse nearest neighbor’ is considered tobe a truly correlated pattern by the cor measure (see thelast row of Table 4).

4.2.2 Study on Flickr

We pick up Flickr photos taken at 12 popular metropoli-tans (as listed in [31]), and cluster them into 1, 200 places ofinterests, by doing which, the photo uploaded by each userbecomes a sequence of visiting locations in the database.

For three selected cities in three different countries - Lon-don, San Francisco, and Paris - we draw their top correlated

Page 7: Efficient Mining of Correlated Sequential Patterns Based on ...hanj.cs.illinois.edu/pdf/webkr12_clin.pdf†Department of Computer Science, University of Illinois at Urbana-Champaign,

British

Museum

Somerset

H

Trafalgar

Square

St Brides

Church

House

Square

(a) Tour at London by PSBSpan

Coit Tower

Transamerica

Pyramid

Downtown

Union

Square

(b) Tour at San Francisco by PSBSpan

Palais Garnier

Museum du

Louvre

Luxembourg

Palace

(c) Tour at Paris by PSBSpan

Millennium

Bridge

Bigben

(d) Tour at London by PrefixSpan

Bay Bridge

Downtownow ow

(e) Tour at San Francisco by PrefixSpan

Museum du

Louvre

Cathedral

N t DNotre Dame

(f) Tour at Paris by PrefixSpan

Figure 3: Case Study on Flickr

pattern using Google Map in Figure 3(a), 3(b) and 3(c), andcompare with each city’s top frequent pattern, as shown inFigure 3(d), 3(e) and 3(f), respectively.

For all three cities without exception, the top trail minedby PrefixSpan (i.e., the most frequent pattern) is a ‘random’connection of two of the top three popular places for eachcity, e.g., Louvre Museum with the Notre Dame Cathedral,and Big Ben with the Millenium Bridge. For the three trails,the traveling distance varies from 1.7 to 5.1 miles, and thewalking time ranges from 13 mins to 1 hour 4 mins, whichdoes not seem to be a pleasant tour.

In contrast, the top trails mined in each city by PSBSpan(i.e., the most correlated frequent pattern in each city), ap-pear as highly consistent and localized in their geographicallocations and reveal some reasonable tourist interests. Forexample, Downtown, Union Square, Transamerica Pyramidand Coit Tower is a sequence of locations leading from thebustling center of San Francisco to the beautiful shore of thePacific Ocean.

5. RELATED WORKPSBSpan is a novel algorithm for mining correlated se-

quential patterns in a sequential database. It inherits pat-tern growth and database projection from PrefixSpan [23] toaccelerate mining efficiency, and utilizes the binding tech-nique to improve the accessibility of information. To thebest of our knowledge, this paper is the first work aiming atcorrelated sequential pattern mining. There are, however,several lines of related work.

Ngram Testing is proposed to extract meaningful phras-es by statistical tests on the co-occurrence of the words inone phrase in natural language processing or topic modeling.An Ngram is a sequence of N units, or tokens, of text, wherethose units are typically single characters or strings that aredelimited by spaces [2, 32, 28]. Many methods have beenproposed to determine whether an Ngram is a meaningfulphrase [5, 2, 19] by testing the association or computingstatistical measures such as mutual information [5] amongthe units. Some other approaches rely on hypothesis testing

techniques, such as Null Hypothesis [2] testing, where theauthors design a Null Hypothesis that holds if the two ran-dom variables are independent of each other. Other testsincluding χ2 Test and Student’s T-Test [19] have also beenemployed for hypothesis testing.

A number of different Correlation Measures have beendiscussed extensively in the literature of pattern mining. (i)χ2 [6, 16] is a measure adopted in correlation relationshipmining. The definition of χ2 follows the standard definitionin statistics. (ii) Lift [16] is also a correlation measure com-puted from the support of the itemsets. (iii) All-confidenceis a measure that can disclose correlation relationships a-mong data objects. It has the nice null-invariance propertyand the downward closure property. (iv) Coherence [16],(v) kulczynski, (vi) max-confidence and (vii) Cosine [29] arefour other good measures and are useful in discovering corre-lation information. (viii) Bond [21] is an interesting correla-tion measure that offers information about the conjunctivesupport of a pattern as well as its disjunctive and nega-tive supports. (ix) Pearson’s Correlation Coefficient [24] isa nominal measure which can analyze the correlation re-lationships among nominal variables. Given the variety ofmeasures proposed, Tan et al. [25] discuss how to select theright measure for a given application. They have shown thateach measure is useful for some application, but not for oth-ers. Wu et al. [29] re-examine the null-invariant measuresand show a generalization of the measures in one mathe-matical framework, which helps us understand and selectthe proper measure for different applications.

Substantial research efforts have been devoted to miningCorrelated Itemsets. Based on extensions of a pattern-growth methodology, efficient algorithms [16] are proposedto mine the correlation relationships among patterns. Ke etal. [11] mine correlations from quantitative databases effi-ciently by utilizing normalized mutual information and all-confidence to perform a two-level pruning. They show thatmining correlations is more effective than mining associa-tions. Besides, in the graph pattern mining literature, there

Page 8: Efficient Mining of Correlated Sequential Patterns Based on ...hanj.cs.illinois.edu/pdf/webkr12_clin.pdf†Department of Computer Science, University of Illinois at Urbana-Champaign,

is also some work on mining correlated and representativegraph patterns [4, 13, 12].

6. CONCLUSIONTo the best of our knowledge, this paper is the first s-

tudy on mining frequent correlated sequential patterns froma sequential database. To formally define the problem, weanalyze ‘good’ and ‘bad’ properties for the selection of cor-relation measures. We point out that forcing the correla-tion score to satisfy the Apriori property is harmful to theeffectiveness of the mining result, and use this analysis tocarefully select the appropriate measures for calculating acorrelation score.

Moreover, we develop an efficent three-stage mining method,Prefix-Suffix-Binding Span (PSBSpan), based on an exten-sion of pattern growth methodology. Experimental studieson real datasets reveal that our mining method is able todiscover ‘truly’ succinct and interesting patterns, while stillremaining efficient for large-scale datasets.

7. REFERENCES

[1] R. Agrawal, T. Imielinski, and A. N. Swami. Miningassociation rules between sets of items in largedatabases. In SIGMOD, 1993.

[2] S. Banerjee and T. Pedersen. The design,implementation, and use of the ngram statisticspackage. In CICLing, 2003.

[3] S. Brin, R. Motwani, and C. Silverstein. Beyondmarket baskets: Generalizing association rules tocorrelations. In SIGMOD, 1997.

[4] C. Chen, C. X. Lin, X. Yan, and J. Han. On effectivepresentation of graph patterns: a structuralrepresentative approach. In CIKM, 2008.

[5] K. W. Church and P. Hanks. Word association norms,mutual information, and lexicography. In Comput.Linguist., 1990.

[6] G. Corder and D. Foreman. Nonparametric Statisticsfor Non-Statisticians: A Step-by-Step Approach.Wiley, 2009.

[7] J. Han and M. Kamber. Data Mining: Concepts andTechniques. Morgan Kaufmann, 3 edition, 2006.

[8] J. Han and J. Pei. Mining frequent patterns bypattern-growth: Methodology and implications. 2000.

[9] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal,and M. Hsu. Freespan: frequent pattern-projectedsequential pattern mining. In KDD, 2000.

[10] M. A. Hasan, V. Chaoji, S. Salem, J. Besson, andM. J. Zaki. Origami: Mining representative orthogonalgraph patterns. In ICDM, 2007.

[11] Y. Ke, J. Cheng, and W. Ng. Mining quantitativecorrelated patterns using an information-theoreticapproach. In KDD, 2006.

[12] Y. Ke, J. Cheng, and J. X. Yu. Efficient discovery offrequent correlated subgraph pairs. In ICDM, 2009.

[13] Y. Ke, J. Cheng, and J. X. Yu. Top-k correlativegraph mining. In SDM, 2009.

[14] S. Kim, M. Barsky, and J. Han. Efficient mining of topcorrelated patterns based on null-invariant measures.In PKDD, 2011.

[15] D. E. Knuth. The Art of Computer Programming:Sorting and Searching. Addison-Wesley, 1968.

[16] Y.-K. Lee, W.-Y. Kim, Y. D. Cai, and J. Han.Comine: Efficient mining of correlated patterns. InICDM, 2003.

[17] C. X. Lin, Q. Mei, J. Han, Y. Jiang, andM. Danilevsky. The joint inference of topic diffusionand evolution in social communities. In ICDM, 2011.

[18] C. X. Lin, B. Zhao, Q. Mei, and J. Han. Pet: astatistical model for popular events tracking in socialcommunities. In KDD, 2010.

[19] C. D. Manning and H. Schtze. Foundations ofstatistical natural language processing. MIT Press,1999.

[20] Q. Mei, X. Shen, and C. Zhai. Automatic labeling ofmultinomial topic models. In KDD, 2007.

[21] E. Omiecinski. Alternative interest measures formining associations in databases. Trans. Knowl. DataEng., 2003.

[22] J. Pei and J. Han. Constrained frequent patternmining: a pattern-growth view. In KDD, 2002.

[23] J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto,Q. Chen, U. Dayal, and M.-C. Hsu. Mining sequentialpatterns by pattern-growth: The PrefixSpan approach.IEEE Trans. Knowledge and Data Engineering, 2004.

[24] J. L. Rodgers and W. A. Nicewander. Thirteen waysto look at the correlation coefficient. In The AmericanStatistician, 1988.

[25] P.-N. Tan, V. Kumar, and J. Srivastava. Selecting theright objective measure for association analysis. InKDD, 2002.

[26] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su.Arnetminer: extraction and mining of academic socialnetworks. In KDD, 2008.

[27] C. Wang, W. Wang, J. Pei, Y. Zhu, and B. Shi.Scalable mining of large disk-based graph databases.In KDD, 2004.

[28] G. I. Webb. Self-sufficient itemsets: An approach toscreening potentially interesting associations betweenitems. TKDD, 4(1), 2010.

[29] T. Wu, Y. Chen, and J. Han. Re-examination ofinterestingness measures in pattern mining: A unifiedframework. In Data Mining and Knowledge Discovery,2010.

[30] X. Yan and J. Han. gspan: Graph-based substructurepattern mining. In ICDM, 2002.

[31] Z. Yin, L. Cao, J. Han, J. Luo, and T. S. Huang.Diversified trajectory pattern ranking in geo-taggedsocial media. In SDM, 2011.

[32] J. Zhang, B. Jiang, M. Li, J. Tromp, X. Zhang, andM. Q. Zhang. Computing exact p-values for dnamotifs. Bioinformatics, 23(5):531–537, 2007.

[33] S. Zhang, J. Yang, and S. Li. Ring: An integratedmethod for frequent representative subgraph mining.In ICDM, 2009.