an efficient algorithm for enumerating closed patterns in transaction databases takeaki uno tatsuya...

21
An Efficient Algorithm for An Efficient Algorithm for Enumerating Closed Patterns in Enumerating Closed Patterns in Transaction Databases Transaction Databases Takeaki Uno Takeaki Uno Tatsuya Asai Tatsuya Asai Yuzo Uchida Yuzo Uchida Hiroki Hiroki Arimura Arimura National Institute of Informatics, JAPAN Kyushu University, JAPAN Kyushu University, JAPAN Hokkaido University, JAPAN 4/Oct/2004 Discovery Science

Upload: morgan-parks

Post on 31-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura National Institute

An Efficient Algorithm for Enumerating An Efficient Algorithm for Enumerating Closed Patterns in Transaction DatabasesClosed Patterns in Transaction Databases

Takeaki UnoTakeaki Uno

Tatsuya AsaiTatsuya Asai

Yuzo UchidaYuzo Uchida

Hiroki ArimuraHiroki Arimura

National Institute of Informatics, JAPAN

Kyushu University, JAPAN

Kyushu University, JAPAN

Hokkaido University, JAPAN

4/Oct/2004 Discovery Science

Page 2: An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura National Institute

Transaction DatabaseTransaction Database

・・ Transaction database Transaction database TT ::

a database composed of transactionstransactions defined on itemset itemset II i.e.,   TT , t ∈TT , t ⊆II

-- basket data

-- links of web pages

-- words in documents

・・  A subset of I I is called a patternpattern

Real world data is often large and sparse

1,2,5,6,72,3,4,51,2,7,8,91,7,92,7,92

TT ==

Page 3: An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura National Institute

Occurrences of PatternOccurrences of Pattern

・・  For a pattern P,

occurrenceoccurrence of P ::     a transaction in TT including P

denotationdenotation of P :: set of occurrences of P

・・  The size of denotation is called frequencyfrequency of P

1,2,5,6,7,92,3,4,51,2,7,8,91,7,92,7,92

TT ==   denotation of {1,2}==  { {1,2,5,6,7,9}, {1,2,7,8,9} }

Page 4: An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura National Institute

Frequent PatternFrequent Pattern

・・  Given a minimum supportminimum support θ,

Frequent patternFrequent pattern: a pattern s.t. (frequency) ≧≧ θ

(a subset of items, which is included in at least θ transactions)

Ex.)Ex.) 1,2,5,6,7,92,3,4,51,2,7,8,91,7,92,7,92

TT ==

patterns included inpatterns included in at least 3 at least 3 transactionstransactions{1} {2} {7} {9}{1,7} {1,9}{2,7} {2,9} {7,9}{2,7,9}

Important role in discovering interesting knowledge

・・  However, # frequent patterns is often large…

Page 5: An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura National Institute

Closed Pattern Closed Pattern [Pasquier et. al. 1999]

・ ・ Patterns having the same denotations quite similar

・・  Classify patterns into equivalence classesequivalence classes by their denotations

・・  Closed patternClosed pattern: the maximal in an equivalence class

    ( = = intersection of occurrences in the denotation)

・・  Closure Closure of a pattern  : the closed pattern belonging

to its equivalence class closed pattern

non-closed patterns

equivalence class

φφ

1,2,51,2,5 3,5,73,5,7

Page 6: An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura National Institute

Advantages of Closed PatternAdvantages of Closed Pattern

Completeness: Completeness: [Mannila’96; Pasquier ‘99]

-- The set CC of frequent closed patterns have the complete information

on the set FF of all frequent patterns and their frequencies -- Any maximal association rule can be constructed from the set CC

-- The set CC is sufficient for building classification rules over itemsets

Compactness: Compactness: [Mannila ‘96]

-- |Maximal Frequent| ≦≦ |C| ≦≦ |F|

-- Frequent closed patterns are possibly exponentially fewer than |F|

Th 1Th 1 [This paper]: For any nn and mm, we can construct a database

of n n items and mm transactions that |C| = O(m2) while |F| = 2Ω(n+m).

Page 7: An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura National Institute

Problem and ResultProblem and Result

・・  PROBLEM: PROBLEM: given a transaction database,

find all frequent closed patternsall frequent closed patterns

・ ・ Many existing studies, theory and practice

We propose prefix preserving closure extension, and an efficient algorithm LCM LCM (Linear time Closed pattern Miner)

・ ・ Theoretical advantage: linear time in ##frequent closed patterns, use small memory・ ・ Practical advantage: faster than the other algorithms for many datasets (almost all datasets of KDDcup and FIMI’03)

Page 8: An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura National Institute

Existing ApproachExisting Approach

・・  Frequent pattern mining based approaches:

enumerate frequent patterns, and output closed patterns among them

・・  Reduce the computation time by avoiding non-closed patterns:

During the enumeration,

-- eliminate unnecessary patterns from memory

- - prune unnecessary branches of the recursion

(not complete)

Page 9: An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura National Institute

Our ApproachOur Approach

・・  Existing algorithms

- - possibly operate many non-closed patterns

-- require much memory for storing obtained patterns

We propose

closure extension based enumeration operate closed patterns only (linear time)prefix preserving closure extension no memory for previously obtained patterns (small memory)some algorithms for fast computation faster then other algorithms

Page 10: An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura National Institute

Closure Extension Closure Extension [Pasquier et. al. ’99][Pasquier et. al. ’99]

・・  Closure extensionClosure extension: a rule for constructing a closed pattern from another closed pattern

add an item, and take its closure

closed pattern

・・  Any closed pattern is a closure extension of

at least one other closed pattern

・ ・ Any closed pattern has strictly smaller size than

any its closure extension

++ item

closure

Page 11: An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura National Institute

frequentfrequent

Acyclic Relation Acyclic Relation [essentially Pasquier et. al. ’99][essentially Pasquier et. al. ’99]

Closure extensionClosure extension induces an acyclic search graph

・・ We compute in linear time all closed patterns by closure extension

・ ・ However, we still have to store obtained closed patterns in memory…

Page 12: An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura National Institute

Prefix Preserving Closure Extension Prefix Preserving Closure Extension [new][new]

・・  Prefix preserving closure extensionPrefix preserving closure extension (ppc extension) is

a variation of closure extension

Def. closure tailDef. closure tail of a closed pattern P

⇔⇔ the minimum j s.t. closure (P ∩ {1,…,j}) ==  P

Def. Def. H == closure(P {∪ i}) (closure extension of P)

is a ppc extensionppc extension of P

   ⇔⇔     i > closure tail and H ∩{1,…,i-1} ==  P ∩{1,…,i-1}

no duplication occurs by depth-first search

“Any” closed pattern H is generated from another “uniqueunique” closed pattern by ppc extension (i.e., from closure(H ∩{1,…,i-1}) )

Page 13: An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura National Institute

Relation of ppc extension Relation of ppc extension [new][new]

・・  Any closed pattern is a ppc extension of unique closed pattern

ppc extensionppc extension forms a treetree

We can proceed depth-first searchdepth-first search by ppc extension,

without storing closed patterns in memorywithout storing closed patterns in memory

frequentfrequent

Page 14: An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura National Institute

ExampleExample

closure extension

ppc extension

1,2,5,6,7,92,3,4,51,2,7,8,91,7,92,7,92

TT ==

φ

{1,7,9}

{2,7,9}

{1,2,7,9}

{7,9}

{2,5}

{2}

{2,3,4,5}

{1,2,7,8,9} {1,2,5,6,7,9}

・・  closure extension acyclic

・・  ppc extension tree

Page 15: An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura National Institute

Fast ComputationFast Computation

To generate a ppc extension for closed pattern P and item i, we

1.1. compute the denotation of P {∪ i}

2.2. compute the closure of P {∪ i}

3.3. compare the prefix

We propose efficient algorithms for these tasks

Page 16: An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura National Institute

Occurrence Deliver Occurrence Deliver [new][new]

・ ・ Compute the denotations of P {∪ i} for all i’s at once, by transposing the trimmed database

・・  Trimmed database is composed of - - items to be added - - transactions including P

linear timelinear time in the size of

trimmed databasedenotation of 1,2,3denotation of 1,2,4denotation of 1,2,5

AA

B

B

C

A B C

3 4 5

33

4 55

A BC

pattern: 1,2denotation: A,B,C

・・  Efficient for sparse datasets

Page 17: An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura National Institute

Anytime Database Reduction Anytime Database Reduction [new][new]

・ ・ Reduce the database, by [fp-growth, etc]

  ◆  ◆ Remove item e, if e is included in less than θ transactions

             oror included in all transactions

  ◆  ◆ merge identical transactions into one

・ ・ Recursively apply trimming and this reduction, in the recursion

   database size becomes small in lower levels of the recursion

・・  For taking closure, keep the intersection of merged transactions

← ← closure operation is to take the intersection of transactions

Page 18: An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura National Institute

ExperimentsExperiments

・・ Computational environmentComputational environment

CPU, memory:   AMD Athron XP 1600+, 224MB

OS, Programming language, compiler: Linux, C, gcc

・・ Algorithms compared with,Algorithms compared with,

FP-growth, afopt, MAFIA, PATRICIAMINE, kDCI

(All these marked high scores at competition FIMI03)

・・ Datasets Datasets

1313 dataset of real world real world, machine learning, artificialartificial datasets

used in FIMI03 and KDD-cup, with specified supports

・・ Won 12 12 databases for every support

(other than Accident dataset of middle supports)

・・ outperfroms especially smaller supports

ResultResult

Page 19: An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura National Institute

resultsresults

Page 20: An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura National Institute

ConclusionConclusion

Closed patternsClosed patterns: representatives of frequent patternsrepresentatives of frequent patterns [Pasquier et.al.’00]

-- much fewer than frequent patterns (possibly exponentially)

-- useful in compact representation and rule induction

・・ We proposed an algorithm LCMLCM for mining closed patterns in databases

-- prefix preserving closure extensionprefix preserving closure extension for tree-shaped search space

-- time complexity is linear in linear in ##closed patternsclosed patterns, and small memory footprintsmall memory footprint

- - practical speed up: occurrence deliveroccurrence deliver and anytime database reductionanytime database reduction

・・ Experiments show that LCM outperforms other algorithms in most

instances, in KDDcup and FIMI datasets, especially with small supports

Future workFuture work: closed patterns for sequences, trees, and other structures

LCM is submitted to FIMI04 competition, be looking forward to it!

Page 21: An Efficient Algorithm for Enumerating Closed Patterns in Transaction Databases Takeaki Uno Tatsuya Asai Yuzo Uchida Hiroki Arimura National Institute

List of DatasetsList of Datasets

Real datasetsReal datasets

     ・・ BMS-WebVeiw-1

     ・・ BMS-WebVeiw-2

     ・・ BMS-POS

     ・・ Retail

     ・・ Kosarak

     ・・ Accidents

Machine learning benchmarkMachine learning benchmark

     ・・ Chess

     ・・ Mushroom

     ・・ Pumsb

     ・・ Pumsb*

     ・・ Connect

Aartificial datasetsAartificial datasets

    ・・ T10I4D100K

    ・・ T40I10D100K