parallel mining frequent patterns: a sampling-based approach

Parallel Mining Frequent Patterns: A Sampling-based Approach

Shengnan Cong

2

Talk Outline

Background– Frequent pattern mining– Serial algorithm

Parallel frequent pattern mining– Parallel framework– Load balancing problem

Experimental results Optimization Summary

3

Frequent Pattern Analysis

Frequent pattern– A pattern (a set of items, subsequences, substructures,

etc.) that occurs frequently in a data set

Motivation– To find the inherent regularities in data

Applications– Basket data analysis

• What products were often purchased together?

– DNA sequence analysis• What kinds of DNA are sensitive to this new drug?

– Web log analysis• Can we automatically classify web documents?

4

Frequent Itemset Mining

Itemset– A collection of one or more items

• Example: {Milk, Juice}– k-itemset

• An itemset that contains k items

Transaction– An itemset– A dataset is a collection of transactions

Support– Number of transactions containing an itemset

• Example: Support({Milk, Juice}) = 2

Frequent-itemset mining– To output all itemsets whose support values are no less

than a predefined threshold in a dataset

Transaction Items

T1 Milk, bread, cookies, juice

T2 Milk, juice

T3 Milk, eggs

T4 Bread, cookies, coffee

Support threshold = 2

{milk}:3

{bread}:2

{cookies}:2

{juice}:2

{milk, juice}:2

{bread, cookies}:2

5

Frequent Itemset Mining

Frequent itemset mining is computationally expensive

Brute-force approach:– Given d items, there are 2d possible candidate itemsets– Count the support of each candidate by scanning the

database– Match each transaction against every candidate

– Complexity ~ O(NMW) => Expensive since M= 2d

Transaction Items

T1 Milk, Bread, Cookies, Juice

T2 Milk, Juice

T3 Milk, Eggs

T4 Bread, Cookies, Coffee

Transactions List of candidates

N

W

M

6

Mining Frequent-itemset In Serial

FP-growth algorithm [Han et al. @SIGMOD 2000]

– One of the most efficient serial algorithms to mine frequent-itemset. [FIMI’03]

– A divide-and-conquer algorithm.

Mining process of FP-growth– Step 1:

• Identify frequent 1-items with one scan of the dataset.– Step 2:

• Construct a tree structure (FP-tree) for the dataset with another dataset scan.

– Step 3: • Traverse the FP-tree and construct a projection (sub-tree) for

each frequent 1-item. Recursively mine the projections.

7

Example of FP-growth Algorithm

TID Items

100 {f, a, c, d, g, i, m, p}

200 {a, b, c, f, l, m, o}

300 {b, f, h, j, o}

400 {b, c, k, s, p}

500 {a, f, c, e, l, p, m, n}Item frequency f 4 c 4 a 3 b 3 m 3 p 3

Support Threshold =3

Input Dataset:

Step 1:

Example

8

root

root

f:1

c:1

a:1

m:1

p:1

root

f:2

c:2

a:2

m:1

p:1

b:1

m:1

root

f:3

c:2

a:2

m:1

p:1

b:1

m:1

b:1

root

f:3

c:2

a:2

m:1

p:1

b:1

m:1

b:1

c:1

b:1

p:1

root

f:4

c:3

a:3

m:2

p:2

b:1

m:1

b:1

c:1

b:1

p:1

TID Items bought100 {f, a, c, d, g, i, m, p}200 {a, b, c, f, l, m, o}

300 {b, f, h, j, o}

400 {b, c, k, s, p}

500 {a, f, c, e, l, p, m, n}

Support =3

Ordered Frequent Items{f, c, a, m, p}{f, c, a, b, m}

{f, b} {c, b, p} {f, c, a, m, p}

Step 2:

Item frequency f 4c 4a 3b 3m 3p 3

root

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

9

Example of FP-growth Algorithm (cont’d)

Step 3:

Traverse the FP-tree by following the side link of each frequent 1-item and accumulate the prefix paths. Build FP-tree structure (projection) for the accumulated prefix paths If the projection contains only one path, enumerate all the combinations of the items, else recursively mine the projection.

All frequent patterns concerning m

m,

fm, cm, am,

fcm, fam, cam,

fcam

m’s prefix paths: fca: 2, fcab: 1

m-projection

root

f:3

c:3

a:3

Item Prefix pathes

c f:3

a fc:3

b fca:1, f:1, c:1

m fca:2, fcab:1

p fcam:2, cb:1


root

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

f:4

c:3

a:3

b:1m:2

m:1


root

f:4 c:1

b:1

p:1

b:1c:3

a:3

b:1m:2

p:2 m:1

X

10

Talk Outline

Background– Frequent-itemset mining– Serial algorithm



11

Parallel Mining Frequent Itemset

FP-growth algorithm:

Parallelization framework for FP-growth algorithm

Identify frequent single items (1.31%)

Build tree structure for the whole DB (1.91%)

Make projection for each frequent single item from the tree structure and mine the projection (96.78%)

Divide and conquer

Identify frequent single items in parallel

• Partition the frequent single items and assign each subset of frequent items to a processor. • Each processor builds tree structure related to the assigned items from the DB.

Each processor makes projections for the assigned items from its local tree and mines the projections independently.

12

Load Balancing Problem

Reason:– The large projections takes too long to mine

relative to the mining time of the overall dataset. Solution:

– The larger projections must be partitioned. Challenge:

– Identify the larger projections.

1

10

100

1 2 4 8 16 32 64Processor #

Spee

dup

optimalpumsb

0

50

100

150

200

250

1 6 11 16 21 26 31 36 41 46 51

Index of projections

Min

ing t

ime

(in s

econds) Task mining time

(max=204.7sec)

Average mining time(26.7sec)

13

How To Identify The Larger Projections?

Static estimation– Based on dataset parameters

• Number of items, number of transactions, length of transactions, …

– Based on the characteristics of the projection• Depth, bushiness, tree size, number of leaves, fan-out,

fan-in, …

Result --- No correlation found with any of the above.

Mining time vs. Tree depth

0

2

4

6

8

10

12

14

16

1 6 11 16 21 26 31 36

Indexes of projections

Min

ing

tim

e (

in s

eco

nd

s)

0

5

10

15

20

25

30

35

FP

-tre

e d

ep

th

Mining time

Tree depth

14

Dynamic Estimation

Runtime sampling– Use the relative mining time of a sample to estimate th

e relative mining time of the whole dataset.– Accuracy vs. overhead

Random sampling: random select a subset of records. – Not accurate

e.g. Pumsb 1% random sampling

(overhead 1.03%)

pumsb

0

50

100

150

200

250

1 6 11 16 21 26 31 36 41 46 51

Index of projected datasets

W

ho

le d

ata

se

t m

inin

g tim

e (

inse

co

nd

)

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

Sa

mp

le m

inin

g tim

e (

in s

eco

nd

)

Whole dataset

1% random sample

15

Selective Sampling

Sampling based on frequency.– Discard the infrequent items– Discard a fraction t of the most frequent 1-items.

e.g. Frequent-1 items: <(f:4),(c:4),(a:3),(b:3),(m:3),(p:3),(l:2), (s:2),(n:2),(q:2)>, (Supmin=2),

t=20% { f, a, c, d, g, i, m, p}{ f, a, b, g, i, p}

When filtering top 20%, sampling takes average 1.41% of the sequential mining time still provides fairly good accuracy.

e.g. Pumsb Top 20% (overhead 0.71%)

X X X X X

pumsb

0

50

100

150

200

250

1 6 11 16 21 26 31 36 41 46 51

Index of projected databases

Wh

ole

da

tase

t min

ing

tim

e(in

se

con

d)

0

0.1

0.2

0.3

0.4

0.5

0.6

Sa

mp

le m

inin

g ti

me

(in

se

con

d)

Whole datasetSelective sample

X XX

16

Why Selective Sampling Works?

The mining time is proportional to the number of frequent itemsets in the result. (from experiments)

Given a frequent L-itemset, all its subsets are frequent itemsets. There are 2L-1 subsets.

Removing one item at the root reduces the total number of itemsets in the result and, therefore, reduces the mining time roughly by half.

The most frequent items are close to the root. The mining time of their projections are negligible but their presence increases the number of itemsets in the results.

17

Talk Outline

Background– Frequent-itemset mining– Serial algorithm



18

Experimental Setups

A Linux cluster with 64 nodes. – 1GHz Pentium III processor and 1GB memory per node.

Implementation: C++ and MPI. Datasets:

Dataset #Transactions #Items Max Trans. Length

mushroom 8,124 23 23

connect 57,557 43 43

pumsb 49,046 7,116 74

pumsb_star 49,046 7,116 63

T40I10D100K 100,000 999 77

T50I5D500K 500,000 5,000 94

19

Speedups: Frequent-itemset Mining

mushroom

1

10

100

1 2 4 8 16 32 64

Processor#

optimal

Par-FP

connect

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-FP

pumsb

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-FP

pumsb_star

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-FP

T40I10D100K

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-FP

T50I5D500K

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-FP

Needs multi-level task

partitioning

20

Experimental Results For Selective Sampling

Overhead of selective sampling (average 1.4%)

Effectiveness of selective sampling– Selective sampling can improve the

performance by 37% on average.Speedups on 64 processors

0

5

10

15

20

25

30

35

40

mushroom connect pumsb pumsb_star T40I 10D100K T50I 5D500K

Speedups

wi thout sampl i ngwi th sampl i ng

Dataset

Mushroom

Connect

Pumsb

Pumsb_star

T40I10D100K

T50I5D500K

Overhead

0.71% 1.80%0.71%

2.90% 2.05% 0.28%

21

Optimization: Multi-level Partitioning

Problem analysis:

Optimal speedup with 1-level partitioning: 576/131=4.4

Conclusion:– We need multi-level task partitioning to obtain better

speedup. Challenges:

– How many levels are necessary?– Which sub-subtasks to be further partitioned?

pumsb_star

1

10

100

1 2 4 8 16 32 64Processor#

optimal

Par-FP

0

50

100

150

200

250

300

1

275

133

65Total576

Max Subtask with 1-level partition: 131

4


22


Observations:– The mining time of the maximal subtask derived is about ½ of the mining time of the task itself.– The mining time of the maximal subtask derived from top1 task is about the same as the top2 task.

Reason:– There is one very long frequent pattern in the dataset. <abcdefg……>

a: 2(L-1) Max subtask -> ab: 2(L-2) b: 2(L-2) Max subtask -> bc: 2(L-3)

c: 2(L-3) Max subtask -> cd: 2(L-4)

…… ……

0

50

100

150

200

250

300

1

275

133

65 Total

576



Labcde: 2(L-5)

(if we partition a to abcde, the mining time of the derived subtask for abcde is about 1/16 of the task of a.)

(The mining time is proportional to the number of frequent itemsets in the result.)

23


Multi-level partitioning heuristic:– If a subtask’s mining time (based on selective

sampling) is greater than , partition it so that the

maximal mining time of the derived sub-subtasks is less

than . (N: number of processors, M total number of tasks, Ti : mining

time of a subtask)

Result:

4

1*1

N

TM

ii

pumsb_star

1

10

100

1 2 4 8 16 32 64processor#

Speedup

optimal

one-level

multi-level

4

1*1

N

TM

ii

24

Summary

Data mining is an important application of parallel processing.

We developed a framework for parallel mining frequent-itemset and achieved good speedups.

We proposed the selective sampling technique to address the load balancing problem and improved the speedups by 45% on average on 64 processors.

25

Questions?

parallel mining frequent patterns: a sampling-based approach

Documents