an efficient algorithm for mining frequent itemests over the entire history of data streams

1

An Efficient Algorithm for Mining Frequent Itemestsover the Entire History of Data Streams

Hua-Fu Li, Suh-Yin Lee, and Man-Kwan Shan. Accepted for Hua-Fu Li, Suh-Yin Lee, and Man-Kwan Shan. Accepted for publication in the Proceedings of First International Workspublication in the Proceedings of First International Workshop on Knowledge Discovery in Data Streams, to be held in hop on Knowledge Discovery in Data Streams, to be held in conjunction with the 15th European Conference on Machinconjunction with the 15th European Conference on Machine Learning (ECML 2004).e Learning (ECML 2004).

Adviser: Jia-Ling KohAdviser: Jia-Ling KohSpeaker: Shu-Ning ShinSpeaker: Shu-Ning ShinDate: 2005.3.4Date: 2005.3.4

2

IntroductionIntroduction• This paper proposes a single-pass algorithm,

called DSMFI(Data Stream Mining for Frequent Itemsets), has three major features:– single streaming data scan for counting item

sets’ frequency information.– extended prefix-tree-based compact pattern

representation.– top-down frequent itemset discovery schem

e.

3

Problem Definition (1)Problem Definition (1)

• Item: Ψ={i1, i2, …, im}• Data Stream: DS=B1, B2, …,BN

• Block: B=[tsi, T1, T2, …, Tk]• Current Length of Data Stream: CL=|B1|+|

B2|+ …+|BN|• Minimum support: • Maximal estimated support error:

)1,0(ms

),0( ms

4

Problem Definition (2)Problem Definition (2)• True Support of itemset X: Tsup(X)• Estimated Support of X: Esup(X)• X is frequent itemset if

– • X is sub-frequent itemset if

– • X is infrequent itemset if

–

CLmsXT )sup(

CLXTCLms )sup(

CLXT )sup(

5

Data Structure: IsFI-forest (1)Data Structure: IsFI-forest (1)• Item-suffix Frequent Itemset forest.• Extended prefix-tree-based summary da

ta structure.• Definition:• 1. Is-FI forest consist of:

– DHT: Dynamic Header Table– a set of CFI-trees (item-suffixes): Candidate

Frequent Itemset trees of item-suffixes.

6

Data Structure: IsFI-forest (2)Data Structure: IsFI-forest (2)• 2. Entry of DHT:

– item-id, support, block-id, head-link.• 3. Entry of CFI-tree(item-suffix):

– item-id, support, block-id, node-link.• 4. Each CFI-tree(item-suffix) has a specifi

c DHT(item-sufix)

7

Construction of IsFI-forestConstruction of IsFI-forest• 1. Read a transaction T=(x1x2..xk) from current

block BN.• 2. Item-suffix projection IsProjection(T): transa

ction T is converted into k small transactions (x1x2..xk), (x2..xk),…, (xk-1xk), (xk).

• 3. insert IsProjection(T) into IsFI-forest.

8

Algorithm: DSM-FIAlgorithm: DSM-FI• DSM-FI is composed of four steps:

– step 1 : reading a block of transactions– step 2: constructing IsFI-forest– step 3: pruning the infrequent information fr

om IsFI-forest– Step 4: top-down frequent itemset discover

y scheme

9

ExampleExample• ms=30%, ε=25% • DS={B1, B2}

– B1={(acdef), (abe), (cef), (acdf), (cef), (df)}– B2={(def), (bef), (be), (bde)}

• DSM-FI:– Step 1. Read B1 into main memory– Step 2. constructing the IsFI-forest– Step 3. pruning infrequent item– Step 4. mine frequent itemset from IsFI-forest

10

Example – step 2 (1)Example – step 2 (1)• (1) T1=acdef

– IsProjection(acdef)=acdef, cdef, def, ef, f– [CFT-tree(a), DHT(a)], [CFT-tree(c), DHT(c)], [CFT-tree(d), DHT(d)],

[CFT-tree(e), DHT(e)] [CFT-tree(f), DHT(f)]c 1 1d 1 1e 1 1f 1 1

d 1 1e 1 1f 1 1

DHT(a)

a:1:1

DHT(c)e 1 1f 1 1

f 1 1DHT(d) DHT(e) DHT(f)

c:1:1 d:1:1 e:1:1 f:1:1

a:1:1

c:1:1

d:1:1

e:1:1

f:1:1

c:1:1

d:1:1

e:1:1

f:1:1

d:1:1

e:1:1

f:1:1

e:1:1

f:1:1

f:1:1

CFT-tree(a)

CFT-tree(c)

CFT-tree(d)

CFT-tree(e)

CFT-tree(f)

11

Example – step 2 (2)Example – step 2 (2)• (2) T2=abe

– IsProjection(abe)=abe, be, e– [CFT-tree(a), DHT(a)], [CFT-tree(b), DHT(b)], [CFT-tree(e), DHT(e)]

c 1 1d 1 1e 2 1f 1 1b 1 1

d 1 1e 1 1f 1 1

DHT(a)

a:2:1

DHT(c)e 1 1f 1 1


c:1:1 d:1:1 e:2:1 f:1:1

a:2:1

c:1:1

d:1:1

e:1:1

f:1:1

c:1:1

d:1:1

e:1:1

f:1:1

d:1:1

e:1:1

f:1:1

e:2:1

f:1:1

f:1:1

b:1:1

e:1:1

e 1 1

b:1:1

e:1:1

b:1:1

DHT(b)

12

Example – step 2 (3)Example – step 2 (3)• (6) T6=df

– IsProjection(abe)=df, f– [CFT-tree(d), DHT(d)], [CFT-tree(f), DHT(f)]

c 2 1d 2 1e 2 1f 2 1b 1 1

d 2 1e 3 1f 4 1

DHT(a)

a:3:1

DHT(c)e 1 1f 3 1


c:4:1 d:3:1 e:4:1 f:5:1

a:3:1

c:2:1

d:2:1

e:1:1

f:1:1

c:4:1

d:2:1

e:1:1

f:1:1

d:3:1

e:1:1

f:1:1

e:4:1

f:3:1

f:5:1

b:1:1

e:1:1

e 1 1

b:1:1

e:1:1

b:1:1

DHT(b)

f:2:1

f:1:1

e:2:1

f:2:1

f:1:1

13

Example – step 3Example – step 3• b is a infrequent item: Esup(b)<0.25*6• delete CFI-tree(b), DHT(b) and entry b.

c 2 1d 2 1e 2 1f 2 1b 1 1

d 2 1e 3 1f 4 1

DHT(a)

a:3:1

DHT(c)e 1 1f 3 1


c:4:1 d:3:1 e:4:1 f:5:1

a:3:1

c:2:1

d:2:1

e:1:1

f:1:1

c:4:1

d:2:1

e:1:1

f:1:1

d:3:1

e:1:1

f:1:1

e:4:1

f:3:1

f:5:1

b:1:1

e:1:1

e 1 1

b:1:1

e:1:1

b:1:1

DHT(b)

f:2:1

f:1:1

e:2:1

f:2:1

f:1:1

14

Example – Step 4Example – Step 4• Start top-down frequent itemset discovery from frequent item a: (frequent: ms*CL=0.3*10=

3)

c 2 1d 2 1e 2 1f 2 1

d 2 1e 3 1f 4 1

DHT(a)

a:3:1

DHT(c)e 3 1f 4 1


c:4:1 d:5:1 e:8:1 f:7:1

a:3:1

c:2:1

d:2:1

e:1:1

f:1:1

c:4:1

d:2:1

e:1:1

f:1:1

d:5:1

e:3:1

f:2:1

e:8:1

f:5:1

f:7:1

e:1:1

e 3 2f 1 2d 1 2

b:3:2

e:2:2

b:3:2

DHT(b)

f:2:1

f:1:1

e:2:1

f:2:1

f:1:1

f:1:2

d:1:2

e:1:2

Frequent: aMaximal candidate: cef

TSup(cef)=3, frequent{c, e, f, ce, cf, ef, cef}

Maximal candidate: defTwo-candidate: de, dfMaximal candidate: be

Tsup(ef)=2 not frequentde, df are frequent

Maximal candidate: ef

be is frequent

15

Upper Bound of SpaceUpper Bound of Space• i=1, 2, …, k• Nodes of all DHT(i):

– (k2-k)/2• Nodes of the set of CFI-trees:

–

k

j

j

i

ijiC

1

2/

1

2/

16

Maximal Estimated ErrorMaximal Estimated Error• (X, X.support, X.block_id)

– X.block_id=1: • Tsup(X)=Esup(X)

– X.block_id>1:• • k: the average size of block

kidblockXXEXT )1_.()sup()sup(

17

Performance (1)Performance (1)• IBM Dataset: T10.I5.D1M, T30.I20.D1M.• 20 blocks with size 50K.• Compare execution time and memory usage of

DSM-FI:

18

Performance (2)Performance (2)• Compare of DSM-FI and Lossy Counting:

an efficient algorithm for mining frequent itemests over the entire history of data streams

Documents