a comparative study of data mining algorithms to
TRANSCRIPT
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
1/31
A Comparative Study of Data MiningAlgorithms to Generate Frequent
Itemsets and Association Rules
Anupma Sangwan
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
2/31
What is Data Mining?
Many Definitions Extraction of implicit, previously unknown and potentiallyuseful information from data.
The task of discovering interesting patterns
from vast amount of data.
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
3/31
What is (not) Data Mining?
What is Data Mining?
Certain names are more
prevalent in certain USlocations (OBrien, ORurke,
OReilly in Boston area)
Group together similar
documents returned by searchengine according to their
context (e.g. Amazon
rainforest, Amazon.com,)
What is not DataMining?
Look up phone
number in phonedirectory
Query a Web
search engine forinformation about
Amazon
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
4/31
Data Mining Tasks
Prediction Methods
Use some variables to predict unknown or future
values of other variables.
Description Methods
Find human-interpretable patterns that describe
the data.
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
5/31
Data Mining Tasks...
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
6/31
What Is Association Mining?
Association rule mining:
Finding frequent patterns, associations, correlations, or causalstructures among sets of items or objects in transaction databases,relational databases, and other information repositories.
Frequent pattern: pattern (set of items, sequence, etc.) that occurs
frequently in a database.
Motivation: finding regularities in data
What products were often purchased together? Beer and diapers?!
What are the subsequent purchases after buying a PC?
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
7/31
Association Rules
An Example
Market-basket model
Look for combinations of products
Put the SHOES near the SOCKS so that if a customer buys onethey will buy the other
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
8/31
Association Rules Purpose
Providing the rules correlate the presence of a set of items
with another set of item
Examples:
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
9/31
Basic Concepts & Terms in Association Rules
Transaction-id Items bought
10 A, B, C
20 A, C
30 A, D
40 B, E, F
Itemset X={x
1, , x
k}
Find all the rules XYwith minconfidence and support
support, s, probability that atransaction contains XY
confidence, c, conditional
probability that a transactionhaving X also contains Y.
Let min_support = 50%,min_conf = 50%:
A C (50%, 66.7%)
CA (50%, 100%)
Customer
buys diaper
Customer
buys both
Customer
buys beer
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
10/31
Mining Association Rules:Example
For rule A C:
support = support({A}{C}) = 50%
confidence = support({A}{C})/support({A}) =
66.6%
Min. support 50%
Min. confidence 50%Transaction-id Items bought
10 A, B, C
20 A, C
30 A, D
40 B, E, F
Frequent pattern Support
{A} 75%
{B} 50%
{C} 50%
{A, C} 50%
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
11/31
Frequent Itemset Algorithms
Some of algorithms that generate frequentitemset
are as follows:
AIS Algorithm
SETM Algorithm
Apriori Algorithm
FP-Growth Algorithm
AprioriTID Algorithm
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
12/31
Discovering the Association rule
Find all frequent itemset. (Itemset with above
minimum support)
Use these frequent itemset to generate rules.
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
13/31
Discovering Large Itemsets
Multiple passes over the data.
First pass count the support of individual items.
Subsequent pass
Generate Candidates using previous passs largeitemset.
Go over the data and check the actual support of thecandidates.
Stop when no new large itemsets are found.
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
14/31
Apriori Algorithm
First scalable algorithm for Association Rule Mining.
An improvement over AIS and SETM algorithms
(Agrawal and Srikant 1994).
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
15/31
Apriori: A Candidate Generation-and-test Approach
Any subset of a frequent itemset must be frequent
if{beer, diaper, nuts} is frequent, so is {beer, diaper}
Every transaction having {beer, diaper, nuts} also contains
{beer, diaper} Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
Method:
generate length (k+1) candidate itemsets from length kfrequent itemsets, and
test the candidates against DB.
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
16/31
Ck: Candidate itemset of size k
Lk: frequent itemset of size k
L1= {frequent items};
for(k= 1; Lk!=; k++) do beginCk+1= candidatesgenerated from Lk;
for each transaction tin database do
increment the count of all candidates in
Ck+1that are contained in t
Lk+1= candidates in Ck+1with min_support
end
returnkLk;
Apriori Algorithm:Pseudo Code
Generate new k-itemsets
candidates
Find the support of all thecandidates
Take only those withsupport over minsup
Join Step:- Ck is generated by joining Lk-1 with itself.
Prune Step:- Any (k-1 )-itemsets that are not frequent can not be a subset of a frequent k-ite
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
17/31
Candidate generation
Join step
1k1k2k2k11
1k1k
1k1k1
k
q.itep.ite,q.itep.ite,...,q.itep.ite
qp,
iteqitepitepp.ite
!! where
rom
.,.,.,select
intoinsert
2
k
k-1
k
c from C
)L(s
ets s of c(k-1)-subs
Citemsets c
delete
thenif
doforall
doforall
P and q are 2 k-1 largeitemsets identical in allk-2 first items.
Join by adding the last item ofq to p
Check all the subsets, remove acandidate with small subset
Prune step
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
18/31
Frequency 50%, Confidence 100%:
A C
B E
BC ECE B
BE C
The Apriori AlgorithmAn Example
Database TDB
1st scan
C1L1
L2
C2 C22nd scan
C3 L33rd scan
Tid Items
10 A, C, D
20 B, C, E
30 A, B, C, E
40 B, E
Itemset sup{A} 2
{B} 3
{C} 3
{D} 1
{E} 3
Itemset sup
{A} 2
{B} 3
{C} 3
{E} 3
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
Itemset sup
{A, B} 1
{A, C} 2
{A, E} 1
{B, C} 2{B, E} 3
{C, E} 2
Itemset sup
{A, C} 2
{B, C} 2
{B, E} 3
{C, E} 2
Itemset
{B, C, E}
Itemset sup
{B, C, E} 2
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
19/31
L1= {frequent items};
for(k= 1; Lk!=; k++) do begin Ck+1=
candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in
Ck+1that are contained in t
Lk+1= candidates in Ck+1with min_support
end
returnkLk;
Apriori Problem?
Every pass goes over thewhole data.
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
20/31
Algorithm AprioriTid
Uses the database only once. Builds a storage set C^k
Members have the form < TID, {Xk} >
Xk are potentially frequent k-items in transaction
TID. For k=1, C^1 is the database.
Uses C^k in pass k+1.
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
21/31
Algorithm AprioriTid
Ofer Pasternak7k
k
kk
t^kt
t
kt
^k-1
^
k
k-k
k-1
^
1
;Lnswer
insup}|c.count{ cL
;t.TID,then)(
c.count
c
ite s};oft.set1])c[k(c
ite soft.setc[k]|(c{c
entries t
;C
)(LC
);k2;L(k
;databaseDC
ite sets}{large 1-L
!
u!
"!{
!
!
!
{!
!
!
end
end
end
i
;
docandidatesorall
begindoorall
;genapriori-
begindoFor
1
1
J
J
Count item occurrences
Generate new k-itemsetscandidates
Find the support of all thecandidates
Take only those with supportover minsup
The storage set is initialized withthe database
Build a new storage set
Determine candidate itemsetswhich are containted intransaction TID
Remove empty entries
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
22/31
ItemsTID
1 3 4100
2 3 52001 2 3 5300
2 5400
Set-of-itemsetsTID
{{1},{3},{4} }100
{{2},{3},{5} }200{{1},{2},{3},{5} }300
{{2},{5} }400
SupportItemset
2{1}
3{2}3{3}
3{5}
itemset
{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
{3 5}
Set-of-itemsetsTID
{{1 3} }100
{{2 3},{2 5} {3 5} }200
{{1 2},{1 3},{1 5},
{2 3}, {2 5}, {3 5} }
300
{{2 5} }400
SupportItemset
2{1 3}
2{2 3}
3{2 5}
2{3 5}
itemset
{2 3 5}
Set-of-itemsetsTID
{{2 3 5} }200
{{2 3 5} }300
SupportItemset
2{2 3 5}
Database C^1
L2
C2C^2
C^3
L1
L3C3
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
23/31
Advantage
C^k could be smaller than the database. If a transaction does not contain k-itemset
candidates, than it will be excluded from C^k .
For large k, each entry may be smaller
than the transaction The transaction might contain only few
candidates.
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
24/31
n ng requent atterns t out an ate enerat on
(FP-Growth)
Compress a large database into a compact, Frequent-Patterntree (FP-tree) structure.
highly condensed, but complete for frequent pattern mining
avoid costly database scans
Develop an efficient, FP-tree-based frequent pattern mining
method
A divide-and-conquer methodology: decompose mining
tasks into smaller ones
Avoid candidate generation: sub-database test only!
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
25/31
Construct FP-tree from a Transaction DB
{}
f:4 c:1
b:1
p:1
b:1c:3
a:3
b:1m:2
p:2 m:1
Header Table
Item frequency headf 4c 4a 3b 3m 3p 3
min_support = 0.5TI Items bought (ordered) frequent items100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}200 {a, b, c, f, l, m, o} {f, c, a, b, m}300 {b, f, h, j, o} {f, b}400 {b, c, k, s, p} {c, b, p}500 {a, f, c, e, l, p, m, n} {f, c, a, m, p}
Steps:
1. Scan DB once, find frequent1-itemset (single itempattern)
2. Order frequent items infrequency descending order
3. Scan DB again, constructFP-tree
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
26/31
Objective
To determine effectiveness and efficiency of
these algorithms of the following
parameters.
- Types of itemsets generated by the algorithms taking in account the
same database.
- Time units taken by the algorithms generating the frequent itemsets.
- Association rules designed on the basis of these frequent itemsets
generated by the algorithms
- Size of Database.
- Varying the Min Support and Min Confidence
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
27/31
Research Methodology
Programming for these algorithms and
connected them to data base and analyze the
results.
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
28/31
Scope and Relevance of Study
1. Inventory Management:
Goal: A consumer appliance repair company wants to
anticipate the nature of repairs on its consumer products
and keep the service vehicles equipped with right partsto reduce on number of visits to consumer households.
Approach: Process the data on tools and parts required
in previous repairs at different consumer locations and
discover the co-occurrence patterns
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
29/31
Cont
Market Analysis: - Which combinations are
frequent.
Health Care: - Analyze the patient disease
history: find relationship between diseases.
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
30/31
References
[1] Agra al R., Imielinski T., S ami T. Mining Association Rules bet een Sets of Items in Large
Databases. Proc. ACM SIGMOD Int. Conf. on Management ofData, 1993. p. 207- 216
[2] Agra al R., Srikant R. ast Algorithms for Mining Association Rules. Proc. Int. Conf. Very
Large Data Bases, 1994. p. 487 499
[3] M. Houstama and A. S ami. Set-Oriented Mining of Association rules. Research Report RJ
9567, IBM Almaden Research Center, San Jose, California, October 1993
[4] Lecture notes and Presentation slides of Professor Anita Wasile ska, State Universitiy of Ne
York, Stony Brook.
[5] J. Han andM. Kamver, Data Mining: Concepts and Techniques, Morgan Kaufmann/ Elsevier
India, 2001.
[6] Arun Pujari, Data Mining techniques, Universities Press (India) Pvt. Ltd. 2001.
[7] Qi Luo Advancing Kno ledge Discovery and Data Mining 2008 Workshop on Kno ledge
and Data Mining, pg 3-5[8] Rupnik, Kukar, Bajec, Krisper, DMDSS: Data Mining Based Decision Support System to
Integrate Data Mining and Decision Support 28th Int. Conf. Information Technology Interface
ITI 2006, June 19-22, 2006 Cavtat, Croatia.
-
8/8/2019 A Comparative Study of Data Mining Algorithms To
31/31
Thank You
Any Queries?