business intelligence & data mining-9
TRANSCRIPT
-
8/10/2019 Business Intelligence & Data Mining-9
1/35
Association Rules
-
8/10/2019 Business Intelligence & Data Mining-9
2/35
Market Basket Analysis
-
8/10/2019 Business Intelligence & Data Mining-9
3/35
Association Rules
Usually applied to market baskets but otherapplications are possible
Useful Rules contain novel and actionable
information: e.g. On Thursdays grocery customers are
likely to buy diapers and beer together
Trivial Rules contain already known information: e.g.
People who buy maintenance agreements are the ones
who have also bought large appliances
Some novel rules may not be useful: e.g. New
hardware stores most commonly sell toilet rings
-
8/10/2019 Business Intelligence & Data Mining-9
4/35
Association Rule: Basic Concepts
Given: (1) a set of transactions, (2) each transaction is aset of items (e.g. purchased by a customer in a visit)
Find: (all ?)rules that correlate the presence of one set
of items with that of another set of items
E.g., 98% of people who purchase tires and auto accessories
also get automotive services done
Applications
Retailing (What other products should the store stocks up)
Attached mailing in direct marketing
Market Basket Analysis (what do people buy together?)
Catalog design (Which items should appear next to each
other)
-
8/10/2019 Business Intelligence & Data Mining-9
5/35
What Is Association Rule Mining?
Association rule mining: Finding frequent patterns, associations, correlations, or
causal structures among sets of items or objects intransaction databases, relational databases, and otherinformation repositories.
Examples. Rule form: Body ead [support, confidence].
buys(x, diapers) buys(x, beers) [0.5%, 60%]
major(x, CS) ^ takes(x, DB) grade(x, CS, A)[1%, 75%]
-
8/10/2019 Business Intelligence & Data Mining-9
6/35
Rule Measures: Support and Confidence
Find all the rules X & Y Z with
minimum confidence and support
support,s, probability that atransaction contains {X & Y &Z}
confidence, c, conditional
probability that a transactionhaving {X & Y} also contains Z
Transaction ID Items Bought
2000 A,B,C
1000 A,C4000 A,D
5000 B,E,F
Let minimum support 50%,
and minimum confidence
50%, we have
A C (50%, 66.6%)
C A (50%, 100%)
Customer
buys diaper
Customer
buys both
Customer
buys beer
-
8/10/2019 Business Intelligence & Data Mining-9
7/35
Mining Association RulesAn Example
For rule A C:
support = support({A &C}) = 50%
confidence = support({A &C})/support({A}) = 66.6%
The Apriori principle (Agarwal, 1995):
Any subset of a frequent itemset must also be frequent
Transaction ID Items Bought
2000 A,B,C
1000 A,C
4000 A,D
5000 B,E,F
Frequent Itemset Support
{A} 75%{B} 50%
{C} 50%{A,C} 50%
Min. support 50%
Min. confidence 50%
Frequent Itemsets
-
8/10/2019 Business Intelligence & Data Mining-9
8/35
Mining Frequent Itemsets: the Key Step
Find thefrequent itemsets: the sets of items that
have minimum support
A subset of a frequent itemset must also be a frequent
itemset
i.e., if {AB} is a frequent itemset, both {A} and {B} should be
frequent itemsets
Iteratively find frequent itemsets with cardinality from 1
to k (k-itemset)
Use the frequent itemsets to generate association
rules.
-
8/10/2019 Business Intelligence & Data Mining-9
9/35
The Apriori Algorithm
Generate C1: all 1 unique items
Generate L1: all 1 unique items with minimum
support
Join Step: Ck is generated forming Cartesian Product ofLk-1with L1. Since, any (k-1)-itemset that is not frequent
cannot be a subset of a frequent k-itemset
Prune Step: Lk-1 is generated by selecting from Ck itemsetsthose with minimum support
-
8/10/2019 Business Intelligence & Data Mining-9
10/35
The Apriori Algorithm Example
TID Items100 1 3 4
200 2 3 5
300 1 2 3 5
400 2 5
Database Ditemset sup.
{1} 2
{2} 3
{3} 3
{4} 1
{5} 3
itemset sup.
{1} 2
{2} 3
{3} 3
{5} 3
Scan D
C1 L1
itemset{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
{3 5}
itemset sup
{1 2} 1
{1 3} 2
{1 5} 1
{2 3} 2
{2 5} 3{3 5} 2
itemset sup
{1 3} 2
{2 3} 2
{2 5} 3
{3 5} 2
L2
C2 C2Scan D
C3 L3itemset
{1 3 5}
{2 3 5}
Scan D itemset sup
{2 3 5} 2
-
8/10/2019 Business Intelligence & Data Mining-9
11/35
Is Apriori Fast Enough?
Performance Bottlenecks
The core of the Apriori algorithm: Use frequent (k 1)-itemsets to generate candidate frequent k-itemsets
Use database scan and matching to collect counts for the candidate
itemsets
The bottleneck ofApriori: candidate generation
Huge candidate sets:
104 frequent 1-itemset will generate 107 candidate 2 -itemsets
To discover a frequent pattern of size 100, e.g., {a1, a2, , a100}, one
needs to generate 2
100
10
30
candidates. Multiple scans of database:
Needs (n +1 ) scans, n is the length of the longest pattern
-
8/10/2019 Business Intelligence & Data Mining-9
12/35
Methods to Improve Aprioris Efficiency
Hash-based itemset counting: A k-itemset whose
corresponding hashing bucket count is below the threshold
cannot be frequent
Transaction reduction: A transaction that does not contain
any frequent k-itemset is useless in subsequent scans
Sampling: mining on a subset of given data, lower support
threshold + a method to determine the completeness
Dynamic itemset counting: add new candidate itemsets only
when all of their subsets are estimated to be frequent
-
8/10/2019 Business Intelligence & Data Mining-9
13/35
How to Count Supports of Candidates?
Why is counting supports of candidates a problem? The total number of candidates can be very huge
One transaction may contain many candidates
Method:
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of itemsets and
counts
Interior node contains a hash table
Subset function: finds all the candidates contained in atransaction
-
8/10/2019 Business Intelligence & Data Mining-9
14/35
Criticism of Support and Confidence
Example 1: (Agarwal & Yu, PODS98)
Among 5000 students 3000 play basketball 3750 eat cereal 2000 both play basket ball and eat cereal
play basketball eat cereal[40%, 66.7%] is misleading because the
overall percentage of students eating cereal is 75% which is higher than
66.7%.
not play basketball eat cereal[35%, 87.5%] lower support but
higher confidence!
play basketball not eat cereal[20%, 33.3%] is more informative,
although with lower support and confidence
basketball not basketball sum(row)cereal 2000 1750 3750
not cereal 1000 250 1250
sum(col.) 3000 2000 5000
-
8/10/2019 Business Intelligence & Data Mining-9
15/35
Criticism of Support and Confidence
Example 2: X and Y: positively
correlated,
X and Z, Y and Z: negatively
correlated
support and confidence ofX=>Z dominates
We need a measure of
dependent or correlated events
P(B|A)/P(B) is called the lift
of rule A => B
X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
X=>Y 25% 50%
X=>Z 37.5% 75%
Y=>Z 12.5% 50%
-
8/10/2019 Business Intelligence & Data Mining-9
16/35
Other Interestingness Measures: lift
Lift = P(B|A)/P(B) =
takes both P(A) and P(B) into consideration
P(A^B)=P(B)*P(A), if A and B are independent events
A and B negatively correlated, if lift < 1
If lift > 1, A and B positively correlated
)()()(
BPAPBAP
X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1
Rule Support Lift
X=>Y 25% 2
X=>Z 37.50% 0.86
Y=>Z 12.50% 0.57
-
8/10/2019 Business Intelligence & Data Mining-9
17/35
Extensions
-
8/10/2019 Business Intelligence & Data Mining-9
18/35
Multiple-Level Association Rules
Items often form hierarchy. Items at the lower level are
expected to have lower
support.
Rules regarding itemsets atappropriate levels could be
quite useful.
Transaction database can be
encoded based on
dimensions and levels
Food
breadmilk
skim
SunsetFraser
2% whitewheat
-
8/10/2019 Business Intelligence & Data Mining-9
19/35
Mining Multi-Level Associations
A top-down, progressive deepening approach:
First find high-level strong rules (Ancestors):milk bread [20%, 60%].
Then find their lower-level weaker rules (Descendants):
2% milk wheat bread [6%, 50%].
Variations of mining multiple-level association rules. Level-crossed association rules:
2% milk Wonderwheatbread
Association rules with multiple, alternative hierarchies:
2% milk Wonderbread
-
8/10/2019 Business Intelligence & Data Mining-9
20/35
Multi-level Association: UniformSupport vs. Reduced Support
Uniform Support: the same minimum support for all levels
+ One minimum support threshold. No need to examine itemsetscontaining any item whose ancestors do not have minimum
support.
Lower level items do not occur as frequently. If supportthreshold
too high miss low level associations
too low generate too many high level associations
Reducing Support: reduced minimum support at lower
levels
Needs modification to the basic algorithm
-
8/10/2019 Business Intelligence & Data Mining-9
21/35
Uniform Support
Multi-level mining with uniform support
Milk
[support = 10%]
2% Milk
[support = 6%]
Skim Milk
[support = 4%]
Level 1
min_sup = 5%
Level 2
min_sup = 5%
-
8/10/2019 Business Intelligence & Data Mining-9
22/35
-
8/10/2019 Business Intelligence & Data Mining-9
23/35
Multi-level Association:Redundancy Filtering
Some rules may be redundant due to ancestor
relationships between items.
Example
milk wheat bread [support = 8%, confidence = 70%]
2% milk wheat bread [support = 2%, confidence = 72%]
We say the first rule is an ancestor of the second rule.
A rule is redundant if its support is close to the
expected value, based on the rules ancestor.
-
8/10/2019 Business Intelligence & Data Mining-9
24/35
Multi-Level Mining: Progressive
Deepening
A top-down, progressive deepening approach: First mine high-level frequent items:
milk (15%), bread (10%) Then mine their lower-level weaker frequent itemsets:
2% milk (5%), wheat bread (4%)
Different min-support threshold across multi-levelslead to different algorithms:
If adopting the same min-supportacross multi-levelsthen reject tif any of ts ancestor is infrequent.
If adopting reduced min-supportat lower levelsthen examine only those descendants whose ancestors support is
frequent and whose own support is > reduced min-support
-
8/10/2019 Business Intelligence & Data Mining-9
25/35
Sequence pattern mining
Sequence events database
Consists of sequences of values or events changing
with time
Data is recorded at regular intervals
Characteristic sequence events characteristics
Trend, cycle, seasonal, irregular
Examples
Financial: stock price, inflation
Biomedical:blood pressure
Meteorological:precipitation
-
8/10/2019 Business Intelligence & Data Mining-9
26/35
Two types of sequence data
Event series
Record events that happens at certain time
E.g. network logins
Time series
Record changes of certain (typically numeric)
values over time
E.g. stock price movements, blood pressure
-
8/10/2019 Business Intelligence & Data Mining-9
27/35
Event series
Series can be represented in two ways:
As a sequence (string) of events.
Empty space if no events occur at a certain time
Hard to represent multiple events
As a set of tuples: {(time, event)}
Allow for multiple event at the same time
-
8/10/2019 Business Intelligence & Data Mining-9
28/35
Types of interesting info
Which events happen often (not too interesting)
What group of event happen often
People who rent Star Wars also rent Star Trek
What sequence of event happen often Renting Star Wars, then Empire Strikes Back, then
Return of the Jedi in that order
Association of events within a time window
People who rent Star Wars tend to rent Empire
Strikes Back within one week
-
8/10/2019 Business Intelligence & Data Mining-9
29/35
Similarity/Difference with
Association Rules
Similarities:
Groups of events : frequent item sets
Associations : Association rules
Differences:
Notion of (time) windows:
People who rent Star Wars tend to rent Empire Strikes
Back within one week
Ordering of events is important
-
8/10/2019 Business Intelligence & Data Mining-9
30/35
Episodes
A partially ordered sequence of events
B
A
Parallel
(B follows A OR
A follows B)
A B
Serial
(B follows A)
B
A
C
General
(order between
A & B unknownor immaterial but A
& B precede C)
-
8/10/2019 Business Intelligence & Data Mining-9
31/35
Sub-episode / super-episode
If A, B & C occur within a time window:
A & B is a sub-episode of A, B & C
A,B & C is the super-episode of A, B, C, A & B,
B & C
-
8/10/2019 Business Intelligence & Data Mining-9
32/35
Frequent episodes / Episode Rules
Frequent episodes Find episodes that appear often
Episode rules
Used to emphasize the effect of events on
episodes Support/confidence as defined in association
rules
Example (window size = 11)
AB-C-DEABE-F-A-DFECDAABBCDE
-
8/10/2019 Business Intelligence & Data Mining-9
33/35
Episode Rules : Example
Meaning: Given episode on the left appear,episode on the right appears 80% of the time.
This essentially says that when (A,B) appears,then C appears (within a given window size)
B
A
C
B
A
Window size 10: Support 4%, Confidence 80%
-
8/10/2019 Business Intelligence & Data Mining-9
34/35
Mining episode rules
Apriori principle for episode
An episode is frequent if and only if all its sub-
episode is frequent
Thus apriori-based algorithm can be applied
However, there are a few tricky issues
-
8/10/2019 Business Intelligence & Data Mining-9
35/35
Mining episode rules
Recognizing episode in sequences Parallel episode: standard association rules techniques
Serial/General episode: Finite state machine basedconstruction
Alternative: Count parallel episodes first, then use them to
generate candidate episodes of other types
Counting number of windows
One event appears in n windows for window size w
O.K. if sequence long, as the ratios even out.
However when sequence size is small, the edges can
dominate