business intelligence & data mining-9

8/10/2019 Business Intelligence & Data Mining-9

1/35

Association Rules


2/35

Market Basket Analysis


3/35

Association Rules

Usually applied to market baskets but otherapplications are possible

Useful Rules contain novel and actionable

information: e.g. On Thursdays grocery customers are

likely to buy diapers and beer together

Trivial Rules contain already known information: e.g.

People who buy maintenance agreements are the ones

who have also bought large appliances

Some novel rules may not be useful: e.g. New

hardware stores most commonly sell toilet rings


4/35

Association Rule: Basic Concepts

Given: (1) a set of transactions, (2) each transaction is aset of items (e.g. purchased by a customer in a visit)

Find: (all ?)rules that correlate the presence of one set

of items with that of another set of items

E.g., 98% of people who purchase tires and auto accessories

also get automotive services done

Applications

Retailing (What other products should the store stocks up)

Attached mailing in direct marketing

Market Basket Analysis (what do people buy together?)

Catalog design (Which items should appear next to each

other)


5/35

What Is Association Rule Mining?

Association rule mining: Finding frequent patterns, associations, correlations, or

causal structures among sets of items or objects intransaction databases, relational databases, and otherinformation repositories.

Examples. Rule form: Body ead [support, confidence].

buys(x, diapers) buys(x, beers) [0.5%, 60%]

major(x, CS) ^ takes(x, DB) grade(x, CS, A)[1%, 75%]


6/35

Rule Measures: Support and Confidence

Find all the rules X & Y Z with

minimum confidence and support

support,s, probability that atransaction contains {X & Y &Z}

confidence, c, conditional

probability that a transactionhaving {X & Y} also contains Z

Transaction ID Items Bought

2000 A,B,C

1000 A,C4000 A,D

5000 B,E,F

Let minimum support 50%,

and minimum confidence

50%, we have

A C (50%, 66.6%)

C A (50%, 100%)

Customer

buys diaper

Customer

buys both

Customer

buys beer


7/35

Mining Association RulesAn Example

For rule A C:

support = support({A &C}) = 50%

confidence = support({A &C})/support({A}) = 66.6%

The Apriori principle (Agarwal, 1995):

Any subset of a frequent itemset must also be frequent

Transaction ID Items Bought

2000 A,B,C

1000 A,C

4000 A,D

5000 B,E,F

Frequent Itemset Support

{A} 75%{B} 50%

{C} 50%{A,C} 50%

Min. support 50%

Min. confidence 50%

Frequent Itemsets


8/35

Mining Frequent Itemsets: the Key Step

Find thefrequent itemsets: the sets of items that

have minimum support

A subset of a frequent itemset must also be a frequent

itemset

i.e., if {AB} is a frequent itemset, both {A} and {B} should be

frequent itemsets

Iteratively find frequent itemsets with cardinality from 1

to k (k-itemset)

Use the frequent itemsets to generate association

rules.


9/35

The Apriori Algorithm

Generate C1: all 1 unique items

Generate L1: all 1 unique items with minimum

support

Join Step: Ck is generated forming Cartesian Product ofLk-1with L1. Since, any (k-1)-itemset that is not frequent

cannot be a subset of a frequent k-itemset

Prune Step: Lk-1 is generated by selecting from Ck itemsetsthose with minimum support


10/35

The Apriori Algorithm Example

TID Items100 1 3 4

200 2 3 5

300 1 2 3 5

400 2 5

Database Ditemset sup.

{1} 2

{2} 3

{3} 3

{4} 1

{5} 3

itemset sup.

{1} 2

{2} 3

{3} 3

{5} 3

Scan D

C1 L1

itemset{1 2}

{1 3}

{1 5}

{2 3}

{2 5}

{3 5}

itemset sup

{1 2} 1

{1 3} 2

{1 5} 1

{2 3} 2

{2 5} 3{3 5} 2

itemset sup

{1 3} 2

{2 3} 2

{2 5} 3

{3 5} 2

L2

C2 C2Scan D

C3 L3itemset

{1 3 5}

{2 3 5}

Scan D itemset sup

{2 3 5} 2


11/35

Is Apriori Fast Enough?

Performance Bottlenecks

The core of the Apriori algorithm: Use frequent (k 1)-itemsets to generate candidate frequent k-itemsets

Use database scan and matching to collect counts for the candidate

itemsets

The bottleneck ofApriori: candidate generation

Huge candidate sets:

104 frequent 1-itemset will generate 107 candidate 2 -itemsets

To discover a frequent pattern of size 100, e.g., {a1, a2, , a100}, one

needs to generate 2

100

10

30

candidates. Multiple scans of database:

Needs (n +1 ) scans, n is the length of the longest pattern


12/35

Methods to Improve Aprioris Efficiency

Hash-based itemset counting: A k-itemset whose

corresponding hashing bucket count is below the threshold

cannot be frequent

Transaction reduction: A transaction that does not contain

any frequent k-itemset is useless in subsequent scans

Sampling: mining on a subset of given data, lower support

threshold + a method to determine the completeness

Dynamic itemset counting: add new candidate itemsets only

when all of their subsets are estimated to be frequent


13/35

How to Count Supports of Candidates?

Why is counting supports of candidates a problem? The total number of candidates can be very huge

One transaction may contain many candidates

Method:

Candidate itemsets are stored in a hash-tree

Leaf node of hash-tree contains a list of itemsets and

counts

Interior node contains a hash table

Subset function: finds all the candidates contained in atransaction


14/35

Criticism of Support and Confidence

Example 1: (Agarwal & Yu, PODS98)

Among 5000 students 3000 play basketball 3750 eat cereal 2000 both play basket ball and eat cereal

play basketball eat cereal[40%, 66.7%] is misleading because the

overall percentage of students eating cereal is 75% which is higher than

66.7%.

not play basketball eat cereal[35%, 87.5%] lower support but

higher confidence!

play basketball not eat cereal[20%, 33.3%] is more informative,

although with lower support and confidence

basketball not basketball sum(row)cereal 2000 1750 3750

not cereal 1000 250 1250

sum(col.) 3000 2000 5000


15/35

Criticism of Support and Confidence

Example 2: X and Y: positively

correlated,

X and Z, Y and Z: negatively

correlated

support and confidence ofX=>Z dominates

We need a measure of

dependent or correlated events

P(B|A)/P(B) is called the lift

of rule A => B

X 1 1 1 1 0 0 0 0

Y 1 1 0 0 0 0 0 0

Z 0 1 1 1 1 1 1 1

X=>Y 25% 50%

X=>Z 37.5% 75%

Y=>Z 12.5% 50%


16/35

Other Interestingness Measures: lift

Lift = P(B|A)/P(B) =

takes both P(A) and P(B) into consideration

P(A^B)=P(B)*P(A), if A and B are independent events

A and B negatively correlated, if lift < 1

If lift > 1, A and B positively correlated

)()()(

BPAPBAP

X 1 1 1 1 0 0 0 0

Y 1 1 0 0 0 0 0 0Z 0 1 1 1 1 1 1 1

Rule Support Lift

X=>Y 25% 2

X=>Z 37.50% 0.86

Y=>Z 12.50% 0.57


17/35

Extensions


18/35

Multiple-Level Association Rules

Items often form hierarchy. Items at the lower level are

expected to have lower

support.

Rules regarding itemsets atappropriate levels could be

quite useful.

Transaction database can be

encoded based on

dimensions and levels

Food

breadmilk

skim

SunsetFraser

2% whitewheat


19/35

Mining Multi-Level Associations

A top-down, progressive deepening approach:

First find high-level strong rules (Ancestors):milk bread [20%, 60%].

Then find their lower-level weaker rules (Descendants):

2% milk wheat bread [6%, 50%].

Variations of mining multiple-level association rules. Level-crossed association rules:

2% milk Wonderwheatbread

Association rules with multiple, alternative hierarchies:

2% milk Wonderbread


20/35

Multi-level Association: UniformSupport vs. Reduced Support

Uniform Support: the same minimum support for all levels

+ One minimum support threshold. No need to examine itemsetscontaining any item whose ancestors do not have minimum

support.

Lower level items do not occur as frequently. If supportthreshold

too high miss low level associations

too low generate too many high level associations

Reducing Support: reduced minimum support at lower

levels

Needs modification to the basic algorithm


21/35

Uniform Support

Multi-level mining with uniform support

Milk

[support = 10%]

2% Milk

[support = 6%]

Skim Milk

[support = 4%]

Level 1

min_sup = 5%

Level 2

min_sup = 5%


22/35


23/35

Multi-level Association:Redundancy Filtering

Some rules may be redundant due to ancestor

relationships between items.

Example

milk wheat bread [support = 8%, confidence = 70%]

2% milk wheat bread [support = 2%, confidence = 72%]

We say the first rule is an ancestor of the second rule.

A rule is redundant if its support is close to the

expected value, based on the rules ancestor.


24/35

Multi-Level Mining: Progressive

Deepening

A top-down, progressive deepening approach: First mine high-level frequent items:

milk (15%), bread (10%) Then mine their lower-level weaker frequent itemsets:

2% milk (5%), wheat bread (4%)

Different min-support threshold across multi-levelslead to different algorithms:

If adopting the same min-supportacross multi-levelsthen reject tif any of ts ancestor is infrequent.

If adopting reduced min-supportat lower levelsthen examine only those descendants whose ancestors support is

frequent and whose own support is > reduced min-support


25/35

Sequence pattern mining

Sequence events database

Consists of sequences of values or events changing

with time

Data is recorded at regular intervals

Characteristic sequence events characteristics

Trend, cycle, seasonal, irregular

Examples

Financial: stock price, inflation

Biomedical:blood pressure

Meteorological:precipitation


26/35

Two types of sequence data

Event series

Record events that happens at certain time

E.g. network logins

Time series

Record changes of certain (typically numeric)

values over time

E.g. stock price movements, blood pressure


27/35

Event series

Series can be represented in two ways:

As a sequence (string) of events.

Empty space if no events occur at a certain time

Hard to represent multiple events

As a set of tuples: {(time, event)}

Allow for multiple event at the same time


28/35

Types of interesting info

Which events happen often (not too interesting)

What group of event happen often

People who rent Star Wars also rent Star Trek

What sequence of event happen often Renting Star Wars, then Empire Strikes Back, then

Return of the Jedi in that order

Association of events within a time window

People who rent Star Wars tend to rent Empire

Strikes Back within one week


29/35

Similarity/Difference with

Association Rules

Similarities:

Groups of events : frequent item sets

Associations : Association rules

Differences:

Notion of (time) windows:

People who rent Star Wars tend to rent Empire Strikes

Back within one week

Ordering of events is important


30/35

Episodes

A partially ordered sequence of events

B

A

Parallel

(B follows A OR

A follows B)

A B

Serial

(B follows A)

B

A

C

General

(order between

A & B unknownor immaterial but A

& B precede C)


31/35

Sub-episode / super-episode

If A, B & C occur within a time window:

A & B is a sub-episode of A, B & C

A,B & C is the super-episode of A, B, C, A & B,

B & C


32/35

Frequent episodes / Episode Rules

Frequent episodes Find episodes that appear often

Episode rules

Used to emphasize the effect of events on

episodes Support/confidence as defined in association

rules

Example (window size = 11)

AB-C-DEABE-F-A-DFECDAABBCDE


33/35

Episode Rules : Example

Meaning: Given episode on the left appear,episode on the right appears 80% of the time.

This essentially says that when (A,B) appears,then C appears (within a given window size)

B

A

C

B

A

Window size 10: Support 4%, Confidence 80%


34/35

Mining episode rules

Apriori principle for episode

An episode is frequent if and only if all its sub-

episode is frequent

Thus apriori-based algorithm can be applied

However, there are a few tricky issues


35/35

Mining episode rules

Recognizing episode in sequences Parallel episode: standard association rules techniques

Serial/General episode: Finite state machine basedconstruction

Alternative: Count parallel episodes first, then use them to

generate candidate episodes of other types

Counting number of windows

One event appears in n windows for window size w

O.K. if sequence long, as the ratios even out.

However when sequence size is small, the edges can

dominate

business intelligence & data mining-9

Documents