http://webb intelligent systems exploratory pattern discovery geoff webb

80
http:// www.csse.monash.edu.au/~webb Intelligent Systems Exploratory pattern discovery Geoff Webb

Upload: arleen-daniel

Post on 29-Dec-2015

231 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webb

Intelligent Systems

Exploratory pattern discovery

Geoff Webb

Page 2: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 2Intelligent Systems

Outline

• Tutorial covers • Data Mining• Exploratory Pattern Discovery• Association rules• Interestingness (objective functions)• False discoveries• Limitations of minimum support• K-most interesting pattern discovery• Itemset discovery• Contrast rule discovery• Impact rules

Page 3: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 3Intelligent Systems

Part 1:

Data Mining

Page 4: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 4Intelligent Systems

Data mining

• Data mining seeks to discover unanticipated knowledge from data

• Exponential growth in the quantity of data stored gives urgency to the pursuit of practical analytic approaches that address• Large volumes of data• Low quality data• Post-hoc analysis• Loosely defined analytical objectives

Page 5: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 5Intelligent Systems

So what’s the big deal?

• Don’t statistics identify patterns in data?• Conventional statistics do not address

• searching quintillions of potential correlations Eg.

• market basket data 2100,000

• US phone calls 2100,000,000

• human genome 23,000,000,000

• selecting most interesting from millions of correlations

Page 6: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 6Intelligent Systems

Example: Should we stock vitamins?

• Major national retailer with detailed records of customer purchasing behaviour

• Considering deleting a low volume product line

• Does data provide evidence of indirect contribution to bottom line?

Page 7: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 7Intelligent Systems

Example: Steel rolling mill

• Complex control problem for expensive production process influenced by input materials, desired output and state of equipment

• Currently uses imperfect model

• Objective, use data to identify circumstances in which model is deficientPhoto courtesy G.C. Goodwin, S. Graebe and M. Salgado. Control System Design, Prentice Hall, 2000.

Page 8: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 8Intelligent Systems

Example: Synchrotron x-ray data analysis

• Synchrotron x-ray scatter patterns reflect micro-structure of material analysed.

Normal Malignant

• Can x-ray scatter plots be used for cancer diagnosis?

Page 9: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 9Intelligent Systems

A growth area

• The sum of human data stored doubles every 7 years

• Data mining is critical to commerce• Fraud detection

• Information retrieval

and to science• Bioinformatics

• Mass data analysis

Page 10: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 10Intelligent Systems

Large unmet demand for good PhDs!

Page 11: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 11Intelligent Systems

Beyond statistics

• Data mining goes beyond the traditional realm of statistics by encompassing • problem formulation • interactions between the business

process and the analytic process• knowledge management• data manipulation

Analytics

Business processes

Data

Other knowledge

sources

Page 12: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 12Intelligent Systems

Generating models

• The core of the data mining process is generating models from data

Eg neural networks, support vector machines, decision trees

• Most research concentrates on this aspect• Surrounding activities are also very important

• Defining analytic task• Sourcing data• Preprocessing data• Identifying appropriate forms of model • Identifying appropriate techniques for generating models• Interpreting models• Applying models

Page 13: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 13Intelligent Systems

Part 2:

Exploratory Pattern Discovery

Page 14: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 14Intelligent Systems

The perils of model selection

• Many data mining techniques seek to identify a single model that best fits the observed data.

• In many applications many models will (almost) equally fit the data

bruises=f & gill-attachment=f & gill-spacing=c & ring-number=o → poisonous[Coverage=0.406 (3296); Support=0.388 (3152); Confidence=0.956]

bruises=f & gill-spacing=c & veil-color=w & ring-number=o → poisonous [Coverage=0.406 (3296); Support=0.388 (3152); Confidence=0.956]

Page 15: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 15Intelligent Systems

Perils of model selection (cont.)

• Data mining systems often make arbitrary choices• without warning

• A system may have no basis on which to select models, but an expert often will• ease / cost of operatalisation

• comprehensibility / compatibility with existing knowledge and beliefs

• social / legal / ethical / political acceptability

Page 16: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 16Intelligent Systems

Exploratory pattern discovery

• Exploratory pattern discovery seeks all patterns that satisfy user-defined constraints

• The user can select from these patterns• can use criteria that might be infeasible to

quantify

Page 17: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 17Intelligent Systems

Patterns

• Rules:• <antecedent> <consequent>

• Itemsets• <condition1> & <condition2> & …

• Sequences• <event1>, <event2>, ….

• Structures

Page 18: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 18Intelligent Systems

Rules

• <antecedent> <consequent>• IF <antecedent> THEN <consequent>

• IF temp >36.8 AND pulse > 120 THEN call doctor• Antecedent

= condition= left hand side, LHS= conditions under which antecedent holds / applies

• Consequent = conclusion= right hand side, RHS= action to perform or conclusion to reach

Page 19: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 19Intelligent Systems

Theoretical foundations

• Substantial bodies of theory in Formal Logic, Computational Logic, and Artificial Intelligence can be brought to bear to utilise rules once they are inferred.

• If the antecedent entails the consequent and the antecedent is known (believed) then the consequent can be concluded.

• Can be extended to probabilistic basis.• Supports complex reasoning.• Modular knowledge representation.

• can capture knowledge nuggets

Page 20: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 20Intelligent Systems

Rule discovery as search

• Rule discovery can be viewed as search through a space of expressible rules.

• The rule space (search space / description space) can be partially ordered on generality.

• A C is a generalisation of B C iff B entails A (A must be true if B is true)

• proper generalisation iff A does not also entail B

• If A C is a generalisation of B C then B C is a specialisation of A C.

• Eg. IF age > 30 THEN X is a generalisation of• IF age > 31 THEN X• IF age > 30 AND gender = male THEN X

Page 21: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 21Intelligent Systems

{}

{A,B} {A,C} {A,D}{B,C} {B,D} {C,D}

{A} {B} {C} {D}

{A,B,C} {A,B,D} {A,C,D} {B,C,D}

Generalization lattice for antecedents

{A,B,C,D}

Page 22: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 22Intelligent Systems

{}

{A,B} {A,C} {A,D}{B,C} {B,D} {C,D}

{A} {B} {C} {D}

{A,B,C} {A,B,D} {A,C,D} {B,C,D}

Search tree for antecedents

{A,B,C,D}

Page 23: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 23Intelligent Systems

{}{A,B,C,D}

{A,B}{C,D}

{A,C}{B,D}

{A,D}{B,C}

{B,C}{A,D}

{B,D}{A,C}

{C,D}{A,B}

{A}{B,C,D}

{B}{A,C,D}

{C}{A,B,D}

{D}{A,B,C}

{A,B,C}{D}

{A,B,D}{C}

{A,C,D}{B}

{B,C,D}{A}

Search tree with consequent propagation

{A,B,C,D}{}

Page 24: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 24Intelligent Systems

Propositional rule discovery

• Antecedent and consequent are propositions

• Often restricted to antecedent and consequent both conjunctions of Boolean terms• IF temp >36.8 AND pulse > 120 THEN

blood pressure > 140 AND condition = critical

Page 25: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 25Intelligent Systems

Rule discovery is inherently intractable

• If • there are n propositions,

• antecedents can be any set of propositions and

• consequents are a single proposition

then

• size of search space ≈ n2n

• It is essential to use powerful pruning techniques to limit the search space

Page 26: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 26Intelligent Systems

Part 3:

Association rules

Page 27: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 27Intelligent Systems

Association rule discovery

• Developed for market basket analysis• a basket is a collection of products

purchased in a single transaction• an itemset is a set of products

• all baskets are itemsets• market basket analysis seeks to identify

products that are associated with each other• diapers and beer

• Can generalize to itemset = any conjunction of Boolean terms

Page 28: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 28Intelligent Systems

Transaction and tabular data

• Transaction data• Each record is a set of items involved in a single

transaction• Eg. market basket, web site traversal, amino acids

in a protein• Tabular data

• Each record consists of a vector of values for the predefined attributes or fields

• Eg. A patient’s signs and symptoms, employee details, the amino acids at each site in a protein

• While association rules were developed for transaction data they generalise directly to attribute-value data

Page 29: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 29Intelligent Systems

Support and confidence

• F(X) = proportion of records that satisfy condition X

• Coverage(AC) = F(A)• Support(AC) = F(A & C)• Confidence(AC) = support(AC) /

coverage(AC) • Maximum likelihood estimate of P(C | A)

A C

Page 30: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 30Intelligent Systems

Frequent itemsets

• An itemset is frequent if its cover equals or exceeds a user defined minimum

• Downward closure • frequency is anti-monotone

• if an itemset I is not frequent then no specialization of I is frequent

Page 31: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 31Intelligent Systems

Association rules

• Antecedent and consequent are frequent itemsets

• An association rule indicates that the presence of the antecedent increases the probability that the consequent will be present• bread & butter honey

Page 32: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 32Intelligent Systems

Association rule discovery

• Requires minimum support constraint• Finds all rules that satisfy minimum

support together with other user specified constraints such as minimum confidence

• Example: 1000 transactions, 100 bread, 100 honey, 50 bread & honey• support(bread honey) = 0.05

• confidence(bread honey) = 0.50

Page 33: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 33Intelligent Systems

The frequent itemset approach

• Find all frequent itemsets• Generate all association rules therefrom• Assumes

• a minimum support constraint

• sparse data

Page 34: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 34Intelligent Systems

Finding frequent itemsets

• Once frequent itemsets are found rule generation is straightforward

• Research has concentrated on efficient frequent itemset generation

Page 35: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 35Intelligent Systems

The Apriori algorithm

Apriori(T, ε)L1 ← frequent 1-itemsets relative to T

k ← 2

while Lk-1 ≠ Ck ← Generate(Lk-1)

for t T

for c Subsets(Ck, t)

count[c]++

Lk ← { c Ck | count[c] ≥ ε }

k++

return L

TRANSACTIONS

a,b,c

a,b,d

a,d

PROCESS, ε=2

L1 {{a},{b},{d}}

C2 {{a,b},{a,d},{b,d}}

L2 {{a,b},{a,d}}

C3 {{a,b,d}}

L3 {}

Page 36: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 36Intelligent Systems

Closed itemsets

• In practice many itemsets cover exactly the same items• Eg pregnant, pregnant & woman

• A closed itemset is the most specific itemset that covers a particular set of items

• More efficient to find all closed frequent itemsets than all frequent itemsets

• Can generate all association rules from closed itemsets

Page 37: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 37Intelligent Systems

Closed Itemsets Example

Full set of itemsets for gill-size=n, gill-color=b & spore-print-color=w gill-size=n [Coverage=2512]

spore-print-color=w [Coverage=2388]

gill-size=n & spore-print-color=w [Coverage=1824]

gill-color=b [Coverage=1728]

gill-color=b & spore-print-color=w [Coverage=1728]

gill-size=n & gill-color=b [Coverage=1728]

gill-size=n & gill-color=b & spore-print-color=w [Coverage=1728]

Closed itemsetsgill-size=n [Coverage=2512]

spore-print-color=w [Coverage=2388]

gill-size=n & spore-print-color=w [Coverage=1824]

gill-size=n & gill-color=b & spore-print-color=w [Coverage=1728]

Page 38: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 38Intelligent Systems

Part 4:

Interestingness (objective functions)

Page 39: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 39Intelligent Systems

Interestingness (Objective Functions)

• Need some means of selecting the most (potentially) interesting patterns

• Many different measures of interestingness may be relevant

• Most measures relate to the degree to which the antecedent and consequent are interdependento P(A & C) – P(A) P(C)

Page 40: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 40Intelligent Systems

Interestingness measures: lift

• lift = confidence / (cover(consequent)/n)• proportional increase in confidence in

context of antecedent

• Example: 1000 transactions, 100 bread, 100 honey, 50 bread & honey• confidence(bread honey) = 0.50

• lift(bread honey) = 5.00

Page 41: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 41Intelligent Systems

M-estimates

• Problem: many rules with low support will have unrealistically high confidence and lift

• Example: 1000 records, 500 females, 1 age>=90, 1 female & age>=90

• confidence(age>=90 female) = 1.00• lift(age>=90 female) = 2.00

• M-estimate is Bayesian estimate of true confidence and lift• biases confidence toward prior• confidence estimate = (support + m * prior) / (coverage +

m)• lift estimate = confidence estimate / prior• Eg confidence estimate = (1 + 2 * 0. 5) / (1 + 2) = 0.667

lift estimate = 0.667 / 0. 500 = 1.333

Page 42: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 42Intelligent Systems

Interestingness measures: leverage

• leverage = support – (cover(antecedent) cover(consequent) / n)

• absolute increase in comparison to expected cases if antecedent and consequent independent

• Also known as interest

• Example: 1000 transactions, 100 bread, 100 honey, 50 bread & honey

• confidence(bread honey) = 0.50• lift(bread honey) = 5.00• leverage(bread honey) = 0.04

• Example2: 1000 transactions, 10 batteries, 5 vodka, 1 batteries & vodka

• lift(batteries vodka) = 20.00• leverage(batteries vodka) = 0.0009

Page 43: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 43Intelligent Systems

Spurious rules

• If condition X is unrelated to conditions A and B,

• confidence(A & X B) confidence(A B)• lift(A & X B) lift(A B)• Eg pregnant & AI Researcher oedema

• One core rule can result in many spurious rules

• If problem ignored, majority of rules can be spurious!

Page 44: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 44Intelligent Systems

{}{A,B,C,D}

{A,B}{C,D}

{A,C}{B,D}

{A,D}{B,C}

{B,C}{A,D}

{B,D}{A,C}

{C,D}{A,B}

{A}{B,C,D}

{B}{A,C,D}

{C}{A,B,D}

{D}{A,B,C}

{A,B,C}{D}

{A,B,D}{C}

{A,C,D}{B}

{B,C,D}{A}

Need to test up the generalization lattice

{A,B,C,D}{}

Page 45: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 45Intelligent Systems

Minimum Improvement

• The improvement of rule X → Y [conf=c] = min(c-k | ZX Z → Y [conf=k])

• A minimum improvement constraint can eliminate many spurious rules

Page 46: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 46Intelligent Systems

Non redundant rules

xyzsc x → y [conf = 1.0 ] x → z [supp=s, conf=c] x, z → y [supp=s, conf=c]

Eg pregnant → oedema [supp=0.1, conf=0.2] pregnant, female → oedema [supp=0.1, conf=0.2]

• A rule X → Y [supp=s, conf=c] is redundant iff xX X\x → Y [supp=s, conf=c] or yY X → Y\y [supp=s, conf=c]

Eg, pregnant, female → oedema • Closed itemset approaches lead to efficient

generation of non-redundant rules because a rule is non-redundant iff all immediate specialisations are closed itemsets.

• Note, redundant rules have improvement of 0.0.

Page 47: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 47Intelligent Systems

Effect

dataset

filter

  non-   improvement  

none redundant % > 0 %

bms webview 170 170 100 155 91

covtype 998 815 82 143 14

ipums.la.99 973 959 99 481 49

kddcup98 995 992 100 939 94

letter-recognition 541 524 97 421 78

mush 891 469 53 128 14

retail 590 590 100 519 88

shuttle 666 595 89 312 47

splice-junction 748 727 97 699 93

ticdata-2000 996 996 100 988 99

Page 48: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 48Intelligent Systems

Part 5:

False discoveries

Page 49: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 49Intelligent Systems

False discoveries

• Massive search leads to high risk of false discoveries

• eg 100 observations, two independent events each occurring with 0.5 probability,

• the probability of perfect correlation is 7.8x10-31. • if there are 1000 events then there are 21000 =

1.07x10301 antecedent – consequent pairs.• What constitutes a false discovery depends upon

the analytic objective• Usually should include rules where

• antecedent and consequent are independent• antecedent and consequent are independent given a

generalisation of the antecedent

Page 50: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 50Intelligent Systems

Testing independence

• Cannot perform simple test of independence because of multiple comparisons problem• used previously (eg Webb, Butler &

Newlands, 2003) as a statistically unsound filter

Page 51: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 51Intelligent Systems

Standard statistical correction

• Bonferroni• To maintain experimentwise risk ≤ α for n tests

• use critical value = α / n

• Holm procedure• To maintain experimentwise risk ≤ α for n tests with p

values ordered from lowest to highest p1 … pn

• Accept tests corresponding to p1 … pk , where k is the

highest value such that 1≤i≤k pi ≤ α / (n – k + 1)

p values 0.0100, 0.0200, 0.0400, 0.0400

critical values 0.0125, 0.0167, 0.0250, 0.0500

accept, accept, reject, reject

Page 52: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 52Intelligent Systems

Direct adjustment

• I used to think “cannot perform simple adjustment such as Bonferroni or Holm because rule spaces are so large, eg 21000 (> 1.0E+301 )

• would result in unacceptable type-2 error• eg = 5.0E-303”

• However, search is often restricted to small antecedents (eg. ≤ 4) resulting in Bonferonni adjusted critical values of magnitude 1.0E-10 … 1.0E-20.

• With such adjustments often many rules can be found

• Cannot order p values to apply Holm procedure

Page 53: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 53Intelligent Systems

Discovery as hypothesis generation

• Important to trade-off the risks of both type-1 and type-2 errors

• Perhaps best viewed as hypothesis generation, recognising that ‘discovered’ patterns require independent assessment

Page 54: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 54Intelligent Systems

Hypothesis testing: proposal

• Why not automate such assessment?

Data

Explor- atory

Holdout

ExploratoryPattern

Discovery

Patterns

StatisticalEvaluation

SoundPatterns

Smallset

prefer-able

Holm adjustment

Any hypothesi

s test

Limited type-2 error

Page 55: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 55Intelligent Systems

Direct adjustment vs Holdout

Direct adjustment• All data used for

exploration and evaluation

• Bonferroni adjustment

• Larger adjustment• Adjustment alters

with size of search space

Holdout• Half data used for

each of exploration and evaluation

• Holm procedure

• Smaller adjustment• Adjustment alters

with number of rules found

Page 56: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 56Intelligent Systems

Case study: Ten widely used data sets

Name Description RecordsAttribute-

values

BMS webview products viewed at a commercial website 59,601 497

covtype forest cover data 581,012 125

ipums.la.99 Los Angeles census data 88,443 1,874

kddcup98 charity donors 52,256 19,662

letter-recog’n digital image recognition 20,000 74

mush identification of poisonous mushrooms 8,124 127

retail retail market basket data 88,162 16,470

shuttle records of space shuttle flight data 58,000 34

splice-junction DNA sequence records 3,177 243

ticdata-2000 insurance risk assessment 5,822 689

Page 57: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 57Intelligent Systems

Detecting spurious rules

• Assuming interest only in positive associations• P(C | A) > P(C)

• For any rule A C, want to assess whether it has higher confidence than all its generalisations

• Eg, is confidence(pregnant & female B) >• confidence(pregnant B)• confidence(female B)• confidence(true B)

Page 58: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 58Intelligent Systems

Detecting spurious rules (cont)

• Perform one-tailed Fisher exact tests with respect to each generalisation• Reject if any test does not exceed critical

value• no need to adjust for multiple comparisons

with respect to the multiple tests for a single rule

• Use Holm adjustment for strict control of type-1 error

Page 59: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 59Intelligent Systems

Spurious rules case study: high support & confidence non-redundant rules

Name RecordsAttribute-values # Rules # Accepted %

bms webview 59,601 497 22,135 1,747 8

covtype 581,012 125 10,018 0 0

ipums.la.99 88,443 1,874 9,857 288 3

kddcup98 52,256 19,662 9,863 40 <1

letter-recognition 20,000 74 7,978 952 12

mush 8,124 127 8,957 1,266 14

retail 88,162 16,470 11,656 97 1

shuttle 58,000 34 9,760 876 9

splice-junction 3,177 243 8,937 132 1

ticdata-2000 5,822 689 10,438 30 <1

Page 60: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 60Intelligent Systems

KDDCUP98: 99.5% of rules rejected

The following 40 rules passed holdout evaluation…ETH12<=0 HC15<=0 [Coverage=0.987 (25786); Support=0.946 (24722); Confidence=0.959; Lift=1.00]…The following 9843 rules failed holdout evaluation, adjusted critical value = 5.09E-06…NOEXCH=0 & ETH12<=0 HC15<=0 [Coverage=0.984 (25703); Support=0.943 (24644); Confidence=0.959; Lift=1.00]…NOEXCH=0 & ETH12<=0 & MDMAUD_F=X HC15<=0 [Coverage=0.981 (25629); Support=0.940 (24573); Confidence=0.959; Lift=1.00]…NOEXCH=0 & ETH12<=0 & ADATE_2>=9706 & MDMAUD_R=X HC15<=0 [Coverage=0.981 (25623); Support=0.940 (24567); Confidence=0.959; Lift=1.00]…

Page 61: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 61Intelligent Systems

Comparison of direct adjustment and holdout tests on artificial data

True Discoveries False Discoveries Experimentwise Error

Hol

dout

Dire

ct

Averages over 100 runs, 84 true rules at antecedent size 4

Page 62: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 62Intelligent Systems

Comparison on real data

Letter Recognition

0

500

1000

1500

2000

2500

3000

2.33E+03 1.32E+05 2.29E+06 2.68E+07 2.27E+08 1.47E+09

Search Space Size

No

of

rule

s

Direct Holdout

Retail

0

200

400

600

800

1000

1200

1.36E+08 2.23E+12 1.23E+16 5.05E+19 1.66E+23 4.56E+26

Search space size

No

of

rule

s

Direct Holdout

Page 63: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 63Intelligent Systems

Part 6:

Limitations of minimum support

Page 64: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 64Intelligent Systems

Limitations of minimum support

• Discontinuity in ‘interestingness’ function• The vodka and caviar problem

• some high value associations are infrequent• Feast or famine

• minimum support is a crude control mechanism• often results in too few or too many associations

• Cannot handle dense data• Cannot prune search space using constraints on

relationship between antecedent and consequent• eg confidence

• Minimum support may not be relevant• cannot be sufficiently low to capture all valid rules• cannot be sufficiently high to exclude all spurious rules

Page 65: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 65Intelligent Systems

Very low support rules can be significant

Data file: Brijs retail.itl [50% sample]

44081 cases / 44081 holdout cases / 16470 items

The following 5 rules passed holdout evaluation

168 & 4685 → 1 [Coverage=0.000 (3); Support=0.000 (3); Confidence estimate=0.601; Lift estimate=192.06]

168 & 3021 → 1 [Coverage=0.000 (3); Support=0.000 (3); Confidence estimate=0.601; Lift estimate=192.06]

1476 & 4685 → 1 [Coverage=0.000 (2); Support=0.000 (2); Confidence estimate=0.502; Lift estimate=160.21]

168 & 783 → 1 [Coverage=0.000 (4); Support=0.000 (3); Confidence estimate=0.501; Lift estimate=160.05]

3021 & 4685 → 1 [Coverage=0.000 (4); Support=0.000 (3); Confidence estimate=0.501; Lift estimate=160.05]

Page 66: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 66Intelligent Systems

Very high support rules can be spurious

Data file: covtype.data 581012 cases / 125 valuesST15=0 → ST07=0 [Coverage=1.000 (581009); Support=1.000 (580904); Confidence=1.000; Lift=1.00]ST07=0 → ST15=0 [Coverage=1.000 (580907); Support=1.000 (580904); Confidence=1.000; Lift=1.00]ST15=0 → ST36=0 [Coverage=1.000 (581009); Support=1.000 (580890); Confidence=1.000; Lift=1.00]ST36=0 → ST15=0 [Coverage=1.000 (580893); Support=1.000 (580890); Confidence=1.000; Lift=1.00]ST15=0 → ST08=0 [Coverage=1.000 (581009); Support=1.000 (580830); Confidence=1.000; Lift=1.00]ST08=0 → ST15=0 [Coverage=1.000 (580833); Support=1.000 (580830); Confidence=1.000; Lift=1.00]….. 197,183,686 such rules have highest support

Page 67: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 67Intelligent Systems

Roles of constraints

1. Select most relevant patterns• patterns that are likely to be interesting

2. Control the number of patterns that the user must consider

3. Make computation feasible

Page 68: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 68Intelligent Systems

Minimum support can get overloaded!

Select most relevant

Control the number

Make com

putation feasible

Page 69: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 69Intelligent Systems

Part 6:

K-most interesting pattern discovery

Page 70: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 70Intelligent Systems

K-most interesting pattern discovery

• Find k patterns that maximise a measure of interest within other constraints that the user may specify

• removes need for minimum support constraint• efficient with dense data• empowers user to use relevant measure of interest• user specifies number of patterns to be returned• does not require either monotone or anti-monotone

constraints• Relies on efficient search

• must be able to retain all data in memory• constraints must sufficiently constraint the search

space

Page 71: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 71Intelligent Systems

Part 7:

Itemset discovery

Page 72: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 72Intelligent Systems

Itemset discovery

• In some contexts it is the collection of variables that are correlated that are of interest and the rule structure is superfluous.

• If A is associated with B then B must be associated with A (in the sense of the presence of the antecedent increasing the probability of the presence of the consequent).

• Discovering interesting itemsets is an area that has been little explored.

Page 73: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 73Intelligent Systems

Part 8:

Contrast discovery

Page 74: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 74Intelligent Systems

Contrast sets (emerging patterns)

• Sometimes it is interesting to identify differences between contrasting groups

• Eg: how do purchasing patterns differ on weekends to weekdays?

• Contrast sets find sets of conditions that differ significantly between groups

)|P()|P( ji GcsetGcsetij ),support(),support(max jiij GcsetGcset

Page 75: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 75Intelligent Systems

Contrast sets (cont.)

• Different analytic objective to association rules• more directed

• focus on differences between groups instead of associations between variables

• Different to classification rules• not discriminative

• no attempt to distinguish all individuals of each group

• find all contrasts rather than sufficient discriminators

Page 76: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 76Intelligent Systems

Can be discovered by existing techniques!

• Contrast / emerging pattern discovery is strictly equivalent to standard exploratory rule discovery with the consequent restricted to the group variable

)|P()|P()|P()|P( csetGcsetGijGcsetGcsetijjiji

Page 77: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 77Intelligent Systems

Part 9:

Impact rules

Page 78: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 78Intelligent Systems

Impact rules (quantitative association rules)

• Most rule discovery techniques require that numeric variables be discretised.

• This often loses important information.• Impact rules associate an antecedent with a

distribution on a numeric variable.• The user specifies what makes a distribution

interesting • eg largest mean, smallest standard deviation, …

• System finds rules that maximise the measure of interest within other user-specified constraints

Page 79: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 79Intelligent Systems

Impact rule discovery example

LengthOfStay: mean = 10.6; min = -6; max = 1687; sum = 367781

COUNTRYOFBIRTH=1100 -> LengthOfStay: Coverage=0.054 (1861); Mean=22.2; Min=-4; Max=1687; Sum=41314; Impact=21612.4

ADMITDay=Wednesday -> LengthOfStay: Coverage=0.159 (5518); Mean=13.3; Min=0; Max=1548; Sum=73389; Impact=15307.6

Page 80: Http://webb Intelligent Systems Exploratory pattern discovery Geoff Webb

http://www.csse.monash.edu.au/~webbCopyright © Geoffrey I Webb 2006 80Intelligent Systems

Summary

• Exploratory pattern discovery empowers the user to select the patterns that are most useful

• Rules provide a modular and powerful knowledge representation formalism

• Association rules discover associations between qualitative variables that are frequent

• K-optimal rules discover associations between qualitative variables that optimise a measure of interest

• Impact rules discover associations between qualitative and quantitative variables

• Contrasts discover differences in distributions over variables between different groups

• If you mine for patterns without appropriate statistical evaluation, expect to find fool’s gold!