course on data mining mika klemettinen and pirjo moen university of helsinki/dept of cs autumn 2001...

Course on Data MiningCourse on Data Mining

Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001

Course on Data Mining (581550-4)Course on Data Mining (581550-4)

Intro/Ass. RulesIntro/Ass. RulesIntro/Ass. RulesIntro/Ass. Rules

EpisodesEpisodesEpisodesEpisodes

Text MiningText MiningText MiningText Mining

Home ExamHome Exam

24./26.10.

30.10.

ClusteringClusteringClusteringClustering

KDD ProcessKDD ProcessKDD ProcessKDD Process

Appl./SummaryAppl./SummaryAppl./SummaryAppl./Summary

14.11.

21.11.

7.11.

28.11.



Today 26.10.2001Today 26.10.2001Today 26.10.2001Today 26.10.2001

• SummarySummary: : – Course organizationCourse organization

• SummarySummary: : – What is data mining?What is data mining?

• Today's subjectToday's subject: : – Association rulesAssociation rules

• Next week's programNext week's program: : – Lecture: Lecture: EpisodesEpisodes

– Exercise: Exercise: AssociationsAssociations

– Seminar: Seminar: AssociationsAssociations

Course on Data Mining (581550-4)Course on Data Mining (581550-4)



Lectures, Exercises, ExamLectures, Exercises, ExamLectures, Exercises, ExamLectures, Exercises, Exam

• 12 lectures: 24.10.-30.11.200112 lectures: 24.10.-30.11.2001– Wed 14-16, Fri 12-14 (A217)Wed 14-16, Fri 12-14 (A217)

• Wed: normal lecture

• Fri: seminar like lecture (except for 26.10.)

• 5 exercise sessions: 1.11.-29.11.20015 exercise sessions: 1.11.-29.11.2001– Thu 12-14 (A318)Thu 12-14 (A318)

• Home exam:Home exam:– Given: 28.11., Returned due: 21.12.2001Given: 28.11., Returned due: 21.12.2001

• Language:Language:– Lecturing language is FinnishLecturing language is Finnish

– Slides and material are in EnglishSlides and material are in English

Course OrganizationCourse Organization



Group WorkGroup WorkGroup WorkGroup Work

• Group for seminar (and exercise) work:Group for seminar (and exercise) work:– 10 groups, à 3 persons, 2 groups/lecture– Dates are agreed at the beginning of course– Articles are given on previous week's Wed

• Seminar presentations:Seminar presentations:– Presentation in an HTML page (around 3-5

printed pages) due to seminar starting:• Can be either a HTML page or a printable

document in PostScript/PDF format

– 30 minutes of presentation– 5-15 minutes of discussion– Active participation




• Group presentation time allocation:Group presentation time allocation:

– Fri 2.11.: Group 1, Group 2 (associations)

– Fri 9.11.: Group 3, Group 4 (episodes)

– Fri 16.11.: Group 5, Group 6 (text mining)

– Fri 23.11.: Group 7, Group 8 (clustering)

– Fri 30.11.: Group 9, Group 10 (KDD process)

Course Organization / GroupsCourse Organization / Groups



Course EvaluationCourse EvaluationCourse EvaluationCourse Evaluation

• Passing the course: min 30 pointsPassing the course: min 30 points– home exam: min 13 points (max 30 points)

– exercises/experiments: min 8 points (max 20 points)

• at least 3 returned and reported experiments

– group presentation: min 4 points (max 10 points)

• Remember also the other requirements:Remember also the other requirements:– Attending the lectures (5/7)

– Attending the seminars (4/5)

– Attending the exercises (4/5)




Course MaterialCourse MaterialCourse MaterialCourse Material

• Lecture slidesLecture slides

• Original articlesOriginal articles

• Seminar presentationsSeminar presentations

• Book: Book: "Data Mining: Concepts and Techniques" by Jiawei Han and Micheline Kamber, Morgan Kaufmann Publishers, August 2000. 550 pages. ISBN 1-55860-489-8

• Remember to check course website and Remember to check course website and folder for the material!folder for the material!




• Ultimately:Ultimately:

– "Extraction of interesting (non-trivial, implicit, previously unknown, potentially useful) information or patterns from data in large databases"

• Often just:Often just:

– "Tell something interesting about this data", "Describe this data"

Exploratory, semi-automatic data Exploratory, semi-automatic data analysis on large data setsanalysis on large data sets

Summary:What is Data Mining?Summary:What is Data Mining?



• Data mining: Data mining: semi-automaticsemi-automatic discovery of discovery of interestinginteresting patterns from patterns from large data setslarge data sets

• Knowledge discovery is a process:Knowledge discovery is a process:– Preprocessing– Data mining– Postprocessing

• To be mined, used or utilized different … To be mined, used or utilized different … – Databases (relational, object-oriented, spatial, WWW, …)– Knowledge (characterization, clustering, association, …)– Techniques (machine learning, statistics, visualization, …)– Applications (retail, telecom, Web mining, log analysis, …)

Summary:What is Data Mining?Summary:What is Data Mining?



Data miningData miningData miningData miningInput dataInput dataInput dataInput data ResultsResultsResultsResultsPreprocessingPreprocessing PostprocessingPostprocessing

OperationalOperationalDatabaseDatabase

OperationalOperationalDatabaseDatabase

Selection

Selection

Selection

Selection

UtilizationUtilizationUtilizationUtilization

CleanedVerifiedFocused

Eval. ofinteres-tingness

Raw data

Time based

selection

Selected usable

patterns

1 32

Summary: Typical KDD ProcessSummary: Typical KDD Process



BasicsBasicsBasicsBasics

Multi-level RulesMulti-level RulesMulti-level RulesMulti-level Rules

ExamplesExamplesExamplesExamples

GenerationGenerationGenerationGeneration

ConstraintsConstraintsConstraintsConstraints

Association RulesAssociation Rules



• Analysis of customer buying habits by finding Analysis of customer buying habits by finding associations and correlations between the different associations and correlations between the different items that customers place in their "shopping basket"items that customers place in their "shopping basket"

Customer1Customer1

Milk, eggs, sugar, Milk, eggs, sugar, breadbread

Customer2Customer2

Milk, eggs, cereal, bread Milk, eggs, cereal, bread

Customer3Customer3

Eggs, sugarEggs, sugar

Market Basket AnalysisMarket Basket Analysis



• Given: Given: • A database of customer transactionstransactions (e.g., shopping baskets),

where each transaction is a set of itemsset of items (e.g., products)

• Find: Find: • Groups of items which are frequently purchased togetherfrequently purchased together




• Extract information on pExtract information on purchasing behaviorurchasing behavior– "IF buys beer and sausage, THEN also by mustard with high

probability"

• Actionable information: can suggest...Actionable information: can suggest...– New store layouts and product assortments– Which products to put on promotion

• MBA approach is applicable whenever a customer MBA approach is applicable whenever a customer purchases multiple things in proximitypurchases multiple things in proximity– Credit cards– Services of telecommunication companies– Banking services– Medical treatments




• Useful:Useful:

"On Thursdays, grocery store consumers often purchase diapers and beer together."

• Trivial:Trivial:

"Customers who purchase maintenance agreements are very likely to purchase large

appliances."

• Unexplicable/unexpected:Unexplicable/unexpected:

"When a new hardaware store opens, one of the most sold items is toilet rings."




Association Rules: BasicsAssociation Rules: Basics

• Association rule mining:Association rule mining:– Finding frequent patterns, associations, correlations, or causal

structures among sets of items or objects in transaction databases, relational databases, and other information repositories.

• Comprehensibility:Comprehensibility: Simple to understand

• Utilizability: Utilizability: Provide actionable information

• Efficiency: Efficiency: Efficient discovery algorithms exist

• Applications:Applications:– Market basket data analysis, cross-marketing, catalog design,

loss-leader analysis, clustering, classification, etc.



• Typical representation formats for association rules:Typical representation formats for association rules:

– diapers beer [0.5%, 60%]

– buys:diapers buys:beer [0.5%, 60%]

– "IF buys diapers, THEN buys beer in 60% of the cases. Diapers and beer are bought together in 0.5% of the rows in the database."

• Other representations (used in Han's book):Other representations (used in Han's book):– buys(x, "diapers") buys(x, "beer") [0.5%, 60%]

– major(x, "CS") ^ takes(x, "DB") grade(x, "A") [1%, 75%]




diapers beer [0.5%, 60%]


1 AntecedentAntecedent, left-hand side (LHS), body

2 ConsequentConsequent, right-hand side (RHS), head

3 SupportSupport, frequency ("in how big part of the data the things in left- and right-hand sides occur together")

4 ConfidenceConfidence, strength ("if the left-hand side occurs, how likely the right-hand side occurs")

"IF buys diapers, THEN buys beer in 60% of the casesin 0.5% of the rows"

1 2 3 4



• SupportSupport:: denotes the frequency of the rule within transactions.

support(support(A A B [ s, c ] B [ s, c ]) = ) = p(p(AABB) = ) = support ({A,B})support ({A,B})

• ConfidenConfidence:ce: denotes the percentage of transactions containing A which contain also B.

confidence(confidence(A A B [ s, c ] B [ s, c ]) = ) = p(B|A)p(B|A) = = p(Ap(AB) / p(A) = B) / p(A) = support({A,B}) / support({A})support({A,B}) / support({A})




• Minimum supportMinimum support ::

– High few frequent itemsets

few valid rules which occur very often

– Low many valid rules which occur rarely

• Minimum confidenceMinimum confidence :: – High few rules, but all "almost logically true"– Low many rules, many of them very "uncertain"

• Typical valuesTypical values:: = 2 -10 %, = 70 - 90 %




• Transaction:Transaction:– Relational format Compact format

<Tid,item> <Tid,itemset><1, item1> <1, {item1,item2}><1, item2> <2, {item3}><2, item3>

• Item vs. itemsets:Item vs. itemsets: single element vs. set of items

• SupportSupport of an itemset I: # of transaction containing I• Minimum Minimum supportsupport : threshold for support• Frequent Frequent itemsetitemset:: with support .




• Given: Given: (1) database of transactions, (2) each transaction is a list of items bought (purchased by a customer in a visit)

• Find: Find: all rules with minimum support and confidence


Transaction ID Items Bought100 A,B,C200 A,C400 A,D500 B,E,F

Frequent Itemset Support{A} 3 or 75%{B} and {C} 2 or 50%{D}, {E} and {F} 1 or 25%{A,C} 2 or 50%Other item pairs max 25%

• If min. support 50% and min. confidence 50%, thenA A C C [50%, 66.6%], C C A A [50%, 100%]



• Association rule mining is a two-step process:Association rule mining is a two-step process:

STEP 1STEP 1: Find the : Find the frequent itemsetsfrequent itemsets: the sets of : the sets of items that have minimum support.items that have minimum support.– So called Apriori trick: A subset of a frequent itemset must also

be a frequent itemset:

• i.e., if {AB} is a frequent itemset, both {A} and {B} should be frequent itemsets

– Iteratively find frequent itemsets with size from 1 to k (k-itemset)

STEP 2STEP 2: Use the frequent itemsets to generate : Use the frequent itemsets to generate association rules.association rules.

Association Rule GenerationAssociation Rule Generation



• Join StepJoin Step: Ck is generated by joining Lk-1with itself

• Prune StepPrune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset

• Pseudo-codePseudo-code:Ck: Candidate itemset of size k; Lk : Frequent itemset of size kL1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = {candidates generated from Lk }; for each transaction t in database do

increment the count of all candidates in Ck+1 that are contained in t

Lk+1 = {candidates in Ck+1 with min_support} endreturn k Lk;

Frequent Sets with AprioriFrequent Sets with Apriori



• The Apriori principle:The Apriori principle:

Any subset of a frequent itemset must be frequent

• LL33=={{abc, abd, acd, ace, bcdabc, abd, acd, ace, bcd}}

• Self-joining: Self-joining: LL33*L*L33

– abcd from abc and abd

– acde from acd and ace

• Pruning:Pruning:

– acde is removed because ade is not in L3

• CC44={={abcdabcd}}

Apriori Candidate GenerationApriori Candidate Generation



TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5

Database DDatabase D

Scan DScan D

Apriori Example (1/6)Apriori Example (1/6)

itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3

CC11

itemset sup.{1} 2{2} 3{3} 3{5} 3

LL11




itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}

CC22

itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2

CC22

itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2

LL22

Scan DScan D



Scan DScan D


itemset{2 3 5}

CC33

itemset sup{2 3 5} 2

LL33



11 22 33 44 55

12 13 14 15 23 24 25 34 35 4512 13 14 15 23 24 25 34 35 45

123 124 125 134 135 145 234 235 245 345123 124 125 134 135 145 234 235 245 345

12341234 12351235 12451245 1345 23451345 2345

1234512345Search Space Search Space of Database Dof Database D




11 22 33 44 55

12 13 12 13 1414 15 23 15 23 2424 25 25 3434 35 35 4545

123 123 124124 125 125 134134 135 135 145145 234234 235 235 245 345245 345

12341234 12351235 12451245 13451345 23452345

1234512345Apriori Trick Apriori Trick on Level 1on Level 1




11 22 33 44 55

1212 13 13 1414 1515 23 23 2424 25 25 34 34 35 35 4545

123 124 125 134 135 145 234 123 124 125 134 135 145 234 235 235 245 345 245 345

12341234 12351235 12451245 1345 23451345 2345

1234512345Apriori Trick Apriori Trick on Level 2on Level 2




• The core of the Apriori algorithm:The core of the Apriori algorithm:– Use frequent (k – 1)-itemsets to generate candidate frequent k-

itemsets

– Use database scan and pattern matching to collect counts for the candidate itemsets

• The bottleneck of The bottleneck of AprioriApriori: : candidate generationcandidate generation– Huge candidate sets:

• 104 frequent 1-itemset will generate 107 candidate 2-itemsets

• To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100 1030 candidates.

– Multiple scans of database:

• Needs (n +1 ) scans, n is the length of the longest pattern

Is Apriori Fast Enough?Is Apriori Fast Enough?



• In practice:In practice:– For basic Apriori approach the number of attributes in the row is

usually much more critical than the number of transaction rows

– For example:• 50 attributes each having 1-3 values, 100.000 rows (not very bad)

• 50 attributes each having 10-100 values, 100.000 rows (quite bad)

• 10.000 attributes each having 5-10 values, 100 rows (very bad...)

– Notice:• One attribute might have several different values

• Association rule algorithms typically treat every attribute-value pair as one attribute (2 attribute with 5 values each => "10 attributes")

• There are some ways to overcome the problem...There are some ways to overcome the problem...

Is Apriori Fast Enough?Is Apriori Fast Enough?



• Hash-based itemset countingHash-based itemset counting: – A k-itemset whose corresponding hashing bucket count is below

the threshold cannot be frequent

• Transaction reductionTransaction reduction: – A transaction that does not contain any frequent k-itemset is

useless in subsequent scans

• PartitioningPartitioning: – Any itemset that is potentially frequent in DB must be frequent in

at least one of the partitions of DB

• SamplingSampling: – Mining on a subset of given data, lower support threshold + a

method to determine the completeness

Improving Apriori PerformanceImproving Apriori Performance



Association Rules from ItemsetsAssociation Rules from Itemsets

• Pseudo-codePseudo-code::

for every frequent itemset l generate all nonempty subsets s of l

for every nonempty subset s of l output the rule "s (l-s)" if

support(l)/support(s) min_conf", where min_conf is the minimum confidence threshold

• E.g.: frequent set l = {abc}, subsets s = {a, b, c, ab, ac, bc)E.g.: frequent set l = {abc}, subsets s = {a, b, c, ab, ac, bc)

– a b, a c, b c– a bc, b ac, c ab– ab c, ac b, bc a



Association Rule GenerationAssociation Rule Generation

• Rule 1 to remember:Rule 1 to remember:– Generating frequent sets is slow (especially itemsets of size 2)– Generating association rules from frequent itemsets is fast

• Rule 2 to remember:Rule 2 to remember:– For itemset generation, support threshold is used– For association rules, confidence threshold is used

• What happens in reality, how long does it take to What happens in reality, how long does it take to create frequent sets and association rules?create frequent sets and association rules?

– Let's take small real-life examples…– Experiments are made with Citum 4/275 Alpha server with 512

MB of main memory & Red Hat Linux release 5.0 (kernel 2.0.30)



Network Management SystemNetwork Management SystemNetwork Management SystemNetwork Management System

MSCMSC MSCMSCMSCMSCMSCMSC

BSCBSC BSCBSCBSCBSCBSCBSC

BTSBTSBTSBTS BTSBTSBTSBTSBTSBTSBTSBTS

Switched NetworkSwitched Network

Access NetworkAccess Network

MSCMSCMSCMSC

BSCBSCBSCBSC

BTSBTSBTSBTS

Base station controller

Base station transceiver

Mobile station controller

Ala

rms

Ala

rms

Performance Example (1/4)Performance Example (1/4)




• Telecom data containing alarms:Telecom data containing alarms:

1234 EL1 PCM 940926082623 A1 ALARMTEXT..

• Example data 1:Example data 1:– 43 478 alarms (26.9.94 - 5.10.94; ~ 10 days)– 2 234 different types of alarms, 23 attributes, 5503 different values

• Example data 2:Example data 2:– 73 679 alarms (1.2.95 - 22.3.95; ~ 7 weeks)– 287 different types of alarms, 19 attributes, 3411 different values

Alarm numberAlarming network element

Alarm type Date, time Alarm severity class



Data set 1 (~10 days)Data set 1 (~10 days) Data set 2 (~7 weeks)Data set 2 (~7 weeks)


Example rule: Example rule:

alarm_number=1234, alarm_type=PCM alarm_severity=A1 [2%,45%]




• Example results for data 1:Example results for data 1:– Frequency threshold: 0.1 (lowest possible with this data)

– Candidate sets: 109 719 Time: 12.02 s

– Frequent sets: 79 311 Time: 64 855.73 s

– Rules: 3 750 000 Time: 860.60 s

• Example results for data 2:Example results for data 2:– Frequency threshold: 0.1 (lowest possible with this data)

– Candidate sets: 43 600 Time: 1.70 s

– Frequent sets: 13 321 Time: 10 478.93 s

– Rules: 509 075 Time: 143.35 s



• Usually the result set is very big, one must select Usually the result set is very big, one must select interesting ones based on:interesting ones based on:– Objective measures:

Two popular measurements: support; and confidence

– Subjective measures (Silberschatz & Tuzhilin, KDD95)A rule (pattern) is interesting if it is unexpected (surprising to the user); and/or actionable (the user can do something with it)

• These issues will be discussed with KDD processesThese issues will be discussed with KDD processes

Selecting the Interesting Rules?Selecting the Interesting Rules?



• Boolean vs. quantitative association rules Boolean vs. quantitative association rules (based on the (based on the types of values handled)types of values handled)

– Boolean:Boolean: Rule concerns associations between the presence or absence of items (e.g. "buys A" or "does not buy A")

buys=SQLServer, buys=DMBook buys=SQLServer, buys=DMBook buys=DBMiner [2%,60%] buys=DBMiner [2%,60%]

buys(x, "SQLServer") ^ buys(x, "DMBook") buys(x, "DBMiner") [0.2%, 60%]

– Quantitative:Quantitative: Rule concerns associations between quantitative items or attributes

age=30..39, income=42..48K age=30..39, income=42..48K buys=PC [1%, 75%] buys=PC [1%, 75%]

age(x, "30..39") ^ income(x, "42..48K") buys(x, "PC") [1%, 75%]

Boolean vs. Quantitative RulesBoolean vs. Quantitative Rules



• Quantitative attributes:Quantitative attributes: e.g., age, income, height, weight

• Categorical attributes:Categorical attributes: e.g., color of car

Problem:Problem: too many distinct values for quantitative attributes

Solution:Solution: transform quantitative attributes in categorical ones via discretizationdiscretization more about this in seminar!

CID height weight income 1 168 75,4 30,5 2 175 80,0 20,3 3 174 70,3 25,8 4 170 65,2 27,0

Quantitative RulesQuantitative Rules



• Single-dimensional vs. multi-dimensional associationsSingle-dimensional vs. multi-dimensional associations

– Single-dimensional:Single-dimensional: Items or attributes in the rule refer to only one dimension (e.g., to "buys")

Beer, Chips Beer, Chips Bread [0.4%, 52%] Bread [0.4%, 52%]

buys(x, "Beer") ^ buys(x, "Chips") buys(x, "Bread") [0.4%, 52%]

– Multi-dimensional:Multi-dimensional: Items or attributes in the rule refer to two or more dimensions (e.g., "buys", "time_of_transaction", "customer_category")

In the following example: nationality, age, income

Single-Single- vs.vs. Multi-dimensional RulesMulti-dimensional Rules



CID nationality age income 1 Italian 50 low 2 French 40 high 3 French 30 high 4 Italian 50 medium 5 Italian 45 high 6 French 35 high RULES:RULES:

nationality = French income = high [50%, 100%]income = high nationality = French [50%, 75%]age = 50 nationality = Italian [33%, 100%]

Multi-dimensional RulesMulti-dimensional Rules



• Single-level vs. multi-level associationsSingle-level vs. multi-level associations

– Single-level:Single-level: Associations between items or attributes from the same level of abstraction (i.e., from the same level of hierarchy)

Beer, Chips Beer, Chips Bread [0.4%, 52%] Bread [0.4%, 52%]

– Multi-level:Multi-level: Associations between items or attributes from different levels of abstraction (i.e, from different levels of hierarchy)

Beer:Karjala, Chips:Estrella:Barbeque Beer:Karjala, Chips:Estrella:Barbeque Bread [0.1%, 74%] Bread [0.1%, 74%]

More about multi-level association rules on the next slides… More about multi-level association rules on the next slides…

Single- vs. Multi-level RulesSingle- vs. Multi-level Rules



• Is difficult to find interesting patterns at a Is difficult to find interesting patterns at a too primitive too primitive levellevel– high support = too few rules– low support = too many rules, most uninteresting

• Approach: reason at suitable level of abstractionApproach: reason at suitable level of abstraction

• A common form of background knowledge is that an A common form of background knowledge is that an attribute may be generalized or specialized according attribute may be generalized or specialized according to a to a hierarchy of conceptshierarchy of concepts

• Multi-level association rules:Multi-level association rules: rules which combine rules which combine associations with hierarchy of conceptsassociations with hierarchy of concepts

Multi-level Association RulesMulti-level Association Rules



• Items often form Items often form hierarchieshierarchies

• Items at the lower level Items at the lower level are expected to have are expected to have lower supportlower support

• Rules regarding itemsets Rules regarding itemsets at appropriate levels at appropriate levels could be quite usefulcould be quite useful

• Transaction database can Transaction database can be encoded based on be encoded based on dimensions and levelsdimensions and levels


Food

breadmilk

skim

SunsetFraser

2% whitewheat



TID ItemsT1 {111, 121, 211, 221}T2 {111, 211, 222, 323}T3 {112, 122, 221, 411}T4 {111, 121}T5 {111, 122, 211, 221, 413}


121= milk - 2% - Fraser

Food

breadmilk

skim

SunsetFraser

2% whitewheat

1

1 2

1 2

2

21



• A top-down, progressive deepening approach:A top-down, progressive deepening approach:– First find high-level strong rules:

milk bread [20%, 60%]

– Then find their lower-level "weaker" rules:

2% milk wheat bread [6%, 50%]

• Variations at mining multi-level association rules:Variations at mining multi-level association rules:– Level-crossed association rules:

milk wheat bread

– Association rules with multiple, alternative hierarchies:

milk Wonder bread




• Generalizing/specializing values of attributes… Generalizing/specializing values of attributes… – ...from specialized to general:...from specialized to general: support of rules increases (new

rules may become valid)

– ...from general to specialized:...from general to specialized: support of rules decreases (rules may become not valid, their support falls under the threshold)

• Too low level => too many rules and too primitive Too low level => too many rules and too primitive Pepsi light 0.5l bottle Taffel Barbeque Chips 200gr

• Too high level => uninteresting rules Too high level => uninteresting rules Food Clothes




• Some rules may be redundant due to "ancestor" Some rules may be redundant due to "ancestor" relationships between itemsrelationships between items

• Example (milk has 4 subclasses):Example (milk has 4 subclasses):– milk wheat bread [support = 8%, confidence = 70%]

– 2% milk wheat bread [support = 2%, confidence = 72%]

• We say the first rule is an ancestor of the second ruleWe say the first rule is an ancestor of the second rule

• A rule is redundant if its support is close to the A rule is redundant if its support is close to the "expected" value, based on the rule’s ancestor"expected" value, based on the rule’s ancestor

– Above the second rule could be redundant

Redundancy FilteringRedundancy Filtering



• Uniform Support:Uniform Support: the same minimum support for all the same minimum support for all levelslevels

+ One minimum support threshold. No need to examine itemsets containing any item whose ancestors do not have minimum

support.

– Lower level items do not occur as frequently. If support threshold

• too high miss low level associations

• too low generate too many high level associations

• Reduced Support:Reduced Support: reduced minimum support at lower reduced minimum support at lower levelslevels

Uniform vs. Reduced SupportUniform vs. Reduced Support



Multi-level mining with uniform supportMulti-level mining with uniform support

Milk

[support = 10%]

2% Milk

[support = 6%]

Skim Milk

[support = 4%]

Level 1min_sup = 5%

Level 2min_sup = 5%

Uniform SupportUniform Support



Reduced SupportReduced Support

Multi-level mining with reduced supportMulti-level mining with reduced support

Level 1min_sup = 5%

Level 2min_sup = 3%

Milk

[support = 10%]

2% Milk

[support = 6%]

Skim Milk

[support = 4%]



• A top-down, progressive deepening approach:A top-down, progressive deepening approach:– First mine high-level frequent items:

milk (15%), bread (10%)– Then mine their lower-level "weaker" frequent itemsets:

2% milk (5%), wheat bread (4%)

• Different min_support thresholds across multi-levels Different min_support thresholds across multi-levels lead to different algorithms:lead to different algorithms:– If adopting the same min_support across multi-levels

then do not examine t if any of t’s ancestors is infrequent

– If adopting reduced min_support at lower levels

then examine only those descendents whose ancestor’s support is frequent/non-negligible

Progressive DeepeningProgressive Deepening



• Interactive, exploratory mining giga-bytes of data? Interactive, exploratory mining giga-bytes of data? – Could it be real? — By making good use of constraints!

• What kinds of constraints can be used in mining?What kinds of constraints can be used in mining?– Knowledge type constraintKnowledge type constraint: classification, association, etc.– Data constraintData constraint: SQL-like queries

• Find product pairs sold together in Vancouver in Dec.’98– Dimension/level constraintsDimension/level constraints:

• In relevance to region, price, brand, customer category– Interestingness constraintsInterestingness constraints:

• Strong rules (min_support 3%, min_confidence 60%)– Rule constraintsRule constraints (see the next slides)

Constraint-Based MiningConstraint-Based Mining



• Two kinds of rule constraints:Two kinds of rule constraints:

– Rule form constraints: meta-rule guided miningRule form constraints: meta-rule guided mining

• Metarule: P(X, Y) ^ Q(X, W) takes(X, "database systems")

• Matching rule: age(X, "30..39") ^ income(X, "41K..60K") takes(X, "database systems").

– Rule content constraint: constraint-based query Rule content constraint: constraint-based query optimization (Ng, et al., SIGMOD’98)optimization (Ng, et al., SIGMOD’98)

• sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) > 1000

Rule ConstraintsRule Constraints



• 1-variable vs. 2-variable constraints (Lakshmanan, et 1-variable vs. 2-variable constraints (Lakshmanan, et al. SIGMOD’99): al. SIGMOD’99):

– 1-var: A constraint confining only one side (L/R) of the rule, e.g.,

• sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) > 1000

– 2-var: A constraint confining both sides (L and R).

• sum(LHS) < min(RHS) ^ max(RHS) < 5* sum(LHS)

Rule ConstraintsRule Constraints



• Association rule mining:Association rule mining: – Probably the most significant contribution from the database

community in KDD

– Rather simple concept, but the "thinking" gives basis for extensions and other methods

– A large number of papers have been published

• Many interesting issues have been exploredMany interesting issues have been explored

• Interesting research directions:Interesting research directions:– Association analysis in other types of data: spatial data,

multimedia data, time series data, etc.

SummarySummary



• R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent itemsets. In Journal of Parallel and Distributed Computing (Special Issue on High Performance Data Mining), 2000.

• R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD'93, 207-216, Washington, D.C.

• R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94 487-499, Santiago, Chile.

• R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, 3-14, Taipei, Taiwan. • R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98, 85-93, Seattle,

Washington.• S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to

correlations. SIGMOD'97, 265-276, Tucson, Arizona.• S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for

market basket analysis. SIGMOD'97, 255-264, Tucson, Arizona, May 1997.• K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. SIGMOD'99,

359-370, Philadelphia, PA, June 1999.• D.W. Cheung, J. Han, V. Ng, and C.Y. Wong. Maintenance of discovered association rules in large

databases: An incremental updating technique. ICDE'96, 106-114, New Orleans, LA.• M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg

queries efficiently. VLDB'98, 299-310, New York, NY, Aug. 1998.

References (1/5)References (1/5)



• G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets. ICDE'00, 512-521, San Diego, CA, Feb. 2000.

• Y. Fu and J. Han. Meta-rule-guided mining of association rules in relational databases. KDOOD'95, 39-46, Singapore, Dec. 1995.

• T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using two-dimensional optimized association rules: Scheme, algorithms, and visualization. SIGMOD'96, 13-23, Montreal, Canada.

• E.-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. SIGMOD'97, 277-288, Tucson, Arizona.

• J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series database. ICDE'99, Sydney, Australia.

• J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. VLDB'95, 420-431, Zurich, Switzerland.

• J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD'00, 1-12, Dallas, TX, May 2000.

• T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of ACM, 39:58-64, 1996.

• M. Kamber, J. Han, and J. Y. Chiang. Metarule-guided mining of multi-dimensional association rules using data cubes. KDD'97, 207-210, Newport Beach, California.

• M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM'94, 401-408, Gaithersburg, Maryland.




• F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new paradigm for fast, quantifiable data mining. VLDB'98, 582-593, New York, NY.

• B. Lent, A. Swami, and J. Widom. Clustering association rules. ICDE'97, 220-231, Birmingham, England.

• H. Lu, J. Han, and L. Feng. Stock movement and n-dimensional inter-transaction association rules. SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'98), 12:1-12:7, Seattle, Washington.

• H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. KDD'94, 181-192, Seattle, WA, July 1994.

• H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1:259-289, 1997.

• R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96, 122-133, Bombay, India.

• R.J. Miller and Y. Yang. Association rules over interval data. SIGMOD'97, 452-461, Tucson, Arizona.

• R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning optimizations of constrained associations rules. SIGMOD'98, 13-24, Seattle, Washington.

• N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. ICDT'99, 398-416, Jerusalem, Israel, Jan. 1999.




• J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules. SIGMOD'95, 175-186, San Jose, CA, May 1995.

• J. Pei, J. Han, and R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets. DMKD'00, Dallas, TX, 11-20, May 2000.

• J. Pei and J. Han. Can We Push More Constraints into Frequent Pattern Mining? KDD'00. Boston, MA. Aug. 2000.

• G. Piatetsky-Shapiro. Discovery, analysis, and presentation of strong rules. In G. Piatetsky-Shapiro and W. J. Frawley, editors, Knowledge Discovery in Databases, 229-238. AAAI/MIT Press, 1991.

• B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98, 412-421, Orlando, FL.

• J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules. SIGMOD'95, 175-186, San Jose, CA.

• S. Ramaswamy, S. Mahajan, and A. Silberschatz. On the discovery of interesting patterns in association rules. VLDB'98, 368-379, New York, NY..

• S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD'98, 343-354, Seattle, WA.

• A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. VLDB'95, 432-443, Zurich, Switzerland.

• A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative associations in a large database of customer transactions. ICDE'98, 494-502, Orlando, FL, Feb. 1998.




• C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal structures. VLDB'98, 594-605, New York, NY.

• R. Srikant and R. Agrawal. Mining generalized association rules. VLDB'95, 407-419, Zurich, Switzerland, Sept. 1995.

• R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. SIGMOD'96, 1-12, Montreal, Canada.

• R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. KDD'97, 67-73, Newport Beach, California.

• H. Toivonen. Sampling large databases for association rules. VLDB'96, 134-145, Bombay, India, Sept. 1996.

• D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A generalization of association-rule mining. SIGMOD'98, 1-12, Seattle, Washington.

• K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Computing optimized rectilinear regions for association rules. KDD'97, 96-103, Newport Beach, CA, Aug. 1997.

• M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for discovery of association rules. Data Mining and Knowledge Discovery, 1:343-374, 1997.

• M. Zaki. Generating Non-Redundant Association Rules. KDD'00. Boston, MA. Aug. 2000.• O. R. Zaiane, J. Han, and H. Zhu. Mining Recurrent Items in Multimedia with Progressive

Resolution Refinement. ICDE'00, 461-470, San Diego, CA, Feb. 2000.




Next WeekNext WeekNext WeekNext Week

• Lecture 31.10.: Episodes and recurrent Lecture 31.10.: Episodes and recurrent patternspatterns

– Mika gives the lectureMika gives the lecture

• Excercise 1.11.: AssociationsExcercise 1.11.: Associations– Pirjo takes care of you! :-) Pirjo takes care of you! :-)

• Seminar 2.11.: AssociationsSeminar 2.11.: Associations– Pirjo gives the lecturePirjo gives the lecture

– 2 group presentations2 group presentations




• Seminar presentations:Seminar presentations:– Articles are given on previous

week's Wed

– Presentation in an HTML page (around 3-5 printed pages) due to seminar starting:

• Can be either a HTML page or a printable document in PostScript/PDF format

– 30 minutes of presentation

– 5-15 minutes of discussion

– Active participation

Seminar PresentationsSeminar Presentations



• Seminar presentations:Seminar presentations:– Try to understand the

"message" in the article

– Try to present the basic ideas as clearly as possible, use examples

– Do not present detailed mathematics or algorithms

– Test: do you understand your own presentation?

– In the presentation, use PowerPoint or conventional slides

Seminar PresentationsSeminar Presentations



Seminar Presentations/Groups 1-2Seminar Presentations/Groups 1-2

Quantitative RulesQuantitative RulesQuantitative RulesQuantitative Rules

MINERULEMINERULEMINERULEMINERULE

R. Srikant, R. Agrawal: "Mining "Mining Quantitative Association Rules Quantitative Association Rules in Large Relational Tables"in Large Relational Tables", Proc. of the ACM-SIGMOD 1996

Rosa Meo, Giuseppe Psaila, Stefano Ceri: "A New SQL-like "A New SQL-like Operator for Mining Operator for Mining Association Rules"Association Rules". VLDB 1996: 122-133



Thank you for Thank you for your attention and your attention and

have a nice weekend!have a nice weekend!Thanks to Jiawei Han from Simon Fraser University for his slides

which greatly helped in preparing this lecture! Also thanks to Fosca Giannotti and Dino Pedreschi from Pisa for their slides.

Introduction to Data Mining (DM)Introduction to Data Mining (DM)

course on data mining mika klemettinen and pirjo moen university of helsinki/dept of cs autumn 2001...

Documents