course on data mining mika klemettinen and pirjo moen university of helsinki/dept of cs autumn 2001...
TRANSCRIPT
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Course on Data Mining (581550-4)Course on Data Mining (581550-4)
Intro/Ass. RulesIntro/Ass. RulesIntro/Ass. RulesIntro/Ass. Rules
EpisodesEpisodesEpisodesEpisodes
Text MiningText MiningText MiningText Mining
Home ExamHome Exam
24./26.10.
30.10.
ClusteringClusteringClusteringClustering
KDD ProcessKDD ProcessKDD ProcessKDD Process
Appl./SummaryAppl./SummaryAppl./SummaryAppl./Summary
14.11.
21.11.
7.11.
28.11.
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Today 26.10.2001Today 26.10.2001Today 26.10.2001Today 26.10.2001
• SummarySummary: : – Course organizationCourse organization
• SummarySummary: : – What is data mining?What is data mining?
• Today's subjectToday's subject: : – Association rulesAssociation rules
• Next week's programNext week's program: : – Lecture: Lecture: EpisodesEpisodes
– Exercise: Exercise: AssociationsAssociations
– Seminar: Seminar: AssociationsAssociations
Course on Data Mining (581550-4)Course on Data Mining (581550-4)
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Lectures, Exercises, ExamLectures, Exercises, ExamLectures, Exercises, ExamLectures, Exercises, Exam
• 12 lectures: 24.10.-30.11.200112 lectures: 24.10.-30.11.2001– Wed 14-16, Fri 12-14 (A217)Wed 14-16, Fri 12-14 (A217)
• Wed: normal lecture
• Fri: seminar like lecture (except for 26.10.)
• 5 exercise sessions: 1.11.-29.11.20015 exercise sessions: 1.11.-29.11.2001– Thu 12-14 (A318)Thu 12-14 (A318)
• Home exam:Home exam:– Given: 28.11., Returned due: 21.12.2001Given: 28.11., Returned due: 21.12.2001
• Language:Language:– Lecturing language is FinnishLecturing language is Finnish
– Slides and material are in EnglishSlides and material are in English
Course OrganizationCourse Organization
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Group WorkGroup WorkGroup WorkGroup Work
• Group for seminar (and exercise) work:Group for seminar (and exercise) work:– 10 groups, à 3 persons, 2 groups/lecture– Dates are agreed at the beginning of course– Articles are given on previous week's Wed
• Seminar presentations:Seminar presentations:– Presentation in an HTML page (around 3-5
printed pages) due to seminar starting:• Can be either a HTML page or a printable
document in PostScript/PDF format
– 30 minutes of presentation– 5-15 minutes of discussion– Active participation
Course OrganizationCourse Organization
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Group presentation time allocation:Group presentation time allocation:
– Fri 2.11.: Group 1, Group 2 (associations)
– Fri 9.11.: Group 3, Group 4 (episodes)
– Fri 16.11.: Group 5, Group 6 (text mining)
– Fri 23.11.: Group 7, Group 8 (clustering)
– Fri 30.11.: Group 9, Group 10 (KDD process)
Course Organization / GroupsCourse Organization / Groups
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Course EvaluationCourse EvaluationCourse EvaluationCourse Evaluation
• Passing the course: min 30 pointsPassing the course: min 30 points– home exam: min 13 points (max 30 points)
– exercises/experiments: min 8 points (max 20 points)
• at least 3 returned and reported experiments
– group presentation: min 4 points (max 10 points)
• Remember also the other requirements:Remember also the other requirements:– Attending the lectures (5/7)
– Attending the seminars (4/5)
– Attending the exercises (4/5)
Course OrganizationCourse Organization
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Course MaterialCourse MaterialCourse MaterialCourse Material
• Lecture slidesLecture slides
• Original articlesOriginal articles
• Seminar presentationsSeminar presentations
• Book: Book: "Data Mining: Concepts and Techniques" by Jiawei Han and Micheline Kamber, Morgan Kaufmann Publishers, August 2000. 550 pages. ISBN 1-55860-489-8
• Remember to check course website and Remember to check course website and folder for the material!folder for the material!
Course OrganizationCourse Organization
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Ultimately:Ultimately:
– "Extraction of interesting (non-trivial, implicit, previously unknown, potentially useful) information or patterns from data in large databases"
• Often just:Often just:
– "Tell something interesting about this data", "Describe this data"
Exploratory, semi-automatic data Exploratory, semi-automatic data analysis on large data setsanalysis on large data sets
Summary:What is Data Mining?Summary:What is Data Mining?
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Data mining: Data mining: semi-automaticsemi-automatic discovery of discovery of interestinginteresting patterns from patterns from large data setslarge data sets
• Knowledge discovery is a process:Knowledge discovery is a process:– Preprocessing– Data mining– Postprocessing
• To be mined, used or utilized different … To be mined, used or utilized different … – Databases (relational, object-oriented, spatial, WWW, …)– Knowledge (characterization, clustering, association, …)– Techniques (machine learning, statistics, visualization, …)– Applications (retail, telecom, Web mining, log analysis, …)
Summary:What is Data Mining?Summary:What is Data Mining?
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Data miningData miningData miningData miningInput dataInput dataInput dataInput data ResultsResultsResultsResultsPreprocessingPreprocessing PostprocessingPostprocessing
OperationalOperationalDatabaseDatabase
OperationalOperationalDatabaseDatabase
Selection
Selection
Selection
Selection
UtilizationUtilizationUtilizationUtilization
CleanedVerifiedFocused
Eval. ofinteres-tingness
Raw data
Time based
selection
Selected usable
patterns
1 32
Summary: Typical KDD ProcessSummary: Typical KDD Process
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
BasicsBasicsBasicsBasics
Multi-level RulesMulti-level RulesMulti-level RulesMulti-level Rules
ExamplesExamplesExamplesExamples
GenerationGenerationGenerationGeneration
ConstraintsConstraintsConstraintsConstraints
Association RulesAssociation Rules
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Analysis of customer buying habits by finding Analysis of customer buying habits by finding associations and correlations between the different associations and correlations between the different items that customers place in their "shopping basket"items that customers place in their "shopping basket"
Customer1Customer1
Milk, eggs, sugar, Milk, eggs, sugar, breadbread
Customer2Customer2
Milk, eggs, cereal, bread Milk, eggs, cereal, bread
Customer3Customer3
Eggs, sugarEggs, sugar
Market Basket AnalysisMarket Basket Analysis
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Given: Given: • A database of customer transactionstransactions (e.g., shopping baskets),
where each transaction is a set of itemsset of items (e.g., products)
• Find: Find: • Groups of items which are frequently purchased togetherfrequently purchased together
Market Basket AnalysisMarket Basket Analysis
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Extract information on pExtract information on purchasing behaviorurchasing behavior– "IF buys beer and sausage, THEN also by mustard with high
probability"
• Actionable information: can suggest...Actionable information: can suggest...– New store layouts and product assortments– Which products to put on promotion
• MBA approach is applicable whenever a customer MBA approach is applicable whenever a customer purchases multiple things in proximitypurchases multiple things in proximity– Credit cards– Services of telecommunication companies– Banking services– Medical treatments
Market Basket AnalysisMarket Basket Analysis
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Useful:Useful:
"On Thursdays, grocery store consumers often purchase diapers and beer together."
• Trivial:Trivial:
"Customers who purchase maintenance agreements are very likely to purchase large
appliances."
• Unexplicable/unexpected:Unexplicable/unexpected:
"When a new hardaware store opens, one of the most sold items is toilet rings."
Market Basket AnalysisMarket Basket Analysis
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Association Rules: BasicsAssociation Rules: Basics
• Association rule mining:Association rule mining:– Finding frequent patterns, associations, correlations, or causal
structures among sets of items or objects in transaction databases, relational databases, and other information repositories.
• Comprehensibility:Comprehensibility: Simple to understand
• Utilizability: Utilizability: Provide actionable information
• Efficiency: Efficiency: Efficient discovery algorithms exist
• Applications:Applications:– Market basket data analysis, cross-marketing, catalog design,
loss-leader analysis, clustering, classification, etc.
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Typical representation formats for association rules:Typical representation formats for association rules:
– diapers beer [0.5%, 60%]
– buys:diapers buys:beer [0.5%, 60%]
– "IF buys diapers, THEN buys beer in 60% of the cases. Diapers and beer are bought together in 0.5% of the rows in the database."
• Other representations (used in Han's book):Other representations (used in Han's book):– buys(x, "diapers") buys(x, "beer") [0.5%, 60%]
– major(x, "CS") ^ takes(x, "DB") grade(x, "A") [1%, 75%]
Association Rules: BasicsAssociation Rules: Basics
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
diapers beer [0.5%, 60%]
Association Rules: BasicsAssociation Rules: Basics
1 AntecedentAntecedent, left-hand side (LHS), body
2 ConsequentConsequent, right-hand side (RHS), head
3 SupportSupport, frequency ("in how big part of the data the things in left- and right-hand sides occur together")
4 ConfidenceConfidence, strength ("if the left-hand side occurs, how likely the right-hand side occurs")
"IF buys diapers, THEN buys beer in 60% of the casesin 0.5% of the rows"
1 2 3 4
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• SupportSupport:: denotes the frequency of the rule within transactions.
support(support(A A B [ s, c ] B [ s, c ]) = ) = p(p(AABB) = ) = support ({A,B})support ({A,B})
• ConfidenConfidence:ce: denotes the percentage of transactions containing A which contain also B.
confidence(confidence(A A B [ s, c ] B [ s, c ]) = ) = p(B|A)p(B|A) = = p(Ap(AB) / p(A) = B) / p(A) = support({A,B}) / support({A})support({A,B}) / support({A})
Association Rules: BasicsAssociation Rules: Basics
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Minimum supportMinimum support ::
– High few frequent itemsets
few valid rules which occur very often
– Low many valid rules which occur rarely
• Minimum confidenceMinimum confidence :: – High few rules, but all "almost logically true"– Low many rules, many of them very "uncertain"
• Typical valuesTypical values:: = 2 -10 %, = 70 - 90 %
Association Rules: BasicsAssociation Rules: Basics
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Transaction:Transaction:– Relational format Compact format
<Tid,item> <Tid,itemset><1, item1> <1, {item1,item2}><1, item2> <2, {item3}><2, item3>
• Item vs. itemsets:Item vs. itemsets: single element vs. set of items
• SupportSupport of an itemset I: # of transaction containing I• Minimum Minimum supportsupport : threshold for support• Frequent Frequent itemsetitemset:: with support .
Association Rules: BasicsAssociation Rules: Basics
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Given: Given: (1) database of transactions, (2) each transaction is a list of items bought (purchased by a customer in a visit)
• Find: Find: all rules with minimum support and confidence
Association Rules: BasicsAssociation Rules: Basics
Transaction ID Items Bought100 A,B,C200 A,C400 A,D500 B,E,F
Frequent Itemset Support{A} 3 or 75%{B} and {C} 2 or 50%{D}, {E} and {F} 1 or 25%{A,C} 2 or 50%Other item pairs max 25%
• If min. support 50% and min. confidence 50%, thenA A C C [50%, 66.6%], C C A A [50%, 100%]
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Association rule mining is a two-step process:Association rule mining is a two-step process:
STEP 1STEP 1: Find the : Find the frequent itemsetsfrequent itemsets: the sets of : the sets of items that have minimum support.items that have minimum support.– So called Apriori trick: A subset of a frequent itemset must also
be a frequent itemset:
• i.e., if {AB} is a frequent itemset, both {A} and {B} should be frequent itemsets
– Iteratively find frequent itemsets with size from 1 to k (k-itemset)
STEP 2STEP 2: Use the frequent itemsets to generate : Use the frequent itemsets to generate association rules.association rules.
Association Rule GenerationAssociation Rule Generation
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Join StepJoin Step: Ck is generated by joining Lk-1with itself
• Prune StepPrune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset
• Pseudo-codePseudo-code:Ck: Candidate itemset of size k; Lk : Frequent itemset of size kL1 = {frequent items};for (k = 1; Lk !=; k++) do begin Ck+1 = {candidates generated from Lk }; for each transaction t in database do
increment the count of all candidates in Ck+1 that are contained in t
Lk+1 = {candidates in Ck+1 with min_support} endreturn k Lk;
Frequent Sets with AprioriFrequent Sets with Apriori
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• The Apriori principle:The Apriori principle:
Any subset of a frequent itemset must be frequent
• LL33=={{abc, abd, acd, ace, bcdabc, abd, acd, ace, bcd}}
• Self-joining: Self-joining: LL33*L*L33
– abcd from abc and abd
– acde from acd and ace
• Pruning:Pruning:
– acde is removed because ade is not in L3
• CC44={={abcdabcd}}
Apriori Candidate GenerationApriori Candidate Generation
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
TID Items100 1 3 4200 2 3 5300 1 2 3 5400 2 5
Database DDatabase D
Scan DScan D
Apriori Example (1/6)Apriori Example (1/6)
itemset sup.{1} 2{2} 3{3} 3{4} 1{5} 3
CC11
itemset sup.{1} 2{2} 3{3} 3{5} 3
LL11
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Apriori Example (2/6)Apriori Example (2/6)
itemset{1 2}{1 3}{1 5}{2 3}{2 5}{3 5}
CC22
itemset sup{1 2} 1{1 3} 2{1 5} 1{2 3} 2{2 5} 3{3 5} 2
CC22
itemset sup{1 3} 2{2 3} 2{2 5} 3{3 5} 2
LL22
Scan DScan D
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Scan DScan D
Apriori Example (3/6)Apriori Example (3/6)
itemset{2 3 5}
CC33
itemset sup{2 3 5} 2
LL33
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
11 22 33 44 55
12 13 14 15 23 24 25 34 35 4512 13 14 15 23 24 25 34 35 45
123 124 125 134 135 145 234 235 245 345123 124 125 134 135 145 234 235 245 345
12341234 12351235 12451245 1345 23451345 2345
1234512345Search Space Search Space of Database Dof Database D
Apriori Example (4/6)Apriori Example (4/6)
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
11 22 33 44 55
12 13 12 13 1414 15 23 15 23 2424 25 25 3434 35 35 4545
123 123 124124 125 125 134134 135 135 145145 234234 235 235 245 345245 345
12341234 12351235 12451245 13451345 23452345
1234512345Apriori Trick Apriori Trick on Level 1on Level 1
Apriori Example (5/6)Apriori Example (5/6)
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
11 22 33 44 55
1212 13 13 1414 1515 23 23 2424 25 25 34 34 35 35 4545
123 124 125 134 135 145 234 123 124 125 134 135 145 234 235 235 245 345 245 345
12341234 12351235 12451245 1345 23451345 2345
1234512345Apriori Trick Apriori Trick on Level 2on Level 2
Apriori Example (6/6)Apriori Example (6/6)
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• The core of the Apriori algorithm:The core of the Apriori algorithm:– Use frequent (k – 1)-itemsets to generate candidate frequent k-
itemsets
– Use database scan and pattern matching to collect counts for the candidate itemsets
• The bottleneck of The bottleneck of AprioriApriori: : candidate generationcandidate generation– Huge candidate sets:
• 104 frequent 1-itemset will generate 107 candidate 2-itemsets
• To discover a frequent pattern of size 100, e.g., {a1, a2, …, a100}, one needs to generate 2100 1030 candidates.
– Multiple scans of database:
• Needs (n +1 ) scans, n is the length of the longest pattern
Is Apriori Fast Enough?Is Apriori Fast Enough?
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• In practice:In practice:– For basic Apriori approach the number of attributes in the row is
usually much more critical than the number of transaction rows
– For example:• 50 attributes each having 1-3 values, 100.000 rows (not very bad)
• 50 attributes each having 10-100 values, 100.000 rows (quite bad)
• 10.000 attributes each having 5-10 values, 100 rows (very bad...)
– Notice:• One attribute might have several different values
• Association rule algorithms typically treat every attribute-value pair as one attribute (2 attribute with 5 values each => "10 attributes")
• There are some ways to overcome the problem...There are some ways to overcome the problem...
Is Apriori Fast Enough?Is Apriori Fast Enough?
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Hash-based itemset countingHash-based itemset counting: – A k-itemset whose corresponding hashing bucket count is below
the threshold cannot be frequent
• Transaction reductionTransaction reduction: – A transaction that does not contain any frequent k-itemset is
useless in subsequent scans
• PartitioningPartitioning: – Any itemset that is potentially frequent in DB must be frequent in
at least one of the partitions of DB
• SamplingSampling: – Mining on a subset of given data, lower support threshold + a
method to determine the completeness
Improving Apriori PerformanceImproving Apriori Performance
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Association Rules from ItemsetsAssociation Rules from Itemsets
• Pseudo-codePseudo-code::
for every frequent itemset l generate all nonempty subsets s of l
for every nonempty subset s of l output the rule "s (l-s)" if
support(l)/support(s) min_conf", where min_conf is the minimum confidence threshold
• E.g.: frequent set l = {abc}, subsets s = {a, b, c, ab, ac, bc)E.g.: frequent set l = {abc}, subsets s = {a, b, c, ab, ac, bc)
– a b, a c, b c– a bc, b ac, c ab– ab c, ac b, bc a
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Association Rule GenerationAssociation Rule Generation
• Rule 1 to remember:Rule 1 to remember:– Generating frequent sets is slow (especially itemsets of size 2)– Generating association rules from frequent itemsets is fast
• Rule 2 to remember:Rule 2 to remember:– For itemset generation, support threshold is used– For association rules, confidence threshold is used
• What happens in reality, how long does it take to What happens in reality, how long does it take to create frequent sets and association rules?create frequent sets and association rules?
– Let's take small real-life examples…– Experiments are made with Citum 4/275 Alpha server with 512
MB of main memory & Red Hat Linux release 5.0 (kernel 2.0.30)
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Network Management SystemNetwork Management SystemNetwork Management SystemNetwork Management System
MSCMSC MSCMSCMSCMSCMSCMSC
BSCBSC BSCBSCBSCBSCBSCBSC
BTSBTSBTSBTS BTSBTSBTSBTSBTSBTSBTSBTS
Switched NetworkSwitched Network
Access NetworkAccess Network
MSCMSCMSCMSC
BSCBSCBSCBSC
BTSBTSBTSBTS
Base station controller
Base station transceiver
Mobile station controller
Ala
rms
Ala
rms
Performance Example (1/4)Performance Example (1/4)
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Performance Example (2/4)Performance Example (2/4)
• Telecom data containing alarms:Telecom data containing alarms:
1234 EL1 PCM 940926082623 A1 ALARMTEXT..
• Example data 1:Example data 1:– 43 478 alarms (26.9.94 - 5.10.94; ~ 10 days)– 2 234 different types of alarms, 23 attributes, 5503 different values
• Example data 2:Example data 2:– 73 679 alarms (1.2.95 - 22.3.95; ~ 7 weeks)– 287 different types of alarms, 19 attributes, 3411 different values
Alarm numberAlarming network element
Alarm type Date, time Alarm severity class
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Data set 1 (~10 days)Data set 1 (~10 days) Data set 2 (~7 weeks)Data set 2 (~7 weeks)
Performance Example (3/4)Performance Example (3/4)
Example rule: Example rule:
alarm_number=1234, alarm_type=PCM alarm_severity=A1 [2%,45%]
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Performance Example (4/4)Performance Example (4/4)
• Example results for data 1:Example results for data 1:– Frequency threshold: 0.1 (lowest possible with this data)
– Candidate sets: 109 719 Time: 12.02 s
– Frequent sets: 79 311 Time: 64 855.73 s
– Rules: 3 750 000 Time: 860.60 s
• Example results for data 2:Example results for data 2:– Frequency threshold: 0.1 (lowest possible with this data)
– Candidate sets: 43 600 Time: 1.70 s
– Frequent sets: 13 321 Time: 10 478.93 s
– Rules: 509 075 Time: 143.35 s
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Usually the result set is very big, one must select Usually the result set is very big, one must select interesting ones based on:interesting ones based on:– Objective measures:
Two popular measurements: support; and confidence
– Subjective measures (Silberschatz & Tuzhilin, KDD95)A rule (pattern) is interesting if it is unexpected (surprising to the user); and/or actionable (the user can do something with it)
• These issues will be discussed with KDD processesThese issues will be discussed with KDD processes
Selecting the Interesting Rules?Selecting the Interesting Rules?
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Boolean vs. quantitative association rules Boolean vs. quantitative association rules (based on the (based on the types of values handled)types of values handled)
– Boolean:Boolean: Rule concerns associations between the presence or absence of items (e.g. "buys A" or "does not buy A")
buys=SQLServer, buys=DMBook buys=SQLServer, buys=DMBook buys=DBMiner [2%,60%] buys=DBMiner [2%,60%]
buys(x, "SQLServer") ^ buys(x, "DMBook") buys(x, "DBMiner") [0.2%, 60%]
– Quantitative:Quantitative: Rule concerns associations between quantitative items or attributes
age=30..39, income=42..48K age=30..39, income=42..48K buys=PC [1%, 75%] buys=PC [1%, 75%]
age(x, "30..39") ^ income(x, "42..48K") buys(x, "PC") [1%, 75%]
Boolean vs. Quantitative RulesBoolean vs. Quantitative Rules
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Quantitative attributes:Quantitative attributes: e.g., age, income, height, weight
• Categorical attributes:Categorical attributes: e.g., color of car
Problem:Problem: too many distinct values for quantitative attributes
Solution:Solution: transform quantitative attributes in categorical ones via discretizationdiscretization more about this in seminar!
CID height weight income 1 168 75,4 30,5 2 175 80,0 20,3 3 174 70,3 25,8 4 170 65,2 27,0
Quantitative RulesQuantitative Rules
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Single-dimensional vs. multi-dimensional associationsSingle-dimensional vs. multi-dimensional associations
– Single-dimensional:Single-dimensional: Items or attributes in the rule refer to only one dimension (e.g., to "buys")
Beer, Chips Beer, Chips Bread [0.4%, 52%] Bread [0.4%, 52%]
buys(x, "Beer") ^ buys(x, "Chips") buys(x, "Bread") [0.4%, 52%]
– Multi-dimensional:Multi-dimensional: Items or attributes in the rule refer to two or more dimensions (e.g., "buys", "time_of_transaction", "customer_category")
In the following example: nationality, age, income
Single-Single- vs.vs. Multi-dimensional RulesMulti-dimensional Rules
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
CID nationality age income 1 Italian 50 low 2 French 40 high 3 French 30 high 4 Italian 50 medium 5 Italian 45 high 6 French 35 high RULES:RULES:
nationality = French income = high [50%, 100%]income = high nationality = French [50%, 75%]age = 50 nationality = Italian [33%, 100%]
Multi-dimensional RulesMulti-dimensional Rules
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Single-level vs. multi-level associationsSingle-level vs. multi-level associations
– Single-level:Single-level: Associations between items or attributes from the same level of abstraction (i.e., from the same level of hierarchy)
Beer, Chips Beer, Chips Bread [0.4%, 52%] Bread [0.4%, 52%]
– Multi-level:Multi-level: Associations between items or attributes from different levels of abstraction (i.e, from different levels of hierarchy)
Beer:Karjala, Chips:Estrella:Barbeque Beer:Karjala, Chips:Estrella:Barbeque Bread [0.1%, 74%] Bread [0.1%, 74%]
More about multi-level association rules on the next slides… More about multi-level association rules on the next slides…
Single- vs. Multi-level RulesSingle- vs. Multi-level Rules
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Is difficult to find interesting patterns at a Is difficult to find interesting patterns at a too primitive too primitive levellevel– high support = too few rules– low support = too many rules, most uninteresting
• Approach: reason at suitable level of abstractionApproach: reason at suitable level of abstraction
• A common form of background knowledge is that an A common form of background knowledge is that an attribute may be generalized or specialized according attribute may be generalized or specialized according to a to a hierarchy of conceptshierarchy of concepts
• Multi-level association rules:Multi-level association rules: rules which combine rules which combine associations with hierarchy of conceptsassociations with hierarchy of concepts
Multi-level Association RulesMulti-level Association Rules
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Items often form Items often form hierarchieshierarchies
• Items at the lower level Items at the lower level are expected to have are expected to have lower supportlower support
• Rules regarding itemsets Rules regarding itemsets at appropriate levels at appropriate levels could be quite usefulcould be quite useful
• Transaction database can Transaction database can be encoded based on be encoded based on dimensions and levelsdimensions and levels
Multi-level Association RulesMulti-level Association Rules
Food
breadmilk
skim
SunsetFraser
2% whitewheat
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
TID ItemsT1 {111, 121, 211, 221}T2 {111, 211, 222, 323}T3 {112, 122, 221, 411}T4 {111, 121}T5 {111, 122, 211, 221, 413}
Multi-level Association RulesMulti-level Association Rules
121= milk - 2% - Fraser
Food
breadmilk
skim
SunsetFraser
2% whitewheat
1
1 2
1 2
2
21
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• A top-down, progressive deepening approach:A top-down, progressive deepening approach:– First find high-level strong rules:
milk bread [20%, 60%]
– Then find their lower-level "weaker" rules:
2% milk wheat bread [6%, 50%]
• Variations at mining multi-level association rules:Variations at mining multi-level association rules:– Level-crossed association rules:
milk wheat bread
– Association rules with multiple, alternative hierarchies:
milk Wonder bread
Multi-level Association RulesMulti-level Association Rules
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Generalizing/specializing values of attributes… Generalizing/specializing values of attributes… – ...from specialized to general:...from specialized to general: support of rules increases (new
rules may become valid)
– ...from general to specialized:...from general to specialized: support of rules decreases (rules may become not valid, their support falls under the threshold)
• Too low level => too many rules and too primitive Too low level => too many rules and too primitive Pepsi light 0.5l bottle Taffel Barbeque Chips 200gr
• Too high level => uninteresting rules Too high level => uninteresting rules Food Clothes
Multi-level Association RulesMulti-level Association Rules
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Some rules may be redundant due to "ancestor" Some rules may be redundant due to "ancestor" relationships between itemsrelationships between items
• Example (milk has 4 subclasses):Example (milk has 4 subclasses):– milk wheat bread [support = 8%, confidence = 70%]
– 2% milk wheat bread [support = 2%, confidence = 72%]
• We say the first rule is an ancestor of the second ruleWe say the first rule is an ancestor of the second rule
• A rule is redundant if its support is close to the A rule is redundant if its support is close to the "expected" value, based on the rule’s ancestor"expected" value, based on the rule’s ancestor
– Above the second rule could be redundant
Redundancy FilteringRedundancy Filtering
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Uniform Support:Uniform Support: the same minimum support for all the same minimum support for all levelslevels
+ One minimum support threshold. No need to examine itemsets containing any item whose ancestors do not have minimum
support.
– Lower level items do not occur as frequently. If support threshold
• too high miss low level associations
• too low generate too many high level associations
• Reduced Support:Reduced Support: reduced minimum support at lower reduced minimum support at lower levelslevels
Uniform vs. Reduced SupportUniform vs. Reduced Support
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Multi-level mining with uniform supportMulti-level mining with uniform support
Milk
[support = 10%]
2% Milk
[support = 6%]
Skim Milk
[support = 4%]
Level 1min_sup = 5%
Level 2min_sup = 5%
Uniform SupportUniform Support
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Reduced SupportReduced Support
Multi-level mining with reduced supportMulti-level mining with reduced support
Level 1min_sup = 5%
Level 2min_sup = 3%
Milk
[support = 10%]
2% Milk
[support = 6%]
Skim Milk
[support = 4%]
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• A top-down, progressive deepening approach:A top-down, progressive deepening approach:– First mine high-level frequent items:
milk (15%), bread (10%)– Then mine their lower-level "weaker" frequent itemsets:
2% milk (5%), wheat bread (4%)
• Different min_support thresholds across multi-levels Different min_support thresholds across multi-levels lead to different algorithms:lead to different algorithms:– If adopting the same min_support across multi-levels
then do not examine t if any of t’s ancestors is infrequent
– If adopting reduced min_support at lower levels
then examine only those descendents whose ancestor’s support is frequent/non-negligible
Progressive DeepeningProgressive Deepening
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Interactive, exploratory mining giga-bytes of data? Interactive, exploratory mining giga-bytes of data? – Could it be real? — By making good use of constraints!
• What kinds of constraints can be used in mining?What kinds of constraints can be used in mining?– Knowledge type constraintKnowledge type constraint: classification, association, etc.– Data constraintData constraint: SQL-like queries
• Find product pairs sold together in Vancouver in Dec.’98– Dimension/level constraintsDimension/level constraints:
• In relevance to region, price, brand, customer category– Interestingness constraintsInterestingness constraints:
• Strong rules (min_support 3%, min_confidence 60%)– Rule constraintsRule constraints (see the next slides)
Constraint-Based MiningConstraint-Based Mining
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Two kinds of rule constraints:Two kinds of rule constraints:
– Rule form constraints: meta-rule guided miningRule form constraints: meta-rule guided mining
• Metarule: P(X, Y) ^ Q(X, W) takes(X, "database systems")
• Matching rule: age(X, "30..39") ^ income(X, "41K..60K") takes(X, "database systems").
– Rule content constraint: constraint-based query Rule content constraint: constraint-based query optimization (Ng, et al., SIGMOD’98)optimization (Ng, et al., SIGMOD’98)
• sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) > 1000
Rule ConstraintsRule Constraints
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• 1-variable vs. 2-variable constraints (Lakshmanan, et 1-variable vs. 2-variable constraints (Lakshmanan, et al. SIGMOD’99): al. SIGMOD’99):
– 1-var: A constraint confining only one side (L/R) of the rule, e.g.,
• sum(LHS) < 100 ^ min(LHS) > 20 ^ count(LHS) > 3 ^ sum(RHS) > 1000
– 2-var: A constraint confining both sides (L and R).
• sum(LHS) < min(RHS) ^ max(RHS) < 5* sum(LHS)
Rule ConstraintsRule Constraints
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Association rule mining:Association rule mining: – Probably the most significant contribution from the database
community in KDD
– Rather simple concept, but the "thinking" gives basis for extensions and other methods
– A large number of papers have been published
• Many interesting issues have been exploredMany interesting issues have been explored
• Interesting research directions:Interesting research directions:– Association analysis in other types of data: spatial data,
multimedia data, time series data, etc.
SummarySummary
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection algorithm for generation of frequent itemsets. In Journal of Parallel and Distributed Computing (Special Issue on High Performance Data Mining), 2000.
• R. Agrawal, T. Imielinski, and A. Swami. Mining association rules between sets of items in large databases. SIGMOD'93, 207-216, Washington, D.C.
• R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94 487-499, Santiago, Chile.
• R. Agrawal and R. Srikant. Mining sequential patterns. ICDE'95, 3-14, Taipei, Taiwan. • R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98, 85-93, Seattle,
Washington.• S. Brin, R. Motwani, and C. Silverstein. Beyond market basket: Generalizing association rules to
correlations. SIGMOD'97, 265-276, Tucson, Arizona.• S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for
market basket analysis. SIGMOD'97, 255-264, Tucson, Arizona, May 1997.• K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse and iceberg cubes. SIGMOD'99,
359-370, Philadelphia, PA, June 1999.• D.W. Cheung, J. Han, V. Ng, and C.Y. Wong. Maintenance of discovered association rules in large
databases: An incremental updating technique. ICDE'96, 106-114, New Orleans, LA.• M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg
queries efficiently. VLDB'98, 299-310, New York, NY, Aug. 1998.
References (1/5)References (1/5)
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of constrained correlated sets. ICDE'00, 512-521, San Diego, CA, Feb. 2000.
• Y. Fu and J. Han. Meta-rule-guided mining of association rules in relational databases. KDOOD'95, 39-46, Singapore, Dec. 1995.
• T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining using two-dimensional optimized association rules: Scheme, algorithms, and visualization. SIGMOD'96, 13-23, Montreal, Canada.
• E.-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. SIGMOD'97, 277-288, Tucson, Arizona.
• J. Han, G. Dong, and Y. Yin. Efficient mining of partial periodic patterns in time series database. ICDE'99, Sydney, Australia.
• J. Han and Y. Fu. Discovery of multiple-level association rules from large databases. VLDB'95, 420-431, Zurich, Switzerland.
• J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD'00, 1-12, Dallas, TX, May 2000.
• T. Imielinski and H. Mannila. A database perspective on knowledge discovery. Communications of ACM, 39:58-64, 1996.
• M. Kamber, J. Han, and J. Y. Chiang. Metarule-guided mining of multi-dimensional association rules using data cubes. KDD'97, 207-210, Newport Beach, California.
• M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A.I. Verkamo. Finding interesting rules from large sets of discovered association rules. CIKM'94, 401-408, Gaithersburg, Maryland.
References (2/5)References (2/5)
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new paradigm for fast, quantifiable data mining. VLDB'98, 582-593, New York, NY.
• B. Lent, A. Swami, and J. Widom. Clustering association rules. ICDE'97, 220-231, Birmingham, England.
• H. Lu, J. Han, and L. Feng. Stock movement and n-dimensional inter-transaction association rules. SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD'98), 12:1-12:7, Seattle, Washington.
• H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for discovering association rules. KDD'94, 181-192, Seattle, WA, July 1994.
• H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent episodes in event sequences. Data Mining and Knowledge Discovery, 1:259-289, 1997.
• R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining association rules. VLDB'96, 122-133, Bombay, India.
• R.J. Miller and Y. Yang. Association rules over interval data. SIGMOD'97, 452-461, Tucson, Arizona.
• R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. Exploratory mining and pruning optimizations of constrained associations rules. SIGMOD'98, 13-24, Seattle, Washington.
• N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent closed itemsets for association rules. ICDT'99, 398-416, Jerusalem, Israel, Jan. 1999.
References (3/5)References (3/5)
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules. SIGMOD'95, 175-186, San Jose, CA, May 1995.
• J. Pei, J. Han, and R. Mao. CLOSET: An Efficient Algorithm for Mining Frequent Closed Itemsets. DMKD'00, Dallas, TX, 11-20, May 2000.
• J. Pei and J. Han. Can We Push More Constraints into Frequent Pattern Mining? KDD'00. Boston, MA. Aug. 2000.
• G. Piatetsky-Shapiro. Discovery, analysis, and presentation of strong rules. In G. Piatetsky-Shapiro and W. J. Frawley, editors, Knowledge Discovery in Databases, 229-238. AAAI/MIT Press, 1991.
• B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules. ICDE'98, 412-421, Orlando, FL.
• J.S. Park, M.S. Chen, and P.S. Yu. An effective hash-based algorithm for mining association rules. SIGMOD'95, 175-186, San Jose, CA.
• S. Ramaswamy, S. Mahajan, and A. Silberschatz. On the discovery of interesting patterns in association rules. VLDB'98, 368-379, New York, NY..
• S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: Alternatives and implications. SIGMOD'98, 343-354, Seattle, WA.
• A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association rules in large databases. VLDB'95, 432-443, Zurich, Switzerland.
• A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong negative associations in a large database of customer transactions. ICDE'98, 494-502, Orlando, FL, Feb. 1998.
References (4/5)References (4/5)
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable techniques for mining causal structures. VLDB'98, 594-605, New York, NY.
• R. Srikant and R. Agrawal. Mining generalized association rules. VLDB'95, 407-419, Zurich, Switzerland, Sept. 1995.
• R. Srikant and R. Agrawal. Mining quantitative association rules in large relational tables. SIGMOD'96, 1-12, Montreal, Canada.
• R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item constraints. KDD'97, 67-73, Newport Beach, California.
• H. Toivonen. Sampling large databases for association rules. VLDB'96, 134-145, Bombay, India, Sept. 1996.
• D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S. Nestorov. Query flocks: A generalization of association-rule mining. SIGMOD'98, 1-12, Seattle, Washington.
• K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Computing optimized rectilinear regions for association rules. KDD'97, 96-103, Newport Beach, CA, Aug. 1997.
• M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm for discovery of association rules. Data Mining and Knowledge Discovery, 1:343-374, 1997.
• M. Zaki. Generating Non-Redundant Association Rules. KDD'00. Boston, MA. Aug. 2000.• O. R. Zaiane, J. Han, and H. Zhu. Mining Recurrent Items in Multimedia with Progressive
Resolution Refinement. ICDE'00, 461-470, San Diego, CA, Feb. 2000.
References (5/5)References (5/5)
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Next WeekNext WeekNext WeekNext Week
• Lecture 31.10.: Episodes and recurrent Lecture 31.10.: Episodes and recurrent patternspatterns
– Mika gives the lectureMika gives the lecture
• Excercise 1.11.: AssociationsExcercise 1.11.: Associations– Pirjo takes care of you! :-) Pirjo takes care of you! :-)
• Seminar 2.11.: AssociationsSeminar 2.11.: Associations– Pirjo gives the lecturePirjo gives the lecture
– 2 group presentations2 group presentations
Course OrganizationCourse Organization
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Seminar presentations:Seminar presentations:– Articles are given on previous
week's Wed
– Presentation in an HTML page (around 3-5 printed pages) due to seminar starting:
• Can be either a HTML page or a printable document in PostScript/PDF format
– 30 minutes of presentation
– 5-15 minutes of discussion
– Active participation
Seminar PresentationsSeminar Presentations
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
• Seminar presentations:Seminar presentations:– Try to understand the
"message" in the article
– Try to present the basic ideas as clearly as possible, use examples
– Do not present detailed mathematics or algorithms
– Test: do you understand your own presentation?
– In the presentation, use PowerPoint or conventional slides
Seminar PresentationsSeminar Presentations
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Seminar Presentations/Groups 1-2Seminar Presentations/Groups 1-2
Quantitative RulesQuantitative RulesQuantitative RulesQuantitative Rules
MINERULEMINERULEMINERULEMINERULE
R. Srikant, R. Agrawal: "Mining "Mining Quantitative Association Rules Quantitative Association Rules in Large Relational Tables"in Large Relational Tables", Proc. of the ACM-SIGMOD 1996
Rosa Meo, Giuseppe Psaila, Stefano Ceri: "A New SQL-like "A New SQL-like Operator for Mining Operator for Mining Association Rules"Association Rules". VLDB 1996: 122-133
Course on Data MiningCourse on Data Mining
Mika Klemettinen and Pirjo MoenMika Klemettinen and Pirjo Moen University of Helsinki/Dept of CS Autumn 2001University of Helsinki/Dept of CS Autumn 2001
Thank you for Thank you for your attention and your attention and
have a nice weekend!have a nice weekend!Thanks to Jiawei Han from Simon Fraser University for his slides
which greatly helped in preparing this lecture! Also thanks to Fosca Giannotti and Dino Pedreschi from Pisa for their slides.
Introduction to Data Mining (DM)Introduction to Data Mining (DM)