multi level association rules
TRANSCRIPT
Support for an itemset X in a transactional database D is defined as count(X) / |D|.
For an association rule X Þ Y, we can calculate
support(X Þ Y) = support(X U Y) = support(X union Y).confidence(X Þ Y) = support(X U Y) / support(X).
Support (S) and Confidence (C) can also be related to joint probabilities and conditional probabilities as follows.
support(X Þ Y) = P(X U Y).confidence(X Þ Y) = P(Y/X).
The number of association rules that can be derived from a dataset D are exponentially large. Interesting association rules are those whose support and confidence are greater than minSupp and minConf.
Frequent itemsets (also called as large itemsets), are those itemsets whose support is greater
than minSupp. The apriori property (downward closure property) says that any subsets of an
frequent itemset are also frequent itemsets.
Multi Level Association Rules – Concepts:
o Rules Generated from mining data at different levels of abstraction
o Essential to mine at different levels, in supporting business decision making
o Massive amount of data highly sparse at the primitive level
o Rules at high concept level adds to common sense
o Rules at low concept level may not be interesting always
Example:
o Items in task relevant data will be primitive
o Primitive data items occurs least frequently
buys (hp-laptop computer) buys (canon-inkjet printer)
Vs
buys (laptop computer) buys (inkjet printer)
Vs
buys (computer) buys (printer)
o Support- Confidence Framework
o Top down Strategy, in accumulating counts
o Algorithms – Apriori & it’s variations
o Variations includes
o Uniform support for all levels
o Reduced Support at lower levels
Mining (UNIFORM SUPPORT):
o Same support for all levels of abstraction
o Subsets of ancestors not satisfying minimum support are not examined
o Higher support threshold lose interesting associations at lower abstractions
o Lower support threshold Many uninteresting associations at higher abstractions
o Alternate Search Strategies
o Level by level independent
Full breadth search
No back Ground knowledge in pruning
Leads to examining lot of infrequent items
o Level-cross filtering by single item
Examine nodes at level i, only if node at level i-1 is frequent
Misses frequent items at lower level abstractions (due to reduced support)
o Level-cross filtering by k-itemset
Examine k-itemsets at level i, only if k-itemset at level i-1 is frequent
Misses frequent k-itemsets at lower level abstractions (due to reduced support)
o Controlled level-cross filtering by singe item
o A modified level-cross filtering by singe item
o Sets a level passage threshold for every levels
o Allows the inspection of lower abstractions, even if its ancestor fails to satisfy
min_sup threshold
Computer Printer
(At same Abstraction level)
Computer InkJet Printer (Cross level Association rules)
(At Different Abstraction level)
Redundancy:
Laptop computer InkJet Printer
(Support = 10 % , confidence = 70%)
Vs
HP Laptop Computer InkJet Printer
(Support = 5 % , confidence = 68%)
o Second one is redundant due to the existing ancestor relationship
Multi Dimensional Association Rules – Concepts:
=>Rules involving more than one dimensions or predicates
• buys (X, “IBM Laptop Computer”) buys (X, “HP Inkjet Printer”)
(Single dimensional)
• age (X, “20 ..25” ) and occupation (X, “student”) buys (X, “HP Inkjet Printer”)
(Multi Dimensional- Inter dimension Association Rule)
• age (X, “20 ..25” ) and buys (X, “IBM Laptop Computer”) buys (X, “HP Inkjet Printer”)
(Multi Dimensional- Hybrid dimension Association Rule)
• Attributes can be categorical or quantitative
• Quantitative attributes are numeric and incorporates hierarchy (age, income..)
• Numeric attributes must be discretized
• 3 different approaches in mining multi dimensional association rules
o Using static discretization of quantitative attributes
o Using dynamic discretization of quantitative attributes
o Using Distance based discretization with clustering
Mining using Static Discretization:
• Discretization is static and occurs prior to mining
• Discretized attributes are treated as categorical
• Use apriori algorithm to find all k-frequent predicate sets
• Every subset of frequent predicate set must be frequent
• If in a data cube the 3D cuboid (age, income, buys) is frequent implies (age, income), (age,buys), (income, buys)
Mining using Dynamic Discretization:
• Known as Mining Quantitative Association Rules
• Numeric attributes are dynamically discretized
• Consider rules of type
Aquan1 Λ Aquan2 Acat
(2D Quantitative Association Rules)
age(X,”20…25”) Λ income(X,”30K…40K”) buys (X, ”Laptop Computer”)
• ARCS (Association Rule Clustering System)An Approach for mining quantitative association rules.
• 2 step mining process
o Perform clustering to find the interval of attributes involved
o Obtain association rules by searching for groups of clusters that occur together
• The resultant rules must satisfy
o Clusters in the rule antecedent are strongly associated with clusters of rules in
the consequent
o Clusters in the antecedent occur together
o Clusters in the consequent occur together