presentation asrul

Upload: doni-helton-janius-wowor

Post on 05-Apr-2018

229 views

Category:

Documents


1 download

TRANSCRIPT

  • 8/2/2019 Presentation Asrul

    1/20

    1

    Association RulesApriori Algorithm

    Machine Learning Overview

    Sales Transaction and Association

    Rules

    Aprori Algorithm

    Example

  • 8/2/2019 Presentation Asrul

    2/20

    2

    Machine Learning

    Common ground of presented methods

    Statistical Learning Methods (frequency/similaritybased)

    Distinction

    Data are represented in a vector space or

    symbolically Supervised learning or unsupervised

    Scalable, works with very large data or not

    Extracting Rules fromExamples

    Apriori Algorithm

    FP-Growth

    Large quantities of stored data (symbolic) Large means often extremely large, and it is

    important that the algorithm gets scalable

    Unsupervised learning

  • 8/2/2019 Presentation Asrul

    3/20

    3

    Cluster Analysis

    K-Means

    EM (Expectation Maximization) COBWEBClustering

    Assesment

    KNN (k Nearest Neighbor)

    Data are represented in a vector space

    Unsupervised learning

    Uncertain knowledge

    Naive Bayes

    Belief Networks (Bayesian Networks) Main tool is the probability theory, which assigns to

    each item numerical degree of belief between 0 and 1

    Learning from Observation (symbolic data)

    Unsupervised Learning

  • 8/2/2019 Presentation Asrul

    4/20

    4

    Decision Trees

    ID3

    C4.5 cart

    Learning from Observation (symbolic data)

    Unsupervised Learning

    Supervised Classifiers- artificial Neural Networks

    Feed forward Networks with one layer: Perceptron

    With several layers: Backpropagation Algorithm

    RBF Networks

    Support Vector Machines

    Data are represented in a vector space

    Supervised learning

  • 8/2/2019 Presentation Asrul

    5/20

    5

    Prediction

    Linear Regression

    Logistic Regression

    Sales Transaction Table

    We would like to perform a basket analysis ofthe set of products in a single transaction

    Discovering for example, that a customer whobuys shoes is likely to buy socks

    Shoes" Socks

  • 8/2/2019 Presentation Asrul

    6/20

    6

    Transactional Database

    The set of all sales transactions is calledthe population

    We represent the transactions in one recordper transaction

    The transaction are represented by a datatuple

    Shoes,Socks,BeltTX4

    Shoes,TieTX3

    Shoes,Socks,Tie,Belt,ShirtTX2

    Shoes,Socks,TieTX1

    Sock is the rule antecedent

    Tie is the rule consequent

    Socks" Tie

  • 8/2/2019 Presentation Asrul

    7/20

    7

    Support and Confidence

    Any given association rule has a supportlevel and a confidence level

    Support it the percentage of thepopulation which satisfies the rule

    If the percentage of the population inwhich the antendent is satisfied is s, thenthe confidence is that percentage inwhich the consequent is also satisfied

    Transactional Database

    Support is 50% (2/4)

    Confidence is 66.67% (2/3)

    Shoes,Socks,BeltTX4

    Shoes,TieTX3

    Shoes,Socks,Tie,Belt,ShirtTX2

    Shoes,Socks,TieTX1

    Socks" Tie

  • 8/2/2019 Presentation Asrul

    8/20

    8

    Apriori Algorithm

    Mining for associations among items in a largedatabase of sales transaction is an importantdatabase mining function

    For example, the information that a customerwho purchases a keyboard also tends to buy a

    mouse at the same time is represented inassociation rule below:

    Keyboard Mouse

    [support = 6%, confidence = 70%]

    Association Rules

    Based on the types of values, the associationrules can be classified into two categories:Boolean Association Rules and QuantitativeAssociation Rules

    Boolean Association Rule:Keyboard Mouse

    [support = 6%, confidence = 70%]

    Quantitative Association Rule:

    (Age = 26 ...30)(Cars =1, 2)[support 3%, confidence = 36%]

  • 8/2/2019 Presentation Asrul

    9/20

    9

    Minimum Support threshold

    The support of an association pattern is thepercentage of task-relevant data transactionsfor which the pattern is true

    support(A" B) = P(A# B)

    A" B

    support(A" B) =# _ tuples_containing_both_ A_ and_B

    total_# _of _ tuples

    Minimum Confidence Threshold

    Confidence is defined as the measure of certainty ortrustworthiness associated with each discoveredpattern

    The probability ofB given that all we know is A

    A" B

    confidence(A" B) = P(B | A)

    confidence(A" B) =# _ tuples_containing_both _ A_ and_B

    # _ tuples_containing_ A

  • 8/2/2019 Presentation Asrul

    10/20

    10

    Itemset

    A set of items is referred to as itemset

    An itemset containing kitems is called

    k-itemset

    An itemset can be seen as a conjunctionof items (or a presdcate)

    Frequent Itemset

    Suppose min_supis the minimumsupport threshold

    An itemset satisfies minimum support if

    the occurrence frequency of the itemsetis greater or equal to min_sup

    If an itemset satisfies minimum support,then it is a frequent itemset

  • 8/2/2019 Presentation Asrul

    11/20

    11

    Strong Rules

    Rules that satisfy both a minimumsupportthreshold and a minimumconfidencethreshold are called strong

    Association Rule Mining

    Find all frequent itemsets

    Generate strong association rules fromthe frequent itemsets

    Apriori algorithm is mining frequentitemsets for Boolean associations rules

  • 8/2/2019 Presentation Asrul

    12/20

    12

    Apriori Algorithm

    Lelel-wise search

    k-itemsets (itensets with k items) are used to explore(k+1)- itemsets from transactional databases forBoolean association rules

    First, the set of frequent 1-itemsets is found (denoted L1)

    L1 is used to find L2, the set of frquent 2-itemsets L2 is used to find L3, and so on, until no frequent k-itemsets

    can be found

    Generate strong association rules from thefrequent itemsets

    Example

    Database TDB

    1st scan

    C1

    L1

    L2

    C2 C2

    2nd scan

    C3

    L33rd scan

    B, E40

    A, B, C, E30

    B, C, E20

    A, C, D10

    ItemsTid

    1{D}

    3{E}

    3{C}

    3{B}

    2{A}

    supItemset

    3{E}

    3{C}

    3{B}

    2{A}

    supItemset

    {C, E}

    {B, E}

    {B, C}

    {A, E}

    {A, C}

    {A, B}Itemset1{A, B}

    2{A, C}

    1{A, E}

    2{B, C}

    3{B, E}

    2{C, E}

    supItemset

    2{A, C}

    2{B, C}

    3{B, E}

    2{C, E}

    supItemset

    {B, C, E}

    Itemset

    2{B, C, E}

    supItemset

    Supmin = 2

  • 8/2/2019 Presentation Asrul

    13/20

    13

    The name of the algorithm is based on the factthat the algorithm uses prior knowledge offrequent items

    Employs an iterative approach known as level-

    wise search, where k-items are used to explorek+1 items

    Apriori Property

    Apriori property is used to reduce thesearch space

    Apriori property: All nonempty subset of

    frequent items must be also frequent Anti-monotone in the sense that if a set

    cannot pass a test, all its supper sets will failthe same test as well

  • 8/2/2019 Presentation Asrul

    14/20

    14

    Apriori Property

    Reducing the search space to avoid finding ofeach Lk requires one full scan of the database(Lk set of frequent k-itemsets)

    If an itemset Idoes not satisfy the minimumsupport threshold, min_sup, the Iis not

    frequent, P(I) < min_sup If an item A is added to the itemset I, then the

    resulting itemset cannot occur more frequentthan I, therfor IA is not frequent,

    P(IA)< min_sup

    Scalable Methods for Mining

    Frequent Patterns

    The downward closure property of frequentpatterns Any subset of a frequent itemset must be frequent

    If {beer, diaper, nuts} is frequent, so is {beer, diaper}

    i.e., every transaction having {beer, diaper, nuts} alsocontains {beer, diaper}

    Scalable mining methods: Three major approaches Apriori (Agrawal & Srikant@VLDB94)

    Freq. pattern growth (FPgrowthHan, Pei & Yin@SIGMOD00)

    Vertical data format approach (CharmZaki & Hsiao@SDM02)

  • 8/2/2019 Presentation Asrul

    15/20

    15

    Algorithm

    1. Scan the (entire) transaction database to get thesupport Sof each 1-itemset, compare Swithmin_sup, and get a set of frequent 1-itemsets, L1

    2. Use Lk-1 join Lk-1 to generate a set of candidate k-itemsets. Use Apriori property to prune theunfreqset k-itemset

    3. Scan the transaction database to get the support Sof each candidate k-itemset in the final set, compareS with min_sup, and get a set of frequent k-itemsets, Lk

    4. Is the candidate set empty, if not goto 2

    5 For each frequent itemset l, generate allnonempty subsets of l

    6 For every nonempty subset sof l,output the rule if itsconfidence C > min_conf

    I={A1,A2,A5}

    s" (I# s)

    A1" A2# A5

    A2" A5# A1

    A1" A5# A2

    A1" A2# A5

    A2" A1# A5

    A5" A1# A2

  • 8/2/2019 Presentation Asrul

    16/20

    16

    Example

    Five transactions from a supermarket

    (diaper=fralda)

    Beer,Milk,Coca-Cola5

    Diaper,Beer,Detergent4Beer,Diaper,Milk3

    Diaper,Baby Powder2

    Beer,Diaper,Baby Powder,Bread,Umbrella1

    List of ItemsTID

    Step 1

    Min_sup 40% (2/5)

    C1 L1

    "1/5"Coca-Cola

    "1/5"Detergent

    "2/5"Milk

    "1/5"Umbrella

    "1/5"Bread

    "2/5"Baby Powder

    "4/5"Diaper

    "4/5"Beer

    SupportItem

    "2/5"Milk

    "2/5"Baby Powder

    "4/5"Diaper

    "4/5"Beer

    SupportItem

  • 8/2/2019 Presentation Asrul

    17/20

    17

    Step 2 and Step 3

    C2 L2

    "0"Baby Powder,Milk

    "1/5"Diaper,Milk

    "2/5"Diaper,Baby Powder

    "2/5"Beer, Milk

    "1/5"Beer, Baby Powder

    "3/5"Beer, Diaper

    SupportItem

    "2/5"Diaper,BabyPowder

    "2/5"Beer, Milk

    "3/5"Beer, Diaper

    SupportItem

    Step 4

    C3 empty

    Min_sup 40% (2/5)

    "0"Diaper,Baby Powder,Milk

    "0"Beer, Milk,Baby Powder"1/5"Beer, Diaper,Milk

    "1/5"Beer, Diaper,Baby Powder

    SupportItem

  • 8/2/2019 Presentation Asrul

    18/20

    18

    Step 5

    min_sup=40% min_conf=70%

    100%40%40%Baby Powder, Diaper

    100%40%40%Milk,Beer

    75%80%60%Diaper,Beer

    50%80%40%Diaper,Baby Powder50%80%40%Beer, Milk

    75%80%60%Beer, Diaper

    ConfidenceSuport ASupport(A,B)Item

    Results

    support 60%, confidence 70%

    support 60%, confidence 70%

    support 40%, confidence 100%

    support 40%, confidence 70%

    Beer" Diaper

    Diaper" Beer

    Milk" Beer

    Baby _Powder" Diaper

  • 8/2/2019 Presentation Asrul

    19/20

    19

    Interpretation

    Some results are belivable, like Baby Powder Diaper

    Some rules need aditional analysis, like MilkBeer

    Some rules are unbelivable, like Diaper Beer

    This example could contain unreal resultsbecause of the small data

    Machine Learning Overview

    Sales Transaction and Association

    Rules

    Aprori Algorithm

    Example

  • 8/2/2019 Presentation Asrul

    20/20

    20

    How to make Apriori faster?

    FP-growth