associion rule mining

Upload: mannanabdulsattar

Post on 03-Jun-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/12/2019 Associion Rule Mining

    1/19

    Association Rule Mining

    Mining Association Rules in

    Large Databases

    Association rule mining

    Algorithms Apriori and FP-Growth

    Max and closed patterns

    Mining various kinds of association/correlation

    rules

  • 8/12/2019 Associion Rule Mining

    2/19

    Max-patterns & Close-patterns

    If there are frequent patterns with manyitems, enumerating all of them is costly.

    We may be interested in finding theboundaryfrequent patterns.

    Two types

    Max-patterns

    Frequent pattern {a1, , a100} (1001) +

    (1002) + + (1

    10000) = 2100-1 = 1.27*1030

    frequent sub-patterns!

    Max-pattern: frequent patterns withoutproper frequent super pattern

    BCDE, ACD are max-patterns

    BCD is not a max-pattern

    A,C,D,F30

    B,C,D,E,20

    A,B,C,D,E10

    ItemsTid

    Min_sup=2

  • 8/12/2019 Associion Rule Mining

    3/19

    Maximal Frequent Itemset

    Border

    Infrequent

    Itemsets

    Maximal

    Itemsets

    An itemset is maximal frequent i f none of its immediate supersets

    is frequent

    Closed Itemset

    An itemset is closed if none of its immediatesupersets has the same support as the itemset

    TID Items

    1 {A,B}

    2 {B,C,D}

    3 {A,B,C,D}

    4 {A,B,D}

    5 {A,B,C,D}

    Itemset Support

    {A} 4

    {B} 5

    {C} 3

    {D} 4

    {A,B} 4

    {A,C} 2

    {A,D} 3

    {B,C} 3

    {B,D} 4

    {C,D} 3

    Itemset Support

    {A,B,C} 2

    {A,B,D} 3

    {A,C,D} 2

    {B,C,D} 3

    {A,B,C,D} 2

  • 8/12/2019 Associion Rule Mining

    4/19

  • 8/12/2019 Associion Rule Mining

    5/19

    Maximal vs Closed Itemsets

    MaxMiner: Mining Max-patterns Idea: generate the complete set-

    enumeration tree one level at a time, whileprune if applicable.

    (ABCD)

    A (BCD) B (CD) C (D) D ()

    AB (CD) AC (D) AD () BC (D) BD () CD ()

    ABC (C)

    ABCD ()

    ABD () ACD () BCD ()

  • 8/12/2019 Associion Rule Mining

    6/19

    Local Pruning Techniques (e.g. at node A)

    Check the frequency of ABCD and AB, AC, AD.

    If ABCD is frequent, prune the whole sub-tree.

    If AC is NOT frequent, remove C from theparenthesis before expanding.

    (ABCD)

    A (BCD) B (CD) C (D) D ()

    AB (CD) AC (D) AD () BC (D) BD () CD ()

    ABC (C)

    ABCD ()

    ABD () ACD () BCD ()

    Algorithm MaxMiner Initially, generate one node N= ,

    where h(N)= and t(N)={A,B,C,D}.

    Consider expanding N,

    If h(N)t(N) is frequent, do not expand N.

    If for some it(N), h(N){i} is NOT frequent,remove i from t(N) before expanding N.

    Apply global pruning techniques

    (ABCD)

  • 8/12/2019 Associion Rule Mining

    7/19

    Global Pruning Technique (across sub-trees)

    When a max pattern is identified (e.g. ABCD),prune all nodes (e.g. B, C and D) where h(N)t(N)is a sub-set of it (e.g. ABCD).

    (ABCD)

    A (BCD) B (CD) C (D) D ()

    AB (CD) AC (D) AD () BC (D) BD () CD ()

    ABC (C)

    ABCD ()

    ABD () ACD () BCD ()

    Example

    A,C,D,F30

    B,C,D,E,20

    A,B,C,D,E10

    ItemsTid (ABCDEF)

    3C

    2B

    2E

    3D

    1F

    2A

    0ABCDEF

    FrequencyItemsMin_sup=2

    Max patterns:

    A (BCDE)B (CDE) C (DE) E ()D (E)

  • 8/12/2019 Associion Rule Mining

    8/19

    Example

    A,C,D,F30

    B,C,D,E,20

    A,B,C,D,E10

    ItemsTid (ABCDEF)

    2AD

    2AC

    1AE

    1AB

    1ABCDE

    FrequencyItems

    Min_sup=2

    A (BCDE)B (CDE) C (DE) E ()D (E)

    AC (D) AD ()

    Max patterns:

    Node A

    Example

    A,C,D,F30

    B,C,D,E,20

    A,B,C,D,E10

    ItemsTid (ABCDEF)

    BC

    BD

    2BCDE

    BE

    FrequencyItems

    Min_sup=2

    A (BCDE)B (CDE) C (DE) E ()D (E)

    AC (D) AD ()

    Max patterns:

    BCDE

    Node B

  • 8/12/2019 Associion Rule Mining

    9/19

    Example

    A,C,D,F30

    B,C,D,E,20

    A,B,C,D,E10

    ItemsTid (ABCDEF)

    2ACD

    FrequencyItems

    Min_sup=2

    A (BCDE)B (CDE) C (DE) E ()D (E)

    AC (D) AD ()

    Max patterns:

    BCDE

    ACD

    Node AC

    Frequent Closed Patterns

    For frequent itemset X, if there exists noitem y s.t. every transaction containing Xalso contains y, then X is a frequentclosed pattern

    abis a frequent closed pattern

    Concise rep. of freq pats

    Reduce # of patterns and rules N. Pasquier et al. In ICDT99

    e, f50

    a, b, d40

    a, b, d30

    a, b, c20

    a, b, c10

    ItemsTID

    Min_sup=2

  • 8/12/2019 Associion Rule Mining

    10/19

    Max Pattern vs. Frequent Closed Pattern

    max pattern closed pattern

    if itemset X is a max pattern, adding any itemto it would not be a frequent pattern; thusthere exists no item y s.t. every transactioncontaining X also contains y.

    closed pattern max pattern

    abis a closed pattern, but not max

    e, f50

    a, b, d40

    a, b, d30

    a, b, c20

    a, b, c10

    ItemsTID

    Min_sup=2

    Mining Frequent Closed Patterns: CLOSET

    Flist: list of all frequent items in support ascending order

    Flist: d-a-f-e-c

    Divide search space

    Patterns having d

    Patterns having a but not d, etc.

    Find frequent closed pattern recursively

    Among the transactions having d, cfa is frequent closed

    cfad is a frequent closed pattern

    J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for

    Mining Frequent Closed Itemsets", DMKD'00.

    c, e, f50

    a, c, d, f40

    c, e, f30

    a, b, e20

    a, c, d, e, f10

    ItemsTID

    Min_sup=2

  • 8/12/2019 Associion Rule Mining

    11/19

  • 8/12/2019 Associion Rule Mining

    12/19

    Multi-level Association: UniformSupport vs. Reduced Support

    Uniform Support: the same minimumsupport for all levels

    + One minimum support threshold. No need toexamine itemsets containing any item whoseancestors do not have minimum support.

    Lower level items do not occur as frequently.If support threshold

    too high miss low level associations

    too low generate too many high levelassociations

    Multi-level Association: UniformSupport vs. Reduced Support

    Reduced Support: reduced minimumsupport at lower levels There are 4 search strategies:

    Level-by-level independent Independent search at all levels (no misses)

    Level-cross filtering by k-itemset Prune a k-pattern if the corresponding k-pattern at

    the upper level is infrequent

    Level-cross filtering by single item Prune an item if its parent node is infrequent

    Controlled level-cross filtering by single item Consider subfrequent items that pass a passage

    threshold

  • 8/12/2019 Associion Rule Mining

    13/19

    Uniform Support

    Multi-level mining with uniform support

    Milk

    [support = 10%]

    full fat Milk

    [support = 6%]

    Skim Milk

    [support = 4%]

    Level 1

    min_sup = 5%

    Level 2

    min_sup = 5%

    X

    Reduced Support

    Multi-level mining with reduced support

    full fat Milk

    [support = 6%]

    Skim Milk

    [support = 4%]

    Level 1

    min_sup = 5%

    Level 2min_sup = 3%

    Milk

    [support = 10%]

  • 8/12/2019 Associion Rule Mining

    14/19

    Pattern Evaluation

    Association rule algorithms tend to produce too

    many rules

    many of them are uninteresting or redundant

    Redundant if {A,B,C} {D} and {A,B} {D}have same support & confidence

    Interestingness measures can be used to prune/rankthe derived patterns

    In the original formulation of association rules,support & confidence are the only measures used

    Computing Interestingness Measure Given a rule X Y, information needed to compute

    rule interestingness can be obtained from acontingency table

    |T|f+0f+1

    fo+f00f01X

    f1+f10f11X

    YY

    Contingency table for X Y

    f11: support of X and Y

    f10: support of X and Y

    f01: support of X and Y

    f00: support of X and Y

    Used to define various measures

    support, confidence, lift, Gini,

    J-measure, etc.

  • 8/12/2019 Associion Rule Mining

    15/19

    Drawback of Confidence

    1001090

    80575Tea

    20515Tea

    CoffeeCoffee

    Association Rule: Tea Coffee

    Confidence= P(Coffee|Tea) = 0.75

    but P(Coffee) = 0.9

    Although confidence is high, rule is misleading

    P(Coffee|Tea) = 0.9375

    Statistical Independence

    Population of 1000 students

    600 students know how to swim (S)

    700 students know how to bike (B)

    420 students know how to swim and bike (S,B)

    P(SB) = 420/1000 = 0.42

    P(S) P(B) = 0.6 0.7 = 0.42

    P(SB) = P(S) P(B) => Statistical independence

    P(SB) > P(S) P(B) => Positively correlated

    P(SB) < P(S) P(B) => Negatively correlated

  • 8/12/2019 Associion Rule Mining

    16/19

    Statistical-based Measures

    Measures that take into account statisticaldependence

    )](1)[()](1)[(

    )()(),(

    )()(),(

    )()(

    ),(

    )(

    )|(

    YPYPXPXP

    YPXPYXPtcoefficien

    YPXPYXPPS

    YPXP

    YXPInterest

    YP

    XYPLift

    =

    =

    =

    =

    Example: Lift/Interest

    1001090

    80575Tea

    20515Tea

    CoffeeCoffee

    Association Rule: Tea Coffee

    Confidence= P(Coffee|Tea) = 0.75

    but P(Coffee) = 0.9

    Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)

  • 8/12/2019 Associion Rule Mining

    17/19

    Drawback of Lift & Interest

    1009010

    90900X

    10010X

    YY

    1001090

    10100X

    90090X

    YY

    10)1.0)(1.0(

    1.0==Lift 11.1

    )9.0)(9.0(

    9.0==Lift

    Statistical independence:

    If P(X,Y)=P(X)P(Y) => Li ft = 1

    There are lots of

    measures proposedin the literature

    Some measures are

    good for certain

    applications, but not

    for others

    What criteria should

    we use to determine

    whether a measure

    is good or bad?

    What about Apriori-style support based

    pruning? How does

    it affect these

    measures?

  • 8/12/2019 Associion Rule Mining

    18/19

    Properties of A Good Measure

    Piatetsky-Shapiro:3 properties a good measure M mustsatisfy:

    M(A,B) = 0 if A and B are statisticallyindependent

    M(A,B) increase monotonically with P(A,B)when P(A) and P(B) remain unchanged

    M(A,B) decreases monotonically with P(A) [orP(B)] when P(A,B) and P(B) [or P(A)] remainunchanged

    Comparing Different Measures

    10 examples of

    contingency tables:

    Rankings of contingency tables

    using various measures:

    Example f11 f10 f01 f00

    E1 8123 83 424 1370

    E2 8330 2 622 1046

    E3 9481 94 127 298

    E4 3954 3080 5 2961

    E5 2886 1 363 1 320 4 431

    E6 1500 2000 500 6000

    E7 4000 2 000 1 000 3 000

    E8 4000 2 000 2 000 2 000

    E9 1720 7121 5 1154

    E10 61 2483 4 7452

  • 8/12/2019 Associion Rule Mining

    19/19

    Property under Variable

    Permutation B B

    A p q

    A r s

    A A B p r

    B q s

    Does M(A,B) = M(B,A)?

    Symmetric measures:

    support, lift, collective strength, cosine, Jaccard, etc

    Asymmetric measures:

    confidence, conviction, Laplace, J-measure, etc