data mining: association rulesinformatique.umons.ac.be/ssi/teaching/dwdm/ass.pdf · 2018. 11....

21
Data Mining: Association Rules Jef Wijsen Universit´ e de Mons (UMONS) J. Wijsen Association Rules 1 / 21

Upload: others

Post on 15-Mar-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining: Association Rulesinformatique.umons.ac.be/ssi/teaching/dwdm/ass.pdf · 2018. 11. 26. · Data Mining: Association Rules Author: Jef Wijsen Created Date: 11/26/2018 2:24:06

Data Mining: Association Rules

Jef Wijsen

Universite de Mons (UMONS)

J. Wijsen Association Rules 1 / 21

Page 2: Data Mining: Association Rulesinformatique.umons.ac.be/ssi/teaching/dwdm/ass.pdf · 2018. 11. 26. · Data Mining: Association Rules Author: Jef Wijsen Created Date: 11/26/2018 2:24:06

La BD et les regles

1 Des items {A,B,C ,D, . . .}.2 Tout ensemble fini d’items est appele un itemset ou une transaction.

3 La BD est un ensemble (plutot un bag) de transactions.

Par ex.,{B,C ,D,E}{A,B,C}{A,C ,D}{A,B,C ,D}

Une regle d’association est une expression de la forme X ⇒ Y ou X et Ysont des itemsets disjoints. Par ex., {B,C} ⇒ {D} ou B,C ⇒ D.

J. Wijsen Association Rules 2 / 21

Page 3: Data Mining: Association Rulesinformatique.umons.ac.be/ssi/teaching/dwdm/ass.pdf · 2018. 11. 26. · Data Mining: Association Rules Author: Jef Wijsen Created Date: 11/26/2018 2:24:06

Le probleme

Le support d’une regle X ⇒ Y est s si s% pourcent des transactionscontiennent tous les items en X ∪ Y .

La confiance d’une regle X ⇒ Y est c si c% des transactions quicontiennent X , contiennent aussi Y .

Normalement, on utilise une fraction entre 0 et 1 au lieu d’unpourcentage.

Dans l’exemple, la regle B,C ⇒ D a un support de 2/4 et uneconfiance de 2/3.

Soient donnes un seuil de support (souvent ≤ 0.1) et un seuil deconfiance (plutot proche de 1.0), trouver toutes les regles quidepassent ces seuils.

J. Wijsen Association Rules 3 / 21

Page 4: Data Mining: Association Rulesinformatique.umons.ac.be/ssi/teaching/dwdm/ass.pdf · 2018. 11. 26. · Data Mining: Association Rules Author: Jef Wijsen Created Date: 11/26/2018 2:24:06

Fichier ; BD de transactions

outlook temperature humidity windy play

sunny hot high false nosunny hot high true no

. . .;

{outlook=sunny,temperature=hot,humidity=high,windy=false,play=no}{outlook=sunny,temperature=hot,humidity=high,windy=true,play=no}

“outlook=sunny”, “temperature=hot” jouent le role d’items.

Il n’y a pas d’attribut cible !

J. Wijsen Association Rules 4 / 21

Page 5: Data Mining: Association Rulesinformatique.umons.ac.be/ssi/teaching/dwdm/ass.pdf · 2018. 11. 26. · Data Mining: Association Rules Author: Jef Wijsen Created Date: 11/26/2018 2:24:06

Exercice

Creer un fichier paniers.arff pour stocker le contenu du “panier dela menagere” des clients qui passent a la caisse.

Decouvrir des regles d’association en paniers.arff.

Comprendre les messages de type “Size of set of large

itemsets L(2): 26”

J. Wijsen Association Rules 5 / 21

Page 6: Data Mining: Association Rulesinformatique.umons.ac.be/ssi/teaching/dwdm/ass.pdf · 2018. 11. 26. · Data Mining: Association Rules Author: Jef Wijsen Created Date: 11/26/2018 2:24:06

Attention : Confiance 6= Correlation

Consider “basket⇒ corn flakes” and suppose there are 5000 students asfollows :

3000 students play basketball ;

3750 students eat corn flakes ;

2000 students both play basketball and eat corn flakes.

The rule “basket⇒ corn flakes” has a confidence of 20003000 = 2/3, but note :

Pr(basket ∧ corn flakes)

Pr(basket)Pr(corn flakes)= 0.889 < 1 .

J. Wijsen Association Rules 6 / 21

Page 7: Data Mining: Association Rulesinformatique.umons.ac.be/ssi/teaching/dwdm/ass.pdf · 2018. 11. 26. · Data Mining: Association Rules Author: Jef Wijsen Created Date: 11/26/2018 2:24:06

Exercice

Pour l’ensemble

{AB,ACDE ,BCDF ,ABCD,ABCF},

trouvez tout sous-ensemble de ABCDEF avec un support count ≥ 2.

Le support count de X , denote σ(X ), est le nombre de transactionsqui incluent X . Par exemple, σ(BC ) = 3.

Notez :I le support de X ⇒ Y est egal a σ(XY )

N avec N le nombre detransactions (a multiplier par 100 pour obtenir un pourcentage) ;

I la confiance de X ⇒ Y est egale a σ(XY )σ(X ) .

J. Wijsen Association Rules 7 / 21

Page 8: Data Mining: Association Rulesinformatique.umons.ac.be/ssi/teaching/dwdm/ass.pdf · 2018. 11. 26. · Data Mining: Association Rules Author: Jef Wijsen Created Date: 11/26/2018 2:24:06

Apriori : elagage

Source: Pang-Ning Tan, Michael Steinbach, and Vipin Kumar: Introduction toData Mining. Addison Wesley, 2006

J. Wijsen Association Rules 8 / 21

Page 9: Data Mining: Association Rulesinformatique.umons.ac.be/ssi/teaching/dwdm/ass.pdf · 2018. 11. 26. · Data Mining: Association Rules Author: Jef Wijsen Created Date: 11/26/2018 2:24:06

Apriori : k-itemsets inclus dans une transaction

Source: Pang-Ning Tan, Michael Steinbach, and Vipin Kumar: Introduction toData Mining. Addison Wesley, 2006

J. Wijsen Association Rules 9 / 21

Page 10: Data Mining: Association Rulesinformatique.umons.ac.be/ssi/teaching/dwdm/ass.pdf · 2018. 11. 26. · Data Mining: Association Rules Author: Jef Wijsen Created Date: 11/26/2018 2:24:06

Apriori : candidats “haches” et comptage

Source: Pang-Ning Tan, Michael Steinbach, and Vipin Kumar: Introduction toData Mining. Addison Wesley, 2006

J. Wijsen Association Rules 10 / 21

Page 11: Data Mining: Association Rulesinformatique.umons.ac.be/ssi/teaching/dwdm/ass.pdf · 2018. 11. 26. · Data Mining: Association Rules Author: Jef Wijsen Created Date: 11/26/2018 2:24:06

Apriori : candidats “haches” et comptage

Source: Pang-Ning Tan, Michael Steinbach, and Vipin Kumar: Introduction toData Mining. Addison Wesley, 2006

J. Wijsen Association Rules 11 / 21

Page 12: Data Mining: Association Rulesinformatique.umons.ac.be/ssi/teaching/dwdm/ass.pdf · 2018. 11. 26. · Data Mining: Association Rules Author: Jef Wijsen Created Date: 11/26/2018 2:24:06

Apriori : validation experimentale

Source: Pang-Ning Tan, Michael Steinbach, and Vipin Kumar: Introduction toData Mining. Addison Wesley, 2006

J. Wijsen Association Rules 12 / 21

Page 13: Data Mining: Association Rulesinformatique.umons.ac.be/ssi/teaching/dwdm/ass.pdf · 2018. 11. 26. · Data Mining: Association Rules Author: Jef Wijsen Created Date: 11/26/2018 2:24:06

Apriori : construire les regles

J. Wijsen Association Rules 13 / 21

Page 14: Data Mining: Association Rulesinformatique.umons.ac.be/ssi/teaching/dwdm/ass.pdf · 2018. 11. 26. · Data Mining: Association Rules Author: Jef Wijsen Created Date: 11/26/2018 2:24:06

Itemsets frequents maximaux

J. Wijsen Association Rules 14 / 21

Page 15: Data Mining: Association Rulesinformatique.umons.ac.be/ssi/teaching/dwdm/ass.pdf · 2018. 11. 26. · Data Mining: Association Rules Author: Jef Wijsen Created Date: 11/26/2018 2:24:06

Itemsets frequents fermes

J. Wijsen Association Rules 15 / 21

Page 16: Data Mining: Association Rulesinformatique.umons.ac.be/ssi/teaching/dwdm/ass.pdf · 2018. 11. 26. · Data Mining: Association Rules Author: Jef Wijsen Created Date: 11/26/2018 2:24:06

Parcours en largeur ou en profondeur

Source: Pang-Ning Tan, Michael Steinbach, and Vipin Kumar: Introduction toData Mining. Addison Wesley, 2006

J. Wijsen Association Rules 16 / 21

Page 17: Data Mining: Association Rulesinformatique.umons.ac.be/ssi/teaching/dwdm/ass.pdf · 2018. 11. 26. · Data Mining: Association Rules Author: Jef Wijsen Created Date: 11/26/2018 2:24:06

Parcours en profondeur

Source: Pang-Ning Tan, Michael Steinbach, and Vipin Kumar: Introduction toData Mining. Addison Wesley, 2006

J. Wijsen Association Rules 17 / 21

Page 18: Data Mining: Association Rulesinformatique.umons.ac.be/ssi/teaching/dwdm/ass.pdf · 2018. 11. 26. · Data Mining: Association Rules Author: Jef Wijsen Created Date: 11/26/2018 2:24:06

Representations dites “horizontale” et “verticale”

Source: Pang-Ning Tan, Michael Steinbach, and Vipin Kumar: Introduction toData Mining. Addison Wesley, 2006

J. Wijsen Association Rules 18 / 21

Page 19: Data Mining: Association Rulesinformatique.umons.ac.be/ssi/teaching/dwdm/ass.pdf · 2018. 11. 26. · Data Mining: Association Rules Author: Jef Wijsen Created Date: 11/26/2018 2:24:06

Calcul recursif de cover

Pour chaque enfant x · s de s, cover(x · s) est l’intersection de cover(s)avec le cover du preceding-sibling de s qui contient x .

cover(a · d) = cover(d) ∩ cover(a)cover(b · d) = cover(d) ∩ cover(b)cover(c · d) = cover(d) ∩ cover(c)

cover(a · bd) = cover(b · d) ∩ cover(a · d)cover(a · cd) = cover(c · d) ∩ cover(a · d)cover(b · cd) = cover(c · d) ∩ cover(b · d)

cover(a · bcd) = cover(b · cd) ∩ cover(a · cd)

J. Wijsen Association Rules 19 / 21

Page 20: Data Mining: Association Rulesinformatique.umons.ac.be/ssi/teaching/dwdm/ass.pdf · 2018. 11. 26. · Data Mining: Association Rules Author: Jef Wijsen Created Date: 11/26/2018 2:24:06

FP-tree : une structure de donnees pour calculer lesintersections en RAM (Random Access Memory)

Parmi les 8 transactions qui contiennent a, il y en a 5 qui contiennentaussi b. Parmi ces 5 transactions qui incluent ab, il y en a 3 quicontiennent aussi c . Etc.

J. Wijsen Association Rules 20 / 21

Page 21: Data Mining: Association Rulesinformatique.umons.ac.be/ssi/teaching/dwdm/ass.pdf · 2018. 11. 26. · Data Mining: Association Rules Author: Jef Wijsen Created Date: 11/26/2018 2:24:06

Exemple : les transactions avec suffixe e

J. Wijsen Association Rules 21 / 21