ecml / pkdd 2004 discovery challenge

41
1 ECML / PKDD 2004 Discovery Challenge ECML / PKDD 2004 Discovery Challenge Mining Strong Associations and Mining Strong Associations and Exceptions in the STULONG Data Set Exceptions in the STULONG Data Set Eduardo Corrêa Gonçalves and Alexandre Plastino * *work sponsored by CNPq research grant 300879/00-8 Universidade Federal Fluminense Department of Computer Science Niterói, Rio de Janeiro, Brazil {egoncalves,plastino}@ic.uff.br - http://www.ic.uff.br

Upload: mariska-takacs

Post on 02-Jan-2016

37 views

Category:

Documents


3 download

DESCRIPTION

ECML / PKDD 2004 Discovery Challenge. Mining Strong Associations and Exceptions in the STULONG Data Set. Eduardo Corrêa Gonçalves and Alexandre Plastino *. Universidade Federal Fluminense Department of Computer Science Niterói, Rio de Janeiro, Brazil - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ECML / PKDD 2004 Discovery Challenge

1ECML / PKDD 2004 Discovery Challenge

ECML / PKDD 2004 Discovery Challenge

Mining Strong Associations and Mining Strong Associations and Exceptions in the STULONG Data SetExceptions in the STULONG Data Set

Eduardo Corrêa Gonçalves and Alexandre Plastino*

*work sponsored by CNPq research grant 300879/00-8

Universidade Federal Fluminense

Department of Computer Science Niterói, Rio de Janeiro, Brazil

{egoncalves,plastino}@ic.uff.br - http://www.ic.uff.br

Page 2: ECML / PKDD 2004 Discovery Challenge

2ECML / PKDD 2004 Discovery Challenge

1. Atherosclerosis Data Set

2. Multidimensional Association Rules

3. Exceptions

4. Data Preparation

5. Results

6. Summary

Outline of the talk

1. Atherosclerosis Data Set

2. Multidimensional Association Rules

3. Exceptions

4. Data Preparation

5. Results

6. Summary

Page 3: ECML / PKDD 2004 Discovery Challenge

3ECML / PKDD 2004 Discovery Challenge

Atherosclerosis Data Set

STULONG Data Set: risk factors of atherosclerosis in a population of 1417 middle aged men from Czech Republic.

Four tables are included in this data set:

Entry: data related to entry examinations performed on these men (the first step of the STULONG project).

Control: data related to long-term observations.

Letter: additional information about the health status of 403 men.

Death: data related to the patients that became dead.

Page 4: ECML / PKDD 2004 Discovery Challenge

4ECML / PKDD 2004 Discovery Challenge

Basic Groups of Patients

The patients were classified into three basic groups, according to the results of the entry examinations:

A. Normal Group : men without the presence of any risk factor.

B. Risk Group : men with the presence of one or more risk factors.

C. Pathologic Group : men with either an identified cardiovascular disease or other serious disease.

Page 5: ECML / PKDD 2004 Discovery Challenge

5ECML / PKDD 2004 Discovery Challenge

The main contribution of this work is to present strong association rules and exceptions mined from the Entry table.

The mining process was driven into discovering relations among the following characteristics of the patients in the basic groups:

Social factors.

Physical activities during free time.

Alcohol consumption.

Smoking.

Results of the biochemical examinationsand the physical check-up.

Contribution

Page 6: ECML / PKDD 2004 Discovery Challenge

6ECML / PKDD 2004 Discovery Challenge

1. Atherosclerosis Data Set

2. Multidimensional Association Rules

3. Exceptions

4. Data Preparation

5. Results

6. Summary

Outline of the talk

1. Atherosclerosis Data Set

2. Multidimensional Association Rules

3. Exceptions

4. Data Preparation

5. Results

6. Summary

Page 7: ECML / PKDD 2004 Discovery Challenge

7ECML / PKDD 2004 Discovery Challenge

Multidimensional Association Rules

Multidimensional Association Rules (J. Han and M. Kamber, 2001) represent combinations of attribute values that often occur together in a database.

They can be mined from relational databases or data warehouses.

Example:

(DailyBeerCons = “>1l”) (Smoking = “>20 cig/day”)

meaning: “men who are heavy beer consumers tend to be also heavy smokers”.

This rule involves two attributes (or dimensions): DailyBeerCons and Smoking.

Page 8: ECML / PKDD 2004 Discovery Challenge

8ECML / PKDD 2004 Discovery Challenge

Multidimensional Association RulesFormal Definition

A1 = a1 , ... , An = an B1 = b1 , ... , Bm = bm

Ai (1 i n) and Bj (1 j m) : distinct attributes (dimensions) from a database relation.

ai and bj : values from the domains of Ai and Bj, respectively.

generic representation: A B

A is the antecedent and B is the consequent of the rule. Several attributes can be involved in both the antecedent and the consequent.

Page 9: ECML / PKDD 2004 Discovery Challenge

9ECML / PKDD 2004 Discovery Challenge

Interest Measures: Support and Confidence

Support index (Sup): the probability that a tuple matches all conditions in A B.

Confidence index (Conf): the probability that a tuple matches B, given that it matches A.

Sup(A B) = P(A,B) and Conf(A B) = P(B|A).

The support indicates the relevance and the confidence indicates the validity of an association rule.

Support / Confidence Framework (Agrawal et al, 1993): finding all rules that match user-provided minimum support and minimum confidence.

Page 10: ECML / PKDD 2004 Discovery Challenge

10ECML / PKDD 2004 Discovery Challenge

Interest Measures: Support and Confidence

Problems with the Support / Confidence Framework (Brin et al, 1997):

generation of a huge number of rules:

most of these rules are often obvious.

In many cases, these rules express relations that are not true.

Page 11: ECML / PKDD 2004 Discovery Challenge

11ECML / PKDD 2004 Discovery Challenge

Id Association Rule SupA SupB Sup Conf

R1 (DailyBeerCons = “>1l”) (Smoking = “>20 cig/day”)

0.1193 0.2602 0.0448 0.3758

R2 (DailyBeerCons = “>1l”) (Married = “yes”)

0.1193 0.8487 0.0905 0.7584

Interest Measures: Support and Confidence

The support and confidence values of R2 are higher than the R1 ones.

Is R2, in fact, more interesting than R1?

Page 12: ECML / PKDD 2004 Discovery Challenge

12ECML / PKDD 2004 Discovery Challenge

Negative Dependence

Id Association Rule SupA SupB Sup Conf

R2 (DailyBeerCons = “>1l”) (Married = “yes”)

0.1193 0.8487 0.0905 0.7584

R2 should imply that men who are heavy beer consumers tend to be married.

84.87% of men are married. However, the probability for a man to be married, given that he is a heavy beer consumer is 75.84%.

Heavy beer consumers are, in fact, less likely to be married. There is a negative dependence between being married and being a heavy beer consumer.

Page 13: ECML / PKDD 2004 Discovery Challenge

13ECML / PKDD 2004 Discovery Challenge

Positive Dependence

Id Association Rule SupA SupB Sup Conf

R1 (DailyBeerCons = “>1l”) (Smoking = “>20 cig/day”)

0.1193 0.2602 0.0448 0.3758

26.02% of men are heavy smokers. The probability for a man to be a heavy smoker, given that he is a heavy beer consumer is 37.58%.

Heavy beer consumers are more likely to smoke a lot.

There is a positive dependence between being a heavy beer consumer and being a heavy smoker.

Page 14: ECML / PKDD 2004 Discovery Challenge

14ECML / PKDD 2004 Discovery Challenge

Strong Association Rule

Id Association Rule SupA SupB Sup Conf

R1 (DailyBeerCons = “>1l”) (Smoking = “>20 cig/day”)

0.1193 0.2602 0.0448 0.3758

R2 (DailyBeerCons = “>1l”) (Married = “yes”)

0.1193 0.8487 0.0905 0.7584

Conclusions:

R1 is a strong association rule, while R2 is not true.

In order to mine interesting information, we need to evaluate the type of dependence between the antecedent and the consequent of a rule.

Page 15: ECML / PKDD 2004 Discovery Challenge

15ECML / PKDD 2004 Discovery Challenge

Lift: how much more frequent is B when A occurs.

Lift(A B) = Conf(A B) Sup(B)

RI - Rule Interest (G. Piatetsky-Shapiro, 1991): computes the percentage of additional tuples matched by an association rule that are above the expected.

RI(A B) = Sup(A B) - Sup(A) x Sup(B)

We believe that the use of different interest measures (Sup, Conf, Lift and RI) provides alternative analysis of the same data, giving a better understanding about the associations.

Lift and RI

Page 16: ECML / PKDD 2004 Discovery Challenge

16ECML / PKDD 2004 Discovery Challenge

1. Atherosclerosis Data Set

2. Multidimensional Association Rules

3. Exceptions

4. Data Preparation

5. Results

6. Summary

Outline of the talk

1. Atherosclerosis Data Set

2. Multidimensional Association Rules

3. Exceptions

4. Data Preparation

5. Results

6. Summary

Page 17: ECML / PKDD 2004 Discovery Challenge

17ECML / PKDD 2004 Discovery Challenge

In our approach, exceptions represent association rules that become much weaker in some specific subsets of the database.

Mined exception:

(DailyBeerCons = “>1l”) & (Age = “ 50”) (Smoking = “>20 cig/day”)

meaning: “among the men who are 50 years old or above, the support value of the association between being a heavy beer consumer and being a heavy smoker is surprisingly smaller than what is expected”.

Exceptions

Example: Does the rule (DailyBeerCons = “>1l”) (Smoking = “>20 cig/day”) become weaker on any subset of the database?

Page 18: ECML / PKDD 2004 Discovery Challenge

18ECML / PKDD 2004 Discovery Challenge

Exceptions

(DailyBeerCons = “>1l”) & (Age = “ 50”) (Smoking = “>20 cig/day”)

This exception was obtained because the conventional rule (DailyBeerCons = “>1l”) & (Age = “50”) (Smoking = “>20 cig/day”) did not achieve an expected support.

This expected support is evaluated from the support of the original rule (DailyBeerCons = “>1l”) (Smoking = “>20 cig/day”) and the support of the condition (Age = “50”).

Page 19: ECML / PKDD 2004 Discovery Challenge

19ECML / PKDD 2004 Discovery Challenge

Let D be a database relation.

Let R: A B be a multidimensional association rule.

Let Z = {Z1 = z1, ..., Zk = Zk} be a set of conditions defined over D, where Z A B = . Z is named as probe set.

An exception related to the positive rule R is an implication of the form:

A Z B

Exceptions: Formal Definition

Page 20: ECML / PKDD 2004 Discovery Challenge

20ECML / PKDD 2004 Discovery Challenge

Exceptions are extracted from candidate exceptions. A candidate exception is an expression in the form:

A Z B

Exceptions are mined only if the candidates do not achieve an expected support.

This expectation is evaluated based on the support of the original rule A B and the support of the conditions that compose the probe set Z:

ExpSup(A Z B) = Sup(A B) x Sup(Z)

Candidate Exceptions

Page 21: ECML / PKDD 2004 Discovery Challenge

21ECML / PKDD 2004 Discovery Challenge

The Interest Measure (IM) Index

We developed two interest measures to evaluate the degree of interestingness of an exception.

The IM (Interest Measure) index evaluates the strength (relevance) of an exception.

IM(E) = 1 - (Sup(A Z B) ExpSup(A Z B))

An exception E is potentially interesting if the actual support value of Sup(A Z B) is much lower than its expected support value.

This measure captures the type of dependence between Z and A B. The closer the value is from 1, the more the negative dependence.

Page 22: ECML / PKDD 2004 Discovery Challenge

22ECML / PKDD 2004 Discovery Challenge

R: (DailyBeerCons = “>1l”) (Smoking = “>20 cig/day”) - Sup(R) = 4.48%

Z = {(Age = “ 50”)} - Sup(Z) = 22.82%

Example of the IM Index

The expected support for A Z B can be computed as 4.48% x 22.82% = 1.02%.

The actual support of A Z B is 0.48%.

The exception E1: A Z B is potentially interesting because IM(E1) = 1 - (0.48 1.02) = 0.53.

The actual support value of E1 is 53% lower than what is expected.

Page 23: ECML / PKDD 2004 Discovery Challenge

23ECML / PKDD 2004 Discovery Challenge

Degree of Unexpectedness

A high value for the IM measure is not a guarantee that we found interesting information.

R: (DailyBeerCons = “>1l”) (Smoking = “>20 cig/day”)Sup(R) = 4.48%

Z = {(Alcohol = “no”)} - Sup(Z) = 9.47%

The expected support for A Z B can be computed as 4.48% x 9.47% = 0.42%.

The actual support for this candidate rule is 0.00%.

IM(A Z B) = 1 - (0.00 0.48) = 1.00.

However, this exception represents na information that is obvious. The IM index could not detect the strong negative dependence between A and Z.

Page 24: ECML / PKDD 2004 Discovery Challenge

24ECML / PKDD 2004 Discovery Challenge

Degree of Unexpectedness

The DU (Degree of Unexpectedness ) Index is used to determine the validity of an exception.

This measure captures how much the negative dependence between a probe set Z and a rule A B is higher than the negative dependence between Z and either A and B.

DU(E) = IM(E) - max(1 - Sup(A Z) ExpSup(A Z), 1 - Sup(B Z) ExpSup(B

Z))

The greater the value is from 0, the more interesting the exception will be. If DU(E) 0 the exception is uninteresting.

Page 25: ECML / PKDD 2004 Discovery Challenge

25ECML / PKDD 2004 Discovery Challenge

Example of the DU Index

R: (DailyBeerCons = “>1l”) (Smoking = “>20 cig/day”) Sup(R) =4.48% --- Sup(A) =11.93% --- Sup(B) =26.02%

Z = {(Age = “ 50”)} Sup(Z)= 22.82% --- Sup(A Z)= 2.00% --- Sup(B Z)= 6.00%

1) compute the negative dependence between A and Z:

1 - (2.00% (11.93% x 22.82%)) = 0.27

2) compute the negative dependence between B and Z:

1 - (6.00% (26.02% x 22.82%)) = -0.01

The exception E1: A Z B is, in fact, interesting because:

DU(E1) = 0.53 - max(0.27,-0.01) = 0.26

Page 26: ECML / PKDD 2004 Discovery Challenge

26ECML / PKDD 2004 Discovery Challenge

1. Atherosclerosis Data Set

2. Multidimensional Association Rules

3. Exceptions

4. Data Preparation

5. Results

6. Summary

Outline of the talk

1. Atherosclerosis Data Set

2. Multidimensional Association Rules

3. Exceptions

4. Data Preparation

5. Results

6. Summary

Page 27: ECML / PKDD 2004 Discovery Challenge

27ECML / PKDD 2004 Discovery Challenge

The following relations in the ARFF format (Witten and Frank, 2000) were generated from the original Entry table:

ENTRYTOT: 1249 tuples (men from groups A, B and C).

ENTRYA: 276 tuples (only men from group A).

ENTRYB: 859 tuples (only men from group B).

ENTRYC: 114 tuples (only men from group C).

Data Preparation

Page 28: ECML / PKDD 2004 Discovery Challenge

28ECML / PKDD 2004 Discovery Challenge

Field Possible Values

Cholesterol “desirable” (<200), “bordering” (200 – 239), “high” ( 240).

Triglycerides “desirable” (<150), “bordering” (150 – 200), “high” (201 - 499), “very high” ( 500).

BMI (body mass index)

“underweight” ( bmi < 20), “normal” (20 bmi < 25),“overweight” (25 bmi < 30), “obese” (30 bmi < 40),“morbidly obese” (bmi 40).

Blood Pressure “normal”, “normal / high”, “high”

Skin Folds “8-20”, “21-30”, “31-40”, “>40”

Age “38-39”, “40-44”, “45-49”, “ 50”

Data PreparationData was enriched with new fields and the continuous attributes were discretized.

Page 29: ECML / PKDD 2004 Discovery Challenge

29ECML / PKDD 2004 Discovery Challenge

1. Atherosclerosis Data Set

2. Multidimensional Association Rules

3. Exceptions

4. Data Preparation

5. Results

6. Summary

Outline of the talk

1. Atherosclerosis Data Set

2. Multidimensional Association Rules

3. Exceptions

4. Data Preparation

5. Results

6. Summary

Page 30: ECML / PKDD 2004 Discovery Challenge

30ECML / PKDD 2004 Discovery Challenge

Results

We developed two programs in C++ (g++ compiler):

MULTMINE: used to mine strong multidimensional association rules.

EXCEPMINE: used to mine exceptions.

We use the following thresholds on the experiments:

Minimum support = 1% (MULTMINE).

Minimum IM = 0.30 and minimum DU = 0.05 (EXCEPMINE).

Page 31: ECML / PKDD 2004 Discovery Challenge

31ECML / PKDD 2004 Discovery Challenge

Group A - EntryALL

SupA SupB Sup Conf Lift RI

0.2210 0.2762 0.0873 0.3949 1.430 0.0262

(Group = “A”) (Education = “university”)

Group A is the only one where men with university degree are in the majority (Conf = 0.3949).

SupA SupB Sup Conf Lift RI

0.2210 0.0857 0.0320 0.1449 1.692 0.0131

(Group = “A”) (PhysActAfterJob = “great activity”)

There is a strong positive dependence between belonging to Group A and practicing physical actvities intensely in free time (lift = 1.692).

Page 32: ECML / PKDD 2004 Discovery Challenge

32ECML / PKDD 2004 Discovery Challenge

Alcohol Consumption x Smoking

Group SupA SupB Sup Conf Lift RI

A 0.0688 0.1667 0.0145 0.2105 1.263 0.0030

B 0.1362 0.5751 0.0908 0.6667 1.159 0.0125

C 0.1140 0.4737 0.0789 0.6923 1.461 0.0249

(DailyBeerCons = “>1l”) (SmokingDuration = “>20 years”)

Drinking a lot and smoking for more than 20 years are positively dependent in groups A, B, and C (Lift and RI columns).

However, there are much fewer smokers in Group A (SupB column). In groups B and C, the greatest part of the heavy beer consumers smoked cigarettes for more than 20 years (Conf column).

Men from group B tend to smoke and drink more (SupA, SupB

and Sup columns).

Page 33: ECML / PKDD 2004 Discovery Challenge

33ECML / PKDD 2004 Discovery Challenge

Alcohol Consumption x Cholesterol

Group SupA SupB Sup Conf Lift RI

A 0.0870 0.3370 0.0507 0.5833 1.731 0.0214

B 0.0861 0.1828 0.0186 0.2162 1.183 0.0029

C 0.1316 0.1316 0.0263 0.2000 1.520 0.0090

(Alcohol = “No”) (Cholesterol = “desirable”)

Not drinking alcohol and having the cholesterol in the desirable range are positively dependent in groups A, B, and C (Lift and RI columns).

There are less alcohol consumers in Group C (SupA column).

In group A, the greatest part of the men who do not drink alcohol have the cholesterol in the desirable range (Conf column).

Page 34: ECML / PKDD 2004 Discovery Challenge

34ECML / PKDD 2004 Discovery Challenge

Education x Smoking

Group SupA SupB Sup Conf Lift RI

A 0.3949 0.5109 0.2210 0.5596 1.095 0.0193

B 0.2526 0.1793 0.0664 0.2627 1.465 0.0211

C 0.1667 0.2018 0.0877 0.5263 2.608 0.0541

(Education = “university”) (Smoking = “no”)

People with the highest education degree are less likely to be smokers (Lift and RI columns).

In groups A and C, the majority of men with university degree do not smoke (Conf column). The support of this rule is very high in group A.

In group B, most of them are smokers (Conf column). However, not smoking and having reached university degree still are very positively dependent (Lift and RI columns).

Page 35: ECML / PKDD 2004 Discovery Challenge

35ECML / PKDD 2004 Discovery Challenge

Skin Folds x Body Mass Index

Group SupA SupB Sup Conf Lift RI

A 0.2319 0.5326 0.1558 0.6719 1.261 0.0323

B 0.2154 0.3586 0.1478 0.6865 1.914 0.0706

C 0.1140 0.2632 0.0789 0.6923 2.631 0.0489

(Skin Folds = “ 20”) (BMI = “normal”)

Most of the men who have the body mass index into the normal range were classified into the lowest range of the attribute Skin Folds (Conf column).

Both attributes are highly positive dependent (Lift and RI columns).

There are much fewer people who have normal BMI in Group C (SupB column).

Page 36: ECML / PKDD 2004 Discovery Challenge

36ECML / PKDD 2004 Discovery Challenge

Exceptions

(Education = “apprentice school ”) & (PhysActAfterJob = “great act.”) (Smoking = “15-20 cig day”)

IM = 0.4755, DU = 0.2069

Original rule: “people whose education degree is apprentice school tend to smoke a lot”.

Exception: Among the men who practice physical activities intensely in free time, the support value of the original rule is 47.55% smaller than what is expected.

The degree of unexpectedness is equal to 20.69%.

Page 37: ECML / PKDD 2004 Discovery Challenge

37ECML / PKDD 2004 Discovery Challenge

Exceptions

(Education = “university ”) & (Group = “C”) (BMI = “normal”)

IM = 0.7018, DU = 0.3052

Original rule: “people with the highest education degree tend to have the body mass index into the normal range”.

Exception: Among the men who belong to Group C, the support value of the original rule is 70.18% smaller than what is expected.

The degree of unexpectedness is equal to 30.52%.

Page 38: ECML / PKDD 2004 Discovery Challenge

38ECML / PKDD 2004 Discovery Challenge

1. Atherosclerosis Data Set

2. Multidimensional Association Rules

3. Exceptions

4. Data Preparation

5. Results

6. Summary

Outline of the talk

1. Atherosclerosis Data Set

2. Multidimensional Association Rules

3. Exceptions

4. Data Preparation

5. Results

6. Summary

Page 39: ECML / PKDD 2004 Discovery Challenge

39ECML / PKDD 2004 Discovery Challenge

Summary

We presented some strong association rules and exceptions mined from the STULONG Data Set, concerning the entry examinations.

Strong association rules evaluated the differences of the correlations concerning the characteristics of the patients from the three basic groups.

Exceptions indicated negative patterns associated with previously known strong positive rules. These exceptions were mined from candidates that do not achieve an expected support value.

Page 40: ECML / PKDD 2004 Discovery Challenge

40ECML / PKDD 2004 Discovery Challenge

Apply the same approach to the relations: Letter, Control and Death.

Besides mining rules with large deviation between the actual and the expected support, we intend to investigate the interestingness of rules with large deviation between the actual and the expected confidence value.

Future Work

Page 41: ECML / PKDD 2004 Discovery Challenge

41ECML / PKDD 2004 Discovery Challenge

UniversidadeUniversidade Federal Fluminense Federal Fluminense

Universidade Federal Fluminense

http://www.uff.br

Niterói, Rio de Janeiro, Brazil

TThhaannkk yyoouu ! !!!