ecml/pkdd 2003 discovery challenge 1 21 1 1 attribute-value and first order data mining within the...

ECML/PKDD 2003 Discovery Challenge1 1211

Attribute-Value and First Order Data Mining within the STULONG

projectAnneleen Van Assche, Sofie Verbaeten, Darek Krzywania,

Jan Struyf, Hendrik Blockeel

Department of Computer Science

Katholieke Universiteit Leuven


Data Mining Effort

•Data

•Data preprocessing

•Data mining

•Evaluation criteria

Discovered Knowledge

•Initial exploration

•Entry data

•Control data

Conclusions

Outline


Data

• we studied 2 of the 4 data matrices from the STULONG data set:

the Entry data matrix

the Control data matrix

• men in the Entry data are divided into 3 subgroups based on occurrence of risk factors:

normal group (NG): non of these risk factors

risk group (RG): at least one of the risk factors

pathological group (PG): manifested serious disease


Data preprocessing

• missing values / empty entries / “not stated” / “no”

• data = propositionalisation of relational database

many empty entries + redundancies (eg. personal anamnesis)

1-n relation from Entry to Control data set

solution: relational representation (ILP)

background knowledge can be used

• new features for trend analysis in control examinations


Attribute-value: Entry data set

converted to Weka .arff format

introduction of new attributes (eg. BMI, …)

Relational: Entry + Control data set

converted to relational ILP format

introduction of background knowledge

Data preprocessing


Data mining

• Entry data in .arff format Weka

classification (ZeroR, OneR, NB, Decision Stump, Decision Table, J48, …)

regression (Linear Regression, M5’)

association rules (Apriori)

• Entry + Control data in ILP format ACE

classification (Tilde)

regression (Tilde)

since data distributions are skewed, better use regression to predict chance of being positive/negative instead of using classification


Evaluation criteria

• 10-fold cross-validation

• classifiers

ROC – analysis (Area Under Curve)

accuracy

• regression models

Relative error (RE)

Pearson’s correlation coefficient (r)


Data Mining Effort

•Data

•Data preprocessing

•Data mining

•Evaluation criteria

Discovered Knowledge

•Initial exploration

•Entry data

•Control data

Conclusions

Outline


Initial exploration of Entry

Comparison of mean values of attributes for the three subgroups

reached education

responsibility in job

physical activity in job

physical activity after job

Skinfold above musculus triceps

Skinfold above musculus subscpularis


Initial exploration of Entry

Correlation between BMI and skin fold for the three subgroups


Results from the Entry data set

Relations between social factors and other characteristics

• education level physical activity in job

• education level smoking

• pensioner drinking

• age blood pressure

Relations between physical activities and other characteristics

• activity after job smoking

• duration of way to work drinking

...

...



Correlation between skinfolds and BMI in particular risk groups

• regression task: predict BMI using SUBSC and TRIC

• classification task: predict OVERWEIGHT(OW) (1 if BMI >25 else 0)

Experiment Size ACC RAE r AUC

OW_T 6.0 71% 0.83 0.417 0.71

OW_NG 0.6 53% 1.01 -0.009 0.50

OW_RG 3.9 74% 0.78 0.467 0.73

OW_PG 1.0 75% 0.97 0.205 0.55

BMI_T 6.9 0.77 0.486

BMI_NG 1.0 0.90 0.309

BMI_RG 3.9 0.72 0.529

BMI_PG 1.7 0.86 0.304



Correlation between skinfolds and BMI in particular subgroups

• correlation is strongest in risk group

• for all different groups SUBSC > ±15 is most important split to distinguish between overweight en non-overweight

• SUBSC BMI

• influence of TRIC on BMI less than influence of SUBSC


Correlation between skinfolds and BMI in particular subgroups


26 24.4

21.926.726.7

29.6 27.5

TRIC < 15

SUBSC < 10

SUBSC < 15

SUBSC < 20

SUBSC < 70

SUBSC < 35

ex. risk group:26.1

23.9

22.2

27.1

27.7

27.8

yesno

no yes

yesno

no yes

yesno

no yes



Staying healthy in the risk group (RG)

• task: predict if person of RG came down with cardio disease

• new attribute ILL introduced based on HODN0 attr from Control

• no good performance (most correlation coefficients < 0.05)

• best correlation (0.15) for cholesterol level

• if cholesterol < 250 then chance to stay healthy


Results from the Control data set

• relational Control data set Tilde

• task: predict whether person from risk group comes down with cardio disease (1) or not (0)

• use only controlexaminations (ce) before patient’s cardio disease:

ce.year ≤ ROKi

• numeric attributes: extra features

compute trend over different ce’s

slope of least squares model of attr. over time interval T – N

T: start of patient’s first disease

N: parameter chosen by Tilde



Input attributes Size ACC RAE r AUC AUC (33%)

Job 1.0 68% 0.79 0.162 0.54 0.55

Physical activity 0.1 68% 1.01 -0.09 0.43 0.51

Smoking 3.7 67% 0.98 0.153 0.57 0.61

Diet 0.0 68% 1.00 -0.11 0.43 0.50

BMI 1.4 67% 1.03 -0.04 0.46 0.50

Blood Pressure 3.3 63% 1.03 0.106 0.55 0.58

Cholesterol 9.1 64% 1.03 0.147 0.59 0.57

Glycaemia & Uric acid 3.3 66% 0.81 0.435 0.71 0.71

BMI & Cholesterol 10.6 63% 1.05 0.147 0.59 0.57

Smoking & Cholesterol 12.5 63% 1.04 0.164 0.59 0.60

All 8.5 66% 0.91 0.351 0.71 0.69

Statistics on the Control data experiments



• Some interesting subgroups from the decision trees:

proportion of class 1 in whole group = 32%

total population = 1417

IF glycaemia > 7.2 and BMI > 23.5 in each examination and diastolic blood pressure slope during last 10 years < -77 THEN 64% (103)

IF systolic blood pressure slope during last 20 years < -0.97 THEN 53% (122)

IF glycaemia > 7.2 in each examination THEN 48% (434)

If patient leaves to full retirement in some examination THEN 20% (233)

IF reduced smoking in some examination and slope in number of cigarettes during last 20 years < 0.48 but during last 10 years > -0.11 THEN 16% (116)

IF glycaemia < 7.2 in some examination THEN 7% (285)



• Glycaemia most important attribute

• also blood pressure, cholesterol and smoking …

• slope of numeric attributes very useful

• statistics may be negatively biased due to cross-validation


Conclusions

• used variety of data mining algorithms

propositional techniques

multi-relational techniques

• results consistent over different algorithms

• much discovered knowledge difficult to handle

• interpretation of results by domain experts is necessary

• carefull handling of results

• if accuracy of classifier not larger than predicting the average

classifier can still be informative!!


The End

Thanks for your attention!!

ecml/pkdd 2003 discovery challenge 1 21 1 1 attribute-value and first order data mining within the...

Documents

data mining entry data

data distributions

data matrices

entry data matrix

entry control data set

order data mining

stulong data set

entry data set relations