attribute interactions in machine learningjakulin/int/interaction-slides.pdfmarital−status...
TRANSCRIPT
Attribute Interactionsin Machine Learning
Aleks Jakulin
Faculty of Computer and Information Science
University of Ljubljana
Attribute Interactions – p.1/17
A Classification Problem
ATTRIBUTES LABEL
Name Hair Height Weight Lotion Result
Sarah blonde average light no sunburned
Dana blonde tall average yes tanned
Alex brown short average yes tanned
Annie blonde short average no sunburned
Emily red average heavy no sunburned
Pete brown tall heavy no tanned
John brown average heavy no tanned
Katie blonde short light yes tanned
TASK: PREDICT AN INSTANCE’S CLASS GIVEN THE ATTRIBUTE VALUES.
Attribute Interactions – p.2/17
Interactions
“We cannot conquer a group of interacting attributes by
dividing them.”
Most machine learning algorithms assume either
that all attributes are independent (naïve Bayes, logisticregression, linear SVM, perceptron),
or that all attributes are dependent (classification trees,
constructive induction, rules, kernel methods, instance-based
methods).
However, voting ensembles, where a number of classifiers
trained on subsets of attributes or instances vote to predict
the label (attribute decomposition, random forests, decision
graphs, subspace methods), yield good results. Why?
Attribute Interactions – p.3/17
Voting
Hair Name Height
Result
Lotion Weight
Attribute Interactions – p.4/17
Voting
Hair Name Height
SKIN Result
Lotion Weight
WE DECLARE A TRUETHIS TO BE: INTERACTION
SPURIOUS RELATIONSHIP
MODERATOR
Attribute Interactions – p.4/17
Voting
Hair Name Height
SKIN Result SIZE
Lotion Weight
WE DECLARE A TRUE A FALSETHIS TO BE: INTERACTION INTERACTION
SPURIOUS RELATIONSHIP
MODERATOR
LATENT CAUSE
LATENT CAUSE
Attribute Interactions – p.4/17
Simpson’s Paradox
0
0.1
0.2
0.3
0.4
0.5
0.6D
eath
Rat
e (%
)
Location
Tuberculosis Patients
New York Richmond
Attribute Interactions – p.5/17
Simpson’s Paradox
0
0.1
0.2
0.3
0.4
0.5
0.6D
eath
Rat
e (%
)
Location
Tuberculosis Patients
New York Richmond
WhiteNon-White
Both
Attribute Interactions – p.5/17
Information Gain
C
BA
An attribute is an information source. We
want to estimate the amount of information
shared between two sources.
The amount learned about a label C from
an attribute A is quantified by information
gain: GainC(A) := H(A) + H(C)−H(AC).
Interpretation: our ignorance about an
unknown C reduces by GainC(A) given the
knowledge of A.
Sufficient, if all attributes are conditionally in-
dependent with respect to the label, when
there are only 2-way interactions.
Attribute Interactions – p.6/17
Interaction Gain
C
BA
How to estimate the amount ofinformation shared among threeattributes?
Generalization of information gain for3-way interactions is interaction gain:
IG3(ABC) := H(AB) + H(AC) + H(BC)−H(A)
−H(B)−H(C)−H(ABC)
= GainC(AB)−GainC(A)−GainC(B).
If IG negative: a false interaction.
If IG positive: a true interaction.
If IG zero: no 3-way interaction.
Attribute Interactions – p.7/17
False Interaction Analysis
age
marital−status
relationship
hours−per−week
sex
workclass
native−country
race
education
education−num
occupation
capital−gain
capital−loss
fnlwgt
0 200 400 600 800 1000
Height The Census/Adult domain
from UCI, 2-classes of
individuals: rich, poor.
Similarity between two
attributes is proportional to
negated 3-interaction gain
between them and the label.
Only false interactions were
included into consideration.
Agglomerative clustering was
used to create the interaction
dendrogram.Attribute Interactions – p.8/17
True Interaction Analysis
native_country
age
100%
race
23%
workclass
75%
occupation
75%
capital_loss
capital_gain
63%
education
59%
marital_status
52%
relationship
46%
hours_per_week
35%
A percentage on an interaction graph edge indicatesthe strength of a true interaction.
Native country appears to be an important moderator,moderating a large number of 2-way interactions.
True interactions are rarely transitive relations.
True interactions are a forest of trees, not a single tree.
Attribute Interactions – p.9/17
Interaction Significance (1)
When is an interaction significant?
Special statistics for conditional dependenceand independence tests, e.g.,Cochran-Mantel-Haenszel.
Evaluate classifier performance on unseendata by comparing:
A classifier assuming independence between twoattributes (voting).
A classifier exploiting dependence between twoattributes via interaction resolution (segmentation).
Attribute Interactions – p.10/17
Interaction Significance (2)
DRUZINSK
ED_FAZAS
27%
KTZHT
0%
ED_UPA
1%
VASK.INV
0%
1%
OPERACIJ
0%
RT
0%
UPASEL
2%
0%
ED_PAI2
0%
PAI2SEL
0%
INV.KAPS
2%
NEOKT
0%
ANAMNEZA
1%
1%
33%
LOKOREG.
1%
ODDALJEN
2%
MKL
0%
ED_PREGL.BE
0%
0% 0%
NOV.TU
0%
MENSEL
1%
NKL
1%
0%
UICC
0%
KT
0%
D_STEVILO
0%
INVAZIJ1
1%
1%
D_PAI1
4%
0%
PAI1SEL
0% 0%
UPARSEL
3% 2%20%
0%
INVAZIJA
3%
0%
HR
D_PR.B
4%
HT
3%
1%
1% 0% 0%
0%
ZDRAVLJE
0%0%
LOKALIZA
0%0%
KAT.L
0%
0%
VELSEL
0%NODSEL
0%
D_DFS2
0%
UPAR
0%
1%
GRADSEL
2%
TIP.TU
0%
GRADUS
0%
LIMF.INV
0%
D_ER.B
0%
ED_KATL
1% 2%
100%
ED_KAT.D
2%
0%
VEL.C
2%
There are generally few significantinteractions.
Attribute Interactions – p.11/17
Interaction Significance (2)
DRUZINSK
ED_FAZAS
27%
KTZHT
0%
ED_UPA
1%
VASK.INV
0%
1%
OPERACIJ
0%
RT
0%
UPASEL
2%
0%
ED_PAI2
0%
PAI2SEL
0%
INV.KAPS
2%
NEOKT
0%
ANAMNEZA
1%
1%
33%
LOKOREG.
1%
ODDALJEN
2%
MKL
0%
ED_PREGL.BE
0%
0% 0%
NOV.TU
0%
MENSEL
1%
NKL
1%
0%
UICC
0%
KT
0%
D_STEVILO
0%
INVAZIJ1
1%
1%
D_PAI1
4%
0%
PAI1SEL
0% 0%
UPARSEL
3% 2%20%
0%
INVAZIJA
3%
0%
HR
D_PR.B
4%
HT
3%
1%
1% 0% 0%
0%
ZDRAVLJE
0%0%
LOKALIZA
0%0%
KAT.L
0%
0%
VELSEL
0%NODSEL
0%
D_DFS2
0%
UPAR
0%
1%
GRADSEL
2%
TIP.TU
0%
GRADUS
0%
LIMF.INV
0%
D_ER.B
0%
ED_KATL
1% 2%
100%
ED_KAT.D
2%
0%
VEL.C
2%
ODDALJEN > 0: y
ODDALJEN <= 0:
:...LOKOREG. <= 0: n
LOKOREG. > 0: y
A PERFECT CLASSIFICATION TREE FOR
THE ‘BREAST’ DOMAIN, INDUCED BY
C4.5.
But they matter: non-myopic feature selection,non-myopic split selection, non-myopic discretization,rules, trees, constructive induction.
Attribute Interactions – p.11/17
Classification Performance
‘adult’ Base False True
NBC 0.416 0.352 0.392
LR 1.562 0.418 1.564
SVM — — —
‘breast’ Base False True
NBC 0.262 0.187 0.171
LR 0.016 0.016 0.016
SVM 0.032 0.032 0.016
A wrapper algorithm detects true or false interactions with
interaction gain and uses minimal-error attribute reduction to
resolve them. No feature selection and no parameter tuning
was used.
It improves results with logistic regression, SVM, and the
naïve Bayesian classifier.
There must be enough data!
Attribute Interactions – p.12/17
Applications
Prediction:
Resolving significant interactions helps improveclassification performance.
Interactions limit or prevent myopia in discretizationand feature selection.
Interactions justify constructive induction.
Analysis:
Interactions are interesting, especially if unexpected:interactions between treatments, symptoms, etc.
Attribute Interactions – p.13/17
Summary of Contributions
Two kinds of interactions: true and false interactions.
Interaction gain is an interaction probe, able to detectand classify 3-way interactions.
The pragmatic interaction significance test, based oncomparison of classification performance on unseendata.
True and false interaction analysis methodology, withinteraction graphs and interaction dendrograms.
Improving classification performance with interactionresolution.
Attribute Interactions – p.14/17
Further Work
A full-fledged tool for interaction analysis.
Support for numerical and ordered attributes.
Generalization to k-way interactions.
Improved methods of resolution, especially offalse interactions.
Exploration of implications of interactions todiscretization, split selection, etc.
Applications.
Attribute Interactions – p.15/17
Cardinality of Attributes
-0.1
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
0 200 400 600 800 1000 1200 1400 1600
impr
ovem
ent b
y re
plac
emen
t
number of joint attribute values
Adult/Census
The greater the number of values in the constituent at-
tributes, the lower the chances of the interaction between
them to be significant.Attribute Interactions – p.16/17
Attribute Reduction
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
-0.08 -0.06 -0.04 -0.02 0 0.02 0.04
impr
ovem
ent b
y re
plac
emen
t (C
arte
sian
)
improvement by replacement (MinErr)
Adult
Minimal-error attribute reduction often yields better results
than using the non-reduced Cartesian product of attributes.
Attribute Interactions – p.17/17