attribute interactions in machine learningjakulin/int/interaction-slides.pdfmarital−status...

Attribute Interactionsin Machine Learning

Aleks Jakulin

Faculty of Computer and Information Science

University of Ljubljana

Attribute Interactions – p.1/17

http://ai.fri.uni-lj.si/aleks

http://www.fri.uni-lj.si

http://www.uni-lj.si

A Classification Problem

ATTRIBUTES LABEL

Name Hair Height Weight Lotion Result

Sarah blonde average light no sunburned

Dana blonde tall average yes tanned

Alex brown short average yes tanned

Annie blonde short average no sunburned

Emily red average heavy no sunburned

Pete brown tall heavy no tanned

John brown average heavy no tanned

Katie blonde short light yes tanned

TASK: PREDICT AN INSTANCE’S CLASS GIVEN THE ATTRIBUTE VALUES.


Interactions

“We cannot conquer a group of interacting attributes by

dividing them.”

Most machine learning algorithms assume either

that all attributes are independent (naïve Bayes, logisticregression, linear SVM, perceptron),

or that all attributes are dependent (classification trees,

constructive induction, rules, kernel methods, instance-based

methods).

However, voting ensembles, where a number of classifiers

trained on subsets of attributes or instances vote to predict

the label (attribute decomposition, random forests, decision

graphs, subspace methods), yield good results. Why?


Voting

Hair Name Height

Result

Lotion Weight


Voting

Hair Name Height

SKIN Result

Lotion Weight

WE DECLARE A TRUETHIS TO BE: INTERACTION

SPURIOUS RELATIONSHIP

MODERATOR


Voting

Hair Name Height

SKIN Result SIZE

Lotion Weight

WE DECLARE A TRUE A FALSETHIS TO BE: INTERACTION INTERACTION

SPURIOUS RELATIONSHIP

MODERATOR

LATENT CAUSE

LATENT CAUSE


Simpson’s Paradox

0

0.1

0.2

0.3

0.4

0.5

0.6D

eath

Rat

e (%

)

Location

Tuberculosis Patients

New York Richmond


Simpson’s Paradox

0

0.1

0.2

0.3

0.4

0.5

0.6D

eath

Rat

e (%

)

Location

Tuberculosis Patients

New York Richmond

WhiteNon-White

Both


Information Gain

C

BA

An attribute is an information source. We

want to estimate the amount of information

shared between two sources.

The amount learned about a label C from

an attribute A is quantified by information

gain: GainC(A) := H(A) + H(C)−H(AC).

Interpretation: our ignorance about an

unknown C reduces by GainC(A) given the

knowledge of A.

Sufficient, if all attributes are conditionally in-

dependent with respect to the label, when

there are only 2-way interactions.


Interaction Gain

C

BA

How to estimate the amount ofinformation shared among threeattributes?

Generalization of information gain for3-way interactions is interaction gain:

IG3(ABC) := H(AB) + H(AC) + H(BC)−H(A)

−H(B)−H(C)−H(ABC)

= GainC(AB)−GainC(A)−GainC(B).

If IG negative: a false interaction.

If IG positive: a true interaction.

If IG zero: no 3-way interaction.


False Interaction Analysis

age

marital−status

relationship

hours−per−week

sex

workclass

native−country

race

education

education−num

occupation

capital−gain

capital−loss

fnlwgt

0 200 400 600 800 1000

Height The Census/Adult domain

from UCI, 2-classes of

individuals: rich, poor.

Similarity between two

attributes is proportional to

negated 3-interaction gain

between them and the label.

Only false interactions were

included into consideration.

Agglomerative clustering was

used to create the interaction

dendrogram.Attribute Interactions – p.8/17

True Interaction Analysis

native_country

age

100%

race

23%

workclass

75%

occupation

75%

capital_loss

capital_gain

63%

education

59%

marital_status

52%

relationship

46%

hours_per_week

35%

A percentage on an interaction graph edge indicatesthe strength of a true interaction.

Native country appears to be an important moderator,moderating a large number of 2-way interactions.

True interactions are rarely transitive relations.

True interactions are a forest of trees, not a single tree.


Interaction Significance (1)

When is an interaction significant?

Special statistics for conditional dependenceand independence tests, e.g.,Cochran-Mantel-Haenszel.

Evaluate classifier performance on unseendata by comparing:

A classifier assuming independence between twoattributes (voting).

A classifier exploiting dependence between twoattributes via interaction resolution (segmentation).



DRUZINSK

ED_FAZAS

27%

KTZHT

0%

ED_UPA

1%

VASK.INV

0%

1%

OPERACIJ

0%

RT

0%

UPASEL

2%

0%

ED_PAI2

0%

PAI2SEL

0%

INV.KAPS

2%

NEOKT

0%

ANAMNEZA

1%

1%

33%

LOKOREG.

1%

ODDALJEN

2%

MKL

0%

ED_PREGL.BE

0%

0% 0%

NOV.TU

0%

MENSEL

1%

NKL

1%

0%

UICC

0%

KT

0%

D_STEVILO

0%

INVAZIJ1

1%

1%

D_PAI1

4%

0%

PAI1SEL

0% 0%

UPARSEL

3% 2%20%

0%

INVAZIJA

3%

0%

HR

D_PR.B

4%

HT

3%

1%

1% 0% 0%

0%

ZDRAVLJE

0%0%

LOKALIZA

0%0%

KAT.L

0%

0%

VELSEL

0%NODSEL

0%

D_DFS2

0%

UPAR

0%

1%

GRADSEL

2%

TIP.TU

0%

GRADUS

0%

LIMF.INV

0%

D_ER.B

0%

ED_KATL

1% 2%

100%

ED_KAT.D

2%

0%

VEL.C

2%

There are generally few significantinteractions.



DRUZINSK

ED_FAZAS

27%

KTZHT

0%

ED_UPA

1%

VASK.INV

0%

1%

OPERACIJ

0%

RT

0%

UPASEL

2%

0%

ED_PAI2

0%

PAI2SEL

0%

INV.KAPS

2%

NEOKT

0%

ANAMNEZA

1%

1%

33%

LOKOREG.

1%

ODDALJEN

2%

MKL

0%

ED_PREGL.BE

0%

0% 0%

NOV.TU

0%

MENSEL

1%

NKL

1%

0%

UICC

0%

KT

0%

D_STEVILO

0%

INVAZIJ1

1%

1%

D_PAI1

4%

0%

PAI1SEL

0% 0%

UPARSEL

3% 2%20%

0%

INVAZIJA

3%

0%

HR

D_PR.B

4%

HT

3%

1%

1% 0% 0%

0%

ZDRAVLJE

0%0%

LOKALIZA

0%0%

KAT.L

0%

0%

VELSEL

0%NODSEL

0%

D_DFS2

0%

UPAR

0%

1%

GRADSEL

2%

TIP.TU

0%

GRADUS

0%

LIMF.INV

0%

D_ER.B

0%

ED_KATL

1% 2%

100%

ED_KAT.D

2%

0%

VEL.C

2%

ODDALJEN > 0: y

ODDALJEN <= 0:

:...LOKOREG. <= 0: n

LOKOREG. > 0: y

A PERFECT CLASSIFICATION TREE FOR

THE ‘BREAST’ DOMAIN, INDUCED BY

C4.5.

But they matter: non-myopic feature selection,non-myopic split selection, non-myopic discretization,rules, trees, constructive induction.


Classification Performance

‘adult’ Base False True

NBC 0.416 0.352 0.392

LR 1.562 0.418 1.564

SVM — — —

‘breast’ Base False True

NBC 0.262 0.187 0.171

LR 0.016 0.016 0.016

SVM 0.032 0.032 0.016

A wrapper algorithm detects true or false interactions with

interaction gain and uses minimal-error attribute reduction to

resolve them. No feature selection and no parameter tuning

was used.

It improves results with logistic regression, SVM, and the

naïve Bayesian classifier.

There must be enough data!


Applications

Prediction:

Resolving significant interactions helps improveclassification performance.

Interactions limit or prevent myopia in discretizationand feature selection.

Interactions justify constructive induction.

Analysis:

Interactions are interesting, especially if unexpected:interactions between treatments, symptoms, etc.


Summary of Contributions

Two kinds of interactions: true and false interactions.

Interaction gain is an interaction probe, able to detectand classify 3-way interactions.

The pragmatic interaction significance test, based oncomparison of classification performance on unseendata.

True and false interaction analysis methodology, withinteraction graphs and interaction dendrograms.

Improving classification performance with interactionresolution.


Further Work

A full-fledged tool for interaction analysis.

Support for numerical and ordered attributes.

Generalization to k-way interactions.

Improved methods of resolution, especially offalse interactions.

Exploration of implications of interactions todiscretization, split selection, etc.

Applications.


Cardinality of Attributes

-0.1

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

0 200 400 600 800 1000 1200 1400 1600

impr

ovem

ent b

y re

plac

emen

t

number of joint attribute values

Adult/Census

The greater the number of values in the constituent at-

tributes, the lower the chances of the interaction between

them to be significant.Attribute Interactions – p.16/17

Attribute Reduction

-0.08

-0.06

-0.04

-0.02

0

0.02

0.04

-0.08 -0.06 -0.04 -0.02 0 0.02 0.04

impr

ovem

ent b

y re

plac

emen

t (C

arte

sian

)

improvement by replacement (MinErr)

Adult

Minimal-error attribute reduction often yields better results

than using the non-reduced Cartesian product of attributes.


attribute interactions in machine learningjakulin/int/interaction-slides.pdfmarital−status...

Documents