phd completion seminar
Post on 12-Feb-2017
534 Views
Preview:
TRANSCRIPT
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Simone Romano’s PhD Completion Seminar
Design and Adjustment ofDependency Measures Between Variables
November 30th 2015
Supervisor: Prof. James Bailey
Co-Supervisor: A/Prof. Karin Verspoor
Computing and Information Systems (CIS)
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
BackgroundExamples of ApplicationsCategories of Dependency measuresThesis Motivation
Ranking Dependencies in Noisy DataMotivationDesign of the Randomized Information Coefficient (RIC)Comparison Against Other Measures
A Framework for Adjusting Dependency MeasuresMotivationAdjustment for QuantificationAdjustment for Ranking
Adjustments for Clustering Comparison MeasuresMotivationDetailed Analysis of Contingency TablesApplication Scenarios
Conclusions
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Examples of Applications
Dependency Measures
A dependency measure D is used to assessthe amount of dependency between variables:
Example 1: After collecting weight and height for many people,we can compute D(weight, height)
Example 2: assess the amount of dependency between search queries in Google
https://www.google.com/trends/correlate/
They are fundamental for a number of applications in machine learning/ data mining
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Examples of Applications
Applications of Dependency Measures
Supervised learning
I Feature selection [Guyon and Elisseeff, 2003];
I Decision tree induction [Criminisi et al., 2012];
I Evaluation of classification accuracy [Witten et al., 2011].
Unsupervised learning
I External clustering validation [Strehl and Ghosh, 2003];
I Generation of alternative or multi-view clusterings[Muller et al., 2013, Dang and Bailey, 2015];
I The exploration of the clustering space using results from the Meta-Clustering algorithm[Caruana et al., 2006, Lei et al., 2014].
Exploratory analysis
I Inference of biological networks [Reshef et al., 2011, Villaverde et al., 2013];
I Analysis of neural time-series data [Cohen, 2014].
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Examples of Applications
Application example (1): feature selection / decision tree induction
Application: Identify if the class C is dependent to a feature F
Toy Example: Is the class C = cancer dependent to the feature F = smoker according tothis data set of 20 patients.
Use of dependency measure: Compute D(F,C)
Smoker CancerNo -Yes +Yes +Yes -No +No -Yes +
......
Yes +
Contingency table is a useful tool:counts the co-occurrences of feature valuesand class values.
+ -10 10
Smoker 8 6 2Non smoker 12 4 8
⇒ if it isdependent theninduce a split inthe decision tree
yes nosmoker
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Examples of Applications
Application example (2): external clustering validation
Application: Compare a clustering solution B to a reference clustering A.
Toy Example: N = 15 data points
reference clustering A with 2 clusters, stars and circles
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Examples of Applications
Application example (2): external clustering validation
Application: Compare a clustering solution B to a reference clustering A.
Toy Example: N = 15 data points
reference clustering A with 2 clusters, stars and circles
clustering solution B with 2 clusters, redand blue
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Examples of Applications
Use of dependency measure: Compute D(A,B)
Once gain the contingency table is a useful too that assesses the amount of overlapbetween A and B
Bred blue6 9
A8 4 47 2 5
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Examples of Applications
Application example (3): genetic network inference
Application: Identify if the gene G1 is interacting with the gene G2
Toy Example: We have a time series of values for each G1 and G2:
Use of dependency measure: Compute D(G1,G2)
time G1 G2
t1 20.4400 19.7450t2 19.0750 20.3300t3 20.0650 20.1700...
......
Time0 20 40 60 80 100 120 140
18
20
22
24
26
G1
G2
Here there is no contingency tablebecause the variables are numerical
G1
18 20 22 24 26
G2
19
20
21
22
23
24
25
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Categories of Dependency measures
Categories of Dependency MeasuresDependency measures can be divided in two categories: measures between categoricalvariables and measures between numerical variables.
Between Categorical VariablesThese measures can be computed naturally on a contingency table. For example on:
Decision trees
yes nosmoker
+ -10 10
Smoker 8 6 2
Non smoker 12 4 8
Clusteringcomparisons
Bred blue
6 9
A8 4 4
7 2 5
I Information theoretic [Cover and Thomas, 2012]:e.g. mutual information (a.k.a. information gain)
I Based on pair-counting [Albatineh et al., 2006]: e.g. Rand Index, Jaccard similarity
I Based on set-matching [Meila, 2007]:e.g. classification accuracy, agreement between annotators
I Others: mostly employed as splitting criteria [Kononenko, 1995]: e.g. Gini gain,Chi-square.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Categories of Dependency measures
Between Numerical VariablesNo contingency table. For example:
Biological interaction
G1
18 20 22 24 26
G2
19
20
21
22
23
24
25
I Estimators of mutual information [Khan et al., 2007]:e.g. kNN estimator, kernel estimator, estimator based on grids
I Correlation based:e.g. Pearson’s correlation, distance correlation [Szekely et al., 2009],randomized dependence coefficient [Lopez-Paz et al., 2013]
I Kernel based: e.g. Hilbert-Schimidt Independence Criterion [Gretton et al., 2005]
I Based on information theory:e.g. the Maximal Information Coefficient (MIC) [Reshef et al., 2011],the mutual information dimension [Sugiyama and Borgwardt, 2013],total information coefficient [Reshef et al., 2015].
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Thesis Motivation
Thesis Motivation
Even if a dependency measure D has nice theoretical properties,dependencies are estimated on finite data with D.
The following goals of dependency measures are challenging:
Detection: Test for the presence of dependency.E.g. test dependence between two genesExample (3)
Quantification: Summarization of the amount of dependency in an interpretable fashion.E.g. assessing the amount of overlapping between two clusteringsExample (2)
Ranking: Sort the relationships of different variables.E.g. ranking many features in decision treesExample (1)
To improve performances on the three goals above:We need information on the distribution of D
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Thesis Motivation
For Example, when Ranking Noisy Relationships
The distribution of D(X ,Y ) when the relationship between X and Y is noisy,should not overlap with the distribution of D(X ,Y ) on a noiseless relationship:
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
BackgroundExamples of ApplicationsCategories of Dependency measuresThesis Motivation
Ranking Dependencies in Noisy DataMotivationDesign of the Randomized Information Coefficient (RIC)Comparison Against Other Measures
A Framework for Adjusting Dependency MeasuresMotivationAdjustment for QuantificationAdjustment for Ranking
Adjustments for Clustering Comparison MeasuresMotivationDetailed Analysis of Contingency TablesApplication Scenarios
Conclusions
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Motivation
Mutual information I (X ,Y ) is good to rank relationships with different level of noisebetween variables:high I ⇒ little noisesmall I ⇒ big noise
It can also be computed between sets of variables: e.g.I (X,Y ) = I ({X1,X2},Y ) = I ({weight, height}, BMI)
Mutual Information quantifies the information shared between two variables
MI(X ,Y ) =
∫ +∞
−∞
∫ +∞
−∞fX ,Y (x , y) log
fX ,Y (x , y)
fX (x)fY (y)
Importance of MIIt is based on a well-established theory and quantifies non-linear interactions which might bemissed if e.g. the Pearson’s correlation coefficient r(X ,Y ) is used.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Estimation of Mutual Information
Many estimators of mutual information:
Acronym Type Applicable to Sets of vars. Best Compl. Worst Compl.
Iew (Discretization equal width) 7 O(n1.5)Ief (Discretization equal frequency) 7 O(n1.5)IA (Adaptive Partitioning) 7 O(n1.5)Imean (Mean Nearest Neighbours) 3 O(n2)IKDE (Kernel Density Estimation) 3 O(n2)IkNN (Nearest Neighbours) 3 O(n1.5) O(n2)
Discretization based estimators of mutual information exhibits good complexity but notapplicable to sets of variables
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Discretization based estimator use fixed grids. And compute mutual information on acontingency table.
X18 20 22 24 26
Y
16
18
20
22
24
26
For example Iew discretizesusing equal width binning
Discretized Xb1 · · · bj · · · bc
a1 n11 · · · · · · · n1c
......
......
Discretized Y ai · nij ·...
......
...ar nr1 · · · · · · · nrc
nij counts the number of points in aparticular bin. Mutual information
can be computed with:
Iew(X ,Y ) =r∑
i=1
c∑j=1
nij
Nlog
nijN
aibj
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Criticism
The discretization approach is less popular between numerical variables because:
I There is a systematic estimation bias which depends to the grid size
However, when comparing dependencies systematic estimation biases cancel each other out[Kraskov et al., 2004, Margolin et al., 2006, Schaffernicht et al., 2010]
Thus too not bad for comparing/ranking relationships!
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Comparing relationships/ comparing estimations of I
Task: Given a strong relationship s and a weak relationship w , compare the estimates Is andIw of the true values Is and Iw
I Systematic biases cancel out when comparing relationships
Systematic biases translate the distributions by a fixed amount
I It is beneficial to reduce the variance
Challenge: Decreasing the variance of the estimation
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Comparing relationships/ comparing estimations of I
Task: Given a strong relationship s and a weak relationship w , compare the estimates Is andIw of the true values Is and Iw
I Systematic biases cancel out when comparing relationships
Systematic biases translate the distributions by a fixed amount
I It is beneficial to reduce the variance
Challenge: Decreasing the variance of the estimation
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Design of the Randomized Information Coefficient (RIC)
Randomized Information Coefficient (RIC)Idea:
I Generate many random grids with different cardinality by random cut-offsI Estimate the normalized mutual information for each of them (because of different
cardinality)
I Average
X18 20 22 24 26
Y16
18
20
22
24
26
X18 20 22 24 26
Y
16
18
20
22
24
26
X18 20 22 24 26
Y
16
18
20
22
24
26
Average
Parameters:I Kr - tunes the number of random gridsI Dmax - tunes the maximum grid cardinality generated
Features:I Proved to decrease the variance like in random forests [Geurts, 2002]I Still good complexity O(n1.5)I Easy to extend to sets of variables
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Design of the Randomized Information Coefficient (RIC)
Random discretization of set of variablesRelationship between Y and X = {X1,X2}
X2
10.5
010.5
X1
1
0.5
0
0
Y
X 0 = X1+X2
2
0 0.5 1
Y
0
0.5
1
Need to randomly discretize X ⇒ just choose some random seeds:
X1
0 0.5 1
X2
0
0.2
0.4
0.6
0.8
1
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Design of the Randomized Information Coefficient (RIC)
Detection of Relationship
Task: Using permutation test identify if arelationship exists:
I Generate 500 values of RIC undercomplete noise
I Sort the values and identify thevalue x of RIC at position500× 95% = 475
I Generate 500 values of RIC under aparticular relationship
I Count how many values are greaterthan x
⇒ the bigger the count the bigger thePower of RIC
Linear Quadratic Cubic
Sinusoidal low freq. Sinusoidal high freq. 4th Root
Circle Step Function Two Lines
X Sinusoidal varying freq. Circle-bar
Noise Lev. 1
Noise Lev. 6
Noise Lev. 11
Noise Lev. 16
Noise Lev. 21
Noise Lev. 26
Tested on many relationships and level of noise
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Design of the Randomized Information Coefficient (RIC)
Power at the increase of the number of random grids
Kr increases the number of random grids
Parameter (Kr)50 100 150 200
Are
aU
nder
Pow
erCurv
e
0
0.2
0.4
0.6
0.8
1RIC, optimum at Kr = 200
Figure : Average power for each relationship - every line is a relationship
More random grids ⇒ less estimation variance ⇒ more power
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Comparison Against Other Measures
Comparison with Other Measures
Extensively compared with other measures on the task of relationship detection
Ave
rage
Ran
k-Pow
er
0
2
4
6
8
10
12
14
RIC TICe IKDE dCorr HSIC RDC MIC IkNN Ief GMIC Iew r2 IA ACE Imean MID
Figure : Average rank across relationship (E.g. rank 1st when power is max on a relationship)
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Comparison Against Other Measures
Comparison - Biological Network InferenceReverse engineering of network of genes when the ground truth is known
Ave
rage
Ran
k-M
ean
Ave
rage
Pre
cision
3
4
5
6
7
8
9
10
11
12
13
RIC dCorr IKDE IkNN HSIC ACE r2 GMIC Ief IA RDC Iew Imean MIC MID
Figure : Average rank across relationship (E.g. rank 1st when Average Precision is max on a network )
Also compared on:
I Feature filtering for regression
I Feature selection for regression
RIC shows competitive performance
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Comparison Against Other Measures
Conclusion - Message
We proposed the Randomized Information Coefficient (RIC)
I Reduces the variance of normalized mutual information via grids when comparingrelationships
I Random discretize multiple variables
Take away message:
I There are different ways to generate random grids (random cut-off/ randomseeds)
I The more the number of grids the smaller the variance
The Randomized Information Coefficient: Ranking Dependencies in Noisy Data, Simone Romano, James Bailey, Nguyen Xuan
Vinh, and Karin Verspoor. Under review in the Machine Learning Journal
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Comparison Against Other Measures
Hypothesis so far...
So far we compared numerical variables on samples of fixed size n
Dependency measures might have biases if they:
I Compare samples with different n
I Compare categorical variables
Need for adjustment in these cases
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Comparison Against Other Measures
Hypothesis so far...
So far we compared numerical variables on samples of fixed size n
Dependency measures might have biases if they:
I Compare samples with different n
I Compare categorical variables
Need for adjustment in these cases
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
BackgroundExamples of ApplicationsCategories of Dependency measuresThesis Motivation
Ranking Dependencies in Noisy DataMotivationDesign of the Randomized Information Coefficient (RIC)Comparison Against Other Measures
A Framework for Adjusting Dependency MeasuresMotivationAdjustment for QuantificationAdjustment for Ranking
Adjustments for Clustering Comparison MeasuresMotivationDetailed Analysis of Contingency TablesApplication Scenarios
Conclusions
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Motivation for Adjustment For QuantificationPearson’s correlation between two variables X and Y estimated on a data sampleSn = {(xk , yk )} of n data points:
r(Sn|X ,Y ) ,
∑nk=1(xk − x)(yk − y)√∑n
k=1(xk − x)2∑n
k=1(yk − y)2(1)
1 0.8 0.4 0 -0.4 -0.8 -1
1 1 1 -1 -1 -1
0 0 0 0 0 0 0
Figure : From https://en.wikipedia.org/wiki/Pearson_product-moment_correlation_coefficient
r2(Sn|X ,Y ) can be used as a proxy of the amount of noise for linear relationships:
I 1 if noiseless
I 0 if complete noise
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
The Maximal Information Coefficient (MIC) was published in Science [Reshef et al., 2011]and has 499 citations to date according to Google scholar.
MIC(X ,Y ) can be used as a proxy of the amount fo noise for functional relationships:
Figure : From supplementary material online in [Reshef et al., 2011]
MIC should be equal to:
I 1 if the relationship between X and Y is functional and noiseless
I 0 if there is complete noise
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Challenge
Nonetheless, its estimation is challenging on a finite data sample Sn of n data points.
We simulate 10,000 fully noisy relationships between X and Y on 20 and 80 data points:
0.2 0.4 0.6 0.8 1
MIC(S80jX;Y )
MIC(S20jX;Y )
Value can be high because of chance! The user expects values close to 0 in both cases
Challenge: Adjust the estimated MIC to better exploit the range [0, 1]
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Quantification
Adjustment for Chance
I We define a framework for adjustment:
Adjustment for Quantification
AD ,D − E [D0]
max D − E [D0]
I It uses the distribution D0 under independent variables:I r 2
0 : Beta distributionI MIC0: can be computed using Monte Carlo permutations.
Used in κ-statistics. Its application is beneficial to other dependency measures:
I Adjusted r2 ⇒ Ar2
I Adjusted MIC ⇒ AMIC
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Quantification
Adjusted measures enable better interpretabilityTask:Obtain 1 for noiseless relationship, and 0 for complete noise (on average).
0%
r2 = 1Ar2 = 1
20%
r2 = 0:66Ar2 = 0:65
40%
r2 = 0:39Ar2 = 0:37
60%
r2 = 0:2Ar2 = 0:17
80%
r2 = 0:073Ar2 = 0:044
100%
r2 = 0:035Ar2 = 0:00046
Figure : Ar 2 becomes zero on average on 100% noise
0%
MIC = 1AMIC = 1
20%
MIC = 0:7AMIC = 0:6
40%
MIC = 0:47AMIC = 0:29
60%
MIC = 0:34AMIC = 0:11
80%
MIC = 0:27AMIC = 0:021
100%
MIC = 0:26AMIC = 0:0014
Figure : AMIC becomes zero on average on 100% noise
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Quantification
Not biased towards small sample size nAverage value of D for different % of noise⇒ estimates can be high because of chance at small n (e.g. because of missing values)
Noise Level0 50 100
0
0.2
0.4
0.6
0.8
1
n = 10
n = 20
n = 30
n = 40
n = 100
n = 200
Raw r2
Noise Level0 50 100
0
0.2
0.4
0.6
0.8
1
n = 20
n = 40
n = 60
n = 80
Raw MIC
Noise Level0 50 100
0
0.2
0.4
0.6
0.8
1
n = 10
n = 20
n = 30
n = 40
n = 100
n = 200
Ar2 (Adjusted)
Noise Level0 50 100
0
0.2
0.4
0.6
0.8
1
n = 20
n = 40
n = 60
n = 80
AMIC (Adjusted)
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Quantification
Not biased towards small sample size nAverage value of D for different % of noise⇒ estimates can be high because of chance at small n (e.g. because of missing values)
Noise Level0 50 100
0
0.2
0.4
0.6
0.8
1
n = 10
n = 20
n = 30
n = 40
n = 100
n = 200
Raw r2
Noise Level0 50 100
0
0.2
0.4
0.6
0.8
1
n = 20
n = 40
n = 60
n = 80
Raw MIC
Noise Level0 50 100
0
0.2
0.4
0.6
0.8
1
n = 10
n = 20
n = 30
n = 40
n = 100
n = 200
Ar2 (Adjusted)
Noise Level0 50 100
0
0.2
0.4
0.6
0.8
1
n = 20
n = 40
n = 60
n = 80
AMIC (Adjusted)
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Motivation for Adjustment for Ranking
Say that we want to predict the risks of C cancer using equally unpredictive variables X1 andX2 defined as follows:
I X1 ≡ patient had breakfast today, X1 = {yes, no};I X2 ≡ patient eye color, X2 = {green, blu, brown};
X1= yesX1= no
X2=green
X2=blueX2=brown
Problem:When ranking variables,dependency measures arebiased towards theselection of variableswith many categories
This still happens because of finite samples!
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Motivation for Adjustment for Ranking
Say that we want to predict the risks of C cancer using equally unpredictive variables X1 andX2 defined as follows:
I X1 ≡ patient had breakfast today, X1 = {yes, no};I X2 ≡ patient eye color, X2 = {green, blu, brown};
X1= yesX1= no
X2=green
X2=blueX2=brown
Problem:When ranking variables,dependency measures arebiased towards theselection of variableswith many categories
This still happens because of finite samples!
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Selection bias experiment
Experimentn = 100 data pointsClass C with 2 categories:
I Generate a variable X1 with 2 categories(independently from C)
I Generate a variable X2 with 3 categories(independently from C)
ComputeGini(X1,C) and Gini(X2,C).
Give a win to the variablethat gets the highest value
REPEAT 10,000 times
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Result: X2 gets selected 70% of the times ( Bad )Given that they are equally unpredictive, we expected 50%
Challenge: adjust the estimated Gini gain to obtained unbiased ranking
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Selection bias experiment
Experimentn = 100 data pointsClass C with 2 categories:
I Generate a variable X1 with 2 categories(independently from C)
I Generate a variable X2 with 3 categories(independently from C)
ComputeGini(X1,C) and Gini(X2,C).
Give a win to the variablethat gets the highest value
REPEAT 10,000 times
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Result: X2 gets selected 70% of the times ( Bad )Given that they are equally unpredictive, we expected 50%
Challenge: adjust the estimated Gini gain to obtained unbiased ranking
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Selection bias experiment
Experimentn = 100 data pointsClass C with 2 categories:
I Generate a variable X1 with 2 categories(independently from C)
I Generate a variable X2 with 3 categories(independently from C)
ComputeGini(X1,C) and Gini(X2,C).
Give a win to the variablethat gets the highest value
REPEAT 10,000 times
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Result: X2 gets selected 70% of the times ( Bad )Given that they are equally unpredictive, we expected 50%
Challenge: adjust the estimated Gini gain to obtained unbiased ranking
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Selection bias experiment
Experimentn = 100 data pointsClass C with 2 categories:
I Generate a variable X1 with 2 categories(independently from C)
I Generate a variable X2 with 3 categories(independently from C)
ComputeGini(X1,C) and Gini(X2,C).
Give a win to the variablethat gets the highest value
REPEAT 10,000 times
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Result: X2 gets selected 70% of the times ( Bad )Given that they are equally unpredictive, we expected 50%
Challenge: adjust the estimated Gini gain to obtained unbiased ranking
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Selection bias experiment
Experimentn = 100 data pointsClass C with 2 categories:
I Generate a variable X1 with 2 categories(independently from C)
I Generate a variable X2 with 3 categories(independently from C)
ComputeGini(X1,C) and Gini(X2,C).
Give a win to the variablethat gets the highest value
REPEAT 10,000 times
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Result: X2 gets selected 70% of the times ( Bad )Given that they are equally unpredictive, we expected 50%
Challenge: adjust the estimated Gini gain to obtained unbiased ranking
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Adjustment for RankingWe propose two adjustments for ranking:
Standardization
SD ,D − E [D0]√
Var(D0)
Quantifies statistical significance like a p-value
Adjustment for Ranking
AD(α) , D − q0(1− α)
Penalizes on statistical significance according to α
q0 is the quantile of the distribution D0
(small α more penalization)
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Standardized Gini (SGini) corrects for Selection bias
Select unpredictive features X1 with 2 categories and X2 with 3 categories.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Experiment: X1 and X2 gets selectedon average almost 50% of the times
( Good )
Being similar to a p-value, this is consistent with the literature on decisiontrees [Frank and Witten, 1998, Dobra and Gehrke, 2001, Hothorn et al., 2006,Strobl et al., 2007].
Nonetheless: we found that this is a simplistic scenario
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Standardized Gini (SGini) might be biased
Fix predictiveness of features X1 and X2 to a constant 6= 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Experiment: SGini becomes biasedtowards X1 because more statically
significant ( Bad )
This behavior has been overlooked in the decision tree community
Use AD(α) to penalize less or even tune the bias!
⇒ AGini(α)
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Standardized Gini (SGini) might be biased
Fix predictiveness of features X1 and X2 to a constant 6= 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
X1 X2
Probability of Selection
Experiment: SGini becomes biasedtowards X1 because more statically
significant ( Bad )
This behavior has been overlooked in the decision tree community
Use AD(α) to penalize less or even tune the bias!
⇒ AGini(α)
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Application to random forest
why random forest? good classifier to try first when there are “meaningful” features[Fernandez-Delgado et al., 2014].
Plug-in different splitting criteria
Experiment: 19 data sets with categorical variables
,0 0.2 0.4 0.6 0.8
Mea
nAU
C
90
90.5
91
91.5
AGini(,)
SGini
Gini
Figure : Using the same α for all data sets
And α can be tuned for each data set with cross-validation.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Adjustment for Ranking
Conclusion - Message
Dependency estimates are high because of chance under finite samples.
Adjustments can help for:
Quantification, to have an interpretable value between [0, 1]
Ranking, to avoid biases towards:
I missing values
I categorical variables with more categories
A Framework to Adjust Dependency Measure Estimates for Chance, Simone Romano, Nguyen Xuan Vinh, James Bailey, and
Karin Verspoor. Under submission in SIAM International Conference on Data Mining 2016 (SDM-16)
Arxiv: http://arxiv.org/abs/1510.07786
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
BackgroundExamples of ApplicationsCategories of Dependency measuresThesis Motivation
Ranking Dependencies in Noisy DataMotivationDesign of the Randomized Information Coefficient (RIC)Comparison Against Other Measures
A Framework for Adjusting Dependency MeasuresMotivationAdjustment for QuantificationAdjustment for Ranking
Adjustments for Clustering Comparison MeasuresMotivationDetailed Analysis of Contingency TablesApplication Scenarios
Conclusions
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Clustering Validation
Given a reference clustering V ( / ) we want to validate the clustering solution U (blue/red)
⇒ we need dependency measures
There are two very popular measures based on adjustments:
The Adjusted Rand Index (ARI)[Hubert and Arabie, 1985]
∼ 3000 citations
The Adjusted Mutual Information (AMI)[Vinh et al., 2009]
∼ 200 citations
No clear connection between them - Users use them both
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Clustering Validation
Given a reference clustering V ( / ) we want to validate the clustering solution U (blue/red)
⇒ we need dependency measures
There are two very popular measures based on adjustments:
The Adjusted Rand Index (ARI)[Hubert and Arabie, 1985]
∼ 3000 citations
The Adjusted Mutual Information (AMI)[Vinh et al., 2009]
∼ 200 citations
No clear connection between them - Users use them both
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Both computed on a contingency table
Notation: Contingency table M
ai =∑
j nij are the row marginals andbj =
∑i nij are the column marginals.
Vb1 · · · bj · · · bc
a1 n11 · · · · · · · n1c
......
......
U ai · nij ·...
......
...ar nr1 · · · · · · · nrc
ARI - Adjustment of Rand Index (RI)
based on counting pairs of objects
ARI =RI− E [RI]
max RI− E [RI]
AMI - Adjustment of Mutual Information (MI)
based on information theory
AMI =MI− E [MI]
max MI− E [MI]
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Link: generalized information theoryGeneralized information theory based on Tsallis q-entropy
Hq(V ) ,1
q − 1
(1−
∑j
(bj
N
)q)generalizes Shannon’s entropy
limq→1
Hq(V ) = H(V ) ,∑
j
bj
Nlog
bj
N
Link between measures:Mutual Information (MIq) based on Tsallis q-entropy links RI and MI:
MIq=2 ∝ RI limq→1
MIq = MI
Challenge: Compute E [MIq] to connect ARI and AMI
Challenge 2.0: Compute Var(MIq) for standardization
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Link: generalized information theoryGeneralized information theory based on Tsallis q-entropy
Hq(V ) ,1
q − 1
(1−
∑j
(bj
N
)q)generalizes Shannon’s entropy
limq→1
Hq(V ) = H(V ) ,∑
j
bj
Nlog
bj
N
Link between measures:Mutual Information (MIq) based on Tsallis q-entropy links RI and MI:
MIq=2 ∝ RI limq→1
MIq = MI
Challenge: Compute E [MIq] to connect ARI and AMI
Challenge 2.0: Compute Var(MIq) for standardization
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Link: generalized information theoryGeneralized information theory based on Tsallis q-entropy
Hq(V ) ,1
q − 1
(1−
∑j
(bj
N
)q)generalizes Shannon’s entropy
limq→1
Hq(V ) = H(V ) ,∑
j
bj
Nlog
bj
N
Link between measures:Mutual Information (MIq) based on Tsallis q-entropy links RI and MI:
MIq=2 ∝ RI limq→1
MIq = MI
Challenge: Compute E [MIq] to connect ARI and AMI
Challenge 2.0: Compute Var(MIq) for standardizationSimone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Motivation
Propose technique applicable to a broader class of measures:We can do:
I Exact computation of measures in Lφwhere S ∈ Lφ is a linear function of the entries of the contingency table:
S = α + β∑
ij
φij (nij )
(α and β are constants)
I Asymptotic approximation of measures in Nφ (non-linear)
Rand Index (RI)
MI Jaccard(J)
GeneralizedInformation Theoretic
VI
MINMI
Figure : Families of measures we can adjust
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Detailed Analysis of Contingency Tables
Exact Expected Value by Permutation Model
E [S ] is obtained by summation over all possible contingency tables M obtained bypermutations.
E [S ] =∑M
S(M)P(M) = α + β∑M
∑ij
φij (nij )P(M)
I No method to exhaustively generate M fixing the marginals
I extremely time expensive ( permutations O(N!))
However, it is possible to swap the inner summation with the outer summation:∑M
∑i,j︸ ︷︷ ︸
to swap
φij (nij )P(M) =∑i,j
∑nij︸ ︷︷ ︸
swapped
φij (nij )P(nij )
I nij has a known hypergeometric distribution,
I Computation time dramatically reduced! ⇒ O (max {rN, cN})
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Detailed Analysis of Contingency Tables
Exact Variance Computation
We have to compute the second moment E [S2] which requires:
∑M
r∑i=1
c∑j=1
φij (nij )
2
P(M)
∑M
∑i,j,i ′,j′︸ ︷︷ ︸
to swap
φij (nij ) · φi ′j′(ni ′j′)P(M)
∑i,j,i ′,j′
∑nij
∑ni′ j′︸ ︷︷ ︸
swapped
φij (nij ) · φi ′j′(ni ′j′)P(nij , ni ′j′)
Contribution: P(nij , ni ′j′) computation is technically challenging.We use the hypergeometric model: drawings from a urn with N marbles with 3 colors, red,blue, and white.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Detailed Analysis of Contingency Tables
Exact Variance Computation
We have to compute the second moment E [S2] which requires:
∑M
r∑i=1
c∑j=1
φij (nij )
2
P(M)
∑M
∑i,j,i ′,j′︸ ︷︷ ︸
to swap
φij (nij ) · φi ′j′(ni ′j′)P(M)
∑i,j,i ′,j′
∑nij
∑ni′ j′︸ ︷︷ ︸
swapped
φij (nij ) · φi ′j′(ni ′j′)P(nij , ni ′j′)
Contribution: P(nij , ni ′j′) computation is technically challenging.We use the hypergeometric model: drawings from a urn with N marbles with 3 colors, red,blue, and white.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Detailed Analysis of Contingency Tables
Finally we can define adjustments..
Definition: Adjusted Mutual Information q - AMIq
AMI2 = ARI limq→1
AMIq = AMI
We can finally relate ARI and AMI to generalized information theory!
Also define: a generalized Standardized Mutual Information q - SMIq for selection bias.
Their complexities:
Name Computational complexityAMI O (max {rN, cN})SMI O
(max {rcN3, c2N3}
)Table : Complexity when comparing two clusterings: N objects, r , c number of clusters
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Detailed Analysis of Contingency Tables
Finally we can define adjustments..
Definition: Adjusted Mutual Information q - AMIq
AMI2 = ARI limq→1
AMIq = AMI
We can finally relate ARI and AMI to generalized information theory!
Also define: a generalized Standardized Mutual Information q - SMIq for selection bias.
Their complexities:
Name Computational complexityAMI O (max {rN, cN})SMI O
(max {rcN3, c2N3}
)Table : Complexity when comparing two clusterings: N objects, r , c number of clusters
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Application Scenarios
Task: Clustering validation. Given a reference clustering V , choose the best clusteringsolution among U1 and U2
Example: Do you prefer U1 or U2?
V10 10 10 70
U1
8 8 0 0 07 0 7 0 07 0 0 7 0
78 2 3 3 70
AMI chooses this one because of many 0’s
V10 10 10 70
U2
10 7 1 1 110 1 7 1 110 1 1 7 170 1 1 1 67
ARI chooses this one
When there are small clusters in V , use AMI because it likes 0’s
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Application Scenarios
Task: Clustering validation. Given a reference clustering V , choose the best clusteringsolution among U1 and U2
Example: Do you prefer U1 or U2?
V10 10 10 70
U1
8 8 0 0 07 0 7 0 07 0 0 7 0
78 2 3 3 70
AMI chooses this one because of many 0’s
V10 10 10 70
U2
10 7 1 1 110 1 7 1 110 1 1 7 170 1 1 1 67
ARI chooses this one
When there are small clusters in V , use AMI because it likes 0’s
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Application Scenarios
Task: Clustering validation. Given a reference clustering V , choose the best clusteringsolution among U1 and U2
Example: Do you prefer U1 or U2?
V10 10 10 70
U1
8 8 0 0 07 0 7 0 07 0 0 7 0
78 2 3 3 70
AMI chooses this one because of many 0’s
V10 10 10 70
U2
10 7 1 1 110 1 7 1 110 1 1 7 170 1 1 1 67
ARI chooses this one
When there are small clusters in V , use AMI because it likes 0’s
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Application Scenarios
Task: Clustering validation. Given a reference clustering V , choose the best clusteringsolution among U1 and U2
Example: Do you prefer U1 or U2?
V10 10 10 70
U1
8 8 0 0 07 0 7 0 07 0 0 7 0
78 2 3 3 70
AMI chooses this one because of many 0’s
V10 10 10 70
U2
10 7 1 1 110 1 7 1 110 1 1 7 170 1 1 1 67
ARI chooses this one
When there are small clusters in V , use AMI because it likes 0’s
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Equal sized clusters...
Task: Clustering validation. Given a reference clustering V , choose the best clusteringsolution among U1 and U2
Example: Do you prefer U1 or U2?
V25 25 25 25
U1
17 17 0 0 017 0 17 0 017 0 0 17 049 8 8 8 25
AMI chooses this one because of many 0’s
V25 25 25 25
U2
24 20 2 1 125 2 20 2 123 1 1 20 128 2 2 2 22
ARI chooses this one
When there are big equal sized clusters in V , use ARI because 0’s are misleading
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Equal sized clusters...
Task: Clustering validation. Given a reference clustering V , choose the best clusteringsolution among U1 and U2
Example: Do you prefer U1 or U2?
V25 25 25 25
U1
17 17 0 0 017 0 17 0 017 0 0 17 049 8 8 8 25
AMI chooses this one because of many 0’s
V25 25 25 25
U2
24 20 2 1 125 2 20 2 123 1 1 20 128 2 2 2 22
ARI chooses this one
When there are big equal sized clusters in V , use ARI because 0’s are misleading
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Equal sized clusters...
Task: Clustering validation. Given a reference clustering V , choose the best clusteringsolution among U1 and U2
Example: Do you prefer U1 or U2?
V25 25 25 25
U1
17 17 0 0 017 0 17 0 017 0 0 17 049 8 8 8 25
AMI chooses this one because of many 0’s
V25 25 25 25
U2
24 20 2 1 125 2 20 2 123 1 1 20 128 2 2 2 22
ARI chooses this one
When there are big equal sized clusters in V , use ARI because 0’s are misleading
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Equal sized clusters...
Task: Clustering validation. Given a reference clustering V , choose the best clusteringsolution among U1 and U2
Example: Do you prefer U1 or U2?
V25 25 25 25
U1
17 17 0 0 017 0 17 0 017 0 0 17 049 8 8 8 25
AMI chooses this one because of many 0’s
V25 25 25 25
U2
24 20 2 1 125 2 20 2 123 1 1 20 128 2 2 2 22
ARI chooses this one
When there are big equal sized clusters in V , use ARI because 0’s are misleading
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
SMIq can be used to correct selection bias
Reference clustering with 4 clusters and solutions U with different number of clusters
2 3 4 5 6 7 8 9 10
SM
I q
00.05
0.1
Probability of selection (q = 1:001)
2 3 4 5 6 7 8 9 10
AM
I q
0
0.1
Number of sets r in U2 3 4 5 6 7 8 9 10
NM
I q
00.20.4
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Correct for selection bias with SMIq for any q
Reference clustering with 4 clusters and solutions U with different number of clusters
2 3 4 5 6 7 8 9 10
SM
I q
00.05
0.1
Probability of selection (q = 2)
2 3 4 5 6 7 8 9 10
AM
I q
00.05
0.1
Number of sets r in U2 3 4 5 6 7 8 9 10
NM
I q
0
0.5
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Application Scenarios
Conclusion - Message
We computed generalized information theoretic measures to propose AMIq and SMIq to:
I identify the application scenarios of ARI and AMI
I correct for selection bias
Take away message:
I Use AMI when the reference is unbalanced and has small clusters
I Use ARI when the reference has big equal sized clusters
I Use SMIq to correct for selection bias
Standardized Mutual Information for Clustering Comparisons: One Step Further in Adjustment for Chance, Simone Romano,
James Bailey, Nguyen Xuan Vinh, and Karin Verspoor. Published in Proceedings of the 31st International Conference on Machine
Learning 2014, pp. 1143–1151 (ICML-14)
Adjusting for Chance Clustering Comparison Measures, Simone Romano, Nguyen Xuan Vinh, James Bailey, and Karin Verspoor.
To submit to the Journal of Machine Learning Research
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
BackgroundExamples of ApplicationsCategories of Dependency measuresThesis Motivation
Ranking Dependencies in Noisy DataMotivationDesign of the Randomized Information Coefficient (RIC)Comparison Against Other Measures
A Framework for Adjusting Dependency MeasuresMotivationAdjustment for QuantificationAdjustment for Ranking
Adjustments for Clustering Comparison MeasuresMotivationDetailed Analysis of Contingency TablesApplication Scenarios
Conclusions
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Summary
Studying the distribution of the estimates D, we:
I Designed RIC
I Adjusted for quantification
I Adjusted for ranking
These results can aid detection, quantification, and ranking of relationships as follows
Detection: RIC can be used to detect relationships between continuous variables becauseit has high power
Quantification: Adjustment for quantification can be used to obtain a more interpretablerange of values.E.g. AMIC and AMIq
Ranking: Adjustment for ranking can be used to correct for biases towards variableswith missing values or variables with many categories.E.g. AGini(α) for random forests
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Future Work
I Dependency measure estimates can obtain high values because of chance also whenthey are computed on different number of dimensions⇒ study adjustments to be unbiased towards different dimensionality
I Adjustment via permutations is slow⇒ compute more analytical adjustments, e.g. for MIC
I The random seeds discretization technique for RIC might have problems with highdimensionality⇒ generate random seeds in random subspaces⇒ study multivariable discretization using random trees
I Inject randomness in other estimators of mutual information⇒ E.g. choose different random kernel widths for the IKDE estimator
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Papers
S. Romano, J. Bailey, N. X. Vinh, and K. Verspoor, “Adjusting for Chance Clustering Comparison Measures”. To submit to the
Journal of Machine Learning Research
S. Romano, J. Bailey, N. X. Vinh, and K. Verspoor, “A Framework to Adjust Dependency Measure Estimates for Chance”. Under
submission in SIAM International Conference on Data Mining 2016 (SDM-16)
S. Romano, J. Bailey, N. X. Vinh, and K. Verspoor, “The Randomized Information Coefficient: Ranking Dependencies in Noisy
Data” Under review in the Machine Learning Journal
S. Romano, J. Bailey, N. X. Vinh, and K. Verspoor, “Standardized Mutual Information for Clustering Comparisons: One Step
Further in Adjustment for Chance”. Published in Proceedings of the 31st International Conference on Machine Learning 2014, pp.
1143–1151 (ICML-14)
Collaborations:Y. Lei, J. C. Bezdek, N. X. Vinh, J. Chan, S. Romano, and J. Bailey, “Extending information theoretic validity indices for fuzzy
clusterings”. Submitted to the Transactions on Fuzzy Systems Journal
N. X. Vinh, J. Chan, S. Romano, J. Bailey, C. Leckie, K. Ramamohanarao, and J. Pei, “Discovering outlying aspects in large
datasets”. Submitted to the Data Mining and Knowledge Discovery Journal
N. X. Vinh, J. Chan, S. Romano, and J. Bailey, “Effective global approaches for mutual information based feature selection”.
Published in Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM,
2014, pp. 512–521
Y. Lei, J. C. Bezdek, J. Chan, N. X. Vinh, S. Romano, and J. Bailey, “Generalized information theoretic cluster validity indices for
soft clusterings”. Published in Proceedings of Computational Intelligence and Data Mining (CIDM), 2014, pp. 24–31
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Thank You All
In particular
My supervisors:James Bailey, Karin Verspoor, and Vinh Nguyen
Committee Chair:Tim Baldwin
My fellow PhD students
Questions?
Code available online:
https://github.com/ialuronico
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
Thank You All
In particular
My supervisors:James Bailey, Karin Verspoor, and Vinh Nguyen
Committee Chair:Tim Baldwin
My fellow PhD students
Questions?
Code available online:
https://github.com/ialuronico
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
References I
Albatineh, A. N., Niewiadomska-Bugaj, M., and Mihalko, D. (2006).On similarity indices and correction for chance agreement.Journal of Classification, 23(2):301–313.
Caruana, R., Elhawary, M., Nguyen, N., and Smith, C. (2006).Meta clustering.In Data Mining, 2006. ICDM’06. Sixth International Conference on, pages 107–118. IEEE.
Cohen, M. X. (2014).Analyzing neural time series data: theory and practice.MIT Press.
Cover, T. M. and Thomas, J. A. (2012).Elements of information theory.John Wiley & Sons.
Criminisi, A., Shotton, J., and Konukoglu, E. (2012).Decision forests: A unified framework for classification, regression, density estimation, manifoldlearning and semi-supervised learning.Foundations and Trends in Computer Graphics and Vision, 7(2-3):81–227.
Dang, X. H. and Bailey, J. (2015).A framework to uncover multiple alternative clusterings.Machine Learning, 98(1-2):7–30.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
References II
Dobra, A. and Gehrke, J. (2001).Bias correction in classification tree construction.In ICML, pages 90–97.
Fernandez-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. (2014).Do we need hundreds of classifiers to solve real world classification problems?The Journal of Machine Learning Research, 15(1):3133–3181.
Frank, E. and Witten, I. H. (1998).Using a permutation test for attribute selection in decision trees.In ICML, pages 152–160.
Geurts, P. (2002).Bias/Variance Tradeoff and Time Series Classification.PhD thesis, Department d’Eletrecite, Eletronique et Informatique. Institut Momntefiore. Unversite deLiege.
Gretton, A., Bousquet, O., Smola, A., and Scholkopf, B. (2005).Measuring statistical dependence with hilbert-schmidt norms.In Algorithmic learning theory, pages 63–77. Springer.
Guyon, I. and Elisseeff, A. (2003).An introduction to variable and feature selection.The Journal of Machine Learning Research, 3:1157–1182.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
References III
Hothorn, T., Hornik, K., and Zeileis, A. (2006).Unbiased recursive partitioning: A conditional inference framework.Journal of Computational and Graphical Statistics, 15(3):651–674.
Hubert, L. and Arabie, P. (1985).Comparing partitions.Journal of Classification, 2:193–218.
Khan, S., Bandyopadhyay, S., Ganguly, A. R., Saigal, S., Erickson III, D. J., Protopopescu, V., andOstrouchov, G. (2007).Relative performance of mutual information estimation methods for quantifying the dependence amongshort and noisy data.Physical Review E, 76(2):026209.
Kononenko, I. (1995).On biases in estimating multi-valued attributes.In International Joint Conferences on Artificial Intelligence, pages 1034–1040.
Kraskov, A., Stogbauer, H., and Grassberger, P. (2004).Estimating mutual information.Physical review E, 69(6):066138.
Lei, Y., Vinh, N. X., Chan, J., and Bailey, J. (2014).Filta: Better view discovery from collections of clusterings via filtering.In Machine Learning and Knowledge Discovery in Databases, pages 145–160. Springer.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
References IV
Lopez-Paz, D., Hennig, P., and Scholkopf, B. (2013).The randomized dependence coefficient.In Advances in Neural Information Processing Systems, pages 1–9.
Margolin, A. A., Nemenman, I., Basso, K., Wiggins, C., Stolovitzky, G., Favera, R. D., and Califano,A. (2006).Aracne: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellularcontext.BMC bioinformatics, 7(Suppl 1):S7.
Meila, M. (2007).Comparing clusterings—an information based distance.Journal of Multivariate Analysis, 98(5):873–895.
Muller, E., Gunnemann, S., Farber, I., and Seidl, T. (2013).Discovering multiple clustering solutions: Grouping objects in different views of the data.Tutorial at ICML.
Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P. J.,Lander, E. S., Mitzenmacher, M., and Sabeti, P. C. (2011).Detecting novel associations in large data sets.Science, 334(6062):1518–1524.
Reshef, Y. A., Reshef, D. N., Finucane, H. K., Sabeti, P. C., and Mitzenmacher, M. M. (2015).Measuring dependence powerfully and equitably.arXiv preprint arXiv:1505.02213.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
References V
Schaffernicht, E., Kaltenhaeuser, R., Verma, S. S., and Gross, H.-M. (2010).On estimating mutual information for feature selection.In Artificial Neural Networks ICANN 2010, pages 362–367. Springer.
Strehl, A. and Ghosh, J. (2003).Cluster ensembles—a knowledge reuse framework for combining multiple partitions.The Journal of Machine Learning Research, 3:583–617.
Strobl, C., Boulesteix, A.-L., and Augustin, T. (2007).Unbiased split selection for classification trees based on the gini index.Computational Statistics & Data Analysis, 52(1):483–501.
Sugiyama, M. and Borgwardt, K. M. (2013).Measuring statistical dependence via the mutual information dimension.In Proceedings of the Twenty-Third international joint conference on Artificial Intelligence, pages1692–1698. AAAI Press.
Szekely, G. J., Rizzo, M. L., et al. (2009).Brownian distance covariance.The annals of applied statistics, 3(4):1236–1265.
Villaverde, A. F., Ross, J., and Banga, J. R. (2013).Reverse engineering cellular networks with information theoretic methods.Cells, 2(2):306–329.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
Background Ranking Dependencies in Noisy Data A Framework for Adjusting Dependency Measures Adjustments for Clustering Comparison Measures Conclusions
References VI
Vinh, N. X., Epps, J., and Bailey, J. (2009).Information theoretic measures for clusterings comparison: is a correction for chance necessary?In ICML, pages 1073–1080. ACM.
Witten, I. H., Frank, E., and Hall, M. A. (2011).Data Mining: Practical Machine Learning Tools and Techniques.3rd edition.
Simone Romano University of Melbourne
Design and Adjustment of Dependency Measures Between Variables
top related