learning the semantics of discrete random variables ......learning the semantics of discrete random...

1
Learning the Semantics of Discrete Random Variables: Ordinal or Categorical? Jos ´ e Miguel Hern´ andez-Lobato 1,3,? , James Robert Lloyd 3,? , Daniel Hern ´ andez-Lobato 2 and Zoubin Ghahramani 3 1 Harvard University. 2 Universidad Aut´ onoma de Madrid. 3 Cambridge University. ? Equal contributors. Neural Information Processing Systems Foundation 1. Introduction Motivation: When specifying a probabilistic model of data, the form of the model will typically depend on the spaces in which random variables take their values. Problem: Automatic data analysis techniques must identify the type of data without supervision. It is not trivial to distinguish between categorical and ordinal data. Furthermore, infering the ordering of the labels in the case of ordinal data is difficult. Continuous Data 0.90, 0.18, 1.59, 1.13, 0.08, ... Density 4 2 0 2 0.0 0.1 0.2 0.3 0.4 Count Data 12, 10, 5, 7, 12, 11, 4, 8, 11, 4, ... Density 0 5 10 15 20 0.00 0.10 0.20 Categorical Data 2, 2, 2, 4, 1, 4, 2, 4, 2, 2, ... Density 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 Ordinal Data 3, 3, 3, 3, 1, 3, 3, 3, 3, 1, ... Density 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 1:"Physics" 2:"Statistics" 3:"Algebra" 4:"Calculus" 5:"Operating Systems" Categorical Labels: 1:"very low" 2:"low" 3:"medium" 4:"high" 5:"very high" Oridinal Labels: Solution:We present some first attempts at this problem by fiting ordinal regression and multi-class classification models and then evaluating their quality of fit. Our ordinal regression models can learn the true ordering in ordinal data. 4. Multi-class Classification for Categorical Data We have that y i = arg max k ∈L f k (x i ), where f v 1 ,..., f v L are latent functions sampled from a GP. Define f =(f v 1 (x 1 ), f v 1 (x 2 ),..., f v 1 (x n ),..., f v L (x 1 ), f v L (x 2 ),..., f v L (x n )) T . The likelihood of f given y =(y 1 ,..., y n ) T and X =(x 1 ,..., x n ) T is p (y|X, f)= n Y i =1 Y k 6=y i Θ(f y i (x i ) - f k (x i )) . Define f v l =(f v l (x 1 ),..., f v l (x n )) T . The prior for f is p (f)= Q L l =1 N (f v l |0, K v l ). EP approximates the posterior p (f|D ) as q (f). The predictive distribution for y ? is p (y ? |x ? , D ) Z p (y ? |f ? )p (f ? |f)q (f)d f , where f ? =(f v 1 (x ? ),..., f v L (x ? )) T and p (y ? |f ? )= Q k 6=y ? Θ(f y ? (x i ) - f k (x i )). This can be computed by solving a one-dimensional numerical integral. 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 Labels x1 x2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 x1 1.0 0.5 0.0 0.5 1.0 x2 1.0 0.5 0.0 0.5 1.0 f_1(x) 15 10 5 0 5 10 15 10 5 0 5 10 15 f(x) f (x) function 1 x1 1.0 0.5 0.0 0.5 1.0 x2 1.0 0.5 0.0 0.5 1.0 f_5(x) 15 10 5 0 5 10 15 10 5 0 5 10 f(x) f (x) function 5 1 2 3 4 5 6. Results for Identifying the Type of Data We compare OR-L and OR-SE with a multi-class clasiffier (MC) based on linear (MC-L) and squared exponential (MC-SE) covariance functions. We consider the ordinal regression tasks and four additional multi-class datasets: Glass, Iris, New Thyroid, and Wine with 6, 3, 2 and 3 class labels, respectively. Method OR-L OR-SE MC-L MC-SE Auto -0.679 -0.726 -0.874 -0.706 Boston -0.901 -0.795 -0.957 -0.856 Fires -1.044 -1.070 -1.050 -1.084 Yacht -0.181 -0.180 -0.897 -0.207 Wins 4 0 Table : Avg. Test LL. Ordinal Tasks. Method OR-L OR-SE MC-L MC-SE Glass -0.224 -0.133 -1.264 -0.096 Iris -0.079 -0.092 -0.331 -0.112 Thyroid -0.065 -0.077 -0.187 -0.066 Wine -0.205 -0.113 -0.076 -0.103 Wins 1.5 2.5 Table : Avg. Test LL. Multi-class Tasks. 2. Ordinal Regression when the Ordering is Known Assume a dataset D = {x i , y i } n i =1 , where y i ∈L = {1,..., L} and σ is a permutation of 1,...,L such that σ (1),...,σ (L) is correctly ordered. A sample f from a Gaussian process (GP) maps the x i to the real line, which is split into L contiguous intervals with boundaries b 0 <...< b L , b 0 = -∞ and b L = . Let f i = f (x i ). The likelihood for f i and b =(b 1 ,..., b L-1 ) given y i is then p (y i |f i , b)= L-1 Y l =1 Θ[sign(σ (y i ) - l - 0.5)(f i - b l )] , (1) where Θ is the Heaviside function, p (b)= Q L-1 l =1 N (b l |m 0 l , v 0 l ) and p (f)= N (f|m, K). Expectation propagation (EP) approximates the exact posterior p (f, b|D ) as q (f, b). The predictive distribution for the label y ? of a new vector x ? is approximated as p (y ? |x ? , D )= Z p (y ? |f ? , b)p (f ? |f)p (f, b|D ) Z p (y ? |f ? , b)p (f ? |f)q (f, b)d fb , (2) where f ? = f (x ? ) and p (f ? |f) is the GP predictive distribution for f ? given f. 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 Labels x1 x2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 3 4 4 4 4 4 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 x1 1.0 0.5 0.0 0.5 1.0 x2 1.0 0.5 0.0 0.5 1.0 f(x) 15 10 5 0 5 10 15 f(x) function 10 5 0 5 10 15 f(x) 6 4 2 0 2 4 6 Interval Boundaries and Labels 1 2 3 4 5 3. A Search Algorithm for Finding the True Ordering The EP approximation of the model evidence can be maximized with respect to the hyper-parameters. Let ˜ z (σ ) be the value of the maximized approximation given σ . We can infer σ by further optimizing ˜ z (σ ) as follows: Require: Dataset D = {x i , y i } n i =1 with y i ∈L = {v 1 ,..., v L }. 1: Select σ at uniformly at random and compute ˜ z (σ ). 2: Generate set P with all the 2-element subsets of {1,..., L}. 3: finished False. 4: while Not finished do 5: finished True. 6: for every subset {i , j } contained in P do 7: Generate σ i ,j by swapping the elements i and j in σ . 8: Compute ˜ z (σ i ,j ). 9: end for 10: Find indexes {k , l } such that ˜ z (σ k ,l ) ˜ z (σ i ,j ) for any i , j . 11: if ˜ z (σ k ,l ) > ˜ z (σ ) then 12: finished FALSE, σ σ k ,l , ˜ z (σ ) ˜ z (σ k ,l ) 13: end if 14: end while 15: return σ parallel 5. Results for Learning the Ordering Regression problems from the UCI repository. The target variable is discretized using equal-probability bining: Boston Housing, Forest Fires, Auto MPG and Yatch. We fix L = 5, except in Forest Fires, where we fix L = 3. The accuracy of each method is computed in terms of the absolute value of Kendall’s tau correlation coefficient between the true ranking of the labels and the ranking discovered by our algorithm. We use linear (OR-L) and squared exponential (OR-SE) covariance functions. Table : Average Kendall’s tau. Method Auto Boston Fires Yacht OR-L 1.000 1.000 0.333 1.000 OR-SE 0.840 0.968 0.427 1.000 7. Conclusions I We have focused on distinguishing categorical data from ordinal data. I Our solution works by evalating the fit of ordinal and multi-class models. I We can find the label ranking using a search procedure. I Linear models correctly identify the true ranking most of the times, while non-linear models are less accurate. I The test log-likelihood can be used to correctly identify the data type. http://jmhl.org/ [email protected]

Upload: others

Post on 28-Jun-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Learning the Semantics of Discrete Random Variables ......Learning the Semantics of Discrete Random Variables: Ordinal or Categorical? Author: José Miguel Hernández-Lobato, James

Learning the Semantics of DiscreteRandom Variables: Ordinal or Categorical?

Jose Miguel Hernandez-Lobato1,3,?, James Robert Lloyd3,?,Daniel Hernandez-Lobato2 and Zoubin Ghahramani3

1Harvard University. 2Universidad Autonoma de Madrid.3Cambridge University. ?Equal contributors.

™Neural InformationProcessing SystemsFoundation

1. IntroductionMotivation: When specifying a probabilistic model of data, the form of the model willtypically depend on the spaces in which random variables take their values.

Problem: Automatic data analysis techniques must identify the type of data withoutsupervision. It is not trivial to distinguish between categorical and ordinal data.Furthermore, infering the ordering of the labels in the case of ordinal data is difficult.

Continuous Data

−0.90, 0.18, 1.59, −1.13, −0.08, ...

Den

sity

−4 −2 0 2

0.0

0.1

0.2

0.3

0.4

Count Data

12, 10, 5, 7, 12, 11, 4, 8, 11, 4, ...

Den

sity

0 5 10 15 20

0.00

0.10

0.20

Categorical Data

2, 2, 2, 4, 1, 4, 2, 4, 2, 2, ...

Den

sity

1.0 1.5 2.0 2.5 3.0 3.5 4.0

0.0

0.5

1.0

1.5

2.0

2.5

Ordinal Data

3, 3, 3, 3, 1, 3, 3, 3, 3, 1, ...

Den

sity

1.0 1.5 2.0 2.5 3.0 3.5 4.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

1:"Physics"2:"Statistics"3:"Algebra"4:"Calculus"5:"Operating Systems"

Categorical Labels:

1:"very low"2:"low"3:"medium"4:"high"5:"very high"

Oridinal Labels:

Solution:We present some first attempts at this problem by fiting ordinal regressionand multi-class classification models and then evaluating their quality of fit. Ourordinal regression models can learn the true ordering in ordinal data.

4. Multi-class Classification for Categorical DataWe have that yi = arg maxk ∈ L fk(xi), where fv1, . . . , fvL are latent functions sampledfrom a GP. Define f = (fv1(x1), fv1(x2), . . . , fv1(xn), . . . , fvL(x1), fvL(x2), . . . , fvL(xn))T.The likelihood of f given y = (y1, . . . , yn)T and X = (x1, . . . , xn)T is

p(y|X, f) =n∏

i=1

∏k 6=yi

Θ (fyi(xi)− fk(xi))

.Define fvl = (fvl(x1), . . . , fvl(xn))T. The prior for f is p(f) =

∏Ll=1N (fvl|0,Kvl).

EP approximates the posterior p(f|D) as q(f). The predictive distribution for y? is

p(y?|x?,D) ≈∫

p(y?|f?)p(f?|f)q(f)df ,

where f? = (fv1(x?), . . . , fvL(x?))T and p(y?|f?) =∏

k 6=y? Θ (fy?(xi)− fk(xi)). This canbe computed by solving a one-dimensional numerical integral.

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.0

0.5

1.0

Labels

x1

x2

2

2

2

2

2

2

2

2

2

2

2

2

2 2

2

2

2

2

3

3

3

3

3

3

3

3

3

33

3

3

3

33

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

45

5

5

55

5

5

x1

−1.0

−0.5

0.0

0.5

1.0

x2

−1.0

−0.5

0.0

0.5

1.0

f_1(x)

−15−10

−5

0

5

10

15

−10

−5

0

5

10

15

f(x)f (x) function1

x1

−1.0

−0.5

0.0

0.5

1.0

x2

−1.0

−0.5

0.0

0.5

1.0

f_5(x)

−15−10

−5

0

5

10

15

−10

−5

0

5

10

f(x)f (x) function5

1 2 3 4 5

6. Results for Identifying the Type of DataWe compare OR-L and OR-SE with a multi-class clasiffier (MC) based on linear (MC-L)and squared exponential (MC-SE) covariance functions.

We consider the ordinal regression tasks and four additional multi-class datasets:Glass, Iris, New Thyroid, and Wine with 6, 3, 2 and 3 class labels, respectively.

Method OR-L OR-SE MC-L MC-SEAuto -0.679 -0.726 -0.874 -0.706Boston -0.901 -0.795 -0.957 -0.856Fires -1.044 -1.070 -1.050 -1.084Yacht -0.181 -0.180 -0.897 -0.207Wins 4 0

Table : Avg. Test LL. Ordinal Tasks.

Method OR-L OR-SE MC-L MC-SEGlass -0.224 -0.133 -1.264 -0.096Iris -0.079 -0.092 -0.331 -0.112Thyroid -0.065 -0.077 -0.187 -0.066Wine -0.205 -0.113 -0.076 -0.103Wins 1.5 2.5Table : Avg. Test LL. Multi-class Tasks.

2. Ordinal Regression when the Ordering is KnownAssume a dataset D = {xi, yi}n

i=1, where yi ∈ L = {1, . . . , L} and σ is apermutation of 1,. . . ,L such that σ(1), . . . , σ(L) is correctly ordered.

A sample f from a Gaussian process (GP) maps the xi to the real line, which is splitinto L contiguous intervals with boundaries b0 < . . . < bL, b0 = −∞ and bL =∞.

Let fi = f (xi). The likelihood for fi and b = (b1, . . . , bL−1) given yi is then

p(yi|fi, b, σ) =L−1∏l=1

Θ [sign(σ(yi)− l − 0.5)(fi − bl)] , (1)

where Θ is the Heaviside function, p(b) =∏L−1

l=1 N (bl|m0l , v

0l ) and p(f) = N (f|m,K).

Expectation propagation (EP) approximates the exact posterior p(f, b|D) as q(f, b).The predictive distribution for the label y? of a new vector x? is approximated as

p(y?|x?,D) =

∫p(y?|f?, b)p(f?|f)p(f, b|D) ≈

∫p(y?|f?, b)p(f?|f)q(f, b)df b , (2)

where f? = f (x?) and p(f?|f) is the GP predictive distribution for f? given f.

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.0

0.5

1.0

Labels

x1

x2

2 2 2

2

22

2

2

2

2

2

2

2

3

3

3

3

33

3

3

3

3

3

4

4

4

4

4

4

4

4

4

4

4

4

4

5

5

5

5

5

5

5 5

5

5

5

5

5

5

5

x1

−1.0

−0.5

0.0

0.5

1.0

x2

−1.0

−0.5

0.0

0.5

1.0

f(x)

−15−10

−5

0

5

10

15

f(x) function

−10

−5

0

5

10

15

f(x)

−6 −4 −2 0 2 4 6

Interval Boundaries and Labels

1 2 3 4 5

3. A Search Algorithm for Finding the True OrderingThe EP approximation of the model evidence can be maximized with respect to thehyper-parameters. Let z(σ) be the value of the maximized approximation given σ. Wecan infer σ by further optimizing z(σ) as follows:

Require: Dataset D = {xi, yi}ni=1 with yi ∈ L = {v1, . . . , vL}.

1: Select σ at uniformly at random and compute z(σ).2: Generate set P with all the 2-element subsets of {1, . . . , L}.3: finished← False.4: while Not finished do5: finished← True.6: for every subset {i, j} contained in P do7: Generate σi,j by swapping the elements i and j in σ.8: Compute z(σi,j).9: end for

10: Find indexes {k , l} such that z(σk ,l) ≥ z(σi,j) for any i, j .11: if z(σk ,l) > z(σ) then12: finished← FALSE, σ ← σk ,l, z(σ)← z(σk ,l)13: end if14: end while15: return σ

parallel

5. Results for Learning the OrderingRegression problems from the UCI repository. The target variable is discretized usingequal-probability bining: Boston Housing, Forest Fires, Auto MPG and Yatch.

We fix L = 5, except in Forest Fires, where we fix L = 3. The accuracy of each methodis computed in terms of the absolute value of Kendall’s tau correlation coefficientbetween the true ranking of the labels and the ranking discovered by our algorithm.

We use linear (OR-L) andsquared exponential (OR-SE)covariance functions.

Table : Average Kendall’s tau.Method Auto Boston Fires YachtOR-L 1.000 1.000 0.333 1.000OR-SE 0.840 0.968 0.427 1.000

7. Conclusions

I We have focused on distinguishing categorical data from ordinal data.I Our solution works by evalating the fit of ordinal and multi-class models.I We can find the label ranking using a search procedure.I Linear models correctly identify the true ranking most of the times, while

non-linear models are less accurate.I The test log-likelihood can be used to correctly identify the data type.

http://jmhl.org/ [email protected]