learning the semantics of discrete random variables ......learning the semantics of discrete random...

Learning the Semantics of DiscreteRandom Variables: Ordinal or Categorical?

Jose Miguel Hernandez-Lobato1,3,?, James Robert Lloyd3,?,Daniel Hernandez-Lobato2 and Zoubin Ghahramani3

1Harvard University. 2Universidad Autonoma de Madrid.3Cambridge University. ?Equal contributors.

™Neural InformationProcessing SystemsFoundation

1. IntroductionMotivation: When specifying a probabilistic model of data, the form of the model willtypically depend on the spaces in which random variables take their values.

Problem: Automatic data analysis techniques must identify the type of data withoutsupervision. It is not trivial to distinguish between categorical and ordinal data.Furthermore, infering the ordering of the labels in the case of ordinal data is difficult.

Continuous Data

−0.90, 0.18, 1.59, −1.13, −0.08, ...

Den

sity

−4 −2 0 2

0.0

0.1

0.2

0.3

0.4

Count Data

12, 10, 5, 7, 12, 11, 4, 8, 11, 4, ...

Den

sity

0 5 10 15 20

0.00

0.10

0.20

Categorical Data

2, 2, 2, 4, 1, 4, 2, 4, 2, 2, ...

Den

sity

1.0 1.5 2.0 2.5 3.0 3.5 4.0

0.0

0.5

1.0

1.5

2.0

2.5

Ordinal Data

3, 3, 3, 3, 1, 3, 3, 3, 3, 1, ...

Den

sity

1.0 1.5 2.0 2.5 3.0 3.5 4.0

0.0

0.5

1.0

1.5

2.0

2.5

3.0

1:"Physics"2:"Statistics"3:"Algebra"4:"Calculus"5:"Operating Systems"

Categorical Labels:

1:"very low"2:"low"3:"medium"4:"high"5:"very high"

Oridinal Labels:

Solution:We present some first attempts at this problem by fiting ordinal regressionand multi-class classification models and then evaluating their quality of fit. Ourordinal regression models can learn the true ordering in ordinal data.

4. Multi-class Classification for Categorical DataWe have that yi = arg maxk ∈ L fk(xi), where fv1, . . . , fvL are latent functions sampledfrom a GP. Define f = (fv1(x1), fv1(x2), . . . , fv1(xn), . . . , fvL(x1), fvL(x2), . . . , fvL(xn))T.The likelihood of f given y = (y1, . . . , yn)T and X = (x1, . . . , xn)T is

p(y|X, f) =n∏

i=1

∏k 6=yi

Θ (fyi(xi)− fk(xi))

.Define fvl = (fvl(x1), . . . , fvl(xn))T. The prior for f is p(f) =

∏Ll=1N (fvl|0,Kvl).

EP approximates the posterior p(f|D) as q(f). The predictive distribution for y? is

p(y?|x?,D) ≈∫

p(y?|f?)p(f?|f)q(f)df ,

where f? = (fv1(x?), . . . , fvL(x?))T and p(y?|f?) =∏

k 6=y? Θ (fy?(xi)− fk(xi)). This canbe computed by solving a one-dimensional numerical integral.

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.0

0.5

1.0

Labels

x1

x2

2

2

2

2

2

2

2

2

2

2

2

2

2 2

2

2

2

2

3

3

3

3

3

3

3

3

3

33

3

3

3

33

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

4

45

5

5

55

5

5

x1

−1.0

−0.5

0.0

0.5

1.0

x2

−1.0

−0.5

0.0

0.5

1.0

f_1(x)

−15−10

−5

0

5

10

15

−10

−5

0

5

10

15

f(x)f (x) function1

x1

−1.0

−0.5

0.0

0.5

1.0

x2

−1.0

−0.5

0.0

0.5

1.0

f_5(x)

−15−10

−5

0

5

10

15

−10

−5

0

5

10

f(x)f (x) function5

1 2 3 4 5

6. Results for Identifying the Type of DataWe compare OR-L and OR-SE with a multi-class clasiffier (MC) based on linear (MC-L)and squared exponential (MC-SE) covariance functions.

We consider the ordinal regression tasks and four additional multi-class datasets:Glass, Iris, New Thyroid, and Wine with 6, 3, 2 and 3 class labels, respectively.

Method OR-L OR-SE MC-L MC-SEAuto -0.679 -0.726 -0.874 -0.706Boston -0.901 -0.795 -0.957 -0.856Fires -1.044 -1.070 -1.050 -1.084Yacht -0.181 -0.180 -0.897 -0.207Wins 4 0

Table : Avg. Test LL. Ordinal Tasks.

Method OR-L OR-SE MC-L MC-SEGlass -0.224 -0.133 -1.264 -0.096Iris -0.079 -0.092 -0.331 -0.112Thyroid -0.065 -0.077 -0.187 -0.066Wine -0.205 -0.113 -0.076 -0.103Wins 1.5 2.5Table : Avg. Test LL. Multi-class Tasks.

2. Ordinal Regression when the Ordering is KnownAssume a dataset D = {xi, yi}n

i=1, where yi ∈ L = {1, . . . , L} and σ is apermutation of 1,. . . ,L such that σ(1), . . . , σ(L) is correctly ordered.

A sample f from a Gaussian process (GP) maps the xi to the real line, which is splitinto L contiguous intervals with boundaries b0 < . . . < bL, b0 = −∞ and bL =∞.

Let fi = f (xi). The likelihood for fi and b = (b1, . . . , bL−1) given yi is then

p(yi|fi, b, σ) =L−1∏l=1

Θ [sign(σ(yi)− l − 0.5)(fi − bl)] , (1)

where Θ is the Heaviside function, p(b) =∏L−1

l=1 N (bl|m0l , v

0l ) and p(f) = N (f|m,K).

Expectation propagation (EP) approximates the exact posterior p(f, b|D) as q(f, b).The predictive distribution for the label y? of a new vector x? is approximated as

p(y?|x?,D) =

∫p(y?|f?, b)p(f?|f)p(f, b|D) ≈

∫p(y?|f?, b)p(f?|f)q(f, b)df b , (2)

where f? = f (x?) and p(f?|f) is the GP predictive distribution for f? given f.

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.0

0.5

1.0

Labels

x1

x2

2 2 2

2

22

2

2

2

2

2

2

2

3

3

3

3

33

3

3

3

3

3

4

4

4

4

4

4

4

4

4

4

4

4

4

5

5

5

5

5

5

5 5

5

5

5

5

5

5

5

x1

−1.0

−0.5

0.0

0.5

1.0

x2

−1.0

−0.5

0.0

0.5

1.0

f(x)

−15−10

−5

0

5

10

15

f(x) function

−10

−5

0

5

10

15

f(x)

−6 −4 −2 0 2 4 6

Interval Boundaries and Labels

1 2 3 4 5

3. A Search Algorithm for Finding the True OrderingThe EP approximation of the model evidence can be maximized with respect to thehyper-parameters. Let z(σ) be the value of the maximized approximation given σ. Wecan infer σ by further optimizing z(σ) as follows:

Require: Dataset D = {xi, yi}ni=1 with yi ∈ L = {v1, . . . , vL}.

1: Select σ at uniformly at random and compute z(σ).2: Generate set P with all the 2-element subsets of {1, . . . , L}.3: finished← False.4: while Not finished do5: finished← True.6: for every subset {i, j} contained in P do7: Generate σi,j by swapping the elements i and j in σ.8: Compute z(σi,j).9: end for

10: Find indexes {k , l} such that z(σk ,l) ≥ z(σi,j) for any i, j .11: if z(σk ,l) > z(σ) then12: finished← FALSE, σ ← σk ,l, z(σ)← z(σk ,l)13: end if14: end while15: return σ

parallel

5. Results for Learning the OrderingRegression problems from the UCI repository. The target variable is discretized usingequal-probability bining: Boston Housing, Forest Fires, Auto MPG and Yatch.

We fix L = 5, except in Forest Fires, where we fix L = 3. The accuracy of each methodis computed in terms of the absolute value of Kendall’s tau correlation coefficientbetween the true ranking of the labels and the ranking discovered by our algorithm.

We use linear (OR-L) andsquared exponential (OR-SE)covariance functions.

Table : Average Kendall’s tau.Method Auto Boston Fires YachtOR-L 1.000 1.000 0.333 1.000OR-SE 0.840 0.968 0.427 1.000

7. Conclusions

I We have focused on distinguishing categorical data from ordinal data.I Our solution works by evalating the fit of ordinal and multi-class models.I We can find the label ranking using a search procedure.I Linear models correctly identify the true ranking most of the times, while

non-linear models are less accurate.I The test log-likelihood can be used to correctly identify the data type.

http://jmhl.org/ [email protected]

learning the semantics of discrete random variables ......learning the semantics of discrete random...

Documents