580.691 learning theory reza shadmehr

580.691 Learning Theory

Reza Shadmehr

Classification via regressionFisher linear discriminant

Bayes classifierConfidence and Error rate of the Bayes classifier

Classification via regression

• Suppose we wish to classify vector x as belonging to either class C0 or C1.

• We can approach the problem as if it was regression:

(1) (1) ( ) ( ) ( )

0 0 1 1 2 2

(1) (1)1 2 1

( ) ( )1 2

1 0

, , , , 0,1

ˆ

1

1

ˆ if 0.5, otherwise

n n i

T

T TML

n n

D y y y

y w w w x w x

x x

X X X X

x x

C y C

x x

w x

w y

x x

-4 -2 0 2 4 6

-2

0

2

4

6

Classification via regression

• Model:0 0 1 1 2 2

1 0

ˆ


Ty w w w x w x

C y C

x w

x x

1x

2x

-4-2

02

46

x1

-20

2

4

6

x2

-0.5

0

0.5

1

1.5

y

-4-2

02

46

x1

-20

2

4

6

x2

0 1 1 2 20.5 w w x w x

22 1 1 2

1

wx x x x

w

Classification via regression: concerns

• Model:0 1 1 2 2

1 0

ˆ


Ty w w x w x

C y C

x w

x x

-4 -2 0 2 4 6

-2

0

2

4

6

-4 -2 0 2 4 6

-4

-2

0

2

4

60.5 TMLx w

1x

1x

2x2x

0.5 TMLx w

• Sometimes an x can give us a y that is outside our range (outside 0-1).

This classification looks good.This one not so good.

1 TMLx w

Classification via regression: concerns

• Model:0 1 1 2 2

1 0

( ) ( ) ( )

ˆ


ˆ

T

n n n

y w w x w x

C y C

y y

x w

x x

• Variance of the error (which is equal to the variance of y) depends on x, unlike in regression.

Since y is a random variable that can only take on values of 0 or 1, error in regression will not be normally distributed.

2 2 2

2 2 2 2

2

1

0 1

1 1 0

1 1 0

var 2 2

1

1 0

P y x

P y x

E y x

E y x

y x E y E y y E y E y y

P y x P y x

Regression as projection

• A linear regression function projects each data point:

Each data point x(n)=[x1,x2] is projected onto

0 1 0 1 1 2 2ˆ Ty w w w x w x w x

1w

( ) ( ) ( )1

n n T nz x w x

For a given w1, there will be a specific distribution of the projected points z={z(1),z(2),…,z(n)}. We can study how well the projected points are distributed into classes.

-4 -2 0 2 4

-4

-2

0

2

4

-4 -2 0 2 4

-4

-2

0

2

4

-4 -2 0 2 4

-4

-2

0

2

4

Fisher discriminant analysis

• Suppose we wish to classify vector x as belonging to either class C0 or C1.

Class y=0: n0 number of points, mean 0, variance 0

Class y=1: n1 number of points, mean 1, variance 1

• Class descriptions in the classification (or projected) space: (i.e., variance of yhat for x’s that belong to class 0 or class 1)

0 0 0

0 0

ˆ

ˆvar

T

T

E y C w

y C

x μ w

x w w

1 1 0

1 1

ˆ

ˆvar

T

T

E y C w

y C

x μ w

x w w

0 1

( ) ( )0 1

0

( )0

1

( )1

( ) ( )0 0 1 1

0 1

( ) ( )0 0 0 0

0

( ) ( )1 1 1 1

1

1 1

1var

1

1var

1

i i

i

i

n ni i

C C

n Ti i

C

n Ti i

C

E C E Cn n

Cn

Cn

x x

x

x

x x μ x x x μ x

x x x μ x μ

x x x μ x μ


• Find w so that when each point is projected to the classification space, the classes are maximally separated.

2

20 1

0 0 1 1

seperation of projected means

sum of within class variances

ˆ ˆ

T T

T T

J

Jn n

w

μ w μ ww

w w w w

-4 -2 0 2 4

-4

-2

0

2

4

-4 -2 0 2 4

-4

-2

0

2

4

Large separation Small separation


2

0 1

0 0 1 1

2 20 1

0 0 1 1

1

221 2

1 1

arg maxˆ ˆ

ˆ ˆ

T T

T T

T T T

TT

T

TTTTT

T TT

n n

JSn n

S R R

R R

RRJ R

R R RR

w

μ μ ww

w w w w

μ μ w m ww

w ww w

v w w v

m vm v vv m

vv vv v

Symmetric positive definiteWe can always write S like this,

where R is a “square root” matrix

Using R, change the coordinate system of J from w to v:


2

0 1

11 10 1 0 1

10 0 1 1 0 1

is maximum when

ˆ ˆ

TT T

T T

T T

J R aR

aR aR

R aR R R R a

n n a

vv m v m

v

v m μ μ

w v μ μ μ μ

w μ μ

Dot product of a vector of norm 1 and another vector is maximum when the two have the same direction.

( ) ( )0

1

1 nT i T i

i

w E y yn

w x w x

arbitrary constant

Bayesian classification

• Suppose we wish to classify vector x as belonging to a class: {1,…,L}. We are given labeled data and need to form a classification function:

(1) (1) ( ) ( ) ( )

1, ,

, , , , 1, ,

ˆ ˆ 1, ,

ˆ arg max

n n i

l L

D c c c L

c c L

c P l

x x

x x x

x x

Classify x into the class l that maximizes the posterior probability.

1

L

p c l P c l p l P c lP c l

pP c p

x xx

xx

priorLikelihood

marginal

Classification when distributions have equal variance

• Suppose we wish to classify a person as male or female based on height.

1 2| 0 , and | 1 , and 1

ˆ

ˆ ˆ1 if 1| 0.5; 0 otherwise

p x c N p x c N P c q

c x

c x P c x c x

What we have:What we want:

height (cm)x height (cm)x 160 180 200

0.01

0.02

0.03

0.04

| 0p x c | 1p x c female male

Note that the two densities have equal variance

1 0.5P c Assume equal probability of being male or female:

| 0 0p x c P c | 1 1p x c P c

160 180 200

0.005

0.01

0.015

0.02

160 180 200

0.005

0.01

0.015

0.02

0.025

0.03

0.035

1

0

|i

p x p x c i P c i

height (cm)x

Classification when distributions have equal variance

| 0 0p x c P c | 1 1p x c P c

height (cm)x 160 180 200

0.005

0.01

0.015

0.02

height (cm)x 160 180 200

0.2

0.4

0.6

0.8

1

0 |P c x 1|P c x

Decision boundary=

Decision boundary

where 0 | 1|

where | 0 0 | 1 1

x P c x P c x

x p x c P c p x c P c

0 00 |

p x c P cP c x

p x

| 0 0

| 1 1

p x c P c

p x c P c

To classify, we really don’t need to compute the posterior prob. All we need is:

If this ratio is greater than 1, then we choose class 0, otherwise class 1. The boundaries between classes occur where the ratio is 1. In other words, the boundary occurs where the log of the ratio is 0

posterior

Uncertainty of the classifier

| 0 and | 1 and 1

1||

0 |

p x c p x c P c q

P c xp c x

P c x

Starting with our likelihood and prior:

we compute a posterior probability distribution as a function of x:

This is a binomial distribution. We can compute the variance of this distribution:

2 2 2

22

2

= 1| 1 0 | 0

1|

1| 1 0 | 0

1|

var

1| 1|

E c x P c x P c x

P c x

E c x P c x P c x

P c x

c x E c x E c x

P c x P c x

140 160 180 200

0.05

0.1

0.15

0.2

0.25

height (cm)x

0 |P c x

1|P c x

140 160 180 2000

0.2

0.4

0.6

0.8

1

var |c x

Classification is most uncertain at the decision boundary

height (cm)x

Classification when distributions have unequal variance

| 0 0p x c P c

| 1 1p x c P c

1 1 2 2| 0 , and | 1 , and 1

ˆ ˆ1 if 1| 0.5; 0 otherwise

p x c N p x c N P c q

c x P c x c x

What we have:Classification:

height (cm)x 160 180 200

0.005

0.01

0.015

0.02

0.025

1 0.5P c Assume:

160 180 200

0.005

0.01

0.015

0.02

0.025

0.03

0.035

1

0

|i

p x p x c i P c i

0 |P c x 1|P c x

160 180 200

0.2

0.4

0.6

0.8

1

140 160 180 2000

0.05

0.1

0.15

0.2

0.25

var |c x

Bayes error rate: Probability of misclassification

160 180 200

0.005

0.01

0.015

0.02

0.025

| 0 0p x c P c | 1 1p x c P c

x

*xdecision boundary

0R 1R

0 1

0 1 1 1 0 0

1 1 0 0R R

P error P x c c P c P x c c P c

p x c P c dx p x c P c dx

In general, it is actually quite hard to compute P(error) because we will need to integrate the posterior probabilities over decision regions that may be discontinuous (for example, when the distributions have unequal variances). To help with this, there is the Chernoff bound.

Prob of data belonging to c1, but we classify as c0

0

1 1R

p x c P c dx

Prob of data belonging to c0 but we classify as c1

1

0 0R

p x c P c dx

Bayes error rate: Chernoff bound

In the two class classification problem, we note that the classification error depends on the area under the minimum of the two posterior probabilities.

min 0 , 1

min 0 0 , 1 1

P error x P c x P c x

p x c P c p x c P c

P error P error x dx

0 |P c x 1|P c x

140 160 180 2000

0.2

0.4

0.6

0.8

1

P error x

x

To compute the minimum, we will need the following inequality:

1min , , 0 and 0 1a b a b a b

To help figure out this inequality, we note that:

And without loss of generality, if we suppose that b is smaller than a. Then a/b>1, and we have:

So we can think of the term a^*b^(1-) (for all values of ), as an upper bound on the min[a,b]. Returning to our P(error) problem, we can replace the min[] function with our inequality:

1 aa b b

b

1b a b

110 1 0 1 0 1P error P c P c p x c p x c dx

The bound is found by numerically finding the value of that minimizes the above expression. The key benefit here is that our search is in the one dimensional space of , and we also got rid of the discontinuous decision regions.

Bayes error rate: Chernoff bound

min 0 0 , 1 1P error x p x c P c p x c P c

P error P error x dx

580.691 learning theory reza shadmehr

Documents