580.691 learning theory reza shadmehr
DESCRIPTION
580.691 Learning Theory Reza Shadmehr Classification via regression Fisher linear discriminant Bayes classifier Confidence and Error rate of the Bayes classifier. Classification via regression Suppose we wish to classify vector x as belonging to either class C0 or C1. - PowerPoint PPT PresentationTRANSCRIPT
580.691 Learning Theory
Reza Shadmehr
Classification via regressionFisher linear discriminant
Bayes classifierConfidence and Error rate of the Bayes classifier
Classification via regression
• Suppose we wish to classify vector x as belonging to either class C0 or C1.
• We can approach the problem as if it was regression:
(1) (1) ( ) ( ) ( )
0 0 1 1 2 2
(1) (1)1 2 1
( ) ( )1 2
1 0
, , , , 0,1
ˆ
1
1
ˆ if 0.5, otherwise
n n i
T
T TML
n n
D y y y
y w w w x w x
x x
X X X X
x x
C y C
x x
w x
w y
x x
-4 -2 0 2 4 6
-2
0
2
4
6
Classification via regression
• Model:0 0 1 1 2 2
1 0
ˆ
ˆ if 0.5, otherwise
Ty w w w x w x
C y C
x w
x x
1x
2x
-4-2
02
46
x1
-20
2
4
6
x2
-0.5
0
0.5
1
1.5
y
-4-2
02
46
x1
-20
2
4
6
x2
0 1 1 2 20.5 w w x w x
22 1 1 2
1
wx x x x
w
Classification via regression: concerns
• Model:0 1 1 2 2
1 0
ˆ
ˆ if 0.5, otherwise
Ty w w x w x
C y C
x w
x x
-4 -2 0 2 4 6
-2
0
2
4
6
-4 -2 0 2 4 6
-4
-2
0
2
4
60.5 TMLx w
1x
1x
2x2x
0.5 TMLx w
• Sometimes an x can give us a y that is outside our range (outside 0-1).
This classification looks good.This one not so good.
1 TMLx w
Classification via regression: concerns
• Model:0 1 1 2 2
1 0
( ) ( ) ( )
ˆ
ˆ if 0.5, otherwise
ˆ
T
n n n
y w w x w x
C y C
y y
x w
x x
• Variance of the error (which is equal to the variance of y) depends on x, unlike in regression.
Since y is a random variable that can only take on values of 0 or 1, error in regression will not be normally distributed.
2 2 2
2 2 2 2
2
1
0 1
1 1 0
1 1 0
var 2 2
1
1 0
P y x
P y x
E y x
E y x
y x E y E y y E y E y y
P y x P y x
Regression as projection
• A linear regression function projects each data point:
Each data point x(n)=[x1,x2] is projected onto
0 1 0 1 1 2 2ˆ Ty w w w x w x w x
1w
( ) ( ) ( )1
n n T nz x w x
For a given w1, there will be a specific distribution of the projected points z={z(1),z(2),…,z(n)}. We can study how well the projected points are distributed into classes.
-4 -2 0 2 4
-4
-2
0
2
4
-4 -2 0 2 4
-4
-2
0
2
4
-4 -2 0 2 4
-4
-2
0
2
4
Fisher discriminant analysis
• Suppose we wish to classify vector x as belonging to either class C0 or C1.
Class y=0: n0 number of points, mean 0, variance 0
Class y=1: n1 number of points, mean 1, variance 1
• Class descriptions in the classification (or projected) space: (i.e., variance of yhat for x’s that belong to class 0 or class 1)
0 0 0
0 0
ˆ
ˆvar
T
T
E y C w
y C
x μ w
x w w
1 1 0
1 1
ˆ
ˆvar
T
T
E y C w
y C
x μ w
x w w
0 1
( ) ( )0 1
0
( )0
1
( )1
( ) ( )0 0 1 1
0 1
( ) ( )0 0 0 0
0
( ) ( )1 1 1 1
1
1 1
1var
1
1var
1
i i
i
i
n ni i
C C
n Ti i
C
n Ti i
C
E C E Cn n
Cn
Cn
x x
x
x
x x μ x x x μ x
x x x μ x μ
x x x μ x μ
Fisher discriminant analysis
• Find w so that when each point is projected to the classification space, the classes are maximally separated.
2
20 1
0 0 1 1
seperation of projected means
sum of within class variances
ˆ ˆ
T T
T T
J
Jn n
w
μ w μ ww
w w w w
-4 -2 0 2 4
-4
-2
0
2
4
-4 -2 0 2 4
-4
-2
0
2
4
Large separation Small separation
Fisher discriminant analysis
2
0 1
0 0 1 1
2 20 1
0 0 1 1
1
221 2
1 1
arg maxˆ ˆ
ˆ ˆ
T T
T T
T T T
TT
T
TTTTT
T TT
n n
JSn n
S R R
R R
RRJ R
R R RR
w
μ μ ww
w w w w
μ μ w m ww
w ww w
v w w v
m vm v vv m
vv vv v
Symmetric positive definiteWe can always write S like this,
where R is a “square root” matrix
Using R, change the coordinate system of J from w to v:
Fisher discriminant analysis
2
0 1
11 10 1 0 1
10 0 1 1 0 1
is maximum when
ˆ ˆ
TT T
T T
T T
J R aR
aR aR
R aR R R R a
n n a
vv m v m
v
v m μ μ
w v μ μ μ μ
w μ μ
Dot product of a vector of norm 1 and another vector is maximum when the two have the same direction.
( ) ( )0
1
1 nT i T i
i
w E y yn
w x w x
arbitrary constant
Bayesian classification
• Suppose we wish to classify vector x as belonging to a class: {1,…,L}. We are given labeled data and need to form a classification function:
(1) (1) ( ) ( ) ( )
1, ,
, , , , 1, ,
ˆ ˆ 1, ,
ˆ arg max
n n i
l L
D c c c L
c c L
c P l
x x
x x x
x x
Classify x into the class l that maximizes the posterior probability.
1
L
p c l P c l p l P c lP c l
pP c p
x xx
xx
priorLikelihood
marginal
Classification when distributions have equal variance
• Suppose we wish to classify a person as male or female based on height.
1 2| 0 , and | 1 , and 1
ˆ
ˆ ˆ1 if 1| 0.5; 0 otherwise
p x c N p x c N P c q
c x
c x P c x c x
What we have:What we want:
height (cm)x height (cm)x 160 180 200
0.01
0.02
0.03
0.04
| 0p x c | 1p x c female male
Note that the two densities have equal variance
1 0.5P c Assume equal probability of being male or female:
| 0 0p x c P c | 1 1p x c P c
160 180 200
0.005
0.01
0.015
0.02
160 180 200
0.005
0.01
0.015
0.02
0.025
0.03
0.035
1
0
|i
p x p x c i P c i
height (cm)x
Classification when distributions have equal variance
| 0 0p x c P c | 1 1p x c P c
height (cm)x 160 180 200
0.005
0.01
0.015
0.02
height (cm)x 160 180 200
0.2
0.4
0.6
0.8
1
0 |P c x 1|P c x
Decision boundary=
Decision boundary
where 0 | 1|
where | 0 0 | 1 1
x P c x P c x
x p x c P c p x c P c
0 00 |
p x c P cP c x
p x
| 0 0
| 1 1
p x c P c
p x c P c
To classify, we really don’t need to compute the posterior prob. All we need is:
If this ratio is greater than 1, then we choose class 0, otherwise class 1. The boundaries between classes occur where the ratio is 1. In other words, the boundary occurs where the log of the ratio is 0
posterior
Uncertainty of the classifier
| 0 and | 1 and 1
1||
0 |
p x c p x c P c q
P c xp c x
P c x
Starting with our likelihood and prior:
we compute a posterior probability distribution as a function of x:
This is a binomial distribution. We can compute the variance of this distribution:
2 2 2
22
2
= 1| 1 0 | 0
1|
1| 1 0 | 0
1|
var
1| 1|
E c x P c x P c x
P c x
E c x P c x P c x
P c x
c x E c x E c x
P c x P c x
140 160 180 200
0.05
0.1
0.15
0.2
0.25
height (cm)x
0 |P c x
1|P c x
140 160 180 2000
0.2
0.4
0.6
0.8
1
var |c x
Classification is most uncertain at the decision boundary
height (cm)x
Classification when distributions have unequal variance
| 0 0p x c P c
| 1 1p x c P c
1 1 2 2| 0 , and | 1 , and 1
ˆ ˆ1 if 1| 0.5; 0 otherwise
p x c N p x c N P c q
c x P c x c x
What we have:Classification:
height (cm)x 160 180 200
0.005
0.01
0.015
0.02
0.025
1 0.5P c Assume:
160 180 200
0.005
0.01
0.015
0.02
0.025
0.03
0.035
1
0
|i
p x p x c i P c i
0 |P c x 1|P c x
160 180 200
0.2
0.4
0.6
0.8
1
140 160 180 2000
0.05
0.1
0.15
0.2
0.25
var |c x
Bayes error rate: Probability of misclassification
160 180 200
0.005
0.01
0.015
0.02
0.025
| 0 0p x c P c | 1 1p x c P c
x
*xdecision boundary
0R 1R
0 1
0 1 1 1 0 0
1 1 0 0R R
P error P x c c P c P x c c P c
p x c P c dx p x c P c dx
In general, it is actually quite hard to compute P(error) because we will need to integrate the posterior probabilities over decision regions that may be discontinuous (for example, when the distributions have unequal variances). To help with this, there is the Chernoff bound.
Prob of data belonging to c1, but we classify as c0
0
1 1R
p x c P c dx
Prob of data belonging to c0 but we classify as c1
1
0 0R
p x c P c dx
Bayes error rate: Chernoff bound
In the two class classification problem, we note that the classification error depends on the area under the minimum of the two posterior probabilities.
min 0 , 1
min 0 0 , 1 1
P error x P c x P c x
p x c P c p x c P c
P error P error x dx
0 |P c x 1|P c x
140 160 180 2000
0.2
0.4
0.6
0.8
1
P error x
x
To compute the minimum, we will need the following inequality:
1min , , 0 and 0 1a b a b a b
To help figure out this inequality, we note that:
And without loss of generality, if we suppose that b is smaller than a. Then a/b>1, and we have:
So we can think of the term a^*b^(1-) (for all values of ), as an upper bound on the min[a,b]. Returning to our P(error) problem, we can replace the min[] function with our inequality:
1 aa b b
b
1b a b
110 1 0 1 0 1P error P c P c p x c p x c dx
The bound is found by numerically finding the value of that minimizes the above expression. The key benefit here is that our search is in the one dimensional space of , and we also got rid of the discontinuous decision regions.
Bayes error rate: Chernoff bound
min 0 0 , 1 1P error x p x c P c p x c P c
P error P error x dx