![Page 1: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/1.jpg)
Discriminant Functions Generative Models Discriminative Models
Linear Models for ClassificationOliver Schulte - CMPT 726
Bishop PRML Ch. 4
![Page 2: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/2.jpg)
Discriminant Functions Generative Models Discriminative Models
Classification: Hand-written Digit Recognition
xi =
!"#$%&'"()#*$#+ !,- )( !.,$#/+ +%((/#/01/! 233456 334
7/#()#-! 8/#9 */:: )0 "'%! /$!9 +$"$;$!/ +,/ ") "'/ :$1< )(
8$#%$"%)0 %0 :%&'"%0& =>?@ 2ABC D,!" -$</! %" ($!"/#56E'/ 7#)")"97/ !/:/1"%)0 $:&)#%"'- %! %::,!"#$"/+ %0 F%&6 GH6
C! !//0I 8%/*! $#/ $::)1$"/+ -$%0:9 ()# -)#/ 1)-7:/J1$"/&)#%/! *%"' '%&' *%"'%0 1:$!! 8$#%$;%:%"96 E'/ 1,#8/-$#</+ 3BK7#)") %0 F%&6 L !')*! "'/ %-7#)8/+ 1:$!!%(%1$"%)07/#()#-$01/ ,!%0& "'%! 7#)")"97/ !/:/1"%)0 !"#$"/&9 %0!"/$+)( /.,$::9K!7$1/+ 8%/*!6 M)"/ "'$" */ );"$%0 $ >6? 7/#1/0"
/##)# #$"/ *%"' $0 $8/#$&/ )( )0:9 (),# "*)K+%-/0!%)0$:
8%/*! ()# /$1' "'#//K+%-/0!%)0$: );D/1"I "'$0<! ") "'/
(:/J%;%:%"9 7#)8%+/+ ;9 "'/ -$"1'%0& $:&)#%"'-6
!"# $%&'() *+,-. */0+12.33. 4,3,5,6.
N,# 0/J" /J7/#%-/0" %08):8/! "'/ OAPQKR !'$7/ !%:'),/""/
+$"$;$!/I !7/1%(%1$::9 B)#/ PJ7/#%-/0" BPK3'$7/KG 7$#" SI
*'%1' -/$!,#/! 7/#()#-$01/ )( !%-%:$#%"9K;$!/+ #/"#%/8$:
=>T@6 E'/ +$"$;$!/ 1)0!%!"! )( GI?HH %-$&/!U RH !'$7/
1$"/&)#%/!I >H %-$&/! 7/# 1$"/&)#96 E'/ 7/#()#-$01/ %!
-/$!,#/+ ,!%0& "'/ !)K1$::/+ V;,::!/9/ "/!"IW %0 *'%1' /$1'
!"# $%%% &'()*(+&$,)* ,) -(&&%') ()(./*$* ()0 1(+2$)% $)&%..$3%)+%4 5,.6 784 ),6 784 (-'$. 7997
:;<6 #6 (== >? @AB C;DE=FDD;?;BG 1)$*& @BD@ G;<;@D HD;I< >HJ CB@A>G KLM >H@ >? "94999N6 &AB @BO@ FP>QB BFEA G;<;@ ;IG;EF@BD @AB BOFCR=B IHCPBJ
?>==>SBG PT @AB @JHB =FPB= FIG @AB FDD;<IBG =FPB=6
:;<6 U6 M0 >PVBE@ JBE><I;@;>I HD;I< @AB +,$.W79 GF@F DB@6 +>CRFJ;D>I >?@BD@ DB@ BJJ>J ?>J **04 *AFRB 0;D@FIEB K*0N4 FIG *AFRB 0;D@FIEB S;@A!!"#$%&$' RJ>@>@TRBD K*0WRJ>@>N QBJDHD IHCPBJ >? RJ>@>@TRB Q;BSD6 :>J**0 FIG *04 SB QFJ;BG @AB IHCPBJ >? RJ>@>@TRBD HI;?>JC=T ?>J F==>PVBE@D6 :>J *0WRJ>@>4 @AB IHCPBJ >? RJ>@>@TRBD RBJ >PVBE@ GBRBIGBG >I@AB S;@A;IW>PVBE@ QFJ;F@;>I FD SB== FD @AB PB@SBBIW>PVBE@ D;C;=FJ;@T6
:;<6 "96 -J>@>@TRB Q;BSD DB=BE@BG ?>J @S> G;??BJBI@ M0 >PVBE@D ?J>C @AB+,$. GF@F DB@ HD;I< @AB F=<>J;@AC GBDEJ;PBG ;I *BE@;>I !676 X;@A @A;DFRRJ>FEA4 Q;BSD FJB F==>EF@BG FGFR@;QB=T GBRBIG;I< >I @AB Q;DHF=E>CR=BO;@T >? FI >PVBE@ S;@A JBDRBE@ @> Q;BS;I< FI<=B6
ti = (0, 0, 0, 1, 0, 0, 0, 0, 0, 0)
• Each input vector classified into one of K discrete classes• Denote classes by Ck
• Represent input image as a vector xi ∈ R784.• We have target vector ti ∈ {0, 1}10
• Given a training set {(x1, t1), . . . , (xN , tN)}, learning problemis to construct a “good” function y(x) from these.• y : R784 → R10
![Page 3: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/3.jpg)
Discriminant Functions Generative Models Discriminative Models
Generalized Linear Models
• Similar to previous chapter on linear models for regression,we will use a “linear” model for classification:
y(x) = f (wTx + w0)
• This is called a generalized linear model• f (·) is a fixed non-linear function
• e.g.
f (u) ={
1 if u ≥ 00 otherwise
• Decision boundary between classes will be linear functionof x
• Can also apply non-linearity to x, as in φi(x) for regression
![Page 4: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/4.jpg)
Discriminant Functions Generative Models Discriminative Models
Generalized Linear Models
• Similar to previous chapter on linear models for regression,we will use a “linear” model for classification:
y(x) = f (wTx + w0)
• This is called a generalized linear model• f (·) is a fixed non-linear function
• e.g.
f (u) ={
1 if u ≥ 00 otherwise
• Decision boundary between classes will be linear functionof x
• Can also apply non-linearity to x, as in φi(x) for regression
![Page 5: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/5.jpg)
Discriminant Functions Generative Models Discriminative Models
Generalized Linear Models
• Similar to previous chapter on linear models for regression,we will use a “linear” model for classification:
y(x) = f (wTx + w0)
• This is called a generalized linear model• f (·) is a fixed non-linear function
• e.g.
f (u) ={
1 if u ≥ 00 otherwise
• Decision boundary between classes will be linear functionof x
• Can also apply non-linearity to x, as in φi(x) for regression
![Page 6: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/6.jpg)
Discriminant Functions Generative Models Discriminative Models
Overview
• Linear regression for Classification• The Fisher Linear Discriminant, or How to Draw a Line
Between Classes• The Perceptron, or The Smallest Neural Net• Logistic Regression—The Statistician’s Classifier
![Page 7: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/7.jpg)
Discriminant Functions Generative Models Discriminative Models
Outline
Discriminant Functions
Generative Models
Discriminative Models
![Page 8: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/8.jpg)
Discriminant Functions Generative Models Discriminative Models
Discriminant Functions with Two Classes
x2
x1
wx
y(x)‖w‖
x⊥
−w0‖w‖
y = 0y < 0
y > 0
R2
R1
• Start with 2 class problem,ti ∈ {0, 1}
• Simple linear discriminant
y(x) = wTx + w0
apply threshold function to getclassification
• Decision surface is line;orthogonal to w.
• Projection of x in w dir. is wT x||w||
![Page 9: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/9.jpg)
Discriminant Functions Generative Models Discriminative Models
Multiple Classes
R1
R2
R3
?
C1
not C1
C2
not C2
R1
R2
R3
?C1
C2
C1
C3
C2
C3
• A linear discriminant between two classes separates with ahyperplane
• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers,
between Ck and all others• One-versus-one method: build K(K − 1)/2 classifiers,
between all pairs
![Page 10: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/10.jpg)
Discriminant Functions Generative Models Discriminative Models
Multiple Classes
R1
R2
R3
?
C1
not C1
C2
not C2
R1
R2
R3
?C1
C2
C1
C3
C2
C3
• A linear discriminant between two classes separates with ahyperplane
• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers,
between Ck and all others• One-versus-one method: build K(K − 1)/2 classifiers,
between all pairs
![Page 11: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/11.jpg)
Discriminant Functions Generative Models Discriminative Models
Multiple Classes
R1
R2
R3
?
C1
not C1
C2
not C2
R1
R2
R3
?C1
C2
C1
C3
C2
C3
• A linear discriminant between two classes separates with ahyperplane
• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers,
between Ck and all others• One-versus-one method: build K(K − 1)/2 classifiers,
between all pairs
![Page 12: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/12.jpg)
Discriminant Functions Generative Models Discriminative Models
Multiple Classes
R1
R2
R3
?
C1
not C1
C2
not C2
R1
R2
R3
?C1
C2
C1
C3
C2
C3
• A linear discriminant between two classes separates with ahyperplane
• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers,
between Ck and all others• One-versus-one method: build K(K − 1)/2 classifiers,
between all pairs
![Page 13: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/13.jpg)
Discriminant Functions Generative Models Discriminative Models
Multiple Classes
R1
R2
R3
?
C1
not C1
C2
not C2
R1
R2
R3
?C1
C2
C1
C3
C2
C3
• A linear discriminant between two classes separates with ahyperplane
• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers,
between Ck and all others• One-versus-one method: build K(K − 1)/2 classifiers,
between all pairs
![Page 14: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/14.jpg)
Discriminant Functions Generative Models Discriminative Models
Multiple Classes
R1
R2
R3
?
C1
not C1
C2
not C2
R1
R2
R3
?C1
C2
C1
C3
C2
C3
• A linear discriminant between two classes separates with ahyperplane
• How to use this for multiple classes?• One-versus-the-rest method: build K − 1 classifiers,
between Ck and all others• One-versus-one method: build K(K − 1)/2 classifiers,
between all pairs
![Page 15: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/15.jpg)
Discriminant Functions Generative Models Discriminative Models
Multiple Classes
Ri
Rj
Rk
xA
xB
x̂
• A solution is to build K linear functions:
yk(x) = wTk x + wk0
assign x to class maxk yk(x)• Gives connected, convex decision regions
x̂ = λxA + (1− λ)xB
yk(x̂) = λyk(xA) + (1− λ)yk(xB)
⇒ yk(x̂) > yj(x̂),∀j 6= k
![Page 16: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/16.jpg)
Discriminant Functions Generative Models Discriminative Models
Multiple Classes
Ri
Rj
Rk
xA
xB
x̂
• A solution is to build K linear functions:
yk(x) = wTk x + wk0
assign x to class maxk yk(x)• Gives connected, convex decision regions
x̂ = λxA + (1− λ)xB
yk(x̂) = λyk(xA) + (1− λ)yk(xB)
⇒ yk(x̂) > yj(x̂),∀j 6= k
![Page 17: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/17.jpg)
Discriminant Functions Generative Models Discriminative Models
Least Squares for Classification
• How do we learn the decision boundaries (wk,wk0)?• One approach is to use least squares, similar to regression• Find W to minimize squared error over all examples and all
components of the label vector:
E(W) =12
N∑n=1
K∑k=1
(yk(xn)− tnk)2
• Some algebra, we get a solution using the pseudo-inverseas in regression
![Page 18: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/18.jpg)
Discriminant Functions Generative Models Discriminative Models
Least Squares for Classification
• How do we learn the decision boundaries (wk,wk0)?• One approach is to use least squares, similar to regression• Find W to minimize squared error over all examples and all
components of the label vector:
E(W) =12
N∑n=1
K∑k=1
(yk(xn)− tnk)2
• Some algebra, we get a solution using the pseudo-inverseas in regression
![Page 19: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/19.jpg)
Discriminant Functions Generative Models Discriminative Models
Least Squares for Classification
• How do we learn the decision boundaries (wk,wk0)?• One approach is to use least squares, similar to regression• Find W to minimize squared error over all examples and all
components of the label vector:
E(W) =12
N∑n=1
K∑k=1
(yk(xn)− tnk)2
• Some algebra, we get a solution using the pseudo-inverseas in regression
![Page 20: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/20.jpg)
Discriminant Functions Generative Models Discriminative Models
Problems with Least Squares
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
• Looks okay... least squaresdecision boundary• Similar to logistic regression
decision boundary (more later)
• Gets worse by adding easypoints?!
• Why?• If target value is 1, points far
from boundary will have highvalue, say 10; this is a largeerror so the boundary ismoved
![Page 21: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/21.jpg)
Discriminant Functions Generative Models Discriminative Models
Problems with Least Squares
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
• Looks okay... least squaresdecision boundary• Similar to logistic regression
decision boundary (more later)
• Gets worse by adding easypoints?!
• Why?• If target value is 1, points far
from boundary will have highvalue, say 10; this is a largeerror so the boundary ismoved
![Page 22: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/22.jpg)
Discriminant Functions Generative Models Discriminative Models
Problems with Least Squares
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
• Looks okay... least squaresdecision boundary• Similar to logistic regression
decision boundary (more later)
• Gets worse by adding easypoints?!
• Why?
• If target value is 1, points farfrom boundary will have highvalue, say 10; this is a largeerror so the boundary ismoved
![Page 23: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/23.jpg)
Discriminant Functions Generative Models Discriminative Models
Problems with Least Squares
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
−4 −2 0 2 4 6 8
−8
−6
−4
−2
0
2
4
• Looks okay... least squaresdecision boundary• Similar to logistic regression
decision boundary (more later)
• Gets worse by adding easypoints?!
• Why?• If target value is 1, points far
from boundary will have highvalue, say 10; this is a largeerror so the boundary ismoved
![Page 24: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/24.jpg)
Discriminant Functions Generative Models Discriminative Models
More Least Squares Problems
−6 −4 −2 0 2 4 6−6
−4
−2
0
2
4
6
• Easily separated by hyperplanes, but not found using leastsquares!
• Remember that least squares is MLE under Gaussian noisemodel for continuous target - we’ve got discrete targets.
• We’ll address these problems later with better models• First, a look at a different criterion for linear discriminant
![Page 25: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/25.jpg)
Discriminant Functions Generative Models Discriminative Models
Fisher’s Linear Discriminant
• The two-class linear discriminant acts as a projection:
y = wTx
≥ −w0
followed by a threshold• In which direction w should we project?• One which separates classes “well”• Intuition: We want the (projected) centers of the classes to
be far apart, and each class (projection) to be clusteredaround its centre.
![Page 26: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/26.jpg)
Discriminant Functions Generative Models Discriminative Models
Fisher’s Linear Discriminant
• The two-class linear discriminant acts as a projection:
y = wTx ≥ −w0
followed by a threshold
• In which direction w should we project?• One which separates classes “well”• Intuition: We want the (projected) centers of the classes to
be far apart, and each class (projection) to be clusteredaround its centre.
![Page 27: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/27.jpg)
Discriminant Functions Generative Models Discriminative Models
Fisher’s Linear Discriminant
• The two-class linear discriminant acts as a projection:
y = wTx ≥ −w0
followed by a threshold• In which direction w should we project?
• One which separates classes “well”• Intuition: We want the (projected) centers of the classes to
be far apart, and each class (projection) to be clusteredaround its centre.
![Page 28: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/28.jpg)
Discriminant Functions Generative Models Discriminative Models
Fisher’s Linear Discriminant
• The two-class linear discriminant acts as a projection:
y = wTx ≥ −w0
followed by a threshold• In which direction w should we project?• One which separates classes “well”• Intuition: We want the (projected) centers of the classes to
be far apart, and each class (projection) to be clusteredaround its centre.
![Page 29: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/29.jpg)
Discriminant Functions Generative Models Discriminative Models
Fisher’s Linear Discriminant
−2 2 6
−2
0
2
4
−2 2 6
−2
0
2
4
• A natural idea would be to project in the direction of theline connecting class means
• However, problematic if classes have variance in thisdirection (ie are not clustered around the mean)
• Fisher criterion: maximize ratio of inter-class separation(between) to intra-class variance (inside)
![Page 30: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/30.jpg)
Discriminant Functions Generative Models Discriminative Models
Fisher’s Linear Discriminant
−2 2 6
−2
0
2
4
−2 2 6
−2
0
2
4
• A natural idea would be to project in the direction of theline connecting class means
• However, problematic if classes have variance in thisdirection (ie are not clustered around the mean)
• Fisher criterion: maximize ratio of inter-class separation(between) to intra-class variance (inside)
![Page 31: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/31.jpg)
Discriminant Functions Generative Models Discriminative Models
Fisher’s Linear Discriminant
−2 2 6
−2
0
2
4
−2 2 6
−2
0
2
4
• A natural idea would be to project in the direction of theline connecting class means
• However, problematic if classes have variance in thisdirection (ie are not clustered around the mean)
• Fisher criterion: maximize ratio of inter-class separation(between) to intra-class variance (inside)
![Page 32: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/32.jpg)
Discriminant Functions Generative Models Discriminative Models
Fisher’s Linear Discriminant
−2 2 6
−2
0
2
4
−2 2 6
−2
0
2
4
• A natural idea would be to project in the direction of theline connecting class means
• However, problematic if classes have variance in thisdirection (ie are not clustered around the mean)
• Fisher criterion: maximize ratio of inter-class separation(between) to intra-class variance (inside)
![Page 33: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/33.jpg)
Discriminant Functions Generative Models Discriminative Models
Math time - FLD
• Projection yn = wTxn
• Inter-class separation is distance between class means(good):
mk =1
Nk
∑n∈Ck
wTxn
• Intra-class variance (bad):
s2k =
∑n∈Ck
(yn − mk)2
• Fisher criterion:
J(w) =(m2 − m1)
2
s21 + s2
2
maximize wrt w
![Page 34: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/34.jpg)
Discriminant Functions Generative Models Discriminative Models
Math time - FLD
• Projection yn = wTxn
• Inter-class separation is distance between class means(good):
mk =1
Nk
∑n∈Ck
wTxn
• Intra-class variance (bad):
s2k =
∑n∈Ck
(yn − mk)2
• Fisher criterion:
J(w) =(m2 − m1)
2
s21 + s2
2
maximize wrt w
![Page 35: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/35.jpg)
Discriminant Functions Generative Models Discriminative Models
Math time - FLD
• Projection yn = wTxn
• Inter-class separation is distance between class means(good):
mk =1
Nk
∑n∈Ck
wTxn
• Intra-class variance (bad):
s2k =
∑n∈Ck
(yn − mk)2
• Fisher criterion:
J(w) =(m2 − m1)
2
s21 + s2
2
maximize wrt w
![Page 36: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/36.jpg)
Discriminant Functions Generative Models Discriminative Models
Math time - FLD
• Projection yn = wTxn
• Inter-class separation is distance between class means(good):
mk =1
Nk
∑n∈Ck
wTxn
• Intra-class variance (bad):
s2k =
∑n∈Ck
(yn − mk)2
• Fisher criterion:
J(w) =(m2 − m1)
2
s21 + s2
2
maximize wrt w
![Page 37: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/37.jpg)
Discriminant Functions Generative Models Discriminative Models
Math time - FLD
J(w) =(m2 − m1)
2
s21 + s2
2=
wTSBwwTSWw
Between-class covariance:SB = (m2 −m1)(m2 −m1)
T
Within-class covariance:SW =
∑n∈C1
(xn −m1)(xn −m1)T +
∑n∈C2
(xn −m2)(xn −m2)T
Lots of math:
w ∝ S−1W (m2 −m1)
If covariance SW is isotropic (proportional to unit matrix, so littlevariance within class), reduces to class mean difference vector
![Page 38: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/38.jpg)
Discriminant Functions Generative Models Discriminative Models
Math time - FLD
J(w) =(m2 − m1)
2
s21 + s2
2=
wTSBwwTSWw
Between-class covariance:SB = (m2 −m1)(m2 −m1)
T
Within-class covariance:SW =
∑n∈C1
(xn −m1)(xn −m1)T +
∑n∈C2
(xn −m2)(xn −m2)T
Lots of math:
w ∝ S−1W (m2 −m1)
If covariance SW is isotropic (proportional to unit matrix, so littlevariance within class), reduces to class mean difference vector
![Page 39: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/39.jpg)
Discriminant Functions Generative Models Discriminative Models
FLD Summary
• FLD is a dimensionality reduction technique (more later inthe course)
• Criterion for choosing projection based on class labels• Still suffers from outliers (e.g. earlier least squares
example)
![Page 40: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/40.jpg)
Discriminant Functions Generative Models Discriminative Models
Perceptrons
• Perceptrons is used to refer to many neural networkstructures (more next week)
• The classic type is a fixed non-linear transformation ofinput, one layer of adaptive weights, and a threshold:
y(x) = f (wTφ(x))
• Developed by Rosenblatt in the 50s
• The main difference compared to the methods we’ve seenso far is the learning algorithm
![Page 41: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/41.jpg)
Discriminant Functions Generative Models Discriminative Models
Perceptron Learning
• Two class problem• For ease of notation, we will use t = 1 for class C1 and
t = −1 for class C2
• We saw that squared error was problematic• Instead, we’d like to minimize the number of misclassified
examples• An example is mis-classified if wTφ(xn)tn < 0• Perceptron criterion:
EP(w) = −∑
n∈MwTφ(xn)tn
sum over mis-classified examples only
![Page 42: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/42.jpg)
Discriminant Functions Generative Models Discriminative Models
Perceptron Learning
• Two class problem• For ease of notation, we will use t = 1 for class C1 and
t = −1 for class C2
• We saw that squared error was problematic• Instead, we’d like to minimize the number of misclassified
examples• An example is mis-classified if wTφ(xn)tn < 0• Perceptron criterion:
EP(w) = −∑
n∈MwTφ(xn)tn
sum over mis-classified examples only
![Page 43: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/43.jpg)
Discriminant Functions Generative Models Discriminative Models
Perceptron Learning
• Two class problem• For ease of notation, we will use t = 1 for class C1 and
t = −1 for class C2
• We saw that squared error was problematic• Instead, we’d like to minimize the number of misclassified
examples• An example is mis-classified if wTφ(xn)tn < 0• Perceptron criterion:
EP(w) = −∑
n∈MwTφ(xn)tn
sum over mis-classified examples only
![Page 44: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/44.jpg)
Discriminant Functions Generative Models Discriminative Models
Perceptron Learning
• Two class problem• For ease of notation, we will use t = 1 for class C1 and
t = −1 for class C2
• We saw that squared error was problematic• Instead, we’d like to minimize the number of misclassified
examples• An example is mis-classified if wTφ(xn)tn < 0• Perceptron criterion:
EP(w) = −∑
n∈MwTφ(xn)tn
sum over mis-classified examples only
![Page 45: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/45.jpg)
Discriminant Functions Generative Models Discriminative Models
Perceptron Learning Algorithm
• Minimize the error function using stochastic gradientdescent (gradient descent per example):
w(τ+1) = w(τ) − η∇EP(w)
= w(τ) + ηφ(xn)tn︸ ︷︷ ︸if incorrect
• Iterate over all training examples, only change w if theexample is mis-classified
• Guaranteed to converge if data are linearly separable• Will not converge if not• May take many iterations• Sensitive to initialization
![Page 46: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/46.jpg)
Discriminant Functions Generative Models Discriminative Models
Perceptron Learning Algorithm
• Minimize the error function using stochastic gradientdescent (gradient descent per example):
w(τ+1) = w(τ) − η∇EP(w) = w(τ) + ηφ(xn)tn︸ ︷︷ ︸if incorrect
• Iterate over all training examples, only change w if theexample is mis-classified
• Guaranteed to converge if data are linearly separable• Will not converge if not• May take many iterations• Sensitive to initialization
![Page 47: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/47.jpg)
Discriminant Functions Generative Models Discriminative Models
Perceptron Learning Algorithm
• Minimize the error function using stochastic gradientdescent (gradient descent per example):
w(τ+1) = w(τ) − η∇EP(w) = w(τ) + ηφ(xn)tn︸ ︷︷ ︸if incorrect
• Iterate over all training examples, only change w if theexample is mis-classified
• Guaranteed to converge if data are linearly separable• Will not converge if not• May take many iterations• Sensitive to initialization
![Page 48: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/48.jpg)
Discriminant Functions Generative Models Discriminative Models
Perceptron Learning Illustration
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
• Note there are many hyperplanes with 0 error• Support vector machines (in a few weeks) have a nice way
of choosing one
![Page 49: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/49.jpg)
Discriminant Functions Generative Models Discriminative Models
Perceptron Learning Illustration
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
• Note there are many hyperplanes with 0 error• Support vector machines (in a few weeks) have a nice way
of choosing one
![Page 50: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/50.jpg)
Discriminant Functions Generative Models Discriminative Models
Perceptron Learning Illustration
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
• Note there are many hyperplanes with 0 error• Support vector machines (in a few weeks) have a nice way
of choosing one
![Page 51: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/51.jpg)
Discriminant Functions Generative Models Discriminative Models
Perceptron Learning Illustration
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
• Note there are many hyperplanes with 0 error• Support vector machines (in a few weeks) have a nice way
of choosing one
![Page 52: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/52.jpg)
Discriminant Functions Generative Models Discriminative Models
Perceptron Learning Illustration
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
−1 −0.5 0 0.5 1−1
−0.5
0
0.5
1
• Note there are many hyperplanes with 0 error• Support vector machines (in a few weeks) have a nice way
of choosing one
![Page 53: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/53.jpg)
Discriminant Functions Generative Models Discriminative Models
Limitations of Perceptrons
• Perceptrons can only solve linearly separable problems infeature space• Same as the other models in this chapter
• Canonical example of non-separable problem is X-OR• Real datasets can look like this too
Expressiveness of perceptrons
Consider a perceptron with g = step function (Rosenblatt, 1957, 1960)
Can represent AND, OR, NOT, majority, etc.
Represents a linear separator in input space:
ΣjWjxj > 0 or W · x > 0
I 1
I 2
I 1
I 2
I 1
I 2
?
(a) (b) (c)
0 1
0
1
0
1 1
0
0 1 0 1
xor I 2I 1orI 1 I 2and I 1 I 2
Chapter 20 11
![Page 54: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/54.jpg)
Discriminant Functions Generative Models Discriminative Models
Outline
Discriminant Functions
Generative Models
Discriminative Models
![Page 55: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/55.jpg)
Discriminant Functions Generative Models Discriminative Models
Intuition for Logistic Regression in Generative Models• Classification with Joint Probabilities With 2 classes, C1
and C2:• Choose C1 if
1 <p(C1|x)p(C2|x)
=p(C1, x)p(C2, x)
⇐⇒
0 < ln(
p(C1|x)p(C2|x)
)= ln(p(C1|x))− ln(p(C2|x))
• The quantity
a = ln(
p(C1|x)p(C2|x)
)is called the log-odds.
• Logistic Regression assumes that the log-odds are a linearfunction of the feature vector: a = wTx + w0.
• Often true given assumptions about the class-conditionaldensities p(x|Ck).
![Page 56: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/56.jpg)
Discriminant Functions Generative Models Discriminative Models
Intuition for Logistic Regression in Generative Models• Classification with Joint Probabilities With 2 classes, C1
and C2:• Choose C1 if
1 <p(C1|x)p(C2|x)
=p(C1, x)p(C2, x)
⇐⇒
0 < ln(
p(C1|x)p(C2|x)
)= ln(p(C1|x))− ln(p(C2|x))
• The quantity
a = ln(
p(C1|x)p(C2|x)
)is called the log-odds.
• Logistic Regression assumes that the log-odds are a linearfunction of the feature vector: a = wTx + w0.
• Often true given assumptions about the class-conditionaldensities p(x|Ck).
![Page 57: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/57.jpg)
Discriminant Functions Generative Models Discriminative Models
Intuition for Logistic Regression in Generative Models• Classification with Joint Probabilities With 2 classes, C1
and C2:• Choose C1 if
1 <p(C1|x)p(C2|x)
=p(C1, x)p(C2, x)
⇐⇒
0 < ln(
p(C1|x)p(C2|x)
)= ln(p(C1|x))− ln(p(C2|x))
• The quantity
a = ln(
p(C1|x)p(C2|x)
)is called the log-odds.
• Logistic Regression assumes that the log-odds are a linearfunction of the feature vector: a = wTx + w0.
• Often true given assumptions about the class-conditionaldensities p(x|Ck).
![Page 58: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/58.jpg)
Discriminant Functions Generative Models Discriminative Models
Intuition for Logistic Regression in Generative Models• Classification with Joint Probabilities With 2 classes, C1
and C2:• Choose C1 if
1 <p(C1|x)p(C2|x)
=p(C1, x)p(C2, x)
⇐⇒
0 < ln(
p(C1|x)p(C2|x)
)= ln(p(C1|x))− ln(p(C2|x))
• The quantity
a = ln(
p(C1|x)p(C2|x)
)is called the log-odds.
• Logistic Regression assumes that the log-odds are a linearfunction of the feature vector: a = wTx + w0.
• Often true given assumptions about the class-conditionaldensities p(x|Ck).
![Page 59: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/59.jpg)
Discriminant Functions Generative Models Discriminative Models
Probabilistic Generative Models
• With 2 classes, C1 and C2:
p(C1|x) =p(x|C1)p(C1)
p(x)Bayes’ Rule
p(C1|x) =p(x|C1)p(C1)
p(x, C1) + p(x, C2)Sum rule
p(C1|x) =p(x|C1)p(C1)
p(x|C1)p(C1) + p(x|C2)p(C2)Product rule
• In generative models we specify the class-conditionaldistribution p(x|Ck) which generates the data for each class
![Page 60: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/60.jpg)
Discriminant Functions Generative Models Discriminative Models
Probabilistic Generative Models
• With 2 classes, C1 and C2:
p(C1|x) =p(x|C1)p(C1)
p(x)Bayes’ Rule
p(C1|x) =p(x|C1)p(C1)
p(x, C1) + p(x, C2)Sum rule
p(C1|x) =p(x|C1)p(C1)
p(x|C1)p(C1) + p(x|C2)p(C2)Product rule
• In generative models we specify the class-conditionaldistribution p(x|Ck) which generates the data for each class
![Page 61: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/61.jpg)
Discriminant Functions Generative Models Discriminative Models
Probabilistic Generative Models
• With 2 classes, C1 and C2:
p(C1|x) =p(x|C1)p(C1)
p(x)Bayes’ Rule
p(C1|x) =p(x|C1)p(C1)
p(x, C1) + p(x, C2)Sum rule
p(C1|x) =p(x|C1)p(C1)
p(x|C1)p(C1) + p(x|C2)p(C2)Product rule
• In generative models we specify the class-conditionaldistribution p(x|Ck) which generates the data for each class
![Page 62: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/62.jpg)
Discriminant Functions Generative Models Discriminative Models
Probabilistic Generative Models
• With 2 classes, C1 and C2:
p(C1|x) =p(x|C1)p(C1)
p(x)Bayes’ Rule
p(C1|x) =p(x|C1)p(C1)
p(x, C1) + p(x, C2)Sum rule
p(C1|x) =p(x|C1)p(C1)
p(x|C1)p(C1) + p(x|C2)p(C2)Product rule
• In generative models we specify the class-conditionaldistribution p(x|Ck) which generates the data for each class
![Page 63: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/63.jpg)
Discriminant Functions Generative Models Discriminative Models
Probabilistic Generative Models - Example
• Let’s say we observe x which is the current temperature• Determine if we are in Vancouver (C1) or Honolulu (C2)• Generative model:
p(C1|x) =p(x|C1)p(C1)
p(x|C1)p(C1) + p(x|C2)p(C2)
• p(x|C1) is distribution over typical temperatures inVancouver• e.g. p(x|C1) = N (x; 10, 5)
• p(x|C2) is distribution over typical temperatures in Honolulu• e.g. p(x|C2) = N (x; 25, 5)
• Class priors p(C1) = 0.1, p(C2) = 0.9
• p(C1|x = 15) = 0.0484·0.10.0484·0.1+0.0108·0.9 ≈ 0.33
![Page 64: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/64.jpg)
Discriminant Functions Generative Models Discriminative Models
Probabilistic Generative Models - Example
• Let’s say we observe x which is the current temperature• Determine if we are in Vancouver (C1) or Honolulu (C2)• Generative model:
p(C1|x) =p(x|C1)p(C1)
p(x|C1)p(C1) + p(x|C2)p(C2)
• p(x|C1) is distribution over typical temperatures inVancouver• e.g. p(x|C1) = N (x; 10, 5)
• p(x|C2) is distribution over typical temperatures in Honolulu• e.g. p(x|C2) = N (x; 25, 5)
• Class priors p(C1) = 0.1, p(C2) = 0.9
• p(C1|x = 15) = 0.0484·0.10.0484·0.1+0.0108·0.9 ≈ 0.33
![Page 65: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/65.jpg)
Discriminant Functions Generative Models Discriminative Models
Generalized Linear Models
• Suppose we have built a model for predicting the log-odds.We can use it to compute the class probability as follows.
p(C1|x) =p(x|C1)p(C1)
p(x|C1)p(C1) + p(x|C2)p(C2)
=1
1 + exp(−a)≡ σ(a)
where a = lnp(x|C1)p(C1)
p(x|C2)p(C2)
= lnp(x, C1)
p(x, C2).
![Page 66: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/66.jpg)
Discriminant Functions Generative Models Discriminative Models
Generalized Linear Models
• Suppose we have built a model for predicting the log-odds.We can use it to compute the class probability as follows.
p(C1|x) =p(x|C1)p(C1)
p(x|C1)p(C1) + p(x|C2)p(C2)
=1
1 + exp(−a)≡ σ(a)
where a = lnp(x|C1)p(C1)
p(x|C2)p(C2)
= lnp(x, C1)
p(x, C2).
![Page 67: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/67.jpg)
Discriminant Functions Generative Models Discriminative Models
Logistic Sigmoid
−5 0 50
0.5
1
• The function σ(a) = 11+exp(−a) is known as the logistic
sigmoid• It squashes the real axis down to [0, 1]• It is continuous and differentiable
![Page 68: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/68.jpg)
Discriminant Functions Generative Models Discriminative Models
Multi-class Extension
• There is a generalization of the logistic sigmoid to K > 2classes:
p(Ck|x) =p(x|Ck)p(Ck)∑
j p(x|Cj)p(Cj)
=exp(ak)∑
j exp(aj)
where ak = ln p(x|Ck)p(Ck)
• a. k. a. softmax function• If some ak � aj, p(Ck|x) goes to 1
![Page 69: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/69.jpg)
Discriminant Functions Generative Models Discriminative Models
Multi-class Extension
• There is a generalization of the logistic sigmoid to K > 2classes:
p(Ck|x) =p(x|Ck)p(Ck)∑
j p(x|Cj)p(Cj)
=exp(ak)∑
j exp(aj)
where ak = ln p(x|Ck)p(Ck)
• a. k. a. softmax function• If some ak � aj, p(Ck|x) goes to 1
![Page 70: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/70.jpg)
Discriminant Functions Generative Models Discriminative Models
Example Logistic Regression
−6 −4 −2 0 2 4 6−6
−4
−2
0
2
4
6
![Page 71: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/71.jpg)
Discriminant Functions Generative Models Discriminative Models
Gaussian Class-Conditional Densities
• Back to the log-odds a in the logistic sigmoid for 2 classes• Let’s assume the class-conditional densities p(x|Ck) are
Gaussians, and have the same covariance matrix Σ:
p(x|Ck) =1
(2π)D/2|Σ|1/2 exp{−1
2(x− µk)
TΣ−1(x− µk)
}• a takes a simple form:
a = lnp(x|C1)p(C1)
p(x|C2)p(C2)= wTx + w0
• Note that quadratic terms xTΣ−1x cancel
![Page 72: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/72.jpg)
Discriminant Functions Generative Models Discriminative Models
Gaussian Class-Conditional Densities
• Back to the log-odds a in the logistic sigmoid for 2 classes• Let’s assume the class-conditional densities p(x|Ck) are
Gaussians, and have the same covariance matrix Σ:
p(x|Ck) =1
(2π)D/2|Σ|1/2 exp{−1
2(x− µk)
TΣ−1(x− µk)
}• a takes a simple form:
a = lnp(x|C1)p(C1)
p(x|C2)p(C2)= wTx + w0
• Note that quadratic terms xTΣ−1x cancel
![Page 73: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/73.jpg)
Discriminant Functions Generative Models Discriminative Models
Gaussian Class-Conditional Densities
• Back to the log-odds a in the logistic sigmoid for 2 classes• Let’s assume the class-conditional densities p(x|Ck) are
Gaussians, and have the same covariance matrix Σ:
p(x|Ck) =1
(2π)D/2|Σ|1/2 exp{−1
2(x− µk)
TΣ−1(x− µk)
}• a takes a simple form:
a = lnp(x|C1)p(C1)
p(x|C2)p(C2)= wTx + w0
• Note that quadratic terms xTΣ−1x cancel
![Page 74: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/74.jpg)
Discriminant Functions Generative Models Discriminative Models
Gaussian Class-Conditional Densities
• Back to the log-odds a in the logistic sigmoid for 2 classes• Let’s assume the class-conditional densities p(x|Ck) are
Gaussians, and have the same covariance matrix Σ:
p(x|Ck) =1
(2π)D/2|Σ|1/2 exp{−1
2(x− µk)
TΣ−1(x− µk)
}• a takes a simple form:
a = lnp(x|C1)p(C1)
p(x|C2)p(C2)= wTx + w0
• Note that quadratic terms xTΣ−1x cancel
![Page 75: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/75.jpg)
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning
• We can fit the parameters to this model using maximumlikelihood• Parameters are µ1, µ2, Σ−1, p(C1) ≡ π, p(C2) ≡ 1− π• Refer to as θ
• For a datapoint xn from class C1 (tn = 1):
p(xn, C1) = p(C1)p(xn|C1) = πN (xn|µ1,Σ)
• For a datapoint xn from class C2 (tn = 0):
p(xn, C2) = p(C2)p(xn|C2) = (1− π)N (xn|µ2,Σ)
![Page 76: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/76.jpg)
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning
• We can fit the parameters to this model using maximumlikelihood• Parameters are µ1, µ2, Σ−1, p(C1) ≡ π, p(C2) ≡ 1− π• Refer to as θ
• For a datapoint xn from class C1 (tn = 1):
p(xn, C1) = p(C1)p(xn|C1) = πN (xn|µ1,Σ)
• For a datapoint xn from class C2 (tn = 0):
p(xn, C2) = p(C2)p(xn|C2) = (1− π)N (xn|µ2,Σ)
![Page 77: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/77.jpg)
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning
• We can fit the parameters to this model using maximumlikelihood• Parameters are µ1, µ2, Σ−1, p(C1) ≡ π, p(C2) ≡ 1− π• Refer to as θ
• For a datapoint xn from class C1 (tn = 1):
p(xn, C1) = p(C1)p(xn|C1) = πN (xn|µ1,Σ)
• For a datapoint xn from class C2 (tn = 0):
p(xn, C2) = p(C2)p(xn|C2) = (1− π)N (xn|µ2,Σ)
![Page 78: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/78.jpg)
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning
• The likelihood of the training data is:
p(t|π,µ1,µ2,Σ) =
N∏n=1
[πN (xn|µ1,Σ)]tn [(1−π)N (xn|µ2,Σ)]1−tn
• As usual, ln is our friend:
`(t; θ) =N∑
n=1
tn lnπ + (1− tn) ln(1− π)︸ ︷︷ ︸π
+ tn ln N1 + (1− tn) ln N2︸ ︷︷ ︸µ1,µ2,Σ
• Maximize for each separately
![Page 79: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/79.jpg)
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning
• The likelihood of the training data is:
p(t|π,µ1,µ2,Σ) =
N∏n=1
[πN (xn|µ1,Σ)]tn [(1−π)N (xn|µ2,Σ)]1−tn
• As usual, ln is our friend:
`(t; θ) =N∑
n=1
tn lnπ + (1− tn) ln(1− π)︸ ︷︷ ︸π
+ tn ln N1 + (1− tn) ln N2︸ ︷︷ ︸µ1,µ2,Σ
• Maximize for each separately
![Page 80: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/80.jpg)
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning - Class Priors
• Maximization with respect to the class priors parameter πis straightforward:
∂
∂π`(t; θ) =
N∑n=1
tnπ− 1− tn
1− π
⇒ π =N1
N1 + N2
• N1 and N2 are the number of training points in each class• Prior is simply the fraction of points in each class
![Page 81: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/81.jpg)
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning - Class Priors
• Maximization with respect to the class priors parameter πis straightforward:
∂
∂π`(t; θ) =
N∑n=1
tnπ− 1− tn
1− π
⇒ π =N1
N1 + N2
• N1 and N2 are the number of training points in each class• Prior is simply the fraction of points in each class
![Page 82: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/82.jpg)
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning - Class Priors
• Maximization with respect to the class priors parameter πis straightforward:
∂
∂π`(t; θ) =
N∑n=1
tnπ− 1− tn
1− π
⇒ π =N1
N1 + N2
• N1 and N2 are the number of training points in each class• Prior is simply the fraction of points in each class
![Page 83: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/83.jpg)
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning - Gaussian Parameters• The other parameters can also be found in the same
fashion• Class means:
µ1 =1
N1
N∑n=1
tnxn
µ2 =1
N2
N∑n=1
(1− tn)xn
• Means of training examples from each class• Shared covariance matrix:
Σ =N1
N1
N1
∑n∈C1
(xn−µ1)(xn−µ1)T+
N2
N1
N2
∑n∈C2
(xn−µ2)(xn−µ2)T
• Weighted average of class covariances
![Page 84: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/84.jpg)
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning - Gaussian Parameters• The other parameters can also be found in the same
fashion• Class means:
µ1 =1
N1
N∑n=1
tnxn
µ2 =1
N2
N∑n=1
(1− tn)xn
• Means of training examples from each class• Shared covariance matrix:
Σ =N1
N1
N1
∑n∈C1
(xn−µ1)(xn−µ1)T+
N2
N1
N2
∑n∈C2
(xn−µ2)(xn−µ2)T
• Weighted average of class covariances
![Page 85: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/85.jpg)
Discriminant Functions Generative Models Discriminative Models
Probabilistic Generative Models Summary
• Fitting Gaussian using ML criterion is sensitive to outliers• Simple linear form for a in logistic sigmoid occurs for more
than just Gaussian distributions• Arises for any distribution in the exponential family, a large
class of distributions
![Page 86: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/86.jpg)
Discriminant Functions Generative Models Discriminative Models
Outline
Discriminant Functions
Generative Models
Discriminative Models
![Page 87: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/87.jpg)
Discriminant Functions Generative Models Discriminative Models
Probabilistic Discriminative Models
• Generative model made assumptions about form ofclass-conditional distributions (e.g. Gaussian)• Resulted in logistic sigmoid of linear function of x
• Discriminative model - explicitly use functional form
p(C1|x) =1
1 + exp(−wTx + w0)
and find w directly• For the generative model we had 2M + M(M + 1)/2 + 1
parameters• M is dimensionality of x
• Discriminative model will have M + 1 parameters
![Page 88: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/88.jpg)
Discriminant Functions Generative Models Discriminative Models
Probabilistic Discriminative Models
• Generative model made assumptions about form ofclass-conditional distributions (e.g. Gaussian)• Resulted in logistic sigmoid of linear function of x
• Discriminative model - explicitly use functional form
p(C1|x) =1
1 + exp(−wTx + w0)
and find w directly• For the generative model we had 2M + M(M + 1)/2 + 1
parameters• M is dimensionality of x
• Discriminative model will have M + 1 parameters
![Page 89: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/89.jpg)
Discriminant Functions Generative Models Discriminative Models
Probabilistic Discriminative Models
• Generative model made assumptions about form ofclass-conditional distributions (e.g. Gaussian)• Resulted in logistic sigmoid of linear function of x
• Discriminative model - explicitly use functional form
p(C1|x) =1
1 + exp(−wTx + w0)
and find w directly• For the generative model we had 2M + M(M + 1)/2 + 1
parameters• M is dimensionality of x
• Discriminative model will have M + 1 parameters
![Page 90: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/90.jpg)
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning - Discriminative Model• As usual we can use the maximum likelihood criterion for
learning
p(t|w) =N∏
n=1
ytnn {1− yn}1−tn ; where yn = p(C1|xn)
• Taking ln and derivative gives:
∇`(w) =N∑
n=1
(yn − tn)xn
• This time no closed-form solution since yn = σ(wTx)• Could use (stochastic) gradient descent
• But Iterative Reweighted Least Squares (IRLS) is a bettertechnique.
![Page 91: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/91.jpg)
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning - Discriminative Model• As usual we can use the maximum likelihood criterion for
learning
p(t|w) =N∏
n=1
ytnn {1− yn}1−tn ; where yn = p(C1|xn)
• Taking ln and derivative gives:
∇`(w) =N∑
n=1
(yn − tn)xn
• This time no closed-form solution since yn = σ(wTx)• Could use (stochastic) gradient descent
• But Iterative Reweighted Least Squares (IRLS) is a bettertechnique.
![Page 92: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/92.jpg)
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning - Discriminative Model• As usual we can use the maximum likelihood criterion for
learning
p(t|w) =N∏
n=1
ytnn {1− yn}1−tn ; where yn = p(C1|xn)
• Taking ln and derivative gives:
∇`(w) =N∑
n=1
(yn − tn)xn
• This time no closed-form solution since yn = σ(wTx)• Could use (stochastic) gradient descent
• But Iterative Reweighted Least Squares (IRLS) is a bettertechnique.
![Page 93: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/93.jpg)
Discriminant Functions Generative Models Discriminative Models
Maximum Likelihood Learning - Discriminative Model• As usual we can use the maximum likelihood criterion for
learning
p(t|w) =N∏
n=1
ytnn {1− yn}1−tn ; where yn = p(C1|xn)
• Taking ln and derivative gives:
∇`(w) =N∑
n=1
(yn − tn)xn
• This time no closed-form solution since yn = σ(wTx)• Could use (stochastic) gradient descent
• But Iterative Reweighted Least Squares (IRLS) is a bettertechnique.
![Page 94: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/94.jpg)
Discriminant Functions Generative Models Discriminative Models
Generative vs. Discriminative
• Generative models• Can generate synthetic
example data• Perhaps accurate
classification is equivalent toaccurate synthesis
• Support learning with missingdata
• Tend to have more parameters• Require good model of class
distributions
• Discriminative models• Only usable for classification• Don’t solve a harder problem
than you need to• Tend to have fewer parameters• Require good model of
decision boundary
![Page 95: Linear Models for Classificationoschulte/teaching/726/spring11/...Discriminant FunctionsGenerative ModelsDiscriminative Models Generalized Linear Models Similar to previous chapter](https://reader035.vdocuments.mx/reader035/viewer/2022081601/60ee9861b7aebd7e74681ced/html5/thumbnails/95.jpg)
Discriminant Functions Generative Models Discriminative Models
Conclusion
• Readings: Ch. 4.1.1-4.1.4, 4.1.7, 4.2.1-4.2.2, 4.3.1-4.3.3• Generalized linear models y(x) = f (wTx + w0)
• Threshold/max function for f (·)• Minimize with least squares• Fisher criterion - class separation• Perceptron criterion - mis-classified examples
• Probabilistic models: logistic sigmoid / softmax for f (·)• Generative model - assume class conditional densities in
exponential family; obtain sigmoid• Discriminative model - directly model posterior using
sigmoid (a. k. a. logistic regression, though classification)• Can learn either using maximum likelihood
• All of these models are limited to linear decisionboundaries in feature space