05 machine learning - supervised fisher classifier
TRANSCRIPT
Machine Learning for Data MiningFisher Linear Discriminant
Andres Mendez-Vazquez
May 26, 2015
1 / 43
Outline
1 IntroductionHistoryThe Rotation IdeaClassifiers as Machines for dimensionality reduction
2 SolutionUse the mean of each ClassScatter measureThe Cost FunctionA Transformation for simplification and defining the cost function
3 Where is this used?Applications
4 Relation with Least Squared ErrorWhat?
2 / 43
Outline
1 IntroductionHistoryThe Rotation IdeaClassifiers as Machines for dimensionality reduction
2 SolutionUse the mean of each ClassScatter measureThe Cost FunctionA Transformation for simplification and defining the cost function
3 Where is this used?Applications
4 Relation with Least Squared ErrorWhat?
3 / 43
Outline
1 IntroductionHistoryThe Rotation IdeaClassifiers as Machines for dimensionality reduction
2 SolutionUse the mean of each ClassScatter measureThe Cost FunctionA Transformation for simplification and defining the cost function
3 Where is this used?Applications
4 Relation with Least Squared ErrorWhat?
4 / 43
Rotation
ProjectingProjecting well-separated samples onto an arbitrary line usually produces aconfused mixture of samples from all of the classes and thus produces poorrecognition performance.
Something NotableHowever, moving and rotating the line around might result in anorientation for which the projected samples are well separated.
Fisher linear discriminant (FLD)It is a discriminant analysis seeking directions that are efficient fordiscriminating binary classification problem.
5 / 43
Rotation
ProjectingProjecting well-separated samples onto an arbitrary line usually produces aconfused mixture of samples from all of the classes and thus produces poorrecognition performance.
Something NotableHowever, moving and rotating the line around might result in anorientation for which the projected samples are well separated.
Fisher linear discriminant (FLD)It is a discriminant analysis seeking directions that are efficient fordiscriminating binary classification problem.
5 / 43
Rotation
ProjectingProjecting well-separated samples onto an arbitrary line usually produces aconfused mixture of samples from all of the classes and thus produces poorrecognition performance.
Something NotableHowever, moving and rotating the line around might result in anorientation for which the projected samples are well separated.
Fisher linear discriminant (FLD)It is a discriminant analysis seeking directions that are efficient fordiscriminating binary classification problem.
5 / 43
Outline
1 IntroductionHistoryThe Rotation IdeaClassifiers as Machines for dimensionality reduction
2 SolutionUse the mean of each ClassScatter measureThe Cost FunctionA Transformation for simplification and defining the cost function
3 Where is this used?Applications
4 Relation with Least Squared ErrorWhat?
7 / 43
This is actually coming from...
Classifier asA machine for dimensionality reduction.
Initial SetupWe have:
Nd-dimensional samples x1, x2, ..., xN .Ni is the number of samples in class Ci for i=1,2.
Then, we ask for the projection of each xi into the line by means of
yi = wT xi (1)
8 / 43
This is actually coming from...
Classifier asA machine for dimensionality reduction.
Initial SetupWe have:
Nd-dimensional samples x1, x2, ..., xN .Ni is the number of samples in class Ci for i=1,2.
Then, we ask for the projection of each xi into the line by means of
yi = wT xi (1)
8 / 43
This is actually coming from...
Classifier asA machine for dimensionality reduction.
Initial SetupWe have:
Nd-dimensional samples x1, x2, ..., xN .Ni is the number of samples in class Ci for i=1,2.
Then, we ask for the projection of each xi into the line by means of
yi = wT xi (1)
8 / 43
This is actually coming from...
Classifier asA machine for dimensionality reduction.
Initial SetupWe have:
Nd-dimensional samples x1, x2, ..., xN .Ni is the number of samples in class Ci for i=1,2.
Then, we ask for the projection of each xi into the line by means of
yi = wT xi (1)
8 / 43
This is actually coming from...
Classifier asA machine for dimensionality reduction.
Initial SetupWe have:
Nd-dimensional samples x1, x2, ..., xN .Ni is the number of samples in class Ci for i=1,2.
Then, we ask for the projection of each xi into the line by means of
yi = wT xi (1)
8 / 43
Outline
1 IntroductionHistoryThe Rotation IdeaClassifiers as Machines for dimensionality reduction
2 SolutionUse the mean of each ClassScatter measureThe Cost FunctionA Transformation for simplification and defining the cost function
3 Where is this used?Applications
4 Relation with Least Squared ErrorWhat?
9 / 43
Use the mean of each Class
ThenSelect w such that class separation is maximized
We then define the mean sample for ecah class1 C1 ⇒m1 = 1
N1
∑N1i=1 xi
2 C2 ⇒m2 = 1N2
∑N2i=1 xi
Ok!!! This is giving us a measure of distanceThus, we want to maximize the distance the projected means:
m1 −m2 = wT (m1 −m2) (2)
where mk = wT mk for k = 1, 2.
10 / 43
Use the mean of each Class
ThenSelect w such that class separation is maximized
We then define the mean sample for ecah class1 C1 ⇒m1 = 1
N1
∑N1i=1 xi
2 C2 ⇒m2 = 1N2
∑N2i=1 xi
Ok!!! This is giving us a measure of distanceThus, we want to maximize the distance the projected means:
m1 −m2 = wT (m1 −m2) (2)
where mk = wT mk for k = 1, 2.
10 / 43
Use the mean of each Class
ThenSelect w such that class separation is maximized
We then define the mean sample for ecah class1 C1 ⇒m1 = 1
N1
∑N1i=1 xi
2 C2 ⇒m2 = 1N2
∑N2i=1 xi
Ok!!! This is giving us a measure of distanceThus, we want to maximize the distance the projected means:
m1 −m2 = wT (m1 −m2) (2)
where mk = wT mk for k = 1, 2.
10 / 43
Use the mean of each Class
ThenSelect w such that class separation is maximized
We then define the mean sample for ecah class1 C1 ⇒m1 = 1
N1
∑N1i=1 xi
2 C2 ⇒m2 = 1N2
∑N2i=1 xi
Ok!!! This is giving us a measure of distanceThus, we want to maximize the distance the projected means:
m1 −m2 = wT (m1 −m2) (2)
where mk = wT mk for k = 1, 2.
10 / 43
Use the mean of each Class
ThenSelect w such that class separation is maximized
We then define the mean sample for ecah class1 C1 ⇒m1 = 1
N1
∑N1i=1 xi
2 C2 ⇒m2 = 1N2
∑N2i=1 xi
Ok!!! This is giving us a measure of distanceThus, we want to maximize the distance the projected means:
m1 −m2 = wT (m1 −m2) (2)
where mk = wT mk for k = 1, 2.
10 / 43
However
We could simply seek
maxw
wT (m1 −m2)
s.t.d∑
i=1wi = 1
After allWe do not care about the magnitude of w.
11 / 43
However
We could simply seek
maxw
wT (m1 −m2)
s.t.d∑
i=1wi = 1
After allWe do not care about the magnitude of w.
11 / 43
Outline
1 IntroductionHistoryThe Rotation IdeaClassifiers as Machines for dimensionality reduction
2 SolutionUse the mean of each ClassScatter measureThe Cost FunctionA Transformation for simplification and defining the cost function
3 Where is this used?Applications
4 Relation with Least Squared ErrorWhat?
13 / 43
Fixing the Problem
To obtain good separation of the projected dataThe difference between the means should be large relative to somemeasure of the standard deviations for each class.
We define a SCATTER measure (Based in the Sample Variance)
s2k =
∑xi∈Ck
(wT xi −mk
)2=
∑yi=wT xi∈Ck
(yi −mk)2 (3)
We define then within-class variance for the whole data
s21 + s2
2 (4)
14 / 43
Fixing the Problem
To obtain good separation of the projected dataThe difference between the means should be large relative to somemeasure of the standard deviations for each class.
We define a SCATTER measure (Based in the Sample Variance)
s2k =
∑xi∈Ck
(wT xi −mk
)2=
∑yi=wT xi∈Ck
(yi −mk)2 (3)
We define then within-class variance for the whole data
s21 + s2
2 (4)
14 / 43
Fixing the Problem
To obtain good separation of the projected dataThe difference between the means should be large relative to somemeasure of the standard deviations for each class.
We define a SCATTER measure (Based in the Sample Variance)
s2k =
∑xi∈Ck
(wT xi −mk
)2=
∑yi=wT xi∈Ck
(yi −mk)2 (3)
We define then within-class variance for the whole data
s21 + s2
2 (4)
14 / 43
Outline
1 IntroductionHistoryThe Rotation IdeaClassifiers as Machines for dimensionality reduction
2 SolutionUse the mean of each ClassScatter measureThe Cost FunctionA Transformation for simplification and defining the cost function
3 Where is this used?Applications
4 Relation with Least Squared ErrorWhat?
15 / 43
Finally, a Cost Function
The between-class variance
(m1 −m2)2 (5)
The Fisher criterionbetween-class variancewithin-class variance (6)
Finally
J (w) = (m1 −m2)2
s21 + s2
2(7)
16 / 43
Finally, a Cost Function
The between-class variance
(m1 −m2)2 (5)
The Fisher criterionbetween-class variancewithin-class variance (6)
Finally
J (w) = (m1 −m2)2
s21 + s2
2(7)
16 / 43
Finally, a Cost Function
The between-class variance
(m1 −m2)2 (5)
The Fisher criterionbetween-class variancewithin-class variance (6)
Finally
J (w) = (m1 −m2)2
s21 + s2
2(7)
16 / 43
Outline
1 IntroductionHistoryThe Rotation IdeaClassifiers as Machines for dimensionality reduction
2 SolutionUse the mean of each ClassScatter measureThe Cost FunctionA Transformation for simplification and defining the cost function
3 Where is this used?Applications
4 Relation with Least Squared ErrorWhat?
17 / 43
We use a transformation to simplify our lifeFirst
J (w) =
(wT m1 − wT m2
)2∑yi =wT xi ∈C1
(yi − mk)2 +∑
yi =wT xi ∈C2(yi − mk)2
(8)
Second
J (w) =
(wT m1 − wT m2
)(wT m1 − wT m2
)T∑yi =wT xi ∈C1
(wT xi − mk
)(wT xi − mk
)T+∑
yi =wT xi ∈C2
(wT xi − mk
)(wT xi − mk
)T
(9)
Third
J (w) =wT (m1 − m2)
(wT (m1 − m2)
)T∑yi =wT xi ∈C1
wT (xi − m1)(
wT (xi − m1))T
+∑
yi =wT xi ∈C2wT (xi − m2)
(wT (xi − m2)
)T
(10)
18 / 43
We use a transformation to simplify our lifeFirst
J (w) =
(wT m1 − wT m2
)2∑yi =wT xi ∈C1
(yi − mk)2 +∑
yi =wT xi ∈C2(yi − mk)2
(8)
Second
J (w) =
(wT m1 − wT m2
)(wT m1 − wT m2
)T∑yi =wT xi ∈C1
(wT xi − mk
)(wT xi − mk
)T+∑
yi =wT xi ∈C2
(wT xi − mk
)(wT xi − mk
)T
(9)
Third
J (w) =wT (m1 − m2)
(wT (m1 − m2)
)T∑yi =wT xi ∈C1
wT (xi − m1)(
wT (xi − m1))T
+∑
yi =wT xi ∈C2wT (xi − m2)
(wT (xi − m2)
)T
(10)
18 / 43
We use a transformation to simplify our lifeFirst
J (w) =
(wT m1 − wT m2
)2∑yi =wT xi ∈C1
(yi − mk)2 +∑
yi =wT xi ∈C2(yi − mk)2
(8)
Second
J (w) =
(wT m1 − wT m2
)(wT m1 − wT m2
)T∑yi =wT xi ∈C1
(wT xi − mk
)(wT xi − mk
)T+∑
yi =wT xi ∈C2
(wT xi − mk
)(wT xi − mk
)T
(9)
Third
J (w) =wT (m1 − m2)
(wT (m1 − m2)
)T∑yi =wT xi ∈C1
wT (xi − m1)(
wT (xi − m1))T
+∑
yi =wT xi ∈C2wT (xi − m2)
(wT (xi − m2)
)T
(10)
18 / 43
Transformation
Fourth
J (w) =wT (m1 − m2) (m1 − m2)T w∑
yi =wT xi ∈C1wT (xi − m1) (xi − m1)T w +
∑yi =wT xi ∈C2
wT (xi − m2) (xi − m2)T w(11)
FifthJ (w) =
wT (m1 − m2) (m1 − m2)T w
wT[∑
yi =wT xi ∈C1(xi − m1) (xi − m1)T +
∑yi =wT xi ∈C2
(xi − m2) (xi − m2)T]
w(12)
Now Rename
J (w) = wT SBwwT Sww (13)
19 / 43
Transformation
Fourth
J (w) =wT (m1 − m2) (m1 − m2)T w∑
yi =wT xi ∈C1wT (xi − m1) (xi − m1)T w +
∑yi =wT xi ∈C2
wT (xi − m2) (xi − m2)T w(11)
FifthJ (w) =
wT (m1 − m2) (m1 − m2)T w
wT[∑
yi =wT xi ∈C1(xi − m1) (xi − m1)T +
∑yi =wT xi ∈C2
(xi − m2) (xi − m2)T]
w(12)
Now Rename
J (w) = wT SBwwT Sww (13)
19 / 43
Transformation
Fourth
J (w) =wT (m1 − m2) (m1 − m2)T w∑
yi =wT xi ∈C1wT (xi − m1) (xi − m1)T w +
∑yi =wT xi ∈C2
wT (xi − m2) (xi − m2)T w(11)
FifthJ (w) =
wT (m1 − m2) (m1 − m2)T w
wT[∑
yi =wT xi ∈C1(xi − m1) (xi − m1)T +
∑yi =wT xi ∈C2
(xi − m2) (xi − m2)T]
w(12)
Now Rename
J (w) = wT SBwwT Sww (13)
19 / 43
Derive with respect to w
ThusdJ (w)
dw =d(wT SBw
) (wT Sww
)−1
dw = 0 (14)
Then
dJ (w)dw =
(SBw + ST
B w) (
wT Sww)−1 −
(wT SBw
) (wT Sww
)−2 (Sww + STw w)
= 0(15)
Now because the symmetry in SB and Sw
dJ (w)dw = SB
(wT Sww) − wT SBwSww(wT Sww)2 = 0 (16)
20 / 43
Derive with respect to w
ThusdJ (w)
dw =d(wT SBw
) (wT Sww
)−1
dw = 0 (14)
Then
dJ (w)dw =
(SBw + ST
B w) (
wT Sww)−1 −
(wT SBw
) (wT Sww
)−2 (Sww + STw w)
= 0(15)
Now because the symmetry in SB and Sw
dJ (w)dw = SB
(wT Sww) − wT SBwSww(wT Sww)2 = 0 (16)
20 / 43
Derive with respect to w
ThusdJ (w)
dw =d(wT SBw
) (wT Sww
)−1
dw = 0 (14)
Then
dJ (w)dw =
(SBw + ST
B w) (
wT Sww)−1 −
(wT SBw
) (wT Sww
)−2 (Sww + STw w)
= 0(15)
Now because the symmetry in SB and Sw
dJ (w)dw = SB
(wT Sww) − wT SBwSww(wT Sww)2 = 0 (16)
20 / 43
Derive with respect to w
ThusdJ (w)
dw = SB
(wT Sww) − wT SBwSww(wT Sww)2 = 0 (17)
Then (wT Sww
)SBw =
(wT SBw
)Sww (18)
21 / 43
Derive with respect to w
ThusdJ (w)
dw = SB
(wT Sww) − wT SBwSww(wT Sww)2 = 0 (17)
Then (wT Sww
)SBw =
(wT SBw
)Sww (18)
21 / 43
Now, Several Tricks!!!
FirstSBw = (m1 − m2) (m1 − m2)T w = α (m1 − m2) (19)
Where α = (m1 −m2)T w is a simple constantIt means that SBw is always in the direction m1 −m2!!!
In additionwT Sww and wT SBw are constants
22 / 43
Now, Several Tricks!!!
FirstSBw = (m1 − m2) (m1 − m2)T w = α (m1 − m2) (19)
Where α = (m1 −m2)T w is a simple constantIt means that SBw is always in the direction m1 −m2!!!
In additionwT Sww and wT SBw are constants
22 / 43
Now, Several Tricks!!!
FirstSBw = (m1 − m2) (m1 − m2)T w = α (m1 − m2) (19)
Where α = (m1 −m2)T w is a simple constantIt means that SBw is always in the direction m1 −m2!!!
In additionwT Sww and wT SBw are constants
22 / 43
Now, Several Tricks!!!
Finally
Sww ∝ (m1 −m2)⇒ w ∝ S−1w (m1 −m2) (20)
Once the data is transformed into yi
Use a threshold y0 ⇒ x ∈ C1 iff y (x) ≥ y0 or x ∈ C2 iff y (x) < y0
Or ML with a Gussian can be used to classify the new transformeddata using a Naive Bayes (Central Limit Theorem and y = wT x sumof random variables).
23 / 43
Now, Several Tricks!!!
Finally
Sww ∝ (m1 −m2)⇒ w ∝ S−1w (m1 −m2) (20)
Once the data is transformed into yi
Use a threshold y0 ⇒ x ∈ C1 iff y (x) ≥ y0 or x ∈ C2 iff y (x) < y0
Or ML with a Gussian can be used to classify the new transformeddata using a Naive Bayes (Central Limit Theorem and y = wT x sumof random variables).
23 / 43
Now, Several Tricks!!!
Finally
Sww ∝ (m1 −m2)⇒ w ∝ S−1w (m1 −m2) (20)
Once the data is transformed into yi
Use a threshold y0 ⇒ x ∈ C1 iff y (x) ≥ y0 or x ∈ C2 iff y (x) < y0
Or ML with a Gussian can be used to classify the new transformeddata using a Naive Bayes (Central Limit Theorem and y = wT x sumof random variables).
23 / 43
Outline
1 IntroductionHistoryThe Rotation IdeaClassifiers as Machines for dimensionality reduction
2 SolutionUse the mean of each ClassScatter measureThe Cost FunctionA Transformation for simplification and defining the cost function
3 Where is this used?Applications
4 Relation with Least Squared ErrorWhat?
24 / 43
ApplicationsSomething Notable
Bankruptcy predictionI In bankruptcy prediction based on accounting ratios and other financial
variables, linear discriminant analysis was the first statistical methodapplied to systematically explain which firms entered bankruptcy vs.survived.
Face recognitionI In computerized face recognition, each face is represented by a large
number of pixel values.I The linear combinations obtained using Fisher’s linear discriminant are
called Fisher faces.
MarketingI In marketing, discriminant analysis was once often used to determine
the factors which distinguish different types of customers and/orproducts on the basis of surveys or other forms of collected data.
25 / 43
ApplicationsSomething Notable
Bankruptcy predictionI In bankruptcy prediction based on accounting ratios and other financial
variables, linear discriminant analysis was the first statistical methodapplied to systematically explain which firms entered bankruptcy vs.survived.
Face recognitionI In computerized face recognition, each face is represented by a large
number of pixel values.I The linear combinations obtained using Fisher’s linear discriminant are
called Fisher faces.
MarketingI In marketing, discriminant analysis was once often used to determine
the factors which distinguish different types of customers and/orproducts on the basis of surveys or other forms of collected data.
25 / 43
ApplicationsSomething Notable
Bankruptcy predictionI In bankruptcy prediction based on accounting ratios and other financial
variables, linear discriminant analysis was the first statistical methodapplied to systematically explain which firms entered bankruptcy vs.survived.
Face recognitionI In computerized face recognition, each face is represented by a large
number of pixel values.I The linear combinations obtained using Fisher’s linear discriminant are
called Fisher faces.
MarketingI In marketing, discriminant analysis was once often used to determine
the factors which distinguish different types of customers and/orproducts on the basis of surveys or other forms of collected data.
25 / 43
ApplicationsSomething Notable
Bankruptcy predictionI In bankruptcy prediction based on accounting ratios and other financial
variables, linear discriminant analysis was the first statistical methodapplied to systematically explain which firms entered bankruptcy vs.survived.
Face recognitionI In computerized face recognition, each face is represented by a large
number of pixel values.I The linear combinations obtained using Fisher’s linear discriminant are
called Fisher faces.
MarketingI In marketing, discriminant analysis was once often used to determine
the factors which distinguish different types of customers and/orproducts on the basis of surveys or other forms of collected data.
25 / 43
ApplicationsSomething Notable
Bankruptcy predictionI In bankruptcy prediction based on accounting ratios and other financial
variables, linear discriminant analysis was the first statistical methodapplied to systematically explain which firms entered bankruptcy vs.survived.
Face recognitionI In computerized face recognition, each face is represented by a large
number of pixel values.I The linear combinations obtained using Fisher’s linear discriminant are
called Fisher faces.
MarketingI In marketing, discriminant analysis was once often used to determine
the factors which distinguish different types of customers and/orproducts on the basis of surveys or other forms of collected data.
25 / 43
ApplicationsSomething Notable
Bankruptcy predictionI In bankruptcy prediction based on accounting ratios and other financial
variables, linear discriminant analysis was the first statistical methodapplied to systematically explain which firms entered bankruptcy vs.survived.
Face recognitionI In computerized face recognition, each face is represented by a large
number of pixel values.I The linear combinations obtained using Fisher’s linear discriminant are
called Fisher faces.
MarketingI In marketing, discriminant analysis was once often used to determine
the factors which distinguish different types of customers and/orproducts on the basis of surveys or other forms of collected data.
25 / 43
ApplicationsSomething Notable
Bankruptcy predictionI In bankruptcy prediction based on accounting ratios and other financial
variables, linear discriminant analysis was the first statistical methodapplied to systematically explain which firms entered bankruptcy vs.survived.
Face recognitionI In computerized face recognition, each face is represented by a large
number of pixel values.I The linear combinations obtained using Fisher’s linear discriminant are
called Fisher faces.
MarketingI In marketing, discriminant analysis was once often used to determine
the factors which distinguish different types of customers and/orproducts on the basis of surveys or other forms of collected data.
25 / 43
ApplicationsSomething Notable
Bankruptcy predictionI In bankruptcy prediction based on accounting ratios and other financial
variables, linear discriminant analysis was the first statistical methodapplied to systematically explain which firms entered bankruptcy vs.survived.
Face recognitionI In computerized face recognition, each face is represented by a large
number of pixel values.I The linear combinations obtained using Fisher’s linear discriminant are
called Fisher faces.
MarketingI In marketing, discriminant analysis was once often used to determine
the factors which distinguish different types of customers and/orproducts on the basis of surveys or other forms of collected data.
25 / 43
Applications
Something NotableBiomedical studies
I The main application of discriminant analysis in medicine is theassessment of severity state of a patient and prognosis of diseaseoutcome.
26 / 43
Applications
Something NotableBiomedical studies
I The main application of discriminant analysis in medicine is theassessment of severity state of a patient and prognosis of diseaseoutcome.
26 / 43
Please
Your Reading Material, it is about the Multi-class4.1.6 Fisher’s discriminant for multiple classes AT “Pattern Recognition”by Bishop
27 / 43
Outline
1 IntroductionHistoryThe Rotation IdeaClassifiers as Machines for dimensionality reduction
2 SolutionUse the mean of each ClassScatter measureThe Cost FunctionA Transformation for simplification and defining the cost function
3 Where is this used?Applications
4 Relation with Least Squared ErrorWhat?
28 / 43
Relation with Least Squared Error
FirstThe least-squares approach to the determination of a linear discriminantwas based on the goal of making the model predictions as close aspossible to a set of target values.
SecondThe Fisher criterion was derived by requiring maximum class separation inthe output space.
29 / 43
Relation with Least Squared Error
FirstThe least-squares approach to the determination of a linear discriminantwas based on the goal of making the model predictions as close aspossible to a set of target values.
SecondThe Fisher criterion was derived by requiring maximum class separation inthe output space.
29 / 43
How do we do this?
FirstWe have N samples.We have N1samples for class C1.We have N2 samples for class C2.
So, we decide the following for the targets on each classWe have then for class C1 is N
N1.
We have then for class C2 is − NN2
.
30 / 43
How do we do this?
FirstWe have N samples.We have N1samples for class C1.We have N2 samples for class C2.
So, we decide the following for the targets on each classWe have then for class C1 is N
N1.
We have then for class C2 is − NN2
.
30 / 43
How do we do this?
FirstWe have N samples.We have N1samples for class C1.We have N2 samples for class C2.
So, we decide the following for the targets on each classWe have then for class C1 is N
N1.
We have then for class C2 is − NN2
.
30 / 43
How do we do this?
FirstWe have N samples.We have N1samples for class C1.We have N2 samples for class C2.
So, we decide the following for the targets on each classWe have then for class C1 is N
N1.
We have then for class C2 is − NN2
.
30 / 43
Thus
The new cost function
E = 12
N∑n=1
(wT xn + wo − tn
)2(21)
Deriving with respect to wN∑
n=1
(wT xn + wo − tn
)xn = 0 (22)
Deriving with respect to w0
N∑n=1
(wT xn + wo − tn
)= 0 (23)
31 / 43
Thus
The new cost function
E = 12
N∑n=1
(wT xn + wo − tn
)2(21)
Deriving with respect to wN∑
n=1
(wT xn + wo − tn
)xn = 0 (22)
Deriving with respect to w0
N∑n=1
(wT xn + wo − tn
)= 0 (23)
31 / 43
Thus
The new cost function
E = 12
N∑n=1
(wT xn + wo − tn
)2(21)
Deriving with respect to wN∑
n=1
(wT xn + wo − tn
)xn = 0 (22)
Deriving with respect to w0
N∑n=1
(wT xn + wo − tn
)= 0 (23)
31 / 43
Then
We have thatN∑
n=1
(wT xn + wo − tn
)=
N∑n=1
(wT xn + wo
)−
N∑n=1
tn
=N∑
n=1
(wT xn + wo
)−N1
NN1
+ N2NN2
=N∑
n=1
(wT xn + wo
)
Then ( N∑n=1
wT xn
)+ Nwo = 0
32 / 43
Then
We have thatN∑
n=1
(wT xn + wo − tn
)=
N∑n=1
(wT xn + wo
)−
N∑n=1
tn
=N∑
n=1
(wT xn + wo
)−N1
NN1
+ N2NN2
=N∑
n=1
(wT xn + wo
)
Then ( N∑n=1
wT xn
)+ Nwo = 0
32 / 43
Then
We have that
wo = −wT(
1N
N∑n=1
xn
)
We rename 1N∑N
n=1 xn = m
m = 1N
N∑n=1
xn = 1N [N1m1 + N2m2]
Finally
wo = −wT m
33 / 43
Then
We have that
wo = −wT(
1N
N∑n=1
xn
)
We rename 1N∑N
n=1 xn = m
m = 1N
N∑n=1
xn = 1N [N1m1 + N2m2]
Finally
wo = −wT m
33 / 43
Then
We have that
wo = −wT(
1N
N∑n=1
xn
)
We rename 1N∑N
n=1 xn = m
m = 1N
N∑n=1
xn = 1N [N1m1 + N2m2]
Finally
wo = −wT m
33 / 43
Thus, we have
Something NotableN∑
n=1
(wT xn + wo
)xn −
NN1
N1∑n=1
xn + NN2
N2∑n=1
xn = 0
ThusN∑
n=1
(wT xn + wo
)xn −N (m1 −m2) = 0
35 / 43
Thus, we have
Something NotableN∑
n=1
(wT xn + wo
)xn −
NN1
N1∑n=1
xn + NN2
N2∑n=1
xn = 0
ThusN∑
n=1
(wT xn + wo
)xn −N (m1 −m2) = 0
35 / 43
Next
ThenN∑
n=1
(wT xn −wT m
)xn = N (m1 −m2)
Thus [ N∑n=1
(wTxn −wT m
)xn
]= N (m1 −m2)
NowWhat? Please do this as a take homework for next class
36 / 43
Next
ThenN∑
n=1
(wT xn −wT m
)xn = N (m1 −m2)
Thus [ N∑n=1
(wTxn −wT m
)xn
]= N (m1 −m2)
NowWhat? Please do this as a take homework for next class
36 / 43
Next
ThenN∑
n=1
(wT xn −wT m
)xn = N (m1 −m2)
Thus [ N∑n=1
(wTxn −wT m
)xn
]= N (m1 −m2)
NowWhat? Please do this as a take homework for next class
36 / 43
Now, Do you have the solution?
You have a version in Duda and Hart Section 5.8
w =(XTX
)−1XTy
Thus
XTXw = XTy
Now, we rewrite the data matrix by the normalization discussed early
X =[
11 X1−12 −X2
]
37 / 43
Now, Do you have the solution?
You have a version in Duda and Hart Section 5.8
w =(XTX
)−1XTy
Thus
XTXw = XTy
Now, we rewrite the data matrix by the normalization discussed early
X =[
11 X1−12 −X2
]
37 / 43
Now, Do you have the solution?
You have a version in Duda and Hart Section 5.8
w =(XTX
)−1XTy
Thus
XTXw = XTy
Now, we rewrite the data matrix by the normalization discussed early
X =[
11 X1−12 −X2
]
37 / 43
Thus, we have
Something Notable
[1T
1 −1T1
XT1 −XT
2
] [11 X1
−12 −X2
] [w0w
]=[
1T1 −1T
1XT
1 −XT2
] [ NN1
11NN2
12
]
Thus, if we use the following definitionsmi = 1
Ni
∑x∈Ci x
SW =∑xi∈C1 (xi −m1) (xi −m1)T +
∑xi∈C2 (xi −m2) (xi −m2)T
39 / 43
Thus, we have
Something Notable
[1T
1 −1T1
XT1 −XT
2
] [11 X1
−12 −X2
] [w0w
]=[
1T1 −1T
1XT
1 −XT2
] [ NN1
11NN2
12
]
Thus, if we use the following definitionsmi = 1
Ni
∑x∈Ci x
SW =∑xi∈C1 (xi −m1) (xi −m1)T +
∑xi∈C2 (xi −m2) (xi −m2)T
39 / 43
Thus, we have
Something Notable
[1T
1 −1T1
XT1 −XT
2
] [11 X1
−12 −X2
] [w0w
]=[
1T1 −1T
1XT
1 −XT2
] [ NN1
11NN2
12
]
Thus, if we use the following definitionsmi = 1
Ni
∑x∈Ci x
SW =∑xi∈C1 (xi −m1) (xi −m1)T +
∑xi∈C2 (xi −m2) (xi −m2)T
39 / 43
Thus, we have
Something Notable
[1T
1 −1T1
XT1 −XT
2
] [11 X1
−12 −X2
] [w0w
]=[
1T1 −1T
1XT
1 −XT2
] [ NN1
11NN2
12
]
Thus, if we use the following definitionsmi = 1
Ni
∑x∈Ci x
SW =∑xi∈C1 (xi −m1) (xi −m1)T +
∑xi∈C2 (xi −m2) (xi −m2)T
39 / 43
Then
We have the following
[n (n1m1 + n2m2)T
(n1m1 + n2m2) Sw + n1m1mT1 + n2m2mT
2
] [w0w
]=[
0n [m1 −m2]
]
Then[
nw0 + (n1m1 + n2m2)T w(n1m1 + n2m2) w0 +
[Sw + n1m1mT
1 + n2m2mT2] ] =
[0
n [m1 −m2]
]
40 / 43
Then
We have the following
[n (n1m1 + n2m2)T
(n1m1 + n2m2) Sw + n1m1mT1 + n2m2mT
2
] [w0w
]=[
0n [m1 −m2]
]
Then[
nw0 + (n1m1 + n2m2)T w(n1m1 + n2m2) w0 +
[Sw + n1m1mT
1 + n2m2mT2] ] =
[0
n [m1 −m2]
]
40 / 43
Thus
We have that
w0 = −wT m[ 1n Sw + n1n2
n2 (m1 −m2) (m1 −m2)T]
w = m1 −m2
ThusSince the vector (m1 −m2) (m1 −m2)T w is in the direction ofm1 −m2
We have thatn1n2n2 (m1 −m2) (m1 −m2)T w = (1− α) (m1 −m2)
41 / 43
Thus
We have that
w0 = −wT m[ 1n Sw + n1n2
n2 (m1 −m2) (m1 −m2)T]
w = m1 −m2
ThusSince the vector (m1 −m2) (m1 −m2)T w is in the direction ofm1 −m2
We have thatn1n2n2 (m1 −m2) (m1 −m2)T w = (1− α) (m1 −m2)
41 / 43
Thus
We have that
w0 = −wT m[ 1n Sw + n1n2
n2 (m1 −m2) (m1 −m2)T]
w = m1 −m2
ThusSince the vector (m1 −m2) (m1 −m2)T w is in the direction ofm1 −m2
We have thatn1n2n2 (m1 −m2) (m1 −m2)T w = (1− α) (m1 −m2)
41 / 43