linear vs. quadratic classifier power point

Linear vs. Quadratic Discriminant Classifier: anOverview

Alaa Tharwat

Electrical Dept. - Suez Canal University- EgyptScientific Research Group in Egypt (SRGE)

Email: [email protected]

April 2, 2016

Alaa Tharwat April 2, 2016 1 / 34

Agenda

Introduction

Building a Classifier Model

Numerical Examples

Conclusions and Future Works


Introduction

The main objective is to:

Explain the principals of linear and quadratic classifiers.

Introduce numerical examples to explain how linear and quadraticclassifiers work.


Introduction

A pattern or sample is represented by a vector or a set of m features,which represent one point in m-dimensional space (Rm) that iscalled pattern space.

The goal of the pattern classification process is to train a model usingthe labelled patterns to assign a class label to an unknown pattern.

The classifier is represented by c decisions or discriminant functions({f1, f2, . . . , fc}).

The decision functions are used to determine the decision boundariesbetween classes and the region or area of each class.


Introduction

f1(x) f2(x) fc(x)

Class Label

Maximum Selector

Dis

crim

inant

Functions

x2 xn1

1xn1+1 xn1+n2

2xN-nc+1 xN-1

c

Input (x

i∈ℛm)

x1x11

x12

x1m

xn1+2 xN

Figure: The structure of building a classifier, which includes N samples and cdiscriminant functions or classes.


Introduction

Discriminant functions are used to build the decision boundaries todiscriminate between different classes into different regions(ωi, i = 1, 2, . . . , c).

Assume we have two classes (ω1) and (ω2), thus there are twodifferent discriminant functions (f1 and f2) and the decisionboundary is calculated as follows, S12 = f1 − f2. The decision regionor class label of an unknown pattern x is calculated as follows:

sgn(S12(x)) = sgn(f1(x)−f2(x)) =

Class 1 : for S12(x) > 0

Undefined : for S12(x) = 0

Class 2 : for S12(x) < 0

(1)


Introduction

The posterior probability of three classes.

PP(x|ω1)P(ω1)

P(x|ω2)P(ω2)P(x|ω3)P(ω3)

ω1 ω2 ω3

x

1

2

3

1 2 3 4 x1

S12 S13

ω2 ω3

ω1

S23

Decision Boundaries

x2

S13<0

S13>0

Figure: Decision Regions of three classes.



Let ω1, ω2, . . . , ωc be the set of c classes, P (x|ωi) represents thelikelihood function.

P (ωi) represents the priori probability of each class that reflects theprior knowledge about that class and it is simply equal to the ratiobetween the number of samples in that class and the total number ofsamples in all classes (N).

Bayes formula calculates the posterior probability based on priori andlikelihood as follows:

P (ω = ωi|x) =P (x|ω = ωi)P (ωi)

P (x)=likelihood× priori

evidence(2)

P (ω = ωi|x) represents the posterior probability or a posteriori, P (x)represents the evidence and it is calculated as follows,P (x) =

∑ci=1 P (x|ω = ωi)P (ωi).

P (x) is used only to scale the expressions in Equation (2), thus thesum of the posterior probabilities is 1 (

∑ci=1 P (ωi|x) = 1).

Generally, P (ωi|x) is calculated using the likelihood (P (x|ωi)) andprior probability (P (ωi)).



Assume that P (x|ωi) is normally distributed (P (x|ωi) ∼ N (µi,Σi)) asfollows:

P (x|ωi) = N (µi,Σi) =1√

(2π)m|Σi|exp(−1

2(x− µi)TΣ−1

i (x− µi)) (3)

where, µi represents the mean of the ith class, Σi is the covariance matrixof the ith class, |Σi| and Σ−1

i represent the determinant and inverse ofthe covariance matrix, respectively, m represents the number of featuresor the number of variables of the sample (x).

var(x1, x1) cov(x1, x2) . . . cov(x1, xN )cov(x2, x1) var(x2, x2) . . . cov(x2, xN )

......

. . ....

cov(xN , x1) cov(xN , x2) var(xN , xN )

(4)


Building a Classifier Model Discriminant Functions for the Normal Density

Assume we have two classes ω1 and ω2 and each class has onediscriminant function (fi, i = 1, 2). If we have an unknown pattern (x)and P (ω1|x) > P (ω2|x), thus the unknown pattern belongs to the firstclass (ω1). Similarly, if P (ω2|x) > P (ω1|x); hence, x belongs to ω2.

fi(x) = ln P (ω = ωi|x) = P (x|ω = ωi)P (ωi) = ln(P (x|ω = ωi))

+ln(P (ωi)) , i = 1, 2

= ln1√

(2π)m|Σi|exp(−1

2(x− µi)TΣ−1

i (x− µi)) + ln(P (ωi))

= −1

2(x− µi)TΣ−1

i (x− µi)−m

2ln(2π)− ln|Σi|

2+ ln(P (ωi))

= −Σ−1i

2(xTx+ µTi µi − 2µTi x)− m

2ln(2π)− ln|Σi|

2+ ln(P (ωi))

(5)



S12(x) = −1

2[Σ−1

1 (xTx− 2µT1 x+ µT1 µ1)

−Σ−12 (xTx− 2µT2 x+ µT2 µ2) + ln|Σ1| − ln|Σ2|] + ln

P (ω1)

P (ω2)

= −1

2xT (Σ−1

1 − Σ−12 )x+ (µT1 Σ−1

1 − µT2 Σ−1

2 )x

−0.5(µT1 Σ−11 µ1 − µT2 Σ−1

2 µ2 + ln|Σ1| − ln|Σ2|) + lnP (ω1)

P (ω2)

= xTWx+ wTx+W0

(7)



W = −1

2(Σ−1

1 − Σ−12 ) (8)

w = µT1 Σ−11 − µ

T2 Σ−1

2 (9)

W0 = −0.5(µT1 Σ−11 µ1 − µT2 Σ−1

2 µ2 + ln|Σ1| − ln|Σ2|) + lnP (ω1)

P (ω2)(10)

where W0 represents the threshold or bias, w represents the slope of theline, and W is the coefficient of the quadratic term xTx. Thus, thedecision boundary is calculated by quadratic function or curve, which iscalled Quadratic Discriminant Classifier (QDC)

sgn(S12(x)) =

+ve if xTWx+ wTx+W0 > 0 → x ∈ ω1

0 if xTWx+ wTx+W0 = 0; On the boundary−ve if xTWx+ wTx+W0 < 0 → x ∈ ω2

(11)Alaa Tharwat April 2, 2016 13 / 34

Building a Classifier Model Special Case: Common Covariance Matrices

Assume the variance of all classes are equal (Σ1 = Σ2 = Σ), hencethe term W will be neglected.

Similarly, the term ln|Σ1| − ln|Σ2| will be neglected and W0 will beeasier to calculate.

Moreover, w will be easier to implement.

The discriminant function is simplified from quadratic to linearfunction, which is called Linear Discriminant Classifier (LDC)



S12 = wTx+W0 (12)

where

w = Σ−1(µT1 − µT2 ) (13)

and

W0 = −0.5Σ−1(µT1 µ1 − µT2 µ2) + lnP (ω1)

P (ω2)(14)



The decision boundary is the point where S12 = 0 and this point will becalculated as follows:

S12 = 0→ Σ−1(µT1 −µT2 )x−0.5Σ−1(µT1 µ1−µT2 µ2)+ lnP (ω1)

P (ω2)= 0 (15)

The decision boundary xDB is

xDB =µ1 + µ2

2+

Σ

µ2 − µ1lnP (ω1)

P (ω2)(16)

When the two classes are equiprobable, then the second term will beneglected and the decision boundary is the point in the middle of theclass centers.

The decision boundary will be closer to the class that has lower priorprobability. For example, P (ωi) > P (ωj), then|µj − xDB| < |µi − xDB|.



Σ =D DT

f =-0.5Σ1-1(xTx+µ1

Tµ1-2µ1Tx)-0.5m(ln(2 ))-0.5ln(|Σ1|)+ln(P( 1))

1

2

3

1 2 3 4

x2

x1

Σ2=D2D2TΣ3=D3D3

T

Covariance Matrix (Σi)Discriminant Functions (fi)

µ2

S12<0

S13<0

µ3

µ1

S13>0

Class 1

Class 2

Class 3

f2=-0.5Σ2-1(xTx+µ2

Tµ2-2µ2Tx)-0.5m(ln(2 ))-0.5ln(|Σ2|)+ln(P( 2))f3=-0.5Σ3

-1(xTx+µ3Tµ3-2µ3

Tx)-0.5m(ln(2 ))-0.5ln(|Σ3|)+ln(P( 3))

1

3

2

(S12=f1-f2)>0

S23>0(S23<0

X=

x1x2

Data Matrix (X)

x12

321

321

Mean of each

Class ( i)

Di= i- i

-1

-1

-2

-3

3-

2-1 -

Figure: Steps of calculating discriminant classifier given three classes, each classhas four samples.


Numerical Examples Example 1: Equal Variance (Σi = σ2I)

In this example, the features were statistically independent, i.e. alloff-diagonal elements of the covariance matrices were zeros, and hadthe same variance (σ2). Thus,

1 The covariance matrices were diagonal and its diagonal elements wereσ2.

2 Geometrical interpretation for this case is that each class is centeredaround its mean, the distance from the mean to all samples of thesame class are equal.

3 The distributions of all classes are spherical in an m-dimensional space.



Given three different classes denoted by, ω1, ω2, ω3.

ω1 =

3.00 4.003.00 5.004.00 4.004.00 5.00

, ω2 =

3.00 2.003.00 3.004.00 2.004.00 3.00

, and ω3 =

6.00 2.006.00 3.007.00 2.007.00 3.00

(17)

The mean of each class is:

µ1 =[3.50 4.50

], µ2 =

[3.50 2.50

], and µ3 =

[6.50 2.50

](18)



Subtract the mean of each class from each sample in that class as follows:

D1 =

−0.5 −0.5−0.5 0.50.5 −0.50.5 0.5

, D2 =

−0.50 −0.50−0.50 0.500.50 −0.500.50 0.50

and D3 =

−0.50 −0.50−0.50 0.500.50 −0.500.50 0.50

(19)

The covariance matrix for each class (Σi) is:

Σ1 = Σ2 = Σ3 =

[1.00 0.000.00 1.00

](20)

Σ−11 = Σ−1

2 = Σ−13 =

[1.00 0.000.00 1.00

](21)



The discriminated functions for each class is:

fi(x) = −Σ−1i

2(xTx+ µTi µi − 2µTi x)− m

2ln(2π)− ln|Σi|

2+ ln(P (ωi))

(22)

f1 = −0.5x21 − 0.5x22 + 3.50x1 + 4.50x2 − 17.35

f2 = −0.5x21 − 0.5x22 + 3.50x1 + 2.50x2 − 10.35

f3 = −0.5x21 − 0.5x22 + 6.50x1 + 2.50x2 − 25.35

(23)

The decision boundaries between each two classes are as follows:

S12 = f1 − f2 → x2 = 3.50

S13 = f1 − f3 → x2 = 1.5x1 − 4.00

S23 = f2 − f3 → x1 = 5.00

(24)

As shown, the decision boundary S12 depends only on x2. Thus, for allsamples belonging to class ω1, the value of x2 is greater than 3.5 to bepositive.



1

2

3

4

5

6

1 2 3 4 5 6

x2 Class 1

Class 2

7

7

Class 3

µ1

µ2

µ2 µ3

µ3

S13<0

µ3

x1 =

5

x 2=1.5

x 1-4

S12<0 µ2

S12>0

x2=3.5

µ1

S23<0S23>0

S13>0

µ1

σ

σ

P

σ

σ

σ

σ

S12

S 13

S23

x1

Figure: The calculated decision boundaries for three different classes where thefeatures or variables are statistically independent and have the same variance.



Figure: Classification of three Gaussian classes with the same covariance matrix(Σ1 = Σ2 = Σ3 = σ2I) (our first example). Top figure, the green, red, and bluesurfaces represent the discriminant functions, f1, f2, andf3, respectively.Bottom, decision boundaries (separation curves) S12 = f1 − f2, S13 = f1 − f3,and S23 = f2 − f3.



Given an unknown or test sample (T [2 2])

f1 = −5.35

f2 = −2.35

f3 = −11.35

(25)

The slope of the discriminant function will not be affected by changingthe priori probability. On the other hand, the bias of each discriminantfunction changes according to the prior probability.Assume the priori probability of the three classes in our example werechanged to be as follows, P (ω1) = 8

12 , P (ω2) = 212 , and P (ω3) = 2

12 .

f1 = −0.5x21 − 0.5x22 + 3.50x1 + 4.50x2 − 16.94

f2 = −0.5x21 − 0.5x22 + 3.50x1 + 2.50x2 − 10.64

f3 = −0.5x21 − 0.5x22 + 6.50x1 + 2.50x2 − 25.64

(26)


Numerical Examples Example 2: Equal Variance (Σi = Σ)

In this example, the covariance matrices of all classes were equal butarbitrary.

The variance of the variables were not equal.

Geometrical interpretation for this case is that the distributions of allclasses were elliptical in m-dimensions space.



Table: Feature values, mean, mean-centering data, and covariance matrices forall classes.

Pattern No.Features

ClassMean D Covariance

Matrix (Σi)x1 x2 x1 x2 x1 x21 3.00 5.00 ω1

3.50 6.00

-0.50 -1.00

Σ1 =

[1.00 0.000.00 4.00

]2 3.00 7.00 ω1 -0.50 1.003 4.00 5.00 ω1 0.50 -1.004 4.00 7.00 ω1 0.50 1.00

5 2.00 2.00 ω2

2.50 3.00

-0.50 -1.00

Σ2 =

[1.00 0.000.00 4.00

]6 2.00 4.00 ω2 -0.50 1.007 3.00 2.00 ω2 0.50 -1.008 3.00 4.00 ω2 0.50 1.00

9 6.00 1.00 ω3

6.50 2.00

-0.50 -1.00

Σ3 =

[1.00 0.000.00 4.00

]10 6.00 3.00 ω3 -0.50 1.0011 7.00 1.00 ω3 0.50 -1.0012 7.00 3.00 ω3 0.50 1.00



Values of the inverse of the covariance matrices are as follows:

Σ−11 = Σ−1

2 = Σ−13 =

[1.00 0.000.00 0.25

](27)

The discriminant functions were then calculated and its values will be asfollows:

f1 = −0.5x21 − 0.125x22 + 3.50x1 + 1.5x2 − 11.72

f2 = −0.5x21 − 0.125x22 + 2.50x1 + 0.75x2 − 5.35

f3 = −0.5x21 − 0.125x22 + 6.50x1 + 0.50x2 − 22.72

(28)

The decision boundaries between each two classes were then calculated asfollows,

S12 = f1 − f2 → x1 = 6.37− 0.75x2

S13 = f1 − f3 → x2 = 3.00x1 − 11.00

S23 = f2 − f3 → x2 = 16x1 − 69.48

(29)



1

2

3

4

5

6

1 2 3 4 5 6 7

7

8

µ2

µ3

S23<0S23>0

µ2

µ1

S13>0

S13<0

µ3

x1 =6.37-0.75x

2

x2 =

16x1 -69.48

S12>0

S12<0

x 2=3

x 1-1

1

µ1

µ3

Class 1

Class 2

Class 3

x2

µ1

µ2

x1

S12

S13

S23

Figure: The calculated decision boundaries for three different classes where theircovariance matrices were equal but arbitrary.



Figure: Classification of three Gaussian classes with the same covariance matrix(Σ1 = Σ2 = Σ3) (our second example). Green, red, and blue surfaces representf1, f2, andf3, respectively.


Numerical Examples Example 3: Different Covariance matrices (Σi =arbitrary)

In this example, the covariance matrices were different for all classesand we can consider this case represents the common case.

The distributions of all classes were different.

Table: Feature values, mean, mean-centering data, and covariance matrices forall classes.

Pattern No.Features

ClassMean D Covariance

Matrix (Σi)x1 x2 x1 x2 x1 x21 7.00 3.00 ω1

7.50 3.50

-0.50 -0.50

Σ1 =

[1.00 0.000.00 1.00

]2 8.00 3.00 ω1 0.50 -0.503 7.00 4.00 ω1 -0.5 0.504 8.00 4.00 ω1 0.50 0.50

5 2.00 2.00 ω2

3.50 2.50

-1.50 -0.50

Σ2 =

[9.00 0.000.00 1.00

]6 5.00 2.00 ω2 1.50 -0.507 2.00 3.00 ω2 -1.50 0.508 5.00 3.00 ω2 1.50 0.50

9 1.00 6.00 ω3

3.00 6.50

-2.00 -0.50

Σ3 =

[16.00 0.000.00 1.00

]10 5.00 6.00 ω3 2.00 -0.5011 1.00 7.00 ω3 -2.00 0.5012 5.00 7.00 ω3 2.00 0.50



The values of the inverse of the covariance matrices are as follows:

Σ−11 =

[1.00 0.000.00 1.00

]Σ−12 =

[0.11 0.000.00 1.00

]Σ−13 =

[0.06 0.000.00 1.00

](30)

The discriminated functions were then calculated and its values will be asfollows:

f1 = −0.50x21 − 0.50x22 + 7.50x1 + 3.50x2 − 35.35

f2 = −0.06x21 − 0.50x22 + 0.39x1 + 2.50x2 − 6.00

f3 = −0.03x21 − 0.50x22 + 0.19x1 + 6.50x2 − 23.89

(31)

The decision boundaries between each two classes were then calculated asfollows:

S12 = f1 − f2 → x2 = 0.44x21 − 7.11x1 + 29.35

S13 = f1 − f3 → x2 = −0.16x21 + 2.44x1 − 3.82

S23 = f2 − f3 → x2 = −0.01x21 + 0.05x1 + 4.47

(32)



1

2

3

4

5

6

1 2 3 4 5 6 7

8

8

7

µ1

µ2

µ3

S13=-0.48x12+7.32x1-3x2-11.46=0

S23=-0.04x12+0.20x1-4x2+17.88=0

S12 =-0.44x

1 2+7.11x1 +x

2 -29.35=0

Class 1

Class 2

Class 3

S12<0

S12>0

S23>0

S23<0

S13>0

S13<0

x1

x2

Figure: The calculated decision boundaries for three different classes where theircovariance matrices are different (our example in Section ??).



Figure: Classification of three Gaussian classes with different covariance matrix(our third example). Green, red, and blue surfaces represent the discriminantfunctions, f1, f2, andf3, respectively.


Conclusions and Future Work

How to construct linear and quadratic classifiers.

In the future we will explain how the singularity problem occurs andhow to solve it.


linear vs. quadratic classifier power point

Engineering