dm825 introduction to machine learningmarco/dm825/slides/dm825-lec7.pdfdm825 introduction to machine...

28
DM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark

Upload: others

Post on 26-Aug-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

DM825

Introduction to Machine Learning

Lecture 7Gaussian discriminant analysis

Naive Bayes

Marco Chiarandini

Department of Mathematics & Computer ScienceUniversity of Southern Denmark

Page 2: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

Gaussian Discriminant AnalysisNaive BayesSupport Vector MachinesOutline

1. Gaussian Discriminant Analysis

2. Naive BayesMulti-variate Bernoulli Event ModelMultinomial Event Model

3. Support Vector Machines

2

Page 3: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

Gaussian Discriminant AnalysisNaive BayesSupport Vector Machines

I Discriminative approach learns p(y|x)

I Generative approach learns p(x|y)

3

Page 4: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

Gaussian Discriminant AnalysisNaive BayesSupport Vector MachinesGenerative Method

1. Model p(y) and p(x | y)

2. learn parameters of the models by maximizing joint likelihoodp(x, y) = p(x|y)p(y)

3. express

p(y | x) =p(x | y)p(y)

p(x)=

p(x | y)p(y)∑y∈Y p(x | y)p(y)

4. predict

arg maxy

p(y | x) = arg maxy

p(x | y)p(y)

p(x)

= arg maxy

p(x | y)p(y)∑y∈Y p(x | y)p(y)

= arg maxy

p(x | y)p(y)

4

Page 5: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

Gaussian Discriminant AnalysisNaive BayesSupport Vector MachinesOutline

1. Gaussian Discriminant Analysis

2. Naive BayesMulti-variate Bernoulli Event ModelMultinomial Event Model

3. Support Vector Machines

5

Page 6: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

Gaussian Discriminant AnalysisNaive BayesSupport Vector MachinesGaussian Discriminant Analysis

Let ~x be a vector of continuous variablesWe will assume p(~x | y) is a multivariate Gaussian distribution

p(~x, ~µ,Σ) =1

(2π)n/2|Σ|1/2exp

(1

2(~x− ~µ)TΣ−1(~x− ~µ)

)

6

Page 7: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

Gaussian Discriminant AnalysisNaive BayesSupport Vector MachinesGaussian Discriminant Analysis

Step 1: we model the probabilities:

y ∼ Bernoulli(ϕ)x|y = 0 ∼ N(µ0,Σ)x|y = 1 ∼ N(µ1,Σ)

that is: p(y) = φy(1− φ)1−y

p(x | y = 0) = N (~µ0,Σ) = 1(2π)n/2|Σ|1/2 exp

(12 (~x− ~µ0)TΣ−1(~x− ~µ0)

)p(x | y = 1) = N (~µ1,Σ) = 1

(2π)n/2|Σ|1/2 exp(12 (~x− ~µ1)TΣ−1(~x− ~µ1)

)Step 2: we express the joint likelihood of a set of data i = 1 . . .m:

l(φ, µ0, µ1,Σ) =

m∏i=1

p(xi, yi)

=

m∏i=1

p(xi | yi)p(yi)

We substitute the model assumptions above and maximize log l(φ, µ0, µ1,Σ)in φ, µ0, µ1,Σ

7

Page 8: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

Gaussian Discriminant AnalysisNaive BayesSupport Vector Machines

Solutions:

φ =

∑mi=1 y

i

m=

∑i I{yi = 1}

m

µ0 =

∑i I{yi = 0}xi∑i I{yi = 0}

µ1 =

∑i I{yi = 1}xi∑i I{yi = 1}

Σ = ...

Compare with logistic regression where we maximized the conditionallikelihood instead!Step 3 and 4:

arg maxy

p(y | x) = arg maxy

p(x | y)p(y)

p(x)

arg maxy

p(x | y)p(y)

8

Page 9: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

Gaussian Discriminant AnalysisNaive BayesSupport Vector MachinesComments

I In GDA: x | y ∼ Gaussian =⇒logistic posterior for p(y = 1 | x) (see sec4.2 of B1)

I In logistic regression we model p(y | x) as logistic also otherdistributions, eg:

x|y = 0 ∼ Poisson(λ0)x|y = 1 ∼ Poisson(λ1)

x|y = 0 ∼ ExpFam(η0)x|y = 1 ∼ ExpFam(η1)

I hence we make stronger assumptions in GDA. If we do not know wherethe data come from the logistic regression analysis would be morerobust. If we know then GDA may perform better.

9

Page 10: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

Gaussian Discriminant AnalysisNaive BayesSupport Vector Machines

I When Σ is the same for all class conditional densities then the decisionboundaries are linear Linear discriminative analysis (LDA)

I When the class conditional densities do not share Σ then quadraticdiscriminant

10

Page 11: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

Gaussian Discriminant AnalysisNaive BayesSupport Vector MachinesOutline

1. Gaussian Discriminant Analysis

2. Naive BayesMulti-variate Bernoulli Event ModelMultinomial Event Model

3. Support Vector Machines

11

Page 12: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

Gaussian Discriminant AnalysisNaive BayesSupport Vector MachinesChain Rule of Probability

permits the calculation of the joint distribution of a set of random variablesusing only conditional probabilities.

Consider the set of events A1, A2, . . . An. To find the value of the jointdistribution, we can apply the definition of conditional probability to obtain:

Pr(An, An−1, . . . , A1) = Pr(An | An−1, . . . , A1) Pr(An−1, An−2, . . . , A1)

repeating the process with each final term:

Pr(∩nk=1Ak) =

n∏k=1

Pr(Ak | ∩kj=1Aj)

For example:

Pr(A4, A3, A2, A1) = Pr(A4|A3, A2, A1) Pr(A3|A2, A1) Pr(A2|A1) Pr(A1)

12

Page 13: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

Gaussian Discriminant AnalysisNaive BayesSupport Vector MachinesMulti-variate Bernoulli Event Model

We want to decide whether an email is spam y ∈ {0, 1} given some discretefeatures ~x.How to represent emails by a set of features?Binary array, each element corresponds to a word in the vocabulary and thebit indicates whether the word is present or not in the data.

~x =

10001...00

~x ∈ {0, }nn = 50000 (large number)250000 possible bit vectors250000−1 parameters to learn

We collect examples, look at those that are spam y = 1 and learnp(x | y = 1), then at those y = 0 and learn p(x | y = 0)

14

Page 14: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

Gaussian Discriminant AnalysisNaive BayesSupport Vector Machines

Step 1: For a given example i we treat each xij independently

p(y) ∼ φy∀j : p(xj |y = 0) ∼ φj|y=0

∀j : p(xj |y = 1) ∼ φj|y=1

Step 2: Maximize joint likelihoodAssume xjs are conditionally independent given y. By chain rule:

p(x1, . . . , x50000) = p(x1 | y)p(x2 | y, x1)p(x3 | y, x1, x2) . . .

= p(x1 | y)p(x2 | y)p(x3 | y) . . . cond. indep.

=

m∏i=1

p(xi | y)

15

Page 15: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

Gaussian Discriminant AnalysisNaive BayesSupport Vector Machines

l(φy, φj|y=0, φj|y=1) =

m∏i=1

p(~xi, yi)

=

m∏i=1

n∏j=1

p(xij | yi)p(yi)

Solution:

φy =

∑i I{yi = 1}

m

φj|y=1 =

∑i I{yi = 1, xij = 1}∑

i I{yi = 1}

φj|y=0 =

∑i I{yi = 0, xij = 1}∑

i I{yi = 0}

Step 3 and 4: prediction as usual but remember to use logarithms16

Page 16: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

Gaussian Discriminant AnalysisNaive BayesSupport Vector MachinesLaplace Smoothing

what if p(x300000|y = 1) = 0 and p(x300000|y = 0) = 0 because we do nothave any observation in the training set with that word?

p(y = 1|x) =p(x | y = 1)p(y = 1)

p(x | y = 1)p(y = 1) + p(x | y = 0)p(y = 0)

=

∏50000j=1 p(xj | y = 1)p(y = 1)∏50000

j=1 p(xj | y = 1)p(y = 1) +∏50000j=1 p(xj | y = 0)p(y = 0)

=0

0 + 0

Laplace smoothing: assume some observations

p(x|y) =c(x, y) + k

c(y) + k|x| φy =

∑i I{yi = 1}+ 1

m+K

φj|y=1 =

∑i I{yi = 1, xij = 1}+ 1∑

i I{yi = 1}+ 2

φj|y=0 =

∑i I{yi = 0, xij = 1}+ 1∑

i I{yi = 0}+ 2 17

Page 17: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

Gaussian Discriminant AnalysisNaive BayesSupport Vector MachinesMultinomial Event Model

We look at a generalization of the previous Naive Bayes that allows to takeinto account also of the number of times a word appear as well as theposition.

Let xi ∈ {1, 2, . . .K}, for example, a continuous variable discretized inbuckets

Let [xi1, xi2, . . . x

ini

], xij ∈ {1, 2, . . . ,K} represent the word in position j, ni# of word in the ith emailStep 1:

p(y) ∼ φy∀j : p(xj = k|y = 0) ∼ φj|y=0

∀j : p(xj = k|y = 1) ∼ φj|y=1

assumed that p(xj = k|y = 0)is the same for all j

φj|y=0 are parameters of multinomial Bernoulli distributions

19

Page 18: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

Gaussian Discriminant AnalysisNaive BayesSupport Vector Machines

Step 2: Joint likelihood:

Solution:

20

Page 19: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

Gaussian Discriminant AnalysisNaive BayesSupport Vector MachinesLaplace Smoothing

21

Page 20: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

Gaussian Discriminant AnalysisNaive BayesSupport Vector MachinesOutline

1. Gaussian Discriminant Analysis

2. Naive BayesMulti-variate Bernoulli Event ModelMultinomial Event Model

3. Support Vector Machines

22

Page 21: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

Gaussian Discriminant AnalysisNaive BayesSupport Vector MachinesSupport Vector Machines (Intro)

Support vector machines: discriminative approach that implements a nonlinear decision with basis in linear classifiers.

I lift points to a space where they are linearly separableI find linear separator

Let’s focus first on how to find a linearseparator. Desiderata:predict “1” iff ~θT~x ≥ 0

predict “0” iff ~θT~x < 0also wanted:if ~θT~x� 0 very confident that y = 1

if ~θT~x� 0 very confident that y = 0

Hence it would be nice if:∀i : yi = 1 we have ~θT~x� 0

∀i : yi = 0 we have ~θT~x� 0

23

Page 22: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

Gaussian Discriminant AnalysisNaive BayesSupport Vector MachinesNotation

Assume training set is linearly separable

Let’s change notation:y ∈ {−1, 1} (instead of {0, 1} like in GLM)Let’s have h output values {−1, 1}:

f(z) = sign(z)

{1 ifz ≥ 0

−1 ifz < 0

(hence no probabilities like in logistic regression)

h(~θ, ~x) = f(~θ~x), ~x ∈ Rn+1, ~θ ∈ Rn+1

h(~θ, ~x) = f(~θ~x+ θ0), ~x ∈ Rn, ~θ ∈ Rn, θ0 ∈ R

24

Page 23: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

Gaussian Discriminant AnalysisNaive BayesSupport Vector MachinesFunctional Marginal

Def.: The functional margin of a hyperplane (~θ, θ0) w.r.t. a specific example(xi, yi) is:

γ̂i = yi(~θT~xi + θ0)

For the decision boundary ~θT~x+ θ0 that defines the linear boundary:we want ~θT~x� 0 if yi = +1

we want ~θT~x� 0 if yi = −1

If yi(~θT~xi + θ0) > 0 then i is classified correctly.Hence, we want to maximize yi(~θT~xi + θ0).This can be achieved by maximizing the worst case for the training set

γ̂ = miniγ̂i

Note: scaling ~θ → 2~θ, θ0 → θo would make γ̂ arbitrarily large. Hence weimpose: ‖ ~θ ‖= 1

25

Page 24: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

Gaussian Discriminant AnalysisNaive BayesSupport Vector MachinesHyperplanes

hyperplane: set of the form {~x | ~aT~x = ~b} (~a 6= 0)

x0

~aT~x = ~b

~aT

~x

I ~a is the normal vector

I hyperplanes are affine and convex sets

26

Page 25: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

Gaussian Discriminant AnalysisNaive BayesSupport Vector MachinesGeometric Marginal

Def. the geometric margin γi is thedistance of example i from the linearseparator, A-B

~θ‖θ‖ unit vector orthogonal to the separating hyperplane

A-B: xi − γi ~θ

‖~θ‖

since B is on the linear separator, substituting the part above in ~θT~x+ θ0 = 0:

~θT

(xi − γi

‖ ~θ ‖

)+ θ0 = 0

27

Page 26: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

Gaussian Discriminant AnalysisNaive BayesSupport Vector Machines

Solving for γi:

~θTxi + θ0 = γi~θT ~θ

‖ ~θ ‖

γi =~θ

‖ ~θ ‖

T

xi +θ0

‖ ~θ ‖

This was for the positives. We can develop the same for the negatives but wewould have a negative sign. Hence the quantity:

γi = yi

(~θ

‖ ~θ ‖

T

xi +θ0

‖ ~θ ‖

)will be always positive.To maximize the distance of the line from all points we maximize the worstcase, that is:

γ = miniγi

28

Page 27: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

Gaussian Discriminant AnalysisNaive BayesSupport Vector Machines

I γ = γ̂

‖~θ‖

I Note that if ‖ ~θ ‖= 1 then γ̂i = γi the two marginal correspond

I geometric margin is invariant to scaling ~θ → 2~θ, θ0 → θo

29

Page 28: DM825 Introduction to Machine Learningmarco/DM825/Slides/dm825-lec7.pdfDM825 Introduction to Machine Learning Lecture 7 Gaussian discriminant analysis Naive Bayes MarcoChiarandini

Gaussian Discriminant AnalysisNaive BayesSupport Vector MachinesOptimal margin classifier

maxγ~θ,θ0

γ (1)

γ ≤ yi(~θT~x+ θ0) ∀i = 1, . . . ,m (2)

‖ ~θ ‖= 1 (3)

(2) implements γ = min γi

(3) is a nonconvex constraint thanks to which the two marginals are the same.

30