18-661 introduction to machine learning - multi-class ... · 18-661 introduction to machine...

96
18-661 Introduction to Machine Learning Multi-class Classification Spring 2020 ECE – Carnegie Mellon University

Upload: others

Post on 16-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

18-661 Introduction to Machine Learning

Multi-class Classification

Spring 2020

ECE – Carnegie Mellon University

Page 2: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Outline

1. Review of Logistic regression

2. Non-linear Decision Boundary

3. Multi-class Classification

Multi-class Naive Bayes

Multi-class Logistic Regression

1

Page 3: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Review of Logistic regression

Page 4: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Intuition: Logistic Regression

• x1 = # of times ’meet’ appears in an email

• x2 = # of times ’lottery’ appears in an email

• Define feature vector x = [1, x1, x2]

• Learn the decision boundary w0 + w1x1 + w2x2 = 0 such that

• If w>x ≥ 0 declare y = 1 (spam)

• If w>x < 0 declare y = 0 (ham)

x1: occurrences of ‘meet’

x 2: o

ccur

renc

es o

f ‘lo

ttery

w0 + w1 x1 + w2 x2 = 0

Key Idea: map features into points in a high-dimensional space, and use

hyperplanes to separate them2

Page 5: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Intuition: Logistic Regression

• Suppose we want to output the probability of an email being

spam/ham instead of just 0 or 1

• This gives information about the confidence in the decision

• Use a function σ(w>x) that maps w>x to a value between 0 and 1

SPAMHAM

1

0 wTx, Linear comb. of features

Prob(y= 1|x)

Probability that predicted label is 1 (spam)

Key Problem: Finding optimal weights w that accurately predict this

probability for a new email

3

Page 6: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Intuition: Logistic Regression

• Suppose we want to output the probability of an email being

spam/ham instead of just 0 or 1

• This gives information about the confidence in the decision

• Use a function σ(w>x) that maps w>x to a value between 0 and 1

SPAMHAM

1

0 wTx, Linear comb. of features

Prob(y= 1|x)

Probability that predicted label is 1 (spam)

Key Problem: Finding optimal weights w that accurately predict this

probability for a new email 3

Page 7: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Formal Setup: Binary Logistic Classification

• Input: x = [1, x1, x2, . . . xD ] ∈ RD+1

• Output: y ∈ {0, 1}• Training data: D = {(xn, yn), n = 1, 2, . . . ,N}

• Model:

p(y = 1|x ; w) = σ(w>x)

and σ[·] stands for the sigmoid function

σ(a) =1

1 + e−a

−6 −4 −2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

4

Page 8: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Formal Setup: Binary Logistic Classification

• Input: x = [1, x1, x2, . . . xD ] ∈ RD+1

• Output: y ∈ {0, 1}

• Training data: D = {(xn, yn), n = 1, 2, . . . ,N}

• Model:

p(y = 1|x ; w) = σ(w>x)

and σ[·] stands for the sigmoid function

σ(a) =1

1 + e−a

−6 −4 −2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

4

Page 9: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Formal Setup: Binary Logistic Classification

• Input: x = [1, x1, x2, . . . xD ] ∈ RD+1

• Output: y ∈ {0, 1}• Training data: D = {(xn, yn), n = 1, 2, . . . ,N}

• Model:

p(y = 1|x ; w) = σ(w>x)

and σ[·] stands for the sigmoid function

σ(a) =1

1 + e−a

−6 −4 −2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

4

Page 10: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Formal Setup: Binary Logistic Classification

• Input: x = [1, x1, x2, . . . xD ] ∈ RD+1

• Output: y ∈ {0, 1}• Training data: D = {(xn, yn), n = 1, 2, . . . ,N}

• Model:

p(y = 1|x ; w) = σ(w>x)

and σ[·] stands for the sigmoid function

σ(a) =1

1 + e−a

−6 −4 −2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

4

Page 11: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Formal Setup: Binary Logistic Classification

• Input: x = [1, x1, x2, . . . xD ] ∈ RD+1

• Output: y ∈ {0, 1}• Training data: D = {(xn, yn), n = 1, 2, . . . ,N}• Model:

p(y = 1|x ; w) = σ(w>x)

and σ[·] stands for the sigmoid function

σ(a) =1

1 + e−a

−6 −4 −2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

4

Page 12: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Formal Setup: Binary Logistic Classification

• Input: x = [1, x1, x2, . . . xD ] ∈ RD+1

• Output: y ∈ {0, 1}• Training data: D = {(xn, yn), n = 1, 2, . . . ,N}• Model:

p(y = 1|x ; w) = σ(w>x)

and σ[·] stands for the sigmoid function

σ(a) =1

1 + e−a

−6 −4 −2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

4

Page 13: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Formal Setup: Binary Logistic Classification

• Input: x = [1, x1, x2, . . . xD ] ∈ RD+1

• Output: y ∈ {0, 1}• Training data: D = {(xn, yn), n = 1, 2, . . . ,N}• Model:

p(y = 1|x ; w) = σ(w>x)

and σ[·] stands for the sigmoid function

σ(a) =1

1 + e−a

−6 −4 −2 0 2 4 60

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

4

Page 14: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

How to optimize w?

Probability of a single training sample (xn, yn)

p(yn|xn; w) =

{σ(w>xn) if yn = 1

1− σ(w>xn) otherwise

Compact expression, exploring that yn is either 1 or 0

p(yn|xn; w) = σ(w>xn)yn [1− σ(w>xn)]1−yn

Minimize the negative log-likelihood of the whole training data D,

i.e. cross-entropy error function

E(w) = −∑n

{yn log σ(w>xn) + (1− yn) log[1− σ(w>xn)]}

5

Page 15: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

How to optimize w?

Probability of a single training sample (xn, yn)

p(yn|xn; w) =

{σ(w>xn) if yn = 1

1− σ(w>xn) otherwise

Compact expression, exploring that yn is either 1 or 0

p(yn|xn; w) = σ(w>xn)yn [1− σ(w>xn)]1−yn

Minimize the negative log-likelihood of the whole training data D,

i.e. cross-entropy error function

E(w) = −∑n

{yn log σ(w>xn) + (1− yn) log[1− σ(w>xn)]}

5

Page 16: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Gradient Descent Update for Logistic Regression

Cross-entropy Error Function

E(w) = −∑n

{yn log σ(w>xn) + (1− yn) log[1− σ(w>xn)]}

Simple fact: derivatives of σ(a)

d

d aσ(a) = σ(a)[1− σ(a)]

Gradient of cross-entropy loss

∂E(w)

∂w=∑n

{σ(w>xn)− yn

}xn

Remark

• en ={σ(w>xn)− yn

}is called error for the nth training sample.

6

Page 17: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Gradient Descent Update for Logistic Regression

Cross-entropy Error Function

E(w) = −∑n

{yn log σ(w>xn) + (1− yn) log[1− σ(w>xn)]}

Simple fact: derivatives of σ(a)

d

d aσ(a) = σ(a)[1− σ(a)]

Gradient of cross-entropy loss

∂E(w)

∂w=∑n

{σ(w>xn)− yn

}xn

Remark

• en ={σ(w>xn)− yn

}is called error for the nth training sample.

6

Page 18: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Gradient Descent Update for Logistic Regression

Cross-entropy Error Function

E(w) = −∑n

{yn log σ(w>xn) + (1− yn) log[1− σ(w>xn)]}

Simple fact: derivatives of σ(a)

d

d aσ(a) = σ(a)[1− σ(a)]

Gradient of cross-entropy loss

∂E(w)

∂w=∑n

{σ(w>xn)− yn

}xn

Remark

• en ={σ(w>xn)− yn

}is called error for the nth training sample.

6

Page 19: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Gradient Descent Update for Logistic Regression

Cross-entropy Error Function

E(w) = −∑n

{yn log σ(w>xn) + (1− yn) log[1− σ(w>xn)]}

Simple fact: derivatives of σ(a)

d

d aσ(a) = σ(a)[1− σ(a)]

Gradient of cross-entropy loss

∂E(w)

∂w=∑n

{σ(w>xn)− yn

}xn

Remark

• en ={σ(w>xn)− yn

}is called error for the nth training sample.

6

Page 20: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Numerical optimization

Gradient descent for logistic regression

• Choose a proper step size η > 0

• Iteratively update the parameters following the negative gradient to

minimize the error function

w (t+1) ← w (t) − η∑n

{σ(x>n w (t))− yn

}xn

• Can also perform stochastic gradient descent (with a possibly

different learning rate η)

w (t+1) ← w (t) − η{σ(x>it w

(t))− yit

}x it

where it is drawn uniformly at randomly from the training data

{1, 2, · · · }

7

Page 21: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Batch gradient descent vs SGD

0 20 40Number of Iterations

0.2

0.4

0.6

0.8

1.0

Pro

bab

ility

ofb

ein

gS

pam

Learning Rate = η = 0.01

Email 1

Email 2

Email 3

Email 4

0 50 100 150 200Number of Iterations

0.2

0.4

0.6

0.8

1.0

Pro

bab

ility

ofb

ein

gS

pam

Learning Rate = η = 0.01

Email 1

Email 2

Email 3

Email 4

Batch GD SGD

fewer iterations, more iterations,

every iteration uses all samples every iteration uses one sample

8

Page 22: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Logistic regression vs linear regression

logistic regression linear regression

Training data (xn, yn), yn ∈ {0, 1} (xn, yn), yn ∈ R

loss function cross-entropy RSS

prob. interpretation yn|xn,w ∼ Ber(σ(w>xn)) yn|xn,w ∼ N (w>xn, σ2)

gradient∑

n

(σ(x>n w)− yn

)xn

∑n

(x>n w − yn

)xn

Cross-entropy loss function (logistic regression):

E(w) = −∑n

{yn log σ(w>xn) + (1− yn) log[1− σ(w>xn)]}

RSS loss function (linear regression):

RSS(w) =1

2

∑n

(yn −w>xn)2

9

Page 23: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Outline

1. Review of Logistic regression

2. Non-linear Decision Boundary

3. Multi-class Classification

Multi-class Naive Bayes

Multi-class Logistic Regression

10

Page 24: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Non-linear Decision Boundary

Page 25: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

How to handle more complex decision boundaries?

x1

x 2

• This data is not linear separable

• Use non-linear basis functions to add more features

11

Page 26: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

How to handle more complex decision boundaries?

x1

x 2

• This data is not linear separable

• Use non-linear basis functions to add more features

11

Page 27: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

How to handle more complex decision boundaries?

x1

x 2

• This data is not linear separable

• Use non-linear basis functions to add more features

11

Page 28: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Adding polynomial features

• New feature vector is x = [1, x1, x2, x21 , x

22 ]

• Pr(y = 1|x) = σ(w0 + w1x1 + w2x2 + w3x21 + w4x

22 )

• If w = [−1, 0, 0, 1, 1], the boundary is −1 + x21 + x22 = 0

• If −1 + x21 + x2

2 ≥ 0 declare spam

• If −1 + x21 + x2

2 < 0 declare ham

x1

-1 + x12 + x2

2 = 0

x 2

12

Page 29: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Adding polynomial features

• New feature vector is x = [1, x1, x2, x21 , x

22 ]

• Pr(y = 1|x) = σ(w0 + w1x1 + w2x2 + w3x21 + w4x

22 )

• If w = [−1, 0, 0, 1, 1], the boundary is −1 + x21 + x22 = 0

• If −1 + x21 + x2

2 ≥ 0 declare spam

• If −1 + x21 + x2

2 < 0 declare ham

x1

-1 + x12 + x2

2 = 0

x 2

12

Page 30: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Adding polynomial features

• New feature vector is x = [1, x1, x2, x21 , x

22 ]

• Pr(y = 1|x) = σ(w0 + w1x1 + w2x2 + w3x21 + w4x

22 )

• If w = [−1, 0, 0, 1, 1], the boundary is −1 + x21 + x22 = 0

• If −1 + x21 + x2

2 ≥ 0 declare spam

• If −1 + x21 + x2

2 < 0 declare ham

x1

-1 + x12 + x2

2 = 0

x 2

12

Page 31: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Adding polynomial features

• New feature vector is x = [1, x1, x2, x21 , x

22 ]

• Pr(y = 1|x) = σ(w0 + w1x1 + w2x2 + w3x21 + w4x

22 )

• If w = [−1, 0, 0, 1, 1], the boundary is −1 + x21 + x22 = 0

• If −1 + x21 + x2

2 ≥ 0 declare spam

• If −1 + x21 + x2

2 < 0 declare ham

x1

-1 + x12 + x2

2 = 0

x 2

12

Page 32: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Adding polynomial features

• New feature vector is x = [1, x1, x2, x21 , x

22 ]

• Pr(y = 1|x) = σ(w0 + w1x1 + w2x2 + w3x21 + w4x

22 )

• If w = [−1, 0, 0, 1, 1], the boundary is −1 + x21 + x22 = 0

• If −1 + x21 + x2

2 ≥ 0 declare spam

• If −1 + x21 + x2

2 < 0 declare ham

x1

-1 + x12 + x2

2 = 0

x 2

12

Page 33: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Adding polynomial features

• New feature vector is x = [1, x1, x2, x21 , x

22 ]

• Pr(y = 1|x) = σ(w0 + w1x1 + w2x2 + w3x21 + w4x

22 )

• If w = [−1, 0, 0, 1, 1], the boundary is −1 + x21 + x22 = 0

• If −1 + x21 + x2

2 ≥ 0 declare spam

• If −1 + x21 + x2

2 < 0 declare ham

x1

-1 + x12 + x2

2 = 0

x 2

12

Page 34: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Adding polynomial features

• New feature vector is x = [1, x1, x2, x21 , x

22 ]

• Pr(y = 1|x) = σ(w0 + w1x1 + w2x2 + w3x21 + w4x

22 )

• If w = [−1, 0, 0, 1, 1], the boundary is −1 + x21 + x22 = 0

• If −1 + x21 + x2

2 ≥ 0 declare spam

• If −1 + x21 + x2

2 < 0 declare ham

x1

-1 + x12 + x2

2 = 0 x 2

12

Page 35: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Adding polynomial features

• What if we add many more features and define

x = [1, x1, x2, x21 , x

22 , x

31 , x

32 , . . . ]?

• We get a complex decision boundary

x1

x 2

Can result in overfitting and bad generalization to new data points

13

Page 36: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Adding polynomial features

• What if we add many more features and define

x = [1, x1, x2, x21 , x

22 , x

31 , x

32 , . . . ]?

• We get a complex decision boundary

x1

x 2

Can result in overfitting and bad generalization to new data points

13

Page 37: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Adding polynomial features

• What if we add many more features and define

x = [1, x1, x2, x21 , x

22 , x

31 , x

32 , . . . ]?

• We get a complex decision boundary

x1

x 2

Can result in overfitting and bad generalization to new data points

13

Page 38: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Adding polynomial features

• What if we add many more features and define

x = [1, x1, x2, x21 , x

22 , x

31 , x

32 , . . . ]?

• We get a complex decision boundary

x1

x 2

Can result in overfitting and bad generalization to new data points

13

Page 39: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Adding polynomial features

• What if we add many more features and define

x = [1, x1, x2, x21 , x

22 , x

31 , x

32 , . . . ]?

• We get a complex decision boundary

x1

x 2

Can result in overfitting and bad generalization to new data points

13

Page 40: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Concept-check: Bias-Variance Trade-off

x1

x 2

x1 x 2

high bias high variance

14

Page 41: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Solution to Overfitting: Regularization

• Add regularization term to be cross entropy loss function

E(w) = −∑n

{yn log σ(w>xn)+(1−yn) log[1−σ(w>xn)]}+ 1

2λ‖w‖22︸ ︷︷ ︸

regularization

• Perform gradient descent on this regularized function

• Often, we do NOT regularize the bias term w0 (you will see this in

the homework)

x1

x 2

15

Page 42: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Solution to Overfitting: Regularization

• Add regularization term to be cross entropy loss function

E(w) = −∑n

{yn log σ(w>xn)+(1−yn) log[1−σ(w>xn)]}+ 1

2λ‖w‖22︸ ︷︷ ︸

regularization

• Perform gradient descent on this regularized function

• Often, we do NOT regularize the bias term w0 (you will see this in

the homework)

x1

x 2

15

Page 43: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Solution to Overfitting: Regularization

• Add regularization term to be cross entropy loss function

E(w) = −∑n

{yn log σ(w>xn)+(1−yn) log[1−σ(w>xn)]}+ 1

2λ‖w‖22︸ ︷︷ ︸

regularization

• Perform gradient descent on this regularized function

• Often, we do NOT regularize the bias term w0 (you will see this in

the homework)

x1

x 2

15

Page 44: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Solution to Overfitting: Regularization

• Add regularization term to be cross entropy loss function

E(w) = −∑n

{yn log σ(w>xn)+(1−yn) log[1−σ(w>xn)]}+ 1

2λ‖w‖22︸ ︷︷ ︸

regularization

• Perform gradient descent on this regularized function

• Often, we do NOT regularize the bias term w0 (you will see this in

the homework)

x1

x 2

15

Page 45: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Solution to Overfitting: Regularization

• Add regularization term to be cross entropy loss function

E(w) = −∑n

{yn log σ(w>xn)+(1−yn) log[1−σ(w>xn)]}+ 1

2λ‖w‖22︸ ︷︷ ︸

regularization

• Perform gradient descent on this regularized function

• Often, we do NOT regularize the bias term w0 (you will see this in

the homework)

x1

x 2

15

Page 46: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Outline

1. Review of Logistic regression

2. Non-linear Decision Boundary

3. Multi-class Classification

Multi-class Naive Bayes

Multi-class Logistic Regression

16

Page 47: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Multi-class Classification

Page 48: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

What if there are more than 2 classes?

• Dog vs. cat. vs crocodile

• Movie genres (action, horror, comedy, . . . )

• Part of speech tagging (verb, noun, adjective, . . . )

• . . .

x1

x 2

17

Page 49: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

What if there are more than 2 classes?

• Dog vs. cat. vs crocodile

• Movie genres (action, horror, comedy, . . . )

• Part of speech tagging (verb, noun, adjective, . . . )

• . . .

x1

x 2

17

Page 50: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

What if there are more than 2 classes?

• Dog vs. cat. vs crocodile

• Movie genres (action, horror, comedy, . . . )

• Part of speech tagging (verb, noun, adjective, . . . )

• . . .

x1

x 2

17

Page 51: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

What if there are more than 2 classes?

• Dog vs. cat. vs crocodile

• Movie genres (action, horror, comedy, . . . )

• Part of speech tagging (verb, noun, adjective, . . . )

• . . .

x1

x 2

17

Page 52: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Setup

Predict multiple classes/outcomes C1,C2, . . . ,CK :

• Weather prediction: sunny, cloudy, raining, etc

• Optical character recognition: 10 digits + 26 characters (lower and

upper cases) + special characters, etc.

K =number of classes

Methods we’ve studied for binary classification:

• Naive Bayes

• Logistic regression

Do they generalize to multi-class classification?

18

Page 53: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Naive Bayes is already multi-class

Formal Definition

Given a random vector X ∈ RK and a dependent variable Y ∈ [C], the

Naive Bayes model defines the joint distribution

P(X = x,Y = c) = P(Y = c)P(X = x|Y = c) (1)

= P(Y = c)K∏

k=1

P(wordk |Y = c)xk (2)

= πc

K∏k=1

θxkck (3)

where xk is the number of occurrences of the kth word, πc is the prior

probability of class c (which allows multiple classes!), and θck is the

weight of the kth word for the cth class.

19

Page 54: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Learning problem

Training data

D = {(xn, yn)}Nn=1 → D = {({xnk}Kk=1, yn)}Nn=1

Goal

Learn πc , c = 1, 2, · · · ,C, and θck ,∀c ∈ [C], k ∈ [K] under the

constraints: ∑c

πc = 1

and ∑k

θck =∑k

P(wordk |Y = c) = 1

as well as πc , θck ≥ 0.

20

Page 55: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Our hammer: maximum likelihood estimation

Log-Likelihood of the training data

L = logP(D) = logN∏

n=1

πynP(xn|yn)

= logN∏

n=1

(πyn∏k

θxnkynk

)

=∑n

(log πyn +

∑k

xnk log θynk

)=∑n

log πyn +∑n,k

xnk log θynk

Optimize it!

(π∗c , θ∗ck) = arg max

∑n

log πyn +∑n,k

xnk log θynk

21

Page 56: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Our hammer: maximum likelihood estimation

Optimization Problem

(π∗c , θ∗ck) = arg max

∑n

log πyn +∑n,k

xnk log θynk

Solution

θ∗ck =#of times word k shows up in data points labeled as c

#total trials for data points labeled as c

π∗c =#of data points labeled as c

N

22

Page 57: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Our hammer: maximum likelihood estimation

Optimization Problem

(π∗c , θ∗ck) = arg max

∑n

log πyn +∑n,k

xnk log θynk

Solution

θ∗ck =#of times word k shows up in data points labeled as c

#total trials for data points labeled as c

π∗c =#of data points labeled as c

N

22

Page 58: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Outline

1. Review of Logistic regression

2. Non-linear Decision Boundary

3. Multi-class Classification

Multi-class Naive Bayes

Multi-class Logistic Regression

23

Page 59: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Logistic regression for predicting multiple classes?

• The linear decision boundary that we optimized was specific to

binary classification.

• If σ(w>x) ≥ 0.5 declare y = 1 (spam)

• If σ(w>x) < 0.5 declare y = 0 (ham)

• How to extend it to multi-class classification?

SPAMHAM

1

0 wTx, Linear comb. of features

Prob(y= 1|x)

y = 1 for spam, y = 0 for ham

Idea: Express as multiple binary classification problems

24

Page 60: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Logistic regression for predicting multiple classes?

• The linear decision boundary that we optimized was specific to

binary classification.

• If σ(w>x) ≥ 0.5 declare y = 1 (spam)

• If σ(w>x) < 0.5 declare y = 0 (ham)

• How to extend it to multi-class classification?

SPAMHAM

1

0 wTx, Linear comb. of features

Prob(y= 1|x)

y = 1 for spam, y = 0 for ham

Idea: Express as multiple binary classification problems

24

Page 61: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Logistic regression for predicting multiple classes?

• The linear decision boundary that we optimized was specific to

binary classification.

• If σ(w>x) ≥ 0.5 declare y = 1 (spam)

• If σ(w>x) < 0.5 declare y = 0 (ham)

• How to extend it to multi-class classification?

SPAMHAM

1

0 wTx, Linear comb. of features

Prob(y= 1|x)

y = 1 for spam, y = 0 for ham

Idea: Express as multiple binary classification problems

24

Page 62: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Logistic regression for predicting multiple classes?

• The linear decision boundary that we optimized was specific to

binary classification.

• If σ(w>x) ≥ 0.5 declare y = 1 (spam)

• If σ(w>x) < 0.5 declare y = 0 (ham)

• How to extend it to multi-class classification?

SPAMHAM

1

0 wTx, Linear comb. of features

Prob(y= 1|x)

y = 1 for spam, y = 0 for ham

Idea: Express as multiple binary classification problems

24

Page 63: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Logistic regression for predicting multiple classes?

• The linear decision boundary that we optimized was specific to

binary classification.

• If σ(w>x) ≥ 0.5 declare y = 1 (spam)

• If σ(w>x) < 0.5 declare y = 0 (ham)

• How to extend it to multi-class classification?

SPAMHAM

1

0 wTx, Linear comb. of features

Prob(y= 1|x)

y = 1 for spam, y = 0 for ham

Idea: Express as multiple binary classification problems

24

Page 64: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Logistic regression for predicting multiple classes?

• The linear decision boundary that we optimized was specific to

binary classification.

• If σ(w>x) ≥ 0.5 declare y = 1 (spam)

• If σ(w>x) < 0.5 declare y = 0 (ham)

• How to extend it to multi-class classification?

SPAMHAM

1

0 wTx, Linear comb. of features

Prob(y= 1|x)

y = 1 for spam, y = 0 for ham

Idea: Express as multiple binary classification problems

24

Page 65: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

The One-versus-Rest or One-Versus-All Approach

• For each class Ck , change the problem into binary classification

1. Relabel training data with label Ck , into positive (or ‘1’)

2. Relabel all the rest data into negative (or ‘0’)

• Repeat this multiple times: Train K binary classifiers, using logistic

regression to differentiate the two classes each time

x1

x 2

25

Page 66: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

The One-versus-Rest or One-Versus-All Approach

• For each class Ck , change the problem into binary classification

1. Relabel training data with label Ck , into positive (or ‘1’)

2. Relabel all the rest data into negative (or ‘0’)

• Repeat this multiple times: Train K binary classifiers, using logistic

regression to differentiate the two classes each time

x1

x 2

26

Page 67: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

The One-versus-Rest or One-Versus-All Approach

• For each class Ck , change the problem into binary classification

1. Relabel training data with label Ck , into positive (or ‘1’)

2. Relabel all the rest data into negative (or ‘0’)

• Repeat this multiple times: Train K binary classifiers, using logistic

regression to differentiate the two classes each time

x1

x 2

w1T x = 0

27

Page 68: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

The One-versus-Rest or One-Versus-All Approach

How to combine these linear decision boundaries?

• There is ambiguity in some of the regions (the 4 triangular areas)

x1

x 2

28

Page 69: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

The One-versus-Rest or One-Versus-All Approach

How to combine these linear decision boundaries?

• There is ambiguity in some of the regions (the 4 triangular areas)

• How do we resolve this?

x1

x 2

+ Square or Circle

+ None!

29

Page 70: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

The One-versus-Rest or One-Versus-All Approach

How to combine these linear decision boundaries?

• Use the confidence estimates Pr(y = C1|x) = σ(w>1 x),

. . . Pr(y = CK |x) = σ(w>K x)

• Declare class C∗k that maximizes

k∗ = arg maxk=1,...,K

Pr(y = Ck |x) = σ(w>k x)

x1

x 2

+ Pr(Square) = 0.6

+

Pr(Circle) = 0.75

Pr(Triangle) = 0.2

30

Page 71: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

The One-Versus-One Approach

• For each pair of classes Ck and Ck′ , change the problem into binary

classification

1. Relabel training data with label Ck , into positive (or ‘1’)

2. Relabel training data with label Ck′ into negative (or ‘0’)

3. Disregard all other data

x1

x 2

31

Page 72: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

The One-Versus-One Approach

• How many binary classifiers for K classes?

K (K − 1)/2

• How to combine their outputs?

• Given x , count the K (K − 1)/2 votes from outputs of all binary

classifiers and declare the winner as the predicted class.

• Use confidence scores to resolve ties

x1

x 2

32

Page 73: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

The One-Versus-One Approach

• How many binary classifiers for K classes?

K (K − 1)/2

• How to combine their outputs?

• Given x , count the K (K − 1)/2 votes from outputs of all binary

classifiers and declare the winner as the predicted class.

• Use confidence scores to resolve ties

x1

x 2

32

Page 74: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

The One-Versus-One Approach

• How many binary classifiers for K classes? K (K − 1)/2

• How to combine their outputs?

• Given x , count the K (K − 1)/2 votes from outputs of all binary

classifiers and declare the winner as the predicted class.

• Use confidence scores to resolve ties

x1

x 2

32

Page 75: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

The One-Versus-One Approach

• How many binary classifiers for K classes? K (K − 1)/2

• How to combine their outputs?

• Given x , count the K (K − 1)/2 votes from outputs of all binary

classifiers and declare the winner as the predicted class.

• Use confidence scores to resolve ties

x1

x 2

32

Page 76: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

The One-Versus-One Approach

• How many binary classifiers for K classes? K (K − 1)/2

• How to combine their outputs?

• Given x , count the K (K − 1)/2 votes from outputs of all binary

classifiers and declare the winner as the predicted class.

• Use confidence scores to resolve ties

x1

x 2

32

Page 77: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Contrast these approaches

Number of Binary Classifiers to be trained

• One-Versus-All: K classifiers.

• One-Versus-One: K (K − 1)/2 classifiers – bad if K is large

Effect of Relabeling and Splitting Training Data

• One-Versus-All: imbalance in the number of positive and negative

samples can cause bias in each trained classifier

• One-Versus-One: each classifier trained on a small subset of data

(only those labeled with those two classes would be involved), which

can result in high variance

Any other ideas?

• Hierarchical classification – we will see this in decision trees

• Multinomial Logistic Regression – directly output probabilities of y

being in each of the K classes, instead of reducing to a binary

classification problem.

33

Page 78: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Contrast these approaches

Number of Binary Classifiers to be trained

• One-Versus-All: K classifiers.

• One-Versus-One: K (K − 1)/2 classifiers – bad if K is large

Effect of Relabeling and Splitting Training Data

• One-Versus-All: imbalance in the number of positive and negative

samples can cause bias in each trained classifier

• One-Versus-One: each classifier trained on a small subset of data

(only those labeled with those two classes would be involved), which

can result in high variance

Any other ideas?

• Hierarchical classification – we will see this in decision trees

• Multinomial Logistic Regression – directly output probabilities of y

being in each of the K classes, instead of reducing to a binary

classification problem.

33

Page 79: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Contrast these approaches

Number of Binary Classifiers to be trained

• One-Versus-All: K classifiers.

• One-Versus-One: K (K − 1)/2 classifiers – bad if K is large

Effect of Relabeling and Splitting Training Data

• One-Versus-All: imbalance in the number of positive and negative

samples can cause bias in each trained classifier

• One-Versus-One: each classifier trained on a small subset of data

(only those labeled with those two classes would be involved), which

can result in high variance

Any other ideas?

• Hierarchical classification – we will see this in decision trees

• Multinomial Logistic Regression – directly output probabilities of y

being in each of the K classes, instead of reducing to a binary

classification problem.

33

Page 80: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Contrast these approaches

Number of Binary Classifiers to be trained

• One-Versus-All: K classifiers.

• One-Versus-One: K (K − 1)/2 classifiers – bad if K is large

Effect of Relabeling and Splitting Training Data

• One-Versus-All: imbalance in the number of positive and negative

samples can cause bias in each trained classifier

• One-Versus-One: each classifier trained on a small subset of data

(only those labeled with those two classes would be involved), which

can result in high variance

Any other ideas?

• Hierarchical classification – we will see this in decision trees

• Multinomial Logistic Regression – directly output probabilities of y

being in each of the K classes, instead of reducing to a binary

classification problem.

33

Page 81: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Contrast these approaches

Number of Binary Classifiers to be trained

• One-Versus-All: K classifiers.

• One-Versus-One: K (K − 1)/2 classifiers – bad if K is large

Effect of Relabeling and Splitting Training Data

• One-Versus-All: imbalance in the number of positive and negative

samples can cause bias in each trained classifier

• One-Versus-One: each classifier trained on a small subset of data

(only those labeled with those two classes would be involved), which

can result in high variance

Any other ideas?

• Hierarchical classification – we will see this in decision trees

• Multinomial Logistic Regression – directly output probabilities of y

being in each of the K classes, instead of reducing to a binary

classification problem.33

Page 82: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Multinomial logistic regression

Intuition:

from the decision rule of our naive Bayes classifier

y∗ = arg maxk p(y = Ck |x) = arg maxk log p(x |y = Ck)p(y = Ck)

= arg maxk log πk +∑i

xi log θki = arg maxk w>k x

Essentially, we are comparing

w>1 x ,w>2 x , · · · ,w>K x

with one for each category.

34

Page 83: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Multinomial logistic regression

Intuition:

from the decision rule of our naive Bayes classifier

y∗ = arg maxk p(y = Ck |x) = arg maxk log p(x |y = Ck)p(y = Ck)

= arg maxk log πk +∑i

xi log θki = arg maxk w>k x

Essentially, we are comparing

w>1 x ,w>2 x , · · · ,w>K x

with one for each category.

34

Page 84: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

First try

So, can we define the following conditional model?

p(y = Ck |x) = σ[w>k x ].

This would not work because:∑k

p(y = Ck |x) =∑k

σ[w>k x ] 6= 1.

each summand can be any number (independently) between 0 and 1.

But we are close!

Learn the K linear models jointly to ensure this property holds!

35

Page 85: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

First try

So, can we define the following conditional model?

p(y = Ck |x) = σ[w>k x ].

This would not work because:∑k

p(y = Ck |x) =∑k

σ[w>k x ] 6= 1.

each summand can be any number (independently) between 0 and 1.

But we are close!

Learn the K linear models jointly to ensure this property holds!

35

Page 86: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

First try

So, can we define the following conditional model?

p(y = Ck |x) = σ[w>k x ].

This would not work because:∑k

p(y = Ck |x) =∑k

σ[w>k x ] 6= 1.

each summand can be any number (independently) between 0 and 1.

But we are close!

Learn the K linear models jointly to ensure this property holds!

35

Page 87: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Multinomial logistic regression

• Model: For each class Ck , we have a parameter vector w k and

model the posterior probability as:

p(Ck |x) =ew>

k x∑k′ e

w>k′x

← This is called softmax function

• Decision boundary: Assign x with the label that is the maximum of

posterior:

arg maxk P(Ck |x)→ arg maxk w>k x .

36

Page 88: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

How does the softmax function behave?

Suppose we have

w>1 x = 100, w>2 x = 50, w>3 x = −20.

We would pick the winning class label 1.

Softmax translates these scores into well-formed conditional

probabilities

p(y = 1|x) =e100

e100 + e50 + e−20< 1

• preserves relative ordering of scores

• maps scores to values between 0 and 1 that also sum to 1

37

Page 89: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

How does the softmax function behave?

Suppose we have

w>1 x = 100, w>2 x = 50, w>3 x = −20.

We would pick the winning class label 1.

Softmax translates these scores into well-formed conditional

probabilities

p(y = 1|x) =e100

e100 + e50 + e−20< 1

• preserves relative ordering of scores

• maps scores to values between 0 and 1 that also sum to 1

37

Page 90: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Sanity check

Multinomial model reduce to binary logistic regression when K = 2

p(C1|x) =ew>

1 x

ew>1 x + ew>

2 x=

1

1 + e−(w1−w2)>x

=1

1 + e−w>x

Multinomial thus generalizes the (binary) logistic regression to deal with

multiple classes.

38

Page 91: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Parameter estimation

Discriminative approach: maximize conditional likelihood

logP(D) =∑n

logP(yn|xn)

We will change yn to yn = [yn1 yn2 · · · ynK ]>, a K -dimensional vector

using 1-of-K encoding.

ynk =

{1 if yn = k

0 otherwise

Ex: if yn = 2, then, yn = [0 1 0 0 · · · 0]>.

⇒∑n

logP(yn|xn) =∑n

logK∏

k=1

P(Ck |xn)ynk =∑n

∑k

ynk logP(Ck |xn)

39

Page 92: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Parameter estimation

Discriminative approach: maximize conditional likelihood

logP(D) =∑n

logP(yn|xn)

We will change yn to yn = [yn1 yn2 · · · ynK ]>, a K -dimensional vector

using 1-of-K encoding.

ynk =

{1 if yn = k

0 otherwise

Ex: if yn = 2, then, yn = [0 1 0 0 · · · 0]>.

⇒∑n

logP(yn|xn) =∑n

logK∏

k=1

P(Ck |xn)ynk =∑n

∑k

ynk logP(Ck |xn)

39

Page 93: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Parameter estimation

Discriminative approach: maximize conditional likelihood

logP(D) =∑n

logP(yn|xn)

We will change yn to yn = [yn1 yn2 · · · ynK ]>, a K -dimensional vector

using 1-of-K encoding.

ynk =

{1 if yn = k

0 otherwise

Ex: if yn = 2, then, yn = [0 1 0 0 · · · 0]>.

⇒∑n

logP(yn|xn) =∑n

logK∏

k=1

P(Ck |xn)ynk =∑n

∑k

ynk logP(Ck |xn)

39

Page 94: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Cross-entropy error function

Definition: negative log likelihood

E(w 1,w 2, . . . ,wK ) = −∑n

∑k

ynk logP(Ck |xn)

= −∑n

∑k

ynk log

(ew>

k xn∑k′ e

w>k′xn

)

Properties

• Convex, therefore unique global optimum

• Optimization requires numerical procedures, analogous to those used

for binary logistic regression

40

Page 95: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Cross-entropy error function

Definition: negative log likelihood

E(w 1,w 2, . . . ,wK ) = −∑n

∑k

ynk logP(Ck |xn)

= −∑n

∑k

ynk log

(ew>

k xn∑k′ e

w>k′xn

)

Properties

• Convex, therefore unique global optimum

• Optimization requires numerical procedures, analogous to those used

for binary logistic regression

40

Page 96: 18-661 Introduction to Machine Learning - Multi-class ... · 18-661 Introduction to Machine Learning Multi-class Classi cation Spring 2020 ECE { Carnegie Mellon University. ... Learning

Summary

You should know

• What is logistic regression and solving for w using gradient descent

on the cross entropy loss function

• Difference between Naive Bayes and Logistic Regression

• How to solve for the model parameters using gradient descent

• How to handle multiclass classification: one-versus-all,

one-versus-one, multinomial regression

41