machine learning basics lecture 2: linear classification · machine learning basics lecture 2:...

34
Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang

Upload: others

Post on 20-Jun-2020

31 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Machine Learning Basics Lecture 2: Linear Classification

Princeton University COS 495

Instructor: Yingyu Liang

Page 2: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Review: machine learning basics

Page 3: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Math formulation

• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷

• Find 𝑦 = 𝑓(𝑥) ∈ 𝓗 that minimizes 𝐿 𝑓 =1

𝑛σ𝑖=1𝑛 𝑙(𝑓, 𝑥𝑖 , 𝑦𝑖)

• s.t. the expected loss is small

𝐿 𝑓 = 𝔼 𝑥,𝑦 ~𝐷[𝑙(𝑓, 𝑥, 𝑦)]

Page 4: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Machine learning 1-2-3

• Collect data and extract features

• Build model: choose hypothesis class 𝓗 and loss function 𝑙

• Optimization: minimize the empirical loss

Page 5: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Machine learning 1-2-3

• Collect data and extract features

• Build model: choose hypothesis class 𝓗 and loss function 𝑙

• Optimization: minimize the empirical loss

Experience

Prior knowledge

Page 6: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Example: Linear regression

• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷

• Find 𝑓𝑤 𝑥 = 𝑤𝑇𝑥 that minimizes 𝐿 𝑓𝑤 =1

𝑛σ𝑖=1𝑛 𝑤𝑇𝑥𝑖 − 𝑦𝑖

2

𝑙2 loss

Linear model 𝓗

Page 7: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Why 𝑙2 loss

• Why not choose another loss• 𝑙1 loss, hinge loss, exponential loss, …

• Empirical: easy to optimize• For linear case: w = 𝑋𝑇𝑋 −1𝑋𝑇𝑦

• Theoretical: a way to encode prior knowledge

Questions:

• What kind of prior knowledge?

• Principal way to derive loss?

Page 8: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Maximum likelihood Estimation

Page 9: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Maximum likelihood Estimation (MLE)

• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷

• Let {𝑃𝜃 𝑥, 𝑦 : 𝜃 ∈ Θ} be a family of distributions indexed by 𝜃

• Would like to pick 𝜃 so that 𝑃𝜃(𝑥, 𝑦) fits the data well

Page 10: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Maximum likelihood Estimation (MLE)

• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷

• Let {𝑃𝜃 𝑥, 𝑦 : 𝜃 ∈ Θ} be a family of distributions indexed by 𝜃

• “fitness” of 𝜃 to one data point 𝑥𝑖 , 𝑦𝑖likelihood 𝜃; 𝑥𝑖 , 𝑦𝑖 ≔ 𝑃𝜃(𝑥𝑖 , 𝑦𝑖)

Page 11: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Maximum likelihood Estimation (MLE)

• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷

• Let {𝑃𝜃 𝑥, 𝑦 : 𝜃 ∈ Θ} be a family of distributions indexed by 𝜃

• “fitness” of 𝜃 to i.i.d. data points { 𝑥𝑖 , 𝑦𝑖 }

likelihood 𝜃; {𝑥𝑖 , 𝑦𝑖} ≔ 𝑃𝜃 {𝑥𝑖 , 𝑦𝑖} = ς𝑖 𝑃𝜃(𝑥𝑖 , 𝑦𝑖)

Page 12: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Maximum likelihood Estimation (MLE)

• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷

• Let {𝑃𝜃 𝑥, 𝑦 : 𝜃 ∈ Θ} be a family of distributions indexed by 𝜃

• MLE: maximize “fitness” of 𝜃 to i.i.d. data points { 𝑥𝑖 , 𝑦𝑖 }

𝜃𝑀𝐿 = argmaxθ∈Θς𝑖 𝑃𝜃(𝑥𝑖 , 𝑦𝑖)

Page 13: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Maximum likelihood Estimation (MLE)

• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷

• Let {𝑃𝜃 𝑥, 𝑦 : 𝜃 ∈ Θ} be a family of distributions indexed by 𝜃

• MLE: maximize “fitness” of 𝜃 to i.i.d. data points { 𝑥𝑖 , 𝑦𝑖 }

𝜃𝑀𝐿 = argmaxθ∈Θ log[ς𝑖 𝑃𝜃 𝑥𝑖 , 𝑦𝑖 ]

𝜃𝑀𝐿 = argmaxθ∈Θ σ𝑖 log[𝑃𝜃 𝑥𝑖 , 𝑦𝑖 ]

Page 14: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Maximum likelihood Estimation (MLE)

• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷

• Let {𝑃𝜃 𝑥, 𝑦 : 𝜃 ∈ Θ} be a family of distributions indexed by 𝜃

• MLE: negative log-likelihood loss

𝜃𝑀𝐿 = argmaxθ∈Θ σ𝑖 log(𝑃𝜃 𝑥𝑖 , 𝑦𝑖 )

𝑙 𝑃𝜃 , 𝑥𝑖 , 𝑦𝑖 = − log(𝑃𝜃 𝑥𝑖 , 𝑦𝑖 )

𝐿 𝑃𝜃 = −σ𝑖 log(𝑃𝜃 𝑥𝑖 , 𝑦𝑖 )

Page 15: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

MLE: conditional log-likelihood

• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷

• Let {𝑃𝜃 𝑦 𝑥 : 𝜃 ∈ Θ} be a family of distributions indexed by 𝜃

• MLE: negative conditional log-likelihood loss

𝜃𝑀𝐿 = argmaxθ∈Θ σ𝑖 log(𝑃𝜃 𝑦𝑖|𝑥𝑖 )

𝑙 𝑃𝜃 , 𝑥𝑖 , 𝑦𝑖 = − log(𝑃𝜃 𝑦𝑖|𝑥𝑖 )

𝐿 𝑃𝜃 = −σ𝑖 log(𝑃𝜃 𝑦𝑖|𝑥𝑖 )

Only care about predicting y from x; do not care about p(x)

Page 16: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

MLE: conditional log-likelihood

• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷

• Let {𝑃𝜃 𝑦 𝑥 : 𝜃 ∈ Θ} be a family of distributions indexed by 𝜃

• MLE: negative conditional log-likelihood loss

𝜃𝑀𝐿 = argmaxθ∈Θ σ𝑖 log(𝑃𝜃 𝑦𝑖|𝑥𝑖 )

𝑙 𝑃𝜃 , 𝑥𝑖 , 𝑦𝑖 = − log(𝑃𝜃 𝑦𝑖|𝑥𝑖 )

𝐿 𝑃𝜃 = −σ𝑖 log(𝑃𝜃 𝑦𝑖|𝑥𝑖 )

P(y|x): discriminative;P(x,y): generative

Page 17: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Example: 𝑙2 loss

• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷

• Find 𝑓𝜃 𝑥 that minimizes 𝐿 𝑓𝜃 =1

𝑛σ𝑖=1𝑛 𝑓𝜃(𝑥𝑖) − 𝑦𝑖

2

Page 18: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Example: 𝑙2 loss

• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷

• Find 𝑓𝜃 𝑥 that minimizes 𝐿 𝑓𝜃 =1

𝑛σ𝑖=1𝑛 𝑓𝜃(𝑥𝑖) − 𝑦𝑖

2

• Define 𝑃𝜃 𝑦 𝑥 = Normal 𝑦; 𝑓𝜃 𝑥 , 𝜎2

• log(𝑃𝜃 𝑦𝑖|𝑥𝑖 ) =−1

2𝜎2(𝑓𝜃 𝑥𝑖 − 𝑦𝑖)

2−log(𝜎) −1

2log(2𝜋)

• 𝜃𝑀𝐿 = argminθ∈Θ1

𝑛σ𝑖=1𝑛 𝑓𝜃(𝑥𝑖) − 𝑦𝑖

2

𝑙2 loss: Normal + MLE

Page 19: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Linear classification

Page 20: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Example 1: image classification

indoor outdoorIndoor

Page 21: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Example 2: Spam detection

#”$” #”Mr.” #”sale” … Spam?

Email 1 2 1 1 Yes

Email 2 0 1 0 No

Email 3 1 1 1 Yes

Email n 0 0 0 No

New email 0 0 1 ??

Page 22: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Why classification

• Classification: a kind of summary

• Easy to interpret

• Easy for making decisions

Page 23: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Linear classification 𝑤𝑇𝑥 = 0

Class 1

Class 0

𝑤

𝑤𝑇𝑥 > 0

𝑤𝑇𝑥 < 0

Page 24: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Linear classification: natural attempt

• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷

• Hypothesis 𝑓𝑤 𝑥 = 𝑤𝑇𝑥• 𝑦 = 1 if 𝑤𝑇𝑥 > 0

• 𝑦 = 0 if 𝑤𝑇𝑥 < 0

• Prediction: 𝑦 = step(𝑓𝑤 𝑥 ) = step(𝑤𝑇𝑥)

Linear model 𝓗

Page 25: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Linear classification: natural attempt

• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷

• Find 𝑓𝑤 𝑥 = 𝑤𝑇𝑥 to minimize 𝐿 𝑓𝑤 =1

𝑛σ𝑖=1𝑛 𝕀[step(𝑤𝑇𝑥𝑖) ≠ 𝑦𝑖]

• Drawback: difficult to optimize• NP-hard in the worst case 0-1 loss

Page 26: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Linear classification: simple approach

• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷

• Find 𝑓𝑤 𝑥 = 𝑤𝑇𝑥 that minimizes 𝐿 𝑓𝑤 =1

𝑛σ𝑖=1𝑛 𝑤𝑇𝑥𝑖 − 𝑦𝑖

2

Reduce to linear regression;

ignore the fact 𝑦 ∈ {0,1}

Page 27: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Linear classification: simple approach

Figure borrowed fromPattern Recognition andMachine Learning, Bishop

Drawback: not robust to “outliers”

Page 28: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Compare the two

𝑦 = 𝑤𝑇𝑥

𝑤𝑇𝑥

𝑦

𝑦 = step(𝑤𝑇𝑥)

Page 29: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Between the two

• Prediction bounded in [0,1]

• Smooth

• Sigmoid: 𝜎 𝑎 =1

1+exp(−𝑎)

Figure borrowed from Pattern Recognition and Machine Learning, Bishop

Page 30: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Linear classification: sigmoid prediction

• Squash the output of the linear function

Sigmoid 𝑤𝑇𝑥 = 𝜎 𝑤𝑇𝑥 =1

1 + exp(−𝑤𝑇𝑥)

• Find 𝑤 that minimizes 𝐿 𝑓𝑤 =1

𝑛σ𝑖=1𝑛 𝜎(𝑤𝑇𝑥𝑖) − 𝑦𝑖

2

Page 31: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Linear classification: logistic regression

• Squash the output of the linear function

Sigmoid 𝑤𝑇𝑥 = 𝜎 𝑤𝑇𝑥 =1

1 + exp(−𝑤𝑇𝑥)

• A better approach: Interpret as a probability

𝑃𝑤(𝑦 = 1|𝑥) = 𝜎 𝑤𝑇𝑥 =1

1 + exp(−𝑤𝑇𝑥)

𝑃𝑤 𝑦 = 0 𝑥 = 1 − 𝑃𝑤 𝑦 = 1 𝑥 = 1 − 𝜎 𝑤𝑇𝑥

Page 32: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Linear classification: logistic regression

• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷

• Find 𝑤 that minimizes

𝐿 𝑤 = −1

𝑛

𝑖=1

𝑛

log 𝑃𝑤 𝑦 𝑥

𝐿 𝑤 = −1

𝑛

𝑦𝑖=1

log𝜎(𝑤𝑇𝑥𝑖) −1

𝑛

𝑦𝑖=0

log[1 − 𝜎 𝑤𝑇𝑥𝑖 ]

Logistic regression:

MLE with sigmoid

Page 33: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Linear classification: logistic regression

• Given training data 𝑥𝑖 , 𝑦𝑖 : 1 ≤ 𝑖 ≤ 𝑛 i.i.d. from distribution 𝐷

• Find 𝑤 that minimizes

𝐿 𝑤 = −1

𝑛

𝑦𝑖=1

log𝜎(𝑤𝑇𝑥𝑖) −1

𝑛

𝑦𝑖=0

log[1 − 𝜎 𝑤𝑇𝑥𝑖 ]

No close form solution;Need to use gradient descent

Page 34: Machine Learning Basics Lecture 2: linear classification · Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang. Review:

Properties of sigmoid function

• Bounded

𝜎 𝑎 =1

1 + exp(−𝑎)∈ (0,1)

• Symmetric

1 − 𝜎 𝑎 =exp −𝑎

1 + exp −𝑎=

1

exp 𝑎 + 1= 𝜎(−𝑎)

• Gradient

𝜎′(𝑎) =exp −𝑎

1 + exp −𝑎 2= 𝜎(𝑎)(1 − 𝜎 𝑎 )