25: logistic regression...lisa yan, cs109, 2020 quick slide reference 2 3 background 25a_background...
TRANSCRIPT
25: Logistic RegressionLisa Yan
June 3, 2020
1
Lisa Yan, CS109, 2020
Quick slide reference
2
3 Background 25a_background
9 Logistic Regression 25b_logistic_regression
27 Training: The big picture 25c_lr_training
56 Training: The details, Testing LIVE
59 Philosophy LIVE
63 Gradient Derivation 25e_derivation
Background
3
25a_background
Lisa Yan, CS109, 2020
1. Weighted sum
If πΏ = π1, π2, β¦ , ππ :
4
dot product
π = π1π1 + π2π2 +β―+ ππππ
=
π=1
π
ππππ weighted sum
= πππΏ
Lisa Yan, CS109, 2020
1. Weighted sum
Recall the linear regression model, where πΏ = π1, π2, β¦ , ππ and π β β:
π πΏ = π0 +
π=1
π
ππππ
How would you rewrite this expression as a single dot product?
5
π€
πππΏ =
π=1
π
ππππDot product/
weighted sum
Lisa Yan, CS109, 2020
1. Weighted sum
Recall the linear regression model, where πΏ = π1, π2, β¦ , ππ and π β β:
π πΏ = π0 +
π=1
π
ππππ
How would you rewrite this expression as a single dot product?
6
Define π0 = 1π πΏ = π0π0 + π1π1 + π2π2 +β―+ ππππ
= πππΏ
πππΏ =
π=1
π
ππππDot product/
weighted sum
New πΏ = 1, π1, π2, β¦ , ππ
Prepending π0 = 1 to each feature vector πΏ makes
matrix operators more accessible.
Lisa Yan, CS109, 2020
2. Sigmoid function π π§
β’ The sigmoid function:
β’ Sigmoid squashes π§ toa number between 0 and 1.
β’ Recall definition of probability:A number between 0 and 1
7
0
0.2
0.4
0.6
0.8
1
-10 -8 -6 -4 -2 0 2 4 6 8 10
π π§ =1
1 + πβπ§
π π§
π§
π π§ can represent
a probability.
Lisa Yan, CS109, 2020
3. Conditional likelihood function
Training data (π datapoints):
β’ π π , π¦ π drawn i.i.d. from a distribution π πΏ = π π , π = π¦ π |π = π π π , π¦ π |π
8
πππΏπΈ = arg maxπ
ΰ·
π=1
π
π π¦ π | π π , πconditional likelihood
of training data
log conditional likelihood
β’ MLE in this lecture is estimator that
maximizes conditional likelihood
β’ Confusingly, log conditional
likelihood is also written as πΏπΏ π
= arg maxπ
π=1
π
log π π¦ π | π π , π
= arg maxπ
πΏπΏ π
Review
Logistic Regression
9
25b_logistic_regression
Lisa Yan, CS109, 2020
β πΏ can be dependent
π€·ββοΈ Regression model ( π β β, not discrete)
10
Prediction models so far
Linear Regression (Regression)
π0 +
π=1
π
πππππΏ π
β Tractable with NB assumption, butβ¦
β οΈ Realistically, ππ features not
necessarily conditionally independent
π€·ββοΈ Actually models π πΏ, π , not π π|πΏ ?
π πΏ|π π π
πΏ
π
π πΏ, π
NaΓ―ve Bayes (Classification)
Review
π = arg maxπ¦= 0,1
π π | πΏ
= arg maxπ¦= 0,1
π πΏ|π π π
π = π0 + Οπ=1π ππππ
Lisa Yan, CS109, 2020
Introducing Logistic Regression!
11
Linear Regression ideas Classification models
+ compute power
Lisa Yan, CS109, 2020
Logistic Regression
Logistic RegressionModel:
Predict π as the most likely πgiven our observation πΏ = π:
β’ Since π β 0,1 , π π = 0|πΏ = π = 1 β π π0 + Οπ=1π πππ₯π
β’ Sigmoid function also known as βlogitβ function
12
π0 +
π=1
π
πππππΏ π π = 1|πΏ
π = arg maxπ¦= 0,1
π π | πΏ
π π = 1|πΏ = π = π π0 +
π=1
π
πππ₯π
sigmoid function
π π§ =1
1 + πβπ§π
Lisa Yan, CS109, 2020
Logistic Regression
13
0.81π = [0,1,1]
π π = 1|πΏ = πconditional likelihood
πΏinput features
π parameter
Slides courtesy of Chris Piech
π π = 1|πΏ = π₯ = π π0 +
π=1
π
πππ₯π
Lisa Yan, CS109, 2020
Logistic Regression cartoon
14
π parameter
Slides courtesy of Chris Piech
Lisa Yan, CS109, 2020
Logistic Regression cartoon
15Slides courtesy of Chris Piech
π π = 1|πΏ = π = π π0 +
π=1
π
πππ₯π
+
π§ π π§
π π = 1|π
Lisa Yan, CS109, 2020
Logistic Regression cartoon
16Slides courtesy of Chris Piech
πΏ, input features
0,1,1
π, output
π π = 1|πΏ = π = π π0 +
π=1
π
πππ₯π
+
π§ π π§
π π = 1|π
Lisa Yan, CS109, 2020
Components of Logistic Regression
17Slides courtesy of Chris Piech
+
π π = 1|πΏ = π = π π0 +
π=1
π
πππ₯π
π§ π π§
π weights
(aka parameters)
π π = 1|π
Lisa Yan, CS109, 2020
Components of Logistic Regression
18Slides courtesy of Chris Piech
+
π π = 1|πΏ = π = π π0 +
π=1
π
πππ₯π
π§ π π§
weighted sum
π π = 1|π
Lisa Yan, CS109, 2020
Components of Logistic Regression
19Slides courtesy of Chris Piech
+
π π = 1|πΏ = π = π π0 +
π=1
π
πππ₯π
π§ π π§
squashing function
b/t 0 and 1
π π = 1|π
Lisa Yan, CS109, 2020
Components of Logistic Regression
20Slides courtesy of Chris Piech
+
π§ π π§
π π = 1|π
prediction
π π = 1|πΏ = π = π π0 +
π=1
π
πππ₯π
Lisa Yan, CS109, 2020
Different predictions for different inputs
21Slides courtesy of Chris Piech
+
π π = 1|πΏ = π = π π0 +
π=1
π
πππ₯π
π π = 1|π
πΏ, input features
0,1,1
Lisa Yan, CS109, 2020
Different predictions for different inputs
22Slides courtesy of Chris Piech
+
π π = 1|πΏ = π = π π0 +
π=1
π
πππ₯π
π π = 1|π
πΏ, input features
0,0,1
Lisa Yan, CS109, 2020
Parameters affect prediction
23Slides courtesy of Chris Piech
+
π π = 1|πΏ = π = π π0 +
π=1
π
πππ₯π
π π = 1|π
Lisa Yan, CS109, 2020
Parameters affect prediction
24Slides courtesy of Chris Piech
+
π π = 1|πΏ = π = π π0 +
π=1
π
πππ₯π
π π = 1|π
Lisa Yan, CS109, 2020
For simplicity
25
π π = 1|πΏ = π = π π0 +
π=1
π
πππ₯π
π π = 1|πΏ = π = π
π=0
π
πππ₯π = π πππ where π₯0 = 1
Lisa Yan, CS109, 2020
Logistic regression classifier
2626
Testing π = arg maxπ¦= 0,1
π π|πΏ
Training π = π0, π1, π2, β¦ , ππ
Given an observation πΏ = π1, π2, β¦ , ππ , predict
π π = 1|πΏ = π = π Οπ=0π πππ₯π = π πππ
Estimate parameters
from training data
π = arg maxπ¦= 0,1
π π|πΏ
Training:The big picture
27
25c_lr_training
Lisa Yan, CS109, 2020
Logistic regression classifier
2828
Testing π = arg maxπ¦= 0,1
π π|π
Training π = π0, π1, π2, β¦ , ππ
Given an observation πΏ = π1, π2, β¦ , ππ , predict
Estimate parameters
from training data
Choose π that optimizes some objective:1. Determine objective function
2. Find gradient with respect to π3. Solve analytically by setting to 0, or
computationally with gradient ascent
We are modeling π π|πdirectly, so we maximize the
conditional likelihood of
training data.
π π = 1|πΏ = π = π Οπ=0π πππ₯π = π πππ
π = arg maxπ¦= 0,1
π π|πΏ
Lisa Yan, CS109, 2020
1. Determine objectivefunction
2. Gradient w.r.t. ππ, for π = 0, 1, β¦ ,π
3. Solve
β’ No analytical derivation of πππΏπΈβ¦
β’ β¦but can still compute πππΏπΈwith gradient ascent!
Estimating π
29
πππΏπΈ = arg maxπ
ΰ·
π=1
π
π π¦ π | π π , π
initialize xrepeat many times:compute gradient
x += Ξ· * gradient
Lisa Yan, CS109, 2020
1. Determine objective function
30
πππΏπΈ = arg maxπ
ΰ·
π=1
π
π π¦ π | π π , π = arg maxπ
πΏπΏ ππ π = 1|πΏ = π = π Οπ=0
π πππ₯π= π πππ
First: Interpret
conditional likelihood
with Logistic Regression
Second: Write a differentiable
expression for log conditional
likelihood
Lisa Yan, CS109, 2020
1. Determine objective function (interpret)
Suppose you have π = 2 training datapoints: π 1 , 1 , π 2 , 0
Consider the following expressions for a given π:
31
π€
A. π πππ 1 π πππ 2
B. 1 β π πππ 1 π πππ 2
1. Interpret the above expressions as probabilities.
2. If we let π = πππΏπΈ, which probability should be highest?
C. π πππ 1 1 β π πππ 2
D. 1 β π πππ 1 1 β π πππ 2
πππΏπΈ = arg maxπ
ΰ·
π=1
π
π π¦ π | π π , π = arg maxπ
πΏπΏ ππ π = 1|πΏ = π = π Οπ=0
π πππ₯π= π πππ
Lisa Yan, CS109, 2020
1. Determine objective function (interpret)
Suppose you have π = 2 training datapoints: π 1 , 1 , π 2 , 0
Consider the following expressions for a given π:
32
A. π πππ 1 π πππ 2
B. 1 β π πππ 1 π πππ 2
1. Interpret the above expressions as probabilities.
2. If we let π = πππΏπΈ, which probability should be highest?
C. π πππ 1 1 β π πππ 2
D. 1 β π πππ 1 1 β π πππ 2
πππΏπΈ = arg maxπ
ΰ·
π=1
π
π π¦ π | π π , π = arg maxπ
πΏπΏ ππ π = 1|πΏ = π = π Οπ=0
π πππ₯π= π πππ
Lisa Yan, CS109, 2020
1. Determine objective function (write)
33
1. What is a differentiableexpression for π π = π¦| πΏ = π ?
2. What is a differentiable expressionfor πΏπΏ π , log conditional likelihood?
π π = π¦|πΏ = π = ΰ΅π πππ if π¦ = 1
1 β π πππ if π¦ = 0
πΏπΏ π = logΰ·
π=1
π
π π¦ π | π π , π
π€
πππΏπΈ = arg maxπ
ΰ·
π=1
π
π π¦ π | π π , π = arg maxπ
πΏπΏ ππ π = 1|πΏ = π = π Οπ=0
π πππ₯π= π πππ
Lisa Yan, CS109, 2020
1. Determine objective function (write)
34
1. What is a differentiableexpression for π π = π¦| πΏ = π ?
2. What is a differentiable expressionfor πΏπΏ π , log conditional likelihood?
π π = π¦|πΏ = π = ΰ΅π πππ if π¦ = 1
1 β π πππ if π¦ = 0
πΏπΏ π = logΰ·
π=1
π
π π¦ π | π π , π
Recall
Bernoulli MLE!
πππΏπΈ = arg maxπ
ΰ·
π=1
π
π π¦ π | π π , π = arg maxπ
πΏπΏ ππ π = 1|πΏ = π = π Οπ=0
π πππ₯π= π πππ
Lisa Yan, CS109, 2020
1. Determine objective function (write)
35
1. What is a differentiableexpression for π π = π¦| πΏ = π ?
2. What is a differentiable expressionfor πΏπΏ π , log conditional likelihood?
π π = π¦|πΏ = π = π ππππ¦1 β π πππ
1βπ¦
πΏπΏ π =
π=1
π
π¦(π) log π πππ(π) + 1 β π¦(π) log 1 β π πππ(π)
πππΏπΈ = arg maxπ
ΰ·
π=1
π
π π¦ π | π π , π = arg maxπ
πΏπΏ ππ π = 1|πΏ = π = π Οπ=0
π πππ₯π= π πππ
Lisa Yan, CS109, 2020
2. Find gradient with respect to π
Optimizationproblem:
Gradient w.r.t. ππ, for π = 0, 1, β¦ ,π:
36
πππΏπΈ = arg maxπ
ΰ·
π=1
π
π π¦ π | π π , π = arg maxπ
πΏπΏ π
πΏπΏ π =
π=1
π
π¦(π) log π πππ(π) + 1 β π¦(π) log 1 β π πππ(π)
ππΏπΏ π
πππ=
π=1
π
π¦(π) β π πππ(π) π₯π(π)
(derived later)
How do we interpret the gradient
contribution of the i-th training datapoint? π€
Lisa Yan, CS109, 2020
2. Find gradient with respect to π
Optimizationproblem:
Gradient w.r.t. ππ, for π = 0, 1, β¦ ,π:
37
πππΏπΈ = arg maxπ
ΰ·
π=1
π
π π¦ π | π π , π = arg maxπ
πΏπΏ π
πΏπΏ π =
π=1
π
π¦(π) log π πππ(π) + 1 β π¦(π) log 1 β π πππ(π)
ππΏπΏ π
πππ=
π=1
π
π¦(π) β π πππ(π) π₯π(π)
(derived later)
scale by j-th feature
Lisa Yan, CS109, 2020
π π = 1|πΏ = π π
2. Find gradient with respect to π
Optimizationproblem:
Gradient w.r.t. ππ, for π = 0, 1, β¦ ,π:
38
πππΏπΈ = arg maxπ
ΰ·
π=1
π
π π¦ π | π π , π = arg maxπ
πΏπΏ π
πΏπΏ π =
π=1
π
π¦(π) log π πππ(π) + 1 β π¦(π) log 1 β π πππ(π)
ππΏπΏ π
πππ=
π=1
π
π¦(π) β π πππ(π) π₯π(π)
(derived later)
1 or 0
Lisa Yan, CS109, 2020
Suppose π¦(π) = 1 (the true class label for π-th datapoint):
β’ If π πππ π β₯ 0.5, correct
β’ If π πππ π < 0.5, incorrect β change ππ more
2. Find gradient with respect to π
Optimizationproblem:
Gradient w.r.t. ππ, for π = 0, 1, β¦ ,π:
39
πππΏπΈ = arg maxπ
ΰ·
π=1
π
π π¦ π | π π , π = arg maxπ
πΏπΏ π
πΏπΏ π =
π=1
π
π¦(π) log π πππ(π) + 1 β π¦(π) log 1 β π πππ(π)
ππΏπΏ π
πππ=
π=1
π
π¦(π) β π πππ(π) π₯π(π)
(derived later)
Lisa Yan, CS109, 2020
1. Optimizationproblem:
2. Gradient w.r.t. ππ, for π = 0, 1, β¦ ,π:
3. Solve
3. Solve
40
ππΏπΏ π
πππ=
π=1
π
π¦(π) β π πππ(π) π₯π(π)
πππΏπΈ = arg maxπ
ΰ·
π=1
π
π π¦ π | π π , π = arg maxπ
πΏπΏ π
πΏπΏ π =
π=1
π
π¦(π) log π πππ(π) + 1 β π¦(π) log 1 β π πππ(π)
Stay tuned!
(live)25: Logistic RegressionSlides by Lisa Yan
August 12, 2020
41
Lisa Yan, CS109, 2020
Logistic Regression Model
4242
π π = 1|πΏ = π = π
π=0
π
πππ₯π = π πππ
π = arg maxπ¦= 0,1
π π|πΏ
Review
π is prediction of π
π0 +
π=1
π
πππππΏ π π = 1|πΏ
sigmoid function
where π₯0 = 1
π π§ =1
1 + πβπ§
Lisa Yan, CS109, 2020
For the βcorrectβ parameters π:
β’ π, 1 should have πππ₯ > 0β’ π, 0 should have πππ₯ β€ 0
Another view of Logistic Regression
Logistic
Regression
Model
π π = 1|πΏ = π = π πππ πππ =
π=0
π
πππ₯πwhere
π§ = πππ
π§, 1
π§, 0
0
Lisa Yan, CS109, 2020
Learning parameters
44
Training
Learn parameters π = π0, π1, β¦ , ππ
Review
πΏπΏ π =
π=1
π
π¦(π) log π πππ(π) + 1 β π¦(π) log 1 β π πππ(π)
πππΏπΈ = arg maxπ
πΏπΏ π
β’ No analytical derivation of πππΏπΈβ¦
β’ β¦but can still compute πππΏπΈ with gradient ascent!
ππΏπΏ π
πππ=
π=1
π
π¦(π) β π πππ(π) π₯π(π)
for π = 0, 1, β¦ ,π
Lisa Yan, CS109, 2020
Gradient Ascent
Walk uphill and you will find a local maxima(if your step is small enough).
45
πΏπ
π1 π2Logistic regression πΏπΏ π
is concave
Review
Training:The details
46
LIVE
Lisa Yan, CS109, 2020
Training: Gradient ascent step
3. Optimize.
47
ππΏπΏ π
πππ=
π=1
π
π¦(π) β π πππ(π) π₯π(π)
ππnew = ππ
old + π β ππΏπΏ πold
πππold
for all thetas:
= ππold + π β
π=1
π
π¦(π) β π πoldππ(π) π₯π
(π)
What does this look like in code?
repeat many times:
Lisa Yan, CS109, 2020
Training: Gradient Ascent
48
ππnew = ππ
old + π β
π=1
π
π¦(π) β π πoldππ(π) π₯π
(π)
initialize ππ = 0 for 0 β€ j β€ m
repeat many times:
gradient[j] = 0 for 0 β€ j β€ m
// compute all gradient[j]βs// based on n training examples
ππ += Ξ· * gradient[j] for all 0 β€ j β€ m
Gradient
Ascent Step
Lisa Yan, CS109, 2020
Training: Gradient Ascent
49
// update gradient[j] for// current (x,y) example
for each 0 β€ j β€ m:
ππnew = ππ
old + π β
π=1
π
π¦(π) β π πoldππ(π) π₯π
(π)Gradient
Ascent Step
initialize ππ = 0 for 0 β€ j β€ m
repeat many times:
gradient[j] = 0 for 0 β€ j β€ m
for each training example (x, y):
ππ += Ξ· * gradient[j] for all 0 β€ j β€ m
Lisa Yan, CS109, 2020
initialize ππ = 0 for 0 β€ j β€ m
repeat many times:
gradient[j] = 0 for 0 β€ j β€ m
for each training example (x, y):
ππ += Ξ· * gradient[j] for all 0 β€ j β€ m
Training: Gradient Ascent
50
for each 0 β€ j β€ m:
gradient[j] +=
ππnew = ππ
old + π β
π=1
π
π¦(π) β π πoldππ(π) π₯π
(π)Gradient
Ascent Step
π¦ β1
1 + πβπππ
π₯πWhat are the important details?
Lisa Yan, CS109, 2020
initialize ππ = 0 for 0 β€ j β€ m
repeat many times:
gradient[j] = 0 for 0 β€ j β€ m
for each training example (x, y):
ππ += Ξ· * gradient[j] for all 0 β€ j β€ m
Training: Gradient Ascent
51
for each 0 β€ j β€ m:
gradient[j] +=
ππnew = ππ
old + π β
π=1
π
π¦(π) β π πoldππ(π) π₯π
(π)Gradient
Ascent Step
π¦ β1
1 + πβπππ
π₯π
β’ π₯π is π-th feature of
input π = π₯1, β¦ , π₯π
Lisa Yan, CS109, 2020
initialize ππ = 0 for 0 β€ j β€ m
repeat many times:
gradient[j] = 0 for 0 β€ j β€ m
for each training example (x, y):
ππ += Ξ· * gradient[j] for all 0 β€ j β€ m
Training: Gradient Ascent
52
for each 0 β€ j β€ m:
gradient[j] +=
ππnew = ππ
old + π β
π=1
π
π¦(π) β π πoldππ(π) π₯π
(π)Gradient
Ascent Step
π¦ β1
1 + πβπππ
π₯π
β’ π₯π is π-th feature of
input π = π₯1, β¦ , π₯πβ’ Insert π₯0 = 1 before
training
Lisa Yan, CS109, 2020
initialize ππ = 0 for 0 β€ j β€ m
repeat many times:
gradient[j] = 0 for 0 β€ j β€ m
for each training example (x, y):
ππ += Ξ· * gradient[j] for all 0 β€ j β€ m
Training: Gradient Ascent
53
for each 0 β€ j β€ m:
gradient[j] +=
ππnew = ππ
old + π β
π=1
π
π¦(π) β π πoldππ(π) π₯π
(π)Gradient
Ascent Step
π¦ β1
1 + πβπππ
π₯π
β’ π₯π is π-th feature of
input π = π₯1, β¦ , π₯πβ’ Insert π₯0 = 1 before
training
β’ Finish computing
gradient before
updating any part of π
Lisa Yan, CS109, 2020
initialize ππ = 0 for 0 β€ j β€ m
repeat many times:
gradient[j] = 0 for 0 β€ j β€ m
for each training example (x, y):
ππ += Ξ· * gradient[j] for all 0 β€ j β€ m
Training: Gradient Ascent
54
for each 0 β€ j β€ m:
gradient[j] +=
ππnew = ππ
old + π β
π=1
π
π¦(π) β π πoldππ(π) π₯π
(π)Gradient
Ascent Step
π¦ β1
1 + πβπππ
π₯π
β’ π₯π is π-th feature of
input π = π₯1, β¦ , π₯πβ’ Insert π₯0 = 1 before
training
β’ Finish computing
gradient before
updating any part of πβ’ Learning rate π is a
constant you set
before training
Lisa Yan, CS109, 2020
initialize ππ = 0 for 0 β€ j β€ m
repeat many times:
gradient[j] = 0 for 0 β€ j β€ m
for each training example (x, y):
ππ += Ξ· * gradient[j] for all 0 β€ j β€ m
Training: Gradient Ascent
55
for each 0 β€ j β€ m:
gradient[j] +=
ππnew = ππ
old + π β
π=1
π
π¦(π) β π πoldππ(π) π₯π
(π)Gradient
Ascent Step
π¦ β1
1 + πβπππ
π₯π
β’ π₯π is π-th feature of
input π = π₯1, β¦ , π₯πβ’ Insert π₯0 = 1 before
training
β’ Finish computing
gradient before
updating any part of πβ’ Learning rate π is a
constant you set
before training
Testing
56
LIVE
Lisa Yan, CS109, 2020
Introducing notation ΰ·π¦
5757
π π = 1|πΏ = π = π Οπ=0π πππ₯π = π πππ
π = arg maxπ¦= 0,1
π π|πΏ
ΰ·π¦ = π π = 1|πΏ = π = π πππ
π π = π¦|πΏ = π = αΰ·π¦ if π¦ = 11 β ΰ·π¦ if π¦ = 0
Small ΰ·π¦ is
conditional probability
π is prediction of π
Lisa Yan, CS109, 2020
Testing: Classification with Logistic Regression
58
Testing
β’ Compute ΰ·π¦ = π π = 1|πΏ = π = π πππ =
β’ Classify instance as:
α1 ΰ·π¦ > 0.5, equivalently πππ > 00 otherwise
Training
Learn parameters π = π0, π1, β¦ , ππ
via gradient
ascent: ππnew = ππ
old + π β
π=1
π
π¦(π) β π πoldππ(π) π₯π
(π)
1
1 + πβπππ
Parameters ππ are not updated during testing phaseβ οΈ
Interlude for jokes/announcements
59
Lisa Yan, CS109, 2020
Announcements
1. Pset 6 due tomorrow at 1pm. No late days or on-time bonus for this pset.
2. Look out for extra office hours + review session for the Final Quiz
3. Final Quiz begins Friday 5pm and ends Sunday 5pm.
4. Youβre so close, you got this!
60
Lisa Yan, CS109, 2020
Ethics and datasets
Sometimes machine learning feels universally unbiased.
We can even prove our estimators are βunbiasedβ (mathematically).
Google/Nikon/HP had biased datasets.
61
Lisa Yan, CS109, 2020
Should your data be unbiased?
62
Dataset: Google News
Bolukbasi et al., Man is to Computer Programmer as Woman is to
Homemaker? Debiasing Word Embeddings. NIPS 2016
Should our unbiased data collection reflect societyβs systemic bias?
Lisa Yan, CS109, 2020
How can we explain decisions?
63
If your task is image classification,reasoning about high-level features is relatively easy.
Everything can be visualized.
What if you are trying to classify social outcomes?
β’ Criminal recidivism
β’ Job performance
β’ Policing
β’ Terrorist risk
β’ At-risk kids
Ethics in Machine Learning is a whole new field. π
64
Philosophy
65
LIVE
Lisa Yan, CS109, 2020
Logistic Regression is trying to fita line that separates data instanceswhere π¦ = 1 from those where π¦ = 0:
β’ We call such data (or functionsgenerating the data linearly separable.
β’ NaΓ―ve Bayes is linear too, because there is no interaction between different features.
Intuition about Logistic Regression
66
πππ = 0
Logistic
Regression
Model
π π = 1|πΏ = π = π πππ πππ =
π=0
π
πππ₯πwhere
Lisa Yan, CS109, 2020
Data is often not linearly separable
β’ Not possible to draw a line that successfully separates all the π¦ = 1 points (green) from the π¦ = 0 points (red)
β’ Despite this fact, Logistic Regression and Naive Bayes still often work well in practice
67
Lisa Yan, CS109, 2020
Many tradeoffs in choosing an algorithm
68
NaΓ―ve Bayes Logistic Regression
Modeling goal π πΏ, π π π|πΏ
Generative: could use joint
distribution to generate new
points (β οΈbut you might not
need this extra effort)
Generative or
discriminative?
Discriminative: just tries to
discriminate π¦ = 0 vs π¦ = 1(β cannot generate new points
b/c no π πΏ, π )
Continuous input
featuresβ Yes, easily
β οΈ Needs parametric form
(e.g., Gaussian) or
discretized buckets (for
multinomial features)
Discrete input
features
β Yes, multi-value discrete
data = multinomial π ππ|π
β οΈ Multi-valued discrete data
hard (e.g., if ππ β {π΄, π΅, πΆ}, not
necessarily good to encode as
1, 2, 3
Gradient Derivation
69
25e_derivation
Lisa Yan, CS109, 2020
Background: Calculus
70
ππ π₯
ππ₯=ππ π§
ππ§
ππ§
ππ₯Calculus Chain Rule
aka decomposition
of composed functionsπ π₯ = π π§ π₯
Calculus refresher #1:
Derivative(sum) =
sum(derivative)
Calculus refresher #2:
Chain rule πππ
π
ππ₯
π=1
π
ππ π₯ =
π=1
ππππ π₯
ππ₯
Lisa Yan, CS109, 2020
Are you ready?
71
Right now!!!
Lisa Yan, CS109, 2020
Compute gradient of log conditional likelihood
72
πΏπΏ π =
π=1
π
π¦(π) log π πππ(π) + 1 β π¦(π) log 1 β π πππ(π)log conditional
likelihood
ππΏπΏ π
πππwhereFind:
Lisa Yan, CS109, 2020
What is π
ππππ πππ ?
A. π π₯π 1 β π π₯π π₯π
B. π πππ 1 β π πππ π
C. π πππ 1 β π πππ π₯π
D. π πππ π₯π 1 β π πππ π₯π
E. None/other
Aside: Sigmoid has a beautiful derivative
73
π€
Derivative:Sigmoid function:
π π§ =1
1 + πβπ§π
ππ§π π§ = π π§ 1 β π π§
Lisa Yan, CS109, 2020
Aside: Sigmoid has a beautiful derivative
What is π
ππππ πππ ?
A. π π₯π 1 β π π₯π π₯π
B. π πππ 1 β π πππ π
C. π πππ 1 β π πππ π₯π
D. π πππ π₯π 1 β π πππ π₯π
E. None/other74
Derivative:Sigmoid function:
π π§ =1
1 + πβπ§π
ππ§π π§ = π π§ 1 β π π§
Let π§ = πππ
π
ππππ πππ =
π
ππ§π π§ β
ππ§
πππ(Chain Rule)
=
π=0
π
πππ₯π .
= π πππ 1 β π πππ π₯π
Lisa Yan, CS109, 2020
Re-itroducing notation ΰ·π¦
7575
π π = 1|πΏ = π = π Οπ=0π πππ₯π = π πππ
π = arg maxπ¦= 0,1
π π|πΏ
ΰ·π¦ = π π = 1|πΏ = π = π πππ
π π = π¦|πΏ = π = αΰ·π¦ if π¦ = 11 β ΰ·π¦ if π¦ = 0
π π = π¦|πΏ = π = ΰ·π¦ π¦ 1 β ΰ·π¦ 1βπ¦
Lisa Yan, CS109, 2020
Compute gradient of log conditional likelihood
76
πΏπΏ π =
π=1
π
π¦(π) log π πππ(π) + 1 β π¦(π) log 1 β π πππ(π)log conditional
likelihood
ππΏπΏ π
πππwhereFind:
πΏπΏ π =
π=1
π
π¦(π) log ΰ·π¦ π + 1 β π¦(π) log 1 β ΰ·π¦ π
Lisa Yan, CS109, 2020
Compute gradient of log conditional likelihood
77
Let ΰ·π¦ π = π πππ(π)ππΏπΏ π
πππ=
π=1
ππ
ππππ¦(π) log ΰ·π¦ π + 1 β π¦(π) log 1 β ΰ·π¦ π
=
π=1
ππ
π ΰ·π¦ ππ¦(π) log ΰ·π¦ π + 1 β π¦(π) log 1 β ΰ·π¦ π β
π ΰ·π¦ π
πππ(Chain Rule)
=
π=1
π
π¦(π)1
ΰ·π¦ πβ 1 β π¦(π)
1
1 β ΰ·π¦ πβ ΰ·π¦ π 1 β ΰ·π¦ π π₯π
π (calculus)
=
π=1
π
π¦(π) β ΰ·π¦ π π₯π(π)
=
π=1
π
π¦(π) β π πππ π π₯π(π) (simplify)
Lisa Yan, CS109, 2020
Compute gradient of log conditional likelihood
78
Let ΰ·π¦ π = π πππ(π)ππΏπΏ π
πππ=
π=1
ππ
ππππ¦(π) log ΰ·π¦ π + 1 β π¦(π) log 1 β ΰ·π¦ π
=
π=1
ππ
π ΰ·π¦ ππ¦(π) log ΰ·π¦ π + 1 β π¦(π) log 1 β ΰ·π¦ π β
π ΰ·π¦ π
πππ
=
π=1
π
π¦(π)1
ΰ·π¦ πβ 1 β π¦(π)
1
1 β ΰ·π¦ πβ ΰ·π¦ π 1 β ΰ·π¦ π π₯π
π
=
π=1
π
π¦(π) β ΰ·π¦ π π₯π(π)
=
π=1
π
π¦(π) β π πππ π π₯π(π)
π
(Chain Rule)
(calculus)
(simplify)
Wow. We did it!
79
CS109 Wrap-
80
LIVE
What have we learned in CS109?
81
Lisa Yan, CS109, 2020
A wild journey
82
Computer science Probability
Lisa Yan, CS109, 2020
From combinatorics to probabilityβ¦
83
π πΈ + π πΈπΆ = 1
Lisa Yan, CS109, 2020
β¦to random variables and the Central Limit Theoremβ¦
84
Bernoulli
Poisson
Gaussian
Lisa Yan, CS109, 2020
β¦to statistics, parameter estimation, and machine learning
85
A happy
Bhutanese person
πΏπΏ πFlu
Under-
grad
TiredFever
and Learn
Lisa Yan, CS109, 2020
Lots and lots of analysis
86
Climate
sensitivity
Bloom filters
Peer
grading
Biometric keystroke
recognition
Web MD
inference
Coursera A/B testing
Lisa Yan, CS109, 2020
Lots and lots of analysis
87
Heart
Ancestry Netflix
Lisa Yan, CS109, 2020
After CS109
88
Statistics
Stats 200 β Statistical Inference
Stats 208 β Intro to the Bootstrap
Stats 209 β Group Methods/Causal Inference
Theory
CS161 β Algorithmic analysis
CS168 - ~Modern~ Algorithmic Analysis
Stats 217 β Stochastic Processes
CS238 β Decision Making Under Uncertainty
CS228 β Probabilistic Graphical Models
Lisa Yan, CS109, 2020
After CS109
89
AI
CS 221 β Intro to AI
CS 229 β Machine Learning
CS 230 β Deep Learning
CS 224N β Natural Language Processing
CS 231N β Conv Neural Nets for Visual Recognition
CS 234 β Reinforcement Learning
Applications
CS 279 β Bio Computation
Literally any class with numbers in it
What do you want to remember in 5 years?
90
Why study probability + CS?
91
Lisa Yan, CS109, 2020
Why study probability + CS?
92
Source: US Bureau of Labor Statistics
Lisa Yan, CS109, 2020
Why study probability + CS?
93
Interdisciplinary Closest thing to magic
Lisa Yan, CS109, 2020
Why study probability + CS?
94
Everyone is welcome!
Technology magnifies.
What do we want magnified?
95
You are all one step closer to improving the world.
(all of you!)
96
The end
97