gradient ascent - stanford university€¦ · •consider i.i.d. random variables x 1, x 2, ..., x...
TRANSCRIPT
![Page 1: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/1.jpg)
Gradient AscentChris Piech
CS109, Stanford University
![Page 2: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/2.jpg)
Our Path
Parameter Estimation
Linear
Regressio
n Naïve Baye
s Logistic
Regressio
n
Deep Learning
![Page 3: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/3.jpg)
Our Path
Linear
Regressio
n Naïve Baye
s Logistic
Regressio
n
Deep Learning
Unbiased
estimators Maxim
izing
likelihood Baye
sian
estimation
![Page 4: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/4.jpg)
Review
![Page 5: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/5.jpg)
![Page 6: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/6.jpg)
• Consider n I.I.D. random variables X1, X2, ..., Xn
§ Xi is a sample from density function f(Xi | q)
§ What are the best choice of parameters q ?
Parameter Learning
![Page 7: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/7.jpg)
Piech, CS106A, Stanford University
Likelihood (of data given parameters):
Õ=
=n
iiXfL
1
![Page 8: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/8.jpg)
Piech, CS106A, Stanford University
Maximum Likelihood Estimation
Õ=
=n
iiXfL
1
LL(✓) =nX
i=1
log f(Xi|✓)
✓ = argmax✓
LL(✓)
![Page 9: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/9.jpg)
Argmax?
![Page 10: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/10.jpg)
Option #1: Straight optimization
![Page 11: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/11.jpg)
• General approach for finding MLE of q§ Determine formula for LL(q)
§ Differentiate LL(q) w.r.t. (each) q :
§ To maximize, set
§ Solve resulting (simultaneous) equations to get qMLEo Make sure that derived is actually a maximum (and not a
minimum or saddle point). E.g., check LL(qMLE ± e) < LL(qMLE) • This step often ignored in expository derivations• So, we’ll ignore it here too (and won’t require it in this class)
¶¶ )(LL
0)(=
¶¶
qqLL
MLEq
Computing the MLE
![Page 12: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/12.jpg)
Piech, CS106A, Stanford University
End Review
![Page 13: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/13.jpg)
Maximizing Likelihood with Bernoulli
• Consider I.I.D. random variables X1, X2, ..., Xn§ Xi ~ Ber(p)§ Probability mass function, f(Xi | p):
![Page 14: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/14.jpg)
• Consider I.I.D. random variables X1, X2, ..., Xn§ Xi ~ Ber(p)§ Probability mass function, f(Xi | p):
Maximizing Likelihood with Bernoulli
0 1
p
1 - p
PMF of Bernoulli
1 0 )1()|( 1 or where =-= -i
xxi xpppXf ii
PMF of Bernoulli (p = 0.2)
f(x) = 0.2x(1� 0.2)1�x
![Page 15: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/15.jpg)
Bernoulli PMF
f(X = x|p) = px(1� p)1�x
X ⇠ Ber(p)
![Page 16: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/16.jpg)
• Consider I.I.D. random variables X1, X2, ..., Xn
§ Xi ~ Ber(p)
§ Probability mass function, f(Xi | p), can be written as:
§ Likelihood:
§ Log-likelihood:
§ Differentiate w.r.t. p, and set to 0:
1 0 )1()|( 1 or where =-= -i
xxi xpppXf ii
Õ=
--=n
i
XX ii ppL1
1)1()(q
[ ]åå==
- --+=-=n
iii
n
i
XX pXpXppLL ii
11
1 )1log()1()(log))1(log()(q
å ==--+=
n
i iXYpYnpY1
where)1log()()(log
å=
==Þ=--
-+=¶
¶ n
iiMLE X
nnYp
pYn
pY
ppLL
1
1 01
1)(1)(
Maximizing Likelihood with Bernoulli
![Page 17: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/17.jpg)
Isn’t that the same as the sample mean?
![Page 18: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/18.jpg)
Yes. For Bernoulli.
![Page 19: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/19.jpg)
![Page 20: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/20.jpg)
Maximum Likelihood Algorithm
4. Use an optimization algorithm to calculate argmax
1. Decide on a model for the distribution of your samples. Define the PMF / PDF for your sample.
2. Write out the log likelihood function.
3. State that the optimal parameters are the argmax of the log likelihood function.
![Page 21: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/21.jpg)
![Page 22: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/22.jpg)
• Consider I.I.D. random variables X1, X2, ..., Xn§ Xi ~ Poi(l)
§ PMF: Likelihood:
§ Log-likelihood:
§ Differentiate w.r.t. l, and set to 0:
!)|(
i
x
i xe
Xfill
l-
= Õ=
-
=n
i i
X
XeL
i
1 !)( lq
l
[ ]åå==
-
-+-==n
iii
n
i i
X
XXeXeLL
i
11
)!log()log()log()!
log()( lllql
åå==
=Þ=+-=¶
¶ n
iiMLE
n
ii X
nXnLL
11
1 01)( lll
l
åå==
-+-=n
ii
n
ii XXn
11
)!log()log(ll
Maximizing Likelihood with Poisson
![Page 23: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/23.jpg)
It is so general!
![Page 24: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/24.jpg)
• Consider I.I.D. random variables X1, X2, ..., Xn
§ Xi ~ Uni(a, b)
§ PDF:
§ Likelihood:
o Constraint a ≤ x1, x2, …, xn ≤ b makes differentiation tricky
o Intuition: want interval size (b – a) to be as small as possible to maximize likelihood function for each data point
o But need to make sure all observed data contained in interval• If all observed data not in interval, then L(q) = 0
§ Solution: aMLE = min(x1, …, xn) bMLE = max(x1, …, xn)
ïî
ïíì
££= ÷÷ø
öççè
æ
-
otherwise 0
,...,, )( 211 baq ab n
n
xxxL
ïî
ïíì ££= -
otherwise0
),|(1
baba ab ii
xXf
Maximizing Likelihood with Uniform
![Page 25: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/25.jpg)
• Consider I.I.D. random variables X1, X2, ..., Xn§ Xi ~ Uni(0, 1)
§ Observe data:o 0.15, 0.20, 0.30, 0.40, 0.65, 0.70, 0.75
Likelihood: L(a,1)
a
L(a,1)
Likelihood: L(0, b)
b
L(0, b)
Understanding MLE with Uniform
![Page 26: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/26.jpg)
• How do small samples affect MLE?
§ In many cases, = sample mean
o Unbiased. Not too shabby…
§ Estimating Normal,
o Biased. Underestimates for small n (e.g., 0 for n = 1)
§ As seen with Uniform, aMLE ≥ a and bMLE ≤ bo Biased. Problematic for small n (e.g., a = b when n = 1)
§ Small sample phenomena intuitively make sense:o Maximum likelihood Þ best explain data we’ve seen
o Does not attempt to generalize to unseen data
å=
=n
iiMLE X
n 1
1µ
å=
-=n
iMLEiXnMLE
1
22 )(1 µs
Small Samples = Problems
![Page 27: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/27.jpg)
• Maximum Likelihood Estimators are generally:
§ Consistent: for e > 0
§ Potentially biased (though asymptotically less so)
§ Asymptotically optimalo Has smallest variance of “good” estimators for large samples
§ Often used in practice where sample size is large relative to parameter space
o But be careful, there are some very large parameter spaces
1) |ˆ(|lim =<-¥®
eqqPn
Properties of MLE
![Page 28: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/28.jpg)
Piech, CS106A, Stanford University
Maximum Likelihood Estimation
Õ=
=n
iiXfL
1
LL(✓) =nX
i=1
log f(Xi|✓)
✓ = argmax✓
LL(✓)
![Page 29: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/29.jpg)
![Page 30: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/30.jpg)
Argmax 2: Gradient Ascent
![Page 31: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/31.jpg)
ArgmaxOption #1: Straight optimization
![Page 32: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/32.jpg)
ArgmaxOption #2: Gradient Ascent
![Page 33: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/33.jpg)
Gradient Ascent
Walk uphill and you will find a local maxima (if your step size is small enough)
✓
p(samples|✓
)
argmax
![Page 34: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/34.jpg)
Gradient Ascent
Walk uphill and you will find a local maxima (if your step size is small enough)
✓
p(samples|✓
)
argmax
![Page 35: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/35.jpg)
Gradient Ascent
Walk uphill and you will find a local maxima (if your step size is small enough)
Especially good if function is convex
p(samples|✓
)
✓1 ✓2
![Page 36: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/36.jpg)
✓ newj = ✓ old
j + ⌘ · @LL(✓old)
@✓ oldj
Gradient Ascent
Repeat many times
Walk uphill and you will find a local maxima (if your step size is small enough)
This is some profound life philosophy
![Page 37: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/37.jpg)
Piech, CS106A, Stanford University
Gradient ascent is your bread and butter algorithm for optimization (eg argmax)
![Page 38: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/38.jpg)
Initialize: θj = 0 for all 0 ≤ j ≤ m
Gradient Ascent
Calculate all θj
![Page 39: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/39.jpg)
Initialize: θj = 0 for all 0 ≤ j ≤ m
Gradient Ascent
Repeat many times:
Calculate all gradient[j]’s based on data
!j += h * gradient[j] for all 0 ≤ j ≤ m
gradient[j] = 0 for all 0 ≤ j ≤ m
![Page 40: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/40.jpg)
Review: Maximum Likelihood Algorithm
4. Use an optimization algorithm to calculate argmax
1. Decide on a model for the likelihood of your samples. This is often using a PMF or PDF.
2. Write out the log likelihood function.
3. State that the optimal parameters are the argmax of the log likelihood function.
![Page 41: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/41.jpg)
Review: Maximum Likelihood Algorithm
4. Calculate the derivative of LL with respect to theta
1. Decide on a model for the likelihood of your samples. This is often using a PMF or PDF.
2. Write out the log likelihood function.
3. State that the optimal parameters are the argmax of the log likelihood function.
5. Use an optimization algorithm to calculate argmax
![Page 42: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/42.jpg)
Linear Regression Lite
![Page 43: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/43.jpg)
Predicting Warriors
X1 = Opposing team ELO
X2 = Points in last game
X3 = Curry playing?
X4 = Playing at home?
Y = Warriors points
![Page 44: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/44.jpg)
Predicting CO2 (simple)
(x(1), y(1)), (x(2), y(2)), . . . (x(n), y(n))
N training datapoints
Linear Regression Lite Model
Y = ✓ ·X + Z Y |X ⇠ N(✓X,�2)Z ⇠ N(0,�2)
X = CO2 level
Y = Average Global Temperature
![Page 45: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/45.jpg)
1) Write Likelihood Fn
(x(1), y(1)), (x(2), y(2)), . . . (x(n), y(n))
N training datapoints ModelY |X ⇠ N(✓X,�2)
First, calculate Likelihood of the data
Shorthand for:
![Page 46: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/46.jpg)
1) Write Likelihood Fn
(x(1), y(1)), (x(2), y(2)), . . . (x(n), y(n))
N training datapoints ModelY |X ⇠ N(✓X,�2)
First, calculate Likelihood of the data
![Page 47: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/47.jpg)
2) Write Log Likelihood Fn
Second, calculate Log Likelihood of the data
Likelihood function:
(x(1), y(1)), (x(2), y(2)), . . . (x(n), y(n))N training datapoints:
![Page 48: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/48.jpg)
3) State MLE as Optimization
Third, celebrate!
Log Likelihood:
(x(1), y(1)), (x(2), y(2)), . . . (x(n), y(n))N training datapoints:
![Page 49: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/49.jpg)
4) Find derivative
Fourth, optimize!
Goal:(x(1), y(1)), (x(2), y(2)), . . . (x(n), y(n))N training datapoints:
![Page 50: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/50.jpg)
5) Run optimization code
(x(1), y(1)), (x(2), y(2)), . . . (x(n), y(n))N training datapoints:
![Page 51: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/51.jpg)
Initialize: θj = 0 for all 0 ≤ j ≤ m
Gradient Ascent
Repeat many times:
Calculate all gradient[j]’s based on data and current setting of theta
!j += h * gradient[j] for all 0 ≤ j ≤ m
gradient[j] = 0 for all 0 ≤ j ≤ m
![Page 52: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/52.jpg)
Initialize: θ = 0
Gradient Ascent
Repeat many times:
Calculate gradient based on data
! += h * gradient
gradient = 0
Linear Regression (simple)
![Page 53: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/53.jpg)
Initialize: θ = 0
Repeat many times:
For each training example (x, y):
! += h * gradient
gradient = 0
Update gradient for current training example
Linear Regression (simple)
![Page 54: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/54.jpg)
Initialize: θ = 0
Repeat many times:
For each training example (x, y):
! += h * gradient
gradient = 0
gradient += 2(y - θ x)(x)
Linear Regression (simple)
![Page 55: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/55.jpg)
Linear Regression
![Page 56: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/56.jpg)
Predicting CO2
X1 = Temperature
X2 = Elevation
X3 = CO2 level yesterday
X4 = GDP of region
X5 = Acres of forest growth
Y = CO2 levels
![Page 57: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/57.jpg)
Training: Gradient ascent to chose the best thetas to describe your data
✓MLE = argmax✓
�nX
i=1
(Y (i) � ✓Tx(i))2
Problem: Predict real value Y based on observing variable X
Linear Regression
Model: Linear weight every feature
Y = ✓1X1 + · · ·+ ✓mXm + ✓m+1
= ✓TX
![Page 58: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/58.jpg)
Initialize: θj = 0 for all 0 ≤ j ≤ m
Repeat many times:
For each training example (x, y):
For each parameter j:
!j += h * gradient[j] for all 0 ≤ j ≤ m
gradient[j] = 0 for all 0 ≤ j ≤ m
gradient[j] += (y – θTx)(-x[j])
Linear Regression
![Page 59: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/59.jpg)
Predicting WarriorsY = Warriors points
X1 = Opposing team ELO
X2 = Points in last game
X3 = Curry playing?
X4 = Playing at home?
!1 = -2.3
!2 = +1.2
!3 = +10.2
!4 = +3.3
!5 = +95.4
Y = ✓1X1 + · · ·+ ✓mXm + ✓m+1
= ✓TX
![Page 60: Gradient Ascent - Stanford University€¦ · •Consider I.I.D. random variables X 1, X 2, ..., X n §X i ~ Uni(a, b) §PDF: §Likelihood: oConstraint a≤ x 1, x 2, …, x n≤](https://reader035.vdocuments.mx/reader035/viewer/2022063007/5fb9b7a0698c2f52377c05b9/html5/thumbnails/60.jpg)
§ Training data: set of N pre-classified data instanceso N training pairs: (x(1),y(1)), (x(2),y(2)), …, (x(n), y(n))
• Use superscripts to denote i-th training instance§ Learning algorithm: method for determining g(X)
o Given a new input observation of x = x1, x2, …, xmo Use g(x) to compute a corresponding output (prediction)
Output(Class)
Trainingdata
Learningalgorithm
g(X)(Classifier)
X
The Machine Learning Process