point estimation linear regressionguestrin/class/10701-s05/slides/... · point estimation linear...
TRANSCRIPT
![Page 1: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/1.jpg)
Point EstimationLinear Regression
Machine Learning – 10701/15781Carlos GuestrinCarnegie Mellon University
January 12th, 2005
![Page 2: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/2.jpg)
Announcements
Recitations – New Day and RoomDoherty Hall 1212 Thursdays – 5-6:30pmStarting January 20th
Use mailing [email protected]
![Page 3: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/3.jpg)
Your first consulting job
A billionaire from the suburbs of Seattle asks you a question:
He says: I have thumbtack, if I flip it, what’s the probability it will fall with the nail up?You say: Please flip it a few times:
You say: The probability is:
He says: Why???You say: Because…
![Page 4: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/4.jpg)
Thumbtack – Binomial Distribution
P(Heads) = θ, P(Tails) = 1-θ
Flips are i.i.d.:Independent eventsIdentically distributed according to Binomial distribution
Sequence D of αH Heads and αT Tails
![Page 5: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/5.jpg)
Maximum Likelihood Estimation
Data: Observed set D of αH Heads and αT Tails Hypothesis: Binomial distribution Learning θ is an optimization problem
What’s the objective function?
MLE: Choose θ that maximizes the probability of observed data:
![Page 6: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/6.jpg)
Your first learning algorithm
Set derivative to zero:
![Page 7: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/7.jpg)
How many flips do I need?
Billionaire says: I flipped 3 heads and 2 tails.You say: θ = 3/5, I can prove it!He says: What if I flipped 30 heads and 20 tails?You say: Same answer, I can prove it!
He says: What’s better?You say: Humm… The more the merrier???He says: Is this why I am paying you the big bucks???
![Page 8: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/8.jpg)
Simple bound (based on Hoeffding’s inequality)
For N = αH+αT, and
Let θ* be the true parameter, for any ε>0:
![Page 9: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/9.jpg)
PAC Learning
PAC: Probably Approximate CorrectBillionaire says: I want to know the thumbtack parameter θ, within ε = 0.1, with probability at least 1-δ = 0.95. How many flips?
![Page 10: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/10.jpg)
What about prior
Billionaire says: Wait, I know that the thumbtack is “close” to 50-50. What can you?You say: I can learn it the Bayesian way…
Rather than estimating a single θ, we obtain a distribution over possible values of θ
![Page 11: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/11.jpg)
Bayesian Learning
Use Bayes rule:
Or equivalently:
![Page 12: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/12.jpg)
Bayesian Learning for Thumbtack
Likelihood function is simply Binomial:
What about prior?Represent expert knowledgeSimple posterior form
Conjugate priors:Closed-form representation of posteriorFor Binomial, conjugate prior is Beta distribution
![Page 13: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/13.jpg)
Beta prior distribution – P(θ)
Likelihood function:Posterior:
![Page 14: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/14.jpg)
Posterior distribution
Prior:Data: αH heads and αT tails
Posterior distribution:
![Page 15: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/15.jpg)
Using Bayesian posterior
Posterior distribution:
Bayesian inference:No longer single parameter:
Integral is often hard to compute
![Page 16: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/16.jpg)
MAP: Maximum a posteriori approximation
As more data is observed, Beta is more certain
MAP: use most likely parameter:
![Page 17: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/17.jpg)
MAP for Beta distribution
MAP: use most likely parameter:
Beta prior equivalent to extra thumbtack flipsAs N →∞, prior is “forgotten”But, for small sample size, prior is important!
![Page 18: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/18.jpg)
What about continuous variables?
Billionaire says: If I am measuring a continuous variable, what can you do for me?You say: Let me tell you about Gaussians…
![Page 19: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/19.jpg)
MLE for Gaussian
Prob. of i.i.d. samples x1,…,xN:
Log-likelihood of data:
![Page 20: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/20.jpg)
Your second learning algorithm:MLE for mean of a GaussianWhat’s MLE for mean?
![Page 21: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/21.jpg)
MLE for variance
Again, set derivative to zero:
![Page 22: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/22.jpg)
Learning Gaussian parameters
MLE:
Bayesian learning is also possibleConjugate priors
Mean: Gaussian priorVariance: Wishart Distribution
![Page 23: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/23.jpg)
Prediction of continuous variables
Billionaire says: Wait, that’s not what I meant! You says: Chill out, dude.He says: I want to predict a continuous variable for continuous inputs: I want to predict salaries from GPA.You say: I can regress that…
![Page 24: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/24.jpg)
The regression problemInstances: <xj, tj>Learn: Mapping from x to t(x)Hypothesis space:
Given, basis functionsFind coeffs w={w1,…,wk}
Precisely, minimize the residual error:
Solve with simple matrix operations:Set derivative to zeroGo to recitation Thursday 1/20
![Page 25: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/25.jpg)
Billionaire (again) says: Why sum squared error???You say: Gaussians, Dr. Gateson, Gaussians…
Model:
Learn w using MLE
But, why?
![Page 26: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/26.jpg)
Maximizing log-likelihood
Maximize:
![Page 27: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/27.jpg)
Bias-Variance Tradeoff
Choice of hypothesis class introduces learning bias
More complex class → less biasMore complex class → more variance
![Page 28: Point Estimation Linear Regressionguestrin/Class/10701-S05/slides/... · Point Estimation Linear Regression Machine Learning – 10701/15781 Carlos Guestrin Carnegie Mellon University](https://reader034.vdocuments.mx/reader034/viewer/2022042200/5e9f807978b4b30819745df0/html5/thumbnails/28.jpg)
What you need to know
Go to recitation for regressionAnd, other recitations too
Point estimation:MLEBayesian learningMAP
Gaussian estimationRegression
Basis function = featuresOptimizing sum squared errorRelationship between regression and Gaussians
Bias-Variance trade-off