machine learning cmpt 726 simon fraser university
DESCRIPTION
Machine Learning CMPT 726 Simon Fraser University. CHAPTER 1: INTRODUCTION. Outline. Comments on general approach. Probability Theory. Joint, conditional and marginal probabilities. Random Variables. Functions of R.V.s Bernoulli Distribution (Coin Tosses). - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/1.jpg)
Machine LearningCMPT 726Simon Fraser University
CHAPTER 1: INTRODUCTION
![Page 2: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/2.jpg)
Outline
• Comments on general approach.• Probability Theory.
• Joint, conditional and marginal probabilities.• Random Variables.• Functions of R.V.s
• Bernoulli Distribution (Coin Tosses).• Maximum Likelihood Estimation.• Bayesian Learning With Conjugate Prior.
• The Gaussian Distribution.• Maximum Likelihood Estimation.• Bayesian Learning With Conjugate Prior.
• More Probability Theory.• Entropy.• KL Divergence.
![Page 3: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/3.jpg)
Our Approach
• The course generally follows statistics, very interdisciplinary.• Emphasis on predictive models: guess the value(s) of target variable(s). “Pattern Recognition”• Generally a Bayesian approach as in the text.• Compared to standard Bayesian statistics:
• more complex models (neural nets, Bayes nets)• more discrete variables• more emphasis on algorithms and efficiency
![Page 4: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/4.jpg)
Things Not Covered
• Within statistics:• Hypothesis testing• Frequentist theory, learning theory.
• Other types of data (not random samples)• Relational data• Scientific data (automated scientific discovery)• Action + learning = reinforcement learning.
Could be optional – what do you think?
![Page 5: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/5.jpg)
Probability Theory
Apples and Oranges
![Page 6: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/6.jpg)
Probability Theory
Marginal Probability
Conditional ProbabilityJoint Probability
![Page 7: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/7.jpg)
Probability Theory
Sum Rule
Product Rule
![Page 8: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/8.jpg)
The Rules of Probability
Sum Rule
Product Rule
![Page 9: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/9.jpg)
Bayes’ Theorem
posterior likelihood × prior
![Page 10: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/10.jpg)
Bayes’ Theorem: Model Version
• Let M be model, E be evidence.
•P(M|E) proportional to P(M) x P(E|M)
Intuition• prior = how plausible is the event (model, theory) a priori before seeing any evidence.• likelihood = how well does the model explain the data?
![Page 11: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/11.jpg)
Probability Densities
![Page 12: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/12.jpg)
Transformed Densities
![Page 13: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/13.jpg)
Expectations
Conditional Expectation(discrete)
Approximate Expectation(discrete and continuous)
![Page 14: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/14.jpg)
Variances and Covariances
![Page 15: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/15.jpg)
The Gaussian Distribution
![Page 16: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/16.jpg)
Gaussian Mean and Variance
![Page 17: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/17.jpg)
The Multivariate Gaussian
![Page 18: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/18.jpg)
Reading exponential prob formulas
• In infinite space, cannot just form sumΣx p(x) grows to infinity.
• Instead, use exponential, e.g.p(n) = (1/2)n
• Suppose there is a relevant feature f(x) and I want to express that “the greater f(x) is, the less probable x is”.
• Use p(x) = exp(-f(x)).
![Page 19: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/19.jpg)
Example: exponential form sample size
• Fair Coin: The longer the sample size, the less likely it is.
• p(n) = 2-n.
ln[p(n)]
Sample size n
![Page 20: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/20.jpg)
Exponential Form: Gaussian mean
• The further x is from the mean, the less likely it is.
ln[p(x)]
2(x-μ)
![Page 21: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/21.jpg)
Smaller variance decreases probability
• The smaller the variance σ2, the less likely x is (away from the mean).
ln[p(x)]
-σ2
![Page 22: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/22.jpg)
Minimal energy = max probability
• The greater the energy (of the joint state), the less probable the state is.
ln[p(x)]
E(x)
![Page 23: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/23.jpg)
Gaussian Parameter Estimation
Likelihood function
![Page 24: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/24.jpg)
Maximum (Log) Likelihood
![Page 25: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/25.jpg)
Properties of and
![Page 26: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/26.jpg)
Curve Fitting Re-visited
![Page 27: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/27.jpg)
Maximum Likelihood
Determine by minimizing sum-of-squares error, .
![Page 28: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/28.jpg)
Predictive Distribution
![Page 29: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/29.jpg)
Frequentism vs. Bayesianism
• Frequentists: probabilities are measured as the frequencies of repeatable events.• E.g., coin flips, snow falls in January.
• Bayesian: in addition, allow probabilities to be attached to parameter values (e.g., P(μ=0).
• Frequentist model selection: give performance guarantees (e.g., 95% of the time the method is right).
• Bayesian model selection: choose prior distribution over parameters, maximize resulting cost function (posterior).
![Page 30: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/30.jpg)
MAP: A Step towards Bayes
Determine by minimizing regularized sum-of-squares error, .
![Page 31: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/31.jpg)
Bayesian Curve Fitting
![Page 32: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/32.jpg)
Bayesian Predictive Distribution
![Page 33: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/33.jpg)
Model Selection
Cross-Validation
![Page 34: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/34.jpg)
Curse of Dimensionality
Rule of Thumb: 10 datapoints per parameter.
![Page 35: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/35.jpg)
Curse of Dimensionality
Polynomial curve fitting, M = 3
Gaussian Densities in higher dimensions
![Page 36: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/36.jpg)
Decision Theory
Inference stepDetermine either or .
Decision stepFor given x, determine optimal t.
![Page 37: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/37.jpg)
Minimum Misclassification Rate
![Page 38: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/38.jpg)
Minimum Expected Loss
Example: classify medical images as ‘cancer’ or ‘normal’
DecisionTr
uth
![Page 39: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/39.jpg)
Minimum Expected Loss
Regions are chosen to minimize
![Page 40: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/40.jpg)
Why Separate Inference and Decision?
• Minimizing risk (loss matrix may change over time)• Unbalanced class priors• Combining models
![Page 41: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/41.jpg)
Decision Theory for Regression
Inference stepDetermine .
Decision stepFor given x, make optimal prediction, y(x), for t.
Loss function:
![Page 42: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/42.jpg)
The Squared Loss Function
![Page 43: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/43.jpg)
Generative vs Discriminative
Generative approach: ModelUse Bayes’ theorem
Discriminative approach: Model directly
![Page 44: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/44.jpg)
Entropy
Important quantity in• coding theory• statistical physics• machine learning
![Page 45: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/45.jpg)
Entropy
![Page 46: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/46.jpg)
Entropy
Coding theory: x discrete with 8 possible states; how many bits to transmit the state of x?
All states equally likely
![Page 47: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/47.jpg)
Entropy
![Page 48: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/48.jpg)
The Maximum Entropy Principle
• Commonly used principle for model selection: maximize entropy.
• Example: In how many ways can N identical objects be allocated M bins?
Entropy maximized when
![Page 49: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/49.jpg)
Differential Entropy and the Gaussian
Put bins of width ¢ along the real line
Differential entropy maximized (for fixed ) when
in which case
![Page 50: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/50.jpg)
Conditional Entropy
![Page 51: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/51.jpg)
The Kullback-Leibler Divergence
![Page 52: Machine Learning CMPT 726 Simon Fraser University](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5681443d550346895db0d6f6/html5/thumbnails/52.jpg)
Mutual Information