machine learning 5. parametric methods
DESCRIPTION
Parametric Methods Need a probabilities to make decisions (prior, evidence, likelihood) Probability is a function of input (observables) Represent function by Selecting its general form (model) with several unknown parameters Find(estimate) parameters from data that optimize certain criteria (e.g. minimize generalization error) Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)TRANSCRIPT
MACHİNE LEARNİNG5. Parametric Methods
Parametric Methods Need a probabilities to make decisions
(prior, evidence, likelihood) Probability is a function of input
(observables) Represent function by
Selecting its general form (model) with several unknown parameters
Find(estimate) parameters from data that optimize certain criteria (e.g. minimize generalization error)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
2
Parametric Estimation
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
3
Assume sample comes from a distribution known up to its parameters
Sufficient statistics : parameters that completely define distribution (e.g. mean and variance)
X = { xt }t where xt ~ p (x) Parametric estimation:
Assume a form for p (x | θ) and estimate θ, its sufficient statistics, using Xe.g., N ( μ, σ2) where θ = { μ, σ2}
Estimation Assume form of distribution Estimate its sufficient parameters from a
sample Use distribution in classification or
regressions
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
4
Maximum Likelihood Estimation
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
5
X consists of independent and identically distributed (iid) samples
Likelihood of θ given the sample Xl (θ|X) = p (X |θ) = ∏t p (xt|θ)
Log likelihood L(θ|X) = log l (θ|X) = ∑t log p (xt|θ)
Maximum likelihood estimator (MLE)θ* = argmaxθ L(θ|X)
Why to use log likelihood Log is increasing function
Increased input->increased output
Maximizing log of a function is equivalent to maximizing function itself
Log convert products to sum log (abc)=log(a)+log(b)+logc
Makes analysis/computation simplerBased on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
6
Examples: Bernoulli/Multinomial
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
7
Bernoulli: Two states, failure/success, x in {0,1} P (x) = po
x (1 – po ) (1 – x)
Solving L(p|X)/dp=0 =>
Examples: Multinomial
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
8
Generalization of Bernoulli : multinomial Outcome is mutually exclusive K states Probability of occurring pi
P (x1,x2,...,xK) = ∏i pixi
L(p1,p2,...,pK|X) = log ∏t ∏i pixit
MLE: pi = ∑t xit / N
Gaussian (Normal) Distribution
2
2
2exp21 x-xp
N
mxs
N
xm
t
t
t
t
2
2
p(x) = N ( μ, σ2)
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)9
2
2
2exp21 xxp
Bias and Variance of an estimator Actual value of en estimator depends on data Get different N samples from a true distribution
, results in different value of an estimator Value of an estimator is a random variables Can ask about its mean and variance Difference between the true value of a
parameter and mean of estimator is bias of the estimator
Usually looking for formula resulting in unbiased estimator with small variance
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
10
Bias and Variance
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
11
Unknown parameter θEstimator di = d (Xi) on sample Xi
Bias: bθ(d) = E [d] – θ
Variance: E [(d–E [d])2]
Mean square error: r (d,θ) = E [(d–θ)2] = (E [d] – θ)2 + E [(d–E [d])2]
= Bias2 + Variance
Mean Estimator
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
12
Variance Estimator
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
13
Optimal Estimators Estimators are random variables as depend
on random sample from the distribution We look for estimators (functions of
samples) that have minimum expected square error
Expected square error can be decomposed into bias (drift) and variance
Sometimes accept larger errors but require unbiased estimators.
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
14
Bayes’ Estimator
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
15
Treat θ as a random var with prior p (θ) Bayes’ rule: p (θ|X) = p(X|θ) p(θ) / p(X) Full: p(x|X) = ∫ p(x|θ) p(θ|X) dθ Maximum a Posteriori (MAP): θMAP =
argmaxθ p(θ|X) Maximum Likelihood (ML): θML = argmaxθ
p(X|θ) Bayes’: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ
Bayes estimator Sometimes have some prior information on the
possible value range that parameter may take Can’t have exact value but can provide a
probability for each value (density function) Probability of a parameter before looking at data
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
16
Bayes Estimator
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
17
Bayes Estimator
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
18
•Assume this function have a peak around true value of the parameter
•Maximum a posteriori estimate will minimize error
Bayes’ Estimator: Example
xt ~ N (θ, σo2) and θ ~ N ( μ, σ2)
θML = m θMAP =
2 20
2 2 2 20 0
/ 1// 1/ / 1/N m
N N
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)19
Parametric Classification Bayes
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
20
Assumptions
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
21
ii
iii
i
i
ii
CPxxg
xCxp
log2
log2 log21
2exp
21|
2
2
2
2
Example Input: customer income Output: which car out of K models he will prefer Assume that for each model there is a group of
customers with certain income Income in a single class distributed normal
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
22
Example: True distributions
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
23
Example: Estimate parameters
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
24
Example: Discriminant
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
25
•Evaluate on a new input•Select class with largest discriminant•Assume: priors and variances are equal:
Boundary between classes
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
26
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
27
Equal variances
Single boundary athalfway between means
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
28
Variances are different
Two boundaries
Two approaches Likelihood-based approach (till now)
Calculate probabilities using Bayes rule Compute discriminant function
Discriminant function approach(later) Directly estimate discriminant Bypass probabilities estimation Boundary between classes?
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
29
Regression
x is independent variable, r is dependant variable
Unknown f, want to approximate to predict future values
Parametric approach: assume model with small number of parameters y=
Find best parameters from data Also have to make assumption on noise
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
30
Regressions
Have a training data (x,r) Find parameters to maximize likelihood In other words, what parameters makes
data most probable
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
31
Regressions
Ignore the last term,(does not depend on parameters
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
32
Regression
Minimize last termBased on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
33
Least Square Estimate
Minimize thisBased on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
34
Linear Regression 0101| wxww,wxg tt
t
t
t
tt
t
t
t
t
t
t
xwxwxr
xwNwr
210
10
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
35
Assume linear model Need to minimize Set derivatives to zero 2 linear equations in 2
unknowns Can solve easily
Polynomial Regression
012
2012| wxwxwxww,w,w,,wxg ttktkk
t
NNNN
k
k
r
rr
xxx
xxxxxx
2
1
22
2222
1211
1
11
rD
rw TT DDD 1
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
36
Bias and Variance
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
37
222 ||| xgExgExgExrExxgxrEE XXXX bias variance
222 |||| xgxrExxrErExxgrE
noise squared error
Bias and Variance Given single sample (x,r), what is the
expected error Variations are due to noise and training
sample
First term is due to noise Does not depend on the estimate Can’t be removed
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
38
Variance
Second term Deviation of estimator from regression function Depends on estimator and training set Average over all possible training samples
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
39
Example: polynomial regression As we increase degree of the polynomial
Bias decreases as allow better fit to points Variance increases as small deviation in
training sample might result in large deviation in model parameters
Bias/variance dilemma true for any machine learning systems
Need a way to find optimal model complexity to balance between bias and variance
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
40
Bias/Variance Dilemma
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
41
Example: gi(x)=2 has no variance and high biasgi(x)= ∑t rt
i/N has lower bias with variance
As we increase complexity, bias decreases (a better fit to data) and variance increases (fit varies more with data)
Bias/Variance dilemma: (Geman et al., 1992)
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
42
gi
Bias/Variance Dilemma
Polynomial Regression
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
43
Best fit “min error”
Model Selection How to select right model complexity? Different from estimating model
parameters There are several procedures
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
44
Cross-Validation Can’t calculate bias and variance as don’t
know true model But can estimate total generalization error Set aside portion of data (validation set) Increase model complexity, find
parameters Calculate error on validation set Stop when error cease to decrease or even
start increasingBased on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
45
Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
46
Best fit, “elbow”
Cross-Validation
Regularization Introduce penalty for model complexity
into an error function Find optimal model complexity (e.g. degree
of polynomial) and optimal parameters (coefficients) which minimize this function
Lambda is penalty for model complexity If lambda is too large only very simple
models will be admitted
Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
47