machine learning 5. parametric methods

MACHİNE LEARNİNG5. Parametric Methods

Parametric Methods Need a probabilities to make decisions

(prior, evidence, likelihood) Probability is a function of input

(observables) Represent function by

Selecting its general form (model) with several unknown parameters

Find(estimate) parameters from data that optimize certain criteria (e.g. minimize generalization error)

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

2

Parametric Estimation


3

Assume sample comes from a distribution known up to its parameters

Sufficient statistics : parameters that completely define distribution (e.g. mean and variance)

X = { xt }t where xt ~ p (x) Parametric estimation:

Assume a form for p (x | θ) and estimate θ, its sufficient statistics, using Xe.g., N ( μ, σ2) where θ = { μ, σ2}

Estimation Assume form of distribution Estimate its sufficient parameters from a

sample Use distribution in classification or

regressions

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

4

Why to use log likelihood Log is increasing function

Increased input->increased output

Maximizing log of a function is equivalent to maximizing function itself

Log convert products to sum log (abc)=log(a)+log(b)+logc

Makes analysis/computation simplerBased on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

6

Examples: Bernoulli/Multinomial


7

Bernoulli: Two states, failure/success, x in {0,1} P (x) = po

x (1 – po ) (1 – x)

Solving L(p|X)/dp=0 =>

Examples: Multinomial


8

Generalization of Bernoulli : multinomial Outcome is mutually exclusive K states Probability of occurring pi

P (x1,x2,...,xK) = ∏i pixi

L(p1,p2,...,pK|X) = log ∏t ∏i pixit

MLE: pi = ∑t xit / N

Gaussian (Normal) Distribution

2

2

2exp21 x-xp

N

mxs

N

xm

t

t

t

t

2

2

p(x) = N ( μ, σ2)

Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)9

2

2

2exp21 xxp

Bias and Variance of an estimator Actual value of en estimator depends on data Get different N samples from a true distribution

, results in different value of an estimator Value of an estimator is a random variables Can ask about its mean and variance Difference between the true value of a

parameter and mean of estimator is bias of the estimator

Usually looking for formula resulting in unbiased estimator with small variance


10

Bias and Variance


11

Unknown parameter θEstimator di = d (Xi) on sample Xi

Bias: bθ(d) = E [d] – θ

Variance: E [(d–E [d])2]

Mean square error: r (d,θ) = E [(d–θ)2] = (E [d] – θ)2 + E [(d–E [d])2]

= Bias2 + Variance

Mean Estimator


12

Variance Estimator


13

Optimal Estimators Estimators are random variables as depend

on random sample from the distribution We look for estimators (functions of

samples) that have minimum expected square error

Expected square error can be decomposed into bias (drift) and variance

Sometimes accept larger errors but require unbiased estimators.


14

Bayes estimator Sometimes have some prior information on the

possible value range that parameter may take Can’t have exact value but can provide a

probability for each value (density function) Probability of a parameter before looking at data


16

Bayes Estimator


17

Bayes Estimator


18

•Assume this function have a peak around true value of the parameter

•Maximum a posteriori estimate will minimize error

Bayes’ Estimator: Example

xt ~ N (θ, σo2) and θ ~ N ( μ, σ2)

θML = m θMAP =

2 20

2 2 2 20 0

/ 1// 1/ / 1/N m

N N

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)19

Parametric Classification Bayes


20

Assumptions


21

ii

iii

i

i

ii

CPxxg

xCxp

log2

log2 log21

2exp

21|

2

2

2

2

Example Input: customer income Output: which car out of K models he will prefer Assume that for each model there is a group of

customers with certain income Income in a single class distributed normal


22

Example: True distributions


23

Example: Estimate parameters


24

Example: Discriminant


25

•Evaluate on a new input•Select class with largest discriminant•Assume: priors and variances are equal:

Boundary between classes


26


27

Equal variances

Single boundary athalfway between means


28

Variances are different

Two boundaries

Two approaches Likelihood-based approach (till now)

Calculate probabilities using Bayes rule Compute discriminant function

Discriminant function approach(later) Directly estimate discriminant Bypass probabilities estimation Boundary between classes?


29

Regression

x is independent variable, r is dependant variable

Unknown f, want to approximate to predict future values

Parametric approach: assume model with small number of parameters y=

Find best parameters from data Also have to make assumption on noise


30

Regressions

Have a training data (x,r) Find parameters to maximize likelihood In other words, what parameters makes

data most probable


31

Regressions

Ignore the last term,(does not depend on parameters


32

Linear Regression 0101| wxww,wxg tt

t

t

t

tt

t

t

t

t

t

t

xwxwxr

xwNwr

210

10


35

Assume linear model Need to minimize Set derivatives to zero 2 linear equations in 2

unknowns Can solve easily

Polynomial Regression

012

2012| wxwxwxww,w,w,,wxg ttktkk

t

NNNN

k

k

r

rr

xxx

xxxxxx

2

1

22

2222

1211

1

11

rD

rw TT DDD 1


36

Bias and Variance


37

222 ||| xgExgExgExrExxgxrEE XXXX bias variance

222 |||| xgxrExxrErExxgrE

noise squared error

Bias and Variance Given single sample (x,r), what is the

expected error Variations are due to noise and training

sample

First term is due to noise Does not depend on the estimate Can’t be removed


38

Variance

Second term Deviation of estimator from regression function Depends on estimator and training set Average over all possible training samples


39

Example: polynomial regression As we increase degree of the polynomial

Bias decreases as allow better fit to points Variance increases as small deviation in

training sample might result in large deviation in model parameters

Bias/variance dilemma true for any machine learning systems

Need a way to find optimal model complexity to balance between bias and variance


40

Bias/Variance Dilemma


41

Example: gi(x)=2 has no variance and high biasgi(x)= ∑t rt

i/N has lower bias with variance

As we increase complexity, bias decreases (a better fit to data) and variance increases (fit varies more with data)

Bias/Variance dilemma: (Geman et al., 1992)


42

gi

Bias/Variance Dilemma

Polynomial Regression


43

Best fit “min error”

Model Selection How to select right model complexity? Different from estimating model

parameters There are several procedures


44

Cross-Validation Can’t calculate bias and variance as don’t

know true model But can estimate total generalization error Set aside portion of data (validation set) Increase model complexity, find

parameters Calculate error on validation set Stop when error cease to decrease or even

start increasingBased on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

45


46

Best fit, “elbow”

Cross-Validation

Regularization Introduce penalty for model complexity

into an error function Find optimal model complexity (e.g. degree

of polynomial) and optimal parameters (coefficients) which minimize this function

Lambda is penalty for model complexity If lambda is too large only very simple

models will be admitted


47

machine learning 5. parametric methods

Documents