chapter 4: parametric methods. lecture notes for e alpaydın 2004 introduction to machine learning...

CHAPTER 4: Parametric Methods

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)

2

Parametric Estimation X = { xt }t where xt ~ p (x) Parametric estimation:

Assume a form for p (x | θ) and estimate θ, its sufficient statistics, using Xe.g., N ( μ, σ2) where θ = { μ, σ2}

Problem: How can we obtain θ from X? Assumption: X contains samples of a one-

dimensional random variable Later multivariate estimation: X contains

multiple and not only a single measurement.


3

Maximum Likelihood Estimation

Density function p with parameters θ is given and xt~p (X |θ)

Likelihood of θ given the sample Xl (θ|X) = p (X |θ) = ∏

t p (xt|θ)

We look θ for that “maximizes the likelihood of the sample”! Log likelihood

L(θ|X) = log l (θ|X) = ∑t log p (xt|θ)

Maximum likelihood estimator (MLE)θ* = argmaxθ L(θ|X)

Homework: Sample: 0, 3, 3, 4, 5 and x~N(,)? Use MLE to find(,)!


4

Examples: Bernoulli/Multinomial Bernoulli: Two states, failure/success, x in {0,1}

P (x) = pox (1 – po )

(1 – x)

L (po|X) = log ∏t po

xt (1 – po ) (1 – xt)

MLE: po = ∑t xt / N

Multinomial: K>2 states, xi in {0,1}

P (x1,x2,...,xK) = ∏i pi

xi

L(p1,p2,...,pK|X) = log ∏t ∏

i pi

xit

MLE: pi = ∑t xi

t / N

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)5

Gaussian (Normal) Distribution

2

2

2exp

2

1 x-xp

p(x) = N ( μ, σ2)

MLE for μ and σ2:

μ σ

N

mxs

N

xm

t

t

t

t

2

2

2

2

2exp

2

1 xxp


6

Bias and Variance

Unknown parameter θEstimator di = d (Xi) on sample Xi

Bias: bθ(d) = E [d] – θVariance: E [(d–E [d])2]

Mean square error of the estimator d: r (d,θ) = E [(d–θ)2]

= (E [d] – θ)2 + E [(d–E [d])2]= Bias2 + Variance

Error in the Model itself Variation/randomness of the model

7

Bayes’ Estimator Treat θ as a random var with prior p (θ) Bayes’ rule: p (θ|X) = p(X|θ) * p(θ) / p(X) Maximum a Posteriori (MAP): θMAP = argmaxθ p(θ|X)

Maximum Likelihood (ML): θML = argmaxθ p(X|θ)

Bayes’ Estimator: θBayes’ = E[θ|X] = ∫ θ p(θ|X) dθ

Comments: ML just takes the maximum value of the density function Compared with ML, MAP additionally considers priors Bayes’ estimator averages over all possible values of θ which are

weighted by their likelihood to occur (which is measured by a probability distribution p(θ)).

For MAP see: http://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation


Bayes’ Estimator: Example

xt ~ N (θ, σo2) and θ ~ N ( μ, σ2)

θML = m

θMAP = θBayes’ =

220

2

220

20

1

1

1|

//N

/m

//N/N

E X

σ: converges to m


9

Parametric Classification

iii

iii

CPCxpxg

CPCxpxg

log| log

lyequivalent or

|

ii

iii

i

i

i

i

CPx

xg

xCxp

log2

log2 log21

2exp

2

1|

2

2

2

2

kind of p(Ci|x)


10

Given the sample

ML estimates are

Discriminant becomes

Nt

tt,rx 1}{ X

x

, if 0

if 1

ijx

xr

jt

it

ti C

C

t

ti

t

tii

t

i

t

ti

t

ti

t

it

ti

i r

rmxs

r

rxm

N

rCP̂

2

2

ii

iii CP̂

s

mxsxg log

2 log2 log

21

2

2


Equal variances

Single boundary athalfway between means


Variances are different

Two boundaries

Homework!


Linear Regression 0101| wxww,wxg tt

t

t

t

tt

t

t

t

t

t

t

xwxwxr

xwNwr

2

10

10

t

t

tt

t

t

t

t

tt

t

xr

r

w

w

xx

xNyw

1

02A

yw 1ARelationship to what we discussed in Topic2??


Polynomial Regression

01

2

2012| wxwxwxww,w,w,,wxg ttktkk

t

NNNN

k

k

r

r

r

xxx

xxx

xxx

2

1

22

2222

1211

1

1

1

rD

rw TT DDD1

Here we get k+1 equations with k+1 unknowns!


17

Other Error Measures

Square Error:

Relative Square Error:

Absolute Error: E (θ|X) = ∑t |rt – g(xt|θ)|

ε-sensitive Error:

E (θ|X) = ∑ t 1(|rt – g(xt|θ)|>ε) (|rt –

g(xt|θ)| – ε)

2

1

|21

|

N

t

tt xgrE X

2

1

2

1

||

N

t

t

N

t

tt

rr

xgrE X


18

Bias and Variance

222 ||| xgExgExgExrExxgxrEE XXXX bias variance

222 |||| xgxrExxrErExxgrE

noise squared error

To be revisited next week!


19

Estimating Bias and Variance

M samples Xi={xti , rt

i}, i=1,...,M

are used to fit gi (x), i =1,...,M

ti

t i

tti

t

tt

xgM

xg

xgxgNM

g

xfxgN

g

1

1Variance

1Bias

2

22

Initially skip!


20

Bias/Variance Dilemma

Example: gi(x)=2 has no variance and high bias

gi(x)= ∑t rt

i/N has lower bias with variance

As we increase complexity, bias decreases (a better fit to data) and variance increases (fit varies more with

data) Bias/Variance dilemma: (Geman et al., 1992)


21

bias

variance

f

gi g

f

Already visited as Topic4!


22

Polynomial Regression

Best fit “min error”


23

Model Selection

Cross-validation: Measure generalization accuracy by testing on data unused during training

Regularization: Penalize complex modelsE’=error on data + λ model complexity

Akaike’s information criterion (AIC), Bayesian information criterion (BIC)

Minimum description length (MDL): Kolmogorov complexity, shortest description of data

Structural risk minimization (SRM)

Remark: will be discussed in more depth later: Topic 11


24

Bayesian Model Selection

Prior on models, p(model)

Regularization, when prior favors simpler models Bayes, MAP of the posterior, p(model|data) Average over a number of models with high

posterior (voting, ensembles: Chapter 15)

data

model model|datadata|model

ppp

p

CHAPTER 5:

Multivariate Methods


26

Multivariate Data

Multiple measurements (sensors) d inputs/features/attributes: d-variate N instances/observations/examples

Nd

NN

d

d

XXX

XXX

XXX

21

222

21

112

11

X


27

Multivariate Parameters

ji

ijijji

jiij

Td

X,X

X,X

,...,E

Corr :nCorrelatio

Cov:Covariance

:Mean 1μx

221

22221

11221

Cov

ddd

d

d

TE

μμ XXX


Parameter Estimation

ji

ijij

jtj

N

t iti

ij

N

t

ti

i

ss

sr:

N

mxmxs:

d,...,i,N

xm:

R

S

m

matrix nCorrelatio

matrix Covariance

1 mean Sample

1

1

29

Multivariate Normal Distribution

μxμxx

μx

1212 2

1exp

2

1Σ

Σ

Σ

T

//d

d

p

~ ,N

Mahalanobis distance between x and

http://www.analyzemath.com/Calculators/inverse_matrix_3by3.html


30

Multivariate Normal Distribution Mahalanobis distance: (x – μ)T ∑–1 (x – μ)

measures the distance from x to μ in terms of ∑ (normalizes for difference in variances and correlations)

Bivariate: d = 2

2221

2121

iiii xz

zzzzxxp

/

212

1exp

12

1, 2

2212122

21

21

Remark: is the correlation between the two variables

Called z-score zi for xi


31

Bivariate Normal


32


33

Independent Inputs: Naive Bayes If xi are independent, offdiagonals of ∑ are 0,

Mahalanobis distance reduces to weighted (by 1/σi) Euclidean distance:

If variances are also equal, reduces to Euclidean distance

d

i i

iid

ii

/d

d

iii

xxpp

1

2

1

21 21

exp2

1

x


Parametric Classification

If p (x | Ci ) ~ N ( μi , ∑i )

Discriminant functions are

iiT

i/

i/diCp μxμxx 1

212 21

exp2

1| Σ

Σ

iiiT

ii

iii

CPd

CPCpg

log21

log21

2log2

log| log

1

μΣμΣ xx

xx


35

Estimation of Parameters

t

ti

T

it

t itt

ii

t

ti

t

tti

i

t

ti

i

r

r

r

rN

rCP̂

mxmx

xm

S

iiiT

iii CP̂g log21

log21 1 mxmxx SS


Different Si

Quadratic discriminant

iiiiT

ii

iii

ii

iT

iiT

iiiT

iiiT

iT

ii

CP̂w

w

CP̂g

log log21

21

21

where

log221

log21

10

1

1

0

111

SS

S

SW

W

SSSS

mm

mw

xwxx

mmmxxxx

skip


37

likelihoods

posterior for C1

discriminant: P (C1|x ) = 0.5


38

Common Covariance Matrix S

Shared common sample covariance S

Discriminant reduces to

which is a linear discriminant

ii

iCP̂ SS

iiT

ii CP̂g log21 1 mxmxx S

iiT

iiii

iT

ii

CP̂w

wg

log21

where

10

1

0

mmmw

xwx

SS

Initially skip!


39

Common Covariance Matrix SInitially skip!


40

Diagonal S

When xj j = 1,..d, are independent, ∑ is diagonal

p (x|Ci) = ∏j p (xj |Ci) (Naive Bayes’ assumption)

Classify based on weighted Euclidean distance (in sj units) to the nearest mean

id

j j

ijtj

i CP̂s

mxg log

21

1

2

x

Likely covered in April!


41

Diagonal S

variances may bedifferent


42

Diagonal S, equal variances

Nearest mean classifier: Classify based on Euclidean distance to the nearest mean

Each mean can be considered a prototype or template and this is template matching

id

jij

tj

ii

i

CP̂mxs

CP̂s

g

log21

log2

2

12

2

2

mxx


43

Diagonal S, equal variances

*?


44

Model Selection

As we increase complexity (less restricted S), bias decreases and variance increases

Assume simple models (allow some bias) to control variance (regularization)

Assumption Covariance matrix No of parameters

Shared, Hyperspheric Si=S=s2I 1

Shared, Axis-aligned Si=S, with sij=0 d

Shared, Hyperellipsoidal Si=S d(d+1)/2

Different, Hyperellipsoidal

Si K d(d+1)/2


45

Discrete Features

Binary features:if xj are independent (Naive Bayes’)

the discriminant is linear

ij

ijjijj

iii

CPpxpx

CPCpg

log1 log 1 log

log| log

xx

Estimated parameters

ijij Cxpp |1

d

j

xij

xiji

jj ppCxp1

11|

t

ti

t

ti

tj

ij r

rxp̂

skip!


46

Discrete Features

Multinomial (1-of-nj) features: xj {v1, v2,..., vnj}

if xj are independent

ikjijkijk CvxpCzpp ||1

t

ti

t

ti

tjk

ijk

iijkj k jki

d

j

n

k

zijki

r

rzp̂

CPpzg

pCpj

jk

log log

|1 1

x

x

skip!


47

Multivariate Regression

Multivariate linear model

Multivariate polynomial model: Define new higher-order variables

z1=x1, z2=x2, z3=x12, z4=x2

2, z5=x1x2

and use the linear model in this new z space (basis functions, kernel trick, SVM: Chapter 10)

dtt w,...,w,wxgr 10|

211010

22110

21

|

t

tdd

ttd

tdd

tt

xwxwwrw,...,w,wE

xwxwxww

X

skip!

chapter 4: parametric methods. lecture notes for e alpaydın 2004 introduction to machine learning...

Documents