3.linear models for regression

7/23/2019 3.Linear Models for Regression

1/103

Linear Models For Regression

(Supervised Learning)

:


2/103

Contents

2

Deterministic linear model regression

Line fitting

Curve fitting

Regularization

Basis function

ML-based probabilistic linear model regression

Bayesian linear model regression

Maximum a posteriori (MAP) estimation

Bayesian estimation

Evidence approximation


3/103

Deterministic Linear Model

Regression


4/103

What is Line Fitting ?

4



5/103

Linear Model : Line Fitting

5


Given a vector of d-dimensional inputs we want to

predict the target (response) using the linear model:1 2( , ,..., )dx x x

x

The term w0is the intercept, or called bias term. It will be convenient toinclude The constant variable 1 in xand write:

Observe a training set consisting ofNobservations

Together with corresponding target values

1 2( , ,..., )N

X x x x

1 2( , ,..., )Nt t t

t

Note that X is an N( d+ 1) matrix


6/103

How to Find(Learn) Optimal w

6


One option is to minimize the sum of the squares of the errors

between the predictions for each data point xnand thecorresponding real-valued targets tn.

( , )ny x w

Number of training data

* Matrix form


7/103

Transforming Objective Function

7


Stack the data into a matrix

and use the norm operationto handle the sum

2

1

2 2 2

1 1 2 2

2

1 1 2 2 2

2

1 1

2 2

2

2

2

1( )2

1( ) ( ) ... ( )

2

1 ( ) ( ) ... ( )2

12

1 1

2 2

N

n n

n

N N

N N

N N

E t

t t t

t t t

t

t

t

w x w

x w x w x w

x w x w x w

x

x

w

x

t Xw t Xw t Xw


8/103


8



9/103

Matrix Derivatives

Supplementary


10/103

Derivatives(Vectors)

Supplementary

Vector-by-scalar

Scalar-by-vector (Gradient)

Vector-by-vector


11/103


12/103

Derivatives Examples

Supplementary

Example : Scalar-by-vector

Example : Scalar-by-matrix

http://en.wikipedia.org/wiki/Matrix_calculus


13/103

Optimal w : Derivation

13


1

0

( )

T T T

T T T

T T

T T

w X X t X

w X X t X

X Xw X t

w X X X t

1( ) ( ) ( )

2

1( )( )

2

1 ( )2

1( )

2

1 1

2 2

T

T T T

T T T T T T

T T T T T

T T T T

E w

Xw t Xw t

w X t Xw t

w X Xw w X t t Xw t t

w X Xw t Xw t Xw t t

w X Xw t Xw t t 2T T T T

w X Xw w X X

w

1 1( )

2 2

T T T T T T T

w w X Xw t Xw t t w X X t X

: symmetric sinceT

A A ( )T T T

A A A A


14/103

Linear Model : Curve Fitting

14


Consider observing a training set consisting N 1-dimensional

observation , together with corresponding

realvalued targets:1 2( , ,..., )Nx x x

x

1 2( , ,..., )Nt t t

t

Previous model only can fit linear relation between input and output


15/103


15


* Note that X is N by (m+1) matrix

As for the least squares example: we can minimize the sum of the

squares of the errors between the predictions for each datapoint and the corresponding target values

( , )ny x wn

xnt

Line fitting

* Ifxid-dim, X is ?


16/103

Various Fitting Result Depend on Size of M

16


D i i i li d l i


17/103

Overfitting : Why?

17


This is overfitting

D t i i ti li d l i


18/103

What Happen to w* When Overfitting Occurs?

18


D t i i ti li d l i


19/103

Overfitting : Varying the Size of the Data

19


D t i i ti li d l i


20/103

Generalization

20


The ultimate goal of supervised learning is achieve good

generalizationby making accurate predictions for new test data that isnot known during learning.

Choosing the values of parameters that minimize the loss function on

the training data may not be the best option

We would like to model the true regularities in the data and ignore

the noise in the data



21/103

Regularization

21


L2 norm



22/103


22


* 1

( )

T T

ridge

w X X I X t

Least square Regularized least square



23/103

Optimal w : Derivation

23


2

1

1( ) ( )

2 2

1( ) ( )

2 2

1( )( )

2 2

1( )

2 2

1 1 1 1

2 2 2 2 21 1

2 2 2

N

T T

n n

n

T T

T T T T

T T T T T T T

T T T T T T

T T T T T

E t

w x w w w

Xw t Xw t w w

w X t Xw t w w

w X Xw w X t t Xw t t w w

w X Xw t Xw t Xw t t w w

w X Xw t Xw t t w w

* 1( )T Tridge

w X X I X t

1 1( )

2 2 2

T T T T T

T T T T

w w X Xw t Xw t t w w

w X X t X w

1

0

( )

( )

T T T T

T T T T

T T T T

T T T

T T

T T

w X X t X w

w X X w t X

w X X w I t X

X Xw I w X t

X X I w X t

w X X I X t



24/103

Geometric Interpretation of Regularization


increase



25/103

How to Choose Regularization Parameter

25




26/103

Cross Validation

26




27/103

Summary

27


Regression : line fitting

Xis N by (d+1) matrix

wis (d+1) vector

Find (learn) optimal w.Minimize error

Regression : curve fitting

Is it necessary to choose M : cross validation?

: cross validation

What if xis not 1-dim?

0

1

Mj

j

j

w w x

0 1x



28/103

Linear Basis Function Models

28


1

0 0 1 1 1 1

0( , ) ( ) ( ) ... ( ) ( )

M

M M j j

jy x w x w x w x w x

w

( ) ii

x x

10 1 1

0 1 1

0

( , ) ... ( )M

M j

M j

j

y x w x w x w x x w x

w

Curve fitting

0

1

M

jj

j

w w x



29/103


29

g

1

0 0 1 1 1 1

0

( , ) ( ) ( ) ... ( ) ( ) ( )M

M M j j

j

y x w x w x w x w x

w w x

( ) ii

x x 1

0 1 1

0 1 1

0

( , ) ... ( )M

M j

M j

j

y x w x w x w x x w x

w

0 1,1 1 1,2 2 1,3 3

2,1 1 2 2,2 1 3 2,3 2 3

2 2 2

3,1 1 3,2 2 3,3 3

( , ) ( )

y w w x w x w x

w x x w x x w x x

w x w x w x

x w w x

3-dimensional input & M=3 polynomial basis function

Later, considering only 1-dim input

- Use multi-indexj=(j1,j2,..jd)

2-dimensional input & M=3 polynomial basis function

Approximately, (M-1)Dbasis functions and weights are

required



30/103

Popular Basis Functions

30

g



31/103

Popular Basis Functions

31

g



32/103

Basis Function : Gaussian

32

g

Gaussian Basis FunctionCurve fitting



33/103

Regularization (Identical to Previous Case)

33



34/103

Other Regularization

34

M-1


35/103

ML-based Probabilistic Linear

Model Regression

ML-based Probabilistic Linear Model Regression


36/103

36

Probabilistic Perspective

* Target values(observations) are often noisy

: scalar

0

1

Mj

j

j

w w x



37/103

Maximum Likelihood Estimation (MLE)

37

2

111

( )1arg max exp

22ML

Nn n

i

y x w t

w

w

Least square : loss function



38/103

Derivation

38

11

ln ln

N N

n n

nn

x x

2

111

2 2

1 11 11 1

2

1 1

11

( , )1ln ( | , , ) ln exp

22

( , ) ( , )1 1 ln exp ln lnexp

2 22 2

( , ) 1 ln 2 ln

2 2

Nn n

n

N Nn n n n

n n

Nn n

n

y tp

y t y t

y t

x wt x w

x w x w

x w

2

11

2

2

1

1 1

( , )2

2

( , )1 1 ln ln 2 ( ) ln ln 2

2 2 2 2 2 2

Nn n

n

N Nn n

n n

n n

y t

y t N Ny t

x w

x wx w

2

1

11

( , )1( | , , ) exp

22

Nn n

i

y tp

x w

t x w



39/103

Maximum Likelihood Estimation (MLE)

39

Least square : loss function

arg max ln | , , arg min ln | , ,p p w w

t x w t x w

1( )T T

ML

w X X X t ln | , , ( )Tp w t x w X t Xw



40/103


40

0

1

Mj

j

j

w w x



41/103


41

ln | , , ( ) 0Tp

w

t x w t w



42/103

Predictive Distribution

42


43/103

Bayesian Linear Model Regression



45/103

Normalization

( | ) ( )( | )

( )

P Y X P XP X Y

P Y

X

Y

A test will produce 99% true positive results for drug users and 99% truenegative results for non-drug users. Suppose that 0.5% of people are users

of the drug. If a randomly selected individual tests positive, what is the

probability he or she is a user?

( ) ( | ) ( )

x X

P Y P Y x P x

If domain of X is extremely large or high-dimension ? intractable

( | ) ( )( | )

( )

P Y y X P XP X Y y

P Y y

( ) ( | ) ( )x X

P Y y P Y y x P x



46/103

Applying Bayes' Theorem to Learning

w

D



47/103

Possible Solutions



48/103

Conjugate Prior

If the posterior distributionsp( |x) are in the same family as the

prior probability distributionp(), the prior and posterior are thencalled conjugate distributions, and the prior is called a conjugate

prior for the likelihood functionp(x |) .

( | ) ( )( | )

( )

P Y X P XP X Y

P Y We dont nee to calculateP(Y)

Bayesian & Bayesian Learning


49/103

Most Probable Prediction

{ , }V

Prediction (classification) by MAP : +



50/103


50

t* t* t*



51/103

Bayesian Estimation

51

w

t

m

S

x



52/103

Normal (Gaussian) Distribution

IfXis a normally distributed variable with mean and variance 2,

2

normalization factor :



53/103

Multivariate Normal (Gaussian) Distribution

k-dimensional mean vector

kk covariance matrix

Symmetric covariance matrix must always be positive definite

-1 is precision matrix.



54/103

Bayesian Estimation

54

, : known!

( | , , , )p w t X

Multivariate GaussianUnivariate Gaussian ?

w

t

m

S

X


Lik lih d f M lti l U i i t G i V i bl


55/103

55

Likelihood of Multiple Univariate Gaussian Variable

Multivariate Gaussian Likelihood

1d

2 21 1

2

1 1

( | , ( ) )

( | , ( ) )

p d d

N d d

1 2 1 2( ( , ) | , , ?)

( | , ?)

( | , ?)

p d d

p

N

d d

d d

d d

2d

2 2

2 2

2

2 2

( | , ( ) )

( | , ( ) )

p d d

N d d

2

01

1d

2d

2d

1d

i i i



56/103

Bayesian Estimation

56

( | , , , )p w t X

( | , , )p t X w

w

t

m

S

X

( | , , , )p w t X 0 0( | , )p w m S

( | )p w( | , , , )p w t X ( | , , )p t X w

( , , , , ) ( | , , , ) ( , , , )( | , , , )

( , , , ) ( , , , , )

( | , , ) ( | , , ) ( , , )

( | , , ) ( | ) ( ) ( ) ( )

( | , , ) ( | ) (

p p pp

p p d

p p p

p p p p p d

p p p

t w X t w X w Xw t X

t X t w X w

t w X w X X

t w X w X w

t w X w ) ( ) ( )

( | , , ) ( | ) ( ) ( ) ( )

( | , , ) ( | )

( | , , ) ( | )

p p

p p d p p p

p p

p p d

X

t w X w w X

t w X w

t w X w w

P i Di ib i D i i



57/103

Posterior Distribution : Derivation

57

2

111

1 1exp ( )

22

N

T

n n

n

t

w x

1 1

1/ 2( / 2)

1 1exp ( )

2(2 )

T

M

w I w

2

111

1 1exp ( )

22

N

T

n n

n

t

w x

exp exp exp( )a b a b

Univariate Gaussian Multivariate Gaussian

P i Di ib i D i i



58/103


58

1 1

1/ 2/ 2

1 1

exp ( )2(2 )

T

M

w I w 2

111

1 1

exp ( )22

N

T

n n

nt

w x

1 11( ) ( ) ( )2 2

T T

w t w t w I w

1( )

2 2 2

T T T T T T

w w w t t t w I w

:T

symmetry

1( )( ) ( )

2 2

T T T T

w t w t w I w

1

( )2 2

T T T T T T T

w w w t t w t t w I w

(AAT)T=(AT)TAT=AAT


P t i Di t ib ti D i ti



59/103


59

We are given a quadratic form defining the exponent terms in a Gaussian

distribution, and we need to determine the corresponding mean and

covariance

Const denotes terms which are independent of x, and we have

made use of the symmetry of .

11 ( ) ( )2

Tconst

x x

Completing the Square

( ) T

Q x x Ax

Quadratic form




60/103


60

1( )

2 2

T T T T T

w I w w t t t

1( )T

N

S I

1 112 2

T T T T

N N N

w S w w S S t t t

T

N Nm S t1 11

2 2

T T T

N N N

w S w w S m t t

1:N symmetry

S

11 ( ) ( )

2

Tconst

x x

1( )

2 2 2

T T T T T T

w w w t t t w I w( | , , , )p w t X




61/103


61

1 1 1 11 1 1

2 2 2 2

T TT T T

N N N N N N N N NS

w w w S m m S m m S m t t

11( ) ( )

2

T

N N NS const

w m w m

1 11

2 2

T T T

N N NS

w w w S m t t

11

2 2

TT

N N Nconst

t t m S m

11 ( ) ( )

2

Tconst

x x

( | , , , )p w t X

1 11

2 2

T T T

N N N

w S w w S m t t

1 1 11 1

2 2

TT T

N N N N N NS const

w w w S m m S m

2

111

1 1( | , , , ) exp ( )

22

NT

n n

n

p t

w t X w x

1 1

1/ 2/ 2

1 1exp ( )

2(2 )

T

M

w I w

P t i M d V i



62/103

Posterior : Mean and Variance

62

20 0 1 0 1

1 1 1

2

1 0 1 1 1

1 1 1

( ) ( ) ( ) ( ) ( )

( ) ( ) ( ) ( ) ( )

N N N

i i i i M i

i i i

N N N

i i i i M iT

i i i

x x x x x

x x x x x

2

1 0 1 1 11 1 1

( ) ( ) ( ) ( ) ( )N N N

M i i M i i M i

i i i

x x x x x

0 1 0 2 0

1 1 1 2 1

1 1 1 2 1

( ) ( ) ( )

( ) ( ) ( )

( ) ( ) ( )

N

NT

M M M N

x x x

x x x

x x x

Meaning of T

P t i M d V i



63/103


1cov , [ ] [ ]E E E E

w w w w w w ww I

Covariance MatrixTwo Multivariate Gaussian:

Precision Matrix

Posterior distribution:

Mean vector, Precision

IPrecision Matrix :

Prior :

1

1

( | ( ), ) ( | , , ) ( | , ?)N

n ML n ML ML

n

N t p N

w x t X w t w

Likelihood(Different Form): Multivariate Gaussian

P t i M d V i



64/103


1 1

1

( | , , ) ( | ( ), ) ( | , ( ) )N

ML n ML n MLn

p N t N

t X w w x t w

1cov , ( )E t t tt

Covariance MatrixTwo Multivariate Gaussian:

Precision Matrix

Posterior distribution:

Mean vector, Precision

Precision Matrix :Likelihood:

1

1

( | ( ), ) ( | , , ) ( | , ?)N

n ML n ML ML

n

N t p N

w x t X w t w

Likelihood(Different Form): Multivariate GaussianI

Precision

Matrix :

P t i M d V i



65/103


65

1 11 ( )T TN N

S S

The more instances we have seen, the larger the posterior precision

1

become larger

become larger

T

N

Svs.

1 11 ( ) ( )T T T T N

m t t

( )ML

w t

T T TN N ML N ML N ML Nm S I 0 w S w S w S t

1( )ML w t

Ba esian Linear Regression : General Form



66/103

Bayesian Linear Regression : General Form

66

MAP Nw m

Posterior : Meaning of Mean



67/103

Posterior : Meaning of Mean

67

1

0 0( )T

N N MLm

S S m w

1( )

ML

w t

1

0 0( )T

N N MLm

S S m w

Mixesbetween sample mean and prior mean

The higher the precision of the prior, the less we believe the sample mean

The higher the precision of the instances, the more we believe the sample mean

(The more instances we have seen, the more we believe the sample mean)

Effect of Varying Covariance of Prior



68/103

Effect of Varying Covariance of Prior

68

( | , , , )p w t X

MAP Estimation



69/103

MAP Estimation

69

1

( | , , , ) ( | ( ), , ) ( | )N

n n

n

p p p

w t X t x w w

1

ln ( | , , , ) ln ( | ( ), , ) ln ( | )N

n n

n

p p p

w t X t x w w

2

1

ln exp ( )22

NT

n n

n

t

w x 2

1

ln ln 2 ( )2 2 2

NT

n n

n

N Nt

w x

( | , , , )p w t X

MAP Estimation



70/103

MAP Estimation

70

1

ln ( | , , ) ln ( | ( ), , ) ln ( | )N

n n

n

p p p

w t X t x w w

1 1

1/ 2/ 2

1 1ln exp ( )

2(2 )

T

M

w I w ln ln 22 2 2

TM M w w

A = diagonal

Ignoring terms that do not depend on w

2

1

ln ( | , , ) ( )

2 2

NT T

n n

n

p t

w t X w x w w

1/ 2 1/ 2/ 2 / 2 2

1/ 2/ 2

1

ln ln1 ln (2 ) ln(2 ) ln ln 2 ln2(2 )

M

M M

M

M

MAP Estimation = Regularized Least Squares



71/103

MAP Estimation = Regularized Least Squares

71

2

1

ln ( | , , ) ( )2 2

NT T

n nn

p t

t x w w x w w

Regularized Least Squares

MAP Nw mT

N Nm S t

1 ( )TN

S I

Bayesian estimation

1

1

T T T T

Nm

I t I t

ML vs MAP



72/103

ML vs. MAP

72

2

1 1

1/ 2/ 211

1 1 1arg max exp ( ) exp ( )

2 2(2 )2MAP

NT T

n n Mn

t

w

w w x w I w

2

1

1

1arg max exp ( )

22

ML

N

n n

i

t

w

w w x

1( )T TML

w t

1( | , , ) ( | ( ), )ML ML ML MLp t N t x w w x

1( )T TMAP

w I t

1( | , , , ) ( | ( ), )MAP MAP MLp t N t

x w w x

Bayesian Linear Regression



73/103


73




74/103


74




75/103


75


Linear Gaussian


77/103


77

1

( | , , )

( | ( ), )

p t

N t

x w

w x

Varying depend on input

t

w

Likelihood


( | , , , )

( | , )N N N

p

N p

w t X

w m S t S

Posterior

1 1 1( ), = , = , , ,N N

A x m b 0 S L

Predictive Distribution : Derivation



78/103

Predictive Distribution : Derivation

78

( | , ) ( | , , ) ( )p t p t p d x x w w w

Predictive distribution (fixed and known ,)

w

t

x X

t

( | , ) ( , | , )

( , , , )

( , )

( | , , ) ( ) ( ) ( )( ) ( )

( | , , ) ( )

p t p t d

p td

p

p t p p p dp p

p t p d

x w x w

w x

w

x

w x w xw

x

x w w w

w

t

x

Predictive Distribution: ML vs Bayes



79/103

Predictive Distribution: ML vs. Bayes

79




80/103


80




81/103


81

Summary : ML, MAP and Bayesian



82/103

Summary : ML, MAP and Bayesian

82

1( | , , ) ( | ( ), )

T

ML ML ML MLp t N t

x w w x 2

111

( )1arg max exp

22ML

TNn

n

t

w

w xw

2

1 1

1/ 2( 1) / 21

1 1arg max exp ( ) exp ( )2 22 (2 )

MAP

N

T T

n nM

nt

ww w x w I w

1( | , , , )MAP MLp t

x w

2 11 1

1/ 2( 1) / 21

1 1( | , , , ) exp ( ) exp ( ) ( | , )

2 22 (2 )

NTT T

n n N NMn

p t N

w t X w x w I w w m S

Maximum a Posteriori (MAP)

1( | , , , , ) ( | , , ) ( | , , , ) ( | ( ), ( ) ( ))

T T

N Np t p t p d N t

t x X x w w t X w m x x S x


83/103

Evidence Approximation

Fully Bayesian Predictive Distribution



84/103


84





85/103


85




86/103


86





87/103


87

( | , , )p t X

w

D

( | ) ( )( )( )

P D PP D

P D

w ww

w

t

X

Posterior Distribution for Hyperparameter



88/103

oste o st but o o ype pa a ete

88

( , , , ) ( | , , ) ( , , ) ( , | , )

( , ) ( , )

( | , , ) ( ) ( ) ( )

( , , , )

( | , , ) ( ) ( ) ( )

( | , , ) ( ) ( ) ( )

p p pp

p p

p p p p

p d d

p p p p

p p p p d d

w

w

t X t X Xt X

t X t X

t X X

t X w w

t X X

t w X w X w

( | , , ) ( ) ( ) ( )( | , , ) ( ) ( ) ( )

( | , , ) ( ) ( )

( | , , ) ( ) ( )

p p p pp p p d d p

p p p

p p p d d

w

w

t X X

t w X w w X

t X

t w X w w

w

X

t

( | ) ( )( )( )

P D PP D

P D

w ww

Posterior Distribution for Hyperparameter



89/103

yp p

89

( , | , ) ( , ) ( , | , )

( , )

( | , , ) ( | , ) ( , ) ( | , , ) ( ) ( ) ( )

( , ) ( , , , )

( | , , ) ( ) ( ) ( )

( | , , ) ( ) ( ) ( )

( | , , ) ( ) ( ) ( )

( | , , ) ( ) (

p pp

p

p p p p p p p

p p d d

p p p p

p p p p d d

p p p p

p p p

w

w

t Xt X

t X

t X X t X X

t X t X w w

t X X

t w X w X w

t X X

t w X w ) ( )

( | , , ) ( ) ( )

( | , , ) ( ) ( )

d d p

p p p

p p p d d

w

w

w X

t X

t w X w w

w

X

t

( | ) ( )( )( )

P D PP D

P D

w ww

( , | ) ( | , ) ( | )P P P t X t X X

Evidence ApproximationEvidence Approximation


90/103

pp

90

1/ 22

1( | , , ) exp ( )2 2

NT

n nn

p t

t X w w x

1 1

1/ 2/ 2

1 1( | ) exp ( )

2(2 )

T

Mp

w w I w

/ 2 / 2

( | , , ) exp ( )2 2

N M

p E d

t X w w

1

( | , , ) ( | , , )N

n n

n

p p t

t X w x w


A = diagonal

w

t

X

Evidence ApproximationEvidence Approximation


91/103

pp

91

/ 2 / 2

( | , , ) exp ( )2 2

N M

p E d

t X w w

1

ln ( | , , ) ln ln ln ln 22 2 2 2

N

M N Np const t X S

Log Marginal Likelihood : DerivationEvidence Approximation


92/103

g g

92

11( ) ( ) ( )

2

T

N N NE const

w w m S w m

1

N

S A

11

2 2

TT

N N Nconst

t t m S m

T

N Nm S t

1

( )T

N

S I

Completing the Square

1

1( ) ( )2

T

const

x x

From derivation of posterior distribution :

1( ) ( ) ( )

2

T

N NE const w w m A w m

1

2 2

TT

N Nconst

t t m Am



93/103

g g

93

1( ) ( ) ( )

2

T

N NE const w w m A w m

1

2 2

TT

N Nconst

t t m Am

/ 2 / 2

( | , , ) exp ( )2 2

N M

p E d

t X w w

/ 2 / 2

1( | , , ) exp exp ( ) ( )

2 2 2

N M

T

N Np const d

t X w m A w m w

/ 2 1/ 2

/ 2 1/ 2

(2 ) | | 1exp exp ( ) ( )

(2 ) | | 2

M

T

N NMconst d

Aw m A w m w

A

/ 2 / 2

/ 2 1/ 2( | , , ) exp (2 ) | |

2 2

N M

Mp const

t X A




95/103

g g

95

1 1ln ( | , , ) ln ln ln

2 2 2 2 2

TT

N N

M Np

t X t t m Am A

11ln ln 2 ln ln | |2 2 2 2

N N Mconst A

1

2 2

TTN N

const t t m Am

Maximizing the EvidenceBayesian Learning : Regression


96/103

g

96

/ 2 / 2

( | , , ) exp ( )2 2

N M

p E d

t X w w

1 1ln ( | , , ) ln ln ln

2 2 2 2 2

TT

N N

M Np

t X t t m Am A

T

N N

m m 1

M

i

i i

2

1

1 1( )

N

T

n N n

n

tN

m x

Maximizing the Evidence : DerivationEvidence Approximation


97/103

g

97

ln2 2

M M

1( )T

N

A S I

1 1ln ( | , , ) ln ln ln

2 2 2 2 2

TT

N N

M Np

t X t t m Am A

X is square matrix

X



98/103

98

1 1ln ( | , , ) ln

2 2 2

T

N N

Mp

t X m Am A

1

2

T

N N

m Am

1

2

T

N N

m Am

1 T

N

m A t

1( )T

N

A S I

1 11 ( ) ( )2

T T T

A t A A t

1 11 ( ) ( )2

T T

t A A A t

11 ( ) ( )2

T T

t A t

11 ( ) ( )( )2

T T

t A t

1 11 ( ) ( )2

T T

At A A t

1 11 ( ) ( )2

T T

At A A t

1

2

T

N N

Am m

1 1

2 2

T T

N N N N m Im m m

1 1( )

T

T

A A

A A

1 1( )

T

T

A A

A A



99/103

99

1 1ln ( | , , ) ln

2 2 2

T

N N

Mp

t X m Am A

1

1 1 1ln ( | , , ) 0

2 2 2

MT

N N

i i

Mp

t X m m

1

1M

T

N N

i i

M

m m

1

1M

T

N N

i i

M m m

1 1

1 11

M M

i ii i

M

1 1

M M

Ti i

N N

i ii i i

m m

T

N N

m m1

Mi

i i



100/103

100

1

1 1 1ln ( | , , ) 0

2 2 2

MT

N N

i i

Mp

t X m m

T

N N

m m1

M

i

i i

Note that this gives only an implicit solution for as both and mNdepend on .

Iterative procedure for finding optimal :

Start with an initial choice for

Use to compute mNand Use mNand to re-estimate

Repeat until convergence.

T

N Nm S t



101/103

1 1ln ( | , , ) ln ln ln

2 2 2 2 2

TT

N N

M Np

t X t t m Am A

1 1ln ( | , , ) ln

2 2 2 2

TT

N N

Np

t X t t m Am A

1 1

1

ln ln( )

M M

i

i

i i i

A

i i

1( )T

N

A S I

T

N Nm S t

X is square matrix

X

Eigenvaluesidefined by

are proportional to

1

M

i

i i

1 1

1 1M Mi i

i ii i

1

1M

i

i i

ia b

iab

iab



102/103

102

1 1ln ( | , , ) ln ln ln

2 2 2 2 2

TT

N N

M Np

t X t t m Am A

1 1ln ( | , , ) ln

2 2 2 2

TT

N N

Np

t X t t m Am A

1ln

2 2

A

2

1

1 1 1( )

2 2 2 2

N

T TT

N N n N n N N

n

t

t t m Am m x t m t m

1( )T

N

A S I

T

N Nm S t

2

1

1( ) 0

2 2 2

N

T

n N n

n

Nt

m x

2

1

1 1( )

N

T

n N n

n

tN

m x


103/103

End

3.linear models for regression

Documents