an introduction to algebraic statisticsmd5/papers/algstat.pdf · 2010-01-13 · ‘algebraic...

An Introduction to Algebraic Statistics

Mathias Drton

Department of StatisticsUniversity of Chicago

January, 2010

‘Algebraic statistics’

Application and development of techniques in

Algebraic Geometry, Commutative Algebra, and Combinatorics

to address problems in Statistics.

Instrumental paper:

Diaconis, Persi; Sturmfels, Bernd. Algebraic algorithms forsampling from conditional distributions. Annals of Statistics26 (1998), no. 1, 363–397.

Applied-minded algebraists get involved with Statistics

(AMS meetings, SIAM activity group, . . . ).

Some literature

Pistone, Riccomagno & Wynn: Algebraic Statistics (Exp. Design)

Pachter & Sturmfels: Algebraic Statistics for Computational Biology

Gibilisco et al. (Eds.): Algebraic and Geometric Methods in Statistics

Viana & Richards (Eds.): Algebraic Methods in Statistics and Probability(2nd volume in prep.)

These lectures

Material from Chapters 1, 2 and 5 in

Drton, Sullivant & Sturmfels:Lectures on Algebraic Statistics

Chapter 3: Conditional independenceGraphical models

Chapter 4: Hidden variable models

Chapter 6: Worked exercises

Chapter 7: Open problems

Lectures

Lecture I: Markov Bases for Exact Inference in Contingency Tables

(Chapter 1 in lecture notes)

Lecture II: Likelihood Ratio Tests and Singularities

(Section 2.3 in lecture notes)

Lecture III: Bayesian Integrals

(Section 5.1 in lecture notes)

Part I

Markov Bases for Exact Inference in Contingency Tables

1 Fisher’s exact test for 2× 2 contingency tables2 Log-linear models for multi-way tables3 Markov bases for exact conditional inference

Lecture outline

1 Fisher’s exact test for 2× 2 contingency tables

2 Log-linear models for multi-way tables

3 Markov bases for exact conditional inference

Mathias Drton Lecture 1: Fisher’s exact test 2 / 110

Example: Cancer treatment

Surgery versus radition treatment for cancer patients:

Cancer Cancer NotControlled Controlled

Surgery 21 0 21Radiation therapy 15 3 18

36 3 39

Disease outcome independent of treatment?

Chi-square test p-value = 0.1788

Fisher’s exact test p-value = 0.08929


Independence model

Two discrete/categorical random variables

X ∈ [r ] := {1, 2, . . . , r} and Y ∈ [c] := {1, 2, . . . , c}

Joint and marginal probabilities:

pij = P(X = i ,Y = j), pi+ = P(X = i), p+j = P(Y = j)

X and Y independent (X⊥⊥Y ) iff

pij = pi+p+j for all i ∈ [r ], j ∈ [c]

or, equivalently, the matrix P = (pij) has rank 1.


Chi-square test of independence

Counts from n i.i.d. copies of (X ,Y ):

Uij =n∑

k=1

1{X (k)=i ,Y (k)=j}, i ∈ [r ], j ∈ [c].

Contingency table U = (Uij) has multinomial distribution:

P(U = u) =n!

u11!u12! · · · urc !

r∏i=1

c∏j=1

puij

ij .

Chi-square statistic

X 2(U) =r∑

i=1

c∑j=1

(Uij − uij)2

uij

H0−→d χ2(r−1)(c−1), n→∞


Fisher’s exact test for 2× 2 table

Hypergeometric distribution:

If X⊥⊥Y , then

P(U11 = u11 |U1+ = u1+,U+1 = u+1) =

(u1+u11

)( n−u1+u+1−u11

)( nu+1

)for u11 ∈ {max(0, u1+ + u+1 − n), . . . ,min(u1+, u+1)}.

Exact test:1 Choose a test statistic T (u)

(e.g., X 2(u), P(U11 = u11 |U1+ = u1+,U+1 = u+1), . . . )2 P-value:

P(T (U) ≥ T (u) |U1+,U+1) =∑

v :T (v)≥T (u)

(U1+

v11

)(n−U1+

U+1−v11

)(n

U+1

)Mathias Drton Lecture 1: Fisher’s exact test 6 / 110

Lecture outline




Mathias Drton Lecture 1: Log-linear models 7 / 110

Three-way table (Agresti, 2002)

White subjects were asked about:

(1) “Black children on school bus”, (2) “Black candidate for presidency”,

(3) “Black friend for dinner at home”

HomePresident Busing Yes No ???

Yes Yes 41 65 0No 71 157 1??? 1 17 0

No Yes 2 5 0No 3 44 0??? 1 0 0

??? Yes 0 3 1No 0 10 0??? 0 0 1

??? = ‘don’t know’


Log-linear models

Discrete r.v. X1, . . . ,Xm; X` ∈ [r`]

State space: R =∏m`=1[r`]

Joint probability table: p = (pi | i ∈ R)

Probability simplex: ∆R−1

Definition

Fix a matrix A ∈ Zd×R whose columns all sum to the same value. Thelog-linear model associated with A is the set of positive probability tables

MA ={

p = (pi ) ∈ int(∆R−1) : log p = (log pi ) ∈ rowspan(A)},

where rowspan(A) is the linear space spanned by the rows of A.


Example: Independence model

X , Y : two discrete r.v. with joint probabilities pij > 0

X⊥⊥Y is equivalent to

log pij = log pi+ + log p+j = αi + βj , i ∈ [r ], j ∈ [c].

Suppose r = 2 and c = 3. Then log p ∈ R2×3 is in row span of the(r + c)× rc = 5× 6 matrix

A =

11 12 13 21 22 23

α1 1 1 1 0 0 0α2 0 0 0 1 1 1β1 1 0 0 1 0 0β2 0 1 0 0 1 0β3 0 0 1 0 0 1

.


Contingency tables

Based on n-sample, define m-way contingency table U:

Ui =n∑

k=1

1{X (k)1 =i1,...,X

(k)m =im}

, i = (i1, . . . , im) ∈ R

Let T (n) be the space of non-neg integer tables summing to n.

Definition

We call the vector Au the minimal sufficient statistics for the model MA,and the set of tables

F(u) ={

v ∈ NR : Av = Au}

is the fiber of a contingency table u ∈ T (n) with respect to model MA.


Example: Independence model

Let u be an r × c table.

For the matrix A encoding the independence model X⊥⊥Y :

Au =

(u·+u+·

),

where u·+ and u+· are the row and columns sums of table u.

If r = 2 and c = 3:

Au =

1 1 1 0 0 00 0 0 1 1 11 0 0 1 0 00 1 0 0 1 00 0 1 0 0 1

u11

u12

u13

u21

u22

u23

=

u1+

u2+

u+1

u+2

u+3

.


Hierarchical models

Conditional independence:

X1 and X2 conditionally independent given X3 if

P(X1 = i ,X2 = j |X3 = k) = P(X1 = i |X3 = k)P(X2 = j |X3 = k).

Equivalent to matrices Pk = (pijk) having rank at most 1 for all k.

Log-linear formulation:

log pijk = α(13)ik + α

(23)jk

No three-way interaction:

log pijk = α(12)ij + α

(13)ik + α

(23)jk


Conditional inference

Lemma

If p = eATα ∈MA and u ∈ T (n), then

P(U = u) =n!∏

i∈R ui !eα

T (Au).

Corollary

Conditional distribution is multivariate hypergeometric:

P(U = u |AU = Au) =1/(∏

i∈R ui !)∑

v∈F(u) 1/(∏

i∈R vi !) ,

and does not depend on p.


Exact test

Consider the hypothesis testing problem

H0 : p ∈MA versus H1 : p 6∈ MA.

Maximum likelihood estimates pi

Expected counts ui = npi (same for all tables in a fiber F(u))

Chi-square statistic

X 2(U) =∑i∈R

(Ui − ui )2

ui

Exact p-valueP(X 2(U) ≥ X 2(u) |AU = Au)


Markov chain Monte Carlo

Exact p-value is equal to∑v∈F(u) 1{X 2(v)≥X 2(u)}/

(∏i∈R ui !

)∑v∈F(u) 1/

(∏i∈R vi !

) .

Larger counts or tables: prohibitive to sum over entire fiber

Approximate p-value by Markov chain Monte Carlo algorithms forsampling tables from the conditional distribution

With prob 1, MCMC yields sequence of tables vt ∈ F(u) such thatthe proportion of tables with X 2(vt) ≥ X 2(u) converges to p-value.

Problem

For an irreducible Metropolis-Hastings sampler, find

Finite set of moves that connect any two tables in any fiber.


Lecture outline




Mathias Drton Lecture 1: Markov bases 17 / 110

Markov basis – Definition

Log-linear model MA associated with matrix A

Integer kernel kerZ(A)

Definition

A finite subset B ⊂ kerZ(A) is a Markov basis for MA if for all u ∈ T (n)and all pairs v , v ′ ∈ F(u) there exists a sequence u1, . . . , uL ∈ B such that

v ′ = v +L∑

k=1

uk and v +l∑

k=1

uk ≥ 0 for all l = 1, . . . , L.

The elements of the Markov basis are called moves.


Metropolis-Hastings algorithm

Input: Contingency table u; Markov basis B for the model MA.

Output: Sequence (X 2(vt))∞t=1 for tables vt in fiber F(u).

Step 1: Initialize v1 = u.

Step 2: For t = 1, 2, . . . repeat the following steps:

(i) Select uniformly at random a move ut ∈ B.(ii) If min(vt + ut) < 0, then set vt+1 = vt , else set

vt+1 =

{vt + ut

vt

with probability

{q

1− q,

where

q = min

{1,

P(U = vt + ut |AU = Au)

P(U = vt |AU = Au)

}.

(iii) Compute X 2(vt).


Markov basis for independence model

Let eij be the r × c table:

j

0 0 0 0 0 0 . . .i 0 0 0 1 0 0 . . .

0 0 0 0 0 0 . . ....

......

......

. . .

Proposition

The (unique minimal) Markov basis for the independence model MX⊥⊥Y

consists of the following 2 ·(r

2

)(c2

)moves, each having one-norm 4:

B ={±(eij + ekl − eil − ekj) : 1 ≤ i < k ≤ r , 1 ≤ j < l ≤ c

}.


Independence model – Proof

Idea Show that we can use elements of B to bring any twodistinct tables in the same fiber closer to one another.

Claim Given v 6= u, v ∈ F(u) show that there is b ∈ B such that(i) u + b ≥ 0 and (ii) ‖u − v‖1 > ‖u + b − v‖1.

Proof Recall Au yields row and column sums:

(a) Since u 6= v and Au = Av , there is at least one positiveentry in u − v . WLOG, u11 − v11 > 0.

(b) Since Au = Av , there is a negative entry in the first row ofu − v . WLOG, u12 − v12 < 0.

(c) Similarly, u22 − v22 > 0.

(d) Let b = e12 + e21 − e11 − e22. Then‖u − v‖1 > ‖u + b − v‖1 and u + b ≥ 0 as desired.


Symbolic computation – 4ti2

Markov basis of ‘no 3-way interaction model’ for 2× 2× 2 table?

Matrix representing model has format 12× 8 (store in file no3way):

12 81 1 0 0 0 0 0 00 0 1 1 0 0 0 00 0 0 0 1 1 0 00 0 0 0 0 0 1 11 0 1 0 0 0 0 00 1 0 1 0 0 0 00 0 0 0 1 0 1 00 0 0 0 0 1 0 11 0 0 0 1 0 0 00 1 0 0 0 1 0 00 0 1 0 0 0 1 00 0 0 1 0 0 0 1


Symbolic computation – 4ti2

Compute Markov basis (up to sign) using command markov no3way

Output in file no3way.mar:

1 81 -1 -1 1 -1 1 1 -1

Two moves

±(e111 + e122 + e212 + e221 − e112 − e121 − e211 − e222)

correspond to the quartic equation

p111p122p212p221 = p112p121p211p222

Recall:pijk ∝ θ

(12)ij θ

(13)ik θ

(23)jk


Polynomial algebra

Polynomial ring R[p] = R[p1, p2, . . . , pk ]

For non-neg integer table u = (u1, . . . , uk) ∈ Nk define monomial

pu = pu11 pu2

2 · · · pukk

For integer table u = u+ − u− ∈ Zk with positive and negative partsu+, u− ∈ Nk define binomial

pu+ − pu−

Example:

p =

(p11 p12

p21 p22

), u =

(2 −2−1 1

)=⇒ p2

11p22 − p212p21


Polynomial algebra

A subset I ⊂ R[p] is an ideal if

f , g ∈ I =⇒ f + g ∈ I

f ∈ I , h ∈ R[p] =⇒ hf ∈ I

Hilbert’s basis theorem:Every ideal I has a finite generating set f1, . . . , fm ∈ R[p], that is,

I = 〈f1, . . . , fm〉 =

{m∑

i=1

hi fi : h1, . . . , hm ∈ R[p]

}


Fundamental theorem

Given a matrix A ∈ Nd×k for a log-linear model, define the (toric) ideal

IA := 〈 pu+ − pu− : u ∈ kerZ(A) 〉 ⊂ R[p].

Theorem (Fundamental theorem of Markov bases)

A subset B of kerZ(A) is a Markov basis if and only if the correspondingset of binomials { pb+ − pb− : b ∈ B } generates the ideal IA. Inparticular, a (finite) Markov basis always exists.


Example: Independence model for 2× 2 table

We have shown that a Markov basis (up to sign) is given by

b =

(1 −1−1 1

)Hence, IA = I ∗ := 〈p11p22 − p12p21〉

Example for IA ⊆ I ∗: Consider the tables

u =

(4 12 5

), v =

(3 23 4

).

Since u − b = v , we have u − b+ = v − b− and thus

p411p1

12p221p5

22 − p311p2

12p321p4

22 = p311p1

12p221p4

22(p11p22 − p12p21) ∈ I ∗


Computing Markov bases

Theorem

The ideal IA is a homogeneous ideal and its homogeneous elements areexactly the homogeneous polynomials f in R[p] that vanish on thelog-linear model MA:

f (p) = 0 for all p ∈MA.

For a matrix A = (aij) ∈ Nd×k , compute a Markov basis byeliminating the variables from the equation system

pj − θa1j

1 θa2j

2 · · · θadj

d = 0, i = 1, . . . , k .

Software for Grobner basis calculations.... . . Macaulay 2, Singular, 4ti2


Example: No 3-way interaction in 2× 2× 2 table

Equation system:

p111 = α11β11γ11, p112 = α11β12γ12,

p121 = α12β11γ21, p122 = α12β12γ22,

p211 = α21β21γ11, p212 = α21β22γ12,

p221 = α22β21γ21, p222 = α22β22γ22.

Variable elimination:Every relation among pijk is a polynomial multiple of

p111p122p212p221 − p112p121p211p222

Markov basis:

±(e111 + e122 + e212 + e221 − e112 − e121 − e211 − e222)


Singular session

LIB "elim.lib";ring R = 0,(p111,p112,p121,p122,p211,p212,p221,p222,

a11,a12,a21,a22,b11,b12,b21,b22,c11,c12,c21,c22),dp;ideal M =p111 - a11*b11*c11,p112 - a11*b12*c12,p121 - a12*b11*c21,p122 - a12*b12*c22,p211 - a21*b21*c11,p212 - a21*b22*c12,p221 - a22*b21*c21,p222 - a22*b22*c22;eliminate(M, a11*a12*a21*a22*b11*b12*b21*b22*

c11*c12*c21*c22);


Background reading

Cox, D.; Little, J.; O’Shea, D. (2007).Ideals, varieties, and algorithms.Springer, New York, 2007.


Database: http://mbdb.mis.mpg.de


http://mbdb.mis.mpg.de

http://mbdb.mis.mpg.de

Slim and long tables

Theorem

Let X1 be a r.v. with 3 states, and X2 and X3 r.v. with r2 and r3 states,resp. Let v ∈ Zk be any integer vector. There are r2, r3 ∈ N and acoordinate projection π : Z3×r2×r3 → Zk such that every minimal Markovbasis for the no 3-way interaction model contains a table u with π(u) = v.

Theorem

Fix a set of interactions Γ for a hierarchical log-linear model, and fixr2, . . . , rm. There exists a number b(Γ, r2, . . . , rm) <∞ such that theone-norms of the elements of any minimal Markov basis for Γ ons × r2 × · · · × rm tables are less than or equal to b(Γ, r2, . . . , rm). Thisbound is independent of s, which can grow large.


Exercise

Exercises 6.1 and 6.2 in the lecture notes

Perform an exact test for your favorite table

e.g. test ‘no 3-way interaction’ in the example from Agresti (2002)shown earlier:

HomePresident Busing Yes No ???

Yes Yes 41 65 0No 71 157 1??? 1 17 0

No Yes 2 5 0No 3 44 0??? 1 0 0

??? Yes 0 3 1No 0 10 0??? 0 0 1

??? = ‘don’t know’


Part II

Likelihood Ratio Tests and Singularities

4 Algebraic statistical models5 Large-sample asymptotics and Chernoff’s theorem6 Examples

Lecture outline

4 Algebraic statistical models

5 Large-sample asymptotics and Chernoff’s theorem

6 Examples

Mathias Drton Lecture 2: Algebraic statistical models 36 / 110

Example: Bayesian network

Sachs et al. (2005): Analysis of flow cytometry data

Expression values for 11 proteins discretized −→ ternary variables

Large sample size (observational part: n = 1200)

Bayesian network (conditional independence model):

Typical task: test absence of edges

Likelihood ratio test of absence‘PKC → PKA’ can be based on χ2

4

distribution

See Chapter 3 in the lecture notes


Chi-square asymptotics

Theorem

Suppose

(i) {Pθ : θ ∈ Θ} is a regular exponential family (Θ ⊂ Rk open),

(ii) Θ0 ⊂ Θ1 are smooth submanifolds of Θ,

(iii) True parameter point θ0 ∈ Θ0.

Then the likelihood ratio statistic for testing

H0 : θ ∈ Θ0 vs. H1 : θ ∈ Θ1 \Θ0

tends to χ2dim(Θ1)−dim(Θ0) as n→∞.

Theorem covers Bayesian network example because

interior of probability simplex is regular exponential family, and

Bayesian networks define smooth submanifolds.


Regular exponential families

Definition

Let PΘ = {Pθ : θ ∈ Θ} be a family of probability distributions onX ⊆ Rm that have densities with respect to a measure ν. We call PΘ anexponential family if there is a statistic T : X → Rk and functionsh : Θ→ Rk and Z : Θ→ R such that each distribution Pθ has ν-density

pθ(x) =1

Z (θ)exp{〈h(θ),T (x)〉}, x ∈ X .

If

H =

{η ∈ Rk :

∫X

exp{〈η,T (x)〉} dν(x) <∞}

is an open subset of Rk and h a diffeomorphism between Θ and H, thenwe say that PΘ is a regular exponential family of order k .


Curved exponential families

Definition

Suppose {Pθ : θ ∈ Θ} is a regular exponential family. If Θ0 is a smoothsubmanifold of Θ, then {Pθ : θ ∈ Θ0} is a curved exponential family.

Well-developed large-sample theory for CEFs

Estimation and confidence intervals:

Maximum likelihood estimators are asymptotically normal.

Hypothesis testing:

Likelihood ratio statistics have asymptotic chi-square distributions.Wald statistics asymptotic chi-square distributions.

Model selection:

Bayesian information criterion (BIC) is consistent and connected to theasymptotics of marginal likelihood integrals.


Example: Instrumental variables

Estimate coeffient γ43 in the system

X3 = γ35X5 + ε3,

X4 = γ43X3 + γ45X5 + ε4,

X5 = ε5

with εi ∼ N (0, ωi ) independent

X3

X4

X5

Variable X5 hidden

: Consider distributions

(X1, . . . ,X4) ∼ N(0,Σ(γ, ω)

)(γ, ω)→ Σ(γ, ω) polynomial parametrization



Estimate coeffient γ43 in the system

X1 = ε1,

X2 = ε2,

X3 = γ31X1 + γ32X2 + γ35X5 + ε3,

X4 = γ43X3 + γ45X5 + ε4,

X5 = ε5

with εi ∼ N (0, ωi ) independent

X1

X3

X4

X2

X5

Variable X5 hidden

Marginal distribution

(X1, . . . ,X4) ∼ N(0,Σ(γ, ω)

)



Covariance matrix parametrization is a polynomial map:

Σ(γ, ω) =ω1 0 γ31 ω1 γ43 γ31 ω1

ω2 γ32 ω2 γ43 γ32 ω2

Var[X3] γ43 Var[X3] + γ35 γ45 ω5

ω4 + γ243 Var[X3] + γ2

45 ω5 + 2γ45 γ43 γ35 ω5

with

Var[X3] = ω3 + γ231 ω1 + γ2

32 ω2 + γ235 ω5

Coordinate σij is a combinatorial expression summing termsassociated with ‘treks’

i ←− `1 ←− `2 ←− . . .←− t −→ . . . −→ r2 −→ r1 −→ j



In this hidden variable model test

H0 : γ31 = γ32 = 0

Null distrib. of LR statistic (n = 1000) X1

X3

X4

X2

X5

0 2 4 6 8 10 12

0.0

0.2

0.4

0.6

0.8

1.0

CDF

F(x

)


Algebraic exponential families

Asymptotic behavior of the LRT in instrumental variables example?

Hidden variable models 6= curved exponential family

What is a suitable general framework to study hidden variable models?

Definition

Suppose {Pθ : θ ∈ Θ} is a regular exponential family. If Θ0 is asemi-algebraic subset of Θ, then the submodel {Pθ : θ ∈ Θ0} is analgebraic exponential family.


Semi-algebraic sets

Definition

Let R[t1, . . . , tk ] be the ring of polynomials in the indeterminates t1, . . . , tkwith real coefficients. A semi-algebraic set is a finite union of the form

Θ0 =m⋃

i=1

{θ ∈ Rk | f (θ) = 0 for f ∈ Fi and h(θ) > 0 for h ∈ Hi},

where Fi ,Hi ⊂ R[t1, . . . , tk ] are collections of polynomials and all Hi finite.

Theorem (Tarski-Seidenberg)

If g : Rd → Rk is a polynomial map and Γ is a semi-algebraic set, thenΘ0 = g(Γ) is semi-algebraic.


Lecture outline



6 Examples

Mathias Drton Lecture 2: Large-sample asymptotics and Chernoff’s theorem 47 / 110

Likelihood ratio test

Independent observations X (1), . . . ,X (n) with unknown distribution

Statistical model {Pθ : θ ∈ Θ}, Θ ⊆ Rk

Suppose Pθ have density functions pθ(x). Define likelihood function

Ln : Θ→ R, θ 7→n∏

i=1

pθ(X (i)).

Test H0 : θ ∈ Θ0 vs. H1 : θ ∈ Θ1 \Θ0 for some Θ0 ⊂ Θ1 ⊂ Θ.

Definition

The likelihood ratio test rejects H0 if the likelihood ratio statistic

λn = 2 logsupθ∈Θ1

Ln(θ)

supθ∈Θ0Ln(θ)

is “too large” =⇒ p-value PH0(λn ≥ λobs).

Canonical example: Normal means

Normal mean model {N (θ, Ik) : θ ∈ Rk}Log-likelihood function

`n(θ) = −nk

2log(2π)−

n

2‖Xn − θ‖2

2 −1

2

n∑i=1

‖X (i) − Xn‖22.

Sample mean

Xn =1

n

n∑i=1

X (i)

Likelihood ratio statistic for testing H0 : θ ∈ Θ0 vs. H1 : θ 6∈ Θ0:

λn = n · infθ∈Θ0

‖Xn − θ‖22 = inf

θ∈Θ0

‖√

n(Xn − θ0)−√

n(θ − θ0)‖22

where θ0 is the true parameter.


Canonical example: Normal means

Asymptotics of LR statistic determined by squared Euclidean distancebetween N (0, Ik)-point and “limit of

√n(Θ0 − θ0)”

Example: Cuspidal cubic

Bivariate normal mean model

Θ0 cuspidal cubic {(θ1, θ2) : θ31 = θ2

2}

Tangent cone at θ0 = 0 is half-ray{(θ1, θ2) : θ1 ≥ 0, θ2 = 0}

Limiting distribution of LRT is a mixtureof chi-squares:

λnD−→ 1

2χ2

1 +1

2χ2

2.


Chernoff’s theorem: Preparation

Definition (Tangent cone)

TC θ0(Θ0) =

{lim

n→∞

θn − θ0

βn: βn > 0, θn ∈ Θ0, θn −→ θ0

}

Definition (Fisher-information matrix)

Positive semi-definite matrix I (θ) with entries

I (θ)ij = Eθ

[(∂

∂θilog pθ(X )

)(∂

∂θjlog pθ(X )

)], i , j ∈ [k].


Chernoff’s theorem (for exponential families)

Theorem

Suppose {Pθ : θ ∈ Θ} is a regular exponential family with Θ ⊆ Rk . Letθ0 ∈ Θ0 ⊆ Θ ⊆ Rk be the true parameter point. If Θ0 is Chernoff-regularat θ0 and n→∞, then LR statistic λn for H0 : θ ∈ Θ0 vs. H1 : θ 6∈ Θ0

converges tomin

τ∈TCθ0(Θ0)‖Z − I (θ0)1/2τ‖2

2

where Z ∼ N (0, Ik) and I (θ0)1/2 is any matrix square root of theFisher-information I (θ0).


What is Chernoff-regularity?

Condition on how tangent cone TC θ0(Θ0) approximates the set Θ0

locally at θ0 ∈ Θ0.Allows one to pass from supθ∈Θ0

. . . to supτ∈TCθ0(Θ0) . . . .

For θ0 = 0:

distance(θ,TC 0(Θ0)) = o(‖θ‖), θ ∈ Θ0,

distance(τ,Θ0) = o(‖τ‖), τ ∈ TC 0(Θ0)

Definition

A set Θ0 ⊆ Rk is Chernoff-regular at θ0 if

For all τ ∈ TC θ0(Θ0) and βn ↘ 0there exists a sequence θn → θ0 in Θ0 such that

limn→∞

θn − θ0

βn= τ.


Chernoff-regularity of semi-algebraic sets

Lemma

Semi-algebraic sets are everywhere Chernoff-regular.

Follows from ‘curve selection lemma’ that implies that for all τ ∈ TΘ(θ0)there exists a (real analytic) map α : [0, ε)→ Θ with α(0) = θ0 s.t.

τ = limt→0+

α(t)− α(0)

t.

Corollary (Testing in a submodel)

Suppose {Pθ : θ ∈ Θ} is regular exponential family with Θ ⊆ Rk . LetΘ0,Θ1 be semi-algebraic subsets of Θ. If true parameter θ0 is in Θ0 andn→∞, then LR statistic for H0 : θ ∈ Θ0 vs. H1 : θ ∈ Θ1 \Θ0 converges to

minτ∈TCθ0

(Θ0)‖Z − I (θ0)1/2τ‖2

2− minτ∈TCθ0

(Θ1)‖Z − I (θ0)1/2τ‖2

2, Z ∼ N (0, Ik).


Lecture outline



6 Examples

Mathias Drton Lecture 2: Examples 55 / 110

Linear spaces

Lemma

If Θ0 is a d-dimensional linear subspace of Rk and X ∼ N (0,Σ) withpositive definite covariance matrix Σ, then

infθ∈Θ0

(X − θ)T Σ−1(X − θ) ∼ χ2k−d .

Corollary

Likelihood ratio statistic is asymptotically chi-square when testing linear orsmooth hypotheses.


Order-restricted inference

Example:

X1 : Difference in blood pressure before and after taking 1 pillX2 : Difference in blood pressure before and after taking 2 pills

Suppose X1 ∼ N(µ1, σ20) and X2 ∼ N(µ2, σ

20) and test:

H0 : µ2 ≥ µ1 ≥ 0 versus H1 : (µ2 < µ1 or µ1 < 0)

or possibly,

H0 : µ2 = µ1 = 0 versus H1 : µ2 ≥ µ1 ≥ 0


Mixture of chi-square distributions

1

8· χ2

0 +1

2· χ2

1 +3

8· χ2

2


Convex cones – ‘Boundary problems’

Lemma

Distance between standard normal random vector and convex cone isdistributed like a mixture of chi-square distributions.

Theorem (Miles, 1959; Drton & Klivans, 2009)

(a)

H0 : θ ∈{

x ∈ Rk : x1 ≤ x2 ≤ · · · ≤ xk

}Mixture weights ∝ coeff’s of t(t − 1) · · · (t − k + 1)

(b)

H0 : θ ∈{

x ∈ Rk : 0 ≤ x1 ≤ x2 ≤ · · · ≤ xk

}Mixture weights ∝ coeff’s of (t − 1)(t − 3) · · · (t − 2k + 1).


Singularities

Geometry of a semi-algebraic set Θ0 ⊆ Rk expresses itselfalgebraically in the vanishing ideal

I(Θ0) = {f ∈ R[t1, . . . , tk ] : f (θ) = 0 for all θ ∈ Θ0}.

Finite generating set

〈 f1, . . . , fs 〉 = I(Θ0), f1, . . . , fs ⊂ R[t1, . . . , tk ]

Definition

A point θ0 in Θ0 is a singularity if the rank of the Jacobian matrix

Jf (θ0) =

(∂fi (t)

∂tj

)t=θ

∈ Rs×k .

is smaller than k − dim Θ0.


Algebraic tangent cone

Let θ0 be a root of the polynomial f ∈ R[t1, . . . , tk ] and write

f (t) =L∑

h=l

fh(t − θ0),

where fh homogeneous, degree(fh) = h, and fl 6= 0.

Since f (θ0) = 0, minimal degree l ≥ 1, and we define fθ0,min = fl .

Tangent cone ideal:

{fθ0,min : f ∈ I(Θ0)} ⊂ R[t1, . . . , tk ].

Lemma

Suppose θ0 is a point in the semi-algebraic set Θ0 and f ∈ R[t1, . . . , tk ] apolynomial such that f (θ0) = 0 and f (θ) ≥ 0 for all θ ∈ Θ0. Then everytangent vector τ ∈ TC θ0(Θ0) satisfies that fθ0,min(τ) ≥ 0.


Example: Cuspidal cubic

Θ0 = {(θ1, θ2) : θ31 = θ2

2}Tangent cone ideal for θ0 = 0 isgenerated by t2

2

Associated algebraic tangent cone

{θ : θ22 = 0} = {θ : θ2 = 0}

Tangent cone at θ0 = 0 is half-ray

{θ : θ1 ≥ 0, θ2 = 0}


Instrumental variables – Singularities

Covariance matrixω1 0 γ31 ω1 γ43 γ31 ω1

ω2 γ32 ω2 γ43 γ32 ω2

ω3 + . . . γ35 γ45 ω5 + . . .

ω4 + . . .

X1

X3

X4

X2

X5

Vanishing idealI = 〈σ12, σ13σ24 − σ14σ23 〉

Singular locus:

{Σ = (σij) : σ12 = σ13 = σ14 = σ23 = σ24 = 0}

coincides with H0 : γ31 = γ32 = 0


Instrumental variables – Tangent cone

Singularities are ‘zero’

Vanishing ideal is homogeneous and thus equal to tangent cone ideal

Algebraic tangent cone at a singularity:(diag2×2 rank ≤ 1

arbitrary2×2

)Geometric tangent cone TC is closed cone that contains all derivativedirections. It is equal to algebraic cone.


Instrumental variables – Asymptotics

Proposition

Consider testingH0 : γ31 = γ32 = 0

in the instrumental variables example. Under the null and as n→∞,

λn −→d max{eigenvalues(W(2, I ))}

where W2×2(2, I ) is standard Wishart matrix with 2 degrees of freedom.

‘Proof’ (Details in worked exercises 6.4 and 6.5 in lecture notes)

Tangent cone invariant under transformation with matrix square rootof Fisher-information

Distance between 2× 2-matrix A and {rank ≤ 1} given by smallersingular value of A


Factor analysis

Factor analysis (conditional independence given hidden variable)

X1 = γ1H + ε1,

X2 = γ2H + ε2,

X3 = γ3H + ε3,

X4 = γ4H + ε4

X1 X2 X3 X4

H

Multivariate normal distributions N4(µ,Σ) with µ ∈ R4 and Σ in

Θ0 = {∆ + γγt | ∆ ∈ R4×4pd diagonal, γ ∈ R4}

Software (e.g. factanal in R) for testing

H0 : Σ ∈ Θ0 vs. H1 : Σ 6∈ Θ0,

uses LRT and χ22-approximation


Factor analysis

Histograms of 20,000 simulated p-values for sample size n = 1000:

Γ = (1, 1, 1, 1)t

p−value

0.0 0.4 0.8

Γ = (1, 1, 1, 0)t

p−value

0.0 0.4 0.8

0.0

0.4

0.8

Γ = (1, 1, 0, 0)t

p−value

0.0 0.4 0.8

0.0

0.6

1.2

Γ = (1, 0, 0, 0)t

p−value

0.0 0.4 0.8

0.0

1.0

Factor loadings 0 or 1, cond. variances 1/3 =⇒ correlations 0 or 3/4.

Three types of limiting distributions?


Factor analysis – Singular session

LIB "sing.lib";LIB "linalg.lib";

ring R = 0,(s11,s12,s13,s14, s22,s23,s24, s33,s34, s44,d1,d2,d3,d4, g1,g2,g3,g4),dp;

// Compute the vanishing ideal by eliminationideal F = s11-(d1+g1^2), s12-g1*g2, s13-g1*g3, s14-g1*g4,

s22-(d2+g2^2), s23-g2*g3, s24-g2*g4,s33-(d3+g3^2), s34-g3*g4,s44-(d4+g4^2);

ideal I = eliminate(F, d1*d2*d3*d4*g1*g2*g3*g4);I;


Factor analysis – Singular session

ring RR = 0,(s11,s12,s13,s14, s22,s23,s24, s33,s34, s44),dp;ideal I = fetch(R,I);dim(groebner(I));

// Compute the singularitiesideal S = slocus(I); S;primdecGTZ(S);

// Tangent cone at diagonal matrixtangentcone(I);// at matrix with s12=1tangentcone( subst(I,s12,s12+1) );// at regular point with s12=s13=1tangentcone( subst(I,s12,s12+1,s13,s13+1) );


Factor analysis: Singularities and tangent cones

Theorem (D, 2009)

(i) A covariance matrix Σ is a singularity of the one-factor model if andonly if Σ has at most one non-zero off-diagonal entry σij , i < j .

(ii) If Σ is diagonal then the tangent cone is the topological closure of{∆ + γγt | ∆ ∈ Rm×m diagonal, γ ∈ Rm

}.

(iii) If Σ has exactly one non-zero off-diagonal entry that is positive, sayσ12 > 0, then the tangent cone is the set of symmetric matrices

θ =

θ11 θ12 θ13 . . . θ1m

θ12 θ22 cθ13 . . . cθ1m

θ33 . . .θmm

, c ∈[σ12

σ11,σ22

σ12

].

Case σ12 < 0 is similar with c < 0.


Exercise: RC association model (Haberman, 1981)

Two discrete r.v. X1 and X2 with r1 and r2 states, respectively.

Logarithmic parametrization

log pij = αi + βj + γiδj , i ∈ [r1], j ∈ [r2]

What are the singularities? (in log-prob coordinates)

What do the tangent cones at the singularities look like?

What is the asymptotic distribution for the likelihood ratio statisticwhen testing the independence model X1⊥⊥X2 against the RCassociation model?


Part III

Bayesian Integrals

7 Information criteria for model selection8 Marginal likelihood integrals9 Resolution of singularities and Newton polyhedra10 Reduced rank regression

Lecture outline

7 Information criteria for model selection

8 Marginal likelihood integrals

9 Resolution of singularities and Newton polyhedra

10 Reduced rank regression

Mathias Drton Lecture 3: Information criteria for model selection 73 / 110

Model selection: Setup

Observations X (1), . . . ,X (n) ∼ P i.i.d.

Unknown P assumed to be in (identifiable) ambient statistical model

{Pθ : θ ∈ Θ}, Θ ⊆ Rk .

True parameter θ0 is such that Pθ0 = P.

Call submodel given by Θ0 ⊂ Θ true if θ0 ∈ Θ0.

Model selection problem

Find the “simplest” true model from a set of competing submodelsassociated with

Θ1,Θ2, . . . ,ΘM ⊆ Θ.


Score-based search

Strategy

Assign a score to each model and maximize the score.

Assume densities pθ(x), and define likelihood function

Ln : Θ→ R, θ 7→n∏

i=1

pθ(X (i)).

For submodel Θi , let

ˆn(i) = sup{ log Ln(θ) | θ ∈ Θi}, i = 1, . . . ,M.

If Θ1 ⊆ Θ2, then ˆn(1) ≤ ˆ

n(2).


Information criteria

Definition

The information criterion associated with a penalty function πn : [M]→ Rassigns the score

τn(i) = ˆn(i)− πn(i)

to the i-th model, i = 1, . . . ,M.

Example

AIC: πn(i) = dim(Θi ) (Akaike)

BIC: πn(i) = dim(Θi )2 log(n) (Bayesian, Schwarz)

Information criteria strike balance between model fit and modeldimensionality.


Basic consistency result

Theorem (compare Haughton, 1988)

Consider a regular exponential family (Pθ | θ ∈ Θ). In particular, Θ ⊆ Rk

is open. Let Θ1,Θ2 ⊆ Θ be any two sets.

1 Suppose θ0 ∈ Θ2 \Θ1. If 1n |πn(2)− πn(1)| n→∞−→ 0, then

limn→∞

Pθ0 (τn(1) < τn(2)) = 1.

2 Suppose θ0 ∈ Θ1 ∩Θ2. If πn(1)− πn(2)n→∞−→ ∞, then

limn→∞

Pθ0 (τn(1) < τn(2)) = 1.


Consistency

Corollary

Suppose a collection of models is given by closed sets Θ1,Θ2, . . . ,ΘM . Ifthe collection is closed under intersections, and Θi ⊂ Θj impliesdim(Θi ) < dim(Θj), then:

1 AIC identifies a true model with prob one as n→∞.

2 BIC identifies smallest true model with prob one as n→∞.

Example

1 Linear regression (random design)

2 Undirected graphical models

3 Determining rank in reduced-rank regression (‘singularities)

4 Determining number of factors in factor analysis (‘singularities)

5 Directed graphical models (‘faithfulness’), hidden var’s (‘singularities’)


Lecture outline





Mathias Drton Lecture 3: Marginal likelihood integrals 79 / 110

Bayesian model determination

Prior probability of model i :

P(Θi ), i = 1, . . . ,M

Prior distribution of parameter in model i :

Qi (θ), θ ∈ Θi

Likelihood function:

Ln(θ | X (1), . . . ,X (n)) =n∏

i=1

pθ(X (i))

Posterior probability of model i :

P(Θi | X (1), . . . ,X (n)) ∝ P(Θi )

∫Θi

Ln(θ | X (1), . . . ,X (n)) dQi (θ)︸︷︷︸marginal/integrated likelihood


Marginal likelihood

In typical applications, the models are parametrized:

θ = gi (γ), γ ∈ Rd

Priors Qi specified via distributions on γ that have densities pi (γ)

Marginal likelihood for one model (suppressing index i):

µn =

∫Rd

Ln

(g(γ) | X (1), . . . ,X (n)

)p(γ) dγ

=

∫Rd

e`n( g(γ) |X (1),...,X (n))p(γ) dγ

Frequentist view

Suppose X (1), . . . ,X (n), · · · ∼ Pθ0 are i.i.d. with θ0 = g(γ0).

What is the asymptotic behavior of the sequence (µn)?


Asymptotics for marginal likelihood integrals

Theorem (Laplace approximation; Haughton, 1988)

Let {Pθ : θ ∈ Θ} be a regular exponential family with Θ ⊆ Rk . Consideran open set Γ ⊆ Rd and a smooth injective map g : Γ→ Rk withcontinuous inverse. Let θ0 = g(γ0) be the true parameter, and assumethat the prior density p(γ) is smooth and positive in a neighborhood of γ0.Then

logµn = ˆn −

d

2log(n) + Op(1),

whereˆn = sup

γ∈Γ`n(g(γ) |X (1), . . . ,X (n)

).

Recall: Rn = Op(1) if ∀ε > 0 ∃Mε ∀n P(|Rn| > Mε) < ε

Haughton actually gives expansion of log µn up to Op

(n−1/2

)Mathias Drton Lecture 3: Marginal likelihood integrals 82 / 110

Example: Normal means model

Observations:

X (1), . . . ,X (n) ∼ N (θ, Ik×k), θ ∈ Θ = Rk

Likelihood function:

Ln(θ | X (1), . . . ,X (n)) =

(1√

(2π)k

)n

exp{−n · 1

2 ||Xn − θ||2}

Model parametrization g : Rd → Rk

Marginal likelihood

µn = Cn

∫Rd

exp{−n · 1

2‖Xn − g(γ)‖2}

p(γ) dγ


Cuspidal cubic

Model Θ0 = {θ ∈ R2 : θ22 = θ3

1}Parametrized by g(γ) = (γ2, γ3)

If γ0 6= 0, i.e., g(γ0) 6= 0, thenHaughton’s Theorem applies.

If θ0 = g(γ0) 6= 0, then

log

∫ ∞−∞

exp{−n · 1

2‖Xn − g(γ)‖2}

p(γ) dγ = −1

2log(n) + Op(1).

(Exponent ≈ quadratic in γ, Gaussian density with variance c/n)

What if θ0 = 0 ⇐⇒ γ0 = 0?


Cuspidal cubic

Integral with normalizing constant omitted:∫ ∞−∞

exp{− 1

2

[(√

nγ2 −√

nXn,1)2 + (√

nγ3 −√

nXn,2)2]}

p(γ) dγ

Change of variables γ = n1/4γ:

n−1/4

∫ ∞−∞

exp{− 1

2

[(γ2 −

√nXn,1)2+( γ3

n1/4−√

nXn,2

)2]}p

(γ

n1/4

)d γ.

Let θ0 = 0 and Z1,Z2ind∼ N (0, 1). Limit when multiplying by n1/4:∫ ∞

−∞exp

{− 1

2

[(γ2 − Z1)2 + Z 2

2

]}p (0) dγ.

Hence, log µn = ˆn − 1

4 log(n) + Op(1)


Observation

Sequence of random intervals:

logµn = log

∫Rd

Cn exp{−n · 1

2‖Xn − g(γ)‖2}

p(γ) dγ

=

{ˆn − 1

2 log(n) + Op(1) if γ0 6= 0,ˆn − 1

4 log(n) + Op(1) if γ0 = 0

Deterministic intervals (replace Xn by expectation θ0 = g(γ0)):

log

∫ ∞−∞

Cn exp{−n · 1

2‖g(γ0)− g(γ)‖2}

p(γ) dγ

=

{n log(C )− 1

2 log(n) + O(1) if γ0 6= 0,

n log(C )− 14 log(n) + O(1) if γ0 = 0

Same asymptotics!


Laplace integrals

Theorem

Let {Pθ : θ ∈ Θ} be a regular exponential family. Consider a polynomialmap g : Rd → Θ, and let θ0 = g(γ0) be the true parameter. Assume thatthat the prior density p(γ) is smooth and positive on a compact andsemi-analytic supporting set. Then

logµn = ˆn − q log(n) + (s − 1) log log(n) + Op(1),

where the rational number q ∈ (0, d/2] and the integer s ∈ [d ] satisfy that

log

∫e−n‖g(γ)−θ0‖2

p(γ)dγ = −q log(n) + (s − 1) log log(n) + O(1).

Remark

The remainder can be shown to converge in distribution.


Watanabe’s book

The theorem is proven in the book byWatanabe.

Watanabe also discusses algebraictechniques for computing the learningcoefficient = growth index q and themultiplicity s

Singular integrals:

Arnol’d, V.I.; Gusein-Zade, S.M.;Varchenko, A.N. Singularities ofdifferentiable maps. Vol. I & II,1985/88.Work by Michael Greenblatt at UIC


Example: Sample vs true mean in normal means model

Random integral

logµn = log

∫Rd

exp{−n · 1

2‖Xn − g(γ)‖2}

p(γ) dγ

Simple bound for any a > 0:

2|〈Xn − θ0, g(γ)− θ0〉| ≤ a‖Xn − θ0‖2 +1

a‖g(γ)− θ0‖2

Bound in exponent:

‖Xn − g(γ)‖2a=1≤ 2‖g(γ)− θ0‖2 + 2‖Xn − θ0‖2

‖Xn − g(γ)‖2a=2≥ 1

2‖g(γ)− θ0‖2 − ‖Xn − θ0‖2

If deterministic integral based on e−n‖g(γ)−θ0‖2has an asymptotic

expansion then random integrals have same growth behavior.


Lecture outline





Mathias Drton Lecture 3: Resolution of singularities and Newton polyhedra 90 / 110

Zeta function

Polynomial map f : Rd → [0,∞)

Smooth prior p(γ), positive on compact semi-analytic support

Laplace integral ∫e−nf (γ)p(γ) dγ

Zeta function:

ζ(λ) =

∫f (γ)λp(γ) dγ, λ ∈ C,Re(λ) > 0

Theorem

The zeta function ζ(λ) can be continued (uniquely) to a meromorphicfunction on all of C. All poles are negative rational numbers. The negatedgrowth index q is the largest pole of ζ(λ) and the multiplicity s is themultiplicity of this pole.


Local view

For large n, main contribution to∫e−nf (γ)p(γ) dγ

comes from neighborhood of

Vf = {γ : f (γ) = 0} ∩ supp(p).

Since prior support assumed compact, study the asymptotics of∫U(γ0)

e−nf (γ)p(γ) dγ, U(γ0) small neighborhood of γ0,

for all γ0 ∈ Vf

Note: For marginal likelihood f (γ) = 0 ⇐⇒ g(γ) = θ0

(‘identifiability’ issues)


Resolution of singularities

Theorem (Hironaka, 1964; Atiyah, 1970)

In the considered setup, for every γ0 ∈ Vf , there exists

a neighborhood U(γ0) of γ0 ∈ Rd and

changes of coordinates

such that the zeta function becomes a finite sum of the form∫U(γ0)

f (γ)λp(γ) dγ =

∑α

∫[0,b]d

(u

2k1(α)1 . . . u

2kd (α)d

)λφα(u)u

h1(α)1 . . . u

hd (α)d du,

where the φα are smooth and bounded away from zero on [0, b]d .


Largest pole and multiplicity

Once in ‘normal crossing form’ meromorphic continuation anddetermination of poles clear.

Example:∫(u2k)λuh du =

u2kλ+h+1

2kλ+ h + 1, Pole λ = −h + 1

2k

Growth index:

q = minα

min1≤j≤d

hj(α) + 1

2kj(α)

Multiplicity:

s = maxα

#

{j :

hj(α) + 1

2kj(α)= q

}


Example: Blow-up transformations

Product interval∫ 1

−1

∫ 1

−1e−n·(x4+y6) dy dx ∼ n−1/4n−1/6 · C = n−5/12 · C

Resolve by repeatedly applying blow-up transformation, i.e., the pair

x = x1, y = x1y1; x = x2y2, y = y2.

y = y’x = x’y’,


Example: Blow-up transformations

First blow-up transformation gives

x4 + y 6 = x41 (1 + x2

1 y 61 ) Jacob. x1

= y 42 (x2

2 + y 22 ) y1

In 1st coordinates normal crossing, 4λ+ 2 = 0, pole −12

In 2nd coordinates not normal crossing, repeat

y 4(x4 + y 2) = x61 y 4

1 (x21 + y 2

2 ) Jacob. x21 y1

= y 62 (1 + x4

2 y 22 ) y 2

2

In 2nd coordinates normal crossing, 6λ+ 3 = 0, pole −12

In 1st coordinates not normal crossing, repeat

x6y 4(x2 + y 2) = x121 y 4

1 (1 + y 21 ) Jacob. x4

1 y1

= x62 y 12

2 (1 + x22 ) x2

2 y 42

Normal crossing in both coordinates: q = 512 , s = 1


Resolution – Singular session

LIB "resolve.lib";ring R = 0,(x,y),dp;

ideal J = x4+y6;list L=resolve(J);presentTree(L);

list L=resolve(J,0,"A");presentTree(L);LIB "reszeta.lib";list coll=collectDiv(L);LIB "resgraph.lib";ResTree(L,coll[1]);


Distance of Newton polyhedron

∫ 1

−1

∫ 1

−1e−n·(x4+y6) dy dx ∼ n−1/4n−1/6 · C = n−5/12 · C

(12/5,12/5)

(6,0)

(0,4)

Distance:

ρ = 4 · 3

5= 6 · 2

5=

12

5=⇒ q =

1

ρ


Newton polyhedron

Polynomial

f (x) =∑a∈Nd

caxa, xa = xa11 . . . xad

d

Newton polyhedron Pf is the convex hull of the set⋃a:ca 6=0

({a}+ [0,∞)d

)Distance:

ρ = min{r : r · 1d ∈ Pf }

For A ⊂ Rd , definefA(x) =

∑a∈A∩Nd

caxa


Non-degenerate exponents and remoteness

Theorem

If the polynomial f has a minimum at zero and is non-degenerate, that is,for any compact face A of the Newton polyhedron the equation system

∂fA(x)

∂x1= . . .

∂fA(x)

∂xd= 0

has no solution in (R \ {0})d , then for small ε the growth index for theintegral ∫

[−ε,ε]de−nf (γ)p(γ) dγ

is q = 1/ρ and the multiplicity s is the codimension of thelowest-dimensional face containing the point at which the ray spanned by1d first intersects the Newton polyhedron.


Lecture outline





Mathias Drton Lecture 3: Reduced rank regression 101 / 110

Reduced rank regression

Multivariate regression model

Y = θX + ε, θ ∈ Ra×b, rank(θ) ≤ h

X1

H

X2

Y1

Y2

Multivariate normal model (random design X )

Parametrize

θ = g(α, β) = αβT , α ∈ Ra×h, β ∈ Rb×h

Model selection problem: Determine h

WLOG: Assume coordinates of X and ε mutually independent withknown variances.


Asymptotics – regular case

Consider model given by rank h.

Suppose true matrix θ0 has rank r ≤ h.

Interested in the asymptotics of the integral∫ ∫exp{−n‖αβT − θ0‖2} dα dβ

Regular case:

The Jacobian of the map g(α, β) = αβT achieves its maximal rankh(a + b − h) at a point (α0, β0) if and only if α0β

T0 has full rank h.

If θ0 has rank r = h, then the set g−1(θ0) ⊆ Rah+bh is a smoothmanifold of dimension h2.

Reparametrize and apply Laplace approximaton (Haughton’s result) toobtain

q = h(a + b − h)/2, s = 1.


Asymptotics – singular case

Interested in the asymptotics of the integral∫ ∫exp{−n‖αβT − θ0‖2} dα dβ

Singular case: rank of θ0 is equal to r < h

Aoyagi & Watanabe (2005):Found growth index q and multiplicity s as a function of (a, b, h, r)

Simplest case with singularities is model rank h = 1


Asymptotics – singular case for rank 1

Model rank h = 1

Only one singular point: θ0 = 0

Fiberg−1(θ0) = {(α0, β0) : α0 = 0 or β0 = 0}

singular at the origin (α0, β0) = 0 and smooth elsewhere.

Local integrals are∫U(α0)

∫U(β0)

exp{−n(α21 + · · ·+ α2

a)(β21 + · · ·+ β2

b)} dα dβ,

(α0, β0) ∈ g−1(0).


Case 1

Suppose α0 = (α01, . . . , α0k , 0, . . . , 0) 6= 0. Then β0 = 0.

Shift (α0, β0) to origin by transformation αi = αi − α0i

Local integral becomes∫U(0)

exp{−n[(α1 + α01)2 + · · ·+ (αk + α0k)2 + α2k+1 + · · ·+ α2

a]

(β21 + · · ·+ β2

b)} d(α, β)

Function of α in exponent is bounded away from zero in aneighborhood U(0).

Asymptotics determined by that of∫U(0)

exp{−n(β21 + · · ·+ β2

b)} dβ

which is a regular integral with growth index b/2 and multiplicity 1.


Case 2

Suppose α0 = β0 = 0.

Resolve (α21 + · · ·+ α2

a)(β21 + · · ·+ β2

b) by applying a blow-up to thefirst term and a blow-up to the second term.

We obtain∫U(0,0)

α2λ1 β2λ

1 αa−11 βb−1

1

(1 + α2

2 + . . .)λ (

1 + β22 + . . .

)λdα dβ.

Consider ∫α2λ+a−1

1 β2λ+b−11 dα1dβ1 =

α2λ+a1 β2λ+b

1

(2λ+ a)(2λ+ b).

Poles λ = −a/2 and λ = −b/2.


Asymptotics for rank 1

Proposition

The marginal likelihood for the reduced rank regression model for rankh = 1 has growth index and multiplicity

(q, s) =

(a+b−1

2 , 1)

if θ0 6= 0,(min{a,b}

2 , 1)

if θ0 = 0 and a 6= b,(a2 = b

2 , 2)

if θ0 = 0 and a = b.

This can also be shown by looking at the Newton diagrams


Exercise: Factor analysis

Let H and ε1, . . . , εd be mutually independent N (0, 1) r.v.

Define

X = αH + ε, α ∈ Rd

Then X ∼ N (0, θ) with covariance matrix θ = I + ααT , α ∈ Rd

X1 X2 X3 X4

H

What is the growth behaviour of marginal likelihood of this model?


Conclusion

Algebraic statistical models:useful framework for discussing non-smooth statistical models.

Computational algebra:Markov bases, vanishing ideals, singular loci, tangent cones, resolutionof singularities, . . .

Many open questions about classical statistical models . . .

Mathias Drton 110 / 110

an introduction to algebraic statisticsmd5/papers/algstat.pdf · 2010-01-13 · ‘algebraic...

Documents