an introduction to algebraic statisticsmd5/papers/algstat.pdf · 2010-01-13 · ‘algebraic...
TRANSCRIPT
An Introduction to Algebraic Statistics
Mathias Drton
Department of StatisticsUniversity of Chicago
January, 2010
‘Algebraic statistics’
Application and development of techniques in
Algebraic Geometry, Commutative Algebra, and Combinatorics
to address problems in Statistics.
Instrumental paper:
Diaconis, Persi; Sturmfels, Bernd. Algebraic algorithms forsampling from conditional distributions. Annals of Statistics26 (1998), no. 1, 363–397.
Applied-minded algebraists get involved with Statistics
(AMS meetings, SIAM activity group, . . . ).
Some literature
Pistone, Riccomagno & Wynn: Algebraic Statistics (Exp. Design)
Pachter & Sturmfels: Algebraic Statistics for Computational Biology
Gibilisco et al. (Eds.): Algebraic and Geometric Methods in Statistics
Viana & Richards (Eds.): Algebraic Methods in Statistics and Probability(2nd volume in prep.)
These lectures
Material from Chapters 1, 2 and 5 in
Drton, Sullivant & Sturmfels:Lectures on Algebraic Statistics
Chapter 3: Conditional independenceGraphical models
Chapter 4: Hidden variable models
Chapter 6: Worked exercises
Chapter 7: Open problems
Lectures
Lecture I: Markov Bases for Exact Inference in Contingency Tables
(Chapter 1 in lecture notes)
Lecture II: Likelihood Ratio Tests and Singularities
(Section 2.3 in lecture notes)
Lecture III: Bayesian Integrals
(Section 5.1 in lecture notes)
Part I
Markov Bases for Exact Inference in Contingency Tables
1 Fisher’s exact test for 2× 2 contingency tables2 Log-linear models for multi-way tables3 Markov bases for exact conditional inference
Lecture outline
1 Fisher’s exact test for 2× 2 contingency tables
2 Log-linear models for multi-way tables
3 Markov bases for exact conditional inference
Mathias Drton Lecture 1: Fisher’s exact test 2 / 110
Example: Cancer treatment
Surgery versus radition treatment for cancer patients:
Cancer Cancer NotControlled Controlled
Surgery 21 0 21Radiation therapy 15 3 18
36 3 39
Disease outcome independent of treatment?
Chi-square test p-value = 0.1788
Fisher’s exact test p-value = 0.08929
Mathias Drton Lecture 1: Fisher’s exact test 3 / 110
Independence model
Two discrete/categorical random variables
X ∈ [r ] := {1, 2, . . . , r} and Y ∈ [c] := {1, 2, . . . , c}
Joint and marginal probabilities:
pij = P(X = i ,Y = j), pi+ = P(X = i), p+j = P(Y = j)
X and Y independent (X⊥⊥Y ) iff
pij = pi+p+j for all i ∈ [r ], j ∈ [c]
or, equivalently, the matrix P = (pij) has rank 1.
Mathias Drton Lecture 1: Fisher’s exact test 4 / 110
Chi-square test of independence
Counts from n i.i.d. copies of (X ,Y ):
Uij =n∑
k=1
1{X (k)=i ,Y (k)=j}, i ∈ [r ], j ∈ [c].
Contingency table U = (Uij) has multinomial distribution:
P(U = u) =n!
u11!u12! · · · urc !
r∏i=1
c∏j=1
puij
ij .
Chi-square statistic
X 2(U) =r∑
i=1
c∑j=1
(Uij − uij)2
uij
H0−→d χ2(r−1)(c−1), n→∞
Mathias Drton Lecture 1: Fisher’s exact test 5 / 110
Fisher’s exact test for 2× 2 table
Hypergeometric distribution:
If X⊥⊥Y , then
P(U11 = u11 |U1+ = u1+,U+1 = u+1) =
(u1+u11
)( n−u1+u+1−u11
)( nu+1
)for u11 ∈ {max(0, u1+ + u+1 − n), . . . ,min(u1+, u+1)}.
Exact test:1 Choose a test statistic T (u)
(e.g., X 2(u), P(U11 = u11 |U1+ = u1+,U+1 = u+1), . . . )2 P-value:
P(T (U) ≥ T (u) |U1+,U+1) =∑
v :T (v)≥T (u)
(U1+
v11
)(n−U1+
U+1−v11
)(n
U+1
)Mathias Drton Lecture 1: Fisher’s exact test 6 / 110
Lecture outline
1 Fisher’s exact test for 2× 2 contingency tables
2 Log-linear models for multi-way tables
3 Markov bases for exact conditional inference
Mathias Drton Lecture 1: Log-linear models 7 / 110
Three-way table (Agresti, 2002)
White subjects were asked about:
(1) “Black children on school bus”, (2) “Black candidate for presidency”,
(3) “Black friend for dinner at home”
HomePresident Busing Yes No ???
Yes Yes 41 65 0No 71 157 1??? 1 17 0
No Yes 2 5 0No 3 44 0??? 1 0 0
??? Yes 0 3 1No 0 10 0??? 0 0 1
??? = ‘don’t know’
Mathias Drton Lecture 1: Log-linear models 8 / 110
Log-linear models
Discrete r.v. X1, . . . ,Xm; X` ∈ [r`]
State space: R =∏m`=1[r`]
Joint probability table: p = (pi | i ∈ R)
Probability simplex: ∆R−1
Definition
Fix a matrix A ∈ Zd×R whose columns all sum to the same value. Thelog-linear model associated with A is the set of positive probability tables
MA ={
p = (pi ) ∈ int(∆R−1) : log p = (log pi ) ∈ rowspan(A)},
where rowspan(A) is the linear space spanned by the rows of A.
Mathias Drton Lecture 1: Log-linear models 9 / 110
Example: Independence model
X , Y : two discrete r.v. with joint probabilities pij > 0
X⊥⊥Y is equivalent to
log pij = log pi+ + log p+j = αi + βj , i ∈ [r ], j ∈ [c].
Suppose r = 2 and c = 3. Then log p ∈ R2×3 is in row span of the(r + c)× rc = 5× 6 matrix
A =
11 12 13 21 22 23
α1 1 1 1 0 0 0α2 0 0 0 1 1 1β1 1 0 0 1 0 0β2 0 1 0 0 1 0β3 0 0 1 0 0 1
.
Mathias Drton Lecture 1: Log-linear models 10 / 110
Contingency tables
Based on n-sample, define m-way contingency table U:
Ui =n∑
k=1
1{X (k)1 =i1,...,X
(k)m =im}
, i = (i1, . . . , im) ∈ R
Let T (n) be the space of non-neg integer tables summing to n.
Definition
We call the vector Au the minimal sufficient statistics for the model MA,and the set of tables
F(u) ={
v ∈ NR : Av = Au}
is the fiber of a contingency table u ∈ T (n) with respect to model MA.
Mathias Drton Lecture 1: Log-linear models 11 / 110
Example: Independence model
Let u be an r × c table.
For the matrix A encoding the independence model X⊥⊥Y :
Au =
(u·+u+·
),
where u·+ and u+· are the row and columns sums of table u.
If r = 2 and c = 3:
Au =
1 1 1 0 0 00 0 0 1 1 11 0 0 1 0 00 1 0 0 1 00 0 1 0 0 1
u11
u12
u13
u21
u22
u23
=
u1+
u2+
u+1
u+2
u+3
.
Mathias Drton Lecture 1: Log-linear models 12 / 110
Hierarchical models
Conditional independence:
X1 and X2 conditionally independent given X3 if
P(X1 = i ,X2 = j |X3 = k) = P(X1 = i |X3 = k)P(X2 = j |X3 = k).
Equivalent to matrices Pk = (pijk) having rank at most 1 for all k.
Log-linear formulation:
log pijk = α(13)ik + α
(23)jk
No three-way interaction:
log pijk = α(12)ij + α
(13)ik + α
(23)jk
Mathias Drton Lecture 1: Log-linear models 13 / 110
Conditional inference
Lemma
If p = eATα ∈MA and u ∈ T (n), then
P(U = u) =n!∏
i∈R ui !eα
T (Au).
Corollary
Conditional distribution is multivariate hypergeometric:
P(U = u |AU = Au) =1/(∏
i∈R ui !)∑
v∈F(u) 1/(∏
i∈R vi !) ,
and does not depend on p.
Mathias Drton Lecture 1: Log-linear models 14 / 110
Exact test
Consider the hypothesis testing problem
H0 : p ∈MA versus H1 : p 6∈ MA.
Maximum likelihood estimates pi
Expected counts ui = npi (same for all tables in a fiber F(u))
Chi-square statistic
X 2(U) =∑i∈R
(Ui − ui )2
ui
Exact p-valueP(X 2(U) ≥ X 2(u) |AU = Au)
Mathias Drton Lecture 1: Log-linear models 15 / 110
Markov chain Monte Carlo
Exact p-value is equal to∑v∈F(u) 1{X 2(v)≥X 2(u)}/
(∏i∈R ui !
)∑v∈F(u) 1/
(∏i∈R vi !
) .
Larger counts or tables: prohibitive to sum over entire fiber
Approximate p-value by Markov chain Monte Carlo algorithms forsampling tables from the conditional distribution
With prob 1, MCMC yields sequence of tables vt ∈ F(u) such thatthe proportion of tables with X 2(vt) ≥ X 2(u) converges to p-value.
Problem
For an irreducible Metropolis-Hastings sampler, find
Finite set of moves that connect any two tables in any fiber.
Mathias Drton Lecture 1: Log-linear models 16 / 110
Lecture outline
1 Fisher’s exact test for 2× 2 contingency tables
2 Log-linear models for multi-way tables
3 Markov bases for exact conditional inference
Mathias Drton Lecture 1: Markov bases 17 / 110
Markov basis – Definition
Log-linear model MA associated with matrix A
Integer kernel kerZ(A)
Definition
A finite subset B ⊂ kerZ(A) is a Markov basis for MA if for all u ∈ T (n)and all pairs v , v ′ ∈ F(u) there exists a sequence u1, . . . , uL ∈ B such that
v ′ = v +L∑
k=1
uk and v +l∑
k=1
uk ≥ 0 for all l = 1, . . . , L.
The elements of the Markov basis are called moves.
Mathias Drton Lecture 1: Markov bases 18 / 110
Metropolis-Hastings algorithm
Input: Contingency table u; Markov basis B for the model MA.
Output: Sequence (X 2(vt))∞t=1 for tables vt in fiber F(u).
Step 1: Initialize v1 = u.
Step 2: For t = 1, 2, . . . repeat the following steps:
(i) Select uniformly at random a move ut ∈ B.(ii) If min(vt + ut) < 0, then set vt+1 = vt , else set
vt+1 =
{vt + ut
vt
with probability
{q
1− q,
where
q = min
{1,
P(U = vt + ut |AU = Au)
P(U = vt |AU = Au)
}.
(iii) Compute X 2(vt).
Mathias Drton Lecture 1: Markov bases 19 / 110
Markov basis for independence model
Let eij be the r × c table:
j
0 0 0 0 0 0 . . .i 0 0 0 1 0 0 . . .
0 0 0 0 0 0 . . ....
......
......
. . .
Proposition
The (unique minimal) Markov basis for the independence model MX⊥⊥Y
consists of the following 2 ·(r
2
)(c2
)moves, each having one-norm 4:
B ={±(eij + ekl − eil − ekj) : 1 ≤ i < k ≤ r , 1 ≤ j < l ≤ c
}.
Mathias Drton Lecture 1: Markov bases 20 / 110
Independence model – Proof
Idea Show that we can use elements of B to bring any twodistinct tables in the same fiber closer to one another.
Claim Given v 6= u, v ∈ F(u) show that there is b ∈ B such that(i) u + b ≥ 0 and (ii) ‖u − v‖1 > ‖u + b − v‖1.
Proof Recall Au yields row and column sums:
(a) Since u 6= v and Au = Av , there is at least one positiveentry in u − v . WLOG, u11 − v11 > 0.
(b) Since Au = Av , there is a negative entry in the first row ofu − v . WLOG, u12 − v12 < 0.
(c) Similarly, u22 − v22 > 0.
(d) Let b = e12 + e21 − e11 − e22. Then‖u − v‖1 > ‖u + b − v‖1 and u + b ≥ 0 as desired.
Mathias Drton Lecture 1: Markov bases 21 / 110
Symbolic computation – 4ti2
Markov basis of ‘no 3-way interaction model’ for 2× 2× 2 table?
Matrix representing model has format 12× 8 (store in file no3way):
12 81 1 0 0 0 0 0 00 0 1 1 0 0 0 00 0 0 0 1 1 0 00 0 0 0 0 0 1 11 0 1 0 0 0 0 00 1 0 1 0 0 0 00 0 0 0 1 0 1 00 0 0 0 0 1 0 11 0 0 0 1 0 0 00 1 0 0 0 1 0 00 0 1 0 0 0 1 00 0 0 1 0 0 0 1
Mathias Drton Lecture 1: Markov bases 22 / 110
Symbolic computation – 4ti2
Compute Markov basis (up to sign) using command markov no3way
Output in file no3way.mar:
1 81 -1 -1 1 -1 1 1 -1
Two moves
±(e111 + e122 + e212 + e221 − e112 − e121 − e211 − e222)
correspond to the quartic equation
p111p122p212p221 = p112p121p211p222
Recall:pijk ∝ θ
(12)ij θ
(13)ik θ
(23)jk
Mathias Drton Lecture 1: Markov bases 23 / 110
Polynomial algebra
Polynomial ring R[p] = R[p1, p2, . . . , pk ]
For non-neg integer table u = (u1, . . . , uk) ∈ Nk define monomial
pu = pu11 pu2
2 · · · pukk
For integer table u = u+ − u− ∈ Zk with positive and negative partsu+, u− ∈ Nk define binomial
pu+ − pu−
Example:
p =
(p11 p12
p21 p22
), u =
(2 −2−1 1
)=⇒ p2
11p22 − p212p21
Mathias Drton Lecture 1: Markov bases 24 / 110
Polynomial algebra
A subset I ⊂ R[p] is an ideal if
f , g ∈ I =⇒ f + g ∈ I
f ∈ I , h ∈ R[p] =⇒ hf ∈ I
Hilbert’s basis theorem:Every ideal I has a finite generating set f1, . . . , fm ∈ R[p], that is,
I = 〈f1, . . . , fm〉 =
{m∑
i=1
hi fi : h1, . . . , hm ∈ R[p]
}
Mathias Drton Lecture 1: Markov bases 25 / 110
Fundamental theorem
Given a matrix A ∈ Nd×k for a log-linear model, define the (toric) ideal
IA := 〈 pu+ − pu− : u ∈ kerZ(A) 〉 ⊂ R[p].
Theorem (Fundamental theorem of Markov bases)
A subset B of kerZ(A) is a Markov basis if and only if the correspondingset of binomials { pb+ − pb− : b ∈ B } generates the ideal IA. Inparticular, a (finite) Markov basis always exists.
Mathias Drton Lecture 1: Markov bases 26 / 110
Example: Independence model for 2× 2 table
We have shown that a Markov basis (up to sign) is given by
b =
(1 −1−1 1
)Hence, IA = I ∗ := 〈p11p22 − p12p21〉
Example for IA ⊆ I ∗: Consider the tables
u =
(4 12 5
), v =
(3 23 4
).
Since u − b = v , we have u − b+ = v − b− and thus
p411p1
12p221p5
22 − p311p2
12p321p4
22 = p311p1
12p221p4
22(p11p22 − p12p21) ∈ I ∗
Mathias Drton Lecture 1: Markov bases 27 / 110
Computing Markov bases
Theorem
The ideal IA is a homogeneous ideal and its homogeneous elements areexactly the homogeneous polynomials f in R[p] that vanish on thelog-linear model MA:
f (p) = 0 for all p ∈MA.
For a matrix A = (aij) ∈ Nd×k , compute a Markov basis byeliminating the variables from the equation system
pj − θa1j
1 θa2j
2 · · · θadj
d = 0, i = 1, . . . , k .
Software for Grobner basis calculations.... . . Macaulay 2, Singular, 4ti2
Mathias Drton Lecture 1: Markov bases 28 / 110
Example: No 3-way interaction in 2× 2× 2 table
Equation system:
p111 = α11β11γ11, p112 = α11β12γ12,
p121 = α12β11γ21, p122 = α12β12γ22,
p211 = α21β21γ11, p212 = α21β22γ12,
p221 = α22β21γ21, p222 = α22β22γ22.
Variable elimination:Every relation among pijk is a polynomial multiple of
p111p122p212p221 − p112p121p211p222
Markov basis:
±(e111 + e122 + e212 + e221 − e112 − e121 − e211 − e222)
Mathias Drton Lecture 1: Markov bases 29 / 110
Singular session
LIB "elim.lib";ring R = 0,(p111,p112,p121,p122,p211,p212,p221,p222,
a11,a12,a21,a22,b11,b12,b21,b22,c11,c12,c21,c22),dp;ideal M =p111 - a11*b11*c11,p112 - a11*b12*c12,p121 - a12*b11*c21,p122 - a12*b12*c22,p211 - a21*b21*c11,p212 - a21*b22*c12,p221 - a22*b21*c21,p222 - a22*b22*c22;eliminate(M, a11*a12*a21*a22*b11*b12*b21*b22*
c11*c12*c21*c22);
Mathias Drton Lecture 1: Markov bases 30 / 110
Background reading
Cox, D.; Little, J.; O’Shea, D. (2007).Ideals, varieties, and algorithms.Springer, New York, 2007.
Mathias Drton Lecture 1: Markov bases 31 / 110
Database: http://mbdb.mis.mpg.de
Mathias Drton Lecture 1: Markov bases 32 / 110
Slim and long tables
Theorem
Let X1 be a r.v. with 3 states, and X2 and X3 r.v. with r2 and r3 states,resp. Let v ∈ Zk be any integer vector. There are r2, r3 ∈ N and acoordinate projection π : Z3×r2×r3 → Zk such that every minimal Markovbasis for the no 3-way interaction model contains a table u with π(u) = v.
Theorem
Fix a set of interactions Γ for a hierarchical log-linear model, and fixr2, . . . , rm. There exists a number b(Γ, r2, . . . , rm) <∞ such that theone-norms of the elements of any minimal Markov basis for Γ ons × r2 × · · · × rm tables are less than or equal to b(Γ, r2, . . . , rm). Thisbound is independent of s, which can grow large.
Mathias Drton Lecture 1: Markov bases 33 / 110
Exercise
Exercises 6.1 and 6.2 in the lecture notes
Perform an exact test for your favorite table
e.g. test ‘no 3-way interaction’ in the example from Agresti (2002)shown earlier:
HomePresident Busing Yes No ???
Yes Yes 41 65 0No 71 157 1??? 1 17 0
No Yes 2 5 0No 3 44 0??? 1 0 0
??? Yes 0 3 1No 0 10 0??? 0 0 1
??? = ‘don’t know’
Mathias Drton Lecture 1: Markov bases 34 / 110
Part II
Likelihood Ratio Tests and Singularities
4 Algebraic statistical models5 Large-sample asymptotics and Chernoff’s theorem6 Examples
Lecture outline
4 Algebraic statistical models
5 Large-sample asymptotics and Chernoff’s theorem
6 Examples
Mathias Drton Lecture 2: Algebraic statistical models 36 / 110
Example: Bayesian network
Sachs et al. (2005): Analysis of flow cytometry data
Expression values for 11 proteins discretized −→ ternary variables
Large sample size (observational part: n = 1200)
Bayesian network (conditional independence model):
Typical task: test absence of edges
Likelihood ratio test of absence‘PKC → PKA’ can be based on χ2
4
distribution
See Chapter 3 in the lecture notes
Mathias Drton Lecture 2: Algebraic statistical models 37 / 110
Chi-square asymptotics
Theorem
Suppose
(i) {Pθ : θ ∈ Θ} is a regular exponential family (Θ ⊂ Rk open),
(ii) Θ0 ⊂ Θ1 are smooth submanifolds of Θ,
(iii) True parameter point θ0 ∈ Θ0.
Then the likelihood ratio statistic for testing
H0 : θ ∈ Θ0 vs. H1 : θ ∈ Θ1 \Θ0
tends to χ2dim(Θ1)−dim(Θ0) as n→∞.
Theorem covers Bayesian network example because
interior of probability simplex is regular exponential family, and
Bayesian networks define smooth submanifolds.
Mathias Drton Lecture 2: Algebraic statistical models 38 / 110
Regular exponential families
Definition
Let PΘ = {Pθ : θ ∈ Θ} be a family of probability distributions onX ⊆ Rm that have densities with respect to a measure ν. We call PΘ anexponential family if there is a statistic T : X → Rk and functionsh : Θ→ Rk and Z : Θ→ R such that each distribution Pθ has ν-density
pθ(x) =1
Z (θ)exp{〈h(θ),T (x)〉}, x ∈ X .
If
H =
{η ∈ Rk :
∫X
exp{〈η,T (x)〉} dν(x) <∞}
is an open subset of Rk and h a diffeomorphism between Θ and H, thenwe say that PΘ is a regular exponential family of order k .
Mathias Drton Lecture 2: Algebraic statistical models 39 / 110
Curved exponential families
Definition
Suppose {Pθ : θ ∈ Θ} is a regular exponential family. If Θ0 is a smoothsubmanifold of Θ, then {Pθ : θ ∈ Θ0} is a curved exponential family.
Well-developed large-sample theory for CEFs
Estimation and confidence intervals:
Maximum likelihood estimators are asymptotically normal.
Hypothesis testing:
Likelihood ratio statistics have asymptotic chi-square distributions.Wald statistics asymptotic chi-square distributions.
Model selection:
Bayesian information criterion (BIC) is consistent and connected to theasymptotics of marginal likelihood integrals.
Mathias Drton Lecture 2: Algebraic statistical models 40 / 110
Example: Instrumental variables
Estimate coeffient γ43 in the system
X3 = γ35X5 + ε3,
X4 = γ43X3 + γ45X5 + ε4,
X5 = ε5
with εi ∼ N (0, ωi ) independent
X3
X4
X5
Variable X5 hidden
: Consider distributions
(X1, . . . ,X4) ∼ N(0,Σ(γ, ω)
)(γ, ω)→ Σ(γ, ω) polynomial parametrization
Mathias Drton Lecture 2: Algebraic statistical models 41 / 110
Example: Instrumental variables
Estimate coeffient γ43 in the system
X1 = ε1,
X2 = ε2,
X3 = γ31X1 + γ32X2 + γ35X5 + ε3,
X4 = γ43X3 + γ45X5 + ε4,
X5 = ε5
with εi ∼ N (0, ωi ) independent
X1
X3
X4
X2
X5
Variable X5 hidden
Marginal distribution
(X1, . . . ,X4) ∼ N(0,Σ(γ, ω)
)
Mathias Drton Lecture 2: Algebraic statistical models 42 / 110
Example: Instrumental variables
Covariance matrix parametrization is a polynomial map:
Σ(γ, ω) =ω1 0 γ31 ω1 γ43 γ31 ω1
ω2 γ32 ω2 γ43 γ32 ω2
Var[X3] γ43 Var[X3] + γ35 γ45 ω5
ω4 + γ243 Var[X3] + γ2
45 ω5 + 2γ45 γ43 γ35 ω5
with
Var[X3] = ω3 + γ231 ω1 + γ2
32 ω2 + γ235 ω5
Coordinate σij is a combinatorial expression summing termsassociated with ‘treks’
i ←− `1 ←− `2 ←− . . .←− t −→ . . . −→ r2 −→ r1 −→ j
Mathias Drton Lecture 2: Algebraic statistical models 43 / 110
Example: Instrumental variables
In this hidden variable model test
H0 : γ31 = γ32 = 0
Null distrib. of LR statistic (n = 1000) X1
X3
X4
X2
X5
0 2 4 6 8 10 12
0.0
0.2
0.4
0.6
0.8
1.0
CDF
F(x
)
Mathias Drton Lecture 2: Algebraic statistical models 44 / 110
Algebraic exponential families
Asymptotic behavior of the LRT in instrumental variables example?
Hidden variable models 6= curved exponential family
What is a suitable general framework to study hidden variable models?
Definition
Suppose {Pθ : θ ∈ Θ} is a regular exponential family. If Θ0 is asemi-algebraic subset of Θ, then the submodel {Pθ : θ ∈ Θ0} is analgebraic exponential family.
Mathias Drton Lecture 2: Algebraic statistical models 45 / 110
Semi-algebraic sets
Definition
Let R[t1, . . . , tk ] be the ring of polynomials in the indeterminates t1, . . . , tkwith real coefficients. A semi-algebraic set is a finite union of the form
Θ0 =m⋃
i=1
{θ ∈ Rk | f (θ) = 0 for f ∈ Fi and h(θ) > 0 for h ∈ Hi},
where Fi ,Hi ⊂ R[t1, . . . , tk ] are collections of polynomials and all Hi finite.
Theorem (Tarski-Seidenberg)
If g : Rd → Rk is a polynomial map and Γ is a semi-algebraic set, thenΘ0 = g(Γ) is semi-algebraic.
Mathias Drton Lecture 2: Algebraic statistical models 46 / 110
Lecture outline
4 Algebraic statistical models
5 Large-sample asymptotics and Chernoff’s theorem
6 Examples
Mathias Drton Lecture 2: Large-sample asymptotics and Chernoff’s theorem 47 / 110
Likelihood ratio test
Independent observations X (1), . . . ,X (n) with unknown distribution
Statistical model {Pθ : θ ∈ Θ}, Θ ⊆ Rk
Suppose Pθ have density functions pθ(x). Define likelihood function
Ln : Θ→ R, θ 7→n∏
i=1
pθ(X (i)).
Test H0 : θ ∈ Θ0 vs. H1 : θ ∈ Θ1 \Θ0 for some Θ0 ⊂ Θ1 ⊂ Θ.
Definition
The likelihood ratio test rejects H0 if the likelihood ratio statistic
λn = 2 logsupθ∈Θ1
Ln(θ)
supθ∈Θ0Ln(θ)
is “too large” =⇒ p-value PH0(λn ≥ λobs).
Canonical example: Normal means
Normal mean model {N (θ, Ik) : θ ∈ Rk}Log-likelihood function
`n(θ) = −nk
2log(2π)−
n
2‖Xn − θ‖2
2 −1
2
n∑i=1
‖X (i) − Xn‖22.
Sample mean
Xn =1
n
n∑i=1
X (i)
Likelihood ratio statistic for testing H0 : θ ∈ Θ0 vs. H1 : θ 6∈ Θ0:
λn = n · infθ∈Θ0
‖Xn − θ‖22 = inf
θ∈Θ0
‖√
n(Xn − θ0)−√
n(θ − θ0)‖22
where θ0 is the true parameter.
Mathias Drton Lecture 2: Large-sample asymptotics and Chernoff’s theorem 49 / 110
Canonical example: Normal means
Asymptotics of LR statistic determined by squared Euclidean distancebetween N (0, Ik)-point and “limit of
√n(Θ0 − θ0)”
Example: Cuspidal cubic
Bivariate normal mean model
Θ0 cuspidal cubic {(θ1, θ2) : θ31 = θ2
2}
Tangent cone at θ0 = 0 is half-ray{(θ1, θ2) : θ1 ≥ 0, θ2 = 0}
Limiting distribution of LRT is a mixtureof chi-squares:
λnD−→ 1
2χ2
1 +1
2χ2
2.
Mathias Drton Lecture 2: Large-sample asymptotics and Chernoff’s theorem 50 / 110
Chernoff’s theorem: Preparation
Definition (Tangent cone)
TC θ0(Θ0) =
{lim
n→∞
θn − θ0
βn: βn > 0, θn ∈ Θ0, θn −→ θ0
}
Definition (Fisher-information matrix)
Positive semi-definite matrix I (θ) with entries
I (θ)ij = Eθ
[(∂
∂θilog pθ(X )
)(∂
∂θjlog pθ(X )
)], i , j ∈ [k].
Mathias Drton Lecture 2: Large-sample asymptotics and Chernoff’s theorem 51 / 110
Chernoff’s theorem (for exponential families)
Theorem
Suppose {Pθ : θ ∈ Θ} is a regular exponential family with Θ ⊆ Rk . Letθ0 ∈ Θ0 ⊆ Θ ⊆ Rk be the true parameter point. If Θ0 is Chernoff-regularat θ0 and n→∞, then LR statistic λn for H0 : θ ∈ Θ0 vs. H1 : θ 6∈ Θ0
converges tomin
τ∈TCθ0(Θ0)‖Z − I (θ0)1/2τ‖2
2
where Z ∼ N (0, Ik) and I (θ0)1/2 is any matrix square root of theFisher-information I (θ0).
Mathias Drton Lecture 2: Large-sample asymptotics and Chernoff’s theorem 52 / 110
What is Chernoff-regularity?
Condition on how tangent cone TC θ0(Θ0) approximates the set Θ0
locally at θ0 ∈ Θ0.Allows one to pass from supθ∈Θ0
. . . to supτ∈TCθ0(Θ0) . . . .
For θ0 = 0:
distance(θ,TC 0(Θ0)) = o(‖θ‖), θ ∈ Θ0,
distance(τ,Θ0) = o(‖τ‖), τ ∈ TC 0(Θ0)
Definition
A set Θ0 ⊆ Rk is Chernoff-regular at θ0 if
For all τ ∈ TC θ0(Θ0) and βn ↘ 0there exists a sequence θn → θ0 in Θ0 such that
limn→∞
θn − θ0
βn= τ.
Mathias Drton Lecture 2: Large-sample asymptotics and Chernoff’s theorem 53 / 110
Chernoff-regularity of semi-algebraic sets
Lemma
Semi-algebraic sets are everywhere Chernoff-regular.
Follows from ‘curve selection lemma’ that implies that for all τ ∈ TΘ(θ0)there exists a (real analytic) map α : [0, ε)→ Θ with α(0) = θ0 s.t.
τ = limt→0+
α(t)− α(0)
t.
Corollary (Testing in a submodel)
Suppose {Pθ : θ ∈ Θ} is regular exponential family with Θ ⊆ Rk . LetΘ0,Θ1 be semi-algebraic subsets of Θ. If true parameter θ0 is in Θ0 andn→∞, then LR statistic for H0 : θ ∈ Θ0 vs. H1 : θ ∈ Θ1 \Θ0 converges to
minτ∈TCθ0
(Θ0)‖Z − I (θ0)1/2τ‖2
2− minτ∈TCθ0
(Θ1)‖Z − I (θ0)1/2τ‖2
2, Z ∼ N (0, Ik).
Mathias Drton Lecture 2: Large-sample asymptotics and Chernoff’s theorem 54 / 110
Lecture outline
4 Algebraic statistical models
5 Large-sample asymptotics and Chernoff’s theorem
6 Examples
Mathias Drton Lecture 2: Examples 55 / 110
Linear spaces
Lemma
If Θ0 is a d-dimensional linear subspace of Rk and X ∼ N (0,Σ) withpositive definite covariance matrix Σ, then
infθ∈Θ0
(X − θ)T Σ−1(X − θ) ∼ χ2k−d .
Corollary
Likelihood ratio statistic is asymptotically chi-square when testing linear orsmooth hypotheses.
Mathias Drton Lecture 2: Examples 56 / 110
Order-restricted inference
Example:
X1 : Difference in blood pressure before and after taking 1 pillX2 : Difference in blood pressure before and after taking 2 pills
Suppose X1 ∼ N(µ1, σ20) and X2 ∼ N(µ2, σ
20) and test:
H0 : µ2 ≥ µ1 ≥ 0 versus H1 : (µ2 < µ1 or µ1 < 0)
or possibly,
H0 : µ2 = µ1 = 0 versus H1 : µ2 ≥ µ1 ≥ 0
Mathias Drton Lecture 2: Examples 57 / 110
Mixture of chi-square distributions
1
8· χ2
0 +1
2· χ2
1 +3
8· χ2
2
Mathias Drton Lecture 2: Examples 58 / 110
Convex cones – ‘Boundary problems’
Lemma
Distance between standard normal random vector and convex cone isdistributed like a mixture of chi-square distributions.
Theorem (Miles, 1959; Drton & Klivans, 2009)
(a)
H0 : θ ∈{
x ∈ Rk : x1 ≤ x2 ≤ · · · ≤ xk
}Mixture weights ∝ coeff’s of t(t − 1) · · · (t − k + 1)
(b)
H0 : θ ∈{
x ∈ Rk : 0 ≤ x1 ≤ x2 ≤ · · · ≤ xk
}Mixture weights ∝ coeff’s of (t − 1)(t − 3) · · · (t − 2k + 1).
Mathias Drton Lecture 2: Examples 59 / 110
Singularities
Geometry of a semi-algebraic set Θ0 ⊆ Rk expresses itselfalgebraically in the vanishing ideal
I(Θ0) = {f ∈ R[t1, . . . , tk ] : f (θ) = 0 for all θ ∈ Θ0}.
Finite generating set
〈 f1, . . . , fs 〉 = I(Θ0), f1, . . . , fs ⊂ R[t1, . . . , tk ]
Definition
A point θ0 in Θ0 is a singularity if the rank of the Jacobian matrix
Jf (θ0) =
(∂fi (t)
∂tj
)t=θ
∈ Rs×k .
is smaller than k − dim Θ0.
Mathias Drton Lecture 2: Examples 60 / 110
Algebraic tangent cone
Let θ0 be a root of the polynomial f ∈ R[t1, . . . , tk ] and write
f (t) =L∑
h=l
fh(t − θ0),
where fh homogeneous, degree(fh) = h, and fl 6= 0.
Since f (θ0) = 0, minimal degree l ≥ 1, and we define fθ0,min = fl .
Tangent cone ideal:
{fθ0,min : f ∈ I(Θ0)} ⊂ R[t1, . . . , tk ].
Lemma
Suppose θ0 is a point in the semi-algebraic set Θ0 and f ∈ R[t1, . . . , tk ] apolynomial such that f (θ0) = 0 and f (θ) ≥ 0 for all θ ∈ Θ0. Then everytangent vector τ ∈ TC θ0(Θ0) satisfies that fθ0,min(τ) ≥ 0.
Mathias Drton Lecture 2: Examples 61 / 110
Example: Cuspidal cubic
Θ0 = {(θ1, θ2) : θ31 = θ2
2}Tangent cone ideal for θ0 = 0 isgenerated by t2
2
Associated algebraic tangent cone
{θ : θ22 = 0} = {θ : θ2 = 0}
Tangent cone at θ0 = 0 is half-ray
{θ : θ1 ≥ 0, θ2 = 0}
Mathias Drton Lecture 2: Examples 62 / 110
Instrumental variables – Singularities
Covariance matrixω1 0 γ31 ω1 γ43 γ31 ω1
ω2 γ32 ω2 γ43 γ32 ω2
ω3 + . . . γ35 γ45 ω5 + . . .
ω4 + . . .
X1
X3
X4
X2
X5
Vanishing idealI = 〈σ12, σ13σ24 − σ14σ23 〉
Singular locus:
{Σ = (σij) : σ12 = σ13 = σ14 = σ23 = σ24 = 0}
coincides with H0 : γ31 = γ32 = 0
Mathias Drton Lecture 2: Examples 63 / 110
Instrumental variables – Tangent cone
Singularities are ‘zero’
Vanishing ideal is homogeneous and thus equal to tangent cone ideal
Algebraic tangent cone at a singularity:(diag2×2 rank ≤ 1
arbitrary2×2
)Geometric tangent cone TC is closed cone that contains all derivativedirections. It is equal to algebraic cone.
Mathias Drton Lecture 2: Examples 64 / 110
Instrumental variables – Asymptotics
Proposition
Consider testingH0 : γ31 = γ32 = 0
in the instrumental variables example. Under the null and as n→∞,
λn −→d max{eigenvalues(W(2, I ))}
where W2×2(2, I ) is standard Wishart matrix with 2 degrees of freedom.
‘Proof’ (Details in worked exercises 6.4 and 6.5 in lecture notes)
Tangent cone invariant under transformation with matrix square rootof Fisher-information
Distance between 2× 2-matrix A and {rank ≤ 1} given by smallersingular value of A
Mathias Drton Lecture 2: Examples 65 / 110
Factor analysis
Factor analysis (conditional independence given hidden variable)
X1 = γ1H + ε1,
X2 = γ2H + ε2,
X3 = γ3H + ε3,
X4 = γ4H + ε4
X1 X2 X3 X4
H
Multivariate normal distributions N4(µ,Σ) with µ ∈ R4 and Σ in
Θ0 = {∆ + γγt | ∆ ∈ R4×4pd diagonal, γ ∈ R4}
Software (e.g. factanal in R) for testing
H0 : Σ ∈ Θ0 vs. H1 : Σ 6∈ Θ0,
uses LRT and χ22-approximation
Mathias Drton Lecture 2: Examples 66 / 110
Factor analysis
Histograms of 20,000 simulated p-values for sample size n = 1000:
Γ = (1, 1, 1, 1)t
p−value
0.0 0.4 0.8
Γ = (1, 1, 1, 0)t
p−value
0.0 0.4 0.8
0.0
0.4
0.8
Γ = (1, 1, 0, 0)t
p−value
0.0 0.4 0.8
0.0
0.6
1.2
Γ = (1, 0, 0, 0)t
p−value
0.0 0.4 0.8
0.0
1.0
Factor loadings 0 or 1, cond. variances 1/3 =⇒ correlations 0 or 3/4.
Three types of limiting distributions?
Mathias Drton Lecture 2: Examples 67 / 110
Factor analysis – Singular session
LIB "sing.lib";LIB "linalg.lib";
ring R = 0,(s11,s12,s13,s14, s22,s23,s24, s33,s34, s44,d1,d2,d3,d4, g1,g2,g3,g4),dp;
// Compute the vanishing ideal by eliminationideal F = s11-(d1+g1^2), s12-g1*g2, s13-g1*g3, s14-g1*g4,
s22-(d2+g2^2), s23-g2*g3, s24-g2*g4,s33-(d3+g3^2), s34-g3*g4,s44-(d4+g4^2);
ideal I = eliminate(F, d1*d2*d3*d4*g1*g2*g3*g4);I;
Mathias Drton Lecture 2: Examples 68 / 110
Factor analysis – Singular session
ring RR = 0,(s11,s12,s13,s14, s22,s23,s24, s33,s34, s44),dp;ideal I = fetch(R,I);dim(groebner(I));
// Compute the singularitiesideal S = slocus(I); S;primdecGTZ(S);
// Tangent cone at diagonal matrixtangentcone(I);// at matrix with s12=1tangentcone( subst(I,s12,s12+1) );// at regular point with s12=s13=1tangentcone( subst(I,s12,s12+1,s13,s13+1) );
Mathias Drton Lecture 2: Examples 69 / 110
Factor analysis: Singularities and tangent cones
Theorem (D, 2009)
(i) A covariance matrix Σ is a singularity of the one-factor model if andonly if Σ has at most one non-zero off-diagonal entry σij , i < j .
(ii) If Σ is diagonal then the tangent cone is the topological closure of{∆ + γγt | ∆ ∈ Rm×m diagonal, γ ∈ Rm
}.
(iii) If Σ has exactly one non-zero off-diagonal entry that is positive, sayσ12 > 0, then the tangent cone is the set of symmetric matrices
θ =
θ11 θ12 θ13 . . . θ1m
θ12 θ22 cθ13 . . . cθ1m
θ33 . . .θmm
, c ∈[σ12
σ11,σ22
σ12
].
Case σ12 < 0 is similar with c < 0.
Mathias Drton Lecture 2: Examples 70 / 110
Exercise: RC association model (Haberman, 1981)
Two discrete r.v. X1 and X2 with r1 and r2 states, respectively.
Logarithmic parametrization
log pij = αi + βj + γiδj , i ∈ [r1], j ∈ [r2]
What are the singularities? (in log-prob coordinates)
What do the tangent cones at the singularities look like?
What is the asymptotic distribution for the likelihood ratio statisticwhen testing the independence model X1⊥⊥X2 against the RCassociation model?
Mathias Drton Lecture 2: Examples 71 / 110
Part III
Bayesian Integrals
7 Information criteria for model selection8 Marginal likelihood integrals9 Resolution of singularities and Newton polyhedra10 Reduced rank regression
Lecture outline
7 Information criteria for model selection
8 Marginal likelihood integrals
9 Resolution of singularities and Newton polyhedra
10 Reduced rank regression
Mathias Drton Lecture 3: Information criteria for model selection 73 / 110
Model selection: Setup
Observations X (1), . . . ,X (n) ∼ P i.i.d.
Unknown P assumed to be in (identifiable) ambient statistical model
{Pθ : θ ∈ Θ}, Θ ⊆ Rk .
True parameter θ0 is such that Pθ0 = P.
Call submodel given by Θ0 ⊂ Θ true if θ0 ∈ Θ0.
Model selection problem
Find the “simplest” true model from a set of competing submodelsassociated with
Θ1,Θ2, . . . ,ΘM ⊆ Θ.
Mathias Drton Lecture 3: Information criteria for model selection 74 / 110
Score-based search
Strategy
Assign a score to each model and maximize the score.
Assume densities pθ(x), and define likelihood function
Ln : Θ→ R, θ 7→n∏
i=1
pθ(X (i)).
For submodel Θi , let
ˆn(i) = sup{ log Ln(θ) | θ ∈ Θi}, i = 1, . . . ,M.
If Θ1 ⊆ Θ2, then ˆn(1) ≤ ˆ
n(2).
Mathias Drton Lecture 3: Information criteria for model selection 75 / 110
Information criteria
Definition
The information criterion associated with a penalty function πn : [M]→ Rassigns the score
τn(i) = ˆn(i)− πn(i)
to the i-th model, i = 1, . . . ,M.
Example
AIC: πn(i) = dim(Θi ) (Akaike)
BIC: πn(i) = dim(Θi )2 log(n) (Bayesian, Schwarz)
Information criteria strike balance between model fit and modeldimensionality.
Mathias Drton Lecture 3: Information criteria for model selection 76 / 110
Basic consistency result
Theorem (compare Haughton, 1988)
Consider a regular exponential family (Pθ | θ ∈ Θ). In particular, Θ ⊆ Rk
is open. Let Θ1,Θ2 ⊆ Θ be any two sets.
1 Suppose θ0 ∈ Θ2 \Θ1. If 1n |πn(2)− πn(1)| n→∞−→ 0, then
limn→∞
Pθ0 (τn(1) < τn(2)) = 1.
2 Suppose θ0 ∈ Θ1 ∩Θ2. If πn(1)− πn(2)n→∞−→ ∞, then
limn→∞
Pθ0 (τn(1) < τn(2)) = 1.
Mathias Drton Lecture 3: Information criteria for model selection 77 / 110
Consistency
Corollary
Suppose a collection of models is given by closed sets Θ1,Θ2, . . . ,ΘM . Ifthe collection is closed under intersections, and Θi ⊂ Θj impliesdim(Θi ) < dim(Θj), then:
1 AIC identifies a true model with prob one as n→∞.
2 BIC identifies smallest true model with prob one as n→∞.
Example
1 Linear regression (random design)
2 Undirected graphical models
3 Determining rank in reduced-rank regression (‘singularities)
4 Determining number of factors in factor analysis (‘singularities)
5 Directed graphical models (‘faithfulness’), hidden var’s (‘singularities’)
Mathias Drton Lecture 3: Information criteria for model selection 78 / 110
Lecture outline
7 Information criteria for model selection
8 Marginal likelihood integrals
9 Resolution of singularities and Newton polyhedra
10 Reduced rank regression
Mathias Drton Lecture 3: Marginal likelihood integrals 79 / 110
Bayesian model determination
Prior probability of model i :
P(Θi ), i = 1, . . . ,M
Prior distribution of parameter in model i :
Qi (θ), θ ∈ Θi
Likelihood function:
Ln(θ | X (1), . . . ,X (n)) =n∏
i=1
pθ(X (i))
Posterior probability of model i :
P(Θi | X (1), . . . ,X (n)) ∝ P(Θi )
∫Θi
Ln(θ | X (1), . . . ,X (n)) dQi (θ)︸ ︷︷ ︸marginal/integrated likelihood
Mathias Drton Lecture 3: Marginal likelihood integrals 80 / 110
Bayesian model determination
Prior probability of model i :
P(Θi ), i = 1, . . . ,M
Prior distribution of parameter in model i :
Qi (θ), θ ∈ Θi
Likelihood function:
Ln(θ | X (1), . . . ,X (n)) =n∏
i=1
pθ(X (i))
Posterior probability of model i :
P(Θi | X (1), . . . ,X (n)) ∝ P(Θi )
∫Θi
Ln(θ | X (1), . . . ,X (n)) dQi (θ)︸ ︷︷ ︸marginal/integrated likelihood
Mathias Drton Lecture 3: Marginal likelihood integrals 80 / 110
Marginal likelihood
In typical applications, the models are parametrized:
θ = gi (γ), γ ∈ Rd
Priors Qi specified via distributions on γ that have densities pi (γ)
Marginal likelihood for one model (suppressing index i):
µn =
∫Rd
Ln
(g(γ) | X (1), . . . ,X (n)
)p(γ) dγ
=
∫Rd
e`n( g(γ) |X (1),...,X (n))p(γ) dγ
Frequentist view
Suppose X (1), . . . ,X (n), · · · ∼ Pθ0 are i.i.d. with θ0 = g(γ0).
What is the asymptotic behavior of the sequence (µn)?
Mathias Drton Lecture 3: Marginal likelihood integrals 81 / 110
Asymptotics for marginal likelihood integrals
Theorem (Laplace approximation; Haughton, 1988)
Let {Pθ : θ ∈ Θ} be a regular exponential family with Θ ⊆ Rk . Consideran open set Γ ⊆ Rd and a smooth injective map g : Γ→ Rk withcontinuous inverse. Let θ0 = g(γ0) be the true parameter, and assumethat the prior density p(γ) is smooth and positive in a neighborhood of γ0.Then
logµn = ˆn −
d
2log(n) + Op(1),
whereˆn = sup
γ∈Γ`n(g(γ) |X (1), . . . ,X (n)
).
Recall: Rn = Op(1) if ∀ε > 0 ∃Mε ∀n P(|Rn| > Mε) < ε
Haughton actually gives expansion of log µn up to Op
(n−1/2
)Mathias Drton Lecture 3: Marginal likelihood integrals 82 / 110
Example: Normal means model
Observations:
X (1), . . . ,X (n) ∼ N (θ, Ik×k), θ ∈ Θ = Rk
Likelihood function:
Ln(θ | X (1), . . . ,X (n)) =
(1√
(2π)k
)n
exp{−n · 1
2 ||Xn − θ||2}
Model parametrization g : Rd → Rk
Marginal likelihood
µn = Cn
∫Rd
exp{−n · 1
2‖Xn − g(γ)‖2}
p(γ) dγ
Mathias Drton Lecture 3: Marginal likelihood integrals 83 / 110
Cuspidal cubic
Model Θ0 = {θ ∈ R2 : θ22 = θ3
1}Parametrized by g(γ) = (γ2, γ3)
If γ0 6= 0, i.e., g(γ0) 6= 0, thenHaughton’s Theorem applies.
If θ0 = g(γ0) 6= 0, then
log
∫ ∞−∞
exp{−n · 1
2‖Xn − g(γ)‖2}
p(γ) dγ = −1
2log(n) + Op(1).
(Exponent ≈ quadratic in γ, Gaussian density with variance c/n)
What if θ0 = 0 ⇐⇒ γ0 = 0?
Mathias Drton Lecture 3: Marginal likelihood integrals 84 / 110
Cuspidal cubic
Integral with normalizing constant omitted:∫ ∞−∞
exp{− 1
2
[(√
nγ2 −√
nXn,1)2 + (√
nγ3 −√
nXn,2)2]}
p(γ) dγ
Change of variables γ = n1/4γ:
n−1/4
∫ ∞−∞
exp{− 1
2
[(γ2 −
√nXn,1)2+( γ3
n1/4−√
nXn,2
)2]}p
(γ
n1/4
)d γ.
Let θ0 = 0 and Z1,Z2ind∼ N (0, 1). Limit when multiplying by n1/4:∫ ∞
−∞exp
{− 1
2
[(γ2 − Z1)2 + Z 2
2
]}p (0) dγ.
Hence, log µn = ˆn − 1
4 log(n) + Op(1)
Mathias Drton Lecture 3: Marginal likelihood integrals 85 / 110
Observation
Sequence of random intervals:
logµn = log
∫Rd
Cn exp{−n · 1
2‖Xn − g(γ)‖2}
p(γ) dγ
=
{ˆn − 1
2 log(n) + Op(1) if γ0 6= 0,ˆn − 1
4 log(n) + Op(1) if γ0 = 0
Deterministic intervals (replace Xn by expectation θ0 = g(γ0)):
log
∫ ∞−∞
Cn exp{−n · 1
2‖g(γ0)− g(γ)‖2}
p(γ) dγ
=
{n log(C )− 1
2 log(n) + O(1) if γ0 6= 0,
n log(C )− 14 log(n) + O(1) if γ0 = 0
Same asymptotics!
Mathias Drton Lecture 3: Marginal likelihood integrals 86 / 110
Laplace integrals
Theorem
Let {Pθ : θ ∈ Θ} be a regular exponential family. Consider a polynomialmap g : Rd → Θ, and let θ0 = g(γ0) be the true parameter. Assume thatthat the prior density p(γ) is smooth and positive on a compact andsemi-analytic supporting set. Then
logµn = ˆn − q log(n) + (s − 1) log log(n) + Op(1),
where the rational number q ∈ (0, d/2] and the integer s ∈ [d ] satisfy that
log
∫e−n‖g(γ)−θ0‖2
p(γ)dγ = −q log(n) + (s − 1) log log(n) + O(1).
Remark
The remainder can be shown to converge in distribution.
Mathias Drton Lecture 3: Marginal likelihood integrals 87 / 110
Watanabe’s book
The theorem is proven in the book byWatanabe.
Watanabe also discusses algebraictechniques for computing the learningcoefficient = growth index q and themultiplicity s
Singular integrals:
Arnol’d, V.I.; Gusein-Zade, S.M.;Varchenko, A.N. Singularities ofdifferentiable maps. Vol. I & II,1985/88.Work by Michael Greenblatt at UIC
Mathias Drton Lecture 3: Marginal likelihood integrals 88 / 110
Example: Sample vs true mean in normal means model
Random integral
logµn = log
∫Rd
exp{−n · 1
2‖Xn − g(γ)‖2}
p(γ) dγ
Simple bound for any a > 0:
2|〈Xn − θ0, g(γ)− θ0〉| ≤ a‖Xn − θ0‖2 +1
a‖g(γ)− θ0‖2
Bound in exponent:
‖Xn − g(γ)‖2a=1≤ 2‖g(γ)− θ0‖2 + 2‖Xn − θ0‖2
‖Xn − g(γ)‖2a=2≥ 1
2‖g(γ)− θ0‖2 − ‖Xn − θ0‖2
If deterministic integral based on e−n‖g(γ)−θ0‖2has an asymptotic
expansion then random integrals have same growth behavior.
Mathias Drton Lecture 3: Marginal likelihood integrals 89 / 110
Lecture outline
7 Information criteria for model selection
8 Marginal likelihood integrals
9 Resolution of singularities and Newton polyhedra
10 Reduced rank regression
Mathias Drton Lecture 3: Resolution of singularities and Newton polyhedra 90 / 110
Zeta function
Polynomial map f : Rd → [0,∞)
Smooth prior p(γ), positive on compact semi-analytic support
Laplace integral ∫e−nf (γ)p(γ) dγ
Zeta function:
ζ(λ) =
∫f (γ)λp(γ) dγ, λ ∈ C,Re(λ) > 0
Theorem
The zeta function ζ(λ) can be continued (uniquely) to a meromorphicfunction on all of C. All poles are negative rational numbers. The negatedgrowth index q is the largest pole of ζ(λ) and the multiplicity s is themultiplicity of this pole.
Mathias Drton Lecture 3: Resolution of singularities and Newton polyhedra 91 / 110
Local view
For large n, main contribution to∫e−nf (γ)p(γ) dγ
comes from neighborhood of
Vf = {γ : f (γ) = 0} ∩ supp(p).
Since prior support assumed compact, study the asymptotics of∫U(γ0)
e−nf (γ)p(γ) dγ, U(γ0) small neighborhood of γ0,
for all γ0 ∈ Vf
Note: For marginal likelihood f (γ) = 0 ⇐⇒ g(γ) = θ0
(‘identifiability’ issues)
Mathias Drton Lecture 3: Resolution of singularities and Newton polyhedra 92 / 110
Resolution of singularities
Theorem (Hironaka, 1964; Atiyah, 1970)
In the considered setup, for every γ0 ∈ Vf , there exists
a neighborhood U(γ0) of γ0 ∈ Rd and
changes of coordinates
such that the zeta function becomes a finite sum of the form∫U(γ0)
f (γ)λp(γ) dγ =
∑α
∫[0,b]d
(u
2k1(α)1 . . . u
2kd (α)d
)λφα(u)u
h1(α)1 . . . u
hd (α)d du,
where the φα are smooth and bounded away from zero on [0, b]d .
Mathias Drton Lecture 3: Resolution of singularities and Newton polyhedra 93 / 110
Largest pole and multiplicity
Once in ‘normal crossing form’ meromorphic continuation anddetermination of poles clear.
Example:∫(u2k)λuh du =
u2kλ+h+1
2kλ+ h + 1, Pole λ = −h + 1
2k
Growth index:
q = minα
min1≤j≤d
hj(α) + 1
2kj(α)
Multiplicity:
s = maxα
#
{j :
hj(α) + 1
2kj(α)= q
}
Mathias Drton Lecture 3: Resolution of singularities and Newton polyhedra 94 / 110
Example: Blow-up transformations
Product interval∫ 1
−1
∫ 1
−1e−n·(x4+y6) dy dx ∼ n−1/4n−1/6 · C = n−5/12 · C
Resolve by repeatedly applying blow-up transformation, i.e., the pair
x = x1, y = x1y1; x = x2y2, y = y2.
y = y’x = x’y’,
Mathias Drton Lecture 3: Resolution of singularities and Newton polyhedra 95 / 110
Example: Blow-up transformations
First blow-up transformation gives
x4 + y 6 = x41 (1 + x2
1 y 61 ) Jacob. x1
= y 42 (x2
2 + y 22 ) y1
In 1st coordinates normal crossing, 4λ+ 2 = 0, pole −12
In 2nd coordinates not normal crossing, repeat
y 4(x4 + y 2) = x61 y 4
1 (x21 + y 2
2 ) Jacob. x21 y1
= y 62 (1 + x4
2 y 22 ) y 2
2
In 2nd coordinates normal crossing, 6λ+ 3 = 0, pole −12
In 1st coordinates not normal crossing, repeat
x6y 4(x2 + y 2) = x121 y 4
1 (1 + y 21 ) Jacob. x4
1 y1
= x62 y 12
2 (1 + x22 ) x2
2 y 42
Normal crossing in both coordinates: q = 512 , s = 1
Mathias Drton Lecture 3: Resolution of singularities and Newton polyhedra 96 / 110
Example: Blow-up transformations
First blow-up transformation gives
x4 + y 6 = x41 (1 + x2
1 y 61 ) Jacob. x1
= y 42 (x2
2 + y 22 ) y1
In 1st coordinates normal crossing, 4λ+ 2 = 0, pole −12
In 2nd coordinates not normal crossing, repeat
y 4(x4 + y 2) = x61 y 4
1 (x21 + y 2
2 ) Jacob. x21 y1
= y 62 (1 + x4
2 y 22 ) y 2
2
In 2nd coordinates normal crossing, 6λ+ 3 = 0, pole −12
In 1st coordinates not normal crossing, repeat
x6y 4(x2 + y 2) = x121 y 4
1 (1 + y 21 ) Jacob. x4
1 y1
= x62 y 12
2 (1 + x22 ) x2
2 y 42
Normal crossing in both coordinates: q = 512 , s = 1
Mathias Drton Lecture 3: Resolution of singularities and Newton polyhedra 96 / 110
Example: Blow-up transformations
First blow-up transformation gives
x4 + y 6 = x41 (1 + x2
1 y 61 ) Jacob. x1
= y 42 (x2
2 + y 22 ) y1
In 1st coordinates normal crossing, 4λ+ 2 = 0, pole −12
In 2nd coordinates not normal crossing, repeat
y 4(x4 + y 2) = x61 y 4
1 (x21 + y 2
2 ) Jacob. x21 y1
= y 62 (1 + x4
2 y 22 ) y 2
2
In 2nd coordinates normal crossing, 6λ+ 3 = 0, pole −12
In 1st coordinates not normal crossing, repeat
x6y 4(x2 + y 2) = x121 y 4
1 (1 + y 21 ) Jacob. x4
1 y1
= x62 y 12
2 (1 + x22 ) x2
2 y 42
Normal crossing in both coordinates: q = 512 , s = 1
Mathias Drton Lecture 3: Resolution of singularities and Newton polyhedra 96 / 110
Example: Blow-up transformations
First blow-up transformation gives
x4 + y 6 = x41 (1 + x2
1 y 61 ) Jacob. x1
= y 42 (x2
2 + y 22 ) y1
In 1st coordinates normal crossing, 4λ+ 2 = 0, pole −12
In 2nd coordinates not normal crossing, repeat
y 4(x4 + y 2) = x61 y 4
1 (x21 + y 2
2 ) Jacob. x21 y1
= y 62 (1 + x4
2 y 22 ) y 2
2
In 2nd coordinates normal crossing, 6λ+ 3 = 0, pole −12
In 1st coordinates not normal crossing, repeat
x6y 4(x2 + y 2) = x121 y 4
1 (1 + y 21 ) Jacob. x4
1 y1
= x62 y 12
2 (1 + x22 ) x2
2 y 42
Normal crossing in both coordinates: q = 512 , s = 1
Mathias Drton Lecture 3: Resolution of singularities and Newton polyhedra 96 / 110
Resolution – Singular session
LIB "resolve.lib";ring R = 0,(x,y),dp;
ideal J = x4+y6;list L=resolve(J);presentTree(L);
list L=resolve(J,0,"A");presentTree(L);LIB "reszeta.lib";list coll=collectDiv(L);LIB "resgraph.lib";ResTree(L,coll[1]);
Mathias Drton Lecture 3: Resolution of singularities and Newton polyhedra 97 / 110
Distance of Newton polyhedron
∫ 1
−1
∫ 1
−1e−n·(x4+y6) dy dx ∼ n−1/4n−1/6 · C = n−5/12 · C
(12/5,12/5)
(6,0)
(0,4)
Distance:
ρ = 4 · 3
5= 6 · 2
5=
12
5=⇒ q =
1
ρ
Mathias Drton Lecture 3: Resolution of singularities and Newton polyhedra 98 / 110
Newton polyhedron
Polynomial
f (x) =∑a∈Nd
caxa, xa = xa11 . . . xad
d
Newton polyhedron Pf is the convex hull of the set⋃a:ca 6=0
({a}+ [0,∞)d
)Distance:
ρ = min{r : r · 1d ∈ Pf }
For A ⊂ Rd , definefA(x) =
∑a∈A∩Nd
caxa
Mathias Drton Lecture 3: Resolution of singularities and Newton polyhedra 99 / 110
Non-degenerate exponents and remoteness
Theorem
If the polynomial f has a minimum at zero and is non-degenerate, that is,for any compact face A of the Newton polyhedron the equation system
∂fA(x)
∂x1= . . .
∂fA(x)
∂xd= 0
has no solution in (R \ {0})d , then for small ε the growth index for theintegral ∫
[−ε,ε]de−nf (γ)p(γ) dγ
is q = 1/ρ and the multiplicity s is the codimension of thelowest-dimensional face containing the point at which the ray spanned by1d first intersects the Newton polyhedron.
Mathias Drton Lecture 3: Resolution of singularities and Newton polyhedra 100 / 110
Lecture outline
7 Information criteria for model selection
8 Marginal likelihood integrals
9 Resolution of singularities and Newton polyhedra
10 Reduced rank regression
Mathias Drton Lecture 3: Reduced rank regression 101 / 110
Reduced rank regression
Multivariate regression model
Y = θX + ε, θ ∈ Ra×b, rank(θ) ≤ h
X1
H
X2
Y1
Y2
Multivariate normal model (random design X )
Parametrize
θ = g(α, β) = αβT , α ∈ Ra×h, β ∈ Rb×h
Model selection problem: Determine h
WLOG: Assume coordinates of X and ε mutually independent withknown variances.
Mathias Drton Lecture 3: Reduced rank regression 102 / 110
Asymptotics – regular case
Consider model given by rank h.
Suppose true matrix θ0 has rank r ≤ h.
Interested in the asymptotics of the integral∫ ∫exp{−n‖αβT − θ0‖2} dα dβ
Regular case:
The Jacobian of the map g(α, β) = αβT achieves its maximal rankh(a + b − h) at a point (α0, β0) if and only if α0β
T0 has full rank h.
If θ0 has rank r = h, then the set g−1(θ0) ⊆ Rah+bh is a smoothmanifold of dimension h2.
Reparametrize and apply Laplace approximaton (Haughton’s result) toobtain
q = h(a + b − h)/2, s = 1.
Mathias Drton Lecture 3: Reduced rank regression 103 / 110
Asymptotics – singular case
Interested in the asymptotics of the integral∫ ∫exp{−n‖αβT − θ0‖2} dα dβ
Singular case: rank of θ0 is equal to r < h
Aoyagi & Watanabe (2005):Found growth index q and multiplicity s as a function of (a, b, h, r)
Simplest case with singularities is model rank h = 1
Mathias Drton Lecture 3: Reduced rank regression 104 / 110
Asymptotics – singular case for rank 1
Model rank h = 1
Only one singular point: θ0 = 0
Fiberg−1(θ0) = {(α0, β0) : α0 = 0 or β0 = 0}
singular at the origin (α0, β0) = 0 and smooth elsewhere.
Local integrals are∫U(α0)
∫U(β0)
exp{−n(α21 + · · ·+ α2
a)(β21 + · · ·+ β2
b)} dα dβ,
(α0, β0) ∈ g−1(0).
Mathias Drton Lecture 3: Reduced rank regression 105 / 110
Case 1
Suppose α0 = (α01, . . . , α0k , 0, . . . , 0) 6= 0. Then β0 = 0.
Shift (α0, β0) to origin by transformation αi = αi − α0i
Local integral becomes∫U(0)
exp{−n[(α1 + α01)2 + · · ·+ (αk + α0k)2 + α2k+1 + · · ·+ α2
a]
(β21 + · · ·+ β2
b)} d(α, β)
Function of α in exponent is bounded away from zero in aneighborhood U(0).
Asymptotics determined by that of∫U(0)
exp{−n(β21 + · · ·+ β2
b)} dβ
which is a regular integral with growth index b/2 and multiplicity 1.
Mathias Drton Lecture 3: Reduced rank regression 106 / 110
Case 2
Suppose α0 = β0 = 0.
Resolve (α21 + · · ·+ α2
a)(β21 + · · ·+ β2
b) by applying a blow-up to thefirst term and a blow-up to the second term.
We obtain∫U(0,0)
α2λ1 β2λ
1 αa−11 βb−1
1
(1 + α2
2 + . . .)λ (
1 + β22 + . . .
)λdα dβ.
Consider ∫α2λ+a−1
1 β2λ+b−11 dα1dβ1 =
α2λ+a1 β2λ+b
1
(2λ+ a)(2λ+ b).
Poles λ = −a/2 and λ = −b/2.
Mathias Drton Lecture 3: Reduced rank regression 107 / 110
Asymptotics for rank 1
Proposition
The marginal likelihood for the reduced rank regression model for rankh = 1 has growth index and multiplicity
(q, s) =
(a+b−1
2 , 1)
if θ0 6= 0,(min{a,b}
2 , 1)
if θ0 = 0 and a 6= b,(a2 = b
2 , 2)
if θ0 = 0 and a = b.
This can also be shown by looking at the Newton diagrams
Mathias Drton Lecture 3: Reduced rank regression 108 / 110
Exercise: Factor analysis
Let H and ε1, . . . , εd be mutually independent N (0, 1) r.v.
Define
X = αH + ε, α ∈ Rd
Then X ∼ N (0, θ) with covariance matrix θ = I + ααT , α ∈ Rd
X1 X2 X3 X4
H
What is the growth behaviour of marginal likelihood of this model?
Mathias Drton Lecture 3: Reduced rank regression 109 / 110
Conclusion
Algebraic statistical models:useful framework for discussing non-smooth statistical models.
Computational algebra:Markov bases, vanishing ideals, singular loci, tangent cones, resolutionof singularities, . . .
Many open questions about classical statistical models . . .
Mathias Drton 110 / 110