andrew r. barron yale university department of ...arb4/presentations/barroninformation...andrew...
TRANSCRIPT
![Page 1: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/1.jpg)
INFORMATION AND STATISTICS
Andrew R. Barron
YALE UNIVERSITY
DEPARTMENT OF STATISTICS
Presentation, April 30, 2015
Information Theory Workshop, Jerusalem
![Page 2: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/2.jpg)
Information and Statistics
Topics in the abstract from which I make a selectionInformation Theory and Inference:
Flexible high-dimensional function estimationNeural nets: sigmoidal and sinusoidal activation functionsApproximation and estimation boundsMinimum description length principlePenalized likelihood risk bounds and minimax ratesComputational strategies
Achieving Shannon Capacity:Communication by regressionSparse superposition codingAdaptive successive decodingRate, reliability, and computational complexity
Information Theory and Probability:General entropy power inequalitiesEntropic central limit theorem and its monotonicityMonotonicity of relative entropy in Markov chainsMonotonicity of relative entropy in statistical mechanics
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 3: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/3.jpg)
Information and Statistics
Information Theory and Inference:Flexible high-dimensional function estimationNeural nets: sigmoidal and sinusoidal activation functionsApproximation and estimation boundsMinimum description length principlePenalized likelihood risk bounds and minimax ratesComputational strategies
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 4: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/4.jpg)
Plan for Information and Inference
SettingUnivariate & muntivariate polynomials, sinusoids, sigmoidsFit to training datastatistical risk is the error of generalization to new data
The challenge of high-dimensional function estimationEstimation failure of rigid approximation models in high dimComputation difficulities of flexible models in high dim
Flexible approximationby stepwise subset selectionby optimization of parameterized basis functions
Approximation boundsRelate error to number of terms
Information-theoretic risk boundsRelate error to number of terms and sample size
Computational challengeConstructing an optimization path
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 5: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/5.jpg)
The Problem
From observational or experimental data, relate a responsevariableY to several explanatory variablesX1,X2, . . . ,Xd
Common task throughout science and engineering
Central to the "Scientific Method"
Aspects of this problem are variously called:Statistical regression, prediction, response surface estimation,analysis of variance, function fitting, function approximation,nonparametric estimation, high-dimensional statistics, datamining, machine learning, computational learning, patternrecognition, artificial intelligence, cybernetics, artificial neuralnetworks, deep learning
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 6: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/6.jpg)
Dimensionality
The blessing and the curse of dimensionality
With increasing number of variables d there is anexponential growth in the number of distinct terms that canbe combined in modeling the function
Larger number of relevant variables d allows in principle forbetter approximation to the response
Large d might lead to a need for exponentially largenumber of observations n or to a need for exponentiallylarge computation time
Under what conditions can we take advantage of theblessing and overcome the curse.
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 7: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/7.jpg)
Example papers for some of what is to follow
Papers illustrating my background addressing these questionsof high dimensionality (available from www.stat.yale.edu)
A. R. Barron, R. L. Barron (1988). Statistical learningnetworks: a unifying view. Computing Science & Statistics:Proc. 20th Symp on the Interface, ASA, p.192-203.A. R. Barron (1993). Universal approximation bounds forsuperpositions of a sigmoidal function. IEEE Transactionson Information Theory, Vol.39, p.930-944.A. R. Barron, A. Cohen, W. Dahmen, R. DeVore (2008).Approximation and learning by greedy algorithms. Annalsof Statistics, Vol.36, p.64-94.A.R. Barron, C. Huang, J. Q. Li and Xi Luo (2008). MDLprinciple, penalized Likelihood, and statistical risk. Proc.IEEE Information Theory Workshop, Porto, Portugal,p.247-257. Also Feschrift for Jorma Rissanen. TampereUniv. Press, Finland.
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 8: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/8.jpg)
Data Setting
Data: (X i ,Yi), i = 1,2, . . . ,n
Inputs: explanatory variable vectors
X i = (Xi,1,Xi,2, . . . ,Xi,d )
Domain: Either a unit cube in Rd or all of Rd
Random design: independent X i ∼ P
Output: response variable Yi in RMoment conditions, with Bernstein constant c
Relationship: E [Yi |X i ] = f (X i) as in:Perfect observation: Yi = f (X i )
Noisy observation: Yi = f (Xi ) + εi with εi indep N(0, σ2)
Classification: Y ∈ {0,1} with f (X ) = P[Y = 1|X ]
Function: f (x) unknown
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 9: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/9.jpg)
Univariate function approximation: d = 1
Basis functions for series expansion
φ0(x), φ1(x), . . . , φK (x), . . .
Polynomial basis (with degree K )
1, x , x2, . . . , xK
Sinusoidal basis (with period L, and with K = 2k ),
1, cos(2π(1/L)x), sin(2π(1/L)x), . . . , cos(2π(k/L)x), sin(2π(k/L)x)
Piecewise constant on [0,1]
1{x≥0},1{x≥1/K},1{x≥2/K}, . . . ,1{x≥1}
Other spline bases and wavelet bases
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 10: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/10.jpg)
Univariate function approximation: d = 1
Standard 1-dim approximation models
Project to the linear span of the basis
Rigid form (not flexible), with coefficients ck adjusted to fitthe response,
fK (x) =K∑
k=0
ck φk (x).
Flexible form, with a subset k1 . . . km chosen to best fit theresponse, for a given number of terms m
m∑j=1
cj φkj (x).
Fit by all-subset regression (if m and K are not too large) orby forward stepwise regression, selecting from thedictionary Φ = {φ0, φ1, . . . , φK}
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 11: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/11.jpg)
Multivariate function approximation: d > 1
Multivariate product bases:
φk (x) = φk1,k2,...,kd (x1, x2, . . . , xd )
= φk1(x1)φk2(x2) · · ·φkd (xd )
Rigid approximation model
K∑k1=0
K∑k2=0
· · ·K∑
kd =0
ck φk (x)
Exponential size: (K + 1)d terms in the sumRequires exponentially large sample size n >> (K + 1)d
for accurate estimationStatistically and computationally problematic
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 12: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/12.jpg)
Flexible multivariate function approximation: d > 1
BY SUBSET SELECTION:A subset k1 . . . km is chosen to fit the response, with agiven number of terms m
m∑j=1
cj φk j(x)
Full forward stepwise selection:computationally infeasible for large d because thedictionary is exponentially large, of size (K + 1)d .
Adhoc stepwise selection:SAS stepwise polynomials.Friedman MARS, Barron-Xiao MAPS, Ann. Statist. 1991.Each step search only incremental modification of terms.Manageable number of choices mKd each step.Computationally fast, not known if it approximates well.
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 13: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/13.jpg)
Flexible multivariate function approximation: d > 1
By internally parameterized models & nonlinear least squaresFit functions fm(x) =
∑mj=1 cjφ(x , θ) in the span of a
parameterized dictionary Φ = {φ(·, θ) : θ ∈ Θ}Product bases:
using continuous powers, frequencies or thresholds
φ(x , θ) = φ1(x1, θ1)φ1(x2, θ2) · · ·φ1(xd , θd )
Ridge bases: as in projection pursuit regression models,sinusoidal models, and single-hidden-layer neural nets:
φ(x , θ) = φ1(θ0 + θ1x1 + θ2x2 + . . .+ θdxd )
Internal parameter vector θ of dimension d +1.Univariate function φ(z) = φ1(z) is the activation function
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 14: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/14.jpg)
Building Non-linear Dictionaries
Examples of activation functions φ(z)
Perceptron networks: 1{z>0} or sgn(z)
Sigmoidal networks: ez/(1+ez) or tanh(z)
Sinusoidal models: cos(z)
Hinging hyperplanes: (z)+
Quadratic splines: 1, z, (z)2+
Cubic splines: 1, z, z2, (z)3+
Polynomials: (z)q
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 15: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/15.jpg)
Notation
Response vector: Y = (Yi)ni=1 in Rn
Dictionary vectors: Φ(n) ={
(φ(X i , θ))ni=1 : θ ∈ Θ
}⊂ Rn
Sample squared norm: ‖f‖2(n) = 1n∑n
i=1 f 2(X i)
Population squared norm: ‖f‖2 =∫
f 2(x)P(dx)
Normalized dictionary condition: ‖φ‖ ≤ 1 for φ ∈ Φ
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 16: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/16.jpg)
Flexible m−term nonlinear optimization
Impractical one-shot optimization
Sample version
fm achieves min(θj ,cj )
mj=1
‖Y −m∑
j=1
cj φθj‖2(n)
Population version
fm achieves min(θj ,cj )
mj=1
‖f −m∑
j=1
cj φθj‖2
Optimization of (θj , cj)mj=1 in R(d+2)m.
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 17: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/17.jpg)
Flexible m−term nonlinear optimization
GREEDY OPTIMIZATIONS
Step 1: Choose c1, θ1 to achieve min ‖Y − cφθ‖2(n)
Step m > 1: Arrange
fm = α fm−1 + c φ(x , θm)
with αm, cm, θm chosen to achieve
minα,c,θ‖Y − α fm−1 − c φθ‖2(n).
Also acceptable, with resi = Yi − fm−1(X i),
Choose θm to achieve maxθ∑n
i=1 resi φ(X i , θ)
Reduced dimension of the search space (still problematic?)
Foward stepwise selection of Sm = {φθ1, . . . , φθm
}. GivenSm−1, combine the terms to achieve
minθ
d(Y , span{φθ1, . . . , φθm−1
, φθ})
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 18: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/18.jpg)
Basic m−term approximation and computation bounds
For either one-shot or greedy approximation(B. IT 1993, Lee et al IT 1995)
Population version:
‖f − fm‖ ≤‖f‖Φ√
m
and moreover
‖f − fm‖2 ≤ infg
{‖f − g‖2 +
2‖g‖2Φm
}Sample version:
‖Y − fm‖2(n) ≤ ‖Y − f‖2(n) +2‖f‖2Φ
m
where ‖f‖Φ is the variation of f with respect to Φ(as will be defined on the next slide).
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 19: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/19.jpg)
`1 norm on coefficients in representation of f
Consider the range of a neural net, expressed via thebound,∣∣∑
j
cj sgn(θ0,j + θ1,jx1 + . . .+ θd ,jxd )∣∣ ≤∑
j
|cj |
equality if x is in polygon where sgn(θj · x) = sgn(cj) for all j
Motivates the norm
‖f‖Φ = limε→0
inf{∑
j
|cj | : ‖∑
j
cjφθj− f‖ ≤ ε
}called the variation of f with respect to Φ (B. 1991)
‖f‖Φ = VΦ(f ) = inf{V : f/V ∈ closure(conv(±Φ))}
It appears in the bound ‖f − fm‖ ≤ ‖f‖Φ√m
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 20: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/20.jpg)
`1 norm on coefficients in representation of f
Finite sum representations, f (x) =∑
j cjφ(x , θj). Variation‖f‖Φ =
∑j |cj |, which is the `1 norm of the coefficients in
representation of f in the span of Φ
Infinite integral representation f (x) =∫
ei θ·x f (θ) dθ(Fourier representation), for x in a unit cube. The variation‖f‖Φ is bounded by an L1 spectral norm:
‖f‖cos =
∫Rd|f (θ)|dθ
‖f‖step ≤∫|f (θ)| ‖θ‖1 dθ
‖f‖q−spline ≤∫|f (θ)| ‖θ‖q+1
1 dθ
As we said, this ‖f‖Φ appears in the numerator of theapproximation bound.
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 21: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/21.jpg)
Statistical Risk
The population accuracy of function estimated from sample
Statistical risk E‖fm − f‖2 = E(fm(X )− f (X ))2
Expected squared generalization error on new X ∼ Pof the estimator trained on the data (X i ,Yi)
ni=1
Minimax optimal risk bound, via information theory
E‖fm − f‖2 ≤ ‖fm − f‖2 + cmn
log N(Φ, δn).
Here log N(Φ, δn) is the metric entropy of Φ at δn = 1/n ;with Φ of metric dimension d , it is of order d log(1/δn), so
E‖fm − f‖2 ≤‖f‖2Φ
m+
cmdn
log n
Need only n >> md rather than n >> (K + 1)d .
Best bound is 2‖f‖Φ
√cdn log n at m∗ = ‖f‖Φ
√n/cd log n
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 22: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/22.jpg)
Adaptation
Adapt network size m and choice of internal parameters
Minimum Description Length Principle leads toComplexity penalized least squares criterion.Let m achieve
minm
{‖Y − fm‖2(n) + 2c
mn
log N(Φ, δn)}
Information-theoretic risk bound
E‖fm − f‖2 ≤ minm
{‖fm − f‖2 + 2c
mn
log N(Φ, δn)}
Performs as well as if the best m∗ were known in advance.‖f‖2
Φ/m replaces ‖fm − f‖2 in the greedy case.
`1 penalized least squaresAchieves the same risk boundRetains the MDL interpretation (B, Huang,Li,Luo,2008)
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 23: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/23.jpg)
Confronting the computational challenge
Greedy searchReduces dimensionality of optimization from md to just dObtain a current θm achieving within a constant factor of themaximum of
Jn(θ) =1n
n∑i=1
resi φ(X i , θ).
This surface can still have many maxima.We might get stuck at an undesirably low local maximum.
New computational strategies:1 A special case in which the set of maxima can be identified.
2 Optimization path via solution to a pde for ridge bases.
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 24: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/24.jpg)
A special case in which the maxima can be identified
Insight from a special case:Sinusoidal dictionary: φ(x , θ) = e−iθ·x
Gaussian design: X i ∼ Normal(0, τ I )Target function: f (x) =
∑moj=1 cj eiαj ·x
For step 1, with large n, the objective function becomesnear its population counterpart
J(θ) = E[f (X )e−iθ·X ] =
mo∑j=1
cj E[eiαj ·X e−iθ·X ]
which simplifies tomo∑j=1
cj e−(τ/2)‖αj−θ‖2.
For large τ it has precisely mo maxima, one at each of theαj in the target function.
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 25: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/25.jpg)
Optimization path for bounded ridge bases
More general approach to seek approximation optimization of
J(θ) =n∑
i=1
ri φ(θT X i)
Adaptive Annealing:recent & current work with Luo, Chatterjee, KlusowskiSample θt from the evolving density
pt (θ) = e t J(θ)−ct p0(θ)
along a sequence of values of t from 0 to tfinal
use tfinal of order (d log d)/nInitialize with θ0 drawn from a product prior p0(θ), such asnormal(0, I ) or a product of standard CauchyStarting from the random θ0 define the optimization path θtsuch that its distribution tracks the target density pt
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 26: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/26.jpg)
Optimization path
Adaptive Annealing: Arrange θt from the evolving density
pt (θ) = etJ(θ)−ct p0(θ)
with θ0 drawn from p0(θ)
State evolution with vector-valued change function Gt (θ):
θt+h = θt − h Gt (θt )
or better: θt+h is the solution to
θt = θt+h + h Gt (θt+h),
with small step-size h, such that θ + h Gt (θ) is invertiblewith a positive definite Jacobian, and solves equations forthe evolution of pt (θ).As we will see there are many such change functionsGt (θ), though not all are nice.
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 27: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/27.jpg)
Nice change functions Gt
A function on Rd is said to be nice if the logarithm of itsmagnitute is bounded by an expression of orderlogarithmic in d and in 1 + ‖θ‖2 .
A vector-valued function is said to be nice if its norm isnice.
For computationally feasibility and distributional validity,seek a nice change function Gt satifying the upcomingdensity evolution rule.
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 28: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/28.jpg)
Solve for the change Gt to track the density pt
Density evolution: by the Jacobian rule
pt+h(θ) = pt(θ + h Gt (θ)
)det(I + h∇GT
t (θ))
Up to terms of order h
pt+h(θ) = pt (θ) + h[(Gt (θ))T ∇pt (θ) + pt (θ)∇T Gt (θ)
]In agreement for small h with the partial diff equation
∂
∂tpt (θ) = ∇T [Gt (θ)pt (θ)
]The right side is GT
t (θ)∇pt (θ) + pt (θ)∇T Gt (θ). Dividing bypt (θ) it is expressed in the log density form
∂
∂tlog pt (θ) = ∇T Gt (θ) + GT
t (θ)∇ log pt (θ)
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 29: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/29.jpg)
Candidate solutions
Solution of smallest L2 norm of Gt (θ)pt (θ) at a specific t .
Let Gt (θ)pt (θ) = ∇b(θ), gradient of a function b(θ)
Let f (θ) = ∂∂t pt (θ)
Set green(θ) proportional to 1/‖θ‖d−2, harmonic for θ 6= 0.
The partial diff equation becomes the Poisson equation:
∇T∇b(θ) = f (θ)
Solutionb(θ) = (f ∗ green)(θ)
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 30: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/30.jpg)
Candidate solutions
Solution of smallest L2 norm of Gt (θ)pt (θ) at a specific t
Let Gt (θ)pt (θ) = ∇b(θ), gradient of a function b(θ)
Let f (θ) = ∂∂t pt (θ)
Set green(θ) proportional to 1/‖θ‖d−2, harmonic for θ 6= 0.
The partial diff equation becomes the Poisson equation:
∇T∇b(θ) = f (θ)
Solution, using ∇green(θ) = cd θ/‖θ‖d
∇b(θ) = (f ∗ ∇green)(θ)
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 31: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/31.jpg)
Candidate solutions
Solution of smallest L2 norm of Gt (θ)pt (θ) at a specific t
Let Gt (θ)pt (θ) = ∇b(θ), gradient of a function b(θ)
Let f (θ) = ∂∂t pt (θ)
Set green(θ) proportional to 1/‖θ‖d−2, harmonic for θ 6= 0.
The partial diff equation becomes the Poisson equation:
∇T [Gt (θ)pt (θ)] = f (θ)
Solution, using ∇green(θ) = cd θ/‖θ‖d
Gt (θ)pt (θ) = (f ∗ ∇green)(θ)
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 32: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/32.jpg)
Candidate solutions
Solution of smallest L2 norm of Gt (θ)pt (θ) at a specific t
Let Gt (θ)pt (θ) = ∇b(θ), gradient of a function b(θ)
Let f (θ) = ∂∂t pt (θ)
Set green(θ) proportional to 1/‖θ‖d−2, harmonic for θ 6= 0.
The partial diff equation becomes the Poisson equation:
∇T [Gt (θ)pt (θ)] = f (θ)
Solution, using ∇green(θ) = cd θ/‖θ‖d
Gt (θ) =(f ∗ ∇green)(θ)
pt (θ)
Not nice !
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 33: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/33.jpg)
Candidate solutions
Perhaps the ideal solution is one of smallest L2 norm of Gt (θ)
It has Gt (θ) = ∇bt (θ) equal to the gradient of a function
The pde in log density form
∇T Gt (θ) + GTt (θ)∇ log pt (θ) =
∂
∂tlog pt (θ)
then becomes an elliptic pde in bt (θ) for fixed t.With ∇ log pt (θ) and ∂
∂t log pt (θ) arranged to be bounded,the solution may exist and be nice.But explicit solution to this elliptic pde is not available(except perhaps numerically in low dim cases).
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 34: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/34.jpg)
Candidate solutions
Ideal solution of smallest L2 norm of Gt (θ)
It has Gt (θ) = ∇bt (θ) equal to the gradient of a function
The pde in log density form
∇T Gt (θ) + GTt (θ)∇ log pt (θ) =
∂
∂tlog pt (θ)
then becomes an elliptic pde in bt (θ) for fixed t.With ∇ log pt (θ) and ∂
∂t log pt (θ) arranged to be bounded,the solution may exist and be nice.But explicit solution to this elliptic pde is not available(except perhaps numerically in low dim cases)To achieve explicit solution give up Gt (θ) being a gradientFor ridge bases, we decompose into a system of first orderdifferential equations and integrate
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 35: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/35.jpg)
Candidate solution by decomposition of ridge sum
Optimize J(θ) =∑n
i=1 ri φ(X Ti θ)
Target density pt (θ) = e tJ(θ)−ct p0(θ) with c′t = Ept [J]
The time score is ∂∂t log pt (θ) = J(θ)− Ept [J]
Specialize the pde in log density form
∇T Gt (θ) + GTt (θ)∇ log pt (θ) = J(θ)− Ept [J]
The right side takes the form of a sum∑ri [φ(X T
i θ)− ai ].
Likewise ∇ log pt (θ) = t ∇J(θ) +∇ log p0(θ) is a sum
t∑
ri Xi φ′(X T
i θ).
Here we surpress the role of the prior. It can be accounted byappending d prior observations with columns of the identity asextra input vectors along with a multiple of the score of themarginal of the prior in place of φ′.
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 36: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/36.jpg)
Approximate solution for ridge sums
Seek approximate solution of the form
Gt (θ) =∑ xi
‖xi‖2gi(u)
with u = (u1, . . . ,un) evaluated at ui = X Ti θ, for which
∇T Gt (θ) =∑
i
∂
∂uigi(u) +
∑i,j:i 6=j
xTi xj
‖xi‖2∂
∂ujgi(u)
Can we ignore the coupling in the derivative terms?xT
j xi/‖xi‖2 are small for uncorrelated designs, large d .Match the remaining terms in the sums to solve for gi(u)
Arrange gi(u) to solve the differential equations
∂
∂uigi(u) + t gi(u)
[riφ′(ui) + resti
]= ri
[φ(ui)−ai
]where resti =
∑j 6=i rj φ
′(uj)xTj xi/‖xi‖2.
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 37: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/37.jpg)
Integral form of solution
Differential equation for gi(ui), suppressing dependence onthe coordinates other than i
∂
∂uigi(ui) + t gi(ui)
[riφ′(ui) + resti
]= ri
[φ(ui)−ai
]Define the density factor
mi(ui) = et ri φ(ui )+t ui resti
Allows the above diff equation to be put back in the form
∂
∂ui[gi(ui) mi(ui)] = ri
[φ(ui)−ai
]mi(ui)
An explicit solution, evaluated at ui = xTi θ, is
gi(ui) = ri
∫ uici
mi(ui)[φ(ui)−ai
]dui
mi(ui)
where ci is such that φ(ci) = ai .
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 38: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/38.jpg)
The derived change function Gt for evolution of θt
Iinclude the uj for j 6= i upon which resti depends. Oursolution is
gi,t (u) = ri
∫ ui
ci
et ri (φ(ui )−φ(ui ))+t(ui−ui )resti (u)[φ(ui)−ai
]dui
Evaluating at u = Xθ we have the change function
Gt (θ) =∑ xi
‖xi‖2gi,t (Xθ)
for which θt evolves according to
θt+h = θt + h Gt (θt )
For showing gi,t , Gt and ∇Gt are nice, assume theactivation function φ and its derivative is bounded (e.g. alogistic sigmoid or a sinusoid).Run several optimization paths in parallel, starting fromindependent choices of θ0. Allows access to empiricalcomputation of ai,t = Ept [φ(xT
i θt )]Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 39: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/39.jpg)
Conjectured conclusion
Derived the desired optimization procedure and the following.
Conjecture: With step size h of order 1/n2 and a number ofsteps of order n d log d and X1,X2, . . . ,Xn i.i.d. Normal(0, I) inRd , and a product of independent standard Cauchy prior p0(θ).With high probability on the design X, the above procedureproduces optimization paths θt whose distribution closely tracksthe target
pt (θ) = et J(θ)−ct p0(θ)
such that, with high probability, the solutions paths haveinstances of J(θt ) which are at least 1/2 the maximum.
Consequently, the relaxed greedy procedure is computationallyfeasible and achieves the indicated bounds for sparse linearcombinations from the dictionary Φ = {φ(θT x) : θ ∈ Rd}
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 40: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/40.jpg)
summary
Flexible approximation modelsSubset selectionNonlinearly parameterized bases as with neural nets`1 control on coefficients of combination
Accurate approximation with moderate number of termsProof analogous to random coding
Information theoretic risk boundsBased on the minimum description length principleShows accurate estimation with a moderate sample size
Computational challenges are being addressedAdaptive annealing strategy appears to be promising
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 41: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/41.jpg)
Information and Statistics
Information and Statistics:
Nonparametric Rates of Estimation
Minimum Description Length Principle
Penalized Likelihood (one-sided concentration)
Implications for Greedy Term Selection
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 42: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/42.jpg)
Shannon Capacity
CapacityA Channel θ → Y is a family of distributions {PY |θ : θ ∈ Θ}
Information Capacity: C = maxPθI(θ; Y )
Communications CapacityThm: Ccom = C (Shannon 1948)
Data Compression CapacityMinimax Redundancy: Red = minQY maxθ∈Θ D(PY |θ‖QY )
Data Compression Capacity Theorem: Red = C(Gallager, Davisson & Leon-Garcia, Ryabko)
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 43: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/43.jpg)
Setting for Statistical Capacity
Statistical Risk Setting
Loss function`(θ, θ′)
Kullback loss`(θ, θ′) = D(PY |θ‖PY |θ′)
Squared metric loss, e.g. squared Hellinger loss:
`(θ, θ′) = d2(θ, θ′)
Statistical risk equals expected loss
Risk = E [`(θ, θ)]
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 44: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/44.jpg)
Statistical Capacity
Statistical Capacity
Estimators: θn
Based on sample Y of size n
Minimax Risk (Wald):
rn = minθn
maxθ
E`(θ, θn)
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 45: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/45.jpg)
Metric Entropy
Ingredients in Determining Minimax Rates of Statistical Risk
Kolmogorov Metric Entropy of S ⊂ Θ:
H(ε) = max{log Card(Θε) : d(θ, θ′) > ε for θ, θ′ ∈ Θε ⊂ S}
Loss Assumption, for θ, θ′ ∈ S:
`(θ, θ′) ∼ D(PY |θ‖PY |θ′) ∼ d2(θ, θ′)
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 46: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/46.jpg)
Statistical Capacity
Information-theoretic Determination of Minimax Rates
For infinite-dimensional Θ
With metric entropy evaluated a critical separation εnStatistical Capacity TheoremMinimax Risk ∼ Info Capacity Rate ∼ Metric Entropy rate
rn ∼ Cn
n∼ H(εn)
n∼ ε2n
(Yang 1997, Yang and B. 1999, Haussler and Opper 1997)
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 47: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/47.jpg)
Information Thy Formulation of Statistical Principle
Minimum Description-Length (Rissanen78,83,B.85, B.&Cover91...)
Statistical measure of complexity of Y
L(Y ) = minq
[log 1/q(Y ) + L(q)
]bits for Y given q + bits for q
It is an information-theoretically valid codelength for Y for anyL(q) satisfying Kraft summability
∑q 2−L(q) ≤ 1.
The minimization is for q in a family indexed by parameters{pθ(Y ) : θ ∈ Θ
}or by functions
{pf (Y ) : f ∈ F
}The estimator p is then pθ or p f .
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 48: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/48.jpg)
Statistical Aim
From training data x ⇒ estimator p
Generalize to subsequent data x ′
Want log 1/p(x ′) to compare favorably to log 1/p(x ′)
For targets p close to or in the families
With X ′ expectation, loss becomes Kullback divergence
Bhattacharyya, Hellinger, Rényi loss also relevant
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 49: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/49.jpg)
Loss
Kullback Information-divergence:
D(PX ′‖QX ′) = E[
log p(X ′)/q(X ′)]
Bhattacharyya, Hellinger, Rényi divergence:
d2(PX ′ ,QX ′) = 2 log 1/E [q(X ′)/p(X ′)]1/2
Product model case: D(PX ′‖QX ′) = n D(P‖Q)
d2(PX ′ ,QX ′) = n d2(P,Q)
Relationship:
d2 ≤ D ≤ (2 + b) d2 if the log density ratio ≤ b.
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 50: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/50.jpg)
MDL Analysis
Redundancy of Two-stage Code:
Redn =1n
E{
minq
[log
1q(Y )
+ L(q)]− log
1p(Y )
}bounded by Index of Resolvability:
Resn(p) = minq
{D(p||q) +
L(q)
n
}Statistical Risk Analysis in i.i.d. case with L(q) = 2L(q):
E d2(p, p) ≤ minq
{D(p‖q) +
L(q)
n
}B.85, B.&Cover 91, B., Rissanen, Yu 98, Li 99, Grunwald 07
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 51: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/51.jpg)
MDL Analysis: Key to risk consideration
Discrepancy between training sample and future
Disc(p) = logp(Y )
q(Y )− log
p(Y ′)q(Y ′)
Future term may be replaced by population counterpartDiscrepancy control: If L(q) satisfies the Kraft sum then
E[
infq{Disc(p,q) + 2L(q)}
]≥ 0
From which the risk bound follows:Risk ≤ Redundancy ≤ Resolvability
E d2(p, p) ≤ Redn ≤ Resn(p)
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 52: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/52.jpg)
Statistically valid penalized likelihood
Likelihood penalties arise vianumber parameters: pen(pθ) = λdim(θ)
roughness penalties: pen(pf ) = λ ‖f s‖2
coefficient penalties: pen(θ) = λ‖θ‖1
Bayes estimators: pen(θ) = log 1/w(θ)
Maximum likelihood: pen(θ) = constantMDL:
Penalized likelihood:
p = arg minq{log 1/q(Y ) + pen(q)}
Under what condition on the penalty will it be true thatthe sample based estimate p has risk controlled by thepopulation counterpart?
Ed2(p, p) ≤ infq
{D(p‖q) +
pen(q)
n}
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 53: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/53.jpg)
Statistically valid penalized likelihood
Result with J. Li, C. Huang, X. Luo (Festschrift for J.Rissanen 2008)Penalized Likelihood:
p = arg minq
{1n
log1
q(Y )+ penn(q)
}Penalty condition:
penn(q) ≥ 1n
minq{2L(q) + ∆n(p, q)}
where the distortion ∆n(q, q) is the difference indiscrepancies at q and a representer qRisk conclusion:
Ed2(p, q) ≤ infq{D(p‖q) + penn(q)}
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 54: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/54.jpg)
Information-theoretic valid penalties
Penalized likelihood
minθ∈Θ
{log
1pθ(x)
+ Pen(θ)
}Possibly uncountable Θ
Valid codelength interpretation if there exists a countable Θand L satisfying Kraft such that the above is not less than
minθ∈Θ
{log
1pθ(x)
+ L(θ)
}
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 55: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/55.jpg)
A variable complexity, variable distortion cover
Equivalently:Penalized likelihood with a penalty Pen(θ) isinformation-theoretically valid with uncountable Θ, if thereis a countable Θ and Kraft summable L(θ), such that, forevery θ in Θ, there is a representor θ in Θ such that
Pen(θ) ≥ L(θ) + logpθ(x)
pθ(x)
This is the link between uncountable and countable cases
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 56: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/56.jpg)
Statistical-Risk Valid Penalty
For an uncountable Θ and a penalty Pen(θ), θ ∈ Θ,suppose there is a countable Θ and L(θ) = 2L(θ)where L(θ) satisfies Kraft, such that, for all x , θ∗,
minθ∈Θ
{[log
pθ∗(x)
pθ(x)− d2
n (θ∗, θ)]
+ Pen(θ)
}
≥ minθ∈Θ
{[log
pθ∗(x)
pθ(x)− d2
n (θ∗, θ)]
+ L(θ)
}Proof of the risk conclusion:The second expression has expectation ≥ 0,so the first expression does too.
B., Li,& Luo (Rissanen Festschrift 2008, Proc. Porto Info TheoryWorkshop 2008)
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 57: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/57.jpg)
`1 Penalties are codelength and risk valid
Regression Setting: Linear Span of a DictionaryG is a dictionary of candidate basis functionsE.g. wavelets, splines, polynomials, trigonometric terms,sigmoids, explanatory variables and their interactions
Candidate functions in the linear spanfθ(x) =
∑g∈G θg g(x)
weighted `1 norm of coefficients ‖θ‖1 =∑
g ag |θg |
weights ag = ‖g‖n where ‖g‖2n = 1n∑n
i=1 g2(xi)
Regression pθ(y |x) = Normal(fθ(x), σ2)
`1 Penalty (Lasso, Basis Pursuit)
pen(θ) = λ‖θ‖1
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 58: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/58.jpg)
Regression with `1 penalty
`1 penalized log-density estimation, i.i.d. case
θ = argminθ
{1n
log1
pfθ(x)+ λn‖θ‖1
}Regression with Gaussian model
minθ
{1
2σ21n
n∑i=1
(Yi − fθ(xi))2 +12
log 2πσ2 +λn
σ‖θ‖1
}
Codelength Valid and Risk Valid for
λn ≥√
2 log(2p)
nwith p = Card(G)
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 59: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/59.jpg)
Adaptive risk bound specialized to regression
Again for fixed design and λn =√
2 log 2pn , multiplying
through by 4σ2,
E‖f ∗ − fθ‖2n ≤ inf
θ
{2‖f ∗ − fθ‖2n + 4σλn‖θ‖1
}In particular for all targets f ∗ = fθ∗ with finite ‖θ∗‖ the risk
bound 4σλn‖θ∗‖ is of order√
log Mn
Details in Barron, Luo (proceedings Workshop on Information Theory Methods in Science & Eng. 2008),Tampere, Finland
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 60: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/60.jpg)
Comment on proof
The variable complexity cover property is demonstrated bychoosing the representer f of fθ of the form
f (x) =vm
m∑k=1
gk (x)
g1, . . .gm picked at random from G, independently, where garises with probability proportional to |θg |
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 61: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/61.jpg)
Practical Communication by Regression
Achieving Shannon Capacity: (with A. Joseph, S. Cho)
Gaussian Channel with Power Constraints
History of Methods
Communication by Regression
Sparse Superposition Coding
Adaptive Successive Decoding
Rate, Reliability, and Computational Complexity
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 62: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/62.jpg)
Shannon Formulation
Input bits: u = (u1,u2, . . . . . . , uK )
↓Encoded: x = (x1, x2, . . . , xn)
↓Channel: p(y |x)
↓Received: y = (y1, y2, . . . , yn)
↓Decoded: u = (u1, u2, . . . . . . , uK )
Rate: R = Kn Capacity C = max I(X ; Y )
Reliability: Want small Prob{u 6= u}and small Prob{Fraction mistakes ≥ α}
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 63: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/63.jpg)
Gaussian Noise Channel
Input bits: u = (u1,u2, . . . . . . , uK )
↓Encoded: x = (x1, x2, . . . , xn) ave 1
n∑n
i=1 x2i ≤ P
↓Channel: p(y |x) y = x + ε ε ∼ N(0, σ2I)
↓Received: y = (y1, y2, . . . , yn)
↓Decoded: u = (u1, u2, . . . . . . , uK )
Rate: R = Kn Capacity C = 1
2 log(1 + P/σ2)
Reliability: Want small Prob{u 6= u}and small Prob{Fraction mistakes ≥ α}
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 64: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/64.jpg)
Shannon Theory meets Coding Practice
The Gaussian noise channel is the basic model forwireless communicationradio, cell phones, television, satellite, spacewired communicationinternet, telephone, cable
Forney and Ungerboeck 1998 reviewmodulation, coding, and shaping for the Gaussian channel
Richardson and Urbanke 2008 cover much of the state ofthe art in the analysis of coding
There are fast encoding and decoding algorithms, withempirically good performance for LDPC and turbo codesSome tools for their theoretical analysis, but obstaclesremain for mathematical proof of these schemes achievingrates up to capacity for the Gaussian channel
Arikan 2009, Arikan and Teletar 2009 polar codesAdapting polar codes to Gaussian channel (Abbe and B.2011)
Method here is different. Prior knowledge of the above isnot necessary to follow what we present.
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 65: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/65.jpg)
Sparse Superposition Code
Input bits: u = (u1 . . . . . . . . . . . .uK )
Coefficients: β = (00 ∗ 0000000000 ∗ 00 . . . 0 ∗ 000000)T
Sparsity: L entries non-zero out of NMatrix: X , n by N, all entries indep Normal(0,1)
Codeword: Xβ, superposition of a subset of columnsReceive: y = Xβ + ε, a statistical linear modelDecode: β and u from X ,y
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 66: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/66.jpg)
Sparse Superposition Code
Input bits: u = (u1 . . . . . . . . . . . .uK )
Coefficients: β = (00 ∗ 0000000000 ∗ 00 . . . 0 ∗ 000000)T
Sparsity: L entries non-zero out of NMatrix: X , n by N, all entries indep Normal(0,1)
Codeword: XβReceive: y = Xβ + ε
Decode: β and u from X ,yRate: R = K
n from K = log(N
L
), near L log
(NL e)
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 67: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/67.jpg)
Sparse Superposition Code
Input bits: u = (u1 . . . . . . . . . . . .uK )
Coefficients: β = (00 ∗ 0000000000 ∗ 00 . . . 0 ∗ 000000)T
Sparsity: L entries non-zero out of NMatrix: X , n by N, all entries indep Normal(0,1)
Codeword: XβReceive: y = Xβ + ε
Decode: β and u from X ,yRate: R = K
n from K = log(N
L
)Reliability: small Prob{Fraction βmistakes ≥ α}, small α
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 68: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/68.jpg)
Sparse Superposition Code
Input bits: u = (u1 . . . . . . . . . . . .uK )
Coefficients: β = (00 ∗ 0000000000 ∗ 00 . . . 0 ∗ 000000)T
Sparsity: L entries non-zero out of NMatrix: X , n by N, all entries indep Normal(0,1)
Codeword: XβReceive: y = Xβ + ε
Decode: β and u from X ,yRate: R = K
n from K = log(N
L
)Reliability: small Prob{Fraction βmistakes ≥ α}, small αOuter RS code: rate 1−2α, corrects remaining mistakesOverall rate: Rtot = (1−2α)R
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 69: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/69.jpg)
Sparse Superposition Code
Input bits: u = (u1 . . . . . . . . . . . .uK )
Coefficients: β = (00 ∗ 0000000000 ∗ 00 . . . 0 ∗ 000000)T
Sparsity: L entries non-zero out of NMatrix: X , n by N, all entries indep Normal(0,1)
Codeword: XβReceive: y = Xβ + ε
Decode: β and u from X ,yRate: R = K
n from K = log(N
L
)Reliability: small Prob{Fraction βmistakes ≥ α}, small αOuter RS code: rate 1−2α, corrects remaining mistakesOverall rate: Rtot = (1−2α)R.
Is it reliable with rate up to capacity?
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 70: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/70.jpg)
Partitioned Superposition CodeInput bits: u = (u1 . . . , . . . , . . . , . . .uK )
Coefficients: β=(00 ∗ 00000, 00000 ∗ 00, . . . , 0 ∗ 000000)
Sparsity: L sections, each of size B =N/L, a power of 2.1 non-zero entry in each section
Indices of nonzeros: (j1, j2, . . . , jL) directly specified by uMatrix: X , n by N, splits into L sectionsCodeword: XβReceive: y = Xβ + ε
Decode: β and uRate: R = K
n from K = L log NL = L log B
may set B = n and L = nR/ log nReliability: small Prob{Fraction βmistakes ≥ α}Outer RS code: Corrects remaining mistakesOverall rate: up to capacity?
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 71: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/71.jpg)
Power Allocation
Coefficients: β=(00∗00000, 00000∗00, . . . ,0∗000000)
Indices of nonzeros: sent = (j1, j2, . . . , jL)
Coeff. values: βj` =√
P` for ` = 1,2, . . . ,L
Power control:∑L
`=1 P` = P
Codewords: Xβ, have average power P
Power Allocations
Constant power: P` = P/L
Variable power: P` proportional to u` = e−2C `/L
Variable with leveling: P` proportional to max{u`, cut}
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 72: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/72.jpg)
Power Allocation
●
●
●
●
●
●
●
●
●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
●●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
0 10 20 30 40 50
0.00
00.
002
0.00
40.
006
0.00
80.
010
section index
pow
er a
lloca
tion
Power = 7L = 50
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 73: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/73.jpg)
Contrast Two Decoders
Decoders using received y = Xβ + ε
Optimal: Least Squares Decoder
β = argmin‖Y − Xβ‖2
minimizes probability of error with uniform input distributionreliable for all R < C, with best form of error exponent
Practical: Adaptive Successive Decoder
fast decoderreliable using variable power allocation for all R < C
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 74: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/74.jpg)
Adaptive Successive Decoder
Decoding Steps
Start: [Step 1]Compute the inner product of Y with each column of XSee which are above a thresholdForm initial fit as weighted sum of columns above threshold
Iterate: [Step k ≥ 2]Compute the inner product of residuals Y − Fitk−1 witheach remaining column of XSee which are above thresholdAdd these columns to the fit
Stop:At Step k = log B, orif there are no inner products above threshold
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 75: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/75.jpg)
Decoding Progression
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●● ●●
gL((x))x
B = 216, L == Bsnr == 15C = 2 bitsR = 1.04 bits (0.52C )No. of steps = 18
Figure : Plot of likely progression of weighted fraction of correctdetections q1,k , for snr = 15.
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 76: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/76.jpg)
Decoding Progression
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
●
●
●
●
● ●●
gL((x))x
B = 216, L == Bsnr == 1C = 0.5 bitsR = 0.31 bits (0.62C )No. of steps = 7
Figure : Plot of of likely progression of weighted fraction of correctdetections q1,k , for snr = 1.
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 77: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/77.jpg)
Rate and Reliability
Optimal: Least squares decoder of sparse superposition codeProb error exponentially small in n for small ∆=C−R>0
Prob{Error} ≤ e−n(C−R)2/2V
In agreement with the Shannon-Gallager optimal exponent,though with possibly suboptimal V depending on the snr
Practical: Adaptive Successive Decoder, with outer RS code.achieves rates up to CB approaching capacity
CB =C
1 + c1/ log BProbability exponentially small in L for R ≤ CB
Prob{
Error}≤ e−L(CB−R)2c2
Improves to e−c3L(CB−R)2(log B)0.5using a Bernstein bound.
Nearly optimal when CB−R is of the same order as C−CB.Our c1 is near (2.5 + 1/snr) log log B + 4C
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 78: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/78.jpg)
Summary
Sparse superposition coding is fast and reliable at rates upto channel capacity
Formulation and analysis blends modern statisticalregression and information theory
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 79: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/79.jpg)
Outline
Information and Probability:Monotonicity of InformationMarkov ChainsMartingalesLarge Deviation ExponentsInformation Stability (AEP)Central Limit TheoremMonotonicity of InformationEntropy Power Inequalities
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 80: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/80.jpg)
Monotonicity of Information Divergence
Information Inequality X → X ′
D(PX ′‖P∗X ′) ≤ D(PX‖P∗X )
Chain Rule
D(PX ,X ′‖P∗X ,X ′) = D(PX ′‖P∗X ′) + E D(PX |X ′‖P∗X |X ′)
= D(PX‖P∗X ) + E D(PX ′|X‖P∗X ′|X )
Markov Chain {Xn} with P∗ invariant
D(PXn‖P∗) ≤ D(PXm‖P∗) for n > m
Convergence
log pn(Xn)/p∗(Xn) is a Cauchy sequence in L1(P)
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 81: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/81.jpg)
Monotonicity of Information Divergence
Information Inequality X → X ′
D(PX ′‖P∗X ′) ≤ D(PX‖P∗X )
Chain Rule
D(PX ,X ′‖P∗X ,X ′) = D(PX ′‖P∗X ′) + E D(PX |X ′‖P∗X |X ′)
= D(PX‖P∗X ) + E D(PX ′|X‖P∗X ′|X )
Markov Chain {Xn} with P∗ invariant
D(PXn‖P∗) ≤ D(PXm‖P∗) for n > m
Convergence
log pn(Xn)/p∗(Xn) is a Cauchy sequence in L1(P)
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 82: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/82.jpg)
Monotonicity of Information Divergence
Information Inequality X → X ′
D(PX ′‖P∗X ′) ≤ D(PX‖P∗X )
Chain Rule
D(PX ,X ′‖P∗X ,X ′) = D(PX ′‖P∗X ′) + E D(PX |X ′‖P∗X |X ′)
= D(PX‖P∗X ) + E D(PX ′|X‖P∗X ′|X )
Markov Chain {Xn} with P∗ invariant
D(PXn‖P∗) ≤ D(PXm‖P∗) for n > m
Convergence
log pn(Xn)/p∗(Xn) is a Cauchy sequence in L1(P)
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 83: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/83.jpg)
Monotonicity of Information Divergence
Information Inequality X → X ′
D(PX ′‖P∗X ′) ≤ D(PX‖P∗X )
Chain Rule
D(PX ,X ′‖P∗X ,X ′) = D(PX ′‖P∗X ′) + E D(PX |X ′‖P∗X |X ′)
= D(PX‖P∗X ) + 0
Markov Chain {Xn} with P∗ invariant
D(PXn‖P∗) ≤ D(PXm‖P∗) for n > m
Convergence
log pn(Xn)/p∗(Xn) is a Cauchy sequence in L1(P)
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 84: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/84.jpg)
Monotonicity of Information Divergence
Information Inequality X → X ′
D(PX ′‖P∗X ′) ≤ D(PX‖P∗X )
Chain Rule
D(PX ,X ′‖P∗X ,X ′) = D(PX ′‖P∗X ′) + E D(PX |X ′‖P∗X |X ′)
= D(PX‖P∗X )
Markov Chain {Xn} with P∗ invariant
D(PXn‖P∗) ≤ D(PXm‖P∗) for n > m
Convergence
log pn(Xn)/p∗(Xn) is a Cauchy sequence in L1(P)
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 85: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/85.jpg)
Monotonicity of Information Divergence
Information Inequality X → X ′
D(PX ′‖P∗X ′) ≤ D(PX‖P∗X )
Chain Rule
D(PX ,X ′‖P∗X ,X ′) = D(PX ′‖P∗X ′) + E D(PX |X ′‖P∗X |X ′)
= D(PX‖P∗X )
Markov Chain {Xn} with P∗ invariant
D(PXn‖P∗) ≤ D(PXm‖P∗) for n > m
Convergence
log pn(Xn)/p∗(Xn) is a Cauchy sequence in L1(P)
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 86: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/86.jpg)
Monotonicity of Information Divergence
Information Inequality X → X ′
D(PX ′‖P∗X ′) ≤ D(PX ′‖P∗X ′)
Chain Rule
D(PX ,X ′‖P∗X ,X ′) = D(PX ′‖P∗X ′) + E D(PX |X ′‖P∗X |X ′)
= D(PX‖P∗X )
Markov Chain {Xn} with P∗ invariant
D(PXn‖P∗) ≤ D(PXm‖P∗) for n > m
Convergence
log pn(Xn)/p∗(Xn) is a Cauchy sequence in L1(P)
Pinsker-Kullback-Csiszar inequalities
A ≤ D +√
2D V ≤√
2D
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 87: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/87.jpg)
Martingale Convergence and Limits of Information
Nonnegative Martingales ρn correspond to the density of ameasure Qn given by Qn(A) = E [ρn1A].Limits can be established in the same way by the chainrule for n > m
D(Qn‖P) = D(Qm‖P) +
∫ (ρn log
ρn
ρm
)dP
Thus Dn = D(Qn‖P) is an increasing sequence. Supposeit is bounded.Then ρn is a Cauchy sequences in L1(P) with limit ρdefining a measure QAlso, log ρn is a Cauchy sequence in L1(Q) and
D(Qn‖P)↗ D(Q‖P)
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 88: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/88.jpg)
Monotonicity of Information Divergence: CLT
Central Limit Theorem Setting:
{Xi} i.i.d. mean zero, finite variance
Pn = PYn is distribution of Yn = X1+X2+...+Xn√n
P∗ is the corresponding normal distribution
For n > mD(Pn‖P∗) < D(Pm‖P∗)
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 89: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/89.jpg)
Monotonicity of Information Divergence: CLT
Central Limit Theorem Setting:
{Xi} i.i.d. mean zero, finite variance
Pn = PYn is distribution of Yn = X1+X2+...+Xn√n
P∗ is the corresponding normal distribution
For n > mD(Pn‖P∗) < D(Pm‖P∗)
Chain Rule for n > m: not clear how to use in this case
D(PYm,Yn‖P∗Ym,Yn) = D(PYn‖P∗) + ED(PYm|Yn‖P
∗Ym|Yn
)
= D(PYm‖P∗) + ED(PYn|Ym‖P∗Yn|Ym
)
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 90: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/90.jpg)
Monotonicity of Information Divergence: CLT
Central Limit Theorem Setting:
{Xi} i.i.d. mean zero, finite variance
Pn = PYn is distribution of Yn = X1+X2+...+Xn√n
P∗ is the corresponding normal distribution
For n > mD(Pn‖P∗) < D(Pm‖P∗)
Chain Rule for n > m: not clear how to use in this case
D(PYm,Yn‖P∗Ym,Yn) = D(Pn‖P∗) + ED(PYm|Yn‖P
∗Ym|Yn
)
= D(Pm‖P∗) + ED(PYn|Ym‖P∗Yn|Ym
)
= D(Pm‖P∗) + D(Pn−m‖P∗)
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 91: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/91.jpg)
Monotonicity of Information Divergence: CLT
Entropy Power Inequality
e2H(X+X ′) ≥ e2H(X) + e2H(X ′)
yieldsD(P2n‖P∗) ≤ D(Pn‖P∗)
Information Theoretic proof of CLT (B. 1986):
D(Pn‖P∗)→ 0 iff finite
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 92: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/92.jpg)
Monotonicity of Information Divergence: CLT
Entropy Power Inequality
e2H(X+X ′) ≥ e2H(X) + e2H(X ′)
yieldsD(P2n‖P∗) ≤ D(Pn‖P∗)
Information Theoretic proof of CLT (B. 1986):
D(Pn‖P∗)→ 0 iff finite
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 93: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/93.jpg)
Monotonicity of Information Divergence: CLT
Entropy Power Inequality
e2H(X+X ′) ≥ e2H(X) + e2H(X ′)
yieldsD(P2n‖P∗) ≤ D(Pn‖P∗)
Information Theoretic proof of CLT (B. 1986):
D(Pn‖P∗)→ 0 iff finite
(Johnson and B. 2004) with Poincare constant R
D(Pn‖P∗) ≤2R
n−1+2RD(P1‖P∗)
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 94: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/94.jpg)
Monotonicity of Information Divergence: CLT
Entropy Power Inequality
e2H(X+X ′) ≥ e2H(X) + e2H(X ′)
yieldsD(P2n‖P∗) ≤ D(Pn‖P∗)
Information Theoretic proof of CLT (B. 1986):
D(Pn‖P∗)→ 0 iff finite
(Johnson and B. 2004) with Poincare constant R
D(Pn‖P∗) ≤2R
n−1+2RD(P1‖P∗)
(Bobkov, Chirstyakov, Gotze 2013) Moment conditions andfinite D(P1‖|P∗) suffice for this 1/n rate
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 95: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/95.jpg)
Monotonicity of Information Divergence: CLT
Entropy Power Inequality
e2H(X+X ′) ≥ e2H(X) + e2H(X ′)
Generalized Entropy Power Inequality (Madiman&B.2006)
eH(X1+...+Xn) ≥ 1r
∑s∈S
e2H(∑
i∈s Xi )
where r is max number of sets in S in which an index appearsProof:
simple L2 projection property of entropy derivativeconcentration inequality for sums of functions of subsets ofindependent variables
VAR(∑s∈S
gs(Xs)) ≤ r∑s∈S
VAR(gs(Xs))
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 96: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/96.jpg)
Monotonicity of Information Divergence: CLT
Entropy Power Inequality
e2H(X+X ′) ≥ e2H(X) + e2H(X ′)
Generalized Entropy Power Inequality (Madiman&B.2006)
eH(X1+...+Xn) ≥ 1r
∑s∈S
e2H(∑
i∈s Xi )
where r is max number of sets in S in which an index appears
Consequence, for all n > m,
D(Pn‖P∗) ≤ D(Pm‖P∗)
[Madiman and B. 2006, Tolino and Verdú 2006.Earlier elaborate proof by Artstein, Ball, Barthe, Naor 2004]
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 97: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/97.jpg)
Information-Stability and Error Probability of Tests
Stability of log-likelihood ratios (AEP)(B. 1985, Orey 1985, Cover and Algoet 1986)
1n
logp(Y1,Y2, . . .Yn)
q(Y1,Y2, . . . ,Yn)→ D(P‖Q) with P prob 1
where D(P‖Q) is the relative entropy rate.
Optimal statistical test: critical region An has asymptotic Ppower 1 (at most finitely many mistakes P(Ac
n i .o.) = 0)and has optimal Q-prob of error
Q(An) = exp{−n[D + o(1)]}
General form of the Chernoff-Stein Lemma.
Relative entropy rate
D(P‖Q) = lim1n
D(PY n‖QY n )
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 98: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/98.jpg)
Information-Stability and Error Probability of Tests
Stability of log-likelihood ratios (AEP)(B. 1985, Orey 1985, Cover and Algoet 1986)
1n
logp(Y1,Y2, . . .Yn)
q(Y1,Y2, . . . ,Yn)→ D(P‖Q) with P prob 1
where D(P‖Q) is the relative entropy rate.
Optimal statistical test: critical region An has asymptotic Ppower 1 (at most finitely many mistakes P(Ac
n i .o.) = 0)and has optimal Q-prob of error
Q(An) = exp{− n [D+o(1)]
}General form of the Chernoff-Stein Lemma.
Relative entropy rate
D = lim1n
D(PY n‖QY n )
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 99: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/99.jpg)
Optimality of the Relative Entropy Exponent
Information Inequality, for any set An,
D(PY n‖QY n ) ≥ P(An) logP(An)
Q(An)+ P(Ac
n) logP(Ac
n)
Q(Acn)
Consequence
D(PY n‖QY n ) ≥ P(An) log1
Q(An)− H2(P(An))
Equivalently
Q(An) ≥ exp{−
D(PY n‖QY n )− H2(P(An))
P(An)
}
For any sequence of pairs of joint distributions, nosequence of tests with P(An) approaching 1 can havebetter Q(An) exponent than D(PY n‖QY n ).
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 100: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/100.jpg)
Large Deviations, I-Projection, and Conditional Limit
P∗: Information projection of Q onto convex CPythagorean identity (Csiszar 75, Topsoe 79): For P in C
D(P‖Q) ≥ D(C‖Q) + D(P‖P∗)
whereD(C‖Q) = inf
P∈CD(P‖Q)
Empirical distribution Pn, from i.i.d. sample.(Csiszar 1985)
Q{Pn ∈ C} ≤ exp{− n D(C‖Q)
}Information-theoretic representation of Chernoff bound(when C is a half-space)
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 101: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/101.jpg)
Large Deviations, I-Projection, and Conditional Limit
P∗: Information projection of Q onto convex CPythagorean identity (Csiszar 75, Topsoe 79): For P in C
D(P‖Q) ≥ D(C‖Q) + D(P‖P∗)
whereD(C‖Q) = inf
P∈CD(P‖Q)
Empirical distribution Pn, from i.i.d. sample.(Csiszar 1985)
Q{Pn ∈ C} ≤ exp{− n D(C‖Q)
}Information-theoretic representation of Chernoff bound(when C is a half-space)
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation
![Page 102: Andrew R. Barron YALE UNIVERSITY DEPARTMENT OF ...arb4/presentations/BarronInformation...Andrew Barron Information Theory & Statistics of High-Dim Function Estimation Example papers](https://reader034.vdocuments.mx/reader034/viewer/2022043012/5fab46848eb5dd7d410f36b7/html5/thumbnails/102.jpg)
Large Deviations, I-Projection, and Conditional Limit
P∗: Information projection of Q onto convex CPythagorean identity (Csiszar 75, Topsoe 79): For P in C
D(P‖Q) ≥ D(C‖Q) + D(P‖P∗)
whereD(C‖Q) = inf
P∈CD(P‖Q)
Empirical distribution Pn, from i.i.d. sampleIf D(interiorC‖Q) = D(C‖Q) then
Q{Pn ∈ C} = exp{− n [D(C‖Q) + o(1)]
}and the conditional distribution PY1,Y2,...,Yn|{Pn∈C} convergesto P∗Y1,Y2,...,Yn
in the I-divergence rate sense (Csiszar 1985)
Andrew Barron Information Theory & Statistics of High-Dim Function Estimation