harvard university · hhznii ;x;s0 = z ds0p s(s 0) z y ab dq ab z y a dua y ab (uaub pq ab)e p p...

35
Statistical Mechanics of High Dimensional Inference Supplementary Material Madhu Advani Surya Ganguli Contents 1 Introduction 2 2 Deriving inference error from the statistical physics of disordered systems 2 2.1 Replica equations at finite temperature ........................................ 2 2.2 Replica equations in the low temperature limit ................................... 5 2.3 Moreau envelope formulation of the replica equations ................................ 7 2.4 Accuracy of inference from the perspective of noise ................................. 8 2.5 Noise-free replica equations .............................................. 8 3 Analytic example: quadratic loss and regularization 10 3.1 General quadratic MSE ................................................ 10 3.2 Optimal quadratic MSE ................................................ 11 3.3 Training and test errors ................................................ 12 3.4 Limiting cases and scaling ............................................... 12 3.5 Quadratic inference through a random matrix theory perspective ......................... 13 4 Lower bounds on the error of any convex inference procedure 14 4.1 Unregularized inference ................................................ 14 4.2 Regularized inference .................................................. 15 4.3 Lower bound on noise-free inference ......................................... 17 4.4 Example: bounds on compressed sensing ....................................... 18 5 Optimal inference 19 5.1 Unregularized inference ................................................ 19 5.2 Regularized inference .................................................. 22 5.3 Analytic bounds on MSE ............................................... 25 5.4 Limiting cases ...................................................... 25 5.5 Optimal noise-free inference .............................................. 27 5.6 The limits of small and large smoothing of Moreau envelopes ........................... 27 5.7 Comparison of optimal M-estimation with MMSE accuracy ............................ 29 A Review of Single Parameter Inference 30 A.1 Single parameter M-estimation ............................................ 30 A.2 Bounds on single parameter M-estimation ...................................... 30 A.3 Asymptotic analysis of single parameter Bayesian MMSE estimation ....................... 31 B Fisher Information 32 B.1 Fisher information for scalar inference ........................................ 32 B.2 Convolutional Fisher inequality ............................................ 32 B.3 Fisher information is minimized by a gaussian distribution ............................. 32 B.4 Relation between MMSE and Fisher information .................................. 33 B.5 Cramer-Rao bound ................................................... 33 C Moreau envelope and proximal map 34 C.1 Relation between proximal map and Moreau envelope ............................... 34 C.2 Inverse of the Moreau envelope ............................................ 34 C.3 Conditions for convexity of optimal M-estimators .................................. 35 1

Upload: others

Post on 29-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

Statistical Mechanics of High Dimensional Inference

Supplementary Material

Madhu AdvaniSurya Ganguli

Contents

1 Introduction 2

2 Deriving inference error from the statistical physics of disordered systems 22.1 Replica equations at finite temperature . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22.2 Replica equations in the low temperature limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Moreau envelope formulation of the replica equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4 Accuracy of inference from the perspective of noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.5 Noise-free replica equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Analytic example: quadratic loss and regularization 103.1 General quadratic MSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Optimal quadratic MSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 Training and test errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.4 Limiting cases and scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.5 Quadratic inference through a random matrix theory perspective . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Lower bounds on the error of any convex inference procedure 144.1 Unregularized inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.2 Regularized inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.3 Lower bound on noise-free inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174.4 Example: bounds on compressed sensing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5 Optimal inference 195.1 Unregularized inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.2 Regularized inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225.3 Analytic bounds on MSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.4 Limiting cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.5 Optimal noise-free inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.6 The limits of small and large smoothing of Moreau envelopes . . . . . . . . . . . . . . . . . . . . . . . . . . . 275.7 Comparison of optimal M-estimation with MMSE accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

A Review of Single Parameter Inference 30A.1 Single parameter M-estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30A.2 Bounds on single parameter M-estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30A.3 Asymptotic analysis of single parameter Bayesian MMSE estimation . . . . . . . . . . . . . . . . . . . . . . . 31

B Fisher Information 32B.1 Fisher information for scalar inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32B.2 Convolutional Fisher inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32B.3 Fisher information is minimized by a gaussian distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32B.4 Relation between MMSE and Fisher information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33B.5 Cramer-Rao bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

C Moreau envelope and proximal map 34C.1 Relation between proximal map and Moreau envelope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34C.2 Inverse of the Moreau envelope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34C.3 Conditions for convexity of optimal M-estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

1

Page 2: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

1 Introduction

Here we provide both detailed derivations of results in the main paper as well as background information about classicalresults in statistics, to place our results on modern high dimensional inference in perspective.

2 Deriving inference error from the statistical physics of disordered systems

In this section, we derive consistency equations governing the mean squared error qs between the true parameters s0 ∈ RP andour estimate of the parameters s ∈ RP . Einstein summation convention is assumed unless otherwise stated and measurements(y1, ..., yN ) are drawn as

yµ = Xµjs0j + εµ. (1)

Given knowledge of the measurements y ∈ RN and the measurement or input matrix X ∈ RN×P (the µ’th row of X is themeasurement vector xµ in the main paper), but not the noise realizations εµ, which are drawn iid according to a probabilitydensity function Pε, we consider M-estimators of the form:

s = arg mins

N∑µ=1

ρ(yµ −Xµjsj) +

P∑j=1

σ(sj)

. (2)

The loss and regularizer ρ and σ will be assumed to be convex functions throughout this paper. In what follows we letX ∈ RN×P be a random matrix with elements drawn iid from a normal distribution with variance 1

P as Xµj ∼ N (0, 1P ),

where the 1P scaling serves to maintain an approximately unit norm for each measurement vector xµ so that both the noise

ε and signal Xs0 vectors have norms that are O(N) if each component of the noise and signal are O(1).

2.1 Replica equations at finite temperature

Result 1. Finite temperature RS relations We let β be the inverse temperature and derive the RS (replicasymmetric) equations in the high dimensional asymptotic (big-data) limit where N,P → ∞ and the ratio of samples tovariables has a finite limit N

P → α. We find that average squared residual error of the energy minimization problem (4),reduces to a set of 4 equations relating scalar order parameters qd, qs,Λρ,Λσ

qdΛ2σ

Λ2ρ

(⟨⟨〈 d− εqs 〉

2Pρ

⟩⟩ε,zε

)1

Λσ=

α

Λ2ρ

(Λρ −

⟨⟨ ⟨(δd)2

⟩Pρ

⟩⟩ε,zε

) qs =⟨⟨ ⟨

s− s0⟩2Pσ

⟩⟩s0,zs

Λρ =⟨⟨ ⟨

(δs)2⟩Pσ

⟩⟩s0,zs

Pρ(d|εqs) ∝ e− (d−εqs )2

2Λρ−βρ(d)

, Pσ(s|s0qd

) ∝ e−1

2Λσ(s−s0qd)

2−βσ(s) (3)

zs, zε are independent random unit gaussians, and s0, ε are random variables drawn from the parameter and noise prob-ability densities Ps and Pε respectively, and we define the effective noise εqs = ε +

√qszε and effective parameters

s0qd

= s0 +√qdzs. We define thermal fluctuations from the mean δs = s− 〈 s 〉 and δd = d− 〈 d 〉.

Derivation of Result 1. We now introduce a physical system which reduces to the above inference problem in the zerotemperature limit. The energy function of such a system is

E(u) =

N∑µ=1

ρ (εµ −Xµjuj) +

P∑j=1

σ(s0j − uj), (4)

where we have made a change of variable to write the problem in terms of the residual error u defined as u = s0 − s. The

Gibbs distribution on the residual error u is PG(u) = e−βE(u)

Z . We then compute the average Free Energy of the system

by applying the replica trick based on the identity limn→0〈〈Zn 〉〉−1

n = 〈〈 log(Z) 〉〉. Note that X, s0, ε here play the role ofquenched disorder. To proceed with the replica method, we first compute:

〈〈Zn 〉〉ε,X,s0 =⟨⟨ ∫ ∏n

a=1 duae−β

∑na=1 (

∑Nµ=1 ρ(Xµku

ak+εµ)+

∑Pj=1 σ(s0j−u

aj ))⟩⟩

ε,X,s0. (5)

Note that, since Xµj ∼ N (0, 1P ), for fixed ua, baµ =

∑Pj=1Xµju

aj , is a gaussian with a covariance structure

⟨⟨baµb

⟩⟩X

=Qabδµν so that

2

Page 3: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

〈〈Zn 〉〉ε,X,s0 =

∫ds0Ps(s

0)

∫ ∏ab

dQab

∫ ∏a

dua∏ab

δ(ua · ub − PQab)e−β∑Pj=1

∑a σ(s0j−u

aj )

∫ N∏µ=1

dεµPε(εµ)

∫1

|Q|N∏µa

dbaµ√2πe∑Nµ=1−β

∑a ρ(b

aµ+εµ)− 1

2

∑ab b

aµQ−1ab b

bµ ,

(6)

where |Q| denotes the determinant of Q. By noting that part of the integral factorizes in µ we have

〈〈Zn 〉〉ε,X,s0 =

∫ ∏ab

dQabe−NE(Q)

∫ds0

∏a

duaPs(s0)e−β

∑Pj=1

∑a σ(s0j−u

aj )∏ab

δ(ua · ub − PQab

), (7)

where

E(Q) = − log

(∫dεPε(ε)

∫ ∏a

dba1

|Q|e−

12 baQ−1

ab bb

e−β∑na=1 ρ(ε+b

a)

). (8)

We then replace our delta functions with integrals over dummy variables Q on the imaginary axis

〈〈Zn 〉〉ε,X,s0 =

∫ ∏ab

dQabdQabe−NE(Q)−

∑ab PQabQab

∫ds0

∏a

duaPs(s0)e−β

∑Pj=1

∑a σ(s0j−u

aj )+

∑ab Qabu

a·ub . (9)

We can simplify this by noting the factorization of the RHS to find

〈〈Zn 〉〉ε,X,s0 =

∫ n∏ab

dQabdQabe−N(E(Q)+ 1

α

∑ab QabQab−

1αS(Q)), (10)

where

S(Q) = log

(∫ds0

n∏a

duaPs(s0)e−β

∑a σ(s0−ua)+

∑ab Qabu

aub

). (11)

We now apply the replica symmetric (RS) assumption about the form of Q, and a corresponding form on Q. For a 6= b,Qab = Q0, Qaa = Q1, Qab = Q0, and Qaa = Q1. It is also useful to define the differences between the replicas ∆ = Q1 −Q0

and ∆ = Q1 − Q0. The RS assumption implies that

ba =√

∆xa +√Q0z, (12)

where xa, z are unit gaussian random variables and ∆ = Q1 −Q0, so that

e−E(Q) =⟨⟨ ∫ ∏n

a=1Dxae−β∑na=1 ρ(ε+

√Q0z+

√∆xa)

⟩⟩z,ε. (13)

Here Dx = dx e−x2/2√

2πis a gaussian integral. Noting the factorization over n replicas, this can be simplified as

e−E(Q) =⟨⟨e−nΛ(z,ε)

⟩⟩z,ε, (14)

where

Λ(z, ε) = − log

(∫Dxe−βρ(ε+

√Q0z+

√∆x)). (15)

Similarly we can factorize

eS(Q) =

∫ds0

n∏a

duaPs(s0)e−β

∑a σ(s0−ua)+Q0(

∑a u

a)2+∆

∑a (ua)2

, (16)

where ∆ = Q1 − Q0. To factorize the above expression, we exploit the identity 〈〈 eγη 〉〉η = eγ2

2 for unit gaussian η. Letting

γ =

√2Q0

∑a u

a

eS(Q) =

⟨⟨∫ds0Ps(s

0)∏a du

ae∑a

[η√

2Q0ua+∆(ua)2−βσ(s0−ua)

]⟩⟩η

. (17)

It follows that,

3

Page 4: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

eS(Q) =⟨⟨

enχ(η,s0)⟩⟩

η,s0, (18)

χ(η, s0) = log

(∫dueη√

2Q0u+∆u2−βσ(s0−u)

). (19)

To determine F , we calculate the low n Taylor expansion of 〈〈Zn 〉〉, to do this we compute:

limn→0

∂E

∂n= 〈〈Λ(z, ε) 〉〉z,ε, (20) lim

n→0

∂S

∂n=⟨⟨χ(η, s0)

⟩⟩η,s0

, (21)

limn→0

∂(QabQab

)∂n

= Q1Q1 −Q0Q0 = Q0∆ + Q0∆ + ∆∆. (22)

Using these limits to compute the free energy density given by the saddle point in (10)

F =−βFN

= minQ0,Q0,∆,∆

[〈〈Λ(z, ε) 〉〉z,ε +

1

α

(Q0∆ + Q0∆ + ∆∆

)− 1

α

⟨⟨χ(η, s0)

⟩⟩η,s0

]. (23)

Minimizing with respect to each of the order parameters Q0, Q0,∆, ∆, we find the following set of coupled equations definingthe order parameters:

∆ = −α∂〈〈Λ 〉〉z,ε∂Q0

(24)

Q0 + ∆ = −α∂〈〈Λ 〉〉z,ε

∂∆(25)

∆ =∂〈〈χ 〉〉η,s0

∂Q0

(26)

Q0 + ∆ =∂〈〈χ 〉〉η,s0

∂∆. (27)

By making a change of variables, we can write

Λ(z, ε) = − log

(1√∆

∫dye−

12∆ (y−

√Q0z−ε)

2−βρ(y)

), (28)

so that

∂〈〈Λ 〉〉z,ε∂Q0

= − 1

2∆√Q0

⟨⟨z∫dy(y−

√Q0z−ε)e−

12∆ (y−

√Q0z−ε)

2−βρ(y)∫dye−

12∆ (y−

√Q0z−ε)

2−βρ(y)

⟩⟩z,ε

. (29)

Using the fact that for a gaussian random variable z, 〈〈 zf(z) 〉〉z = 〈〈 f ′(z) 〉〉z, we can re-write this as

∂〈〈Λ 〉〉z,ε∂Q0

= − 1

2∆2

⟨⟨ ∫dy(y−

√Q0z−ε)2e−

12∆ (y−

√Q0z−ε)

2−βρ(y)∫dye−

12∆ (y−

√Q0z−ε)

2−βρ(y)

⟩⟩z,ε

⟨⟨( ∫dy(y−

√Q0z−ε)e−

12∆ (y−

√Q0z−ε)

2−βρ(y)∫dye−

12∆ (y−

√Q0z−ε)

2−βρ(y)

)2⟩⟩

z,ε

−∆

.

(30)Defining εQ0 = ε+

√Q0z yields

∆ =α

2∆2

⟨⟨ ⟨(y − εQ0

)2⟩Pρ(y)

− 〈 (y − εQ0) 〉2Pρ(y)

⟩⟩εQ0

− α

2∆, (31)

where the 〈 . 〉Pρ(y) denotes average over a Gibbs distributions Pρ(y) ∝ e−1

2∆ (y−√Q0z−ε)

2−βρ(y). Similarly, we find

∂〈〈Λ 〉〉z,ε∂∆

= − 1

2∆2

⟨⟨ ∫dy(y−

√Q0z−ε)2e−

12∆ (y−

√Q0z−ε)

2−βρ(y)∫dye−

12∆ (y−

√Q0z−ε)

2−βρ(y)

⟩⟩z,ε

+1

2∆, (32)

implying

Q0 + ∆ =α

2∆2

⟨⟨ ⟨(y − εQ0

)2⟩Pρ(y)

⟩⟩εQ0

− α

2∆. (33)

We can split this expectation into a variance term and bias term to yield

Q0 =α

2∆2

⟨⟨〈 y − εQ0 〉

2Pρ(y)

⟩⟩εQ0

. (34)

We can also compute order parameters (∆, Q0), and find

4

Page 5: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

∆ =

⟨⟨η

∫du u√

2Q0eη√

2Q0u+∆u2−βσ(s0−u)∫dueη

√2Q0u+∆u2−βσ(s0−u)

⟩⟩η,s0

, (35)

where by using the property of gaussian integrals to take the derivative with respect to η, we find

∆ =⟨⟨ ⟨

u2⟩Pσ(u)

− 〈u 〉2Pσ(u)

⟩⟩η,s0

, (36)

where Pσ(u) ∝ e∆

(u2+2

√Q02∆2 uη

)−βσ(s0−u)

∝ e∆

(u+

√Q02∆2 η

)2

−βσ(s0−u)

. Similarly, minimizing with respect to ∆, we find

Q0 + ∆ =

⟨⟨ ∫duu2eη

√2Q0u+∆u2−βσ(s0−u)∫

dueη√

2Q0u+∆u2−βσ(s0−u)

⟩⟩η,s0

=⟨⟨ ⟨

u2⟩Pσ(u)

⟩⟩η,s0

, (37)

splitting this into a bias and variance component yields

Q0 =⟨⟨〈u 〉2Pσ(u)

⟩⟩η,s0

. (38)

These relations, which hold for arbitrary temperature, may be written as

Q0 =α

2∆2

(⟨⟨〈 y − εQ0

〉2Pρ(y)

⟩⟩εQ0

), (39)

∆ =α

2∆2

(⟨⟨⟨δy2

⟩Pρ(y)

⟩⟩εQ0

−∆

), (40)

Q0 =⟨⟨〈u 〉2Pσ(u)

⟩⟩s0,η

, (41)

∆ =⟨⟨ ⟨

δs2⟩Pσ(u)

⟩⟩s0,η

, (42)

where qd = Q0

2∆2, δu = u− 〈u 〉, δy = y − 〈 y 〉, and

Pρ(y) ∝ e−1

2∆ (y−√Q0z−ε)

2−βρ(y), Pσ(u) ∝ e∆(s−s0+√qdη)

2−βσ(s). (43)

We make the simplifying change of variables ∆→ Λρ, ∆→ −12Λσ

, Q0 → qs, and Q0 → qd2Λ2

σ. We also relabel our gaussian unit

noise z → zε, η → zs to match the main paper.

2.2 Replica equations in the low temperature limit

Result 2. The low temperature limit:In the low temperature limit β →∞, the energy minimization problem and M-estimation problem are equivalent. We seeksolutions to the replica equations where the order parameters scale with β as Λρ =

λρβ , Λσ = λσ

β , so that λσ, λρ, qs, qd are

O(1) scalars and we find that the probability distributions for d and s simplify in the low temperature limit to:

Pρ(d|εqs) = N(ε,λρε′(εqs)

β

), (44)

ε(εqs) = Pλρ [ ρ ](εqs), (45)

Pσ(s|s0qd

) = N

(s,λσ s′(s0

qd)

β

), (46)

s(s0qd

) = Pλσ [σ ](s0qd

). (47)

where Pλ[ f ] is the proximal map function (see Appendix C for more information) defined as

Pλ[ f ](x) = arg miny

[(x− y)2

2λ+ f(y)

], (48)

and εqs = ε +√qszε, s

0qd

= s0 +√qdzs are random variables corresponding to the corrupted versions of the noise and

signal where zε, zs defined as zero mean, unit variance gaussian noise.

Derivation—.

Pρ(d|εqs) ∝ e−β[

(d−εqs )2

2λρ+ρ(d)

](49) Pσ(s|s0

qd) ∝ e

−β[

(s−s0qd)2

2λσ+σ(s)

](50)

If we define ε and s to be the mode (arg max) of distributions Pρ and Pσ respectively, which is equivalent to maximizing theargument of the exponentials, then

ε = Pλρ [ ρ ](εqs), (51) s = Pλσ [σ ](s0qd

). (52)

Taylor expanding Pρ and Pσ about these maxima yields

5

Page 6: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

Pρ(d|εqs) ∝ e− β2

[1λρ

+ρ′′(ε)](d−ε)2

, (53) Pσ(s|s0qd

) ∝ e−β2 [ 1λσ

+σ′′(s)](s−s)2

. (54)

We can simplify the terms containing second derivatives in (53) and (54) as follows. Note that by the definition of theproximal map, it satisfies an interesting property: by differentiating the argument in the RHS of (48) and evaluating it atthe proximal point (LHS of (48)), we necessarily get zero. Applying this property to (51) yields:

ε− εqsλρ

+ ρ′(ε) = 0. (55)

Differentiating with respect to ε, we find

1

λρ+ ρ′′(ε) =

1

λρε′(εqs). (56)

Substituting this form into (53) and similarly for (54), it follows that

Pρ(d|εqs) ∝ e− β(d−ε)2

2λρε′(εqs ) , (57) Pσ(s|s0qd

) ∝ e− β(s−s)2

2λσs′(s0qd) . (58)

Result 3. M-estimator error predictions In the regularized case, consider any convex M-estimator defined by a lossfunction, regularizer pair (ρ, σ) where the parameters (s0

1, ..., s0P ) and noise (ε1, ..., εN ) are iid random variables. The

associated zero temperature replica symmetric equations simplify to:

qdλ2σ

λ2ρ

⟨⟨(ε− εqs)

2⟩⟩, (59)

1

λσ=

α

λρ〈〈 (1− ε′(εqs)) 〉〉, (60)

qs =⟨⟨ (

s− s0)2 ⟩⟩

s0,zs, (61)

λρ = λσ⟨⟨s′(s0

qd)⟩⟩s0,zs

, (62)

where

ε(εqs) = Pλρ [ ρ ](εqs), (63) s(s0qd

) = Pλσ [σ ](s0qd

). (64)

Derivation.— This result follows from the substituting the low temperature form of Pρ and Pσ in Result 2 into the finitetemperature consistency equations in Result 1 and taking the zero temperature limit.

We note that these replica equations have a well-defined relationship to the original M-estimation problem in (2). Considerfor example pairs of estimated and true signal components: (si, s

0i ) for i = 1, . . . P in the original high-dimensional inference

problem. We can obtain an empirical histogram of such joint pairs, that converges as P →∞ to a joint distribution P (si, s0i ).

This distribution is self-averaging - i.e. does not depend on the detailed realization of signal, measurements, or noise. Thereplica theory makes a mean-field prediction for this self-averaging joint distribution through (64), as described in the mainpaper. This joint distribution arises as an auxiliary scalar inference problem in which gaussian noise of variance qd is addedto s0, and this noise is cleaned up by a proximal descent step to obtain an estimate s. Integrating out the noise yields themean field prediction PMF (si, s

0i ).

A similar statement holds from the perspective of noise as opposed to signal estimation. Any inference procedure thatinfers an estimated signal s also implicitly estimates the noise through ε = y −Xs, thus yielding a decomposition of themeasurement outcome y into a sum of estimated signal contribution Xs and estimated noise contribution ε. Just as we didfor the estimated signal and the true signal, we could also consider the joint distribution of components of the estimatednoise and true noise, (εµ, εµ) for µ = 1, . . . , N . Again in the N → ∞ limit the empirical joint histogram of these quantitiesconverges to a self-averaging joint distribution P (εµ, εµ). The replica theory makes a mean field prediction for this jointdistribution through (63), also described in the main paper. This joint distribution arises as an auxiliary scalar inferenceproblem in which gaussian noise of variance qs is added to the true noise ε, and this additional noise is cleaned up by aproximal descent step to obtain an estimated noise ε. Integrating out the Gaussian noise yields the mean field predictionPMF (εµ, εµ). These distributional predictions can be derived by following a standard replica method in which the abovesteps are modified by the introduction of a delta function picking off one component of either the estimated signal or noise.See [1] section 8.2 for an example of this modification.

6

Page 7: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

Result 4. The special case of unregularized M-estimation, for which σ = 0, follows directly from Result 3. In this case:

α⟨⟨

(ε− εqs)2⟩⟩

= qs (65) α〈〈 1− ε′(εqs) 〉〉 = 1 (66)

whereε(εqs) = Pλρ [ ρ ](εqs). (67)

Using the relation between the Moreau envelope and proximal map, we can also write these equations in the form:

α⟨⟨ (

λρMλρ [ ρ ]′(εqs))2 ⟩⟩

= qs (68) α⟨⟨λρMλρ [ ρ ]′′(εqs)

⟩⟩= 1 (69)

the mean square M-estimator error qs as well as the order parameter λρ are may be computed as the solution to this pairof equations, this result is in agreement with the findings of previous work [2].

Derivation of Result 4.The choice σ = 0 yields a linear map Pλσ [σ ](x) = x. Substituting this into (62) yields λρ = λσ, and into (61) yields

qs = qd. Using this simplification equations (59) and (60) become (65) and (66). Then, direct substitution of the relationbetween the proximal map and the Moreau envelope (74) derived in Appendix C.1 yields (68) and (69). Moreover, dividing(68) by the square of (69) yields the form of the replica equations exhibited in the main paper.

2.3 Moreau envelope formulation of the replica equations

While the above replica symmetric equations (59-61) provide a nice characterization of the mean field distributions PMF(ε, ε)and PMF(s, s0), it will also be useful to formulate these replica symmetric equations in terms of Moreau envelopes, insteadof proximal maps. This reformulation will simplify derivations of bounds on inference error in section 4.2 and derivationsof optimal inference procedures in section 5.2. To achieve this reformulation, we systematically apply the relation betweenproximal maps and Moreau envelopes (74) derived in Appendix C.1. This yields the result:

Result 5. M-estimator Formulation

qdλ2σ

= α⟨⟨ (Mλρ [ ρ ]′(εqs)

)2 ⟩⟩(70)

1

λσ= α

⟨⟨Mλρ [ ρ ]′′(εqs)

⟩⟩(71)

qs + qd

(1− 2

λσλρ

)= λ2

σ

⟨⟨ (Mλσ [σ ]′(s0

qd))2 ⟩⟩

s0qd

(72)

λρλσ− 1 = λσ

⟨⟨Mλσ [σ ]′′(s0

qd)⟩⟩

(73)

Derivation.—In Appendix C.1 we derive the relation between the proximal map and Moreau envelope:

Pλ[ f ](x) = x− λMλ[ f ]′(x). (74)

By applying the above relation to (59) and (60) respectively we arrive at (70) and (71). Next we derive (72), starting with(62):

λρ = λσ⟨⟨Pλσ [σ ]′(s0

qd)⟩⟩

= λσ(1−

⟨⟨λσMλσ [σ ]′′(s0

qd)⟩⟩)

. (75)

Rearranging yields

1− λρλσ

= λσ⟨⟨Mλσ [σ ]′′(s0

qd)⟩⟩. (76)

Finally, to derive (73), we begin with (61) to obtain

qs =⟨⟨ (Pλσ [σ ](s0 +

√qdz)− s0

)2 ⟩⟩s0,z

=⟨⟨ (√

qdz − λσMλσ [σ ]′(s0 +√qdz)

)2 ⟩⟩s0,z

. (77)

By expanding the square above we find

qs = qd − 2√qd⟨⟨zλσMλσ [σ ]′(s0 +

√qdz)

⟩⟩s0,z

+ λ2σ

⟨⟨ (Mλσ [σ ]′(s0

qd))2 ⟩⟩

s0qd

. (78)

Now, we again use the fact that for a gaussian random variable z, 〈〈 zf(z) 〉〉z = 〈〈 f ′(z) 〉〉z, to show that

7

Page 8: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

√qd⟨⟨zλσMλσ [σ ]′(s0 +

√qdz)

⟩⟩s0,z

= qd⟨⟨λσMλσ [σ ]′′(s0

qd)⟩⟩s0qd

= qd

(1− λρ

λσ

), (79)

where the last equality follows from (73). Substituting this into (78) yields

qs = qd − 2qd

(1− λσ

λρ

)+ λ2

σ

⟨⟨ (Mλσ [σ ]′(s0

qd))

2⟩⟩s0qd, (80)

which may be rearranged as

qs + qd

(1− 2

λσλρ

)= λ2

σ

⟨⟨ (Mλσ [σ ]′(s0

qd))2 ⟩⟩

s0qd

. (81)

Finally, note that dividing (70) by the square of (71) yields the form of the replica equation for qd exhibited in the mainpaper.

2.4 Accuracy of inference from the perspective of noise

Just as qs measures inference performance through the MSE (Mean Squared Error) in estimating the signal, it is also usefulto capture inference performance through the MSE in estimating the noise, which we work out in this section. This measureof inference performance, from the perspective of noise, will yield insights into the nature of optimal inference, as well as thenature of overfitting and generalization (see sections 5.1 and 5.2 below).

Result 6. We define qε =⟨⟨

(ε(εqs)− ε)2⟩⟩

, which corresponds to replica prediction for the MSE of an M-estimator in

estimating the the noise ε. In the regularized case we find:

qε = qs +qsα

[qdqs

(λρλσ− qsqd

)2

− qsqd

]. (82)

In the unregularized case and the over-sampled regime (α > 1), qd = qs and λρ = λσ, so this reduces to

qε = qs

(α− 1

α

). (83)

Derivation.—Using the definition of qε and the relationship between the Moreau envelope and proximal map (74), we obtain

qε =⟨⟨

(ε(εqs)− ε)2⟩⟩

=⟨⟨

(ε(εqs)− εqs +√qsz)

2⟩⟩

=⟨⟨ (√

qsz − λρMλρ [ ρ ]′(εqs))2 ⟩⟩

. (84)

Expanding the square above yields

qε = qs − 2√qsλρ

⟨⟨zMλρ [ ρ ]′(εqs)

⟩⟩+ λ2

ρ

⟨⟨ (Mλρ [ ρ ]′(εqs)

)2 ⟩⟩. (85)

Applying the identity 〈 zf(z) 〉z = 〈〈 f ′(z) 〉〉z where z is a unit gaussian random variable z, yields

qε = qs − 2qs⟨⟨λρMλρ [ ρ ]′′(εqs)

⟩⟩+⟨⟨ (

λρMλρ [ ρ ]′(εqs))2 ⟩⟩

. (86)

Then, substituting the replica symmetry equations (70),(71), yields

qε = qs − 2qsλραλσ

+qdλ

αλ2σ

= qs +qsα

[qdqs

(λρλσ− qsqd

)2

− qsqd

]. (87)

2.5 Noise-free replica equations

Motivated by the theory of compressed sensing which yields the interesting result that we can sometimes exactly recoverunknown sparse signals in the undersampled measurement regime α < 1 in the absence of noise via L1 minimization, weconsider the more general case of arbitrary convex inference in the absence of noise.

8

Page 9: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

Result 7. In the limit of noise-free measurements, yµ = Xµ · s0. The following equalities hold in the regime where theMSE error is non-zero

s = mins

P∑j=1

σ(sj) s.t y = Xs. (88)

qd =1

αqs, (89)

λρ = αλσ, (90)

qs =⟨⟨

(s(s0qd

)− s0)2⟩⟩s0,zs

, (91)

⟨⟨s′(s0

qd)⟩⟩s0qd

= α, (92)

where s(s0qd

) = Pλσ [σ ](s0qd

). (93)

Derivation.—The problem is equivalent to a M-estimator of the form

s = arg mins

∑µ

γ

2(yµ −Xµjsj)

2+∑j

σ(sj)

, (94)

where we will take the limit γ → ∞ to strictly enforce equality on the constraint equations. Before we take this limit, wesubstitute the form of ρ into Result 3, yielding the following 4 equations:

qdλ2σ

=αqsγ

2

(1 + γλρ)2(95)

1

λσ=

αγ

1 + γλρ(96)

qs =⟨⟨

(Pλσ [σ ](s0qd

)− s0)2⟩⟩s0,zs

(97)

λρ = λσ⟨⟨Pλσ [σ ]′(s0

qd)⟩⟩s0qd. (98)

One can substitute (96) into (95) to show

qs = αqd. (99)

Now we consider two cases as γ → ∞. First, in the finite error regime, qs remains O(1) as γ → ∞, γλρ → ∞. Second, inthe error-free regime qs → 0, while γλρ remains O(1). In the finite error regime, the equations simplify and (96) becomesλρ = αλσ. Then (98) becomes ⟨⟨

Pλσ [σ ]′(s0qd

)⟩⟩s0qd

= α. (100)

Thus we obtain all the results in the box above corresponding to the finite error regime. We note that in the zero error limit,(96) becomes

λσ =1 + γλραγ

→ 0, (101)

since γλρ = O(1) as γ →∞. Applying the fact that λσ = 0 to (97) yields qs = qd, and to (98) yields λρ = λσ = 0. Thus wefind that in the zero error regime, as expected λρ, λσ, qd, qs → 0 is a solution to the RS equations. See [3] for a more detailedstudy of noise-free perfect inference for the specific case of L1 minimization.

9

Page 10: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

3 Analytic example: quadratic loss and regularization

3.1 General quadratic MSE

Result 8. As an example, we derive the asymptotic MSE, qs, analytically for the case of M-estimators with a square lossfunction and quadratic regularization:

ρ(x) =x2

2σ(x) = γ

x2

2, (102)

we find the normalized MSE (i.e. the fraction of unexplained variance) to be

qs =qs⟨⟨

(s0)2⟩⟩ =

φ+ αγ2λ2σ

α(1 + γλσ)2 − 1, (103)

where φ =

⟨⟨ε2⟩⟩⟨⟨

(s0)2⟩⟩ = 1

SNR is the inverse signal to noise ratio, and λσ is the non-negative root which solves the

quadratic equation:

αγλ2σ + (α− γ − 1)λσ − 1 = 0. (104)

Derivation. —From Result 3, the fact that f(x) = γ

2x2, Pλ[ f ](x) = x

1+λγ , allows us to rewrite (63) and (64) as

ε(εqs) = Pλρ [ ρ ](εqs) =εqs

1 + λρ, (105) s(s0

qd) = Pλσ [σ ](s0

qd) =

s0qd

1 + γλσ. (106)

Substituting the form of ε, s into (59 - 62) yields

qdλ2σ

λ2ρ

(1 + λρ)2

⟨⟨ε2qs⟩⟩εqs

(107)

1

λσ=

α

1 + λρ=⇒ λσ =

1 + λρα

(108)

Combining (108) and (107) yields

qd =

⟨⟨ε2qs⟩⟩

α=

⟨⟨ε2⟩⟩

+ qs

α(109)

qs =

⟨⟨(s0+z

√qd

1+λσγ− s0

)2⟩⟩

s0,z

, (110)

λρ =λσ

1 + γλσ(111)

which implies

qs =qd + (λσγ)2

⟨⟨(s0)2

⟩⟩s0

(1 + λσγ)2. (112)

Combining (108) and (111) we find that λσ must satisfy the quadratic equation

αγλ2σ + (α− γ − 1)λσ − 1 = 0. (113)

The constraint of non-negativity forces λσ to be the positive root, which is unique. Finally, substituting (109) into (112)yields

qs =

⟨⟨ε2⟩⟩

+ qs + α (γλσ)2 ⟨⟨

(s0)2⟩⟩

α (1 + γλσ)2 . (114)

Re-arranging to solve for qs and dividing by the variance of the parameters, we find

qs =qs⟨⟨

(s0)2⟩⟩ =

φ+ α(γλσ)2

α(1 + γλσ)2 − 1, (115)

where we define φ =

⟨⟨ε2⟩⟩⟨⟨

(s0)2⟩⟩ to be the noise to signal ratio. Note that in the limit of strong regularization (γ →∞), this

reduces to qs =⟨⟨

(s0)2⟩⟩

which is intuitive since the strong ridge penalty pins the estimator s to 0. The other extreme isthe un-regularized M-estimator, which corresponds simply to least squares regression and yields

qs =

⟨⟨ε2⟩⟩

α− 1, (116)

which demonstrates that the MSE is unbounded as the number of samples approaches the number of dimensions.

10

Page 11: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

3.2 Optimal quadratic MSE

Result 9. Optimal Quadratic M-estimation: We find that the optimal quadratic M-estimator has the form:

ρ(x) =x2

2σ(x) = φ

x2

2, (117)

and satisfies

qs =1− α− φ+

√(φ+ α− 1)2 + 4φ

2. (118)

This result does not depend on the general form of Pε and Ps except via φ, the inverse SNR.

Derivation of Optimal Regularization Parameter:To find the value of γ that minimizes qs, we set the derivative of (103) with respect to γ equal to zero:

dqsdγ

=2α(λσ + γ dλσdγ )

(α(1 + γλσ)2 − 1)2

(αγ2λ2

σ + (α− φ− 1)γλσ − φ)

= 0. (119)

We look for when

αγ2λ2σ + (α− φ− 1)γλσ − φ = 0, (120)

we then multiply (104) by γ:

αγ2λ2σ = γ − (α− γ − 1)γλσ. (121)

Substituting this result into (120) yields

(γ − φ)(1 + γλσ) = 0, (122)

so we find a fixed point of qs occurs at γ = φ.

Derivation of Optimal qsSubstituting the optimal regularization parameter γ = φ into (104) and multiplying by an overall factor of φ yields

α(φλσ)2 + (α− φ− 1)φλσ − φ = 0, (123)

multiplying both sides by the positive factor 1 + φ+ α, yields

α(1 + φ+ α)(φλσ)2 + (1 + φ+ α)(α− φ− 1)φλσ − φ(φ+ α+ 1) = 0. (124)

This can be expanded an re-arranged as

φ(αλσ − 1) (φλσ(1 + φ+ α) + φ+ α− 1) = φλσ(1 + φ− α) + 2φ. (125)

It follows that

φ(αλσ − 1) =φλσ(1 + φ− α) + 2φ

(φλσ(1 + φ+ α) + φ+ α− 1). (126)

Substitution of the the form of (φλσ)2

from (123) into (103) yields

qs =φ+ αφ2λ2

σ

α(1 + φλσ)2 − 1=

φλσ(1 + φ− α) + 2φ

(φλσ(1 + φ+ α) + φ+ α− 1). (127)

Combining the previous two equations implies that

qs = φ(αλσ − 1). (128)

We complete the derivation by substituting the value of φλσ, which is the positive root of a quadratic equation (123):

φλσ =φ+ 1− α+

√(φ+ 1− α)2 + 4αφ

2α. (129)

Substituting (129) into (128) yields

11

Page 12: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

qs =φ+ 1− α+

√(φ+ 1− α)2 + 4αφ− 2φ

2=

1− α− φ+√

(φ+ α− 1)2 + 4φ

2. (130)

3.3 Training and test errors

Etest =Etest⟨⟨(s0)2

⟩⟩ = qs + φ, (131)

Etrain =⟨⟨Pλρ [ ρ ](εqs)

2⟩⟩. (132)

Under our choice of ρ(x) = x2

2 , it follows that

Etrain =

⟨⟨ε2⟩⟩

+ qs

(1 + λρ)2. (133)

Combining (108) and (111) to find a quadratic equation in λρ and assuming the optimal setting γ = φ yields

(φλρ)2 + (φλρ)(α+ φ− 1)− φ = 0. (134)

This implies that in the optimal quadratic setting

λρ =qsφ, (135)

solving for Etrain = Etrain⟨⟨(s0)2

⟩⟩ yields

Etrain =φ2

Etest. (136)

3.4 Limiting cases and scaling

Result 10. Limiting case for optimal quadratic inference:

In the limit SNR� 1:

qs =

1− α α < 1,

1√SNR

α = 11

SNR(α−1) α > 1

(137)

Finite SNR:

qs =

1− α SNR

SNR+1 α� 1,1

2SNR

(√1 + 4SNR− 1

)α = 1

1SNR(α−1)+1 α� 1

(138)

In the limit SNR� 1

qs =1

SNR(α− 1) + 1(139)

The following form will be convenient for expanding qs in different limits. Factoring α+ φ− 1 out of (118) yields,

qs =

(α+ φ− 1

2

)(−1 +

α+ φ− 1

|α+ φ− 1|

√1 +

(α+ φ− 1)2

). (140)

SNR � 1:First consider α = 1, in which case (140) reduces to

qs =φ

2

(−1 +

√1 +

4

φ

)=

1

2SNR

(√1 + 4SNR− 1

). (141)

In order to Taylor expand this in small φ, we write the above expression as

qs =φ

2

(−1 +

(4

φ

)1/2√

1 +φ

4

)≈ φ

2

(−1 +

(4

φ

)1/2(1 +

φ

8

)), (142)

which approaches qs =√φ = 1√

SNRin the large SNR limit. Next, in the case α < 1, we expand (140) in small φ which yields

12

Page 13: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

qs =

(α+ φ− 1

2

)(−1−

√1 +

(α+ φ− 1)2

)≈ 1− α. (143)

For α > 1, the sign inside the square root changes, and expanding in small φ yields

qs =

(α+ φ− 1

2

)(−1 +

√1 +

(α+ φ− 1)2

)≈ φ

α+ φ− 1→ φ

α− 1. (144)

Finite SNR :Consider the limit of finite SNR with a very low sampling rate α� 1, in which case (140) may be written as

qs =1− α− φ+

√(φ+ 1)2 + 2α(φ− 1) + α2

2=

1− α− φ+ (1 + φ)√

1 + 2α(φ−1)+α2

(1+φ)2

2. (145)

Expanding this expression to first order in α yields

qs =1− α− φ+ (1 + φ)

(1 + α(φ−1)

(1+φ)2

)2

=2− α+ α (φ−1)

1+φ

2= 1− α

1 + φ= 1− α SNR

SNR + 1. (146)

These results characterize the way that, with greater SNR, increasing the measurement density yields a greater reduction inthe estimation error. The case of α = 1 is already shown in (141). In the limit of α� 1 with finite SNR:

qs =

(α+ φ− 1

2

)(−1 +

√1 +

(α+ φ− 1)2

)≈(α+ φ− 1

2

)(−1 +

(1 + 2

φ

(α+ φ− 1)2

))=

1

SNR(α− 1) + 1. (147)

SNR � 1:In the case of very low SNR (φ� 1) with finite α, the first order expansion of the expression (140) in small 1

φ is

qs =

(α+ φ− 1

2

)(−1 +

√1 +

(α+ φ− 1)2

)≈ φ

α+ φ− 1=

1

SNR(α− 1) + 1. (148)

3.5 Quadratic inference through a random matrix theory perspective

M-estimation under the choice ρ(x) = x2

2 and σ(x) = γ x2

2 becomes

s = arg mins

1

2

∑µ

yµ −∑j

Xµjsj

2

+ γ∑j

s2j

, (149)

which can be analyzed by differentiating the argument of the above expression with respect to s and setting the result equalto zero. This yields an expression for the estimated parameters:

s =(XTX + γI

)−1

XTy. (150)

Substituting y = Xs0 + ε into the previous equation, we write this in terms of the difference between the estimated and trueparameters:

s− s0 =(XTX + γI

)−1 (XT ε− γs0

). (151)

Calculating the MSE, we find

qs =1

P

⟨⟨∑i

(si − s0

i

)2 ⟩⟩=

1

P

⟨⟨(XT ε− γs0

)T (XTX + γI

)−T (XTX + γI

)−1 (XT ε− γs0

)⟩⟩. (152)

Averaging over ε, s0 yields

Pqs = σ2ε Tr

(X(XTX + γI

)−2

XT

)+ γ2σ2

s Tr

((XTX + γI

)−2), (153)

= σ2ε Tr

(XTX

(XTX + γI

)−2)

+ γ2σ2s Tr

((XTX + γI

)−2). (154)

13

Page 14: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

Since XTX is a symmetric covariance matrix, it has an eigenvalue decomposition of the form

XTX = QTAQ, (155)

where A is a diagonal matrix of eigenvalues. Substituting the eigenvalue decomposition of XTX into (154) yields

Pqs = σ2ε Tr

(A (A + γI)

−2)

+ γ2σ2s Tr

((A + γI)

−2). (156)

This allows us to write the simplified asymptotic form:

qs = σ2ε

∫ρMP(λ)

λ

(λ+ γ)2 dλ+ γ2σ2

s

∫ρMP(λ)

1

(λ+ γ)2 dλ. (157)

Here ρMP(λ) is the distribution of eigenvalues of XTX. This distribution is known in random matrix theory as the Marchenko-Pastur distribution [?] and has the analytic form:

ρMP(λ) =1

√(λ+ − λ)(λ− λ−)

λ+ 1α<1(1− α)δ(λ) for λ ∈ [λ−, λ+], (158)

and the function is zero elsewhere. Here 1α<1 is equal to one when α < 1 and is zero otherwise, and

λ± =(√α± 1

)2. (159)

We divide (157) by the variance per parameter and simplify the expression further to yield

qs =qsσ2s

=

∫φλ+ γ2

(λ+ γ)2 ρ

MP(λ)dλ, (160)

where φ =σ2ε

σ2s. If we now consider the case of optimal quadratic inference γ = φ, so

qs =

∫φλ+ φ2

(λ+ φ)2 ρ

MP(λ)dλ =

∫φ

λ+ φρMP(λ)dλ =

∫∆(λ)ρMP(λ)dλ, (161)

where

∆(λ) =1

1 + λ · SNR. (162)

Thus, in the limit of low noise (SNR→∞) we find that for λ > 0: ∆(λ)→ 0, and only when λ = 0 does the function retaina positive value: ∆(0) = 1. Therefore, the behavior of the Marchenko-Pastur distribution near λ = 0 determines the largeSNR scaling behavior of the mean-squared error.

4 Lower bounds on the error of any convex inference procedure

4.1 Unregularized inference

Result 11. Unregularized High Dimensional Error Bound In the over-sampled (N > P ), but still high dimensionalregime α > 1, the asymptotic lower bound qs (estimator MSE) for any unregularized convex M-estimator is given by:

qs ≥1

αJ[ε+√qsz] ≥ 1

(α− 1)J [ ε ]. (163)

The form of the Cramer-Rao bound in this problem is derived in Appendix B.5 to be:

qs ≥1

αJ [ ε ]. (164)

Thus, the Cramer-Rao bound is not saturated for finite α. See Appendix B for more information on the Fisher informationand Cramer-Rao bound.

We define fqs to be the pdf of ε +√qsz where z is a unit gaussian random variable. The bound above follows from

constraints (65) and (66) derived in Result 4, which may be written as∫fqs (y)r′(y)dy =

1

α, (165)

∫fqs (y)r(y)2dy =

qsα, (166)

14

Page 15: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

where we have defined

r(y) = λρMλρ [ ρ ]′(y). (167)

Integrating (165) by parts and squaring the result yields(1

α

)2

=

(∫dyr(y)f ′qs (y)

)2

. (168)

By breaking the integrand into two parts and applying the Cauchy-Schwartz inequality, we find:(1

α

)2

=

(∫dy

f ′qs (y)√fqs (y)

√fqs (y)r(y)

)2

≤ J [ε+√qsz]

∫fqs (y)r(y)2dy. (169)

Substituting (166) into this yields the inequality

qs ≥1

αJ[ε+√qsz] . (170)

To complete the derivation, we apply the convolutional Fisher inequality (see Appendix B.2)

J [ ε1 + ε2 ] ≤ J [ ε1 ]J [ ε2 ]

J [ ε1 ] + J [ ε2 ], (171)

where ε1, ε2 are random variables drawn from different distributions. This result implies

J [ε+√qsz] ≤

J [ ε ]

1 + qsJ [ ε ]. (172)

Substituting this into (170) yields

qs ≥1

αJ[ε+√qsz] ≥ 1 + qsJ [ ε ]

αJ [ ε ]. (173)

This implies the inequality qsJ [ ε ] ≥ 1α (1 + qsJ [ ε ]), which may be applied repeatedly to the RHS of (173) to show

qs ≥1

αJ[ε+√qsz] ≥ 1

αJ [ ε ]

(1 +

1

α+

1

α2+ ...

). (174)

For α > 1, we can write this as

qs ≥1

αJ[ε+√qsz] ≥ 1

(α− 1)J [ ε ]. (175)

4.2 Regularized inference

Result 12. High Dimensional Regularized BoundsFor regularized M-estimators (regularizer is not assumed to be zero), we find that the estimator MSE qs is lower boundedby

qs ≥ qd − q2dJ[s0 +

√qdz]

= qMMSEs (qd), where qd =

1

αJ[ε+√qsz] . (176)

Here qMMSEs (qd) denotes the scalar minimum mean squared error of reconstructing random variable s0 corrupted by

gaussian noise with variance qd via Bayesian inference of the posterior mean:

qMMSEs (qd) =

⟨⟨ (s0 −

⟨s|s0 +

√qdz

⟩)2 ⟩⟩s0,z

. (177)

The above bound also leads to the inequality:

qs ≥1

αJ[ε+√qsz]

+ J[s0] , (178)

which implies that regularized M-estimators can leverage a highly informative signal distribution to improve on the Cramer-Rao bound in the high dimensional setting.

15

Page 16: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

In this result, we derive a lower bound on MSE qs over all convex Regularized M-estimators defined by pairs (ρ, σ), giventhe constraints imposed by Result 5. For notational convenience, we define:

rρ (y) = λρMλρ [ ρ ]′(y), (179) rσ (y) = λσMλσ [σ ]′(y). (180)

We also find it convenient to define fqs , gqd to be the pdf’s for ε +√qszε and s0 +

√qdzs respectively where zε, zs are

independent unit gaussian random variables. Thus, (72) and (73) may be written as

λρλσ− 1 =

∫g′qd (y)rσ (y)dy, (181)

∫gqd (y)rσ (y)

2dy = qs − qd − 2qd

∫g′qd (y)rσ (y)dy = qs − qd − 2qd

(λρλσ− 1

). (182)

Squaring (181) and applying the Cauchy-Schwartz inequality yields(λρλσ− 1

)2

=

(∫g′qd (y)√gqd (y)

√gqd (y)rσ (y)dy

)2

≤ J[s0 +

√qdz] ∫

gqd (y)rσ (y)2dy. (183)

Substituting (182) into the RHS of the above inequality yields(λρλσ− 1

)2

≤ J[s0 +

√qdz](

qs − qd − 2qd

(λρλσ− 1

)). (184)

Completing the square, we find(λρλσ− 1 + qdJ

[s0 +

√qdz])2

≤ J[s0 +

√qdz] (qs − qd + q2

dJ[s0 +

√qdz]). (185)

Since the LHS and Fisher information are non-negative, this implies (176):

qs ≥ qd − q2dJ[s0 +

√qdz]. (186)

To derive the form of qd in (176), we use the two constraint equations containing rρ (60) and (59). These may be written,by integrating (258) and (257) by parts, as∫

f ′qs (y)rρ (y)dy = − λραλσ

, (187)

∫fqs (y)rρ (y)

2dy =

qdλ2ρ

αλ2σ

. (188)

Squaring (187) and applying Cauchy Schwartz as before, yields(λραλσ

)2

=

(∫f ′qs (y)rρ (y)dy

)2

≤ J [ε+√qsz]

∫fqs (y)rρ (y)

2dy. (189)

By inserting (188) into the RHS, we find

qd ≥1

αJ[ε+√qsz] . (190)

By applying the relation between MMSE and Fisher information in Appendix B.4 to (186), we find that qs is lower boundedby the MMSE error of reconstruction of s0 given gaussian noise of variance qd:

qs ≥ qMMSEs (qd). (191)

Note that the function on the RHS monotonically increasing in qd since increasing the variance in the noise will makereconstruction more difficult. It follows that the strongest bound occurs for the largest possible value of qd:

qd =1

αJ[ε+√qsz] (192)

To complete the derivation, we apply the Fisher information convolution inequality (171) (see Appendix B.2) which impliesthat

J[s0 +

√qdz]≤

J[s0]

1 + qdJ[s0] , (193)

substituting this into (186) yields

16

Page 17: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

qs ≥ qd − q2dJ[s0 +

√qdz]≥ qd

1 + qdJ[s0] , (194)

and since the RHS is monotonically increasing in qd, it is lower bounded by the lowest allowed value of qd which is boundedby (190) so that

qs ≥1

αJ[ε+√qsz]

+ J[s0] . (195)

Result 13. Noise-free inferenceOne application of the high dimensional regularized bound above is that it produces a lower bound on the required densityof samples αr to achieve some permissible mean squared error per parameter Q in the big data limit only in terms of theallowed error Q and the Fisher information of the noise J [ ε ] and parameters J

[s0].

αr ≥1 +QJ [ ε ]

J [ ε ]

(1

Q− J

[s0])

. (196)

Derivation.—Re-arranging (178) we find

α ≥1− qsJ

[s0]

qsJ[ε+√qsz] ≥ (1− qsJ

[s0])(1 + qsJ [ ε ])

qsJ [ ε ], (197)

where the second inequality follows from (171). Differentiating the RHS of (197) with respect to qs, we find that it is amonotonically decreasing quantity since the Fisher information is positive. Thus, under the condition qs ≤ Q the strongestbound occurs at qs = Q and is

αr ≥ maxqs≤Q

[(1− qsJ

[s0])(1 + qsJ [ ε ])

qsJ [ ε ]

]=

[(1−QJ

[s0])(1 +QJ [ ε ])

QJ [ ε ]

]. (198)

Note that this inequality reveals how both the Fisher information of the noise and signal contributed to a lower bound onthe requisite measurement density to achieve a prescribed error.

4.3 Lower bound on noise-free inference

Result 14. In the noise free regime (ε→ 0), we find the following lower bound on the MSE qs

qs ≥ qMMSEs (qd), where qd =

qsα, (199)

and in terms of Fisher information,

qs ≥α(1− α)

J[s0 +

√qsα z] ≥ 1− α

J[s0] . (200)

Derivation.—Note that the Result can be derived directly from the regularized bound (176), which simplifies in the noise-free limit ε→ 0so that qd = qs

α and

qs ≥ qMMSEs (qd). (201)

From the relation between scalar Fisher information and MMSE (see Appendix B.2), the above equation may be written as

qs ≥ qd(1− qdJ[s0qd

]) =

qsα

(1− qsαJ[s0qd

]). (202)

By dividing both sides of this equation by qs and rearranging, we arrive at the inequality

qs ≥α(1− α)

J[s0qd

] =α(1− α)

J[s0 +

√qsα z] . (203)

From the convolutional Fisher inequality (see Appendix B.2) we know

17

Page 18: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

1

J[s0qd

] ≥ qd +1

J[s0] , (204)

substituting this into inequality into the previous equation, after some rearrangement, yields

qs ≥1− αJ[s0] . (205)

4.4 Example: bounds on compressed sensing

Result 15. For noise free compressed sensing, where the signal s0 is drawn from a distribution which is f-sparse (only afraction f of its components are nonzero), we find that the optimal convex M-estimator requires a number of measurementsN satisfying

N ≥ fP. (206)

to achieve perfect reconstruction to be possible in the asymptotic limit.

Consider the compressed sensing problem: the parameters being estimated are f-sparse so that a fraction f of them arenon-zero. In this scenario, the bound from the previous result becomes

limqs→0

J[s0 +

√qsα z]

= α1− fqs

+O(1). (207)

Substituting this into (203) we find that in the limit of perfect reconstruction, we have

α ≥ f. (208)

In other words we must have,

N

P≥ f, (209)

so the number of measurements required is greater than the size of signal multiplied by the fraction of non-zero elements foran M-estimator to achieve perfect reconstruction in the asymptotic limit.

18

Page 19: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

5 Optimal inference

5.1 Unregularized inference

Result 16. In the case of unregularized regression (σ = 0) and log-concave noise probability density Pε (or equivalentlyconvex noise energy Eε = − logPε), we find that the loss function with minimal MSE is

ρopt = −Mqs [ logPεqs ]. (210)

Mλ[ f ] is a Moreau envelope defined as

Mλ[ f ](x) = miny

[(x− y)2

2λ+ f(y)

], (211)

(see Appendix C.1) and qs is the minimal solution satisfying

qs =1

αJ[ε+√qsz] . (212)

The distribution of signal and noise estimates are obtained through

s =√qsz + s0, (213) ε = εMMSE(εqs) =

∫εP (ε|εqs)dε, (214)

with probabilities P (z) and P (εqs) respectively where z is is a unit gaussian random variable. The MSE per componentin estimating the noise is

qε = qMMSEε (qs) =

⟨⟨ (⟨ε|ε+

√qsz⟩− ε)2 ⟩⟩

ε,z. (215)

The training and test errors are

E train = σ2ε − qε = σ2

ε − qMMSEε (qs), E test = σ2

ε + qs. (216)

Derivation of the AlgorithmFor ease of notation we will write f = Pε and εq = ε+

√qz, where z is a unit gaussian random variable, so that εq is a random

variable drawn from pdf fq. Our goal is to minimize the error qs by optimizing ρ under the unregularized constraints fromResult 4 using the method of lagrange multipliers and calculus of variations. Under the definition r(x) = λρMλρ [ ρ ]′(x), theunregularized constraints (68) and (69) are

α

∫fqs (y)r(y)2dy − qs = 0, (217)

∫fqs (y)r′(y)dy − 1

α= 0. (218)

Introducing multipliers λ and µ, our lagrangian may be written

L = qs + λ

[∫fqs(y)r′(y)dy − 1

α

]+ µ

[∫fqs(y) (r(y))

2dy − qs

α

]. (219)

The strategy we employ is to first find the optimal r, and second to derive ρopt from r. Calculus of variations may be appliedif we combine the integrals in the Lagrangian so that

L = qs − λ1

α− µ 1

αqs +

∫S(r(y), r′(y))dy, (220)

where

S(r, r′) =(λr′ + µr2

)fqs . (221)

Defining δL to be the variational derivative of L (induced by varying r by a small change δr). Taylor expanding the actionS in the integrand gives:

δL =

∫ (∂S

∂rδr +

∂S

∂r′δr′)dy. (222)

Integrating the second term by parts gives

19

Page 20: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

δL =

∫ (∂S

∂r− d

dy

∂S

∂r′

)δrdy. (223)

For stationarity to hold for all δr we thus require the Euler-Lagrange (E-L) equation to hold, i.e.

∂S

∂r=

∂y

∂S

∂r′.. (224)

Inserting (221) into (224) yields

2µrfqs = λf ′qs . (225)

Re-arranging,

r(y) =d

dy

λ

2µlog(fqs(y)). (226)

To solve for λ2µ , we extremize the Lagrangian with respect to other variables after substituting the form of (225) into (219)

to find

L = qs − λ1

α− 1

αµqs −

λ2

∫(f ′qs)

2

fqs. (227)

We can now use the fact that J[ε+√qsz]

=∫ (f ′qs )2

fqsto simplify notation. Now, extremizing L with respect to λ gives

λ2µ = − 1

αJ[ε+√qsz]

, and with respect to µ gives(λ2µ

)2

= qsαJ[ε+

√qsz]

. Combining these results we find λ2µ = −qs. Note that

the zero solution can be neglected since λ2µ = 1

αJ[ε+√qsz]

. Substituting into L gives

L = qs +µ

αJ[ε+√qsz] ( 1

α− qsJ [ε+

√qsz]

). (228)

Because µ is still a free parameter, we choose qs to be the minimal value satisfying

J [ε+√qsz] qs =

1

α. (229)

Now we can substitute λ2µ = −qs into (226) yields

r(x) = −qsd

dxlog fqs(x). (230)

Thus, from the definition of r(x),

d

dxλρMλρ [ ρ

opt ](x) = −qsd

dxlog(fqs(x)). (231)

Integrating both sides of this equation, we find that

Mλρ [ ρopt ](x) = − qs

λρlog(fqs(x)) + b, (232)

where b is an arbitrary constant. We now invert the Moreau envelope, using the property that for convex f , Mλ[ f ] = gimplies f = −Mλ[−g ] (derived in Appendix C.2) yields

ρopt = −Mλρ [qsλρ

log fqs − b ] = −Mλρ [qsλρ

log fqs ] + b. (233)

Because additive constants will not effect on the M-estimation algorithm, we can write this as

ρopt(x) = −Mλρ [qsλρ

log fqs ](x) = −miny

[qsλρ

log fqs(y) +(x− y)2

2λρ

]= − qs

λρMqs [ log fqs ](x). (234)

The unregularized RS equations from Result 4 have an additional symmetry that we can exploit, the equations are unchangedunder the transform ρ→ cρ and λρ → 1

cλρ where c is a positive real number. Since we can re-scale the loss function withoutchanging the optimal of the inference algorithm, we simplify the optimal loss function by choosing λρ = qs so that

ρopt = −Mqs [ log fqs ]. (235)

Log-concavity of the noise distribution ensures that ρopt is convex (see C.3 for more details) so that the RS assumptions hold.

20

Page 21: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

Result 17. Given the loss function form:

ρ = −Mq[ logPεq ]. (236)

The proximal map of ρ becomes:

Pq[ ρ ](εq) = εMMSE(εq). (237)

We first invert the Moreau envelope in (236) as explained in Appendix C.2, which yields

Mq[ ρ ](x) = − logPεq . (238)

We want to compute the proximal map of ρ, which is related to the Moreau envelope by the identity (401)

Pq[ ρ ](x) = x− qMq[ ρ ]′(x). (239)

To calculate the second term, we differentiate the form of the Moreau envelope and multiply by a constant:

−qMq[ ρ ]′(x) = qd

dxlogPεq (x) = q

ddx

∫1√

2πqse−

(x−y)2

2qs Pε(y)dy

Pεq (x)=

∫ (y−x)√2πqs

e−(x−y)2

2qs Pε(y)dy

Pεq (x)=

∫(ε− εqs)P (ε|εqs)dε. (240)

The final equality follows from Bayes rule: we can derive an expression for the prior on noise ε given a version corrupted bygaussian noise εq:

P (ε|εq) =P (εq|ε)P (ε)

P (εq)=

1√2πq

e−(εq−ε)2

2q P (ε)

P (εq). (241)

It then follows that

Pq[ ρ ](εqs) = εqs − qMq[ ρ ]′(εqs) =

∫εP (ε|εqs), dε = εMMSE(εqs), (242)

which yields the desired result that the proximal map becomes an MMSE estimate.

Distribution of estimatesWe let qs = qopt

s be the optimal MSE for the remainder of this section for notational simplicity. We can use Result 17 tocompute the distribution (67) of ε in the unregularized optimal case where we can choose λρ = qs so that,

εopt(εqs) = Pλρ [ ρopt ](εqs) = Pqs [ ρopt ](εqs) = εMMSE(εqs). (243)

MSE of noise estimationWe define qε to be the per component MSE of an algorithm in estimating noise: qε =

⟨⟨(ε− ε)2

⟩⟩. As we showed in (83),

in the optimal unregularized case,

qε = qs

(α− 1

α

). (244)

From (229), in the optimal case α satisfies

α =1

qsJ [ εqs ]. (245)

Combining these yields

qε = qs − q2sJ [ εqs ] = qMMSE

ε (qs). (246)

The final equality follows from the relation between scalar Fisher information and MMSE given in Appendix B.4.

Training and test errorsNext we compute the training and test error. The test error is simply

Etestopt = qs + σ2

ε =1

αJ [ εqs ]+ σ2

ε . (247)

We can also compute the training error

21

Page 22: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

Etrainopt =

⟨⟨(εopt)

2⟩⟩. (248)

If we first compute the qMMSEε (qs),

qMMSEε (qs) =

⟨⟨ ∫(ε− εopt(εqs))

2P (ε|εqs)dε⟩⟩εqs

=⟨⟨ ∫

ε2P (ε|εqs)⟩⟩−⟨⟨

(εopt(εqs))2⟩⟩

εqs

= σ2ε − Etrain

opt , (249)

thus

Etrainopt = σ2

ε − qMMSEε (qs). (250)

5.2 Regularized inference

Result 18. Under the assumption of log-concave probability densities Pε and Ps, we find the optimal loss and regularizerpair to be:

ρopt = −Mqs [ log(Pεqs

)] σopt = −Mqd [ log

(Psqd

)] (251)

and qs, qd are chosen such that qs is the minimum non-negative order parameter satisfying

qd =1

αJ[ε+√qsz] , qs = qd

(1− qdJ

[s0 +

√qdz])

= qMMSEs (qd). (252)

The measurement deviations and parameter estimates take on a value

ε = εMMSE(εqs) =

∫εP (ε|εqs)dε, (253) s = sMMSE(s0

qd) =

∫sP (s|s0

qd)ds, (254)

with probability P (s0qd

), P (εqs), moreover the MSE of the optimal M-estimator in estimating ε is

qε =⟨⟨

(ε(εqs)− ε)2⟩⟩

= qMMSEε (qs). (255)

The corresponding test and training errors are:

E test = σ2ε + qs = σ2

ε + qMMSEs (qd) E train = σ2

ε − qε = σ2ε − qMMSE

ε (qs), (256)

Derivation of the AlgorithmOur goal is to determine the optimal pair of functions ρ, σ, which we write as ρopt, σopt, to minimize the MSE of ourestimation. We are optimizing simultaneously over regularizer and loss function to minimize qs under the RS constraints (70- 73). To simplify notation, we will let g = Ps and noise f = Pε, and letting fq and ga be the probability densities of randomvariables ε +

√qzε and s0 +

√azs, respectively for independent unit gaussian random variables zε, zs. Under the definition

rρ (x) = λρMλρ [ ρ ]′(x) and rσ (x) = λσMλσ [σ ]′(x), we write these constraints as

qdλ2σ

− α

λ2ρ

∫fqs (y)rρ (y)

2dy = 0 (257)

1

λσ− α

λρ

∫fqs (y)r′ρ (y)dy = 0 (258)

qs − qd −∫gqd (y)

(rσ (y)

2 − 2qdr′σ (y)

)dy = 0 (259)

λρ − λσ + λσ

∫gqd (y)r′σ (y)dy = 0 (260)

Thus, we may define a Lagrangian for minimizing qs under the constraints (257,258,259,260) as

L = qs + γ1

[qdλ2σ

− α

λ2ρ

∫fqs (y)rρ (y)

2dy

]+ γ2

[1

λσ− α

λρ

∫fqs (y)r′ρ (y)dy

]γ3

[qs − qd −

∫gqd (y)

(rσ (y)

2 − 2qdr′σ (y)

)dy

]+ γ4

[λρ − λσ + λσ

∫gqd (y)r′σ (y)dy

], (261)

where we have introduced lagrange parameters γ1, γ2, γ3, γ4. To apply calculus of variation, we re-write the lagrangian as

L = qs +γ1qdλ2σ

+γ2

λσ+ γ3qs − γ3qd + γ4λρ − γ4λσ +

∫(Sρ (y) + Sσ (y)) dy, (262)

22

Page 23: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

defining

Sρ (y) = −αfqs (y)

λρ

(γ1

λρrρ (y)

2+ γ2r

′ρ (y)

)(263) Sσ (y) = −gqd (y)

(γ3rσ (y)

2 − (γ4λσ + 2γ3qd)r′σ (y)

).

(264)Requiring stationarity in both δrρ and δrσ leads to the pair of E-L equations:

∂Sρ(rρ, r′ρ; y)

∂rρ=

d

dy

∂Sρ(rρ, r′ρ; y)

∂r′ρ, (265)

∂Sσ(rσ, r′σ; y)

∂rσ=

d

dy

∂Sσ(rσ, r′σ; y)

∂r′σ, (266)

which lead to expressions rρ and rσ:

rρ (y) = λργ2

2γ1

d

dylog (fqs (y)) (267) rσ (y) = −

(γ4λσ2γ3

+ qd

)d

dylog (gqd (y)) (268)

Substituting this form of rρ and rσ into (263) and (264), and expressing the results in terms of the Fisher information, wefind ∫

Sρ (y)dy = αγ2

2

4γ1J [ε+

√qsz] (269)

∫Sσ (y)dy = γ3

(γ4λσ2γ3

+ qd

)2

J[s0 +

√qdz]. (270)

Thus, the Lagrangian (262) becomes

L = qs +γ1qdλ2σ

+γ2

λσ+ γ3qs − γ3qd + γ4λρ − γ4λσ + α

γ22

4γ1J [ε+

√qsz] + γ3

(γ4λσ2γ3

+ qd

)2

J[s0 +

√qdz]. (271)

The optimization problem may be further simplified by extremizing with respect to γ1 and γ2, which yieldsqdλ2σ

=αγ2

2

4γ12J [ε+

√qsz] (272)

qdλσ

= − γ2

2γ1J [ε+

√qsz] (273)

respectively. By combining these, we arrive at the constraints:γ2

2γ1= − qd

λσ. (274) qd =

1

αJ[ε+√qsz] (275)

If we next extremize with respect to λρ, we find that γ4 = 0. This suggests that constraint related to γ4 does not impact theminimization problem. If we extremize with respect to γ3 and γ4 we find:

qs − qd + q2dJ[s0 +

√qdz]

= 0 (276) λσ − λρ = qdλσJ[s0 +

√qdz]

(277)

Combining (276) and (277) yields an equation determining one order parameter given the other 3:

qsqd

=λρλσ. (278)

It follows that the optimization problem may be written in terms of only qd, qs as a minimum of qs subject to constraints(275),(276). By defining

qd(qs) =1

αJ[ε+√qsz] , (279)

the minimum MSE qopts is the minimum value of qs satisfying (276):

qs = qd − q2dJ[s0 +

√qdz

]. (280)

For convenience, we define qoptd = qd(q

opts ). Note that the form of (280) implies that qs is the MMSE of reconstructing s0

from s0 +√qdz where z is unit gaussian noise (see Appendix B.4 for a relationship between Fisher Information and MMSE).

To find the form of the optimal ρ, σ, we first combine (274) and (278) to reduce (267) and similarly (268) to show

d

dyMλρ [ ρ

opt ](y) = −qopts

λρ

d

dylog(fqopt

s(y)), (281)

d

dyMλσ [σopt ](y) = −

qoptd

λσ

d

dylog(gqopt

d(y)), (282)

We can use the property derived in Appendix C.2 thatMλ[ f ] = g implies f = −Mλ[−g ] under the assumption that f andg are convex along with the fact that the constants introduced by integrating both sides of the above equation will not effectthe M-estimator, to write

ρopt = −Mλρ [qopts

λρlog(fqopt

s) ], (283) σopt = −Mλσ [

qoptd

λσlog(gqopt

d) ]. (284)

Or equivalently,

23

Page 24: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

ρopt = −qopts

λρMqopt

s[ log(fqopt

s) ] (285) σopt = −

qoptd

λσMqopt

d[ log(gqopt

d) ]. (286)

The regularized RS equations from Result 3 have an additional symmetry that we can exploit, the equations are unchangedunder the transform (ρ, σ) → (cρ, cσ) with (λρ, λσ) →

(1cλρ,

1cλσ

)where c is a positive real number. Since we can multiply

σopt and ρopt by the same positive constant without effecting the results of the inference algorithm, we simplify the optimalM-estimator by rescaling such that λρ = qopt

s , which from (278) implies λσ = qoptd , so that

ρopt = −Mqopts

[ log(fqopts

) ], (287) σopt = −Mqoptd

[ log(gqoptd

) ]. (288)

Log-concavity of the noise and parameter distributions ensures that ρopt and σopt are both convex (see C.3 for more details).

Distribution of EstimatesFor ease of notation let qs = qopt

s and qd = qoptd for the remainder of this section. As we derived previously (63,64), the

distribution measurement deviation ε = y −Xs and of estimates s are

εopt(εqs) = Pλρ [ ρopt ](εqs), (289) sopt(s0qd

) = Pλσ [σopt ](s0qd

). (290)

Since we can set λρ = qs, λσ = qd in the optimal case, we can apply Result 17, and find that

εopt(εqs) = Pqs [ ρopt ](εqs) = εMMSE(εqs), (291) sopt(s0qd

) = Pλσ [σopt ](s0qd

) = sMMSE(s0qd

). (292)

It also follows that

qMMSEs (qd) =

⟨⟨ ∫(s0 − sopt)2P (s0|s0

qd)ds0

⟩⟩s0qd

=⟨⟨ (

s0 − sopt)2 ⟩⟩

= qs, (293)

so qs is the MMSE of inferring s0 from s0qd

.

MSE of Noise EstimationAs we showed in (82), in the optimal regularized case,

qε = qs +qsα

[qdqs

(λρλσ− qsqd

)2

− qsqd

]. (294)

Choosing λρ = qs, λσ = qd and substituting the form of qd (279), this expression can be simplified as

qε = qs − q2sJ [ εqs ] = qMMSE

ε (qs). (295)

See the relation between scalar Fisher information and MMSE in Appendix B.4.

Training and Test ErrorsThe test error is simply

Etest = σ2ε + qs = σ2

ε + qMMSEs (qd). (296)

From the definition of training error, we have that

Etrain =⟨⟨ε2⟩⟩εqs, Etrain

opt =⟨⟨

(εopt)2⟩⟩

εqs

. (297)

It then follows from (291) that

qMMSEε (qs) =

⟨⟨ ∫(ε− εopt(εqs))

2P (ε|εqs)dε⟩⟩εqs

=⟨⟨ ∫

ε2P (ε|εqs)⟩⟩−⟨⟨

(εopt(εqs))2⟩⟩

εqs

= σ2ε −

⟨⟨(εopt(εqs))

2⟩⟩

εqs

.

(298)Rearranging, and substituting the form of Etrain

opt , it immediately follows that

Etrainopt = σ2

ε − qMMSEε (qs). (299)

24

Page 25: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

5.3 Analytic bounds on MSE

Result 19. The accuracy of unregularized M-estimation is invariant with respect to the true s0 distribution and theoptimal MSE obeys:

1

σ2εJ [ ε ]

1

α− 1≤ qopts

σ2ε

≤ 1

α− 1. (300)

In the regularized case, the optimal MSE qopts =qoptsσ2s

satisfies the bounds

1

σ2sJ[s0] qQuads

(α,J[s0]

J [ ε ]

)≤ qopts ≤ qQuad

s

(α,σ2ε

σ2s

), (301)

where qQuads (α,Φ ), the form of which is derived in (118), is the positive root of the quadratic

x2 + (Φ + α− 1)x− Φ = 0. (302)

Note that under the condition of Gaussian signal and noise J[s0]

= 1σ2s, J [ ε ] = 1

σ2ε

, thus

qopts = qQuads

(α,σ2ε

σ2s

). (303)

Derivation.— In the unregularized case, quadratic M-estimation (least-squares regression) has the form (116) which mustupper bounds the optimal MSE. The high dimensional Cramer-Rao bound (163) lower bounds the MSE.

The upper bound on the optimal regularized M-estimator follows from Result 9 for the optimal quadratic M-estimator,which must necessarily have an error greater than or equal to that of the optimal M-estimator. To derive our lower bound,we apply the convolutional Fisher inequality to the bounds derived for regularized M-estimation:

qd ≥1

αJ [ εqs ]≥ 1

α

(1

J [ ε ]+ qs

), (304)

qs ≥ qd(1− qdJ

[s0qd

])≥ qd

(1−

qdJ[s0]

qdJ[s0]

+ 1

). (305)

Combining the previous two inequalities yields the desired lower bound.

5.4 Limiting cases

Result 20. Limiting Case for Optimal M-estimation: The form of the optimal M-estimator derived for log concaveparameter and noise distributions further simplifies under various limits of the measurement density α and SNR:

High SNR SNR is O(1) Low SNR

Case α > 1 :

qs ≈1

αJ [ εqs ],

Case α� 1:

qs ≈ σ2s q

Quads

(α,σ2ε

σ2s

)

Case α� 1:

qs ≈1

αJ [ ε ].

Case α� 1:

qs ≈1

1σ2s

+ αJ [ εqs ]

Case of finite α:

qs ≈1

1σ2s

+ αJ [ ε ].

Derivation.— We will choose σs to be O(1) and vary the scale of the noise σε corresponding to different limits of SNR.We will also assume that Pε, Ps have finite Fisher information when their variances are finite. From Result 18, in cases whenthe optimal M-estimator may be computed, MSE qs obeys the conditions:

qd =1

αJ [ εqs ], (306)

qs = qd(1− qdJ[s0qd

]). (307)

25

Page 26: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

High SNR (σε → 0)

Case α > 1: In the over-sampled regime, where α > 1, even least-squares regression will achieve reconstruction withan error proportional to σε, as can be seen in (116). The optimal M-estimator performance has a strictly lower error thanleast-squares, thus qs → 0. It follows that J [ εqs ] approaches infinity, so from (306) we know qd → 0. It follows from (307)that in this limit of small qs and qd, qs ≈ qd. Substituting this into (306) yields

qs ≈1

αJ [ εqs ]. (308)

Of course, we need σε sufficiently small so that qs ≈ 0, and it is sufficient to bound qs from above with the error from least

squares: qLSs =

σ2ε

α−1 , we thus require lower noise when α is close to one such that σ2ε � α− 1. A regularizer is not required

to accurately estimate qs in this regime, so information about the parameter distribution is not useful and we find that thehigh dimensional unregularized bound is an accurate approximation of achievable MSE.

Case α� 1: qs will be bounded above by σ2s since it must outperform the trivial estimator which simply estimates zero

for every parameter. qs will also be non-zero for finite J[s0], as can be seen from the lower bound from (301). It follows

that for α→ 0,

qd =1

αJ [ εqs ]→∞. (309)

In this limit, s0qd

may be approximated as a gaussian with variance qd + σ2s . Similarly qs is O(1) and the variance of ε is

small, so εqs is also approximately gaussian and

J [ εqs ] ≈ 1

qs + σ2ε

, J[s0qd

]≈ 1

qd + σ2s

. (310)

Substituting into the optimal M-estimator conditions and solving for qs yields the result that the optimal M-estimator erroris equivalent to that of the optimal quadratic loss-regularizer pair in this regime.

SNR is O(1) (σε = O(1))

Case α � 1: In the the highly over-sampled regime, again even least squares regression approaches zero error, so thatthe optimal M-estimator has zero error qs → 0. Since α is large

qd ≈1

αJ [ ε ]→ 0. (311)

It follows from (307) that for small qd, qs ≈ qd:

qs ≈1

αJ [ εqs ]≈ 1

αJ [ ε ], (312)

where the final approximation follows from the fact that qs is small and σε is order 1. Therefore this is a regime where theCramer-Rao bound becomes tight and maximum likelihood becomes optimal.

Case α� 1: It immediately follows from (306) that qd →∞ which implies that s0qd

can be approximated by a gaussianso (307) becomes

qs ≈ qd(

1− qdqd + σ2

s

)=

1

αJ [ εqs ] + 1σ2s

. (313)

Low SNR (σε →∞)

Case α < ∞: In this regime qd approaches infinity, which follows from our assumption that for large σε the Fisherinformation of the noise approaches zero. For large qd, we may approximate s0

qdas a gaussian and recover (313). Since

qs = O(1), it is very small compared to ε, so that εqs ≈ ε:

qs ≈1

αJ [ ε ] + 1σ2s

. (314)

26

Page 27: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

5.5 Optimal noise-free inference

Result 21. In the case of noise-free reconstruction (for Ps log-concave) the optimal regularization function is

σopt(x) = −Mqd [ logPs0qd](x), (315)

and yields a MSE of qs, where qs and qd satisfy

qs = qMMSEs (qd), (316) qd =

qsα, (317)

and the estimated parameters s of the M-estimator are distributed such that they take the value

s = sMMSE(s0qd

) (318)

with probability P (s0qd

).

This result follows from minimizing the MSE qs under the noise free error relations of Result 7. We have that λρ = αλσ,qs = αqd, and by substituting this pair of identities into (259) and (260), we find

αqd − qd −∫gqd (y)

(rσ (y)

2 − 2qdr′σ (y)

)dy = 0, (319)

1 +1

α

(∫gqd (y)r′σ (y)dy − 1

)= 0. (320)

If we then follow the same procedure of constructing a Lagrangian and performing calculus of variations, as in the previoussection, we arrive at the form of σopt above as well as the constraint on qs:

qs = α1− α

J[s0 +

√1αqsz

] . (321)

In order to remove α from the above equation, we first re-arrange it to solve for α:

α = 1− qdJ[s0qd

]. (322)

Substituting the form of α into the preceding equation allows us to write qs as

qs = qd(1− qdJ

[s0qd

])= qMMSE

s (qd), (323)

where the final equality follows from relation between Fisher information and MMSE in Appendix B.4. We know from theregularized RS equations that the parameter distribution is distributed as

s = Pλσ [σ ](s0qd

). (324)

Applying Result 17 to the form of σopt, and using the fact that we can set λσ = qd for optimal inference, the distribution ofestimated parameters (318) immediately follows. Note that it also follows directly from the form of the estimated parametersthat

qs =⟨⟨

(sMMSE(s0qd

)− s0)2⟩⟩

= qMMSEs (qd). (325)

5.6 The limits of small and large smoothing of Moreau envelopes

Result 22. We derive the form of the optimal M-estimator in the limits of both small and large smoothing. This yieldssimplified forms for both optimal unregularized and regularized inference at high and low measurement densities. Lettingfq denote the density of a pdf f convolved with a zero mean gaussian of variance q, and defining µf , σ

2f to be the mean

and variance of f respectively, we find*:

limq→0Mq[ log fq ](x) = log f(x) (326) lim

q→∞Mq[ log fq ](x) =

(x− µf )2

2σ2f

. (327)

*A sufficient condition for (326) is that∣∣∣∂ff ∣∣∣ is bounded above. For details on the subgradient ∂ see e.g. [4]. A sufficient

condition for (327) is finite moments/cumulants of f .

27

Page 28: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

Mq[ log fq ](x) = miny

[log fq (y) +

(x− y)2

2q

]=

[log fq

(y∗q (x)

)+

(x− y∗q (x)

)22q

](328)

where y∗q (x) is the minimizer of the Moreau envelop (the prox function), and thus satisfies

y∗q (x) = x+ q∂fq

(y∗q (x)

)fq(y∗q (x)

) . (329)

In the limit q → 0,

limq→0Mq[ log fq ](x) = lim

q→0

log fq

(x+ q

∂fq(y∗q (x)

)fq(y∗q (x)

) )+ q

(∂fq

(y∗q (x)

)fq(y∗q (x)

) )2

2

= log f0(x) = log f(x), (330)

where the second equality follows from the assumption of bounded∣∣∣∂ff ∣∣∣ (so that limq→0 q

∣∣∣∂ff ∣∣∣ = 0). For the case whereq →∞,

2πfq(y) =

∫dkeikye−

qk2

2 f(k) =

∫dkeiky−

qk2

2 +log f(k) =

∫dke−q(k2

2 −ikyq −

1q log f(k)

). (331)

Thus, letting H(k) = log f(k),

f∞(y) =1

2πlimq→∞

∫dke−q(k2

2 −ikyq −

1qH(k)

). (332)

Now, assuming f has finite cumulants, we can calculate the above function via a saddle point argument that

f∞(y) = limq→∞

e−q(k2

2 −ikyq −

1qH(k)

)√√√√ 1

2π(q −H ′′(k)

) , (333)

where k is the maximizer of the integrand. A perturbative expansion of k gives

k =iy +H ′(0)

q+H ′′(0) (iy +H ′(0))

q2+ · · · (334)

It is not hard to show that H(0) = 0, H ′(0) = −iµ, and H ′′(0) = −σ2, where µ and σ are respectively the mean and varianceof f and find

f∞(y) = limq→∞

1√2π(q + σ2)

e− (y−µ)2

2(q+σ2) . (335)

Thus, the smoothed distribution tends to a gaussian, where we can approximate the original pdf f by a gaussian with identicalvariance and mean. We can now use this approximation of ζq to compute ρ in the large q limit

ρ∞(x) = limq→∞

− infy

[− (y − µ)2

2(q + σ2)+

(x− y)2

2q

]. (336)

After some manipulation we find

ρ∞(x) =(x− µ)2

2σ2. (337)

We use this result in the main paper to demonstrate that under suitable assumptions in the high dimensional regime theoptimal unregularized loss approaches a quadratic, and the optimal regularized M-estimator has a regularization functionapproaching a square penalty.

These results can be applied to simplify optimal inference in various limits. For example in unregularized inference, asα → 1, qs → ∞, so the results of this section imply that the optimal loss function (210) simplifies to a quadratic in theunder-sampled limit. Also, as α → ∞, qs → 0, implying that the optimal loss function simplifies to maximum-likelihood inthe over-sampled limit.

In the case of regularized inference, as α → 0, qd → ∞ so that the optimal regularization function (251) approaches aquadratic. If we are additionally in the low noise (high SNR) limit, then qs is O(1) while ε is small so that smoothing dominatesand the optimal loss also approaches a quadratic. In the over-sampled limit of α→∞, qd → 0 and qs = qd−q2

dJ[s0qd

]) ≈ qd,

since the smoothing for both the loss and regular approach 0 at the same rate the optimal M-estimator (251) approachesMAP.

28

Page 29: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

5.7 Comparison of optimal M-estimation with MMSE accuracy

In this section we derive replica theory predictions, using the finite temperature RS relations we derived in Result 1, tocompute the Bayes-optimal MMSE. The MMSE estimator is simply the mean of the posterior distribution of the parametersgiven data y,X and one may compute this posterior distribution using Bayes rule:

P (s0|y,X) =P (y|s0,X)P (s0)P (X)

P (y|X)P (X)=P (y|s0,X)P (s0)

P (y|X). (338)

It follows that under our assumption that noise and parameters are drawn iid from pdf Pε and Ps respectively, the posteriordistribution takes the form:

P (s0|y,X) ∝N∏µ=1

Pε(yµ − xµ · s0)

P∏j=1

Ps(s0j ). (339)

In the finite temperature regime, under the choice of unit temperature β = 1, σ = − logPs, and ρ = − logPε we find thatthe energy distribution (4) will yield a Gibbs distribution precisely equivalent to the posterior of the parameters:

PG(s) =1

Z(y,X)e−β(

∑Nµ=1 ρ(yµ−xµ·s)+

∑Pj=1 σ(sj)) ∝

N∏µ=1

Pε (yµ − xµ · s)P∏j=1

Ps(sj). (340)

It follows that averaging s over the Gibbs distribution 〈 s 〉 yields the MMSE estimator and that⟨⟨

(〈 s 〉 − s0)2⟩⟩

is simplythe MMSE. The finite temperature replica symmetric equations from Result 1 allows us to characterize the MMSE solutionas a 1-dimensional MMSE inference problem. Under the definitions

εMMSE(εqs) =1

P (εqs)

∫εPε(ε)

1√2πqs

e−(ε−εqs )2

2qs dε, (341)

sMMSE(s0qd

) =1

P (s0qd

)

∫sPs(s)

1√2πqd

e−

(s−s0qd)2

2qd ds. (342)

In searching for solutions to the finite temperature RS relations in this setting, we make the ansatz that Λρ = qs, Λσ = qd, sothat Pρ and Pσ become posterior distributions of scalar Gaussian corrupted noise and corrupted signal P (ε|εqs) and P (s0|s0

qd)

respectively. In this setting, the finite temperature RS relations thus reduce to:1

qd=

α

q2s

⟨⟨(εMMSE(εqs)− εqs)2

⟩⟩εqs

(343)

1

qd=

α

q2s

(qs − qMMSE

ε (qs))

(344)

qs = qMMSEs (qd) (345)

qs = qMMSEs (qd). (346)

We now demonstrate that (343) is equivalent to (344). To this end,⟨⟨(εMMSE(εqs)− εqs)2

⟩⟩εqs

=⟨⟨

(εMMSE(εqs)− ε−√qszs)

2⟩⟩ε,zs

(347)

=⟨⟨

(εMMSE(εqs)− ε)2⟩⟩− 2√qs⟨⟨zs(ε

MMSE(εqs)− ε)⟩⟩zs,ε

+ qs. (348)

From the definition of qMMSEε (qs), we can rewrite the previous expression as⟨⟨(εMMSE(εqs)− εqs)2

⟩⟩εqs

= qMMSEε (qs) + qs − 2

√qs⟨⟨zs(ε

MMSE(εqs)− ε)⟩⟩zs,ε

. (349)

To complete the derivation we simplify the final term in the previous expression, defining

A =√qs⟨⟨zs(ε

MMSE(εqs)− ε)⟩⟩zs,ε

=√qs

⟨⟨zs∫ε′ Pε(ε

′)P (εqs )

1√2πqs

e−(ε′−ε−√qszs)2

2qs dε′⟩⟩

zs,ε

. (350)

Now, applying the identity 〈〈 zf(z) 〉〉 = 〈〈 f ′(z) 〉〉 for a unit gaussian variable z yields

A =⟨⟨ ∫

ε′(ε′ − ε−√qszs)P (ε′|εqs)dε′⟩⟩zs,ε−⟨⟨〈 ε 〉(〈 ε 〉 − ε−√qszs)

⟩⟩zs,ε

=⟨⟨ ⟨

ε2⟩− ε〈 ε 〉 − (〈 ε 〉2 − 〈 ε 〉ε)

⟩⟩zs,ε

, (351)

where 〈 ε 〉 =∫ε′P (ε′|εqs)dε′ = εMMSE(εqs). The expression for A can thus be simplified to

A =⟨⟨ ⟨

ε2⟩− 〈 ε 〉2

⟩⟩zs,ε

= qMMSEε (qs). (352)

29

Page 30: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

Substituting this result into (343) yields (344). By applying the relationship between Fisher information and scalar MMSE(388), the finite temperature RS relations yield:

qMMSEε (qs) = qs − q2

sJ [ εqs ], (353) qMMSEs (qd) = qd − q2

dJ[s0qd

]. (354)

Substituting these forms into (344) and (345) implies

qd =1

αJ [ εqs ], (355) qs = qMMSE

s (qd) = qd − q2dJ[s0qd

]. (356)

This set of self-consistency equations for optimal Bayesian inference in the thermodynamic limit is consistent with previousMMSE predictions in the literature such as [5] (eqn. 12) which yields a set of self-consistency equations equivalent to oursunder the choice of Gaussian noise, in which case (355) reduces to:

qd =σ2ε + qsα

. (357)

Remarkably, eqs. (355)-(356) exactly match the RS relations for optimal M-estimation (252), implying that the M-estimatorswe have derived achieve MMSE performance when the RS approximation is exact. A sufficient condition for the validity ofRS is to assume log-concave noise and signal distributions.

A Review of Single Parameter Inference

A.1 Single parameter M-estimation

Consider the problem of inferring a single parameter s0 given measurements yµ for µ = 1, ..., N :

yµ = s0 + εµ, (358)

where the noise εµ is drawn iid from a probability density function Pε. We apply an unregularized M-estimator to estimates0:

s = arg min∑µ

ρ(yµ − s) = arg min∑µ

ρ(εµ − (s− s0)). (359)

Thus s is the solution to, ∑µ

ρ′(εµ − (s− s0)) = 0. (360)

For an unbiased M-estimator, in the large N limit s− s0 becomes small so that we can Taylor expand in this small quantity:∑µ

ρ′(εµ)−∑µ

ρ′′(εµ)(s− s0) = 0. (361)

Rearranging and squaring both sides,(∑µ

ρ′′(εµ)

)2

(s− s0)2 =

(∑µ

ρ′(εµ)

)2

→∑µ

ρ′(εµ)2, (362)

Here we neglect the off diagonal terms in the final square, as the terms in the sum are uncorrelated random variables, so theaverage of the off diagonal terms will vanish. Thus, in the asymptotic large N limit, after averaging over the noise, we find

qs = (s− s0)2 =

∑µ ρ′(εµ)2(∑

µ ρ′′(εµ)

)2 →⟨⟨ρ′(ε)2

⟩⟩〈〈 ρ′′(ε) 〉〉2

, (363)

where we have define qs to be the MSE of signal estimate.

A.2 Bounds on single parameter M-estimation

We can use this result on the form (363) of the MSE of an M-estimator to derive a bound on the MSE of M-estimation. Webegin with the expression in the denominator and apply integration by parts,

(∫ρ′′(x)Pε(x)dx

)2

=

(∫ρ′(x)P ′ε(x)dx

)2

=

(∫ (ρ′(x)

√Pε(x)

)( P ′ε(x)√Pε(x)

)dx

)2

≤∫Pε(x)ρ′(x)2dx

∫P ′ε(x)2

Pε(x)dx,

(364)

30

Page 31: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

where the inequality follows from Cauchy-Schwartz. The final integral is simply the Fisher information from (382). Thisbound will turn out to be a special case of a more general bound known as the Cramer-Rao bound, that applies any unbiasedinference problem. Thus, our analytic asymptotic expression (363) for MSE is bounded as

qs =

⟨⟨ρ′(ε)2

⟩⟩〈〈 ρ′′(ε) 〉〉2

≥ 1∫ P ′ε(x)2

Pε(x) dx=

1

J [ ε ]. (365)

The Cauchy-Schwartz inequality is saturated when the functions in the inner product (364) are proportional so that

ρ′(x)√Pε(x) ∝ P ′ε(x)√

Pε(x). (366)

It follows that ρ(x) = − logPε(x) is the asymptotically optimal M-estimator and achieves the optimal minimum MSE amongstall unbiased estimators.

A.3 Asymptotic analysis of single parameter Bayesian MMSE estimation

Suppose we are given N noisy measurements indexed by µ = 1, ..., N of an unknown signal s0:

yµ = s0 + εµ. (367)

We would like to estimate s0 from the measurements yµ. To this end we apply Bayes rule to compute a posterior distribution

P (s|y) ∝N∏µ=1

Pε(yµ − s)Ps(s) = e∑Nµ=1 log(Pε(yµ−s))Ps(s) = e

∑Nµ=1 log(Pε(εµ−(s−s0)))Ps(s). (368)

For large N , the posterior distribution will be concentrated in the vicinity of s0. Therefore we can expand the distributionaround s0, taking s− s0 to be small:

N∑µ=1

log(Pε(εµ − (s− s0))

)≈

N∑µ=1

log′ (Pε(εµ))(s− s0) +1

2

N∑µ=1

log′′ (Pε(εµ))(s− s0)2 (369)

These Taylor series coefficients are random variables and as N →∞ the central limit theorem implies:

N∑µ=1

log′ (Pε(εµ))→ N (0, NJ [ ε ]), (370)

N∑µ=1

log′′ (Pε(εµ))→ NJ [ ε ]. (371)

Thus, in the large N limit, the precise measurement vector y in the prior distribution (368) may be replaced with randomvariables so that

P (s|s0, z) ∝ Ps(s)e−(s−s0−z√qd)2

2qd , (372)

where z is a unit gaussian random variable and we define qd = 1

NJ [ ε ], so that qd is the unbiased error from the Classical

Cramer-Rao bound, achieved by maximum likelihood in the large N limit.

sMMSE =

∫sps(s|s0, z)ds. (373)

It also follows that the MMSE achievable given the data is simply

qs =⟨⟨ ∫

(s− sMMSE)2p(s|s0, z)ds⟩⟩s0,z

. (374)

Using the relationship between MMSE and Fisher information in Appendix B.4, the MMSE of inferring a random variables0 contaminated by gaussian noise of variance qd, may be written in terms of the Fisher information:

qs = qMMSEs (qd) = qd − q2

dJ[s0 +

√qdz

]. (375)

Then, the convolution Fisher inequality (171) implies

1

J[s0 +

√qdz

] ≥ 1

J[s0] + qd, (376)

31

Page 32: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

so that

qs = qd − q2dJ[s0 +

√qdz

]≥ qd

1− qd1

J[s0] + qd

= qd

(1

1 + qdJ[s0]) =

11qd

+ J[s0] =

1

NJ [ ε ] + J[s0] , (377)

yielding a Fisher information bound on asymptotic MMSE inference for a single parameter:

qs ≥1

NJ [ ε ] + J[s0] . (378)

Special case of gaussian noise and parameters:

In the case of gaussian noise and parameters (ε ∼ N (0, σ2ε ), s0 ∼ N (0, σ2

s)), (378) holds with equality for all N (not just inthe asymptotic limit):

qs =1

NJ [ ε ] + J[s0] . (379)

Note that J [ ε ] = 1σ2ε, J[s0]

= 1σ2s, and that we need not consider derivatives higher than second order in log(Pε(x)), so we

need only compute two terms: (370) and (371). In both cases the equalities hold exactly, since a sum of gaussian randomvariables is a gaussian random variable. Since we do not need to invoke the central limit theorem, this result holds for anyN . Further, the convolutional Fisher inequality (376) becomes an equality for gaussian, this follows from that fact that theFisher information is the inverse variance for a a gaussian:

1

J[s0 +

√qdz

] =1

J[√

qd + σ2sz] = qd + σ2

s = qd +1

J[s0] . (380)

B Fisher Information

B.1 Fisher information for scalar inference

The scalar Fisher information J [ ε ] quantifies the accuracy of any unbiased estimator to reconstruct a signal corrupted byadditive noise ε drawn from probability density function Pε. The problem is to estimate the scalar s0 given a corruptedmeasurements y:

y = s0 + ε. (381)

The definition of Fisher information in this scalar case is:

J [ ε ] =⟨⟨ (

dds0 log p(y|s0)

)2 ⟩⟩ε

=

∫P ′ε(z)

2

Pε(z)dz. (382)

B.2 Convolutional Fisher inequality

The scalar Fisher information obeys the following inequality:

J [ ε1 + ε2 ] ≤ J [ ε1 ]J [ ε2 ]

J [ ε1 ] + J [ ε2 ], (383)

where ε1, ε2 are drawn from separate probability densities. See [6] p.674.

B.3 Fisher information is minimized by a gaussian distribution

For a noise distribution Pε, the Fisher information is bounded below as

J [ ε ] ≥ 1

σ2ε

, (384)

and for a fixed variance σ2ε is minimized for a gaussian distribution. This result follows from the inequality:

1 =

(∫Pε(x)

((x− 〈 ε 〉)P ′ε(x)

Pε(x)

)dx

)2

≤∫

(x− 〈 ε 〉)2Pε(x)dx

∫P ′ε(x)2

Pε(x)dx = σ2

εJ [ ε ], (385)

32

Page 33: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

where the left equality follows from integration by parts, the inequality follows from Cauchy Schwartz, and the Fisherinformation appears in the final equality by substituting the definition of Fisher Information (382). This proves the desiredbound. Moreover, the cauchy-schwartz that the bound is saturated when

P ′ε(x)

Pε(x)∝ ±(x− 〈 ε 〉). (386)

Integrating both sides and choosing appropriate constants s.t. Pε in a probability distribution yields

Pε(x) ∝ e−(x−〈ε〉)2

2σ2ε , (387)

so that this bound is saturated only for a gaussian Pε.

B.4 Relation between MMSE and Fisher information

The MMSE of estimating a single random variable s0 from measurements corrupted by gaussian noise of variance qd can bewritten in terms of the Fisher Information as

qMMSEs (qd) = qd − q2

dJ[s0 +

√qdz

]. (388)

See e.g. [7] equation 28, p. 38.

B.5 Cramer-Rao bound

For the case of multiple parameters, the Cramer-Rao bound is written in terms of the Fisher information matrix, we willstate this bound here and derive the form of the Fisher information matrix for the problem considered in this work.

For any estimator s of the true parameters s0 given the data z, where the conditional probability of data given parametersis p(z|s0), we define the Fisher information matrix as

Iij =⟨⟨

∂∂s0i

log p(z|s0) ∂∂s0j

log p(z|s0)⟩⟩

z. (389)

The Cramer-Rao bound states that for any unbiased inference algorithm with estimates s, so that⟨⟨(sk − s0

k)2⟩⟩

z≥ I−1

kk . (390)

See [6] for a derivation of the Cramer-Rao bound.

Applying Cramer-Rao to the problem considered in this paper:

We first derive the form of the Fisher information matrix I for the system studied in this paper:

y = Xs0 + ε, (391)

where εµ ∼ f iid, y ∈ RN ,s0 ∈ RP , and X ∈ RN×P . We apply the definition of the Fisher information matrix (389) to deriveits form:

Ijk =⟨⟨

∂∂sj

log p(X,y|s) ∂∂sk

log p(X,y|s)⟩⟩

X,y. (392)

In this problem,

log p(X,y|s) = log p(y|X, s) + log p(X) =

N∑µ=1

log f(yµ −Xµ · s) + log p(X), (393)

where Xµ is the µth row of the matrix X. Differentiating (392) yields

Ijk =

⟨⟨∑Nµ=1XµjXµk

(f ′(εµ)f(εµ)

)2⟩⟩

X,ε

. (394)

For j 6= k the input matrices will be uncorrelated so Ijk = 0. For j = k then,

Ijj = α

∫dxf ′(x)2

f(x). (395)

Thus, the Fisher information matrix has no dependence on s0 and has the form:

33

Page 34: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

Iij = δijα

∫(f ′(z))2

f(z)dz = δijαJ [ ε ]. (396)

Substituting the Fisher information matrix into the Cramer-Rao bound, yields the variance per parameter:

qs ≥1

αJ [ ε ]. (397)

C Moreau envelope and proximal map

C.1 Relation between proximal map and Moreau envelope

The Moreau envelope is a functional mapping parameterized by a regularization scalar λ. It maps a function f to

Mλ[ f ](x) = miny

[(x− y)2

2λ+ f(y)

]. (398)

We will denote the special case ofM1[f ] byM [ f ]. Some properties that follows from this definition are that the minimizersof f and Mλ[ f ] are the same, and that the Moreau envelope is also lower bound on the function f . The closely relatedproximal map is defined as

Pλ[ f ](x) = arg miny

[(x− y)2

2λ+ f(y)

]. (399)

The proximal map can be viewed as a gradient descent step along the Moreau envelope:

Pλ[ f ](x)− x = −λMλ[ f ]′(x). (400)

To derive (400), we begin with the derivative of the Moreau envelope and show

Mλ[ f ]′(x) =d

dxminy

[(x− y)2

2λ+ f(y)

]=

d

dx

[(x− y)2

2λ+ f(y)

], (401)

where y is the minimizer of the RHS argument of (401). Differentiating with respect to y yields 0 at the minimum, so the ymay be effectively treated as a constant and we need only differentiate with respect to x, which yields

Mλ[ f ]′(x) =x− yλ

. (402)

It follows that

Mλ[ f ]′(x) =x− Pλ[ f ](x)

λ. (403)

C.2 Inverse of the Moreau envelope

Result 23. Procedure for inverting a Moreau envelope: For λ > 0 and f a convex, lower semi-continuous functionsuch that Mλ[ f ] = g, the Moreau envelope can be inverted so that f = −Mλ[−g ].

To derive this result, we first consider the case of λ = 1, from which the λ > 0 case will follow. Our assumption thatMλ[ f ] = g implies

g(x) =M [ f ](x) = miny

[(x− y)2

2+ f(y)

]=x2

2+ min

y

[−xy +

y2

2+ f(y)

]=x2

2−max

y

[xy − y2

2− f(y)

]. (404)

We now define the Fenchel conjugate .∗, which operates on a function h to yield h∗(x) = maxy [xy − h(y)]. We then define,

for notational simplicity, the function p2(x) = x2

2 . With this notation, (404) reduces to

g = p2 − (f + p2)∗ (405)

The Fenchel-Moreau theorem [8] states that if h is a convex and lower semi-continuous function, then h = (h∗)∗. Theseproperties are assumed true for f and will also hold for f + p2 so that (405) may be inverted, yielding:

f = (p2 − g)∗ − p2. (406)

34

Page 35: Harvard University · hhZnii ;X;s0 = Z ds0P s(s 0) Z Y ab dQ ab Z Y a dua Y ab (uaub PQ ab)e P P j=1 P a ˙(s 0 ua j) Z YN =1 d P ( ) Z 1 jQjN Y a dba p 2ˇ e P N =1 P a ˆ(b a +

We now write f in terms of a Moreau envelope by expanding the previous expression:

f(x) = −x2

2+ max

y[xy − y2

2+ g(y)] = −min

y

[(x− y)2

2− g(y)

]= −M [−g ](x). (407)

Thus, M [ f ] = g implies f = −M [−g ]. To extend this to λ 6= 1, we use the identity

λMλ[ 1λf ] =M [ f ], (408)

which can be verified by substitution into the definition of the Moreau envelope (398). Combining the result M [ f ] = gimplies f = −M [−g ] with (408) yields that:

Mλ[ 1λf ] =

1

λg, (409)

also implies

1

λf = −Mλ[− 1

λg ], (410)

which completes the derivation since 1λ may be absorbed into the definition of f and g.

C.3 Conditions for convexity of optimal M-estimators

A sufficient condition to guarantee the convexity of an optimal regularized M-estimator is log-concavity of the signal andnoise distributions Ps and Pε. Here we explain why this is the case. The main reason is that the optimal loss and regularizershould be convex to guarantee the validity of the RS approximation in the zero temperature replica theory for M-estimation.First, we note that the functional relationship between the optimal loss (or regularizer) and the noise (or signal) distributionarises through the relation

hopt = −Mq[ log fq ](x) = (q log fq + p2)∗ − p2, (411)

where the function f denotes the noise (or signal) distribution, the function hopt denotes the optimal loss (or regularizer),and q denotes the level of Gaussian smoothing. Then the question we must ask is, under what conditions on f is hopt convex?

The basic idea is that if f is a log-concave distribution, then fq is the convolution of a pair of log concave distributions,which will also be log concave; i.e. log fq is a concave function. Now the sum q log fq + p2 is the sum of a concave functionwith the convex function p2(x) = 1

2x2. However, it can be shown that this sum is convex for any q. Moreover, this sum is in

a sense less convex than p2 itself (a function a is more convex than b if a− b is convex). Furthermore the Fenchel conjugate ofa function less convex than p2 is more convex than p2. Thus (q log fq + p2)∗ is more convex than p2. Therefore, subtractingthe convex function p2 from it, does not destroy its convexity. Therefore hopt is convex. See [2] for a more detailed derivationof this result. Finally, we stress that log-concavity of the signal and noise is a sufficient condition for the validity of ouroptimal M-estimators, but it may not be necessary.

References

[1] Madhu Advani, Subhaneil Lahiri, and Surya Ganguli. Statistical mechanics of complex neural systems and high dimen-sional data. Journal of Statistical Mechanics: Theory and Experiment, 2013(03):P03014, 2013.

[2] D. Bean, PJ Bickel, N. El Karoui, and B. Yu. Optimal M-estimation in high-dimensional regression. PNAS, 110(36):14563–8, 2013.

[3] S. Ganguli and H. Sompolinsky. Statistical mechanics of compressed sensing. Physical Review Letters, 104(18):188701,May 2010.

[4] S Boyd and L Vandenberghe. Convex optimization. Cambridge University Press, 2004.

[5] Gou D., D. Baron, and S. Shamai. A single-letter characterization of optimal noisy compressed sensing. In 47th AnnualAllerton Conference on Communication, Control, and Computing (Allerton), pages 52–59. IEEE, 2009.

[6] TM Cover and JA Thomas. Elements of information theory. Wiley and Sons, 2012.

[7] O Rioul. Information theoretic proofs of entropy power inequalities. Information Theory, IEEE Transactions on, 57(1):33–55, 2011.

[8] Shozo Koshi and Naoto Komuro. A generalization of the Fenchel-Moreau theorem. Proceedings of the Japan Academy,Series A, Mathematical Sciences, 59(5):178–181, 1983.

35