Download - cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

much more on minimax (order bounds)

http://www-stat.stanford.edu/~imj/wald/wald1web.pdf

cf. lecture by Iain Johnstone

Monday, June 3, 2013



today’s lecture

• parametric estimation, Fisher information, Cramer-Rao lower bound: Ch. 4, Sec. 9.3

• information and estimation: Ch. 7

• universal denoising: Ch. 8

• (chapters and sections from new version of notes)


mean squared error estimation


bias-variance


Fisher Information

exercise:


exercise


note


• r.h.s. depends on estimator

• far from tight: consider estimator identically 0

note:


multi-parameter case


Fisher information for a “location family”


Fisher Information and MMSE


recall

5 Notation and Conventions

Our conventions and notation for information measures, such as mutual information and relative entropy, are stan-

dard. The initiated reader is advised to skip this section. If U, V,W are three random variables taking values in

Polish spaces U ,V,W, respectively, and defined on a common probability space with a probability measure P , we let

PU , PU,V etc. denote the probability measures induced on U , the pair (U ,V) etc. while e.g., PU |V denotes a regular

version of the conditional distribution of U given V . PU |v is the distribution on U obtained by evaluating that regular

version at v. If Q is another probability measure on the same measurable space we similarly denote QU , QU |V , etc.As usual, given two measures on the same measurable space, e.g., P and Q, define their relative entropy (divergence)

by

D(P�Q) =

� �log

dP

dQ

�dP (12)

when P is absolutely continuous w.r.t. Q, defining D(P�Q) = ∞ otherwise. An immediate consequence of the

definitions of relative entropy and of the Radon-Nikodym derivative is that if f : U → V is measurable and one-to-

one, and V = f(U), then

D(PU�QU ) = D(PV �QV ). (13)

Following [5], we further use the notation

D(PU |V �QU |V |PV ) =

�D(PU |v�QU |v)dPV (v), (14)

where on the right side D(PU |v�QU |v) is a divergence in the sense of (12) between the measures PU |v and QU |v. It

will be convenient to write

D(PU |V �QU |V ) (15)

to denote f(V ) when f(v) = D(PU |v�QU |v). Thus D(PU |V �QU |V ) is a random variable while D(PU |V �QU |V |PV ) is

its expectation under P . With this notation, the chain rule for relative entropy (cf., e.g., [6, Subsection D.3]) is

D(PU,V �QU,V ) = D(PU�QU ) +D(PV |U�QV |U |PU ) (16)

and is valid regardless of the finiteness of both sides of the equation.

The mutual information between U and V is defined as

I(U ;V ) = D(PU,V �PU × PV ), (17)

where PU × PV denotes the product measure induced by PU and PV . We note in passing, in line with the comment

on relative entropy and one-to-one transformations leading to (13), that if f and g are two measurable one-to-one

transformations and A = f(U) while B = g(V ), then

I(U ;V ) = I(A;B). (18)

Finally, the conditional mutual information between U and V , given W , is defined as

I(U ;V |W ) = D(PU,V |W �PU |W × PV |W |PW ). (19)

The roles of U, V,W will be played in what follows by scalar random variables, vectors, or processes.

6 Relative Entropy and Mismatched Estimation

6.1 For slides

• Scalar Channel:

X ≥ 0

Yγ |X ∼ Poisson(γ ·X)

7








by

D(P�Q) =

� �log

dP

dQ

�dP (12)




D(PU�QU ) = D(PV �QV ). (13)



�D(PU |v�QU |v)dPV (v), (14)



D(PU |V �QU |V ) (15)






I(U ;V ) = D(PU,V �PU × PV ), (17)




I(U ;V ) = I(A;B). (18)





6.1 For slides

• Scalar Channel:

X ≥ 0


7


1 for Duncan slide

AWGN channeldYt = Xtdt+ dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T

0(Xt − E[Xt|Y t])2dt

�

dYt =√γXtdt+ dWt, 0 ≤ t ≤ T

I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W

W is a standard Gaussian, independent of X

I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�

[Guo, Shamai and Verdu 2005]:

d

dγI(γ) =

1

2mmse(γ)

3 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information betweenthe input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simplerelationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as thecontinuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings wherethis relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due tothe mismatch is equal to the relative entropy between the true channel output distribution and the channel outputdistribution under Q, at SNR = γ.

2

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W


I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�


d

dγI(γ) =

1

2mmse(γ)

3 Introduction



2

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W


I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�


d

dγI(γ) =

1

2mmse(γ)

3 Introduction



2

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W


I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�


d

dγI(γ) =

1

2mmse(γ)

3 Introduction



2

mutual information and MMSE


1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W


I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�


d

dγI(γ) =

1

2mmse(γ)

3 Introduction



2

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W


I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�


d

dγI(γ) =

1

2mmse(γ)

3 Introduction



2

(follows from J-MMSE and De-Bruijn)


continuous time

1 for Duncan slide


W is white Gaussian noise, independent of X[Duncan 1970]:

I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) =1

2E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =1

2cmmse(γ)

2 Introduction



This result was key in [33], where it was shown that the relationship between the causal and non-causal MMSEscontinues to hold also in the mismatched case, i.e. when the filters are optimized for an underlying signal distributionthat differs from the true one. The bridge between the two sides of the equality in this mismatched case was shown tobe the sum of the mutual information and the relative entropy between the true and mismatched output distributions,this relative entropy thus quantifying the penalty due to mismatch.

Consider now the Poisson channel, by which we mean, for the case of scalar random variables, that X, theinput, is a non-negative random variable while the conditional distribution of the output Y given the input isgiven by Poisson(γ · X), the parameter γ ≥ 0 here playing the role of SNR. In the continuous time setting, thechannel input is XT = {Xt, 0 ≤ t ≤ T}, a non-negative stochastic process, and conditionally on XT , the outputY T = {Yt, 0 ≤ t ≤ T} is a non-homogeneous Poisson process with intensity function γ ·XT . Often referred to as the“ideal Poisson channel” [19], this model is the canonical one for describing direct detection optical communication:The channel input represents the squared magnitude of the electric field incident on the photo-detector, while itsoutput is the counting process describing the arrival times of the photons registered by the detector. Here the energyof the channel input signal is proportional to its l1 norm, rather than the l2 norm as in the Gaussian channel. Thus it

2

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) =1

2E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =1

2cmmse(γ)

2 Introduction





2

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W


I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�


d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T

0(Xt − E[Xt|Y T ])2dt

�

3 Introduction


2


1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W


I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�


d

dγI(γ) =

1

2mmse(γ)

3 Introduction



2

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W


I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�


d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T


�

or in its integral version

I(snr) =1

2

� snr

0mmse(γ)dγ

2

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W


I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�


d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T


�


I(snr) =1

2

� snr

0mmse(γ)dγ

2

, [Zakai 2005]:

1 for Duncan slide

H(X) =�

x∈XP (X = x) log

1

P (X = x)



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

For mismatch:

cmseP,Q(γ) = EP

�� T

0(Xt − EQ[Xt|Y t])2dt

�

mseP,Q(γ) = EP

�� T

0(Xt − EQ[Xt|Y T ])2dt

�

cmseP,Q(snr) =1

snr

� snr

0mseP,Q(γ)dγ

=2

snr[I(snr) +D (PY �QY )]

Relationship between cmseP,Q and mseP,Q ?Relationship between cmseP,Q and mseP,Q

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W


I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�


2


Duncan1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) =1

2E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =1

2cmmse(γ)

2 Introduction





2

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) =1

2E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =1

2cmmse(γ)

2 Introduction





2

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) =1

2E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =1

2cmmse(γ)

2 Introduction





2

1 for Duncan slide

AWGN channel

dYt = Xtdt+ dWt, 0 ≤ t ≤ T


I(XT;Y T

) =1

2E

�� T

0(Xt − E[Xt|Y t

])2dt

�


I(γ) = I(XT;Y T

)

cmmse(γ) =1

2E

�� T

0(Xt − E[Xt|Y t

])2dt

�

I(XT;Y T

) =1

2E

�� T

0(Xt − E[Xt|Y t

])2dt

�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

3 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information between

the input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),

is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simple

relationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as the

continuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings where

this relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-

tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributed

continuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean value

of the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual information

to both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimator

that would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due to

the mismatch is equal to the relative entropy between the true channel output distribution and the channel output

distribution under Q, at SNR = γ.This result was key in [33], where it was shown that the relationship between the causal and non-causal MMSEs

continues to hold also in the mismatched case, i.e. when the filters are optimized for an underlying signal distribution

that differs from the true one. The bridge between the two sides of the equality in this mismatched case was shown to

be the sum of the mutual information and the relative entropy between the true and mismatched output distributions,

this relative entropy thus quantifying the penalty due to mismatch.

Consider now the Poisson channel, by which we mean, for the case of scalar random variables, that X, the

input, is a non-negative random variable while the conditional distribution of the output Y given the input is

given by Poisson(γ · X), the parameter γ ≥ 0 here playing the role of SNR. In the continuous time setting, the

channel input is XT = {Xt, 0 ≤ t ≤ T}, a non-negative stochastic process, and conditionally on XT , the output

Y T = {Yt, 0 ≤ t ≤ T} is a non-homogeneous Poisson process with intensity function γ ·XT . Often referred to as the

“ideal Poisson channel” [19], this model is the canonical one for describing direct detection optical communication:

2


SNR in Duncan

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) =1

2E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =1

2cmmse(γ)

2 Introduction





2

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) =1

2E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =1

2cmmse(γ)

2 Introduction





2

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) =1

2E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =1

2cmmse(γ)

2 Introduction





2

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) =1

2E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 Introduction





2

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W


I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�

3 Introduction



This result was key in [33], where it was shown that the relationship between the causal and non-causal MMSEscontinues to hold also in the mismatched case, i.e. when the filters are optimized for an underlying signal distributionthat differs from the true one. The bridge between the two sides of the equality in this mismatched case was shown to

2


1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) =1

2E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =1

2cmmse(γ)

2 Introduction





2

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) =1

2E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 Introduction





2

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W


I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�


d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T


�


I(snr) =1

2

� snr

0mmse(γ)dγ

2

, [Zakai 2005]:

1 for Duncan slide

H(X) =�

x∈XP (X = x) log

1

P (X = x)



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

For mismatch:

cmseP,Q(γ) = EP

�� T


�

mseP,Q(γ) = EP

�� T


�

cmseP,Q(snr) =1

snr

� snr

0mseP,Q(γ)dγ

=2

snr[I(snr) +D (PY �QY )]


I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W


I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�


2

d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T


�


I(snr) =1

2

� snr

0mmse(γ)dγ

cmmse(snr) =1

snr

� snr

0mmse(γ)dγ

cmleP,Q(snr) =

Relationship between cmmse and mmse??⇒

�⇒

⇒?

+

=?

What if X ∼ P but the estimator thinks X ∼ Q ?

mseP,Q(γ) = EP

�(X − EQ[X|Y ])2

�

What is Cost of Mismatch?

D(P�Q) =

� ∞

0[mseP,Q(γ)−mseP,P (γ)]dγ

D(PYsnr�QYsnr) =

� snr


d

dγD(PY �QY ) = mseP,Q(γ)−mseP,P (γ)

X ∼ P

?

3

Recap


1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W


I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�


d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T


�


I(snr) =1

2

� snr

0mmse(γ)dγ

cmmse(snr) =1

snr

� snr

0mmse(γ)dγ

Relationship between cmmse and mmse?

2

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W


I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�


d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T


�


I(snr) =1

2

� snr

0mmse(γ)dγ

cmmse(snr) =1

snr

� snr

0mmse(γ)dγ


+

2Monday, June 3, 2013

Mismatch

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W


I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�


d

dγI(γ) =

1

2mmse(γ)

3 Introduction



2

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W


I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�


d

dγI(γ) =

1

2mmse(γ)

3 Introduction



2

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W


I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�


d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T


�


I(snr) =1

2

� snr

0mmse(γ)dγ

cmmse(snr) =1

snr

� snr

0mmse(γ)dγ


+

=?


X ∼ P

2

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W


I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�


d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T


�


I(snr) =1

2

� snr

0mmse(γ)dγ

cmmse(snr) =1

snr

� snr

0mmse(γ)dγ


+

=?


mseP,Q(γ) = EP

�(X − EQ[X|Y ])2

�

X ∼ P

2


A new representation of relative entropy [Verdu 2010]:

d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T


�


I(snr) =1

2

� snr

0mmse(γ)dγ

cmmse(snr) =1

snr

� snr

0mmse(γ)dγ


+

=?


mseP,Q(γ) = EP

�(X − EQ[X|Y ])2

�


D(P�Q) =

� ∞


d


X ∼ P

3 Introduction




Consider now the Poisson channel, by which we mean, for the case of scalar random variables, that X, theinput, is a non-negative random variable while the conditional distribution of the output Y given the input is

3

d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T


�


I(snr) =1

2

� snr

0mmse(γ)dγ

cmmse(snr) =1

snr

� snr

0mmse(γ)dγ


+

=?


mseP,Q(γ) = EP

�(X − EQ[X|Y ])2

�


D(P�Q) =

� ∞


D(PYsnr�QYsnr) =

� snr


d


X ∼ P

3 Introduction




3


Causal vs. Non-causal Mismatched Estimation 1 for Duncan slide

AWGN channel

dYt = Xtdt+ dWt, 0 ≤ t ≤ T


I(XT;Y T

) =1

2E

�� T

0(Xt − E[Xt|Y t

])2dt

�


I(γ) = I(XT;Y T

)

cmmse(γ) =1

2E

�� T

0(Xt − E[Xt|Y t

])2dt

�

I(XT;Y T

) =1

2E

�� T

0(Xt − E[Xt|Y t

])2dt

�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

3 Introduction

In the seminal paper [13], Guo, Shamai and Verdu discovered that the derivative of the mutual information between

the input and the output in a real-valued scalar Gaussian channel, with respect to the signal-to-noise ratio (SNR),

is equal to the minimum mean square error (MMSE) in estimating the input based on the output. This simple

relationship holds regardless of the input distribution, and carries over essentially verbatim to vectors, as well as the

continuous-time Additive White Gaussian Noise (AWGN) channel (cf. [34, 21] for even more general settings where

this relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-

tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributed

continuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean value

of the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual information

to both types of MMSE thus served as a bridge between the two quantities.

More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimator

that would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due to

the mismatch is equal to the relative entropy between the true channel output distribution and the channel output

distribution under Q, at SNR = γ.This result was key in [33], where it was shown that the relationship between the causal and non-causal MMSEs

continues to hold also in the mismatched case, i.e. when the filters are optimized for an underlying signal distribution

that differs from the true one. The bridge between the two sides of the equality in this mismatched case was shown to

be the sum of the mutual information and the relative entropy between the true and mismatched output distributions,

this relative entropy thus quantifying the penalty due to mismatch.

Consider now the Poisson channel, by which we mean, for the case of scalar random variables, that X, the

input, is a non-negative random variable while the conditional distribution of the output Y given the input is

given by Poisson(γ · X), the parameter γ ≥ 0 here playing the role of SNR. In the continuous time setting, the

channel input is XT = {Xt, 0 ≤ t ≤ T}, a non-negative stochastic process, and conditionally on XT , the output

Y T = {Yt, 0 ≤ t ≤ T} is a non-homogeneous Poisson process with intensity function γ ·XT . Often referred to as the

“ideal Poisson channel” [19], this model is the canonical one for describing direct detection optical communication:

2

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

For mismatch:

cmseP,Q(γ) = EP

�� T


�

mseP,Q(γ) = EP

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W


I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�


d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T


�


I(snr) =1

2

� snr

0mmse(γ)dγ

cmmse(snr) =1

snr

� snr

0mmse(γ)dγ

2

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

For mismatch:

cmseP,Q(γ) = EP

�� T


�

mseP,Q(γ) = EP

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W


I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�


d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T


�


I(snr) =1

2

� snr

0mmse(γ)dγ

cmmse(snr) =1

snr

� snr

0mmse(γ)dγ

2

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

For mismatch:

cmseP,Q(γ) = EP

�� T


�

mseP,Q(γ) = EP

�� T


�

Relationship between cmseP,Q and mseP,Q ?

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W


I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�


d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T


�


I(snr) =1

2

� snr

0mmse(γ)dγ

2

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) =1

2E

�� T


�

I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =1

2cmmse(γ)

2 Introduction





2


1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

For mismatch:

cmseP,Q(γ) = EP

�� T


�

mseP,Q(γ) = EP

�� T


�


I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W


I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�


d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T


�


I(snr) =1

2

� snr

0mmse(γ)dγ

2

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

For mismatch:

cmseP,Q(γ) = EP

�� T


�

mseP,Q(γ) = EP

�� T


�

cmseP,Q(snr) =1

snr

� snr

0mseP,Q(γ)dγ

=1

snr[I +D]


I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W


I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�


d

dγI(γ) =

1

2mmse(γ)

2


1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

For mismatch:

cmseP,Q(γ) = EP

�� T


�

mseP,Q(γ) = EP

�� T


�


I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W


I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�


d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T


�


I(snr) =1

2

� snr

0mmse(γ)dγ

2

1 for Duncan slide



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

For mismatch:

cmseP,Q(γ) = EP

�� T


�

mseP,Q(γ) = EP

�� T


�

cmseP,Q(snr) =1

snr

� snr

0mseP,Q(γ)dγ

=1

snr[I +D]


I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W


I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�


d

dγI(γ) =

1

2mmse(γ)

2

1 for Duncan slide

H(X) =�

x∈XP (X = x) log

1

P (X = x)



I(XT ;Y T ) =1

2E

�� T


�


I(γ) = I(XT ;Y T )

cmmse(γ) = E

�� T


�

For mismatch:

cmseP,Q(γ) = EP

�� T


�

mseP,Q(γ) = EP

�� T


�

cmseP,Q(snr) =1

snr

� snr

0mseP,Q(γ)dγ

=2

snr[I(snr) +D (PY T �QY T )]


I(XT ;Y T ) =1

2E

�� T


�

[Duncan 1970]:

I(γ) =γ

2· cmmse(γ)

2 for GSV slide

Y =√γ ·X +W


I(γ) = I(X;Y )

mmse(γ) = E�(X − E[X|Y ])2

�

[Guo, Shamai and Verdu 2005]:[Guo, Shamai and Verdu 2008]

2


minimax estimation

Theorem 6.4 Let P and Q be two probability measures that are members of P. For γ ≥ 0,

D(PY Tγ�QY T

γ) = γ · [cmleP,Q(γ)− cmleP,P (γ)] . (27)

Theorem 6.5 (under mild conditions)

D(PY T �QY T ) ∝ cmleP,Q − cmleP,P (28)

cmseP,Q − cmseP,P = D(PY T �QY T ) (29)

• Girsanov-type theory for expressing logdQY T

dlaw of homogenous Poisson as a filtering integral

• manipulating

D(PY T �QY T ) = EP

log

dPY T

dlaw of homogenous Poisson

logdQY T


via ‘orthogonality’ etc.

Put together, Theorem 6.3 and Theorem 6.5 yield, for γ > 0,

cmleP,Q(γ)− cmleP,P (γ) =1

γ

� γ

0[mleP,Q(α)−mleP,P (α)] dα =

1

γD(PY T

γ�QY T

γ), (30)

which is the Poissonian analogue of [33, Theorem 2]. On a technical note, the r.h.s. of (24), (25) and (26) arewell-defined as integrals of non-negative Borel measurable functions, as will follow from our treatment in Section 9.

6.4 for slides: minimaxity

minimax(P, snr)�= min

{Xt(·)}0≤t≤T

maxP∈P

�EP

�� T

0�(Xt, Xt(Y

t))dt

�− cmseP,P (snr)

�


X(·)maxP∈P

�cmseP,X(snr)− cmseP,P (snr)

�

minimax(P, snr) = minQ

maxP∈P

[cmseP,Q(snr)− cmseP,P (snr)] (31)

=2

snrminQ

maxP∈P

D�PY T

snr

��QY Tsnr

�(32)

=2

snrmax

�I�Θ;Y T

snr

�: Θ is a P-valued RV

�(33)

=2

snrC

��PY T

snr

�P∈P

�(34)

Furthermore, the ‘strong redundancy-capacity’ results are directly applicable here and imply:

6.5 strong red cap

∀ε > 0 and any filter {Xt(·)}0≤t≤T ,

EP

�� T

0�(Xt, Xt(Y

t))dt

�− cmseP,P (snr) ≥ (1− ε) ·minimax(P, snr) (35)

for all P ∈ P with the possible exception of sources in a subset B ⊂ P where

w∗(B) ≤ e · 2

−ε·C��

PY Tsnr

�

P∈P

�

, (36)

w∗(B) ≤ e · 2−ε·minimax(P,snr)

,

w∗ being the capacity achieving prior

11


minimax estimation

classical


D(PY Tγ�QY T







• manipulating


log

dPY T


logdQY T





γ

� γ


1

γD(PY T

γ�QY T

γ), (30)




{Xt(·)}0≤t≤T

maxP∈P

�EP

�� T

0�(Xt, Xt(Y

t))dt


�


X(·)maxP∈P


�


maxP∈P


=2

snrminQ

maxP∈P

D�PY T

snr

��QY Tsnr

�(32)

=2

snrmax

�I�Θ;Y T

snr


�(33)

=2

snrC

��PY T

snr

�P∈P

�(34)


6.5 strong red cap


EP

�� T

0�(Xt, Xt(Y

t))dt



w∗(B) ≤ e · 2

−ε·C��

PY Tsnr

�

P∈P

�

, (36)


,


11


minimax estimation

classical

ours


D(PY Tγ�QY T







• manipulating


log

dPY T


logdQY T





γ

� γ


1

γD(PY T

γ�QY T

γ), (30)




{Xt(·)}0≤t≤T

maxP∈P

�EP

�� T

0�(Xt, Xt(Y

t))dt


�


X(·)maxP∈P


�


maxP∈P


=2

snrminQ

maxP∈P

D�PY T

snr

��QY Tsnr

�(32)

=2

snrmax

�I�Θ;Y T

snr


�(33)

=2

snrC

��PY T

snr

�P∈P

�(34)


6.5 strong red cap


EP

�� T

0�(Xt, Xt(Y

t))dt



w∗(B) ≤ e · 2

−ε·C��

PY Tsnr

�

P∈P

�

, (36)


,


11


minimax estimation

classical

ours

Redundancy-Capacity theory


D(PY Tγ�QY T







• manipulating


log

dPY T


logdQY T





γ

� γ


1

γD(PY T

γ�QY T

γ), (30)




{Xt(·)}0≤t≤T

maxP∈P

�EP

�� T

0�(Xt, Xt(Y

t))dt


�


X(·)maxP∈P


�


maxP∈P


=2

snrminQ

maxP∈P

D�PY T

snr

��QY Tsnr

�(32)

=2

snrmax

�I�Θ;Y T

snr


�(33)

=2

snrC

��PY T

snr

�P∈P

�(34)


6.5 strong red cap


EP

�� T

0�(Xt, Xt(Y

t))dt



w∗(B) ≤ e · 2

−ε·C��

PY Tsnr

�

P∈P

�

, (36)


,


11


minimax estimation

classical

ours


Shannon


D(PY Tγ�QY T







• manipulating


log

dPY T


logdQY T





γ

� γ


1

γD(PY T

γ�QY T

γ), (30)




{Xt(·)}0≤t≤T

maxP∈P

�EP

�� T

0�(Xt, Xt(Y

t))dt


�


X(·)maxP∈P


�


maxP∈P


=2

snrminQ

maxP∈P

D�PY T

snr

��QY Tsnr

�(32)

=2

snrmax

�I�Θ;Y T

snr


�(33)

=2

snrC

��PY T

snr

�P∈P

�(34)


6.5 strong red cap


EP

�� T

0�(Xt, Xt(Y

t))dt



w∗(B) ≤ e · 2

−ε·C��

PY Tsnr

�

P∈P

�

, (36)


,


11


minimax estimation

classical

ours


Shannon


D(PY Tγ�QY T






• manipulating


log

dPY T


logdQY T





γ

� γ


1

γD(PY T

γ�QY T

γ), (29)




{Xt(·)}0≤t≤T

maxP∈P

�EP

�� T

0�(Xt, Xt(Y

t))dt


�


maxP∈P


=2

snrminQ

maxP∈P

D�PY T

snr

��QY Tsnr

�(31)

=2

snrmax

�I�Θ;Y T

snr


�(32)

=2

snrC

��PY T

snr

�P∈P

�(33)


6.5 strong red cap


EP

�� T

0�(Xt, Xt(Y

t))dt

�− cmleP,P (snr) ≥ (1− ε) ·minimax(P, snr) (34)


w∗(B) ≤ e · 2−ε·C(P,snr)

, (35)


11


D(PY Tγ�QY T







• manipulating


log

dPY T


logdQY T





γ

� γ


1

γD(PY T

γ�QY T

γ), (30)




{Xt(·)}0≤t≤T

maxP∈P

�EP

�� T

0�(Xt, Xt(Y

t))dt


�


X(·)maxP∈P


�


maxP∈P


=2

snrminQ

maxP∈P

D�PY T

snr

��QY Tsnr

�(32)

=2

snrmax

�I�Θ;Y T

snr


�(33)

=2

snrC

��PY T

snr

�P∈P

�(34)


6.5 strong red cap


EP

�� T

0�(Xt, Xt(Y

t))dt



w∗(B) ≤ e · 2

−ε·C��

PY Tsnr

�

P∈P

�

, (36)


,


11


Strong Converse


D(PY Tγ�QY T





dlaw of homogenous Poissonas a filtering integral

• manipulating


log

dPY T


logdQY T





γ

� γ


1

γD(PY T

γ�QY T

γ), (29)

which is the Poissonian analogue of [33, Theorem 2]. On a technical note, the r.h.s. of (24), (25) and (26) are

well-defined as integrals of non-negative Borel measurable functions, as will follow from our treatment in Section 9.



{Xt(·)}0≤t≤T

maxP∈P

�EP

�� T

0�(Xt, Xt(Y

t))dt

�− cmleP,P (snr)

�


maxP∈P

[cmleP,Q(snr)− cmseP,P (snr)] (30)

=2

snrminQ

maxP∈P

D�PY T

snr

��QY Tsnr

�(31)

=2

snrmax

�I�Θ;Y T

snr


�(32)

=2

snrC(P, snr) (33)


6.5 strong red cap


EP

�� T

0�(Xt, Xt(Y

t))dt



w∗(B) ≤ e · 2−ε·C(P,snr), (35)


10


D(PY Tγ�QY T






• manipulating


log

dPY T


logdQY T





γ

� γ


1

γD(PY T

γ�QY T

γ), (29)





{Xt(·)}0≤t≤T

maxP∈P

�EP

�� T

0�(Xt, Xt(Y

t))dt


�


maxP∈P


=2

snrminQ

maxP∈P

D�PY T

snr

��QY Tsnr

�(31)

=2

snrmax

�I�Θ;Y T

snr


�(32)

=2

snrC(P, snr) (33)


6.5 strong red cap


EP

�� T

0�(Xt, Xt(Y

t))dt



w∗(B) ≤ e · 2−ε·C(P,snr), (35)


10


D(PY Tγ�QY T






• manipulating


log

dPY T


logdQY T





γ

� γ


1

γD(PY T

γ�QY T

γ), (29)





{Xt(·)}0≤t≤T

maxP∈P

�EP

�� T

0�(Xt, Xt(Y

t))dt


�


maxP∈P


=2

snrminQ

maxP∈P

D�PY T

snr

��QY Tsnr

�(31)

=2

snrmax

�I�Θ;Y T

snr


�(32)

=2

snrC(P, snr) (33)


6.5 strong red cap


EP

�� T

0�(Xt, Xt(Y

t))dt



w∗(B) ≤ e · 2−ε·C(P,snr), (35)


10

∀ε > 0 and any X(·)cmseP,X(snr)− cmseP,P (snr) ≥ (1− ε) ·minimax(P, snr) (36)


w∗(B) ≤ e · 2

−ε·C��

PY Tsnr

�

P∈P

�

, (37)


,


7 Implications

7.1 Mutual Information and Minimum Mean Estimation Loss

Let X be a non-negative random variable and, for γ > 0, let Yγ be a non-negative integer-valued random variable,jointly distributed with X such that the conditional law of Yγ given X is Poisson(γX). When specialized to thissetting, Theorem 2 of [14] gives

d

dγI(X;Yγ) = E [X logX − E[X|Yγ ] logE[X|Yγ ]] . (38)

It is instructive to observe that the right hand side of (37) is nothing but the minimum mean loss in estimating X

based on Yγ under the loss function �. Indeed, denoting this minimum mean loss by mmle(γ), i.e.,

mmle(γ)�= E [� (X,E[X|Yγ ])] , (39)

we have

E [� (X,E[X|Yγ ])] = E

�X log

X

E[X|Yγ ]−X + E[X|Yγ ]

�(40)

= E [X logX −X logE[X|Yγ ]] (41)

= E [X logX − E[X|Yγ ] logE[X|Yγ ]] . (42)

Thus, (37) can be stated as the “I-MMLE” relationship

d

dγI(X;Yγ) = mmle(γ), (43)

in complete analogy with the I-MMSE relationship of [13]. To see one immediate benefit of this realization that theright hand side of (37) coincides with the minimum mean loss in the right hand side of (42), we first go throughthe following data processing argument: Fix γ

�< γ, let {Bi}i≥1 be i.i.d. Bernoulli(γ�

/γ) independent of (X,Yγ),

and note that�X,

�Yγ

i=1 Bi

�is equal in distribution to (X,Yγ�). Since estimating X based on

�Yγ

i=1 Bi, which is a

function of Yγ and the randomization sequence {Bi}, cannot be better (in the sense of minimizing the expected lossunder �) than estimating X based on Yγ , we have mmle(γ�) ≥ mmle(γ). Thus, mmle(γ) is non-increasing with γ

which, when combined with (42), yields the following analogue of [13, Corollary 1]:

Corollary 7.1 I(X;Yγ) is concave in γ.

It is also worth pointing out that the I-MMLE relationship can be viewed as a direct consequence of Theorem6.2. Indeed, in the notation of Section 6.2, (42) is expressed as

d

dγIP (X;Yγ) = mleP,P (γ), (44)

12

∀ε > 0 and any X(·)cmseP,X(snr)− cmseP,P (snr) ≥ (1− ε) ·minimax(P, snr) (36)


w∗(B) ≤ e · 2

−ε·C��

PY Tsnr

�

P∈P

�

, (37)


,


7 Implications

7.1 Mutual Information and Minimum Mean Estimation Loss

Let X be a non-negative random variable and, for γ > 0, let Yγ be a non-negative integer-valued random variable,jointly distributed with X such that the conditional law of Yγ given X is Poisson(γX). When specialized to thissetting, Theorem 2 of [14] gives

d

dγI(X;Yγ) = E [X logX − E[X|Yγ ] logE[X|Yγ ]] . (38)

It is instructive to observe that the right hand side of (37) is nothing but the minimum mean loss in estimating X

based on Yγ under the loss function �. Indeed, denoting this minimum mean loss by mmle(γ), i.e.,

mmle(γ)�= E [� (X,E[X|Yγ ])] , (39)

we have

E [� (X,E[X|Yγ ])] = E

�X log

X

E[X|Yγ ]−X + E[X|Yγ ]

�(40)

= E [X logX −X logE[X|Yγ ]] (41)

= E [X logX − E[X|Yγ ] logE[X|Yγ ]] . (42)

Thus, (37) can be stated as the “I-MMLE” relationship

d

dγI(X;Yγ) = mmle(γ), (43)

in complete analogy with the I-MMSE relationship of [13]. To see one immediate benefit of this realization that theright hand side of (37) coincides with the minimum mean loss in the right hand side of (42), we first go throughthe following data processing argument: Fix γ

�< γ, let {Bi}i≥1 be i.i.d. Bernoulli(γ�

/γ) independent of (X,Yγ),

and note that�X,

�Yγ

i=1 Bi

�is equal in distribution to (X,Yγ�). Since estimating X based on

�Yγ

i=1 Bi, which is a

function of Yγ and the randomization sequence {Bi}, cannot be better (in the sense of minimizing the expected lossunder �) than estimating X based on Yγ , we have mmle(γ�) ≥ mmle(γ). Thus, mmle(γ) is non-increasing with γ

which, when combined with (42), yields the following analogue of [13, Corollary 1]:

Corollary 7.1 I(X;Yγ) is concave in γ.

It is also worth pointing out that the I-MMLE relationship can be viewed as a direct consequence of Theorem6.2. Indeed, in the notation of Section 6.2, (42) is expressed as

d

dγIP (X;Yγ) = mleP,P (γ), (44)

12


“Minimax Filtering via Relations Between Information and Estimation”

ISIT 2013IEEE International Symposium on Information Theory

July 7-12, 2013 — Istanbul, Turkey

Albert No and T. Weissman


lookahead


question

1 for Duncan slide

H(X) =�

x∈XP (X = x) log

1P (X = x)

AWGN channeldYt = Xtdt + dWt, 0 ≤ t ≤ T

W is standard white Gaussian noise, independent of X

[Duncan 1970]:For example, consider Duncan’s relationship

I(XT ;Y T ) =12E

�� T


�

⇔

E

�log

dPXT ,Y T

dPXT × PY T

�=

12E

�� T


�

⇔

E

�log

dPXT ,Y T

dPXT × PY T

− 12

� T


�= 0

What else can we say about the random variable

logdPXT ,Y T

dPXT × PY T

− 12

� T


V ar

�log

dPXT ,Y T

dPXT × PY T

− 12

� T


�=?

V ar

�log

dPXT ,Y T

dPXT × PY T

− 12

� T


�= 2I(XT ;Y T ) = E

�� T


�

dYt =√

γXtdt + dWt, 0 ≤ t ≤ T

For stationary X = {Xt} let

mmsed(X, d, γ) = V ar(X0|Y d−∞)

let I(γ) here be the mutual information rate

can I(·) determine lmmse(d, snr) ?

We’ve seen that I(·) determines both mmsed(X, 0, γ) and mmsed(X,∞, γ)Does I(·) determine mmsed(X, d, γ) in general?No: In general mmsed(X, d, γ) �= mmsed(X(r)

, d, γ), where X(r) is the time-reversed X

I(γ) = I(XT ;Y T )


question

1 for Duncan slide

H(X) =�

x∈XP (X = x) log

1P (X = x)




I(XT ;Y T ) =12E

�� T


�

⇔

E

�log

dPXT ,Y T

dPXT × PY T

�=

12E

�� T


�

⇔

E

�log

dPXT ,Y T

dPXT × PY T

− 12

� T


�= 0


logdPXT ,Y T

dPXT × PY T

− 12

� T


V ar

�log

dPXT ,Y T

dPXT × PY T

− 12

� T


�=?

V ar

�log

dPXT ,Y T

dPXT × PY T

− 12

� T


�= 2I(XT ;Y T ) = E

�� T


�

dYt =√








I(γ) = I(XT ;Y T )

2

1 for Duncan slide

H(X) =�

x∈XP (X = x) log

1P (X = x)




I(XT ;Y T ) =12E

�� T


�

⇔

E

�log

dPXT ,Y T

dPXT × PY T

�=

12E

�� T


�

⇔

E

�log

dPXT ,Y T

dPXT × PY T

− 12

� T


�= 0


logdPXT ,Y T

dPXT × PY T

− 12

� T


V ar

�log

dPXT ,Y T

dPXT × PY T

− 12

� T


�=?

V ar

�log

dPXT ,Y T

dPXT × PY T

− 12

� T


�= 2I(XT ;Y T ) = E

�� T


�

dYt =√






how about I(·) and Sx(·) ?




a time irreversible process


d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T


�


I(snr) =1

2

� snr

0mmse(γ)dγ

cmmse(snr) =1

snr

� snr

0mmse(γ)dγ


+

=?


mseP,Q(γ) = EP

�(X − EQ[X|Y ])2

�


D(P�Q) =

� ∞


D(PYsnr�QYsnr) =

� snr


d


X ∼ P

?

3 Introduction



3


Poisson Channel








by

D(P�Q) =

� �log

dP

dQ

�dP (12)




D(PU�QU ) = D(PV �QV ). (13)



�D(PU |v�QU |v)dPV (v), (14)



D(PU |V �QU |V ) (15)






I(U ;V ) = D(PU,V �PU × PV ), (17)




I(U ;V ) = I(A;B). (18)





6.1 For slides

• Scalar Channel:

X ≥ 0


7

• Continuous-time Channel:

XT a non-negative stochastic process

Y Tγ |XT non-homogenous Poisson of intensity γ ·XT

• Note

D (exp(λ1)�exp(λ2)) =1

λ1· �(λ1,λ2)

Compare with

D�N (µ1,σ

2)�N (µ2,σ2)�=

1

2σ2· (µ1 − µ2)

2

•

I�XT ;Y T

γ

�= E

�� T

0��Xt, E[Xt|Y t

γ ]�dt

�

cmmle(γ) = E

�� T

0��Xt, E[Xt|Y t

γ ]�dt

�

mmle(γ) = E

�� T

0��Xt, E[Xt|Y T

γ ]�dt

�

cmmle(snr) =1

snr

� snr

0mmle(γ)dγ

Relationship between cmmle and mmle

I(U ;V )X independent of Z ∼ N (0, 1)

d

dth�X +

√tZ

�=

1

2J(X +

√tZ)

6.2 Random Variables

Suppose that X is a non-negative random variable and the conditional law of a r.v. Yγ , given X, is Poisson(γX). IfX ∼ P , denote expectation w.r.t. the corresponding joint law of X and Yγ by EP , the distribution of Yγ by PYγ ,the conditional expectation by EP [X|Yγ ], etc. We denote the mutual information by IP (X;Yγ) or simply I(X;Yγ)when there is no ambiguity. Let further mleP,Q(γ) denote the mean loss under � in estimating X based on Yγ usingthe estimator that would have been optimal had X ∼ Q when in fact X ∼ P , i.e.,

mleP,Q(γ)�= EP

��X,EQ[X|Yγ ]

��. (20)

The following is a new representation of relative entropy, paralleling the Gaussian channel result of [31]:

Theorem 6.1 For any pair P,Q of probability measures over [a, b], where 0 < a < b < ∞,

D(P�Q) =

� ∞

0[mleP,Q(γ)−mleP,P (γ)] dγ (21)

Theorem 6.1 is a direct consequence of the fact (proved in Section 9) that

limγ→∞

D(PYγ�QYγ ) = D(P�Q), (22)

combined with the following result, which is the Poisson parallel of [31, Equation (24)]:

8


quest for

this relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.



Consider now the Poisson channel, by which we mean, for the case of scalar random variables, that X, theinput, is a non-negative random variable while the conditional distribution of the output Y given the input isgiven by Poisson(γ · X), the parameter γ ≥ 0 here playing the role of SNR. In the continuous time setting, thechannel input is XT = {Xt, 0 ≤ t ≤ T}, a non-negative stochastic process, and conditionally on XT , the outputY T = {Yt, 0 ≤ t ≤ T} is a non-homogeneous Poisson process with intensity function γ ·XT . Often referred to as the“ideal Poisson channel” [19], this model is the canonical one for describing direct detection optical communication:The channel input represents the squared magnitude of the electric field incident on the photo-detector, while itsoutput is the counting process describing the arrival times of the photons registered by the detector. Here the energyof the channel input signal is proportional to its l1 norm, rather than the l2 norm as in the Gaussian channel. Thus itis the amplification factor γ rather than γ2 that plays the role of SNR. We refer to [32] for a review of the literature onthe Poisson channel and its communication theoretic significance, and to [11] and references therein for applicationsof Poisson channel models in other fields.

The function �0(x) = x log x − x + 1, x > 0 (where log denotes the natural logarithm throughout), being theconvex conjugate of the Poisson distribution’s log moment generating function, arises naturally in analysis of Poissonand continuous time jump Markov processes in a variety of situations. These include relative entropy representationfor jump Markov processes (see, e.g., equation (3.20) and Theorem 3.3 of [8]), large deviation local rate function forsuch processes ([8], Chapter 5 of [29]), mutual information in the Poisson channel (Section 19.5 and equation (19.135)of [20]), and logarithmic transformations in stochastic control theory (Section 3 of [9]). It is also intimately relatedto change-of-measure formulae for point processes in the spirit of the Girsanov transformation (Section VI.(5.5–6) of[4], [16], [28]). It is therefore not surprising that the function �0 appears in this paper in representations for relativeentropy and related calculations. It is less obvious, however, that using it to define estimation loss turns out to bevery useful and, in particular, gives rise to a number of results that parallel the Gaussian theory.

Enter the loss function � : [0,∞)× [0,∞) → [0,∞] defined by x�0(x/x) or, more precisely,

�(x, x) = x log(x/x)− x+ x, (1)

where the right hand side of (1) is well-defined as an extended non-negative real number in view of our conventions0 log 0 = 0, 0 log 0/0 = 0, c/0 = ∞ and log c/0 = ∞ for c > 0. In Section 2, we exhibit properties of this loss functionthat show it is a natural one for measuring goodness of reconstruction of non-negative objects, and that it sharessome of its key properties with the squared error loss, such as optimality of the conditional expectation under themean loss criterion.

The goal of this paper is to show that a set of relations identical to those that hold for the Gaussian channel– ranging from Duncan’s formula [7], to the I-MMSE of [13, 34], to Verdu’s relationship between relative entropyand mismatched estimation [31], to the relationship between causal and non-causal estimation in continuous timefor matched [13] and mismatched [33] filters – hold for the Poisson channel upon replacing the squared error loss bythe loss function in (1).

It is instructive to note that while the relative entropy between two Gaussians of the same variance and meansm1 and m2 is equal to (m1 − m2)2, that between two exponentials of parameters λ1 and λ2 is equal to �(λ1,λ2)(with additional multiplicative terms in both cases). Although this simple fact does not exclusively explain theGaussian-Poissonian analogy, it lies at its heart, along with further properties of � observed in Section 2.

2


this relationship holds). When combined with Duncan’s theorem [7], it was also shown to imply a remarkable rela-tionship between the MMSEs in causal (filtering) and non-causal (smoothing) estimation of an arbitrarily distributedcontinuous-time signal corrupted by Gaussian noise: the filtering MMSE at SNR level γ is equal to the mean valueof the smoothing MMSE with SNR uniformly distributed between 0 and γ. The relation of the mutual informationto both types of MMSE thus served as a bridge between the two quantities.






�(x, x) = x log(x/x)− x+ x, (1)




2

[26] D. P. Palomar and S. Verdu, “Representation of Mutual Information via Input Estimates,” IEEE Trans. Infor-

mation Theory, vol. 53, no. 2, pp. 453-470, Feb. 2007.

[27] B. Y. Ryabko, “Encoding a source with unknown but ordered probabilities,” Probl. Inf. Transm., pp. 134-139,Oct. 1979.

[28] A. Segall and T. Kailath, “Radon-Nikodym derivatives with respect to measures induced by discontinuousindependent-increment processes,” Ann. Probab., Vol. 3 No. 3, pp. 449–464, 1975

[29] A. Shwartz and A. Weiss. Large Deviations for Performance Analysis. Queues, Communications, and Comput-

ing. Chapman & Hall, London, 1995

[30] A. M. Tulino and S. Verdu, “Monotonic Decrease of the Non-Gaussianness of the Sum of Independent RandomVariables: A Simple Proof,” IEEE Trans. Information Theory, vol. 52, no. 9, pp. 4295-4297, Sep. 2006.

[31] S. Verdu, “Mismatched estimation and relative entropy,” IEEE Trans. Information Theory, vol. 56, no. 8, pp.3712-3720, August 2010.

[32] S. Verdu, “Poisson communication theory,” International Technion Communication Day in Honor of Israel

Bar-David, March 1999.

[33] T. Weissman, “The Relationship Between Causal and Noncausal Mismatched Estimation in Continuous-TimeAWGN Channels,” IEEE Trans. Information Theory, vol. 56, no. 9, pp. 4256 - 4273, September 2010.

[34] M. Zakai, “On mutual information, likelihood ratios, and estimation error for the additive Gaussian channel,”IEEE Trans. Information Theory, vol. 51, no. 9, pp. 3017–3024, Sep. 2005.

1 2 3 4

0.5

1.0

1.5

2.0

2.5

(a) �(1, x)

1 2 3 4

0.5

1.0

1.5

2.0

2.5

(b) �(x, 1) = x log x− x+ 1

Figure 1: The loss function �

A

C

B

D

0.0 0.5 1.0 1.5 2.0 2.5 3.00.0

0.1

0.2

0.3

0.4

0.5

Figure 2: The curves mleP,P (γ), cmleP,P (γ), mleP,Q(γ) and cmleP,Q(γ), marked respectively by A,B,C,D, of theexample in Section 8.1, plotted here for p = 1/2 and q = 1/5.

27













1 2 3 4

0.5

1.0

1.5

2.0

2.5

(a) �(1, x)

1 2 3 4

0.5

1.0

1.5

2.0

2.5

(b) �(x, 1) = x log x− x+ 1


A

C

B

D

0.0 0.5 1.0 1.5 2.0 2.5 3.00.0

0.1

0.2

0.3

0.4

0.5


27













1 2 3 4

0.5

1.0

1.5

2.0

2.5

(a) �(1, x)

1 2 3 4

0.5

1.0

1.5

2.0

2.5

(b) �(x, 1) = x log x− x+ 1


A

C

B

D

0.0 0.5 1.0 1.5 2.0 2.5 3.00.0

0.1

0.2

0.3

0.4

0.5


27


An observation (and hint)




• Note


λ1· �(λ1,λ2)

D (Poisson(λ1)�Poisson(λ2)) = �(λ1,λ2)

Compare with

D�N (µ1,σ

2)�N (µ2,σ2)�=

1

2σ2· (µ1 − µ2)

2

•

I�XT ;Y T

γ

�= γ · E

�� T

0��Xt, E[Xt|Y t

γ ]�dt

�

cmleP,Q(snr) =1

snr

�I�XT ;Y T

snr

�+D

�PY T

snr�QY T

snr

��

cmmle(γ) = E

�� T

0��Xt, E[Xt|Y t

γ ]�dt

�

mmle(γ) = E

�� T

0��Xt, E[Xt|Y T

γ ]�dt

�

cmmle(snr) =1

snr

� snr

0mmle(γ)dγ



d

dth�X +

√tZ

�=

1

2J(X +

√tZ)



mleP,Q(γ)�= EP

��X,EQ[X|Yγ ]

��. (20)

The following is a new representation of relative entropy, paralleling the Gaussian channel result of [?]:


D(P�Q) =

� ∞


8


An observation (and hint)




• Note


λ1· �(λ1,λ2)


Compare with

D�N (µ1,σ

2)�N (µ2,σ2)�=

1

2σ2· (µ1 − µ2)

2

•

I�XT ;Y T

γ

�= γ · E

�� T

0��Xt, E[Xt|Y t

γ ]�dt

�

cmleP,Q(snr) =1

snr

�I�XT ;Y T

snr

�+D

�PY T

snr�QY T

snr

��

cmmle(γ) = E

�� T

0��Xt, E[Xt|Y t

γ ]�dt

�

mmle(γ) = E

�� T

0��Xt, E[Xt|Y T

γ ]�dt

�

cmmle(snr) =1

snr

� snr

0mmle(γ)dγ



d

dth�X +

√tZ

�=

1

2J(X +

√tZ)



mleP,Q(γ)�= EP

��X,EQ[X|Yγ ]

��. (20)

The following is a new representation of relative entropy, paralleling the Gaussian channel result of [?]:


D(P�Q) =

� ∞


8




• Note


λ1· �(λ1,λ2)


Compare with

D�N (µ1,σ

2)�N (µ2,σ2)�=

1

2σ2· (µ1 − µ2)

2

D (N (µ1, 1)�N (µ2, 1)) =1

2· (µ1 − µ2)

2

•

I�XT ;Y T

γ

�= γ · E

�� T

0��Xt, E[Xt|Y t

γ ]�dt

�

cmleP,Q(snr) =1

snr

�I�XT ;Y T

snr

�+D

�PY T

snr�QY T

snr

��

cmmle(γ) = E

�� T

0��Xt, E[Xt|Y t

γ ]�dt

�

mmle(γ) = E

�� T

0��Xt, E[Xt|Y T

γ ]�dt

�

cmmle(snr) =1

snr

� snr

0mmle(γ)dγ



d

dth�X +

√tZ

�=

1

2J(X +

√tZ)



mleP,Q(γ)�= EP

��X,EQ[X|Yγ ]

��. (20)

The following is a new representation of relative entropy, paralleling the Gaussian channel result of [31]:

8


d

dγI(γ) =

1

2mmse(γ)

mmse(γ) = E

�� T


�


I(snr) =1

2

� snr

0mmse(γ)dγ

cmmse(snr) =1

snr

� snr

0mmse(γ)dγ


�⇒

+

=?


mseP,Q(γ) = EP

�(X − EQ[X|Y ])2

�


D(P�Q) =

� ∞


D(PYsnr�QYsnr) =

� snr


d


X ∼ P

?

3 Introduction


More recently, Verdu has shown in [31] that when X ∼ P is estimated based on Y by a mismatched estimatorthat would have minimized the MSE had X ∼ Q, the integral over all SNR values up to γ of the excess MSE due to

3

Punch Line

[Rami Atar and T.W. 2012]:

under the above






�(x, x) = x log(x/x)− x+ x, (1)




Our emphasis is on the results for the mismatched setting, relating the cost of mismatch to relative entropy inthe Poisson channel. The results for the exact (i.e., non-mismatched) setting, relating the minimum mean loss tomutual information, and causal to non-causal minimum mean estimation loss, are shown to follow as special cases.The latter results, for the exact setting, are consistent and in fact coincide with those of [14] – which considered amore general Poisson channel model that accommodates the presence of dark current – when specialized to the caseof zero dark current. Our framework complements the results of [14] not only in extending the scope to the presence

4


and i mean everything

• i-mmse

• Duncan

• causal - non-causal

• mismatch

• minimax


the universal picture


universal denoising


universal probability assignments:

X1, X2, X3, . . . ,Xi−1, Xi, . . .

Y1, Y2, Y3, . . . , Yi−1, Yi, . . .

I(Xi;Yi|Y i−1)

I(Y i−1;Xi|Xi−1)

C = limn→∞

max1n

I(Xn → Y n)

I(Xn → Y n)� I(Y n−1 → Xn) ⇒ “X causes Y”

I(Xn → Y n)� I(Y n−1 → Xn) ⇒ “Y causes X”

I(Xn → Y n) ≈ I(Y n−1 → Xn)� 0 ⇒ “X and Y are causing each other”

I(Xn;Y n) ≈ 0 ⇒ X and Y are essentially independent

I(X→ Y) = limn→∞

1n

I(Xn → Y n)

Q is universal if

limn→∞

1n

D(PXn�QXn) = 0

for every stationary P

and pointwise universal if

lim supn→∞

1n

logPXn(Xn)QXn(Xn)

≤ 0 P − a.s.

for every stationary and ergodic P

1

X1, X2, X3, . . . ,Xi−1, Xi, . . .

Y1, Y2, Y3, . . . , Yi−1, Yi, . . .

I(Xi;Yi|Y i−1)


C = limn→∞

max1n

I(Xn → Y n)






1n

I(Xn → Y n)

Q is universal if

limn→∞

1n

D(PXn�QXn) = 0



lim supn→∞

1n

logPXn(Xn)QXn(Xn)

≤ 0 P − a.s.


1

X1, X2, X3, . . . ,Xi−1, Xi, . . .

Y1, Y2, Y3, . . . , Yi−1, Yi, . . .

I(Xi;Yi|Y i−1)


C = limn→∞

max1n

I(Xn → Y n)






1n

I(Xn → Y n)

Q is universal if

limn→∞

1n

D(PXn�QXn) = 0



lim supn→∞

1n

logPXn(Xn)QXn(Xn)

≤ 0 P − a.s.


1


universal compressors (e.g.: Lempel-Ziv 78, CTW)

X1, X2, X3, . . . , Xi−1, Xi, . . .

Y1, Y2, Y3, . . . , Yi−1, Yi, . . .

H(X) =�

x

PX(x) log1

PX(x)

• I(X;Y ) = I(Y ;X)

• I(f(X); g(Y )) = I(X;Y ) if f and g are one-to-one

• chain rules

I(X;Y )

I(Xi;Yi|Y i−1)


C = limn→∞

max1

nI(Xn → Y

n)

I(Xn → Yn) � I(Y n−1 → X

n) ⇒ “X causes Y”

I(Xn → Yn) � I(Y n−1 → X

n) ⇒ “Y causes X”

I(Xn → Yn) ≈ I(Y n−1 → X

n) � 0 ⇒ “X and Y are causing each other”

⇐⇒


I(X → Y) = limn→∞

1

nI(Xn → Y

n)

Q is universal if

limn→∞

1

nD(PXn�QXn) = 0



lim supn→∞

1

nlog

PXn(Xn)

QXn(Xn)≤ 0 P − a.s.


1

X1, X2, X3, . . . , Xi−1, Xi, . . .

Y1, Y2, Y3, . . . , Yi−1, Yi, . . .

H(X) =�

x

PX(x) log1

PX(x)

• I(X;Y ) = I(Y ;X)

• I(f(X); g(Y )) = I(X;Y ) if f and g are one-to-one

• chain rules

I(X;Y )

I(Xi;Yi|Y i−1)


C = limn→∞

max1

nI(Xn → Y

n)

I(Xn → Yn) � I(Y n−1 → X

n) ⇒ “X causes Y”

I(Xn → Yn) � I(Y n−1 → X

n) ⇒ “Y causes X”

I(Xn → Yn) ≈ I(Y n−1 → X

n) � 0 ⇒ “X and Y are causing each other”

⇐⇒


I(X → Y) = limn→∞

1

nI(Xn → Y

n)

Q is universal if

limn→∞

1

nD(PXn�QXn) = 0



lim supn→∞

1

nlog

PXn(Xn)

QXn(Xn)≤ 0 P − a.s.


1

universal probability assignment

univ. sequential prob. assignment

(much more in ee376c)

univ. prediction, filtering, denoising, lossy compression


Download - cf. lecture by Iain Johnstoneweb.stanford.edu/class/ee378a/lecture-notes/last_lecture.pdfOur conventions and notation for information measures, such as mutual information and relative

Top Related