information theory - carnegie mellon universitycshalizi/2010-06-15-1-csss.pdfentropy and information...

145
Information Theory Cosma Shalizi 15 June 2010 Complex Systems Summer School

Upload: truongque

Post on 26-Apr-2018

219 views

Category:

Documents


1 download

TRANSCRIPT

Information Theory

Cosma Shalizi

15 June 2010Complex Systems Summer School

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Entropy and Information Measuring randomness anddependence in bits

Entropy and Ergodicity Dynamical systems as informationsources, long-run randomness

Information and Inference The connection to statistics

Cover and Thomas (1991) is the best single book oninformation theory.

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Entropy

The most fundamental notion in information theoryX = a discrete random variable, values from XThe entropy of X is

H[X ] ≡ −∑x∈X

Pr (X = x) log2 Pr (X = x)

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Proposition

H[X ] ≥ 0, and = 0 only when Pr (X = x) = 1 for some x

( EXERCISE)

Proposition

H[X ] is maximal when all X are equally probable, and thenH[X ] = log2 #X ( EXERCISE)

Proposition

H[f (X )] ≤ H[X ], equality if and only if f is 1-1

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Proposition

H[X ] ≥ 0, and = 0 only when Pr (X = x) = 1 for some x( EXERCISE)

Proposition

H[X ] is maximal when all X are equally probable, and thenH[X ] = log2 #X ( EXERCISE)

Proposition

H[f (X )] ≤ H[X ], equality if and only if f is 1-1

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Proposition

H[X ] ≥ 0, and = 0 only when Pr (X = x) = 1 for some x( EXERCISE)

Proposition

H[X ] is maximal when all X are equally probable, and thenH[X ] = log2 #X

( EXERCISE)

Proposition

H[f (X )] ≤ H[X ], equality if and only if f is 1-1

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Proposition

H[X ] ≥ 0, and = 0 only when Pr (X = x) = 1 for some x( EXERCISE)

Proposition

H[X ] is maximal when all X are equally probable, and thenH[X ] = log2 #X ( EXERCISE)

Proposition

H[f (X )] ≤ H[X ], equality if and only if f is 1-1

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Proposition

H[X ] ≥ 0, and = 0 only when Pr (X = x) = 1 for some x( EXERCISE)

Proposition

H[X ] is maximal when all X are equally probable, and thenH[X ] = log2 #X ( EXERCISE)

Proposition

H[f (X )] ≤ H[X ], equality if and only if f is 1-1

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Interpretations

H[X ] measures

how random X isHow variable X isHow uncertain we should be about X“paleface” problemconsistent resolution leads to a completely subjective probability theory

but the more fundamental interpretation is description length

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Interpretations

H[X ] measureshow random X is

How variable X isHow uncertain we should be about X“paleface” problemconsistent resolution leads to a completely subjective probability theory

but the more fundamental interpretation is description length

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Interpretations

H[X ] measureshow random X isHow variable X is

How uncertain we should be about X“paleface” problemconsistent resolution leads to a completely subjective probability theory

but the more fundamental interpretation is description length

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Interpretations

H[X ] measureshow random X isHow variable X isHow uncertain we should be about X

“paleface” problemconsistent resolution leads to a completely subjective probability theory

but the more fundamental interpretation is description length

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Interpretations

H[X ] measureshow random X isHow variable X isHow uncertain we should be about X“paleface” problem

consistent resolution leads to a completely subjective probability theory

but the more fundamental interpretation is description length

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Interpretations

H[X ] measureshow random X isHow variable X isHow uncertain we should be about X“paleface” problemconsistent resolution leads to a completely subjective probability theory

but the more fundamental interpretation is description length

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Interpretations

H[X ] measureshow random X isHow variable X isHow uncertain we should be about X“paleface” problemconsistent resolution leads to a completely subjective probability theory

but the more fundamental interpretation is description length

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Description Length

H[X ] = how concisely can we describe X?Imagine X as text message:

in Renoin Reno send moneyin Reno divorce finalmarry me?in Reno send lawyers guns and money kthxbai

Known and finite number of possible messages (#X )I know what X is but won’t show it to youYou can guess it by asking yes/no (binary) questions

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Description Length

H[X ] = how concisely can we describe X?Imagine X as text message:

in Reno

in Reno send moneyin Reno divorce finalmarry me?in Reno send lawyers guns and money kthxbai

Known and finite number of possible messages (#X )I know what X is but won’t show it to youYou can guess it by asking yes/no (binary) questions

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Description Length

H[X ] = how concisely can we describe X?Imagine X as text message:

in Renoin Reno send money

in Reno divorce finalmarry me?in Reno send lawyers guns and money kthxbai

Known and finite number of possible messages (#X )I know what X is but won’t show it to youYou can guess it by asking yes/no (binary) questions

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Description Length

H[X ] = how concisely can we describe X?Imagine X as text message:

in Renoin Reno send moneyin Reno divorce final

marry me?in Reno send lawyers guns and money kthxbai

Known and finite number of possible messages (#X )I know what X is but won’t show it to youYou can guess it by asking yes/no (binary) questions

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Description Length

H[X ] = how concisely can we describe X?Imagine X as text message:

in Renoin Reno send moneyin Reno divorce finalmarry me?

in Reno send lawyers guns and money kthxbai

Known and finite number of possible messages (#X )I know what X is but won’t show it to youYou can guess it by asking yes/no (binary) questions

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Description Length

H[X ] = how concisely can we describe X?Imagine X as text message:

in Renoin Reno send moneyin Reno divorce finalmarry me?in Reno send lawyers guns and money kthxbai

Known and finite number of possible messages (#X )I know what X is but won’t show it to youYou can guess it by asking yes/no (binary) questions

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

First goal: ask as few questions as possible

Starting with “is it y?” is optimal iff X = yCan always achieve no worse than ≈ log2 #X questionsNew goal: minimize the mean number of questionsAsk about more probable messages firstStill takes ≈ − log2 Pr (X = x) questions to reach xMean is then H[X ]

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

First goal: ask as few questions as possibleStarting with “is it y?” is optimal iff X = y

Can always achieve no worse than ≈ log2 #X questionsNew goal: minimize the mean number of questionsAsk about more probable messages firstStill takes ≈ − log2 Pr (X = x) questions to reach xMean is then H[X ]

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

First goal: ask as few questions as possibleStarting with “is it y?” is optimal iff X = yCan always achieve no worse than ≈ log2 #X questions

New goal: minimize the mean number of questionsAsk about more probable messages firstStill takes ≈ − log2 Pr (X = x) questions to reach xMean is then H[X ]

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

First goal: ask as few questions as possibleStarting with “is it y?” is optimal iff X = yCan always achieve no worse than ≈ log2 #X questionsNew goal: minimize the mean number of questions

Ask about more probable messages firstStill takes ≈ − log2 Pr (X = x) questions to reach xMean is then H[X ]

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

First goal: ask as few questions as possibleStarting with “is it y?” is optimal iff X = yCan always achieve no worse than ≈ log2 #X questionsNew goal: minimize the mean number of questionsAsk about more probable messages first

Still takes ≈ − log2 Pr (X = x) questions to reach xMean is then H[X ]

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

First goal: ask as few questions as possibleStarting with “is it y?” is optimal iff X = yCan always achieve no worse than ≈ log2 #X questionsNew goal: minimize the mean number of questionsAsk about more probable messages firstStill takes ≈ − log2 Pr (X = x) questions to reach x

Mean is then H[X ]

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

First goal: ask as few questions as possibleStarting with “is it y?” is optimal iff X = yCan always achieve no worse than ≈ log2 #X questionsNew goal: minimize the mean number of questionsAsk about more probable messages firstStill takes ≈ − log2 Pr (X = x) questions to reach xMean is then H[X ]

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

TheoremH[X ] is the minimum mean number of binary distinctionsneeded to describe X

Units of H[X ] are bits

source

X

binary-coded message

length H[X]receiver

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Multiple Variables — Joint Entropy

Joint entropy of two variables X and Y :

H[X ,Y ] ≡ −∑x∈X

∑y∈Y

Pr (X = x ,Y = y) log2 Pr (X = x ,Y = y)

Entropy of joint distributionThis is the minimum mean length to describe both X and Y

H[X ,Y ] ≥ H[X ]

H[X ,Y ] ≥ H[Y ]

H[X ,Y ] ≤ H[X ] + H[Y ]

H[f (X ),X ] = H[X ]

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Multiple Variables — Joint Entropy

Joint entropy of two variables X and Y :

H[X ,Y ] ≡ −∑x∈X

∑y∈Y

Pr (X = x ,Y = y) log2 Pr (X = x ,Y = y)

Entropy of joint distribution

This is the minimum mean length to describe both X and Y

H[X ,Y ] ≥ H[X ]

H[X ,Y ] ≥ H[Y ]

H[X ,Y ] ≤ H[X ] + H[Y ]

H[f (X ),X ] = H[X ]

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Multiple Variables — Joint Entropy

Joint entropy of two variables X and Y :

H[X ,Y ] ≡ −∑x∈X

∑y∈Y

Pr (X = x ,Y = y) log2 Pr (X = x ,Y = y)

Entropy of joint distributionThis is the minimum mean length to describe both X and Y

H[X ,Y ] ≥ H[X ]

H[X ,Y ] ≥ H[Y ]

H[X ,Y ] ≤ H[X ] + H[Y ]

H[f (X ),X ] = H[X ]

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Multiple Variables — Joint Entropy

Joint entropy of two variables X and Y :

H[X ,Y ] ≡ −∑x∈X

∑y∈Y

Pr (X = x ,Y = y) log2 Pr (X = x ,Y = y)

Entropy of joint distributionThis is the minimum mean length to describe both X and Y

H[X ,Y ] ≥ H[X ]

H[X ,Y ] ≥ H[Y ]

H[X ,Y ] ≤ H[X ] + H[Y ]

H[f (X ),X ] = H[X ]

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Conditional Entropy

Entropy of conditional distribution:

H[X |Y = y ] ≡ −∑x∈X

Pr (X = x |Y = y) log2 Pr (X = x |Y = y)

Average over y :

H[X |Y ] ≡∑y∈Y

Pr (Y = y) H[X |Y = y ]

On average, how many bits are needed to describe X , after Yis given?

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

H[X |Y ] = H[X ,Y ]− H[Y ]

“text completion” principleNote: H[X |Y ] 6= H[Y |X ], in general

Chain rule:

H[X n1 ] = H[X1] +

n−1∑t=1

H[Xt+1|X t1]

Describe one variable, then describe 2nd with 1st, 3rd with firsttwo, etc.

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

H[X |Y ] = H[X ,Y ]− H[Y ]

“text completion” principleNote: H[X |Y ] 6= H[Y |X ], in generalChain rule:

H[X n1 ] = H[X1] +

n−1∑t=1

H[Xt+1|X t1]

Describe one variable, then describe 2nd with 1st, 3rd with firsttwo, etc.

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Mutual Information

Mutual information between X and Y

I[X ; Y ] ≡ H[X ] + H[Y ]− H[X ,Y ]

How much shorter is the actual joint description than the sum ofthe individual descriptions?

Equivalent:

I[X ; Y ] = H[X ]− H[X |Y ] = H[Y ]− H[Y |X ]

How much can I shorten my description of either variable byusing the other?

0 ≤ I[X ; Y ] ≤ min H[X ],H[Y ]

I[X ; Y ] = 0 if and only if X and Y are statistically independent

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Mutual Information

Mutual information between X and Y

I[X ; Y ] ≡ H[X ] + H[Y ]− H[X ,Y ]

How much shorter is the actual joint description than the sum ofthe individual descriptions?Equivalent:

I[X ; Y ] = H[X ]− H[X |Y ] = H[Y ]− H[Y |X ]

How much can I shorten my description of either variable byusing the other?

0 ≤ I[X ; Y ] ≤ min H[X ],H[Y ]

I[X ; Y ] = 0 if and only if X and Y are statistically independent

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Mutual Information

Mutual information between X and Y

I[X ; Y ] ≡ H[X ] + H[Y ]− H[X ,Y ]

How much shorter is the actual joint description than the sum ofthe individual descriptions?Equivalent:

I[X ; Y ] = H[X ]− H[X |Y ] = H[Y ]− H[Y |X ]

How much can I shorten my description of either variable byusing the other?

0 ≤ I[X ; Y ] ≤ min H[X ],H[Y ]

I[X ; Y ] = 0 if and only if X and Y are statistically independentCSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

noise

channel decodersource

Xencoder

receiver

Y

How much can we learn about what was sent from what wereceive? I[X ; Y ]

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Historically, this is the origin of information theory: sendingcoded messages efficiently (Shannon, 1948)Stephenson (1999) is a historical dramatization with silly late-1990s story tacked on

channel capacity C = max I[X ; Y ] as we change distribution ofXAny rate of information transfer < C can be achieved witharbitrarily small error rate, no matter what the noiseNo rate > C can be achieved without errorC is also related to the value of information in gambling(Poundstone, 2005)This is not the only model of communication! (Sperber andWilson, 1995, 1990)

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Historically, this is the origin of information theory: sendingcoded messages efficiently (Shannon, 1948)

Stephenson (1999) is a historical dramatization with silly late-1990s story tacked on

channel capacity C = max I[X ; Y ] as we change distribution ofXAny rate of information transfer < C can be achieved witharbitrarily small error rate, no matter what the noiseNo rate > C can be achieved without errorC is also related to the value of information in gambling(Poundstone, 2005)This is not the only model of communication! (Sperber andWilson, 1995, 1990)

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Historically, this is the origin of information theory: sendingcoded messages efficiently (Shannon, 1948)Stephenson (1999) is a historical dramatization with silly late-1990s story tacked on

channel capacity C = max I[X ; Y ] as we change distribution ofXAny rate of information transfer < C can be achieved witharbitrarily small error rate, no matter what the noiseNo rate > C can be achieved without errorC is also related to the value of information in gambling(Poundstone, 2005)This is not the only model of communication! (Sperber andWilson, 1995, 1990)

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Historically, this is the origin of information theory: sendingcoded messages efficiently (Shannon, 1948)Stephenson (1999) is a historical dramatization with silly late-1990s story tacked on

channel capacity C = max I[X ; Y ] as we change distribution ofX

Any rate of information transfer < C can be achieved witharbitrarily small error rate, no matter what the noiseNo rate > C can be achieved without errorC is also related to the value of information in gambling(Poundstone, 2005)This is not the only model of communication! (Sperber andWilson, 1995, 1990)

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Historically, this is the origin of information theory: sendingcoded messages efficiently (Shannon, 1948)Stephenson (1999) is a historical dramatization with silly late-1990s story tacked on

channel capacity C = max I[X ; Y ] as we change distribution ofXAny rate of information transfer < C can be achieved witharbitrarily small error rate, no matter what the noiseNo rate > C can be achieved without error

C is also related to the value of information in gambling(Poundstone, 2005)This is not the only model of communication! (Sperber andWilson, 1995, 1990)

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Historically, this is the origin of information theory: sendingcoded messages efficiently (Shannon, 1948)Stephenson (1999) is a historical dramatization with silly late-1990s story tacked on

channel capacity C = max I[X ; Y ] as we change distribution ofXAny rate of information transfer < C can be achieved witharbitrarily small error rate, no matter what the noiseNo rate > C can be achieved without errorC is also related to the value of information in gambling(Poundstone, 2005)

This is not the only model of communication! (Sperber andWilson, 1995, 1990)

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Historically, this is the origin of information theory: sendingcoded messages efficiently (Shannon, 1948)Stephenson (1999) is a historical dramatization with silly late-1990s story tacked on

channel capacity C = max I[X ; Y ] as we change distribution ofXAny rate of information transfer < C can be achieved witharbitrarily small error rate, no matter what the noiseNo rate > C can be achieved without errorC is also related to the value of information in gambling(Poundstone, 2005)This is not the only model of communication! (Sperber andWilson, 1995, 1990)

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Conditional Mutual Information

I[X ; Y |Z ] = H[X |Z ] + H[Y |Z ]− H[X ,Y |Z ]

How much extra information do X and Y give, over and abovewhat’s in Z?

X |= Y |Z if and only if I[X ; Y |Z ] = 0Markov property is completely equivalent to

I[X∞t+1; X t−1−∞|Xt ] = 0

Markov property is really about information flow

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Conditional Mutual Information

I[X ; Y |Z ] = H[X |Z ] + H[Y |Z ]− H[X ,Y |Z ]

How much extra information do X and Y give, over and abovewhat’s in Z?X |= Y |Z if and only if I[X ; Y |Z ] = 0

Markov property is completely equivalent to

I[X∞t+1; X t−1−∞|Xt ] = 0

Markov property is really about information flow

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Conditional Mutual Information

I[X ; Y |Z ] = H[X |Z ] + H[Y |Z ]− H[X ,Y |Z ]

How much extra information do X and Y give, over and abovewhat’s in Z?X |= Y |Z if and only if I[X ; Y |Z ] = 0Markov property is completely equivalent to

I[X∞t+1; X t−1−∞|Xt ] = 0

Markov property is really about information flow

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

What About Continuous Variables?

Differential entropy:

H(X ) ≡ −∫

dxp(x) log2 p(x)

where p has to be the probability density function

H(X ) < 0 entirely possibleDifferential entropy varies under 1-1 maps (e.g. coordinatechanges)Joint and conditional entropy definitions carry overMutual information definition carries overMI is non-negative and invariant under 1-1 maps

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

What About Continuous Variables?

Differential entropy:

H(X ) ≡ −∫

dxp(x) log2 p(x)

where p has to be the probability density functionH(X ) < 0 entirely possible

Differential entropy varies under 1-1 maps (e.g. coordinatechanges)Joint and conditional entropy definitions carry overMutual information definition carries overMI is non-negative and invariant under 1-1 maps

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

What About Continuous Variables?

Differential entropy:

H(X ) ≡ −∫

dxp(x) log2 p(x)

where p has to be the probability density functionH(X ) < 0 entirely possibleDifferential entropy varies under 1-1 maps (e.g. coordinatechanges)

Joint and conditional entropy definitions carry overMutual information definition carries overMI is non-negative and invariant under 1-1 maps

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

What About Continuous Variables?

Differential entropy:

H(X ) ≡ −∫

dxp(x) log2 p(x)

where p has to be the probability density functionH(X ) < 0 entirely possibleDifferential entropy varies under 1-1 maps (e.g. coordinatechanges)Joint and conditional entropy definitions carry over

Mutual information definition carries overMI is non-negative and invariant under 1-1 maps

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

What About Continuous Variables?

Differential entropy:

H(X ) ≡ −∫

dxp(x) log2 p(x)

where p has to be the probability density functionH(X ) < 0 entirely possibleDifferential entropy varies under 1-1 maps (e.g. coordinatechanges)Joint and conditional entropy definitions carry overMutual information definition carries over

MI is non-negative and invariant under 1-1 maps

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

What About Continuous Variables?

Differential entropy:

H(X ) ≡ −∫

dxp(x) log2 p(x)

where p has to be the probability density functionH(X ) < 0 entirely possibleDifferential entropy varies under 1-1 maps (e.g. coordinatechanges)Joint and conditional entropy definitions carry overMutual information definition carries overMI is non-negative and invariant under 1-1 maps

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Relative Entropy

P, Q = two distributions on the same space X

D(P‖Q) ≡∑x∈X

P(x) log2P(x)

Q(x)

Or, if X is continuous,

D(P‖Q) ≡∫X

dx p(x) log2p(x)

q(x)

Or, if you like measure theory,

D(P‖Q) ≡∫

dP(ω) log2dPdQ

(ω)

a.k.a. Kullback-Leibler divergenceCSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Relative Entropy Properties

D(P‖Q) ≥ 0, with equality if and only if P = QD(P‖Q) 6= D(Q‖P), in generalD(P‖Q) =∞ if Q gives probability zero to something withpositive P probability (P not dominated by Q)Invariant under 1-1 maps

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Relative Entropy Properties

D(P‖Q) ≥ 0, with equality if and only if P = Q

D(P‖Q) 6= D(Q‖P), in generalD(P‖Q) =∞ if Q gives probability zero to something withpositive P probability (P not dominated by Q)Invariant under 1-1 maps

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Relative Entropy Properties

D(P‖Q) ≥ 0, with equality if and only if P = QD(P‖Q) 6= D(Q‖P), in general

D(P‖Q) =∞ if Q gives probability zero to something withpositive P probability (P not dominated by Q)Invariant under 1-1 maps

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Relative Entropy Properties

D(P‖Q) ≥ 0, with equality if and only if P = QD(P‖Q) 6= D(Q‖P), in generalD(P‖Q) =∞ if Q gives probability zero to something withpositive P probability (P not dominated by Q)

Invariant under 1-1 maps

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Relative Entropy Properties

D(P‖Q) ≥ 0, with equality if and only if P = QD(P‖Q) 6= D(Q‖P), in generalD(P‖Q) =∞ if Q gives probability zero to something withpositive P probability (P not dominated by Q)Invariant under 1-1 maps

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Joint and Conditional Relative Entropies

P, Q now distributions on X ,Y

D(P‖Q) = D(P(X )‖Q(X )) + D(P(Y |X )‖Q(Y |X ))

where

D(P(Y |X )‖Q(Y |X )) =∑

x

P(x)D(P(Y |X = x)‖Q(Y |X = x))

=∑

x

P(x)∑

y

P(y |x) log2P(y |x)

Q(y |x)

and so on for more than two variables

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Relative entropy can be the basic concept

H[X ] = log2 m − D(P‖U)

where m = #X , U = uniform dist on X , P = dist of X

I[X ; Y ] = D(J‖P ⊗Q)

where P = dist of X , Q = dist of Y , J = joint dist

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Relative entropy can be the basic concept

H[X ] = log2 m − D(P‖U)

where m = #X , U = uniform dist on X , P = dist of X

I[X ; Y ] = D(J‖P ⊗Q)

where P = dist of X , Q = dist of Y , J = joint dist

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Relative Entropy and Miscoding

Suppose real distribution is P but we think it’s Q and we usethat for codingOur average code length (cross-entropy) is

−∑

x

P(x) log2 Q(x)

But the optimum code length is

−∑

x

P(x) log2 P(x)

Difference is relative entropyRelative entropy is the extra description length from getting thedistribution wrong

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

EntropyDescription LengthMultiple Variables and Mutual InformationContinuous VariablesRelative Entropy

Basics: Summary

Entropy = minimum mean description length; variability of therandom quantityMutual information = reduction in description length from usingdependenciesRelative entropy = excess description length from guessing thewrong distribution

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition

Information Sources

X1,X2, . . .Xn, . . . a sequence of random variablesX t

s = (Xs,Xs+1, . . .Xt−1,Xt )Any sort of random process process will do

Sequence of messagesSuccessive outputs of a stochastic systemNeed not be from a communication channele.g., successive states of a dynamical systemor coarse-grained observations of the dynamics

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition

Information Sources

X1,X2, . . .Xn, . . . a sequence of random variablesX t

s = (Xs,Xs+1, . . .Xt−1,Xt )Any sort of random process process will doSequence of messagesSuccessive outputs of a stochastic system

Need not be from a communication channele.g., successive states of a dynamical systemor coarse-grained observations of the dynamics

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition

Information Sources

X1,X2, . . .Xn, . . . a sequence of random variablesX t

s = (Xs,Xs+1, . . .Xt−1,Xt )Any sort of random process process will doSequence of messagesSuccessive outputs of a stochastic systemNeed not be from a communication channele.g., successive states of a dynamical system

or coarse-grained observations of the dynamics

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition

Information Sources

X1,X2, . . .Xn, . . . a sequence of random variablesX t

s = (Xs,Xs+1, . . .Xt−1,Xt )Any sort of random process process will doSequence of messagesSuccessive outputs of a stochastic systemNeed not be from a communication channele.g., successive states of a dynamical systemor coarse-grained observations of the dynamics

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition

Definition (Strict or Strong Stationarity)

for any k > 0, T > 0, for all w ∈ X k

Pr(

X k1 = w

)= Pr

(X k+T

1+T = w)

i.e., the distribution is invariant over time

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition

Law of large numbers for stationary sequences

Theorem (Ergodic Theorem)

If X is stationary, then the empirical distribution converges

P̂n → ρ

for some limit ρ, and for all nice functions f

1n

n∑t=1

f (Xt )→ Eρ [f (X )]

but ρ may be random and depend on initial conditionsone ρ per attractor

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition

Entropy Rate

Entropy rate, a.k.a. Shannon entropy rate, a.k.a. metricentropy rate

h1 ≡ limn→∞

H[Xn|X n−11 ]

How many extra bits to we need to describe the nextobservation (in the limit)?

Theoremh1 exists for any stationary process (and some others)

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition

Examples of entropy ratesIID H[Xn|X n−1

1 ] = H[X1] = h1

Markov H[Xn|X n−11 ] = H[Xn|Xn−1] = H[X2|X1] = h1

k th-order Markov h1 = H[Xk+1|X k1 ]

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition

Using chain rule, can re-write h1 as

h1 = limn→∞

1n

H[X n1 ]

description length per unit time

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition

Topological Entropy Rate

Wn ≡ number of allowed words of length n≡ #

{w ∈ X n : Pr

(X n

1 = w)> 0

}log2 Wn ≡ topological entropytopological entropy rate

h0 = limn→∞

1n

log2 Wn

H[X n1 ] = log2 Wn if and only if each word is equally probable

Otherwise H[X n1 ] < log2 Wn

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition

Metric vs. Topological Entropy Rates

h0 = growth rate in # allowed words, counting all equallyh1 = growth rate, counting more probable words more heavily

effective number of wordsSo:

h0 ≥ h1

2h1 = effective # of choices of how to go on

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition

Metric vs. Topological Entropy Rates

h0 = growth rate in # allowed words, counting all equallyh1 = growth rate, counting more probable words more heavilyeffective number of words

So:h0 ≥ h1

2h1 = effective # of choices of how to go on

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition

Metric vs. Topological Entropy Rates

h0 = growth rate in # allowed words, counting all equallyh1 = growth rate, counting more probable words more heavilyeffective number of wordsSo:

h0 ≥ h1

2h1 = effective # of choices of how to go on

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition

Metric vs. Topological Entropy Rates

h0 = growth rate in # allowed words, counting all equallyh1 = growth rate, counting more probable words more heavilyeffective number of wordsSo:

h0 ≥ h1

2h1 = effective # of choices of how to go on

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition

KS Entropy Rate

h1 = growth rate of mean description length of trajectoriesChaos needs h1 > 0Coarse-graining deterministic dynamics, each partition B hasits own h1(B)Kolmogorov-Sinai (KS) entropy rate:

hKS = supB

h1(B)

TheoremIf G is a generating partition, then hKS = h1(G)

hKS is the asymptotic randomness of the dynamical systemor, the rate at which the symbol sequence provides newinformation about the initial condition

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition

Entropy Rate and Lyapunov Exponents

In general (Ruelle’s inequality),

hKS ≤d∑

i=1

λi1x>0(λi)

If the invariant measure is smooth, this is equality (Pesin’sidentity)

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition

Asymptotic Equipartition Property

When n is large, for any word xn1 , either

Pr (X n1 = xn

1 ) ≈ 2−nh1

orPr (X n

1 = xn1 ) ≈ 0

More exactly, it’s almost certain that

−1n

log Pr (X n1 )→ h1

This is the entropy ergodic theorem orShannon-MacMillan-Breiman theorem

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition

Relative entropy version:

−1n

log Qθ(X n1 )→ h1 + d(P‖Qθ)

whered(P‖Qθ) = lim

n→∞

1n

D(P(X n1 )‖Qθ(X n

1 ))

Relative entropy AEP implies entropy AEP

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Information SourcesEntropy RatesEntropy Rates and DynamicsAsymptotic Equipartition

Entropy and Ergodicity: Summary

h1 is the growth rate of the entropy, or number of choices madein continuing the trajectoryMeasures instability in dynamical systemsTypical sequences have probabilities shrinking at the entropyrate

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Relative Entropy and Sampling; Large Deviations

X1,X2, . . .Xn all IID with distribution PEmpirical distribution ≡ P̂nLaw of large numbers (LLN): P̂n → P

Theorem (Sanov)

−1n

log2 Pr(

P̂n ∈ A)→ argmin

Q∈AD(Q‖P)

or, for non-mathematicians,

Pr(

P̂n ≈ Q)≈ 2−nD(Q‖P)

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Relative Entropy and Sampling; Large Deviations

X1,X2, . . .Xn all IID with distribution PEmpirical distribution ≡ P̂nLaw of large numbers (LLN): P̂n → P

Theorem (Sanov)

−1n

log2 Pr(

P̂n ∈ A)→ argmin

Q∈AD(Q‖P)

or, for non-mathematicians,

Pr(

P̂n ≈ Q)≈ 2−nD(Q‖P)

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Sanov’s theorem is part of the general theory of largedeviations:

Pr(fluctuations away from law of large numbers)→ 0exponentially in n

rate functon generally a relative entropy

More on large devations: Bucklew (1990); den Hollander (2000)LDP explains statistical mechanics; see Touchette (2008), ortalk to Eric Smith

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Relative Entropy and Hypothesis Testing

Testing P vs. QOptimal error rate (chance of guessing Q when really P) goeslike

Pr (error) ≈ 2−nD(Q‖P)

For dependent data, substitute sum of conditional relative entropies for nD

More exact statement:

1n

log2 Pr (error)→ −D(Q‖P)

For dependent data, substitute sum conditional relative entropy rate for D

The bigger D(Q‖P), the easier is to test which is right

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Relative Entropy and Hypothesis Testing

Testing P vs. QOptimal error rate (chance of guessing Q when really P) goeslike

Pr (error) ≈ 2−nD(Q‖P)

For dependent data, substitute sum of conditional relative entropies for nD

More exact statement:

1n

log2 Pr (error)→ −D(Q‖P)

For dependent data, substitute sum conditional relative entropy rate for D

The bigger D(Q‖P), the easier is to test which is right

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Relative Entropy and Hypothesis Testing

Testing P vs. QOptimal error rate (chance of guessing Q when really P) goeslike

Pr (error) ≈ 2−nD(Q‖P)

For dependent data, substitute sum of conditional relative entropies for nD

More exact statement:

1n

log2 Pr (error)→ −D(Q‖P)

For dependent data, substitute sum conditional relative entropy rate for D

The bigger D(Q‖P), the easier is to test which is right

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Method of Maximum Likelihood

Fisher (1922)Data = X with true distribution = PModel distributions = Qθ, θ = parameterLook for the Qθ which best describes the dataLikelihood at θ is probability of generating the dataQθ(x) ≡ L(θ)Estimate θ by maximizing likelihood, equivalently log-likelihoodL(θ) ≡ log Qθ(x)

θ̂ ≡ argmaxθ

L(θ) = argmaxθ

n∑t=1

log Qθ(xt |x t−11 )

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Maximum likelihood and relative entropy

Suppose we want the Qθ which will best describe new dataOptimal parameter value is

θ∗ = argminθ

D(P‖Qθ)

If P = Qθ0 for some θ0, then θ∗ = θ0 (true parameter value)Otherwise θ∗ is the pseudo-true parameter value

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Maximum likelihood and relative entropy

Suppose we want the Qθ which will best describe new dataOptimal parameter value is

θ∗ = argminθ

D(P‖Qθ)

If P = Qθ0 for some θ0, then θ∗ = θ0 (true parameter value)

Otherwise θ∗ is the pseudo-true parameter value

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Maximum likelihood and relative entropy

Suppose we want the Qθ which will best describe new dataOptimal parameter value is

θ∗ = argminθ

D(P‖Qθ)

If P = Qθ0 for some θ0, then θ∗ = θ0 (true parameter value)Otherwise θ∗ is the pseudo-true parameter value

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

θ∗ = argminθ

∑x

P(x) log2P(x)

Qθ(x)

= argminθ

∑x

P(x) log2 P(x)− P(x) log2 Qθ(x)

= argminθ−HP [X ]−

∑x

P(x) log2 Qθ(x)

= argminθ−∑

x

P(x) log2 Qθ(x)

= argmaxθ

∑x

P(x) log2 Qθ(x)

This is the expected log-likelihood

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

We don’t know P but we do have P̂nFor IID case

θ̂ = argmaxθ

n∑t=1

log Qθ(xt )

= argmaxθ

1n

n∑t=1

log2 Qθ(xt )

= argmaxθ

∑x

P̂n(x) log2 Qθ(x)

So θ̂ comes from approximating P by P̂nθ̂ → θ∗ because P̂n → PNon-IID case (e.g. Markov) similar, more notation

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Relative Entropy and Log Likelihood

In general:

−H[X ]− D(P‖Q) = expected log-likelihood of Q−H[X ] = optimal expected log-likelihood (ideal model)

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Why Maximum Likelihood?

1 The inherent compelling rightness of the optimizationprinciple

(a bad answer)2 Generally consistent: θ̂ converges on the optimal value

(as we just saw)3 Generally efficient: converges faster than other consistent

estimators

(2) and (3) are really theorems of probability theorylet’s look a bit more at efficiency

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Why Maximum Likelihood?

1 The inherent compelling rightness of the optimizationprinciple (a bad answer)

2 Generally consistent: θ̂ converges on the optimal value(as we just saw)

3 Generally efficient: converges faster than other consistentestimators

(2) and (3) are really theorems of probability theorylet’s look a bit more at efficiency

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Why Maximum Likelihood?

1 The inherent compelling rightness of the optimizationprinciple (a bad answer)

2 Generally consistent: θ̂ converges on the optimal value

(as we just saw)3 Generally efficient: converges faster than other consistent

estimators

(2) and (3) are really theorems of probability theorylet’s look a bit more at efficiency

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Why Maximum Likelihood?

1 The inherent compelling rightness of the optimizationprinciple (a bad answer)

2 Generally consistent: θ̂ converges on the optimal value(as we just saw)

3 Generally efficient: converges faster than other consistentestimators

(2) and (3) are really theorems of probability theorylet’s look a bit more at efficiency

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Why Maximum Likelihood?

1 The inherent compelling rightness of the optimizationprinciple (a bad answer)

2 Generally consistent: θ̂ converges on the optimal value(as we just saw)

3 Generally efficient: converges faster than other consistentestimators

(2) and (3) are really theorems of probability theorylet’s look a bit more at efficiency

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Fisher Information

Fisher: Taylor-expand L to second order around maximum

Fisher information matrix

Fuv (θ0) ≡ −Eθ0

∂2 log Qθ0(X )

∂θu∂θv

∣∣∣∣∣θ=θ0

F ∝ n (for IID, Markov, etc.)Variance of θ̂ = F−1 (under some regularity conditions)

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Fisher Information

Fisher: Taylor-expand L to second order around maximumFisher information matrix

Fuv (θ0) ≡ −Eθ0

∂2 log Qθ0(X )

∂θu∂θv

∣∣∣∣∣θ=θ0

F ∝ n (for IID, Markov, etc.)Variance of θ̂ = F−1 (under some regularity conditions)

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Fisher Information

Fisher: Taylor-expand L to second order around maximumFisher information matrix

Fuv (θ0) ≡ −Eθ0

∂2 log Qθ0(X )

∂θu∂θv

∣∣∣∣∣θ=θ0

F ∝ n (for IID, Markov, etc.)

Variance of θ̂ = F−1 (under some regularity conditions)

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Fisher Information

Fisher: Taylor-expand L to second order around maximumFisher information matrix

Fuv (θ0) ≡ −Eθ0

∂2 log Qθ0(X )

∂θu∂θv

∣∣∣∣∣θ=θ0

F ∝ n (for IID, Markov, etc.)Variance of θ̂ = F−1 (under some regularity conditions)

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

The Information Bound

Theorem (Cramér-Rao)

F−1 is the minimum variance for any unbiased estimator

because uncertainty in θ̂ depends on curvature at maximumleads to a whole information geometry, with F as the metrictensor (Amari et al., 1987; Kass and Vos, 1997; Kulhavý, 1996;Amari and Nagaoka, 1993/2000)

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

The Information Bound

Theorem (Cramér-Rao)

F−1 is the minimum variance for any unbiased estimator

because uncertainty in θ̂ depends on curvature at maximum

leads to a whole information geometry, with F as the metrictensor (Amari et al., 1987; Kass and Vos, 1997; Kulhavý, 1996;Amari and Nagaoka, 1993/2000)

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

The Information Bound

Theorem (Cramér-Rao)

F−1 is the minimum variance for any unbiased estimator

because uncertainty in θ̂ depends on curvature at maximumleads to a whole information geometry, with F as the metrictensor (Amari et al., 1987; Kass and Vos, 1997; Kulhavý, 1996;Amari and Nagaoka, 1993/2000)

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Relative Entropy and Fisher Information

Fuv (θ0) ≡ −Eθ0

∂2 log Qθ0(X )

∂θu∂θv

∣∣∣∣∣θ=θ0

=

∂2

∂θu∂θvD(Qθ0‖Qθ)

∣∣∣∣θ=θ0

Fisher information is how quickly the relative entropy grows withsmall changes in parameters

D(θ0‖θ0 + ε) ≈ εT F ε+ O(‖ε‖3)

Intuition: “easy to estimate” = “easy to reject sub-optimalvalues”

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Relative Entropy and Fisher Information

Fuv (θ0) ≡ −Eθ0

∂2 log Qθ0(X )

∂θu∂θv

∣∣∣∣∣θ=θ0

=

∂2

∂θu∂θvD(Qθ0‖Qθ)

∣∣∣∣θ=θ0

Fisher information is how quickly the relative entropy grows withsmall changes in parameters

D(θ0‖θ0 + ε) ≈ εT F ε+ O(‖ε‖3)

Intuition: “easy to estimate” = “easy to reject sub-optimalvalues”

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Relative Entropy and Fisher Information

Fuv (θ0) ≡ −Eθ0

∂2 log Qθ0(X )

∂θu∂θv

∣∣∣∣∣θ=θ0

=

∂2

∂θu∂θvD(Qθ0‖Qθ)

∣∣∣∣θ=θ0

Fisher information is how quickly the relative entropy grows withsmall changes in parameters

D(θ0‖θ0 + ε) ≈ εT F ε+ O(‖ε‖3)

Intuition: “easy to estimate” = “easy to reject sub-optimalvalues”

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Maximum Entropy: A Dead End

Given constraints on expectation values of functionsE [g1(X )] = c1,E [g2(X )] = c2, . . .E [gq(X )] = cq

P̃ME ≡ argmaxP

H[P] : ∀i , EP [gi(X )] = ci

= argmaxP

H[P]−q∑

i=1

λi(EP [gi(X )]− ci)

with Lagrange multipliers λi chosen to enforce the constraints

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Maximum Entropy: A Dead End

Given constraints on expectation values of functionsE [g1(X )] = c1,E [g2(X )] = c2, . . .E [gq(X )] = cq

P̃ME ≡ argmaxP

H[P] : ∀i , EP [gi(X )] = ci

= argmaxP

H[P]−q∑

i=1

λi(EP [gi(X )]− ci)

with Lagrange multipliers λi chosen to enforce the constraints

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Solution: Exponential Families

Generic solution:

P(x) =e−

Pqi=1 βi gi (x)∫

dxe−Pq

i=1 βi gi (x)=

e−Pq

i=1 βi gi (x)

Z (β1, β2, . . . βq)

again β enforces constraintsPhysics: canonical ensemble with extensive variables gi andintensive variables βiStatistics: exponential family with sufficient statistics gi andnatural parameters βiIf we take this family of distributions as basic, MLE is β suchthat E [gi(X )] = gi(x), i.e., mean = observedBest discussion of the connection is still Mandelbrot (1962)

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

The Method of Maximum Entropy

Calculate sample statistics gi(x)

Assume that the distribution of X is the one which maximizesthe entropy under those constraintsi.e., the MLE in the exponential family with those sufficientstatisticsRefinement: Minimum relative entropy , minimize divergencefrom a reference distribution — also leads to an exponentialfamily but with a prefactor of the base densityUpdate distributions under new data by minimizing relativeentropyOften said to be the “least biased” estimate of P, or the onewhich makes “fewest assumptions”

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

The Method of Maximum Entropy

Calculate sample statistics gi(x)Assume that the distribution of X is the one which maximizesthe entropy under those constraints

i.e., the MLE in the exponential family with those sufficientstatisticsRefinement: Minimum relative entropy , minimize divergencefrom a reference distribution — also leads to an exponentialfamily but with a prefactor of the base densityUpdate distributions under new data by minimizing relativeentropyOften said to be the “least biased” estimate of P, or the onewhich makes “fewest assumptions”

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

The Method of Maximum Entropy

Calculate sample statistics gi(x)Assume that the distribution of X is the one which maximizesthe entropy under those constraintsi.e., the MLE in the exponential family with those sufficientstatistics

Refinement: Minimum relative entropy , minimize divergencefrom a reference distribution — also leads to an exponentialfamily but with a prefactor of the base densityUpdate distributions under new data by minimizing relativeentropyOften said to be the “least biased” estimate of P, or the onewhich makes “fewest assumptions”

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

The Method of Maximum Entropy

Calculate sample statistics gi(x)Assume that the distribution of X is the one which maximizesthe entropy under those constraintsi.e., the MLE in the exponential family with those sufficientstatisticsRefinement: Minimum relative entropy , minimize divergencefrom a reference distribution — also leads to an exponentialfamily but with a prefactor of the base density

Update distributions under new data by minimizing relativeentropyOften said to be the “least biased” estimate of P, or the onewhich makes “fewest assumptions”

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

The Method of Maximum Entropy

Calculate sample statistics gi(x)Assume that the distribution of X is the one which maximizesthe entropy under those constraintsi.e., the MLE in the exponential family with those sufficientstatisticsRefinement: Minimum relative entropy , minimize divergencefrom a reference distribution — also leads to an exponentialfamily but with a prefactor of the base densityUpdate distributions under new data by minimizing relativeentropy

Often said to be the “least biased” estimate of P, or the onewhich makes “fewest assumptions”

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

The Method of Maximum Entropy

Calculate sample statistics gi(x)Assume that the distribution of X is the one which maximizesthe entropy under those constraintsi.e., the MLE in the exponential family with those sufficientstatisticsRefinement: Minimum relative entropy , minimize divergencefrom a reference distribution — also leads to an exponentialfamily but with a prefactor of the base densityUpdate distributions under new data by minimizing relativeentropyOften said to be the “least biased” estimate of P, or the onewhich makes “fewest assumptions”

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

About MaxEnt

MaxEnt has lots of devotees who basically think it’s the answerto everything

And it sometimes works, because1 Exponential families often decent approximations, MLE is

cool but not everything is an exponential family2 Conditional large deviations principle (Csiszár, 1995): if P̂

is constrained to lie in a convex set A, then

−1n

log Pr(

P̂ ∈ B|P̂ ∈ A)→ inf

Q∈B∩AD(Q‖P)− D(Q‖A)

so P̂ is exponentially close to argminQ∈A D(Q‖P)but the conditional LDP doesn’t always hold

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

About MaxEnt

MaxEnt has lots of devotees who basically think it’s the answerto everythingAnd it sometimes works, because

1 Exponential families often decent approximations, MLE iscool but not everything is an exponential family

2 Conditional large deviations principle (Csiszár, 1995): if P̂is constrained to lie in a convex set A, then

−1n

log Pr(

P̂ ∈ B|P̂ ∈ A)→ inf

Q∈B∩AD(Q‖P)− D(Q‖A)

so P̂ is exponentially close to argminQ∈A D(Q‖P)but the conditional LDP doesn’t always hold

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

About MaxEnt

MaxEnt has lots of devotees who basically think it’s the answerto everythingAnd it sometimes works, because

1 Exponential families often decent approximations, MLE iscool

but not everything is an exponential family2 Conditional large deviations principle (Csiszár, 1995): if P̂

is constrained to lie in a convex set A, then

−1n

log Pr(

P̂ ∈ B|P̂ ∈ A)→ inf

Q∈B∩AD(Q‖P)− D(Q‖A)

so P̂ is exponentially close to argminQ∈A D(Q‖P)but the conditional LDP doesn’t always hold

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

About MaxEnt

MaxEnt has lots of devotees who basically think it’s the answerto everythingAnd it sometimes works, because

1 Exponential families often decent approximations, MLE iscool but not everything is an exponential family

2 Conditional large deviations principle (Csiszár, 1995): if P̂is constrained to lie in a convex set A, then

−1n

log Pr(

P̂ ∈ B|P̂ ∈ A)→ inf

Q∈B∩AD(Q‖P)− D(Q‖A)

so P̂ is exponentially close to argminQ∈A D(Q‖P)but the conditional LDP doesn’t always hold

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

About MaxEnt

MaxEnt has lots of devotees who basically think it’s the answerto everythingAnd it sometimes works, because

1 Exponential families often decent approximations, MLE iscool but not everything is an exponential family

2 Conditional large deviations principle (Csiszár, 1995): if P̂is constrained to lie in a convex set A, then

−1n

log Pr(

P̂ ∈ B|P̂ ∈ A)→ inf

Q∈B∩AD(Q‖P)− D(Q‖A)

so P̂ is exponentially close to argminQ∈A D(Q‖P)

but the conditional LDP doesn’t always hold

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

About MaxEnt

MaxEnt has lots of devotees who basically think it’s the answerto everythingAnd it sometimes works, because

1 Exponential families often decent approximations, MLE iscool but not everything is an exponential family

2 Conditional large deviations principle (Csiszár, 1995): if P̂is constrained to lie in a convex set A, then

−1n

log Pr(

P̂ ∈ B|P̂ ∈ A)→ inf

Q∈B∩AD(Q‖P)− D(Q‖A)

so P̂ is exponentially close to argminQ∈A D(Q‖P)but the conditional LDP doesn’t always hold

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Updating by minimizing relative entropy can disagree withBayes’s rule (Seidenfeld, 1979, 1987; Grünwald and Halpern,2003)

, contra claims by physicistsThe “constraint rule” is certainly not required by logic orprobability (Seidenfeld, 1979, 1987; Uffink, 1995, 1996)MaxEnt (or MinRelEnt) is not the best rule for coming up with aprior distribution to use with Bayesian updating; all such rulessuck (Kass and Wasserman, 1996)

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Updating by minimizing relative entropy can disagree withBayes’s rule (Seidenfeld, 1979, 1987; Grünwald and Halpern,2003) , contra claims by physicists

The “constraint rule” is certainly not required by logic orprobability (Seidenfeld, 1979, 1987; Uffink, 1995, 1996)MaxEnt (or MinRelEnt) is not the best rule for coming up with aprior distribution to use with Bayesian updating; all such rulessuck (Kass and Wasserman, 1996)

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Updating by minimizing relative entropy can disagree withBayes’s rule (Seidenfeld, 1979, 1987; Grünwald and Halpern,2003) , contra claims by physicistsThe “constraint rule” is certainly not required by logic orprobability (Seidenfeld, 1979, 1987; Uffink, 1995, 1996)

MaxEnt (or MinRelEnt) is not the best rule for coming up with aprior distribution to use with Bayesian updating; all such rulessuck (Kass and Wasserman, 1996)

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Updating by minimizing relative entropy can disagree withBayes’s rule (Seidenfeld, 1979, 1987; Grünwald and Halpern,2003) , contra claims by physicistsThe “constraint rule” is certainly not required by logic orprobability (Seidenfeld, 1979, 1987; Uffink, 1995, 1996)MaxEnt (or MinRelEnt) is not the best rule for coming up with aprior distribution to use with Bayesian updating; all such rulessuck (Kass and Wasserman, 1996)

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Minimum Description Length Inference

Rissanen (1978, 1989)

Chose a model to concisely describe the datamaximum likelihood minimizes description length of the data. . . but you need to describe the model as well!Two-part MDL:

D2(x , θ,Θ) = − log2 Qθ(x) + C(θ,Θ)

θ̂MDL = argminθ∈Θ

D2(x , θ,Θ)

D2(x ,Θ) = D2(x , θ̂MDL,Θ)

where C is a coding scheme for the parametersCSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Must fix coding scheme before seeing the data (EXERCISE:why?)By AEP

n−1D2 → h1 + argminθ∈Θ

d(P‖Qθ)

still for finite n the coding scheme matters(One-part MDL exists but would take too long)

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Why Use MDL?

1 The inherent compelling rightness of the optimizationprinciple

2 Good properties: for reasonable sources, if the parametriccomplexity

COMP(Θ) = log∑

w∈X n

argmaxθ∈Θ

Qθ(w)

is small — if there aren’t all that many words which gethigh likelihoods — then if MDL did well in-sample, it willgeneralize well to new data from the same source

See Grünwald (2005, 2007) for much more

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Sampling and Large DeviationsHypothesis TestingMaximum Likelihood EstimationFisher Information and Estimation UncertaintyMaximum Entropy: A Dead EndMinimum Description Length

Information and Statistics: Summary

Relative entropy controls large deviationsRelative entropy = ease of discriminating distributionsEasy discrimination⇒ good estimationLarge deviations explains why MaxEnt works when it does

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Amari, Shun-ichi, O. E. Barndorff-Nielsen, Robert E. Kass,Steffe L. Lauritzen and C. R. Rao (1987). DifferentialGeometry in Statistical Inference, vol. 10 of Institute ofMathematical Statistics Lecture Notes-Monographs Series.Hayward, California: Institute of Mathematical Statistics. URLhttp://projecteuclid.org/euclid.lnms/1215467056.

Amari, Shun-ichi and Hiroshi Nagaoka (1993/2000). Methods ofInformation Geometry . Providence, Rhode Island: AmericanMathematical Society. Translated by Daishi Harada. As JohoKika no Hoho, Tokyo: Iwanami Shoten Publishers.

Bucklew, James A. (1990). Large Deviation Techniques inDecision, Simulation, and Estimation. New York:Wiley-Interscience.

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Cover, Thomas M. and Joy A. Thomas (1991). Elements ofInformation Theory . New York: Wiley.

Csiszár, Imre (1995). “Maxent, Mathematics, and InformationTheory.” In Maximum Entropy and Bayesian Methods:Proceedings of the Fifteenth International Workshop onMaximum Entropy and Bayesian Methods (Kenneth M.Hanson and Richard N. Silver, eds.), pp. 35–50. Dordrecht:Kluwer Academic.

den Hollander, Frank (2000). Large Deviations. Providence,Rhode Island: American Mathematical Society.

Fisher, R. A. (1922). “On the Mathematical Foundations ofTheoretical Statistics.” Philosophical Transactions of theRoyal Society A, 222: 309–368. URLhttp://digital.library.adelaide.edu.au/coll/special/fisher/stat_math.html.

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Grünwald, Peter (2005). “A Tutorial Introduction to the MinimumDescription Length Principle.” In Advances in MinimumDescription Length: Theory and Applications (P. Grünwaldand I. J. Myung and M. Pitt, eds.). Cambridge,Massachusetts: MIT Press. URLhttp://arxiv.org/abs/math.ST/0406077.

Grünwald, Peter D. (2007). The Minimum Description LengthPrinciple. Cambridge, Massachusetts: MIT Press.

Grünwald, Peter D. and Joseph Y. Halpern (2003). “UpdatingProbabilities.” Journal of Artificial Intelligence Research, 19:243–278. doi:10.1613/jair.1164.

Kass, Robert E. and Paul W. Vos (1997). GeometricalFoundations of Asymptotic Inference. New York: Wiley.

Kass, Robert E. and Larry Wasserman (1996). “The Selectionof Prior Distributions by Formal Rules.” Journal of the

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

American Statistical Association, 91: 1343–1370. URLhttp://www.stat.cmu.edu/~kass/papers/rules.pdf.

Kulhavý, Rudolf (1996). Recursive Nonlinear Estimation: AGeometric Approach, vol. 216 of Lecture Notes in Controland Information Sciences. Berlin: Springer-Verlag.

Mandelbrot, Benoit (1962). “The Role of Sufficiency and ofEstimation in Thermodynamics.” Annals of MathematicalStatistics, 33: 1021–1038. URL http://projecteuclid.org/euclid.aoms/1177704470.

Poundstone, William (2005). Fortune’s Formula: The UntoldStory of the Scientific Betting Systems That Beat the Casinosand Wall Street . New York: Hill and Wang.

Rissanen, Jorma (1978). “Modeling by Shortest DataDescription.” Automatica, 14: 465–471.

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

— (1989). Stochastic Complexity in Statistical Inquiry .Singapore: World Scientific.

Seidenfeld, Teddy (1979). “Why I Am Not an ObjectiveBayesian: Some Reflections Prompted by Rosenkrantz.”Theory and Decision, 11: 413–440. URLhttp://www.hss.cmu.edu/philosophy/seidenfeld/relating%20to%20other%20probability%20and%20statistical%20issues/Why%20I%20Am%20Not%20an%20Objective%20B.pdf.

— (1987). “Entropy and Uncertainty.” In Foundations ofStatistical Inference (I. B. MacNeill and G. J. Umphrey, eds.),pp. 259–287. Dordrecht: D. Reidel. URLhttp://www.hss.cmu.edu/philosophy/seidenfeld/relating%20to%20other%20probability%20and%

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

20statistical%20issues/Entropy%20and%20Uncertainty%20(revised).pdf.

Shannon, Claude E. (1948). “A Mathematical Theory ofCommunication.” Bell System Technical Journal , 27:379–423. URL http://cm.bell-labs.com/cm/ms/what/shannonday/paper.html. Reprinted in Shannonand Weaver (1963).

Shannon, Claude E. and Warren Weaver (1963). TheMathematical Theory of Communication. Urbana, Illinois:University of Illinois Press.

Sperber, Dan and Deirdre Wilson (1990). “Rhetoric andRelevance.” In The Ends of Rhetoric: History, Theory,Practice (David Wellbery and John Bender, eds.), pp.140–155. Stanford: Stanford University Press. URLhttp://dan.sperber.com/rhetoric.htm.

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

— (1995). Relevance: Cognition and Communication. Oxford:Basil Blackwell, 2nd edn.

Stephenson, Neal (1999). Cryptonomicon. New York: AvonBooks.

Touchette, Hugo (2008). “The Large Deviations Approach toStatistical Mechanics.” E-print, arxiv.org. URLhttp://arxiv.org/abs/0804.0327.

Uffink, Jos (1995). “Can the Maximum Entropy Principle beExplained as a Consistency Requirement?” Studies inHistory and Philosophy of Modern Physics, 26B: 223–261.URL http://www.phys.uu.nl/~wwwgrnsl/jos/mepabst/mepabst.html.

— (1996). “The Constraint Rule of the Maximum EntropyPrinciple.” Studies in History and Philosophy of Modern

CSSS Information Theory

Entropy and InformationEntropy and Ergodicity

Relative Entropy and StatisticsReferences

Physics, 27: 47–79. URL http://www.phys.uu.nl/~wwwgrnsl/jos/mep2def/mep2def.html.

CSSS Information Theory