information theory ying nian wu ucla department of statistics july 9, 2007 ipam summer school

41
Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Upload: gregory-anthony

Post on 02-Jan-2016

28 views

Category:

Documents


0 download

DESCRIPTION

Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School. Goal: A gentle introduction to the basic concepts in information theory Emphasis: understanding and interpretations of these concepts - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Information Theory

Ying Nian WuUCLA Department of Statistics

July 9, 2007IPAM Summer School

Page 2: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Goal: A gentle introduction to the basic concepts in information theory

Emphasis: understanding and interpretations of these concepts

Reference: Elements of Information Theory by Cover and Thomas

Page 3: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Topics•Entropy and relative entropy•Asymptotic equipartition property•Data compression•Large deviation•Kolmogorov complexity•Entropy rate of process

Page 4: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

EntropyRandomness or uncertainty of a probability distribution

kk paX )Pr( )Pr()( xXxp

8/18/14/12/1Pr

dcba

Example

Kk

Kk

ppppp

aaaa

21

21

Page 5: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Definition

x

kk

k xpxpppXH )(log)(log)(

Entropy

kk paX )Pr( )Pr()( xXxp

Kk

Kk

ppppp

aaaa

21

21

Page 6: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Entropy

8/18/14/12/1Pr

dcba

Example

4

7

8

1log

8

1

8

1log

8

1

4

1log

4

1

2

1log

2

1)( XH

x

kk

k xpxpppXH )(log)(log)(

Page 7: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Definition for both discrete and continuous

)](log[)( XppH E

x

xpxfXf )()()]([ERecall

kk paX )Pr( )Pr()( xXxp

Entropy

x

kk

k xpxpppXH )(log)(log)(

Kk

Kk

ppppp

aaaa

21

21

Page 8: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Example

Entropy

3321log

8/18/14/12/1Pr

p

dcba

4

73

8

13

8

12

4

11

2

1)( XH

)](log[)( XppH E

Page 9: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Interpretation 1: cardinality

Uniform distribution ][~ AX Unif

There are elements inAll these choices are equally likely

A

||log|)]|/1log([)](log[)( AAXpXH EE

Entropy can be interpreted as log of volume or size dimensional cube has vertices can also be interpreted as dimensionality

What if the distribution is not uniform?

n n2

|| A

Page 10: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Asymptotic equipartition property

Any distribution is essentially a uniform distribution in long run repetition

)(~,...,, 21 xpXXX n independently

)()...()(),...,,( 2121 nn XpXpXpXXXp

||/1)( AXp Recall if then a constant]Unif[AX ~

Random?

But in some sense, it is essentially a constant!

Page 11: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Law of large number

)(~,...,, 21 xpXXX n independently

x

n

ii xpxfXfXf

n)()()]([)(

1

1

E

Long run average converges to expectation

Page 12: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Asymptotic equipartition property

)()...()(),...,,( 2121 nn XpXpXpXXXp

)()]([log)(log1

),...,(log1

11 pHXpXp

nXXp

n

n

iin

E

Intuitively, in the long run,)(

1 2),...,( pnHnXXp

Page 13: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

)(1 2),...,( pnH

nXXp

||/1)( AXp Recall if

then a constant]Unif[ AX ~

Therefore, as if ]Unif[AXX n ~),...,( 1

)(2|| pnHA ,with

So the dimensionality per observation is )( pH

Asymptotic equipartition property

We can make it more rigorous

Page 14: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

)(~,...,, 21 xpXXX n independently

n

iiXf

n 1

1)|)(1

Pr(|

0nn for

x

n

ii xpxfXfXf

n)()()]([)(

1

1

E

Weak law of large number

Page 15: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Typical set

)(1 2),...,( pnH

nXXp

]Unif[AXX n ~),...,( 1)(2|| pnHA ,with

Typical set nnA ,

Kk

Kk

ppppp

aaaa

21

21

n

,nA

Page 16: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

: The set of sequencesn

nxxx ),...,,( 21,nA))((

21))(( 2),...,,(2 pHn

npHn xxxp

1)( ,nAp

for sufficiently large

n

))((,

))(( 2||2)1(

pHnn

pHn A

Typical set)(

1 2),...,( pnHnXXp

]Unif[AXX n ~),...,( 1)(2|| pnHA ,with

n

,nA

Page 17: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Interpretation 2: coin flipping

Flip a fair coin {Head, Tail}Flip a fair coin twice independently {HH, HT, TH, TT}……Flip a fair coin times independently equally likely sequences

We may interpret entropy as the number of flips

nn2

Page 18: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Example

TTTHHTHH

dcba

flips

4/14/14/14/1Pr

The above uniform distribution amounts to 2 coin flips

Interpretation 2: coin flipping

Page 19: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

)(1 2),...,( pnH

nXXp

]Unif[AXX n ~),...,( 1)(2|| pnHA ,with

),...,( 1 nXX amounts to )( pnH flips

)( pH flips)(~ xpX amounts to

Kk

Kk

ppppp

aaaa

21

21

Interpretation 2: coin flipping

n

,nA

Page 20: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

THTTHHTTH

dcba

flips

8/18/14/12/1Pr

H T

H

H

T

T

a

b

c d

Interpretation 2: coin flipping

Page 21: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

3321log

8/18/14/12/1Pr

p

dcba

flips4

73

8

13

8

12

4

11

2

1)( XH

THTTHHTTH

dcba

flips

8/18/14/12/1Pr

Interpretation 2: coin flipping

Page 22: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Interpretation 3: coding

Example

00011011

4/14/14/14/1Pr

code

dcbaA

)4/1log(4log2 length

Page 23: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

)(1 2),...,( pnH

nXXp

]Unif[AXX n ~),...,( 1)(2|| pnHA ,with

How many bits to code elements in A

Kk

Kk

ppppp

aaaa

21

21

?

)( pnH

Can be made more formal using typical set

bits

Interpretation 3: coding

n

,nA

Page 24: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

010011001

8/18/14/12/1Pr

code

dcba

100101100010abacbd

Prefix code

bitsE4

7

8

13

8

13

4

12

2

11)()()]([

x

xpxlXl

Page 25: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

H T

H

H

T

T

a

b

c d

100101100010abacbd

010011001

8/18/14/12/1Pr

code

dcba

Sequence of coin flippingA completely random sequenceCannot be further compressed

)(log)( xpxl

)()]([ pHXl E

Optimal code

e.g., two words I, probability

Page 26: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Optimal code

Kk

Kk

Kk

lllll

ppppp

aaaa

21

21

21

Kraft inequality for prefix code

12

k

lk

k

kk plXlL )]([EMinimize

Optimal length kk pl log*

)(* pHL

Page 27: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Wrong model

Kk

Kk

Kk

qqqq

pppp

aaaa

21

21

21

Wrong

True

kk

kkk

k ppplXlL log)]([ *** EOptimal code

k k

kkkk qpplXlL log)]([EWrong code

Redundancyk

k

kk q

ppLL log*

Box: All models are wrong, but some are useful

Page 28: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Kk

Kk

Kk

qqqqq

ppppp

aaaa

21

21

21

Relative entropy

x

p xq

xpxp

Xq

XpqpD

)(

)(log)(]

)(

)([log)|( E

Kullback-Leibler divergence

Page 29: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Kk

Kk

Kk

qqqqq

ppppp

aaaa

21

21

21

Relative entropy

0]))(

)([log(]

)(

)([log]

)(

)([log)|(

Xp

Xq

Xp

Xq

Xq

XpqpD ppp EEE

Jensen inequality

)(||log||log)]([log)|( pHXpUpD p E

Page 30: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Types

Kk

Kk

Kk

nnnn

qqqqq

aaaa

21

21

21

freq

)(~,...,, 21 xqXXX n independently

kn number of times ki aX

n

nf k

k normalized frequency

Page 31: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

),...,(),...,( 11 KK qqqfff

Kk

Kk

Kk

nnnn

qqqqq

aaaa

21

21

21

freq

Law of large number

n

nf k

k

Refinement

k

nfk

Kk

nk

K

kk qnfnf

nq

nn

nf

,...,,...,)Pr(

11

i

in xpxxp )(),...,( 1

Page 32: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

),...,(),...,( 11 KK qqqfff

Kk

Kk

Kk

nnnn

qqqqq

aaaa

21

21

21

freq

Law of large number

n

nf k

k

Refinement

)|()Pr(log1

qfDfn

)|(2~)Pr( qfnDf

Large deviation

Page 33: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Kolmogorov complexity

Example: a string 011011011011…011

Program: for (i =1 to n/3) write(011) endCan be translated to binary machine code

Kolmogorov complexity = length of shortest machine code that reproduce the string no probability distribution involved

If a long sequence is not compressible, then it has all the statistical properties of a sequence of coin flipping

string = f(coin flippings)

Page 34: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Joint and conditional entropy

121

21

2222212

1112111

21

j

kkjkkk

j

j

j

ppp

ppppa

ppppa

ppppa

bbb

kjjk pbYaX )&Pr(

)&Pr(),( yYxXyxp

kk paX )Pr(

jj pbY )Pr(

)Pr()( xXxpX )Pr()( yYypY

Joint distribution

Marginal distribution

e.g., eye color & hair color

Page 35: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

jkjjk ppbYaX /)|Pr(

kkjkj ppaXbY /)|Pr(

)|Pr()|(| yYxXyxp YX

)|Pr()|(| xXyYxyp XY

)|()()|()(),( || yxpypxypxpyxp YXYXYX

Joint and conditional entropy

121

21

2222212

1112111

21

j

kkjkkk

j

j

j

ppp

ppppa

ppppa

ppppa

bbb

Conditional distribution

Chain rule

Page 36: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

121

21

2222212

1112111

21

j

kkjkkk

j

j

j

ppp

ppppa

ppppa

ppppa

bbb

Joint and conditional entropy

kj yxkjkj YXpyxpyxpppYXH

,

)],(log[),(log),(log),( E

y

xypxypxXYH )|(log)|()|(

x x y

xypxypxpxXYHxpXYH )|(log)|()()|()()|(

)]|(log[)|(log),()|(,

XYpxypyxpXYHyx

E

Page 37: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

)|()()|()(),( || yxpypxypxpyxp YXYXYX

)|()(),( XYHXHYXH

Chain rule

),...,|(),...,(1

111

n

iiin XXXHXXH

Page 38: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

  ])()(

),([log))()(|),(();(

YpXp

YXpypxpyxpDYXI E

yx ypxp

yxpyxpYXI

, )()(

),(log),();(

Mutual information

)|()();( YXHXHYXI

),()()();( YXHYHXHYXI

Page 39: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Entropy rate

Stochastic process ,...),...,,( 21 nXXX not independent

Markov chain:

)|Pr(),...,|Pr( 111111 nnnnnnnn xXxXxXxXxX

),...,(1

lim 1 nn XXHn

H Entropy rate:

Stationary process: ),...,|(lim 11 XXXHH nnn

Stationary Markov chain: )|( 1 nn XXHH

(compression)

Page 40: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Shannon, 1948

1. Zero-order approximationXFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD.  2. First-order approximation OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL.  3. Second-order approximation (digram structure as in English). ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE.  4. Third-order approximation (trigram structure as in English). IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE.  5. First-order word approximation. REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE.  6. Second-order word approximation. THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.  

Page 41: Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School

Summary

Entropy of a distribution measures randomness or uncertainty log of the number of equally likely choices average number of coin flips average length of prefix code (Kolmogorov: shortest machine code randomness)

Relative entropy from one distribution to the other measure the departure from the first to the second coding redundancy large deviation

Conditional entropy, mutual information, entropy rate