information theory ying nian wu ucla department of statistics july 9, 2007 ipam summer school
DESCRIPTION
Information Theory Ying Nian Wu UCLA Department of Statistics July 9, 2007 IPAM Summer School. Goal: A gentle introduction to the basic concepts in information theory Emphasis: understanding and interpretations of these concepts - PowerPoint PPT PresentationTRANSCRIPT
Information Theory
Ying Nian WuUCLA Department of Statistics
July 9, 2007IPAM Summer School
Goal: A gentle introduction to the basic concepts in information theory
Emphasis: understanding and interpretations of these concepts
Reference: Elements of Information Theory by Cover and Thomas
Topics•Entropy and relative entropy•Asymptotic equipartition property•Data compression•Large deviation•Kolmogorov complexity•Entropy rate of process
EntropyRandomness or uncertainty of a probability distribution
kk paX )Pr( )Pr()( xXxp
8/18/14/12/1Pr
dcba
Example
Kk
Kk
ppppp
aaaa
21
21
Definition
x
kk
k xpxpppXH )(log)(log)(
Entropy
kk paX )Pr( )Pr()( xXxp
Kk
Kk
ppppp
aaaa
21
21
Entropy
8/18/14/12/1Pr
dcba
Example
4
7
8
1log
8
1
8
1log
8
1
4
1log
4
1
2
1log
2
1)( XH
x
kk
k xpxpppXH )(log)(log)(
Definition for both discrete and continuous
)](log[)( XppH E
x
xpxfXf )()()]([ERecall
kk paX )Pr( )Pr()( xXxp
Entropy
x
kk
k xpxpppXH )(log)(log)(
Kk
Kk
ppppp
aaaa
21
21
Example
Entropy
3321log
8/18/14/12/1Pr
p
dcba
4
73
8
13
8
12
4
11
2
1)( XH
)](log[)( XppH E
Interpretation 1: cardinality
Uniform distribution ][~ AX Unif
There are elements inAll these choices are equally likely
A
||log|)]|/1log([)](log[)( AAXpXH EE
Entropy can be interpreted as log of volume or size dimensional cube has vertices can also be interpreted as dimensionality
What if the distribution is not uniform?
n n2
|| A
Asymptotic equipartition property
Any distribution is essentially a uniform distribution in long run repetition
)(~,...,, 21 xpXXX n independently
)()...()(),...,,( 2121 nn XpXpXpXXXp
||/1)( AXp Recall if then a constant]Unif[AX ~
Random?
But in some sense, it is essentially a constant!
Law of large number
)(~,...,, 21 xpXXX n independently
x
n
ii xpxfXfXf
n)()()]([)(
1
1
E
Long run average converges to expectation
Asymptotic equipartition property
)()...()(),...,,( 2121 nn XpXpXpXXXp
)()]([log)(log1
),...,(log1
11 pHXpXp
nXXp
n
n
iin
E
Intuitively, in the long run,)(
1 2),...,( pnHnXXp
)(1 2),...,( pnH
nXXp
||/1)( AXp Recall if
then a constant]Unif[ AX ~
Therefore, as if ]Unif[AXX n ~),...,( 1
)(2|| pnHA ,with
So the dimensionality per observation is )( pH
Asymptotic equipartition property
We can make it more rigorous
)(~,...,, 21 xpXXX n independently
n
iiXf
n 1
1)|)(1
Pr(|
0nn for
x
n
ii xpxfXfXf
n)()()]([)(
1
1
E
Weak law of large number
Typical set
)(1 2),...,( pnH
nXXp
]Unif[AXX n ~),...,( 1)(2|| pnHA ,with
Typical set nnA ,
Kk
Kk
ppppp
aaaa
21
21
n
,nA
: The set of sequencesn
nxxx ),...,,( 21,nA))((
21))(( 2),...,,(2 pHn
npHn xxxp
1)( ,nAp
for sufficiently large
n
))((,
))(( 2||2)1(
pHnn
pHn A
Typical set)(
1 2),...,( pnHnXXp
]Unif[AXX n ~),...,( 1)(2|| pnHA ,with
n
,nA
Interpretation 2: coin flipping
Flip a fair coin {Head, Tail}Flip a fair coin twice independently {HH, HT, TH, TT}……Flip a fair coin times independently equally likely sequences
We may interpret entropy as the number of flips
nn2
Example
TTTHHTHH
dcba
flips
4/14/14/14/1Pr
The above uniform distribution amounts to 2 coin flips
Interpretation 2: coin flipping
)(1 2),...,( pnH
nXXp
]Unif[AXX n ~),...,( 1)(2|| pnHA ,with
),...,( 1 nXX amounts to )( pnH flips
)( pH flips)(~ xpX amounts to
Kk
Kk
ppppp
aaaa
21
21
Interpretation 2: coin flipping
n
,nA
THTTHHTTH
dcba
flips
8/18/14/12/1Pr
H T
H
H
T
T
a
b
c d
Interpretation 2: coin flipping
3321log
8/18/14/12/1Pr
p
dcba
flips4
73
8
13
8
12
4
11
2
1)( XH
THTTHHTTH
dcba
flips
8/18/14/12/1Pr
Interpretation 2: coin flipping
Interpretation 3: coding
Example
00011011
4/14/14/14/1Pr
code
dcbaA
)4/1log(4log2 length
)(1 2),...,( pnH
nXXp
]Unif[AXX n ~),...,( 1)(2|| pnHA ,with
How many bits to code elements in A
Kk
Kk
ppppp
aaaa
21
21
?
)( pnH
Can be made more formal using typical set
bits
Interpretation 3: coding
n
,nA
010011001
8/18/14/12/1Pr
code
dcba
100101100010abacbd
Prefix code
bitsE4
7
8
13
8
13
4
12
2
11)()()]([
x
xpxlXl
H T
H
H
T
T
a
b
c d
100101100010abacbd
010011001
8/18/14/12/1Pr
code
dcba
Sequence of coin flippingA completely random sequenceCannot be further compressed
)(log)( xpxl
)()]([ pHXl E
Optimal code
e.g., two words I, probability
Optimal code
Kk
Kk
Kk
lllll
ppppp
aaaa
21
21
21
Kraft inequality for prefix code
12
k
lk
k
kk plXlL )]([EMinimize
Optimal length kk pl log*
)(* pHL
Wrong model
Kk
Kk
Kk
qqqq
pppp
aaaa
21
21
21
Wrong
True
kk
kkk
k ppplXlL log)]([ *** EOptimal code
k k
kkkk qpplXlL log)]([EWrong code
Redundancyk
k
kk q
ppLL log*
Box: All models are wrong, but some are useful
Kk
Kk
Kk
qqqqq
ppppp
aaaa
21
21
21
Relative entropy
x
p xq
xpxp
Xq
XpqpD
)(
)(log)(]
)(
)([log)|( E
Kullback-Leibler divergence
Kk
Kk
Kk
qqqqq
ppppp
aaaa
21
21
21
Relative entropy
0]))(
)([log(]
)(
)([log]
)(
)([log)|(
Xp
Xq
Xp
Xq
Xq
XpqpD ppp EEE
Jensen inequality
)(||log||log)]([log)|( pHXpUpD p E
Types
Kk
Kk
Kk
nnnn
qqqqq
aaaa
21
21
21
freq
)(~,...,, 21 xqXXX n independently
kn number of times ki aX
n
nf k
k normalized frequency
),...,(),...,( 11 KK qqqfff
Kk
Kk
Kk
nnnn
qqqqq
aaaa
21
21
21
freq
Law of large number
n
nf k
k
Refinement
k
nfk
Kk
nk
K
kk qnfnf
nq
nn
nf
,...,,...,)Pr(
11
i
in xpxxp )(),...,( 1
),...,(),...,( 11 KK qqqfff
Kk
Kk
Kk
nnnn
qqqqq
aaaa
21
21
21
freq
Law of large number
n
nf k
k
Refinement
)|()Pr(log1
qfDfn
)|(2~)Pr( qfnDf
Large deviation
Kolmogorov complexity
Example: a string 011011011011…011
Program: for (i =1 to n/3) write(011) endCan be translated to binary machine code
Kolmogorov complexity = length of shortest machine code that reproduce the string no probability distribution involved
If a long sequence is not compressible, then it has all the statistical properties of a sequence of coin flipping
string = f(coin flippings)
Joint and conditional entropy
121
21
2222212
1112111
21
j
kkjkkk
j
j
j
ppp
ppppa
ppppa
ppppa
bbb
kjjk pbYaX )&Pr(
)&Pr(),( yYxXyxp
kk paX )Pr(
jj pbY )Pr(
)Pr()( xXxpX )Pr()( yYypY
Joint distribution
Marginal distribution
e.g., eye color & hair color
jkjjk ppbYaX /)|Pr(
kkjkj ppaXbY /)|Pr(
)|Pr()|(| yYxXyxp YX
)|Pr()|(| xXyYxyp XY
)|()()|()(),( || yxpypxypxpyxp YXYXYX
Joint and conditional entropy
121
21
2222212
1112111
21
j
kkjkkk
j
j
j
ppp
ppppa
ppppa
ppppa
bbb
Conditional distribution
Chain rule
121
21
2222212
1112111
21
j
kkjkkk
j
j
j
ppp
ppppa
ppppa
ppppa
bbb
Joint and conditional entropy
kj yxkjkj YXpyxpyxpppYXH
,
)],(log[),(log),(log),( E
y
xypxypxXYH )|(log)|()|(
x x y
xypxypxpxXYHxpXYH )|(log)|()()|()()|(
)]|(log[)|(log),()|(,
XYpxypyxpXYHyx
E
)|()()|()(),( || yxpypxypxpyxp YXYXYX
)|()(),( XYHXHYXH
Chain rule
),...,|(),...,(1
111
n
iiin XXXHXXH
])()(
),([log))()(|),(();(
YpXp
YXpypxpyxpDYXI E
yx ypxp
yxpyxpYXI
, )()(
),(log),();(
Mutual information
)|()();( YXHXHYXI
),()()();( YXHYHXHYXI
Entropy rate
Stochastic process ,...),...,,( 21 nXXX not independent
Markov chain:
)|Pr(),...,|Pr( 111111 nnnnnnnn xXxXxXxXxX
),...,(1
lim 1 nn XXHn
H Entropy rate:
Stationary process: ),...,|(lim 11 XXXHH nnn
Stationary Markov chain: )|( 1 nn XXHH
(compression)
Shannon, 1948
1. Zero-order approximationXFOML RXKHRJFFJUJ ZLPWCFWKCYJ FFJEYVKCQSGHYD QPAAMKBZAACIBZLHJQD. 2. First-order approximation OCRO HLI RGWR NMIELWIS EU LL NBNESEBYA TH EEI ALHENHTTPA OOBTTVA NAH BRL. 3. Second-order approximation (digram structure as in English). ON IE ANTSOUTINYS ARE T INCTORE ST BE S DEAMY ACHIN D ILONASIVE TUCOOWE AT TEASONARE FUSO TIZIN ANDY TOBE SEACE CTISBE. 4. Third-order approximation (trigram structure as in English). IN NO IST LAT WHEY CRATICT FROURE BIRS GROCID PONDENOME OF DEMONSTURES OF THE REPTAGIN IS REGOACTIONA OF CRE. 5. First-order word approximation. REPRESENTING AND SPEEDILY IS AN GOOD APT OR COME CAN DIFFERENT NATURAL HERE HE THE A IN CAME THE TO OF TO EXPERT GRAY COME TO FURNISHES THE LINE MESSAGE HAD BE THESE. 6. Second-order word approximation. THE HEAD AND IN FRONTAL ATTACK ON AN ENGLISH WRITER THAT THE CHARACTER OF THIS POINT IS THEREFORE ANOTHER METHOD FOR THE LETTERS THAT THE TIME OF WHO EVER TOLD THE PROBLEM FOR AN UNEXPECTED.
Summary
Entropy of a distribution measures randomness or uncertainty log of the number of equally likely choices average number of coin flips average length of prefix code (Kolmogorov: shortest machine code randomness)
Relative entropy from one distribution to the other measure the departure from the first to the second coding redundancy large deviation
Conditional entropy, mutual information, entropy rate