ece 598: the speech chain lecture 12: information theory
TRANSCRIPT
ECE 598: The Speech ECE 598: The Speech ChainChain
Lecture 12: Information Lecture 12: Information TheoryTheory
TodayToday InformationInformation
Speech as CommunicationSpeech as Communication Shannon’s Measurement of InformationShannon’s Measurement of Information
EntropyEntropy Entropy = Average InformationEntropy = Average Information ““Complexity” or “sophistication” of a textComplexity” or “sophistication” of a text
Conditional Entropy Conditional Entropy Conditional Entropy = Average Conditional InformationConditional Entropy = Average Conditional Information Example: Confusion Matrix, Articulation TestingExample: Confusion Matrix, Articulation Testing Conditional Entropy vs. SNRConditional Entropy vs. SNR
Channel CapacityChannel Capacity Mutual Information = Entropy – Conditional EntropyMutual Information = Entropy – Conditional Entropy Channel Capacity = max{Mutual Information}Channel Capacity = max{Mutual Information}
Finite State Language ModelsFinite State Language Models GrammarsGrammars Regular Grammar = Finite State Grammar = Markov GrammarRegular Grammar = Finite State Grammar = Markov Grammar Entropy of a Finite State GrammarEntropy of a Finite State Grammar
N-Gram Language ModelsN-Gram Language Models Maximum-Likelihood EstimationMaximum-Likelihood Estimation Cross-Entropy of Text given its N-GramCross-Entropy of Text given its N-Gram
InformationInformation
Speech as CommunicationSpeech as Communication
wwnn, ŵn = words selected from = words selected from vocabulary V vocabulary V Size of the vocabulary, |V|, is assumed to be finite.Size of the vocabulary, |V|, is assumed to be finite.
No human language has a truly finite vocabulary!!!No human language has a truly finite vocabulary!!! For a more accurate analysis, we should do a phoneme-by-For a more accurate analysis, we should do a phoneme-by-
phoneme analysis, i.e., wphoneme analysis, i.e., wnn, ŵn = phonemes in language V= phonemes in language V If |V| is finite, then we can define p(wIf |V| is finite, then we can define p(wnn|h|hnn):):
hhnn = all relevant history, including = all relevant history, including previous words of the same utterance, wprevious words of the same utterance, w11,…,w,…,wn-1n-1 Dialog history (what did the other talker just say?)Dialog history (what did the other talker just say?) Shared knowledge, e.g., physical knowledge, cultural knowledgeShared knowledge, e.g., physical knowledge, cultural knowledge
0 ≤ p(w0 ≤ p(wnn|h|hnn) ≤ 1 ) ≤ 1 wn in wn in VV p(w p(wnn|h|hnn) = 1) = 1
I said these words:
[…,wn,…]I heard these
words: […,ŵn,…]
+
Acoustic Noise, Babble, Reverberation, …
Speech (Acoustic Signal) Noisy Speech
Shannon’s Criteria for a Shannon’s Criteria for a Measure of “Information”Measure of “Information”
Information should be…Information should be… Non-negativeNon-negative
I(wI(wnn|h|hnn) ≥ 0) ≥ 0
Zero if a word is perfectly predictable from Zero if a word is perfectly predictable from its historyits history I(wI(wnn|h|hnn) = 0 if and only if p(w) = 0 if and only if p(wnn|h|hnn) = 1) = 1
Large for unexpected wordsLarge for unexpected words I(I(wwnn|h|hnn) is large if p(w) is large if p(wnn|h|hnn) is small) is small
AdditiveAdditive I(wI(wn-1n-1,w,wnn) = I(w) = I(wn-1n-1) + I(w) + I(wnn))
Shannon’s Measure of Shannon’s Measure of InformationInformation
All of the criteria are satisfied by the All of the criteria are satisfied by the following definition of information:following definition of information:
I(wI(wnn) = log) = logaa(1/p(w(1/p(wnn|h|hnn)) = )) = loglogaa p(w p(wnn|h|hnn))
InformationInformationInformation is…Information is…
Non-negative: Non-negative:
p(wp(wnn|h|hnn)<1 )<1 loglogaap(wp(wnn|h|hnn)<0 )<0 I(wI(wnn|h|hnn)>0)>0 Zero if a word is perfectly predictable: Zero if a word is perfectly predictable:
p(wp(wnn|h|hnn)=1 )=1 loglogaap(wp(wnn|h|hnn)=0 )=0 I(wI(wnn|h|hnn)=0)=0 Large if a word is unpredictable: Large if a word is unpredictable:
I(wI(wnn)= )= loglogaap(wp(wnn|h|hnn) large if p(w) large if p(wnn|h|hnn) small) small Additive:Additive:
p(wp(wn-1n-1,w,wnn) = p(w) = p(wn-1n-1)p(w)p(wnn))
loglogaap(wp(wn-1n-1,w,wnn) = ) = loglogaap(wp(wn-1n-1) ) loglogaap(wp(wnn))
I(wI(wn-1n-1,w,wnn) = I(w) = I(wn-1n-1) + I(w) + I(wnn))
Bits, Nats, and DigitsBits, Nats, and Digits Consider a string of random coin tosses, Consider a string of random coin tosses,
“HTHHHTTHTHHTTT”“HTHHHTTHTHHTTT” p(wp(wnn|h|hnn) = ½) = ½ loglog22 p(w p(wnn|h|hnn) = 1 bit of information/symbol) = 1 bit of information/symbol ln p(wln p(wnn|h|hnn) = 0.69 nats/bit) = 0.69 nats/bit loglog1010 p(w p(wnn|h|hnn) = 0.3 digits/bit) = 0.3 digits/bit
Consider a random string of digits, “49873417”Consider a random string of digits, “49873417” p(wp(wnn|h|hnn) = 1/10) = 1/10 loglog1010 p(w p(wnn|h|hnn) = 1 digit/symbol) = 1 digit/symbol loglog22 p(w p(wnn|h|hnn) = 3.32 bits/digit) = 3.32 bits/digit ln p(wln p(wnn|h|hnn) = 2.3 nats/digit) = 2.3 nats/digit
Unless otherwise specified, information is usually Unless otherwise specified, information is usually measured in bits measured in bits
I(w|h) = I(w|h) = loglog2 2 p(w|h)p(w|h)
EntropyEntropy
Entropy = Average Entropy = Average InformationInformation
How unpredictable is the next word?How unpredictable is the next word? Entropy = Average UnpredictabilityEntropy = Average Unpredictability
H(p) = H(p) = ww p(w) I(w) p(w) I(w)
H(p) = H(p) = ww p(w) log p(w) log22p(w)p(w)
Notice that entropy is not a function Notice that entropy is not a function of the word, w…of the word, w…
It is a function of the probability It is a function of the probability distribution, p(w)distribution, p(w)
Example: Uniform SourceExample: Uniform Source Entropy of a coin toss:Entropy of a coin toss:
H(p) = H(p) =
p(“heads”)logp(“heads”)log22p(“heads”)p(“heads”)p(“tails”)logp(“tails”)log22p(“tails”)p(“tails”)
==0.5 log0.5 log22(0.5)(0.5)0.5 log0.5 log22(0.5)(0.5)= 1 bit/symbol= 1 bit/symbol
Entropy of uniform source with N different words:Entropy of uniform source with N different words:p(w) = 1/Np(w) = 1/N
H(p) = H(p) = w=1w=1N N p(w)logp(w)log22p(w)p(w)
= log= log22NN In general: if all words are equally likely, then the In general: if all words are equally likely, then the
average unpredictability (“entropy”) is equal to average unpredictability (“entropy”) is equal to the unpredictability of any particular word (the the unpredictability of any particular word (the “information” conveyed by that word): log“information” conveyed by that word): log22N bits.N bits.
Example: Non-Uniform Example: Non-Uniform SourceSource
Consider the toss of a weighted coin, with the following Consider the toss of a weighted coin, with the following probabilities:probabilities:
p(“heads”) = 0.6p(“heads”) = 0.6p(“tails”) = 0.4p(“tails”) = 0.4
Information communicated by each word:Information communicated by each word:I(“heads”) = I(“heads”) = loglog22 0.6 = 0.74 bits 0.6 = 0.74 bits
I(“tails”) = I(“tails”) = loglog22 0.4 = 1.3 bits 0.4 = 1.3 bits Entropy = average informationEntropy = average information
H(p) = H(p) = 0.6 log0.6 log22(0.6)(0.6)0.4 log0.4 log22(0.4)(0.4)= 0.97 bits/symbol on average= 0.97 bits/symbol on average
The entropy of a non-uniform source is always less than the The entropy of a non-uniform source is always less than the entropy of a uniform source with the same vocabulary.entropy of a uniform source with the same vocabulary. Entropy of a uniform source, N-word vocabulary, is logEntropy of a uniform source, N-word vocabulary, is log22N bitsN bits Information conveyed by a likely word is less than logInformation conveyed by a likely word is less than log22N bitsN bits Information conveyed by an unlikely word is more than logInformation conveyed by an unlikely word is more than log22N N
bits --- but that word is unlikely to occur!bits --- but that word is unlikely to occur!
Example: Deterministic Example: Deterministic SourceSource
Consider the toss of a two-headed coin, with the Consider the toss of a two-headed coin, with the following probabilities:following probabilities:
p(“heads”) = 1.0p(“heads”) = 1.0p(“tails”) = 0.0p(“tails”) = 0.0
Information communicated by each word:Information communicated by each word:
I(“heads”) = I(“heads”) = loglog22 1.0 = 0 bits 1.0 = 0 bits
I(“tails”) = I(“tails”) = loglog22 0.0 = infinite bits!! 0.0 = infinite bits!! Entropy = average informationEntropy = average information
H(p) = H(p) = 1.0 log1.0 log22(1.0)(1.0)0.0 log0.0 log22(0.0)(0.0)= 0 bits/symbol on average= 0 bits/symbol on average
If you know in advance what each word will be…If you know in advance what each word will be… then you gain no information by listening to the then you gain no information by listening to the
message. message. The “entropy” (average information per symbol) is zero.The “entropy” (average information per symbol) is zero.
Example: Textual Example: Textual ComplexityComplexity
Twas brillig, and the slithy tovesTwas brillig, and the slithy toves
did gyre and gimble in the wabe…did gyre and gimble in the wabe…
p(w) = 2/13 for w=“and,” w=“the”p(w) = 2/13 for w=“and,” w=“the” p(w) = 1/13 for the other 9 wordsp(w) = 1/13 for the other 9 words
H(p) = H(p) = ww p(w)logp(w)log22p(w)p(w)
= = 2(2/13)log2(2/13)log22(2/13)(2/13)9(1/13)log9(1/13)log22(1/13)(1/13)
= 3.4 bits/word= 3.4 bits/word
Example: Textual Example: Textual ComplexityComplexity
How much wood would a woodchuck How much wood would a woodchuck chuck if a woodchuck could chuck chuck if a woodchuck could chuck
wood?wood?
p(w) = 2/13 for “wood, a, woodchuck, p(w) = 2/13 for “wood, a, woodchuck, chuck”chuck”
p(w) = 1/13 for the other 5 wordsp(w) = 1/13 for the other 5 words
H(p) = H(p) = ww p(w)logp(w)log22p(w)p(w)
= = 4(2/13)log4(2/13)log22(2/13)(2/13)5(1/13)log5(1/13)log22(1/13)(1/13)= 3.0 bits/word= 3.0 bits/word
Example: The Speech Example: The Speech ChannelChannel
How much wood wood a wood chuck chuck if a How much wood wood a wood chuck chuck if a wood chuck could chuck wood?wood chuck could chuck wood?
p(w) = 5/15 for w=“wood”p(w) = 5/15 for w=“wood” p(w) = 4/15 for w=“chuck”p(w) = 4/15 for w=“chuck” p(w) = 2/15 for w=“a”p(w) = 2/15 for w=“a” p(w) = 1/15 for “how, much, if, could”p(w) = 1/15 for “how, much, if, could”
H(p) = H(p) = ww p(w)logp(w)log22p(w) = 2.5 bits/wordp(w) = 2.5 bits/word
Conditional Conditional Entropy Entropy
(Equivocation)(Equivocation)
Conditional Entropy = Average Conditional Entropy = Average Conditional InformationConditional Information
Suppose that p(w|h) is “conditional” upon some Suppose that p(w|h) is “conditional” upon some history variable h. Then the information provided history variable h. Then the information provided by w isby w is
I(w|h) = I(w|h) = loglog2 2 p(w|h)p(w|h) Suppose that we also know the probability Suppose that we also know the probability
distribution of the history variable, p(h)distribution of the history variable, p(h) The joint probability of w and h isThe joint probability of w and h is
p(w,h) = p(w|h)p(h)p(w,h) = p(w|h)p(h) The average information provided by any word, w, The average information provided by any word, w,
averaged across all possible histories, is the averaged across all possible histories, is the “conditional entropy” H(p(w|h)):“conditional entropy” H(p(w|h)):
H(p(w|h)) = H(p(w|h)) = wwhh p(w,h) I(w|h) p(w,h) I(w|h)
= = h h p(h) p(h) ww p(w|h) log p(w|h) log22p(w|h)p(w|h)
Example: CommunicationExample: Communication
Suppose wSuppose w, ŵ are not always the same are not always the same … … but we can estimate the probability p(but we can estimate the probability p(ŵ|w|w) Entropy of the source is
H(p(w)) = H(p(w)) = ww p(w) log p(w) log22p(w)p(w) Conditional Entropy of the received information isConditional Entropy of the received information is
H(p(H(p(ŵ|w)) = |w)) = wwp(w) p(w) ŵ p(p(ŵ|w) log|w) log22p(p(ŵ|w)|w) Conditional Entropy of Received Message given Conditional Entropy of Received Message given
the Transmitted Message is called the Transmitted Message is called EquivocationEquivocation
I said these words:
[…,wn,…]I heard these
words: […,ŵn,…]
+
Acoustic Noise, Babble, Reverberation, …
Speech (Acoustic Signal) Noisy Speech
Example: Articulation Example: Articulation TestingTesting
The “caller” reads a list of nonsense The “caller” reads a list of nonsense syllablessyllables Miller and Nicely, 1955: only one consonant per Miller and Nicely, 1955: only one consonant per
utterance is randomizedutterance is randomized Fletcher: CVC syllables, all three phonemes are Fletcher: CVC syllables, all three phonemes are
randomly selectedrandomly selected The “listener” writes down what she hearsThe “listener” writes down what she hears The lists are compared to compute p(The lists are compared to compute p(ŵ|w)|w)
“a tug”“a sug”“a fug”
“a tug”“a thug”“a fug”
+
Acoustic Noise, Babble, Reverberation, …
Speech (Acoustic Signal) Noisy Speech
Confusion Matrix: Consonants at -6dB Confusion Matrix: Consonants at -6dB SNRSNR
(Miller and Nicely, 1955)(Miller and Nicely, 1955)pp tt kk ff ss ʃʃ bb dd gg vv ðð zz ƷƷ mm nn
pp 8080 4343 6464 1717 1414 66 22 11 11 11 11 22
tt 7171 8484 5555 55 99 33 88 11 11 11
kk 6666 7676 107107 1212 88 99 44 11 11
ff 1818 1212 99 171755
4848 1111 11 77 22 11 22 22
1919 1717 1616 101044
6464 3232 77 55 44 55 66 44 55
ss 88 55 44 2323 3939 101077
4545 44 22 33 11 11 33 22 11
ʃʃ 11 66 33 44 66 2929 191955
33 11
bb 11 55 44 44 131366
1010 99 4747 1616 66 11 55 44
dd 88 55 8080 4545 1111 2020 2020 2626 11
gg 22 33 6363 6666 33 1919 3737 5656 33
vv 22 22 4848 55 55 141455
4545 1212 44
ðð 66 3131 66 1717 8686 5858 2121 55 66 44
zz 11 11 1717 2020 2727 1616 2828 9494 4444 11
ƷƷ 11 2626 1818 33 88 4545 121299
22
mm 11 44 44 11 33 171777
4646
NN 44 11 55 22 77 11 66 4747 161633
Perceived (ŵ)
Calle
d (
w)
Conditional Probabilities p(Conditional Probabilities p(ŵ|w), 1 signif. |w), 1 signif. dig.dig.
(Miller and Nicely, 1955)(Miller and Nicely, 1955)pp tt kk ff ss ʃʃ bb dd gg vv ðð zz ƷƷ mm nn
pp 0.30.3 0.0.22
0.30.3 0.10.1 0.10.1
tt 0.30.3 0.0.33
0.20.2
kk 0.20.2 0.0.33
0.40.4
ff 0.10.1 0.60.6 0.20.2
0.10.1 0.0.11
0.10.1 0.40.4 0.20.2 0.10.1
ss 0.10.1 0.20.2 0.40.4 0.20.2 0.10.1
ʃʃ 0.10.1 0.80.8
bb 0.60.6 0.20.2 0.10.1
dd 0.40.4 0.20.2 0.10.1 0.10.1 0.10.1 0.10.1
gg 0.30.3 0.30.3 0.10.1 0.10.1 0.20.2
vv 0.20.2 0.50.5 0.20.2
ðð 0.10.1 0.10.1 0.40.4 0.20.2 0.10.1
zz 0.10.1 0.10.1 0.10.1 0.10.1 0.10.1 0.40.4 0.20.2
ƷƷ 0.10.1 0.10.1 0.20.2 0.60.6
mm 0.80.8 0.20.2
nn 0.20.2 0.70.7
Perceived (ŵ)
Calle
d (
w)
Example: Articulation Testing, Example: Articulation Testing, -6dB SNR-6dB SNR
At -6dB SNR, p(At -6dB SNR, p(ŵ|w) is nonzero for |w) is nonzero for about 4 different possible responsesabout 4 different possible responses
The equivocation is roughly The equivocation is roughly
H(p(H(p(ŵ|w)) = |w)) = wwp(w) p(w) ŵ p(p(ŵ|w) log|w) log22p(p(ŵ||w)w)
= = ww (1/18) (1/18) ŵ (1/4) log(1/4) log22 (1/4) (1/4)
= = 1818(1/18)(1/18)44(1/4)(1/4)loglog22(1/4)(1/4)
= 2 bits= 2 bits
Example: Perfect Example: Perfect TransmissionTransmission
At very high signal to noise ratio (for At very high signal to noise ratio (for humans, SNR > 18dB)…humans, SNR > 18dB)…
The listener understands exactly what The listener understands exactly what the talker said:the talker said:
p(p(ŵ|w)=1 (for |w)=1 (for ŵ=w) or =w) or p(p(ŵ|w)=0 (for |w)=0 (for ŵ≠w)≠w)
So H(p(So H(p(ŵ|w))=0 --- zero equivocation|w))=0 --- zero equivocation Meaning: If you know exactly what the Meaning: If you know exactly what the
talker said, then that’s what you’ll talker said, then that’s what you’ll write; there is no more randomness write; there is no more randomness leftleft
Example: No TransmissionExample: No Transmission At very low signal to noise ratio (for humans, At very low signal to noise ratio (for humans,
SNR < minus 18dB)…SNR < minus 18dB)… The listener doesn’t understand anything the The listener doesn’t understand anything the
talker said:talker said: p(p(ŵ|w)=p(|w)=p(ŵ) The listener has no idea what the talker said, so she The listener has no idea what the talker said, so she
has to guess.has to guess. Her guesses follow the natural pattern of the Her guesses follow the natural pattern of the
language: she is more likely to write /s/ or /t/ instead of language: she is more likely to write /s/ or /t/ instead of /// or / or //ðð/./.
So H(p(So H(p(ŵ|w))=H(p(|w))=H(p(ŵ))=H(p(w)) --- conditional ))=H(p(w)) --- conditional entropy is exactly the same as the original entropy is exactly the same as the original source entropysource entropy
Meaning: If you have no idea what the talker Meaning: If you have no idea what the talker said, then you haven’t learned anything by said, then you haven’t learned anything by listeninglistening
Equivocation as a Function of Equivocation as a Function of SNRSNR
SNR
Region 1: No Transmission; Random Guessing
Equivocation = Entropy of the Language
Region 2:Equivocation depends on SNR
Eq
uiv
oca
tion (
Bit
s)
Region 3:Error-Free Transmission
Equivocation = 0
Dashed Line = Entropy of the Language(e.g., for 18-consonant articulation testing,source entropy H(p(w))=log218 bits)
18dB-18dB
Mutual Mutual Information, Information,
Channel CapacityChannel Capacity
Definition: Mutual Definition: Mutual InformationInformation
On average, how much information On average, how much information gets gets correctly transmittedcorrectly transmitted from from caller to listener?caller to listener?
““Mutual Information” = Mutual Information” = Average Information in the Caller’s Average Information in the Caller’s
MessageMessage ……minus…minus… Conditional Randomness of the Conditional Randomness of the
Listener’s PerceptionListener’s Perception I(p(I(p(ŵ,w)) = H(p(w)) – H(p(,w)) = H(p(w)) – H(p(ŵ|w))|w))
Listeners use Context to Listeners use Context to “Guess” the Message“Guess” the Message
Consider gradually increasing the complexity of the Consider gradually increasing the complexity of the message in a noisy environmentmessage in a noisy environment
1.1. Caller says “yes, yes, no, yes, no.” Entropy of Caller says “yes, yes, no, yes, no.” Entropy of the message is 1 bit/word; listener gets enough the message is 1 bit/word; listener gets enough from lip-reading to correctly guess every word.from lip-reading to correctly guess every word.
2.2. Caller says “429986734.” Entropy of the Caller says “429986734.” Entropy of the message is 3.2 bits/word; listener can still message is 3.2 bits/word; listener can still understand just by lip reading.understand just by lip reading.
3.3. Caller says “Let’s go scuba diving in Puerto Caller says “Let’s go scuba diving in Puerto Vallarta this January.” Vallarta this January.”
Listener Listener effortlesslyeffortlessly understands the low-entropy parts understands the low-entropy parts of the message (“let’s go”), but of the message (“let’s go”), but
The high-entropy parts (“scuba diving,” “Puerto The high-entropy parts (“scuba diving,” “Puerto Vallarta”) are completely lost in the noiseVallarta”) are completely lost in the noise
The Mutual Information Ceiling The Mutual Information Ceiling EffectEffect
Source Message Entropy (Bits)Mu
tual In
form
ati
on
Tra
nsm
itte
d (
Bit
s)
Region 1: Perfect Transmission
Equivocation=0
Mutual Information= Source Entropy
Region 2:Mutual Information clipped at an SNR-dependent maximum bit rate called the “channel capacity.”
Equivocation = Source Entropy – Channel Capacity
Eq
uiv
ocati
on
(B
its)
Definition: Channel CapacityDefinition: Channel Capacity Capacity of a Channel = Maximum Capacity of a Channel = Maximum
number of bits per second that may number of bits per second that may be transmitted, error-free, through be transmitted, error-free, through that channelthat channel
C = maxC = maxpp I(p( I(p(ŵ,w)),w)) The maximum is taken over all The maximum is taken over all
possible source distributions, i.e., over possible source distributions, i.e., over all possible H(p(w))all possible H(p(w))
Information Theory Jargon Information Theory Jargon ReviewReview
Information = Unpredictability of a wordInformation = Unpredictability of a word
I(w) = I(w) = loglog22p(w)p(w) Entropy = Average Information of the words in a message Entropy = Average Information of the words in a message
H(p(w)) = H(p(w)) = ww p(w) log p(w) log22p(w)p(w) Conditional Entropy = Average Conditional InformationConditional Entropy = Average Conditional Information
H(p(w|h)) = H(p(w|h)) = h h p(h) p(h) ww p(w|h) log p(w|h) log22p(w|h)p(w|h) Equivocation = Conditional Entropy of the Received Equivocation = Conditional Entropy of the Received
Message given the Transmitted MessageMessage given the Transmitted Message
H(p(H(p(ŵ|w)) = |w)) = wwp(w) p(w) ŵ p(p(ŵ|w) log|w) log22p(p(ŵ|w)|w) Mutual Information = Entropy minus EquivocationMutual Information = Entropy minus Equivocation
I(p(I(p(ŵ,w)) = H(p(w)) – H(p(,w)) = H(p(w)) – H(p(ŵ|w))|w)) Channel Capacity = Maximum Mutual InformationChannel Capacity = Maximum Mutual Information
C(SNR) = maxC(SNR) = maxpp I(p( I(p(ŵ,w)),w))
Finite State Finite State Language ModelsLanguage Models
GrammarGrammar Discussion so far has ignored the “history,” Discussion so far has ignored the “history,”
hhnn. . How does hHow does hnn affect p(w affect p(wnn|h|hnn)?)? Topics that we won’t discuss today, but that Topics that we won’t discuss today, but that
computational linguists are working on:computational linguists are working on: Dialog contextDialog context Shared knowledgeShared knowledge
Topics that we will discuss:Topics that we will discuss: Previous words in the same utterance:Previous words in the same utterance:
p(wp(wnn|w|w11,…,w,…,wn-1n-1) ≠ p(w) ≠ p(wnn)) A grammar = something that decides A grammar = something that decides
whether or not (wwhether or not (w11,…,w,…,wNN) is a possible ) is a possible sentence.sentence.
Probabilistic grammar = something that Probabilistic grammar = something that calculates p(wcalculates p(w11,…,w,…,wNN))
GrammarGrammar Definition: A Grammar, Definition: A Grammar, GG, has four parts, has four parts
GG = { = { S, N, V, PS, N, V, P } } NN = A set of “non-terminal” nodes = A set of “non-terminal” nodes
Example: Example: NN = { Sentence, NP, VP, NOU, VER, DET, ADJ, ADV } = { Sentence, NP, VP, NOU, VER, DET, ADJ, ADV } VV = A set of “terminal” nodes, a.k.a. a “vocabulary” = A set of “terminal” nodes, a.k.a. a “vocabulary”
Example: Example: VV = { how, much, wood, would, a } = { how, much, wood, would, a } SS = The non-terminal node that sits at the top of every = The non-terminal node that sits at the top of every
parse treeparse tree Example: S = { Sentence }Example: S = { Sentence }
PP = A set of production rules = A set of production rules Example (CFG in “Chomsky normal form”):Example (CFG in “Chomsky normal form”):Sentence = NP VPSentence = NP VPNP = DET NPNP = DET NPNP = ADJ NPNP = ADJ NPNP = NOUNP = NOUNOU = woodNOU = woodNOU = woodchuckNOU = woodchuck
Types of GrammarTypes of Grammar A type 0 (“unrestricted”) grammar can have anything on either A type 0 (“unrestricted”) grammar can have anything on either
side of a production ruleside of a production rule A type 1 (“context sensitive grammar,” CSG) has rules of the A type 1 (“context sensitive grammar,” CSG) has rules of the
following form:following form:<context1> N <context2> = <context1> STUFF <context2><context1> N <context2> = <context1> STUFF <context2>……where… where… <context1> and <context2> are arbitrary unchanged contexts<context1> and <context2> are arbitrary unchanged contexts N is an arbitrary non-terminal, e.g., “NP”N is an arbitrary non-terminal, e.g., “NP” STUFF is an arbitrary sequence of terminals and non-terminals, e.g., STUFF is an arbitrary sequence of terminals and non-terminals, e.g.,
“the big ADJ NP” would be an acceptable STUFF“the big ADJ NP” would be an acceptable STUFF A type 2 grammar (“context free grammar,” CFG) has rules of the A type 2 grammar (“context free grammar,” CFG) has rules of the
following form:following form:N = STUFFN = STUFF
A type 3 grammar (“regular grammar,” RG) has rules of the A type 3 grammar (“regular grammar,” RG) has rules of the following form:following form:
N1 = T1 N2N1 = T1 N2 N1, N2 are non-terminalsN1, N2 are non-terminals T1 is a terminal node – a word!T1 is a terminal node – a word! Acceptable Example: NP = the NPAcceptable Example: NP = the NP Unacceptable Example: Sentence = NP VPUnacceptable Example: Sentence = NP VP
Regular Grammar = Finite Regular Grammar = Finite State Grammar (Markov State Grammar (Markov
Grammar)Grammar) Let every non-terminal be a “state”Let every non-terminal be a “state” Let every production rule be a “transition”Let every production rule be a “transition” Example:Example:
S = a SS = a S S = woodchuck VPS = woodchuck VP VP = could VPVP = could VP VP = chuck NPVP = chuck NP NP = how QPNP = how QP QP = much NPQP = much NP NP = woodNP = wood
S VP NP END
a could how much
woodchuck chuck wood
QP
Probabilistic Finite State Probabilistic Finite State GrammarGrammar
Every production rule has an associated conditional Every production rule has an associated conditional probability: p(production rule | LHS nonterminal)probability: p(production rule | LHS nonterminal)
Example:Example: S = a S 0.5S = a S 0.5 S = woodchuck VP 0.5S = woodchuck VP 0.5 VP = could VP 0.7VP = could VP 0.7 VP = chuck NP 0.3VP = chuck NP 0.3 NP = how QP 0.4NP = how QP 0.4 QP = much NP 1.0QP = much NP 1.0 NP = wood 0.6NP = wood 0.6
S VP NP END
a (0.5)
could (0.7) how (0.4)
much (1.0)
woodchuck (0.5) chuck (0.3) wood (0.6)
QP
Calculating the probability of Calculating the probability of texttext
p(“a woodchuck could chuck how much wood”) =p(“a woodchuck could chuck how much wood”) =(0.5)(0.5)(0.7)(0.3)(0.4)(1.0)(0.6) (0.5)(0.5)(0.7)(0.3)(0.4)(1.0)(0.6)
= 0.0126= 0.0126 p(“woodchuck chuck wood”) = p(“woodchuck chuck wood”) =
(0.5)(0.3)(0.6) = 0.09(0.5)(0.3)(0.6) = 0.09 p(“A woodchuck could chuck how much wood. p(“A woodchuck could chuck how much wood.
Woodchuck chuck wood.”) =Woodchuck chuck wood.”) =(0.0126)(0.09) = 0.01134(0.0126)(0.09) = 0.01134
p(some very long text corpus) = p(1p(some very long text corpus) = p(1stst sentence)p(2 sentence)p(2ndnd sentence)…sentence)…
S VP NP END
a (0.5)
could (0.7) how (0.4)
much (1.0)
woodchuck (0.5) chuck (0.3) wood (0.6)
QP
Cross-EntropyCross-Entropy
Cross-entropy of an N-word text, T, given a Cross-entropy of an N-word text, T, given a language model, G:language model, G:
H(T|G) = H(T|G) = n=1n=1NN p(w p(wnn|T) log|T) log22p(wp(wnn|G,h|G,hnn))
= = n=1n=1NN (1/N) log (1/N) log22p(wp(wnn|G,h|G,hnn))
N = # words in the textN = # words in the text p(wp(wnn|T) = (# times wn occurs)/N|T) = (# times wn occurs)/N p(wp(wnn|G) = probability of word w|G) = probability of word wnn given its given its
history hhistory hnn, according to language model G, according to language model G
Cross-Entropy: ExampleCross-Entropy: Example
T = “A woodchuck could chuck wood.”T = “A woodchuck could chuck wood.” H(T|G) =H(T|G) =
= = n=1n=1NN p(w p(wnn|T) log|T) log22p(wp(wnn|G,h|G,hnn))
= = (1/N) (1/N) n=1n=1NN log log22p(wp(wnn|G,h|G,hnn))
= = (1/5) { log(1/5) { log22(0.5) + log(0.5) + log22(0.5) + log(0.5) + log22(0.7)+ log(0.7)+ log22(0.3)+ log(0.3)+ log22(0.6) }(0.6) }
= 4.989/5 = 0.998 bits= 4.989/5 = 0.998 bits Interpretation: language model G assigns entropy of 0.998 bits Interpretation: language model G assigns entropy of 0.998 bits
to the words in text T.to the words in text T. This is a very low cross-entropy: G predicts T very well.This is a very low cross-entropy: G predicts T very well.
S VP NP END
a (0.5)
could (0.7) how (0.4)
much (1.0)
woodchuck (0.5) chuck (0.3) wood (0.6)
QP
N-Gram N-Gram Language ModelsLanguage Models
N-Gram Language ModelsN-Gram Language Models
An N-gram is just a PFSG (probabilistic finite An N-gram is just a PFSG (probabilistic finite state grammar) in which each nonterminal is state grammar) in which each nonterminal is specified by the N-1 most recent terminals.specified by the N-1 most recent terminals.
Definition of an N-gram:Definition of an N-gram:
p(wp(wnn|h|hnn) = p(w) = p(wnn|w|wn-N+1n-N+1,…,w,…,wn-1n-1)) The most common choices for N are 0,1,2,3:The most common choices for N are 0,1,2,3:
Trigram (3-gram): p(wTrigram (3-gram): p(wnn|h|hnn) = p(w) = p(wnn|w|wn-2n-2,w,wn-1n-1)) Bigram (2-gram): p(wBigram (2-gram): p(wnn|h|hnn) = p(w) = p(wnn|w|wn-1n-1)) Unigram (1-gram): p(wUnigram (1-gram): p(wnn|h|hnn) = p(w) = p(wnn)) 0-gram: p(w0-gram: p(wnn|h|hnn) = 1/|V|) = 1/|V|
Example: A Woodchuck Example: A Woodchuck BigramBigram
T = “How much wood would a woodchuck chuck if a T = “How much wood would a woodchuck chuck if a woodchuck could chuck wood”woodchuck could chuck wood”
Nonterminals are labeled by wNonterminals are labeled by wn-1n-1. Edges are labeled with . Edges are labeled with wwnn, and with p(w, and with p(wnn|w|wn-1n-1))
0 how much wouldwood a woodchuckchuck if could
how (1.0)
much (1.0)
wood (1.0)
would (1.0)
a (1.0)
woodchuck(1.0)
wood(0.5)
if(0.5)
could(0.5)
chuck(1.0)
a(1.0)
chuck(0.5)
N-Gram: Maximum Likelihood N-Gram: Maximum Likelihood EstimationEstimation
An N-gram is defined by its vocabulary An N-gram is defined by its vocabulary VV, and by , and by the probabilities the probabilities = { p(w = { p(wnn|w|wn-N+1n-N+1,…,w,…,wn-1n-1) }) }
GG = { = { VV, , } } A text is a (long) string of words, T = { wA text is a (long) string of words, T = { w11,…,w,…,wNN } } The probability of a text given an N-gram isThe probability of a text given an N-gram is
p(T|G) = p(T|G) = n=1n=1NN p(w p(wnn|T) log|T) log22p(wp(wnn|G, w|G, wn-N+1n-N+1,…,w,…,wn-1n-1))
= = (1/N) (1/N) n=1n=1NN log log22p(wp(wnn|G, w|G, wn-N+1n-N+1,…,w,…,wn-1n-1))
The “Maximum Likelihood” N-gram model is the The “Maximum Likelihood” N-gram model is the set of probabilities, set of probabilities, , that maximizes p(T|G)., that maximizes p(T|G).
N-Gram: Maximum Likelihood N-Gram: Maximum Likelihood EstimationEstimation
The “Maximum Likelihood” N-gram model is the set of The “Maximum Likelihood” N-gram model is the set of probabilities, probabilities, , that maximizes p(T|G). These , that maximizes p(T|G). These probabilities are given by:probabilities are given by:
p(wp(wnn|w|wn-N+1n-N+1,…,w,…,wnn) = N(w) = N(wn-N+1n-N+1,…,w,…,wnn)/N(w)/N(wn-N+1n-N+1,…,w,…,wn-1n-1))
where where N(wN(wn-N+1n-N+1,…,w,…,wnn) is the “frequency” of the N-gram w) is the “frequency” of the N-gram wn-N+1n-N+1,…,w,…,wn n
(i.e., the number of times that the N-gram occurs in the data)(i.e., the number of times that the N-gram occurs in the data) N(wN(wn-N+1n-N+1,…,w,…,wn-1n-1) is the frequency of the (N-1)-gram w) is the frequency of the (N-1)-gram wn-N+1n-N+1,…,w,…,wn-n-
11
This is the set of probabilities you would have guessed, This is the set of probabilities you would have guessed, anyway!!anyway!!
For example, the woodchuck bigram assigned For example, the woodchuck bigram assigned
p(wp(wnn|w|wn-1n-1)=N(w)=N(wn-1n-1,w,wnn)/N(w)/N(wnn).).
Cross-Entropy of an N-gramCross-Entropy of an N-gram
Cross-entropy of an N-word text, T, given an N-Cross-entropy of an N-word text, T, given an N-gram G:gram G:
H(T|G) = H(T|G) = n=1n=1NN p(w p(wnn|T) log|T) log22p(wp(wnn|G,w|G,wn-N+1n-N+1,…,w,…,wn-1n-1))
= = (1/N) (1/N) n=1n=1NN log log22p(wp(wnn|G,w|G,wn-N+1n-N+1,…,w,…,wn-1n-1))
N = # words in the textN = # words in the text p(wp(wnn|T) = (# times wn occurs)/N|T) = (# times wn occurs)/N p(wp(wnn|G, w|G, wn-N+1n-N+1,…,w,…,wn-1n-1) = probability of word w) = probability of word wnn
given its history wgiven its history wn-N+1n-N+1,…,w,…,wn-1n-1, according to N-, according to N-gram language model Ggram language model G
Example: A Woodchuck Example: A Woodchuck BigramBigram
T = “a woodchuck could chuck wood.”T = “a woodchuck could chuck wood.” H(T|G) = H(T|G) = (1/5){log(1/5){log22(p(“a”))(p(“a”))
+log+log22(1.0)+log(1.0)+log22(0.5)+log(0.5)+log22(1.0)+log(1.0)+log22(0.5)}(0.5)}
= (2= (2loglog22p(“a”))/5 p(“a”))/5
= 0.4 bits/word plus the information of the first word= 0.4 bits/word plus the information of the first word
how much wouldwood a woodchuckchuck if could
much (1.0)
wood (1.0)
would (1.0)
a (1.0)
woodchuck(1.0)
wood(0.5)
if(0.5)
could(0.5)
chuck(1.0)a
(1.0)
chuck(0.5)
Example: A Woodchuck Example: A Woodchuck BigramBigram Information of the first word must be set by assumption. Information of the first word must be set by assumption.
Common assumptions include:Common assumptions include: Assume that the first word gives zero information (p(“a”)=1, Assume that the first word gives zero information (p(“a”)=1,
loglog22(p(“a”))=0(p(“a”))=0) --- this focuses attention on the bigram) --- this focuses attention on the bigram First word information given by its unigram probability First word information given by its unigram probability
(p(“a”)=2/13, (p(“a”)=2/13, loglog22(p(“a”))=2.7 bits(p(“a”))=2.7 bits) --- this gives a well-) --- this gives a well-
normalized entropynormalized entropy First word information given by its 0-gram probability First word information given by its 0-gram probability
(p(“a”)=1/9, (p(“a”)=1/9, loglog22(p(“a”))=3.2 bits(p(“a”))=3.2 bits) --- a different well-) --- a different well-
normalized entropynormalized entropy
how much wouldwood a woodchuckchuck if could
much (1.0)
wood (1.0)
would (1.0)
a (1.0)
woodchuck(1.0)
wood(0.5)
if(0.5)
could(0.5)
chuck(1.0)a
(1.0)
chuck(0.5)
N-Gram: ReviewN-Gram: Review The “Maximum Likelihood” N-gram model is the set of The “Maximum Likelihood” N-gram model is the set of
probabilities, probabilities, , that maximizes p(T|G). These , that maximizes p(T|G). These probabilities are given by:probabilities are given by:
p(wp(wnn|w|wn-N+1n-N+1,…,w,…,wnn) = N(w) = N(wn-N+1n-N+1,…,w,…,wnn)/N(w)/N(wn-N+1n-N+1,…,w,…,wn-1n-1))
where where N(wN(wn-N+1n-N+1,…,w,…,wnn) is the “frequency” of the N-gram w) is the “frequency” of the N-gram wn-N+1n-N+1,…,w,…,wn n
(i.e., the number of times that the N-gram occurs in the data)(i.e., the number of times that the N-gram occurs in the data) N(wN(wn-N+1n-N+1,…,w,…,wn-1n-1) is the frequency of the (N-1)-gram w) is the frequency of the (N-1)-gram wn-N+1n-N+1,…,w,…,wn-n-
11
This is the set of probabilities you would have guessed, This is the set of probabilities you would have guessed, anyway!!anyway!!
For example, the woodchuck bigram assigned For example, the woodchuck bigram assigned
p(wp(wnn|w|wn-1n-1)=N(w)=N(wn-1n-1,w,wnn)/N(w)/N(wnn).).
ReviewReview Information = Unpredictability of a wordInformation = Unpredictability of a word
I(w) = I(w) = loglog22p(w)p(w) Entropy = Average Information of the words in a message Entropy = Average Information of the words in a message
H(p(w)) = H(p(w)) = ww p(w) log p(w) log22p(w)p(w) Conditional Entropy = Average Conditional InformationConditional Entropy = Average Conditional Information
H(p(w|h)) = H(p(w|h)) = h h p(h) p(h) ww p(w|h) log p(w|h) log22p(w|h)p(w|h) Equivocation = Conditional Entropy of the Received Message given the Equivocation = Conditional Entropy of the Received Message given the
Transmitted MessageTransmitted Message
H(p(H(p(ŵ|w)) = |w)) = wwp(w) p(w) ŵ p(p(ŵ|w) log|w) log22p(p(ŵ|w)|w) Mutual Information = Entropy minus EquivocationMutual Information = Entropy minus Equivocation
I(p(I(p(ŵ,w)) = H(p(w)) – H(p(,w)) = H(p(w)) – H(p(ŵ|w))|w)) Channel Capacity = Maximum Mutual InformationChannel Capacity = Maximum Mutual Information
C(SNR) = maxC(SNR) = maxpp I(p( I(p(ŵ,w)),w))
Cross-Entropy of text T={wCross-Entropy of text T={w11,…,w,…,wNN} given language model G:} given language model G:
H(T|G) = = H(T|G) = = (1/N) (1/N) n=1n=1NN log log22p(wp(wnn|G,h|G,hnn))
Maximum Likelihood Estimate of an N-gram language model:Maximum Likelihood Estimate of an N-gram language model:
p(wp(wnn|w|wn-N+1n-N+1,…,w,…,wnn) = N(w) = N(wn-N+1n-N+1,…,w,…,wnn)/N(w)/N(wn-N+1n-N+1,…,w,…,wn-1n-1))