language modeling 1. roadmap (for next two classes) review lm evaluation metrics entropy ...
TRANSCRIPT
![Page 1: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/1.jpg)
1
Language Modeling
![Page 2: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/2.jpg)
2
Roadmap (for next two classes)
Review LM evaluation metrics Entropy Perplexity
Smoothing Good-Turing Backoff and Interpolation Absolute Discounting Kneser-Ney
![Page 3: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/3.jpg)
3
Language Model Evaluation Metrics
![Page 4: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/4.jpg)
4
Applications
![Page 5: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/5.jpg)
5
Entropy and perplexity
Entropy – measure information content, in bits
is message length with ideal code Use if you want to measure in bits!
Cross entropy – measure ability of trained model to compactly represent test data
Average logprob of test data Perplexity – measure average branching factor
![Page 6: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/6.jpg)
6
Entropy and perplexity
Entropy – measure information content, in bits
is message length with ideal code Use if you want to measure in bits!
Cross entropy – measure ability of trained model to compactly represent test data
Average logprob of test data Perplexity – measure average branching factor
![Page 7: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/7.jpg)
7
Entropy and perplexity
Entropy – measure information content, in bits
is message length with ideal code Use if you want to measure in bits!
Cross entropy – measure ability of trained model to compactly represent test data
Average logprob of test data Perplexity – measure average branching factor
![Page 8: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/8.jpg)
8
Entropy and perplexity
Entropy – measure information content, in bits
is message length with ideal code Use if you want to measure in bits!
Cross entropy – measure ability of trained model to compactly represent test data
Average logprob of test data Perplexity – measure average branching factor
![Page 9: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/9.jpg)
9
Language model perplexity
Recipe: Train a language model on training data Get negative logprobs of test data, compute average Exponentiate!
Perplexity correlates rather well with: Speech recognition error rates MT quality metrics
LM Perplexities for word-based models are normally between say 50 and 1000
Need to drop perplexity by a significant fraction (not absolute amount) to make a visible impact
![Page 10: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/10.jpg)
10
Parameter estimation
What is it?
![Page 11: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/11.jpg)
11
Parameter estimation
Model form is fixed (coin unigrams, word bigrams, …) We have observations
H H H T T H T H H Want to find the parameters Maximum Likelihood Estimation – pick the parameters
that assign the most probability to our training data c(H) = 6; c(T) = 3 P(H) = 6 / 9 = 2 / 3; P(T) = 3 / 9 = 1 / 3
MLE picks parameters best for training data… …but these don’t generalize well to test data – zeros!
![Page 12: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/12.jpg)
12
Parameter estimation
Model form is fixed (coin unigrams, word bigrams, …) We have observations
H H H T T H T H H Want to find the parameters Maximum Likelihood Estimation – pick the parameters
that assign the most probability to our training data c(H) = 6; c(T) = 3 P(H) = 6 / 9 = 2 / 3; P(T) = 3 / 9 = 1 / 3
MLE picks parameters best for training data… …but these don’t generalize well to test data – zeros!
![Page 13: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/13.jpg)
13
Parameter estimation
Model form is fixed (coin unigrams, word bigrams, …) We have observations
H H H T T H T H H Want to find the parameters Maximum Likelihood Estimation – pick the parameters
that assign the most probability to our training data c(H) = 6; c(T) = 3 P(H) = 6 / 9 = 2 / 3; P(T) = 3 / 9 = 1 / 3
MLE picks parameters best for training data… …but these don’t generalize well to test data – zeros!
![Page 14: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/14.jpg)
14
Smoothing
Take mass from seen events, give to unseen events Robin Hood for probability models
MLE at one end of the spectrum; uniform distribution the other
Need to pick a happy medium, and yet maintain a distribution
![Page 15: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/15.jpg)
15
Smoothing techniques
Laplace Good-Turing Backoff Mixtures Interpolation Kneser-Ney
![Page 16: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/16.jpg)
16
Laplace
From MLE:
To Laplace:
![Page 17: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/17.jpg)
Good-Turing Smoothing
New idea: Use counts of things you have seen to estimate those you haven’t
![Page 18: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/18.jpg)
Good-Turing Josh Goodman Intuition
Imagine you are fishing There are 8 species: carp, perch, whitefish, trout,
salmon, eel, catfish, bass You have caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish
How likely is it that the next fish caught is from a new species (one not seen in our previous catch)?
Slide adapted from Josh Goodman, Dan Jurafsky
![Page 19: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/19.jpg)
Good-Turing Josh Goodman Intuition
Imagine you are fishing There are 8 species: carp, perch, whitefish, trout,
salmon, eel, catfish, bass You have caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish
How likely is it that the next fish caught is from a new species (one not seen in our previous catch)?
3/18
Assuming so, how likely is it that next species is trout?
Slide adapted from Josh Goodman, Dan Jurafsky
![Page 20: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/20.jpg)
Good-Turing Josh Goodman Intuition
Imagine you are fishing There are 8 species: carp, perch, whitefish, trout, salmon,
eel, catfish, bass You have caught
10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish
How likely is it that the next fish caught is from a new species (one not seen in our previous catch)?
3/18
Assuming so, how likely is it that next species is trout? Must be less than 1/18
Slide adapted from Josh Goodman, Dan Jurafsky
![Page 21: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/21.jpg)
21
Some more hypotheticalsSpecies Puget Sound Lake Washington Greenlake
Salmon 8 12 0
Trout 3 1 1
Cod 1 1 0
Rockfish 1 0 0
Snapper 1 0 0
Skate 1 0 0
Bass 0 1 14
TOTAL 15 15 15
How likely is it to find a new fish in each of these places?
![Page 22: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/22.jpg)
Good-Turing Smoothing
New idea: Use counts of things you have seen to estimate those you haven’t
![Page 23: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/23.jpg)
Good-Turing Smoothing
New idea: Use counts of things you have seen to estimate those you haven’t
Good-Turing approach: Use frequency of singletons to re-estimate frequency of zero-count n-grams
![Page 24: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/24.jpg)
Good-Turing Smoothing
New idea: Use counts of things you have seen to estimate those you haven’t
Good-Turing approach: Use frequency of singletons to re-estimate frequency of zero-count n-grams
Notation: Nc is the frequency of frequency c Number of ngrams which appear c times N0: # ngrams of count 0; N1: # of ngrams of count 1
![Page 25: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/25.jpg)
Good-Turing Smoothing Estimate probability of things which occur c times
with the probability of things which occur c+1 times
Discounted counts: steal mass from seen cases to provide for the unseen:
MLE
GT
![Page 26: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/26.jpg)
GT Fish Example
![Page 27: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/27.jpg)
27
Enough about the fish…how does this relate to language? Name some linguistic situations where the number
of new words would differ
![Page 28: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/28.jpg)
28
Enough about the fish…how does this relate to language? Name some linguistic situations where the number
of new words would differ Different languages:
Chinese has almost no morphology Turkish has a lot of morphology Lots of new words in Turkish!
![Page 29: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/29.jpg)
29
Enough about the fish…how does this relate to language? Name some linguistic situations where the number
of new words would differ Different languages:
Chinese has almost no morphology Turkish has a lot of morphology Lots of new words in Turkish!
Different domains: Airplane maintenance manuals: controlled vocabulary Random web posts: uncontrolled vocab
![Page 30: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/30.jpg)
Bigram Frequencies of Frequencies and GT Re-estimates
![Page 31: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/31.jpg)
Good-Turing Smoothing
N-gram counts to conditional probability
Use c* from GT estimate
![Page 32: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/32.jpg)
Additional Issues in Good-Turing General approach:
Estimate of c* for Nc depends on N c+1
What if Nc+1 = 0? More zero count problems Not uncommon: e.g. fish example, no 4s
![Page 33: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/33.jpg)
Modifications
Simple Good-Turing Compute Nc bins, then smooth Nc to replace zeroes
Fit linear regression in log space log(Nc) = a +b log(c)
What about large c’s? Should be reliable Assume c*=c if c is large, e.g c > k (Katz: k =5)
Typically combined with other approaches
![Page 34: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/34.jpg)
Backoff and Interpolation
Another really useful source of knowledge If we are estimating:
trigram p(z|x,y) but count(xyz) is zero
Use info from:
![Page 35: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/35.jpg)
Backoff and Interpolation
Another really useful source of knowledge If we are estimating:
trigram p(z|x,y) but count(xyz) is zero
Use info from: Bigram p(z|y)
Or even: Unigram p(z)
![Page 36: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/36.jpg)
Backoff and Interpolation
Another really useful source of knowledge If we are estimating:
trigram p(z|x,y) but count(xyz) is zero
Use info from: Bigram p(z|y)
Or even: Unigram p(z)
How to combine this trigram, bigram, unigram info in a valid fashion?
![Page 37: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/37.jpg)
Backoff vs. Interpolation
Backoff: use trigram if you have it, otherwise bigram, otherwise unigram
![Page 38: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/38.jpg)
Backoff vs. Interpolation
Backoff: use trigram if you have it, otherwise bigram, otherwise unigram
Interpolation: always mix all three
![Page 39: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/39.jpg)
39
Backoff
Bigram distribution
But could be zero… What if we fell back (or “backed off”) to a unigram
distribution?
Also could be zero…
![Page 40: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/40.jpg)
40
Backoff
What’s wrong with this distribution?
Doesn’t sum to one! Need to steal mass…
![Page 41: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/41.jpg)
41
Backoff
![Page 42: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/42.jpg)
42
Mixtures
Given distributions and Pick any number between and is a distribution (Laplace is a mixture!)
![Page 43: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/43.jpg)
Interpolation Simple interpolation
Or, pick interpolation value based on context
Intuition: Higher weight on more frequent n-grams
![Page 44: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/44.jpg)
How to Set the Lambdas? Use a held-out, or development, corpus Choose lambdas which maximize the probability of
some held-out data I.e. fix the N-gram probabilities Then search for lambda values That when plugged into previous equation Give largest probability for held-out set Can use EM to do this search
![Page 45: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/45.jpg)
Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff
I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading)
![Page 46: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/46.jpg)
Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff
I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading)
P(Francisco|reading) backs off to P(Francisco)
![Page 47: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/47.jpg)
Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff
I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading)
P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading)
![Page 48: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/48.jpg)
Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff
I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading)
P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading) However, Francisco appears in few contexts, glasses many
![Page 49: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/49.jpg)
Kneser-Ney Smoothing Most commonly used modern smoothing technique Intuition: improving backoff
I can’t see without my reading…… Compare P(Francisco|reading) vs P(glasses|reading)
P(Francisco|reading) backs off to P(Francisco) P(glasses|reading) > 0 High unigram frequency of Francisco > P(glasses|reading) However, Francisco appears in few contexts, glasses many
Interpolate based on # of contexts Words seen in more contexts, more likely to appear in
others
![Page 50: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/50.jpg)
50
Kneser-Ney Smoothing: bigrams
Modeling diversity of contexts
So
![Page 51: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/51.jpg)
51
Kneser-Ney Smoothing: bigrams
Backoff:
![Page 52: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/52.jpg)
52
Kneser-Ney Smoothing: bigrams
Interpolation:
![Page 53: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/53.jpg)
OOV words: <UNK> word
Out Of Vocabulary = OOV words
![Page 54: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/54.jpg)
OOV words: <UNK> word
Out Of Vocabulary = OOV words We don’t use GT smoothing for these
![Page 55: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/55.jpg)
OOV words: <UNK> word
Out Of Vocabulary = OOV words We don’t use GT smoothing for these
Because GT assumes we know the number of unseen events
Instead: create an unknown word token <UNK>
![Page 56: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/56.jpg)
OOV words: <UNK> word
Out Of Vocabulary = OOV words We don’t use GT smoothing for these
Because GT assumes we know the number of unseen events
Instead: create an unknown word token <UNK> Training of <UNK> probabilities
Create a fixed lexicon L of size V At text normalization phase, any training word not in L changed to
<UNK> Now we train its probabilities like a normal word
![Page 57: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/57.jpg)
OOV words: <UNK> word
Out Of Vocabulary = OOV words We don’t use GT smoothing for these
Because GT assumes we know the number of unseen events
Instead: create an unknown word token <UNK> Training of <UNK> probabilities
Create a fixed lexicon L of size V At text normalization phase, any training word not in L changed to
<UNK> Now we train its probabilities like a normal word
At decoding time If text input: Use UNK probabilities for any word not in training Plus an additional penalty! UNK predicts the class of unknown words;
then we need to pick a member
![Page 58: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/58.jpg)
Class-Based Language Models
Variant of n-gram models using classes or clusters
![Page 59: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/59.jpg)
Class-Based Language Models
Variant of n-gram models using classes or clusters Motivation: Sparseness
Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram
![Page 60: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/60.jpg)
Class-Based Language Models
Variant of n-gram models using classes or clusters Motivation: Sparseness
Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram
IBM clustering: assume each word in single class P(wi|wi-1)~P(ci|ci-1)xP(wi|ci) Learn by MLE from data
Where do classes come from?
![Page 61: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/61.jpg)
Class-Based Language Models
Variant of n-gram models using classes or clusters Motivation: Sparseness
Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram
IBM clustering: assume each word in single class P(wi|wi-1)~P(ci|ci-1)xP(wi|ci) Learn by MLE from data
Where do classes come from? Hand-designed for application (e.g. ATIS) Automatically induced clusters from corpus
![Page 62: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/62.jpg)
Class-Based Language Models
Variant of n-gram models using classes or clusters Motivation: Sparseness
Flight app.: P(ORD|to),P(JFK|to),.. P(airport_name|to) Relate probability of n-gram to word classes & class ngram
IBM clustering: assume each word in single class P(wi|wi-1)~P(ci|ci-1)xP(wi|ci) Learn by MLE from data
Where do classes come from? Hand-designed for application (e.g. ATIS) Automatically induced clusters from corpus
![Page 63: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/63.jpg)
LM Adaptation
Challenge: Need LM for new domain Have little in-domain data
![Page 64: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/64.jpg)
LM Adaptation
Challenge: Need LM for new domain Have little in-domain data
Intuition: Much of language is pretty general Can build from ‘general’ LM + in-domain data
![Page 65: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/65.jpg)
LM Adaptation
Challenge: Need LM for new domain Have little in-domain data
Intuition: Much of language is pretty general Can build from ‘general’ LM + in-domain data
Approach: LM adaptation Train on large domain independent corpus Adapt with small in-domain data set
What large corpus?
![Page 66: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/66.jpg)
LM Adaptation
Challenge: Need LM for new domain Have little in-domain data
Intuition: Much of language is pretty general Can build from ‘general’ LM + in-domain data
Approach: LM adaptation Train on large domain independent corpus Adapt with small in-domain data set
What large corpus? Web counts! e.g. Google n-grams
![Page 67: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/67.jpg)
Incorporating Longer Distance Context
Why use longer context?
![Page 68: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/68.jpg)
Incorporating Longer Distance Context
Why use longer context? N-grams are approximation
Model size Sparseness
![Page 69: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/69.jpg)
Incorporating Longer Distance Context
Why use longer context? N-grams are approximation
Model size Sparseness
What sorts of information in longer context?
![Page 70: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/70.jpg)
Incorporating Longer Distance Context
Why use longer context? N-grams are approximation
Model size Sparseness
What sorts of information in longer context? Priming Topic Sentence type Dialogue act Syntax
![Page 71: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/71.jpg)
Long Distance LMs Bigger n!
284M words: <= 6-grams improve; 7-20 no better
![Page 72: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/72.jpg)
Long Distance LMs Bigger n!
284M words: <= 6-grams improve; 7-20 no better Cache n-gram:
Intuition: Priming: word used previously, more likely Incrementally create ‘cache’ unigram model on test corpus
Mix with main n-gram LM
![Page 73: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/73.jpg)
Long Distance LMs Bigger n!
284M words: <= 6-grams improve; 7-20 no better Cache n-gram:
Intuition: Priming: word used previously, more likely Incrementally create ‘cache’ unigram model on test corpus
Mix with main n-gram LM
Topic models: Intuition: Text is about some topic, on-topic words likely
P(w|h) ~ Σt P(w|t)P(t|h)
![Page 74: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/74.jpg)
Long Distance LMs Bigger n!
284M words: <= 6-grams improve; 7-20 no better Cache n-gram:
Intuition: Priming: word used previously, more likely Incrementally create ‘cache’ unigram model on test corpus
Mix with main n-gram LM
Topic models: Intuition: Text is about some topic, on-topic words likely
P(w|h) ~ Σt P(w|t)P(t|h)
Non-consecutive n-grams: skip n-grams, triggers, variable lengths n-grams
![Page 75: Language Modeling 1. Roadmap (for next two classes) Review LM evaluation metrics Entropy Perplexity Smoothing Good-Turing Backoff and Interpolation](https://reader033.vdocuments.mx/reader033/viewer/2022052603/56649c6d5503460f9491f9da/html5/thumbnails/75.jpg)
Language Models
N-gram models: Finite approximation of infinite context history
Issues: Zeroes and other sparseness Strategies: Smoothing
Add-one, add-δ, Good-Turing, etc Use partial n-grams: interpolation, backoff
Refinements Class, cache, topic, trigger LMs