nlp language models1 language models, lm noisy channel model simple markov models smoothing...
TRANSCRIPT
NLP Language Models 1
• Language Models, LM• Noisy Channel model• Simple Markov Models• Smoothing
Statistical Language Models
NLP Language Models 2
Two Main Approaches to NLPKnowlege (AI)
•Statistical models
- Inspired in speech recognition :
probability of next word based on previous
- Statistical models different from
statistical methods methods
NLP Language Models 3
Probability Theory
• X be uncertain outcome of some event. Called a random variable
• V(X) finite number of possible outcome (not a real number)
• P(X=x), probability of the particular outcome x (x belongs V(X))
• X desease of your patient, V(X) all possible diseases,
NLP Language Models 4
Conditional probability of the outcome of an event based upon
the outcome of a second event
We pick two words randomly from a book. We know first word is
the, we want to know probability second word is dog
P(W2 = dog|W1 = the) = | W1 = the,W2= dog|/ |W1 = the|
Bayes’s law: P(x|y) = P(x) P(y|x) / P(y)
Probability Theory
NLP Language Models 5
Bayes’s law: P(x|y) = P(x) P(y|x) / P(y)
P(desease/symptom)= P(desease)P(symptom/desease)/P(symptom)
P(w1,n| speech signal) = P(w1,n)P(speech signal | w1,n)/ P(speech signal)
We only need to maximize the numerator
P(speech signal | w1,n) expresses how well the speech signal fits
the sequence of words w1,n
Probability Theory
NLP Language Models 6
Useful generalizations of the Bayes’s law
- To find the probability of something happening calculate the probability that it hapens given some second event times the probability of the second event
- P(w,z|y,z) = P(w,x) P(y,z | w,z) / P(y,z)
Where x,y,y,z aseparate event (i.e. take a word)
- P(w1,w2,..,wn) =P(w1)P(w2|w1)P(w3|w1,w2),… P(wn|w1..,wn-1)
also when conditioned on some event
P(w1,w2,..,wn|x) =P(w1|x)P(w2|w1,x) … P(wn|w1..,wn-1,x)
Probability Theory
NLP Language Models 7
Statistical Model of a Language
• Statistical models of words of sentences – language models
• Probability of all possible sequences of words .
• For sequences of words of length n,
• assign a number to P(W1,n= w1,n), being w1,n a sequence of
•
n random variable w1,w2,..wn being any word Computing the probability of next word related to computing probability of sequencesP(the big dog) = P(the)P(big/the)P(dog/the bP(the pig dog) = P(the)P(pig/the)P(dog/the pig)
NLP Language Models 8
Ngram Model
• Simple but durable statistical model
• Useful to indentify words in noisy, ambigous input.
• Speech recognition, many input speech sounds similar and confusable
• Machine translation, spelling correction, handwriting recognition, predictive text input
• Other NLP tasks: part of speech tagging, NL generation, word similarity
NLP Language Models 9
CORPORA
• Corpora (singular corpus) are online collections of text or speech.
• Brown Corpus: 1 million word collection from 500 written texts
• from different genres (newspaper,novels, academic).
• Punctuation can be treated as words.
• Switchboard corpus: 2430 Telephone conversations averaging 6 minutes each, 240 hour of speech and 3 million words
•
There is no punctuation.
•
There are disfluencies: filled pauses and fragments
•
“I do uh main- mainly business data processing”
NLP Language Models 10
Training and Test Sets
• Probabilities of N-gram model come from
the corpus it is trained for
• Data in the corpus is divided into training set
• (or training corpus) and test set (or test corpus).
• Perplexity: compare statistical models
NLP Language Models 11
Ngram Model
• How can we compute probabilities of entire sequences P(w1,w2,..,wn)
• Descomposition using the chain rule of probability
P(w1,w2,..,wn) =P(w1)P(w2|w1)P(w3|w1,w2),… P(wn|w1..,wn-1)
• Assigns a conditional probability to possible next words considering the history.
• Markov assumption : we can predict the probability of some future unit without looking too far into the past.
• Bigrams only consider previous usint, trigrams, two previous unit, n-grams, n previous unit
NLP Language Models 12
Ngram Model
• Assigns a conditional probability to possible next words.Only n-1 previous words have effect on the probabilities of next word
• For n = 3, Trigrams P(wn|w1..,wn-1) = P(wn|wn-2,wn-1)
• How we estimate these trigram or N-gram probabilities?
Maximum Likelihood Estimation (MLE)
To maximize the likelihood of the training set T given the model M (P(T/M)
• To create the model use training text (corpus), taking counts and normalizing them so they lie between 0 and 1.
NLP Language Models 13
Ngram Model
• For n = 3, Trigrams
P(wn|w1..,wn-1) = P(wn|wn-2,wn-1)
• To create the model use training text and record pairs and triples of words that appear in the text and how many times
P(wi|wi-2,wi-1)= C(wi-2,i) / C(wi-2,i-1)
P(submarine|the, yellow) = C(the,yellow, submarine)/C(the,yellow)
Relative frequency: observed frequency of a particular sequence divided by observed fequency of a prefix
NLP Language Models 14
• Statistical Models• Language Models (LM)• Vocabulary (V), word
• w V
• Language (L), sentence • s L• L V* usually infinite
• s = w1,…wN
• Probability of s• P(s)
Language Models
NLP Language Models 15
W X W*Yencoder decoderChannel
p(y|x)message
input to channel
Output fromchannel
Attempt to reconstruct message based on output
Noisy Channel Model
NLP Language Models 16
• In NLP we do not usually act on coding. The problem is reduced to decode for getting the most likely input given the output,
p i p o∣i ¿
decoderNoisy Channel p(o|I)
I O I
Noisy Channel Model in NLP
NLP Language Models 17
p i p o∣i ¿
Language Model Channel Model
NLP Language Models 18
noisy channel X Y
real Language X
Observed Language Y
We want to retrieve X from Y
NLP Language Models 19
Correct text
errors
text with errors
noisy channel X Y
real Language X
Observed Language Y
NLP Language Models 20
Correct text
space removing
text without spaces
noisy channel X Y
real Language X
Observed Language Y
NLP Language Models 21
language model
acoustic model noisy channel X Y
real Language X
Observed Language Y
text
spelling
speech
NLP Language Models 22
Source Language
Translationnoisy channel X Y
real Language X
Observed Language YTarget Language
NLP Language Models 23
Acoustic chain word chain
Language model Acoustic Model
example: ASR Automatic Speech Recognizer
ASRX1 ... XT w1 ... wN
NLP Language Models 24
Target Language Model Translation Model
example: Machine Translation
NLP Language Models 25
• Naive Implementation• Enumerate s L• Compute p(s)• Parameters of the model |L|
• But ...• L is usually infinite• How to estimate the parameters?
• Simplifications
• History• hi = { wi, … wi-1}
• Markov Models
NLP Language Models 26
• Markov Models of order n + 1
• P(wi|hi) = P(wi|wi-n+1, … wi-1)
• 0-gram
• 1-gram• P(wi|hi) = P(wi)
• 2-gram• P(wi|hi) = P(wi|wi-1)
• 3-gram• P(wi|hi) = P(wi|wi-2,wi-1)
NLP Language Models 27
• n large:• more context information (more discriminative power)
• n small:• more cases in the training corpus (more reliable)
• Selecting n: • ej. for |V| = 20.000
n num. parameters
2 (bigrams) 400,000,000
3 (trigrams)
8,000,000,000,000
4 (4-grams) 1.6 x 1017
NLP Language Models 28
• Parameters of an n-gram model• |V|n
• MLE estimation• From a training corpus
• Problem of sparseness
NLP Language Models 29
• 1-gram Model
• 2-gram Model
• 3-gram Model
PMLEw =C w ∣V ∣
P MLE w i∣w i−1 ,w i−2 =C w i−2w i−1w i C w i−2w i−1
P MLE w i∣w i−1 =C w i−1w i C w i−1
NLP Language Models 30
NLP Language Models 31
NLP Language Models 32
True probability distribution
NLP Language Models 33
The seen cases are overestimated the unseen ones have a null probability
NLP Language Models 34
Save a part of the mass probability from seen cases and assign it to the unseen
ones
SMOOTHING
NLP Language Models 35
• Some methods perform on the countings:• Laplace, Lidstone, Jeffreys-Perks
• Some methods perform on the probabilities:• Held-Out• Good-Turing• Descuento
• Some methods combine models• Linear interpolation• Back Off
NLP Language Models 36
P laplace w 1⋯w n =C w 1⋯w n 1
N+B
P = probability of an n-gram
C = counting of the n-gram in the training corpus
N = total of n-grams in the training corpus
B = parameters of the model (possible n-grams)
Laplace (add 1)
NLP Language Models 37
P Lid w 1⋯w n =C w1⋯w n +λ
N+B⋅λ
= small positive number
M.L.E: = 0Laplace: = 1Jeffreys-Perks: = ½
Lidstone (generalization of Laplace)
NLP Language Models 38
• Compute the percentage of the probability mass that has to be reserved for the n-grams unseen in the training corpus
• We separate from the training corpus a held-out corpus
• We compute howmany n-grams unseen in the training corpus occur in the held-out corpus
• An alternative of using a held-out corpus is using Cross-Validation• Held-out interpolation• Deleted interpolation
Held-Out
NLP Language Models 39
T r= ∑{w1⋯w
n:C
1w
1⋯w
n=r }C 2w1⋯wn
P how1⋯wn=T rN r N
Let a n-gram w1… wn
r = C(w1… wn)
C1(w1… wn) counting of the n-gram in the training set
C2(w1… wn) counting of the n-gram in the held-out set
Nr number of n-grams with counting r in the training set
Held-Out
NLP Language Models 40
r* = adjusted countNr = number of n-gram-types occurring r times
E(Nr) = expected value
E(Nr+1) < E(Nr)
Zipf law
r= r+ 1 E N r+1 E N r
PGT =r /N
Good-Turing
NLP Language Models 41
Combination of models:
Linear combination(interpolation)
• Linear combination of de 1-gram, 2-gram, 3-gram, ...• Estimation of using a development corpus
P li w n∣w n−2 ,w n−1 =
λ 1 P 1 w n +λ 2 P 2 w n∣w n−1 +λ1 P 3 w n∣w n−2 ,w n−1
NLP Language Models 42
• Start with a n-gram model• Back off to n-1 gram for null (or low) counts• Proceed recursively
Katz’s Backing-Off
NLP Language Models 43
• Performing on the history
• Class-based Models• Clustering (or classifying) words into
classes• POS, syntactic, semantic
• Rosenfeld, 2000:• P(wi|wi-2,wi-1) = P(wi|Ci) P(Ci|wi-2,wi-1)• P(wi|wi-2,wi-1) = P(wi|Ci) P(Ci|wi-2,Ci-1)• P(wi|wi-2,wi-1) = P(wi|Ci) P(Ci|Ci-2,Ci-1)• P(wi|wi-2,wi-1) = P(wi|Ci-2,Ci-1)
NLP Language Models 44
• Structured Language Models• Jelinek, Chelba, 1999• Including the syntactic structure into
the history
• Ti are the syntactic structures • binarized lexicalized trees