lecture2 language modelinglecture 2: language modeling ltat.01.001 –natural language processing...
TRANSCRIPT
Lecture 2: Language modeling
LTAT.01.001 – Natural Language ProcessingKairit Sirts ([email protected])
20.02.2019
The task of language modeling
The cat sat on the mat
The mat sat on the cat
The cat mat the on sat
2
Language modeling
Task:• Estimate the quality/fluency/grammaticality of a natural language
sentence or segment
Why?• Generate new sentences• Choose between several variants, picking the best sounding one.
3
Language modeling
Word: !
Sentence: " = !!!"…!#
4
Language modeling
Can we use some grammaticality checking rules to determine the fluency of the sentence !?• Theoretically - yes• In practice: • Grammar checking software is unreliable• Grammar checking software is only available for few languages• Its output is often non-continuous, which means that
• It cannot be used in optimization• It cannot be used to easily choose a better output from many viable hypotheses
5
Language modeling
Instead we will try to calculate/model:
! " = ! $!$"…$#
>
>
6
P(The cat sat on the mat) P(The mat sat on the cat)
P(The cat mat the on sat)P(The mat sat on the cat)
How to compute the sentence probability?
= #(#$% &'( )'( *+ ($% ,'()# '.. )%+(%+&%) = ?
= #(#$% ,'( )'( *+ ($% &'()# '.. )%+(%+&%) = ?
= #(#$% &'( ,'( ($% *+ )'()# '.. )%+(%+&%) = ?
# - the number or count of such sentencesThat’s clearly not doable in general!
7
P(The cat sat on the mat)
P(The mat sat on the cat)
P(The cat mat the on sat)
How to compute the sentence probability?
Factorize the joint probability:• In general:
! ", $, % = ! " ! $ " ! % ", $
• Similarly:! '!, '", … , '#= ! '! ! '" '! ! '$ '!, '" …!('#|'!, '", … , '#%!)
• It still does not solve the problem!
8
Sentence probability
• Cannot estimate directly:
! "!""…"# = #("!""…"#)# ()) *+,-+,.+*
• Cannot use the factorization:
! "!""…"# =%$%!
#!("$|"!…"$&!)
9
Sentence probability
But word probabilities are doable:• Take a huge text (millions/billions of words)• Compute the probability for each word type (unique word)
! " = #(")# '(( ")*+, -. /ℎ1 /12/
10
Maximum likelihood (ML) estimate
Sentence probability
• What if we treat each word as independent of other words? Then:
! " ≅ ! $! ×! $" ×⋯×!($#)
=
=
11
P(The cat sat on the mat) P(The mat sat on the cat)
P(The cat mat the on sat)P(The mat sat on the cat)
Sentence probability• Maybe add some context?
= " #ℎ% " &'( #ℎ% " )'( &'( " *+ )'( " (ℎ% *+ "(-'(|(ℎ%)
= " #ℎ% " -'( #ℎ% " )'( -'( " *+ )'( " (ℎ% *+ "(&'(|(ℎ%)
= " #ℎ% " &'( #ℎ% " -'( &'( " (ℎ% -'( " *+ (ℎ% "()'(|*+) 12
! " ≅ ! $! ×! $"|$! ×!($#|$")×⋯×!($$|$$%!)P(The cat sat on the mat)
P(The mat sat on the cat)
P(The cat mat the on sat)
Sentence probability – Markov property
Independence assumption or Markov assumption (in the context of language modeling):• The next word only depends on the current/last word.
• This is precisely the model we had on the previous slide and it is called bigram language model because we are looking at the word bigrams.
13
N-gram language model
In general, we talk about n-gram language models, where the next word depends on a fixed history of n-1 words.
• Unigram model – all words are independent, the classical BOW approach• Bigram model• Trigram model – the next word depends on two last words• 4-gram model• 5-gram model
14
Computing n-gram probabilities
• Unigrams: !! " !! = ##!# $%% #&'()
• Bigrams: !!*+!! " !! !!*+ = #(#!"#,#!)#(#!"#)
• Trigrams: !!*/!!*+!! " !! !!*/, !!*+ = #(#!"$,#!"#,#!)#(#!"$,#!"#)
15
Sentence probability according to n-gram model• If
! "! "", "#, … , "!$" ≅ ! "! "!$% , … , "!$"• Where & = ()*+, *+(& − 1:• Unigrams: ! = 0• Bigrams: ! = 1• Trigrams: ! = 2, etc
• Then
! / =0!&"
'!("!|"", "#, … , "!$") ≅0
!&"
'!("!|"!$% , … , "!$")
16
Bigram language model: example
An example corpus:1. the cat saw the mouse2. the cat heard a mouse3. the mouse heard4. a mouse saw5. a cat saw
17
Bigram Count Unigram Count Bigram probSTART the STARTthe cat thecat saw catsaw the sawthe mouse themouse END mousecat heard catheard a hearda mouse aSTART a STARTmouse saw mousesaw END sawa cat a
Bigram language model: example
An example corpus:1. the cat saw the mouse2. the cat heard a mouse3. the mouse heard4. a mouse saw5. a cat saw
18
Bigram Count Unigram Count Bigram probSTART the 3 START 5 0.6the cat 2 the 4 0.5cat saw 2 cat 3 0.67saw the 1 saw 3 0.33the mouse 2 the 4 0.5mouse END 2 mouse 4 0.5cat heard 1 cat 3 0.33heard a 1 heard 2 0.5a mouse 2 a 3 0.67START a 2 START 5 0.4mouse saw 1 mouse 4 0.25saw END 2 saw 3 0.67a cat 1 a 3 0.33
Bigram language model: example
P(The cat heard) = ?
19
Bigram Bigram probSTART the 0.6the cat 0.5cat saw 0.67saw the 0.33the mouse 0.5mouse END 0.5cat heard 0.33heard a 0.5a mouse 0.67START a 0.4mouse saw 0.25saw END 0.67a cat 0.33heard END 0.5
Bigram language model: example
P(The cat heard) = = P(START the) x P(the cat) x P(cat heard) x P(heard END)
20
Bigram Bigram probSTART the 0.6the cat 0.5cat saw 0.67saw the 0.33the mouse 0.5mouse END 0.5cat heard 0.33heard a 0.5a mouse 0.67START a 0.4mouse saw 0.25saw END 0.67a cat 0.33heard END 0.5
Bigram language model: example
P(The cat heard) = = P(START the) x P(the cat) x P(cat heard) x P(heard END) = 0.6*0.5*0.33*0.5 = 0.0495
21
Bigram Bigram probSTART the 0.6the cat 0.5cat saw 0.67saw the 0.33the mouse 0.5mouse END 0.5cat heard 0.33heard a 0.5a mouse 0.67START a 0.4mouse saw 0.25saw END 0.67a cat 0.33heard END 0.5
Bigram language model: example
P(The mouse saw the cat) = ?
22
Bigram Bigram probSTART the 0.6the cat 0.5cat saw 0.67saw the 0.33the mouse 0.5mouse END 0.5cat heard 0.33heard a 0.5a mouse 0.67START a 0.4mouse saw 0.25saw END 0.67a cat 0.33heard END 0.5
Bigram language model: example
P(the mouse saw the cat) = =P(START the) x P(the mouse) x P(mouse saw) x P(saw the) x P(the cat) x P(cat END)
23
Bigram Bigram probSTART the 0.6the cat 0.5cat saw 0.67saw the 0.33the mouse 0.5mouse END 0.5cat heard 0.33heard a 0.5a mouse 0.67START a 0.4mouse saw 0.25saw END 0.67a cat 0.33heard END 0.5
Bigram language model: example
P(the mouse saw the cat) = =P(START the) x P(the mouse) x P(mouse saw) x P(saw the) x P(the cat) x P(cat END)= 0.6*0.5*0.25*0.33*0.5*0 = 0
24
Bigram Bigram probSTART the 0.6the cat 0.5cat saw 0.67saw the 0.33the mouse 0.5mouse END 0.5cat heard 0.33heard a 0.5a mouse 0.67START a 0.4mouse saw 0.25saw END 0.67a cat 0.33heard END 0.5
Morphology
25Source: www.rabiaergin.com
Sparsity issues
Natural languages are sparse!
Consider vocabulary of size 60000• How many possible unigrams, bigrams, trigrams are there?• How large a text corpus do we need to obtain reliable statistics for all
ngrams?• Does more data solve the problem completely?
26
Zipf’s law
• Given some corpus of natural language text, the frequency of any word is inversely proportional to its rank in the frequency table• The most frequent word will occur approximately twice as often as the
second most frequent word• The second most frequent word will occur approximately twice as often as
the third most frequent word etc
27
Zipf’s law
28
Masrai and Milton, 2006. “How different is Arabic from Other Languages? The Relationship between Word Frequency and Lexical Coverage”
Smoothing
The general idea: Find a way to fill the gaps in counts• Take care not to change the original distribution too much• Fill in the gaps only as much as needed: as the corpus grows larger
there are less gaps to fill.
• Smoothing methods• Add λ method• Interpolation• (Modified) Kneser-Ney• There are others
29
Add λ method
Assume all n-grams occur λ times more than they actually occur.• Usual bigram probability:
! "! "!"# = #("!"#, "!)#("!"#)
• Add 0 < * ≤ 1 to all bigram counts:
!$ "!|"!"# = # "!"#, "! + *#("!"#) + *|/|
• Special case * = 1: add-one smoothing
30
Add λ method
• Advantages• Very simple• Easy to apply
• Disadvantages• Performs poorly (according to Chen & Goodman)• All unseen events receive the same probability• All events are upgraded by λ
31
Interpolation (Jelinek-Mercer smoothing)
If the bigram !!"# !! is unseen:• Originally its probability would be 0:
" !! !!"# = 0• Instead of 0 we could use the probability of the shorter n-gram
(unigram):"(!!)
• We must make sure that the total probability mass remains the same• Thus interpolate between the unigram and bigram distribution
"$% !! !!"# = '" !! !!"# + 1 − ' "(!!)
32
Interpolation (Jelinek-Mercer smoothing)
• Recursive formulation: nth-order smoothed model is defined recursively as linear interpolation between the nth-order maximum likelihood (ML) model and the (n-1)th-order smoothed model!!" "# "#$% , … , "#$&= &#$%! "# "#$% , … , "#$& + 1 − &#$% !!" "# "#$%'&, … , "#$&
• Can ground the recursion with:• 1st order unigram model• 0th order uniform model
! " = 1|&|
33
Software for language modelling
• KenLM• https://github.com/kpu/kenlm
• SRILM• http://www.speech.sri.com/projects/srilm/
• IRSTLM• http://hlt-mt.fbk.eu/technologies/irstlm
• Others:• http://www.statmt.org/moses/?n=FactoredTraining.BuildingLanguageModel
34
Language model evaluation
• Intrinsic evaluation• Perplexity• Quick and simple• Improvements in perplexity might not translate into improvements in
downstream tasks• Extrinsic evaluation• In a down-stream task (like machine translation, speech recognition etc)• More difficult and time-consuming• More accurate evaluation (although beware of confounding with other
factors)
35
Perplexity
• Perplexity is a measurement of how well a probability model predicts a sample.• Language model is a probability model over language• To evaluate a language model, compute the perplexity over held-out
set (test set)
!! = 2!"# ∑!"#
$ %&'% ((*!|*!&',…,*!&#)
36
Perplexity
• The lower the perplexity the better the language model, i.e. the less “surprised” the model is on seeing the evaluation data• The exponent is really the cross-entropy, which measures the number
of bits needed to represent a word:
! "#, #̂ = −(!"#
$"#(*!) log% #̂(*!)
• "# *! = #((!)$ - empirical unigram probability
• #̂ *! = #*+(*!|*!,- , … , *!,#)
37
Perplexity
• Let’s assume that the cross-entropy on a test set is 7.95• This means that each word in the test set could be encoded with 7.95
bits• The model perplexity would be 27.95=247 per word• This means that the model is as confused on test data as if it would
have to choose uniformly at random from 247 possibilities for each word.
38
Perplexity
• Perplexity is corpus-specific: only the perplexities calculated on the same test set are comparable• For meaningful comparison, the vocabulary sizes of the two language
models must be the same, e.g.• You can compare a bigram language model to a trigram language model that
both use vocabulary size 10000• You cannot compare a trigram language model using a vocabulary size 10000
to a trigram language model using vocabulary size 20000
39
Neural language models
• Window-based feed-forward neural language model• Recurrent neural language model
40
Feed-forward neural language model (Bengioet al., 2003)
41
! = # $!"#$% ; … ; # $!"& ; # $!"%
' = ( !)' + +'
, $! $!"#$%, … , $!"&, $!"%= ./01234(')( + +()
Recurrent neural language model (Mikolov et al., 2010)
Source: http://colah.github.io
!! = #(%!&+!!"#( + )$)
+ ,! ,#, … , ,!"%, ,!"#= /012345(!!&& + )&)
Training the language model with cross-entropy loss
!!"#$$%&'("#)* "#, # = −'+,-
.#+ log "#+ = − log "#(
|V| - the vocabulary sizet – index of the correct word
43
Why is the softmax over large vocabulary computationally costly?• What is a softmax?
!"#$%&' (! = *"!∑!" *"!"
• Now take the derivative from this with respect to (!• The sum over the whole vocabulary will remain in the derivative
(check it yourself)
44
How to handle large softmax?
• Hierarchical softmax• Decompose the softmax layer into
binary tree• Reduce the complexity of the output
distribution from O(|V|) to O(log|V|)• Self-normalization• Approximate softmax
45
source: https://becominghuman.ai
What to do with infrequent words
• Typically, the vocabulary size is fixed, ranging anywhere between 10K-200K words• Still, there will always be words that are not part of the vocabulary
(remember Turkish?)• The most common approach is to simply replace all out-of-vocabulary
(OOV) words with a special UNK token• Another option is to reduce the sparsity by constructing vocabulary
from subword units:• Morphemes, characters, syllables, …
46
What to do with infrequent words
• What if there are no UNK’s in the training set?• Use a random UNK vector during testing• Randomly replace some infrequent words with UNK during training
• Construct word embeddings from characters (we’ll talk about it in more detail later)• Works for input (context) words• Cannot use for output words
47
Character-level language model
• For instance for generating text with mark-up• A. Karpathy, 2015. The Unreasonable Effectiveness of Recurrent
Neural Networks• Generated text based on a LM trained on Wikipedia:
48
Using language models
• For scoring sentences• Speech recognition• Using LM for text classification• Statistical machine translation
• For generating text• Neural machine translation• Dialogue generation• Abstractive summarization
49