chapter6. statistical inference : n-gram model over sparse data
DESCRIPTION
Foundations of Statistic Natural Language Processing. Chapter6. Statistical Inference : n-gram Model over Sparse Data. Pusan national university 2014. 4. 22 Myoungjin , Jung. Introduction. Object of Statistical NLP Do statistical inference for the field of natural language. - PowerPoint PPT PresentationTRANSCRIPT
1
CHAPTER6. STATISTICAL INFERENCE : N-GRAM MODEL OVER SPARSE DATA
Pusan national university2014. 4. 22
Myoungjin, Jung
Foundations of Statistic Natural Language Processing
2
INTRODUCTION Object of Statistical NLP
Do statistical inference for the field of natural language.
Statistical inference ( 크게 두 가지 과정으로 나눔 ) 1. Taking some data generated by unknown probability distribu-
tion. ( 말뭉치 필요 ) 2. Making some inferences about this distribution. ( 해당 말뭉치로 확률분포 추론 )
Divides the problem into three areas : ( 통계적 언어처리의 3 가지 과정 )
1. Dividing the training data into equivalence class. 2. Finding a good statistical estimator for each equivalence class. 3. Combining multiple estimators.
3
BINS : FORMING EQUIVALENCE CLASSES Reliability vs Discrimination
Ex)“large green ___________”tree? mountain? frog? car?
“swallowed the large green ________”pill? broccoli?
smaller n: more instances in training data, better statistical es-timates (more reliability)
larger n: more information about the context of the specific in-stance (greater discrimination)
4
BINS : FORMING EQUIVALENCE CLASSES N-gram models
“n-gram” = sequence of n words predicting the next word : Markov assumption
Only the prior local context - the last few words – affects the next word.
Selecting an n : Vocabulary size = 20,000 words
n Number of bins
2 (bigrams) 400,000,000
3 (trigrams) 8,000,000,000,000
4 (4-grams) 1.6 x 1017
11 ,,| nn wwwP
5
Probability dist. : P(s) where s : sentence Ex.
P(If you’re going to San Francisco, be sure ……)= P(If) * P(you’re|If) * P(going|If you’re) * P(to|If you’re going) * ……
Markov assumption Only the last n-1 words are relevant for a prediction
Ex. With n=5P(sure|If you’re going to San Francisco, be)= P(sure|San Francisco , be)
BINS : FORMING EQUIVALENCE CLASSES
6
BINS : FORMING EQUIVALENCE CLASSES N-gram : Sequence of length n with a count Ex. 5-gram : If you’re going to San Sequence naming :
Markov assumption formalized :• P() = P(|) P(|)
n-1 words
7
BINS : FORMING EQUIVALENCE CLASSES
Instead of P(s) : only one conditional prob. P(|) Simplify P(|) to P(|)
n-1 n-1
• NWP() = arg max P(|)• • Set of all words in the corpus• next word prediction
8
BINS : FORMING EQUIVALENCE CLASSES
Ex. The easiest way : (|) = =
P(San|If you’re going to) = =
9
STATISTICAL ESTIMATORS Given the observed training data.
How do you develop a model (probability distribution) to predict fu-ture events? ( 더 좋은 확률의 추정 )
Probability estimate target feature
Estimating the unknown probability distribution of n-grams.
11
111|
n
nnn wwP
wwPwwwP
10
STATISTICAL ESTIMATORS Notation for the statistical estimation chapter.
N Number of training instances
B Number of bins training instances are divided into
w1n An n-gram w1…wn in the training text
C(w1…wn) Freq. of n-gram w1…wn in the training text
r Freq. of an n-gram
f(•) Freq. estimate of a model
Nr Number of bins that have r training instances in them
Tr Total count of n-grams of freq. r in further data
h ‘History’ of preceding words
11
STATISTICAL ESTIMATORS Example - Instances in the training corpus:
“inferior to ________”
12
MAXIMUM LIKELIHOOD ESTIMATION (MLE) Definition
Using the relative frequency as a probability estimate. Example :
In corpus, found 10 training instances of the word “comes across”
8 times they were followed by “as” : P(as) = 0.8 Once by “more” and “a” : P(more) = 0.1 , P(a) = 0.1 Not among the above 3 word : P(x) = 0.0
Formula
11
111
11
|
n
nnnMLE
nnMLE
wwCwwCwwwP
Nr
NwwCwwP
13
MAXIMUM LIKELIHOOD ESTIMATION (MLE)
14
MAXIMUM LIKELIHOOD ESTIMATION (MLE)
Example 1. A Paragraph Using Training Data
The bigram model uses the preceding word to help predict the next word. (End) In
general, this helps enormously, and gives us a much better model. (End) In some
cases the estimated probability of the word that actually comes next has gone up by
about an order of magnitude (was, to, sisters). (End) However, note that the bigram
model is not guaranteed to increase the probability estimate. (End)
Word (N=79 : ,…,) 1-gram 2-gram 3-gram
P(the)=7/79, P(bigram|the)=2/7, P(model|the,bigram)=2/2 P(bigram)=2/79
C(the)=7C(bigram)
=2C(model)=
3C(the,bigram)=2C(the,bigram,model)=2
15
LAPLACE’S LAW, LIDSTONE’S LAW AND THE JEF-FREYS-PERKS LAW
Laplace’s law(1814; 1995)
Add a little bit of probability space to unseen events
BN
rBN
wwCwwP nnLAP
111
1
16
LAPLACE’S LAW, LIDSTONE’S LAW AND THE JEF-FREYS-PERKS LAW
Word (N=79 : B=seen(51)+unseen(70)=121)
MLE Laplace’s law
0.0886076 0.04000000.2857143 0.00509511.0000000 0.0083089
17
LAPLACE’S LAW, LIDSTONE’S LAW AND THE JEF-FREYS-PERKS LAW
Page 202-203 (Associated Press[AP] newswire yielded a vocabulary)
unseen event 에 대한 약간의 확률 공간을 추가 but 너무 많은 공간을 추가하였다 .
44milion 의 경우 voca400653 발생 -> 160,000,000,000 bigram 발생 Bins 의 개수가 training instance 보다 많아지게 되는 문제가 발생• Lap law 는 unseen event 에 대한 확률공간을 위해 분모에 B 를 삽입 하였지만 결과적으로 약 46.5% 의 확률공간을 unseen event 에 주게 되었다 .• N0 * P lap(.) = 74.671,100,000 * 0,000137/22,000,000 = 0.465
18
Lidstone’s law(1920) and the Jeffreys-Perks law(1973)
Lidstone’s Law add some positive value
Jeffreys-Perks Law
= 0.5 Called ELE (Expected Likelihood Estimation)
BλNλ)wC(w
)w(wP nnLid
1
1
LAPLACE’S LAW, LIDSTONE’S LAW AND THE JEF-FREYS-PERKS LAW
19
LIDSTONE’S LAW
Using Lidstone’s law, instead of adding one, add some smaller value ,
where the parameter is in the range . And .
20
LIDSTONE’S LAW
Here, gives the maximum likelihood estimate, gives the Laplace’s law, if tends to then we have the uniform estimate .
represents the trust we have in relative frequencies. implies more trust in relative frequencies than the Laplace's
law while represents less trust in relative frequencies.
In practice, people use values of in the range , a common value being . (Jeffreys-Perks law)
21
JEFFREYS-PERKS LAW
Using Lidstone’s law,
MLE(
)
Lidstone’s law
( )
Jeffreys-Perks (
)
Lidstone’s law
( )
Laplace’s law
( )
Lidstone’s law
( )
A 0.0886 0.0633 0.0538 0.0470 0.0400 0.0280B 0.2857 0.0081 0.0063 0.0056 0.0051 0.0049C 1.0000 0.0084 0.0085 0.0083 0.0083 0.0083
*A: , B: , C:
22
HELD OUT ESTIMATION(JELINEK AND MERCER, 1985)
For each n-gram, , let :
= frequency of in training data
= frequency of in held out data
Let
be the total number of times that all n-grams that appeared r times in the
training text appeared in the held out data.An estimate for the probability of one of these n-gram is :
where .
23
[Full text ( ) : , respectively], unseen word : I don't know. [Word ( ) : , unseen word : 70, respec-
tively] : (Training Data)
[Word ( ) : , unseen word : 51- , respectively] : (Held out Data)
(1-gram) Traing data : , , ( ) Held out data : , , ( )
HELD OUT ESTIMATION(JELINEK AND MERCER, 1985)
24
training text 에서 r 번 나온 bigram 이 추가적으로 추출한 text (further text) 에서 몇 번 나오는가를 알아보는 것 .
• Held out estimation : training text 에서 r 번 출현되어진 bigram 이 더 많은 text 에서는 얼마나 출현 할 것인가를 예측하는 방법 .
• Test data(training data 에 독립적 ) 는 전체 데이터의 5-10% 이지만 신뢰하기에 충분하다 .• 우리는 데이터를 training data 의 test data 로 나누기를 원한다 . (검증된 데이터와 검증안된 데이터 )• Held out data (10%) • N-gram 의 held-out estimation 을 통해 held-out data 를 얻는다 .
HELD OUT ESTIMATION(JELINEK AND MERCER, 1985)
25
CROSS-VALIDATION (DELETED ESTIMATION; JELINEK AND MERCER, 1985)
Use data for both training and validation
Divide test data into 2 parts Train on A, validate on B Train on B, validate on A
Combine two models
A B
train validate
validate train
Model 1
Model 2
Model 1 Model 2+ Final Model
26
CROSS-VALIDATION (DELETED ESTIMATION; JELINEK AND MERCER, 1985)
Cross validation : training data is used both as initial training data held out data
On large training corpora, deleted estimation works better than held-out estimation
rwwCwhereNNNTTwwP
rwwCwhereNN
TorNN
TwwP
nrr
rrndel
nr
r
r
rnho
110
1001
1
11
10
0
01
1
27
CROSS-VALIDATION (DELETED ESTIMATION; JELINEK AND MERCER, 1985) [Full text ( ) : , respectively], unseen word : I don't know. [Word ( ) : , unseen word : 70, re-
spectively] : (Training Data) [A-part word ( ) : , unseen word :
101, respectively] [B-part word ( ) : , unseen
word : 90- , respectively]A-part data : , ( )
B-part data : , ( )
, .
28
CROSS-VALIDATION (DELETED ESTIMATION; JELINEK AND MERCER, 1985)
28
[B-part word ( ) : , unseen word :
90- , respectively] [A-part word ( ) : , unseen
word : 101+ , respectively]B-part data : , ( )
A-part data : , ( )
, .[Result]
, .
29
CROSS-VALIDATION (DELETED ESTIMATION; JELINEK AND MERCER, 1985)
Held out estimation 개념으로 우리가 training data 를 두 부분으로 나눔으로써 같은 효과를 얻는다 . 이러한 메소드를 cross validation 이라 한다 .
더 효과적인 방법 . 두 방법을 합치므로써 Nr0, Nr1 의 차이를 감소 시킨다 .
큰 training curpus 내의 deleted estimation 은 held-out estimation 보다 더 신뢰적이다 .
30
GOOD-TURING ESTIMATION(GOOD, 1953) : [BINOMIAL DISTRI-BUTION]
Idea: re-estimate probability mass assigned to N-grams with zero counts
Adjust actual counts to expected counts with formula
rr
GT
NENErr
NrP
1*
*
1
(r* is an adjusted frequency)
(E denotes the expectation of
random variable)
31
GOOD-TURING ESTIMATION(GOOD, 1953) : [BINOMIAL DISTRI-BUTION]
If
If
이 작을 시 : 이 클 시 : So, over-estimator 된 것을 under-estimator 시킴
32
NOTE
단점 : over-estimator
[two discounting models] (Ney and Essen, 1993; Ney et al., 1994) Absolute discounting
, over-estimator 를 만큼 다운시킴 .,
Linear discounting, 를 이용하여 를 조절,
33
NOTE
단점 : over-estimator
[Natural Law of Succession] (Ristad, 1995)
34
COMBINING ESTIMATORS Basic Idea
Consider how to combine multiple probability estimate from various different models
How can you develop a model to utilize different length n-grams as appropriate?
Simple linear interpolation
where and .
Combination of trigram, bigram and unigram
35
COMBINING ESTIMATORS [Katz’s backing-off] (Katz, 1987)
Example
36
COMBINING ESTIMATORS [Katz’s backing-off] (Katz, 1987) If sequence unseen : use shorter sequence Ex. If P(San | going to) = 0, Use P(San | to)
= τ() if c() > 0 Λ*() if c() = 0
weight lower order prob. higher order prob.
37
COMBINING ESTIMATORS [General linear interpolation]
where and
38
COMBINING ESTIMATORS
Interpolated smoothing = τ() + Λ*()
higher order prob. Weight
lower order prob.
Seems to work better than back-off smoothing
39
Witten Bell smoothing
= *() + (1- )*()
=
Where = |{}|
NOTE
40
Absolute discounting Like Jelinek-Mercer, involves interpolation of higher- and
lower- order models But instead of multiplying the higher-order by a , we sub-
tract a fixed discount [0,1] from each nonzero count :
= + + (1- )*()
To make it sum to 1: (1- )= *
Choose using held-out estimation.
NOTE
41
KN smoothing (1995) An extension of absolute discounting with a clever way
of constructing the lower-order (backoff) model Idea: the lower-order model is signficant only when
count is small or zero in the higher-order model, and so should be optimized for that purpose.
= + **()
NOTE
42
An empirical study of smoothing techniques for language modeling (1999)
For a bigram model, we would like to select a smoothed dist. that satisfies the following con-straint on unigram marginals for all :
(1) ( 제약조건 ) (2) (1) 번으로부터 = (3) (2) 번으로부터 =
NOTE
43
=*[ + **()]
=+ *()
= + ()
= + ()
NOTE
44
= |{}|
= = |{}| =
()=
NOTE
45
Generlizing to higher-order models, we have that
(|)= Where = |{}| = |{}| =
NOTE