chapter6. statistical inference : n-gram model over sparse data

1

CHAPTER6. STATISTICAL INFERENCE : N-GRAM MODEL OVER SPARSE DATA

Pusan national university2014. 4. 22

Myoungjin, Jung

Foundations of Statistic Natural Language Processing

2

INTRODUCTION Object of Statistical NLP

Do statistical inference for the field of natural language.

Statistical inference ( 크게 두 가지 과정으로 나눔 ) 1. Taking some data generated by unknown probability distribu-

tion. ( 말뭉치 필요 ) 2. Making some inferences about this distribution. ( 해당 말뭉치로 확률분포 추론 )

Divides the problem into three areas : ( 통계적 언어처리의 3 가지 과정 )

1. Dividing the training data into equivalence class. 2. Finding a good statistical estimator for each equivalence class. 3. Combining multiple estimators.

3

BINS : FORMING EQUIVALENCE CLASSES Reliability vs Discrimination

Ex)“large green ___________”tree? mountain? frog? car?

“swallowed the large green ________”pill? broccoli?

smaller n: more instances in training data, better statistical es-timates (more reliability)

larger n: more information about the context of the specific in-stance (greater discrimination)

4

BINS : FORMING EQUIVALENCE CLASSES N-gram models

“n-gram” = sequence of n words predicting the next word : Markov assumption

Only the prior local context - the last few words – affects the next word.

Selecting an n : Vocabulary size = 20,000 words

n Number of bins

2 (bigrams) 400,000,000

3 (trigrams) 8,000,000,000,000

4 (4-grams) 1.6 x 1017

11 ,,| nn wwwP

5

Probability dist. : P(s) where s : sentence Ex.

P(If you’re going to San Francisco, be sure ……)= P(If) * P(you’re|If) * P(going|If you’re) * P(to|If you’re going) * ……

Markov assumption Only the last n-1 words are relevant for a prediction

Ex. With n=5P(sure|If you’re going to San Francisco, be)= P(sure|San Francisco , be)

BINS : FORMING EQUIVALENCE CLASSES

6

BINS : FORMING EQUIVALENCE CLASSES N-gram : Sequence of length n with a count Ex. 5-gram : If you’re going to San Sequence naming :

Markov assumption formalized :• P() = P(|) P(|)

n-1 words

7


Instead of P(s) : only one conditional prob. P(|) Simplify P(|) to P(|)

n-1 n-1

• NWP() = arg max P(|)• • Set of all words in the corpus• next word prediction

8


Ex. The easiest way : (|) = =

P(San|If you’re going to) = =

9

STATISTICAL ESTIMATORS Given the observed training data.

How do you develop a model (probability distribution) to predict fu-ture events? ( 더 좋은 확률의 추정 )

Probability estimate target feature

Estimating the unknown probability distribution of n-grams.

11

111|

n

nnn wwP

wwPwwwP

10

STATISTICAL ESTIMATORS Notation for the statistical estimation chapter.

N Number of training instances

B Number of bins training instances are divided into

w1n An n-gram w1…wn in the training text

C(w1…wn) Freq. of n-gram w1…wn in the training text

r Freq. of an n-gram

f(•) Freq. estimate of a model

Nr Number of bins that have r training instances in them

Tr Total count of n-grams of freq. r in further data

h ‘History’ of preceding words

11

STATISTICAL ESTIMATORS Example - Instances in the training corpus:

“inferior to ________”

12

MAXIMUM LIKELIHOOD ESTIMATION (MLE) Definition

Using the relative frequency as a probability estimate. Example :

In corpus, found 10 training instances of the word “comes across”

8 times they were followed by “as” : P(as) = 0.8 Once by “more” and “a” : P(more) = 0.1 , P(a) = 0.1 Not among the above 3 word : P(x) = 0.0

Formula

11

111

11

|

n

nnnMLE

nnMLE

wwCwwCwwwP

Nr

NwwCwwP

13

MAXIMUM LIKELIHOOD ESTIMATION (MLE)

14

MAXIMUM LIKELIHOOD ESTIMATION (MLE)

Example 1. A Paragraph Using Training Data

The bigram model uses the preceding word to help predict the next word. (End) In

general, this helps enormously, and gives us a much better model. (End) In some

cases the estimated probability of the word that actually comes next has gone up by

about an order of magnitude (was, to, sisters). (End) However, note that the bigram

model is not guaranteed to increase the probability estimate. (End)

Word (N=79 : ,…,) 1-gram 2-gram 3-gram

P(the)=7/79, P(bigram|the)=2/7, P(model|the,bigram)=2/2 P(bigram)=2/79

C(the)=7C(bigram)

=2C(model)=

3C(the,bigram)=2C(the,bigram,model)=2

15

LAPLACE’S LAW, LIDSTONE’S LAW AND THE JEF-FREYS-PERKS LAW

Laplace’s law(1814; 1995)

Add a little bit of probability space to unseen events

BN

rBN

wwCwwP nnLAP

111

1

16


Word (N=79 : B=seen(51)+unseen(70)=121)

MLE Laplace’s law

0.0886076 0.04000000.2857143 0.00509511.0000000 0.0083089

17


Page 202-203 (Associated Press[AP] newswire yielded a vocabulary)

unseen event 에 대한 약간의 확률 공간을 추가 but 너무 많은 공간을 추가하였다 .

44milion 의 경우 voca400653 발생 -> 160,000,000,000 bigram 발생 Bins 의 개수가 training instance 보다 많아지게 되는 문제가 발생• Lap law 는 unseen event 에 대한 확률공간을 위해 분모에 B 를 삽입 하였지만 결과적으로 약 46.5% 의 확률공간을 unseen event 에 주게 되었다 .• N0 * P lap(.) = 74.671,100,000 * 0,000137/22,000,000 = 0.465

18

Lidstone’s law(1920) and the Jeffreys-Perks law(1973)

Lidstone’s Law add some positive value

Jeffreys-Perks Law

= 0.5 Called ELE (Expected Likelihood Estimation)

BλNλ)wC(w

)w(wP nnLid

1

1


19

LIDSTONE’S LAW

Using Lidstone’s law, instead of adding one, add some smaller value ,

where the parameter is in the range . And .

20

LIDSTONE’S LAW

Here, gives the maximum likelihood estimate, gives the Laplace’s law, if tends to then we have the uniform estimate .

represents the trust we have in relative frequencies. implies more trust in relative frequencies than the Laplace's

law while represents less trust in relative frequencies.

In practice, people use values of in the range , a common value being . (Jeffreys-Perks law)

21

JEFFREYS-PERKS LAW

Using Lidstone’s law,

MLE(

)

Lidstone’s law

( )

Jeffreys-Perks (

)

Lidstone’s law

( )

Laplace’s law

( )

Lidstone’s law

( )

A 0.0886 0.0633 0.0538 0.0470 0.0400 0.0280B 0.2857 0.0081 0.0063 0.0056 0.0051 0.0049C 1.0000 0.0084 0.0085 0.0083 0.0083 0.0083

*A: , B: , C:

22

HELD OUT ESTIMATION(JELINEK AND MERCER, 1985)

For each n-gram, , let :

= frequency of in training data

= frequency of in held out data

Let

be the total number of times that all n-grams that appeared r times in the

training text appeared in the held out data.An estimate for the probability of one of these n-gram is :

where .

23

[Full text ( ) : , respectively], unseen word : I don't know. [Word ( ) : , unseen word : 70, respec-

tively] : (Training Data)

[Word ( ) : , unseen word : 51- , respectively] : (Held out Data)

(1-gram) Traing data : , , ( ) Held out data : , , ( )


24

training text 에서 r 번 나온 bigram 이 추가적으로 추출한 text (further text) 에서 몇 번 나오는가를 알아보는 것 .

• Held out estimation : training text 에서 r 번 출현되어진 bigram 이 더 많은 text 에서는 얼마나 출현 할 것인가를 예측하는 방법 .

• Test data(training data 에 독립적 ) 는 전체 데이터의 5-10% 이지만 신뢰하기에 충분하다 .• 우리는 데이터를 training data 의 test data 로 나누기를 원한다 . (검증된 데이터와 검증안된 데이터 )• Held out data (10%) • N-gram 의 held-out estimation 을 통해 held-out data 를 얻는다 .


25

CROSS-VALIDATION (DELETED ESTIMATION; JELINEK AND MERCER, 1985)

Use data for both training and validation

Divide test data into 2 parts Train on A, validate on B Train on B, validate on A

Combine two models

A B

train validate

validate train

Model 1

Model 2

Model 1 Model 2+ Final Model

26


Cross validation : training data is used both as initial training data held out data

On large training corpora, deleted estimation works better than held-out estimation

rwwCwhereNNNTTwwP

rwwCwhereNN

TorNN

TwwP

nrr

rrndel

nr

r

r

rnho

110

1001

1

11

10

0

01

1

27

CROSS-VALIDATION (DELETED ESTIMATION; JELINEK AND MERCER, 1985) [Full text ( ) : , respectively], unseen word : I don't know. [Word ( ) : , unseen word : 70, re-

spectively] : (Training Data) [A-part word ( ) : , unseen word :

101, respectively] [B-part word ( ) : , unseen

word : 90- , respectively]A-part data : , ( )

B-part data : , ( )

, .

28


28

[B-part word ( ) : , unseen word :

90- , respectively] [A-part word ( ) : , unseen

word : 101+ , respectively]B-part data : , ( )

A-part data : , ( )

, .[Result]

, .

29


Held out estimation 개념으로 우리가 training data 를 두 부분으로 나눔으로써 같은 효과를 얻는다 . 이러한 메소드를 cross validation 이라 한다 .

더 효과적인 방법 . 두 방법을 합치므로써 Nr0, Nr1 의 차이를 감소 시킨다 .

큰 training curpus 내의 deleted estimation 은 held-out estimation 보다 더 신뢰적이다 .

30

GOOD-TURING ESTIMATION(GOOD, 1953) : [BINOMIAL DISTRI-BUTION]

Idea: re-estimate probability mass assigned to N-grams with zero counts

Adjust actual counts to expected counts with formula

rr

GT

NENErr

NrP

1*

*

1

(r* is an adjusted frequency)

(E denotes the expectation of

random variable)

31

GOOD-TURING ESTIMATION(GOOD, 1953) : [BINOMIAL DISTRI-BUTION]

If

If

이 작을 시 : 이 클 시 : So, over-estimator 된 것을 under-estimator 시킴

32

NOTE

단점 : over-estimator

[two discounting models] (Ney and Essen, 1993; Ney et al., 1994) Absolute discounting

, over-estimator 를 만큼 다운시킴 .,

Linear discounting, 를 이용하여 를 조절,

33

NOTE

단점 : over-estimator

[Natural Law of Succession] (Ristad, 1995)

34

COMBINING ESTIMATORS Basic Idea

Consider how to combine multiple probability estimate from various different models

How can you develop a model to utilize different length n-grams as appropriate?

Simple linear interpolation

where and .

Combination of trigram, bigram and unigram

35

COMBINING ESTIMATORS [Katz’s backing-off] (Katz, 1987)

Example

36

COMBINING ESTIMATORS [Katz’s backing-off] (Katz, 1987) If sequence unseen : use shorter sequence Ex. If P(San | going to) = 0, Use P(San | to)

= τ() if c() > 0 Λ*() if c() = 0

weight lower order prob. higher order prob.

37

COMBINING ESTIMATORS [General linear interpolation]

where and

38

COMBINING ESTIMATORS

Interpolated smoothing = τ() + Λ*()

higher order prob. Weight

lower order prob.

Seems to work better than back-off smoothing

39

Witten Bell smoothing

= *() + (1- )*()

=

Where = |{}|

NOTE

40

Absolute discounting Like Jelinek-Mercer, involves interpolation of higher- and

lower- order models But instead of multiplying the higher-order by a , we sub-

tract a fixed discount [0,1] from each nonzero count :

= + + (1- )*()

To make it sum to 1: (1- )= *

Choose using held-out estimation.

NOTE

41

KN smoothing (1995) An extension of absolute discounting with a clever way

of constructing the lower-order (backoff) model Idea: the lower-order model is signficant only when

count is small or zero in the higher-order model, and so should be optimized for that purpose.

= + **()

NOTE

42

An empirical study of smoothing techniques for language modeling (1999)

For a bigram model, we would like to select a smoothed dist. that satisfies the following con-straint on unigram marginals for all :

(1) ( 제약조건 ) (2) (1) 번으로부터 = (3) (2) 번으로부터 =

NOTE

43

=*[ + **()]

=+ *()

= + ()

= + ()

NOTE

44

= |{}|

= = |{}| =

()=

NOTE

45

Generlizing to higher-order models, we have that

(|)= Where = |{}| = |{}| =

NOTE

chapter6. statistical inference : n-gram model over sparse data

Documents