inf5820 – 2011 fall natural language processing...4 1. general case, example: height, sentence...

47
INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning, Lecture 14, 16.11 1

Upload: others

Post on 23-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING

Jan Tore Lønning, Lecture 14, 16.11

1

Page 2: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Today

More on collocations Repeat some statistics T-test Other tests Differences

Feature selection Multinomial logistic regression (MaxEnt) Comparing and combining classifiers

2

Page 3: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Collocations

Page 4: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Does a sample belong to a population with mean µ?

4

1. General case, example: height, sentence length. 1. Known standard deviation σ

Calculate z-score: z = 𝑋�−𝜇

𝜎2 𝑛� and corresponding p-value.

2. Unknown st. dev.. Approximate by sample st. dev. s

Calculate t-score: t = 𝑋�−𝜇

𝑠2 𝑛� and the t(n-1) density curve.

2. Proportion n items, k successes, �̂� = 𝑘 𝑛⁄ p = µ (the expected proportion), 𝜎2 = 𝑝 (1 − 𝑝)

for large n: z-score: z = 𝑝�−𝑝

𝑝(1−𝑝)𝑛� and corresponding p-value.

Page 5: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

“T-test” for collocations

How likely/unlikely is w1 and w2 to occur together?

Null hypothesis, H0 : P(w1=new, w2=companies)= P(w1=new)×P(w2=companies)

Hypothesis, Ha : P(w1=new, w2=companies)> P(w1=new)×P(w2=companies)

The expectation, µ=p = P(w1=new)×P(w2=companies) =3.615 x 10- 7

�̂� = P(w1=new, w2=companies) = 5.591 x 10- 7

N = 14307668

Manning and Schütze uses t-score: t = 𝑋�−𝜇

𝑠2 𝑛�

Since this is propoprtion it is more correct to use z = 𝑝�−𝑝

𝑝(1−𝑝)𝑛�

Page 6: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

With z-distribution and σ

2431319.11430766810615.3

10615.310591.57

77

2=

×

×−×=

−=

−−

nxzσ

µ

(Better score than the book – but I still think it can be defended!)

z = 𝑝�−𝑝

𝑝(1−𝑝)𝑛�

Page 7: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Today

More on collocations Repeat some statistics T-test Other tests Differences

Feature selection Multinomial logistic regression (MaxEnt) Comparing and combining classifiers

7

Page 8: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Collocations tests 8

1. T-test 2. Absolute frequencies

3. Dice score 2×𝑃(𝑤1𝑤2)𝑃 𝑤1 +𝑃(𝑤2)

4. Pointwise mutual information log( 𝑃 𝑤1𝑤2𝑃 𝑤1 ×𝑃 𝑤2

)

5. χ2 test 6. (Mutual information)

Page 9: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Pointwise mutual information

In the T-test we compared P(new companies) with P(new)xP(companies)

Idea: compare these directly P(new companies)/P(new)xP(companies)

log(P(new companies)/P(new)xP(companies)) = Pointwise mutual information Observe: log does not change the ranking log has a theoretical motivation

Page 10: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Problems with pmi

If w1w2 only occur together, the score increases when the frequency of w2 decreases P(w1 w2) = P(w1) = P(w2) P(w1 w2)/P(w1)xP(w2) = 1/P(w2)

Bad measure of dependence Good measure of independence

All measures are unreliable for small samples

Page 11: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

χ2 test

O11= 8

E11= P(new) x P(companies) x N = (C(new)/N) x (C(companies)/N) x N = (O11 + O12 ) x (O11 + O21 ) /N, Etc

H0 : no association between row and column variable:

If true, X2-statistics has approx. a χ2 distribution with (r-1)(c-1) degrees of freedom (d.f.)

Page 12: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

In practice

Page 13: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Today

More on collocations Repeat some statistics T-test Other tests Differences

Feature selection Multinomial logistic regression (MaxEnt) Comparing and combining classifiers

13

Page 14: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Comparing two samples 14

X ={x1, …, xn}, Y ={y1, …, ym}, «do they belong to the same population?» 1. General case, example: height, sentence length

t = 𝑋�−𝑌�

𝑠𝑥2 𝑛� +𝑠𝑦2𝑚�

, the t(k-1) density curve (k = min(n, m) )

2. Proportions: n items, k successes, �̂� = 𝑘 𝑛⁄ 𝑠2 = �̂� 1 − �̂� z = 𝑝�−𝑝2�

𝑝�(1−𝑝�)𝑛� +𝑝2� (1−𝑝2� )

𝑚� , use the z-density curve

Page 15: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Applied to collocations

Assumption: 1-p is so close to 1 that p is a good approximation to s2 = p(1-p)

Page 16: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛
Page 17: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

What is a collocation

Non-compositional Ex.: N-N compounds in English, written as one word in

German/Norwegian: white wine

Non substitutability: red wine, burgundy wine Frozen Or only frequent co-occurrence?

Page 18: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

IR 13.5

Feature selection 18

Page 19: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Today

More on collocations Repeat some statistics T-test Other tests Differences

Feature selection Multinomial logistic regression (MaxEnt) Comparing and combining classifiers

19

Page 20: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Feature selection 20

To include all words as features is too costly Select features which

Separate between the classes Expected to occur with a reasonable frequency

To select: use similar measures as for collocations, e.g.

Raw frequency Chi-square Mutual information …

Page 21: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Apply chi square 21

Assume a binary classifier: only two classes {yes, no} and binary features

For every feature calculate the chi square score between the two values of the feature and the two classes

Select the words with the highest score as features

Page 22: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Chi square 22

Sense musical instrument Yes No sums ”guitar” in context

Yes 25 5 30 No 275 695 970

Sums 300 700 1000

Page 23: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Chi square 23

Observations Is in class s Yes NO sums w in context

Yes O11 O10 O1x No O01 O00 O0x

Sums Ox1 Ox0 N

Expectations Is in class s

Yes NO sums

w in context

Yes E11=O1x×Ox1/N E10=O1x×Ox0/N O1x

No E01=O0x×Ox1/N E00=O0x×Ox0/N O0x

Sums Ox1 Ox0 O

Page 24: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Chi square 24

Observations Sense musical instrument Yes NO sums ”guitar” in context

Yes 25 5 30 No 275 695 970

Sums 300 700 1000

Expectations Is in class s

Yes NO sums

w in context

Yes 9=30×300/1000 21=30×700/1000 30

No 291 679 970

Sums 300 700 O

Page 25: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Line, binary, ‘product’, BoW-only 25

Number of word features

Most frequent Chi-square

0 0.528 0.528

10 0.632 0.786

20 0.724 0.826

50 0.816 0.846

100 0.844 0.862

200 0.864 0.878

500 0.864 0.898

1000 0.892 0.906

2000 0.912 0.912

5000 0.918 0.916

Page 26: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Feature selection, contd. 26

Similarly to collocations, we may use other association measures, e.g. Pointwise mutual information Mutual information

A difference between the different measures is how they trade-off: discrimination/frequency

Page 27: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Pointwise mutual information 27

11

11logEO

=

Page 28: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Mutual information 28

∑∑

====

==

==

====

===

1,0;1,01,0;1,0

1,0;1,0

loglog

)(ˆ)(ˆ),(ˆ

log),(ˆ);(

ji ij

ijij

ji jxxi

ijij

ji

EO

NO

OONO

NO

jCPiWPjCiWPjCiWPCWI

Page 29: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Multinomial logistic regression

Page 30: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Today

More on collocations Repeat some statistics T-test Other tests Differences

Feature selection Multinomial logistic regression (MaxEnt) Comparing and combining classifiers

30

Page 31: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

A slight reformulation

We saw that for NB

iff

This could also be written

0)|()|(

log)()(log

1 2

1

2

1 >

+

∑=

n

j j

j

cfPcfP

cPcP

( ) ( ) 0)2|(log)|(log)(log)(log1

121 >−+− ∑=

n

jjj cfPcfPcPcP

∑∑==

+>+n

jj

n

jj cfPcPcfPcP

12

111 )2|(log)(log)|(log)(log

)|()|( 21 fcPfcP

> ∏∏==

>n

jj

n

jj cfPcPcfPcP

122

111 )|()()|()(

31

Page 32: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Reformulation, contd. 32

has the form where

and our earlier

(The probability in this notation

and similarly for P(c2|f) )

∑∑==

+>+n

jj

n

jj cfPcPcfPcP

12

111 )2|(log)(log)|(log)(log

fwxwxwfwM

iii

M

iii

•=>=• ∑∑

==

2

0

2

0

11

))|(log( 11 cfPw jj =

21jjj www −=

))|(log( 22 cfPw jj =

fwfw

fw

fww

fww

fw

fw

eee

ee

eefcP

••

•−

•−

+=

+=

+= 12

1

21

21

)(

)(

1 11)|(

Page 33: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Multinomial logistic regression

We may generalize this to more than two classes For each class cj for j = 1,..,k a linear expression and the probability of belonging to class cj:

where

and

( ) ∏ ∏=

=∑==•= •

i i

fi

fwfwjj i

ifjii i

jij

aZ

wZ

eZ

eZ

fwZ

fcP e 1111exp1)|(

jiw

i ea =

( )∑=

•=k

j

j fwZ1exp

∑=

=•M

ii

ji

j xwfw0

classifierlinear as NBBinary )(Bernoulli Bayes Naive

regression Logisticregression lMultinomia

33

Page 34: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Footnote: Alternative formulation

(In case you read other presentations, like Mitchell or Hastie et. al.: They use a slightly different formulation, corresponding to

where for i = 1, 2,…, k-1:

But and

The two formulations are equivalent though: In the J&M formulation, divide the numerator and denominator in each P(ci|f)

with

and you get this formulation (with adjustments to Z and w.)

( )∑−

=

•+=1

1exp1

k

i

i fwZ

( )∑−

=

•+= 1

1exp1

1)|( k

i

i

k

fwfcP

( ) ∏ ∏=

=∑==•= •

j j

fj

fwfwii j

jfijj j

iji

aZ

wZ

eZ

eZ

fwZ

fcP e 1111exp1)|(

( )fwk

•exp

34

Page 35: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Indicator variables

Already seen: categorical variables represented by indicator variables, taking the values 0,1

Also usual to let the variables indicate both observation and class

( ) ( )( ) ∑ ∑

∑ ∑

∑= =

=

= =

=

=

=

=•

•=•=

k

l

li

m

ii

ji

m

ii

k

li

n

i

li

i

n

i

ji

k

l

l

jjj

xcfw

xcfw

fw

fw

fw

fwfwZ

fcP

1 0

0

1 0

0

1),(exp

),(exp

exp

exp

exp

expexp1)|(

35

Page 36: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Examples – J&M 36

Page 37: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Why called ”maximum entropy”?

See NLTK book for a further example

37

P(NN)+P(JJ)+P(NNS)+P(VB)=1

P(NN)+P(NNS)=0.8

P(VB)=1/20

Page 38: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Why called ”maximum entropy”?

The multinomial logistic regression yields the probability distribution which Gives the maximum entropy Given our training data

38

Page 39: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Line – Most frequent BoW-features 39

Number of word features

NaiveBayes SklearnClassifier(LogisticRegression())

0 0.528 0.528

10 0.528 0.528

20 0.534 0.546

50 0.576 0.624

100 0.688 0.732

200 0.706 0.752

500 0.744 0.804

1000 0.774 0.838

2000 0.802 0.846

5000 0.826 0.850

Page 40: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Line, binary, ‘product’, BoW-only 40

Number of word features

Bernoulli Chi-square SKLearn-LogReg

Chi-square LogReg

0 0.528 0.528 0.528 0.528

10 0.632 0.786 0.636 0.786

20 0.724 0.826 0.738 0.830

50 0.816 0.846 0.810 0.866

100 0.844 0.862 0.858 0.898

200 0.864 0.878 0.888 0.906

500 0.864 0.898 0.902 0.924

1000 0.892 0.906 0.912 0.928

2000 0.912 0.912 0.924 0.928

5000 0.918 0.916 0.922 0.928

Page 41: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Comparing and combining classifiers

Page 42: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Today

More on collocations Repeat some statistics T-test Other tests Differences

Feature selection Multinomial logistic regression (MaxEnt) Comparing and combining classifiers

42

Page 43: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Maxent vs Naive Bayes

If the Naive Bayes assumption is warranted – i.e. the features are independent – the two yield the same result in the limit.

Otherwise, Maxent cope better with dependencies between features

With Maxent you may throw in features and let the model decide whether they are useful

Maxent training is slower

43

Page 44: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Generative vs discriminative model

P(o,c) P(c|o) argmaxC P(c|o) P(o) argmaxo P(o) P(o|c)

… P(c|o) argmaxC P(c|o)

Generative (e.g. NB) Discriminative (e.g. Maxent)

See NLTK book

44

Page 45: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

More than two classes (in general)

Any of or multivalue classification An item may belong to 1, 0 or more than 1 classes Classes are independent Use n binary classifiers Example: Documents

One-of or multinomial classification Each item belongs to one class Classes are mutually exclusive Example: POS-tagging

45

Page 46: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

One of classifiers

Many classifiers are built for binary problems

Simply combining several binary quantifiers do not result in a one-of-classifier.

?

?

?

46

Page 47: INF5820 – 2011 fall Natural Language Processing...4 1. General case, example: height, sentence length. 1. Known standard deviation σ Calculate z-score: z = 𝑋 −𝜇 𝜎2 𝑛

Combining binary classifiers

Build a classifier for each class compared to its complement For a test document, evaluate it for membership in each

class Assign document to class with either:

maximum probability maximum score maximum confidence

Multinomial logistic regression is a good example Sometimes one postpones the decision and proceed with the

probabilities (soft classification), E.g. Maxent tagging

47