inf5820 – 2011 fall natural language processing...4 1. general case, example: height, sentence...

INF5830 – 2015 FALL NATURAL LANGUAGE PROCESSING

Jan Tore Lønning, Lecture 14, 16.11

1

Today

More on collocations Repeat some statistics T-test Other tests Differences

Feature selection Multinomial logistic regression (MaxEnt) Comparing and combining classifiers

2

Collocations

Does a sample belong to a population with mean µ?

4

1. General case, example: height, sentence length. 1. Known standard deviation σ

Calculate z-score: z = 𝑋�−𝜇

𝜎2 𝑛� and corresponding p-value.

2. Unknown st. dev.. Approximate by sample st. dev. s

Calculate t-score: t = 𝑋�−𝜇

𝑠2 𝑛� and the t(n-1) density curve.

2. Proportion n items, k successes, �̂� = 𝑘 𝑛⁄ p = µ (the expected proportion), 𝜎2 = 𝑝 (1 − 𝑝)

for large n: z-score: z = 𝑝�−𝑝

𝑝(1−𝑝)𝑛� and corresponding p-value.

“T-test” for collocations

How likely/unlikely is w1 and w2 to occur together?

Null hypothesis, H0 : P(w1=new, w2=companies)= P(w1=new)×P(w2=companies)

Hypothesis, Ha : P(w1=new, w2=companies)> P(w1=new)×P(w2=companies)

The expectation, µ=p = P(w1=new)×P(w2=companies) =3.615 x 10- 7

�̂� = P(w1=new, w2=companies) = 5.591 x 10- 7

N = 14307668

Manning and Schütze uses t-score: t = 𝑋�−𝜇

𝑠2 𝑛�

Since this is propoprtion it is more correct to use z = 𝑝�−𝑝

𝑝(1−𝑝)𝑛�

With z-distribution and σ

2431319.11430766810615.3

10615.310591.57

77

2=

×

×−×=

−=

−

−−

nxzσ

µ

(Better score than the book – but I still think it can be defended!)

z = 𝑝�−𝑝

𝑝(1−𝑝)𝑛�

Today



7

Collocations tests 8

1. T-test 2. Absolute frequencies

3. Dice score 2×𝑃(𝑤1𝑤2)𝑃 𝑤1 +𝑃(𝑤2)

4. Pointwise mutual information log( 𝑃 𝑤1𝑤2𝑃 𝑤1 ×𝑃 𝑤2

)

5. χ2 test 6. (Mutual information)

Pointwise mutual information

In the T-test we compared P(new companies) with P(new)xP(companies)

Idea: compare these directly P(new companies)/P(new)xP(companies)

log(P(new companies)/P(new)xP(companies)) = Pointwise mutual information Observe: log does not change the ranking log has a theoretical motivation

Problems with pmi

If w1w2 only occur together, the score increases when the frequency of w2 decreases P(w1 w2) = P(w1) = P(w2) P(w1 w2)/P(w1)xP(w2) = 1/P(w2)

Bad measure of dependence Good measure of independence

All measures are unreliable for small samples

χ2 test

O11= 8

E11= P(new) x P(companies) x N = (C(new)/N) x (C(companies)/N) x N = (O11 + O12 ) x (O11 + O21 ) /N, Etc

H0 : no association between row and column variable:

If true, X2-statistics has approx. a χ2 distribution with (r-1)(c-1) degrees of freedom (d.f.)

In practice

Today



13

Comparing two samples 14

X ={x1, …, xn}, Y ={y1, …, ym}, «do they belong to the same population?» 1. General case, example: height, sentence length

t = 𝑋�−𝑌�

𝑠𝑥2 𝑛� +𝑠𝑦2𝑚�

, the t(k-1) density curve (k = min(n, m) )

2. Proportions: n items, k successes, �̂� = 𝑘 𝑛⁄ 𝑠2 = �̂� 1 − �̂� z = 𝑝�−𝑝2�

𝑝�(1−𝑝�)𝑛� +𝑝2� (1−𝑝2� )

𝑚� , use the z-density curve

Applied to collocations

Assumption: 1-p is so close to 1 that p is a good approximation to s2 = p(1-p)

What is a collocation

Non-compositional Ex.: N-N compounds in English, written as one word in

German/Norwegian: white wine

Non substitutability: red wine, burgundy wine Frozen Or only frequent co-occurrence?

IR 13.5

Feature selection 18

Today



19

Feature selection 20

To include all words as features is too costly Select features which

Separate between the classes Expected to occur with a reasonable frequency

To select: use similar measures as for collocations, e.g.

Raw frequency Chi-square Mutual information …

Apply chi square 21

Assume a binary classifier: only two classes {yes, no} and binary features

For every feature calculate the chi square score between the two values of the feature and the two classes

Select the words with the highest score as features

Chi square 22

Sense musical instrument Yes No sums ”guitar” in context

Yes 25 5 30 No 275 695 970

Sums 300 700 1000

Chi square 23

Observations Is in class s Yes NO sums w in context

Yes O11 O10 O1x No O01 O00 O0x

Sums Ox1 Ox0 N

Expectations Is in class s

Yes NO sums

w in context

Yes E11=O1x×Ox1/N E10=O1x×Ox0/N O1x

No E01=O0x×Ox1/N E00=O0x×Ox0/N O0x

Sums Ox1 Ox0 O

Chi square 24

Observations Sense musical instrument Yes NO sums ”guitar” in context

Yes 25 5 30 No 275 695 970

Sums 300 700 1000

Expectations Is in class s

Yes NO sums

w in context

Yes 9=30×300/1000 21=30×700/1000 30

No 291 679 970

Sums 300 700 O

Line, binary, ‘product’, BoW-only 25

Number of word features

Most frequent Chi-square

0 0.528 0.528

10 0.632 0.786

20 0.724 0.826

50 0.816 0.846

100 0.844 0.862

200 0.864 0.878

500 0.864 0.898

1000 0.892 0.906

2000 0.912 0.912

5000 0.918 0.916

Feature selection, contd. 26

Similarly to collocations, we may use other association measures, e.g. Pointwise mutual information Mutual information

A difference between the different measures is how they trade-off: discrimination/frequency

Pointwise mutual information 27

11

11logEO

=

Mutual information 28

∑∑

∑

====

==

==

====

===

1,0;1,01,0;1,0

1,0;1,0

loglog

)(ˆ)(ˆ),(ˆ

log),(ˆ);(

ji ij

ijij

ji jxxi

ijij

ji

EO

NO

OONO

NO

jCPiWPjCiWPjCiWPCWI

Multinomial logistic regression

Today



30

A slight reformulation

We saw that for NB

iff

This could also be written

0)|()|(

log)()(log

1 2

1

2

1 >

+

∑=

n

j j

j

cfPcfP

cPcP

( ) ( ) 0)2|(log)|(log)(log)(log1

121 >−+− ∑=

n

jjj cfPcfPcPcP

∑∑==

+>+n

jj

n

jj cfPcPcfPcP

12

111 )2|(log)(log)|(log)(log

)|()|( 21 fcPfcP

> ∏∏==

>n

jj

n

jj cfPcPcfPcP

122

111 )|()()|()(

31

Reformulation, contd. 32

has the form where

and our earlier

(The probability in this notation

and similarly for P(c2|f) )

∑∑==

+>+n

jj

n

jj cfPcPcfPcP

12

111 )2|(log)(log)|(log)(log

fwxwxwfwM

iii

M

iii

•=>=• ∑∑

==

2

0

2

0

11

))|(log( 11 cfPw jj =

21jjj www −=

))|(log( 22 cfPw jj =

fwfw

fw

fww

fww

fw

fw

eee

ee

eefcP

••

•

•−

•−

•

•

+=

+=

+= 12

1

21

21

)(

)(

1 11)|(

Multinomial logistic regression

We may generalize this to more than two classes For each class cj for j = 1,..,k a linear expression and the probability of belonging to class cj:

where

and

( ) ∏ ∏=

=∑==•= •

i i

fi

fwfwjj i

ifjii i

jij

aZ

wZ

eZ

eZ

fwZ

fcP e 1111exp1)|(

jiw

i ea =

( )∑=

•=k

j

j fwZ1exp

∑=

=•M

ii

ji

j xwfw0

classifierlinear as NBBinary )(Bernoulli Bayes Naive

regression Logisticregression lMultinomia

≈

33

Footnote: Alternative formulation

(In case you read other presentations, like Mitchell or Hastie et. al.: They use a slightly different formulation, corresponding to

where for i = 1, 2,…, k-1:

But and

The two formulations are equivalent though: In the J&M formulation, divide the numerator and denominator in each P(ci|f)

with

and you get this formulation (with adjustments to Z and w.)

( )∑−

=

•+=1

1exp1

k

i

i fwZ

( )∑−

=

•+= 1

1exp1

1)|( k

i

i

k

fwfcP

( ) ∏ ∏=

=∑==•= •

j j

fj

fwfwii j

jfijj j

iji

aZ

wZ

eZ

eZ

fwZ

fcP e 1111exp1)|(

( )fwk

•exp

34

Indicator variables

Already seen: categorical variables represented by indicator variables, taking the values 0,1

Also usual to let the variables indicate both observation and class

( ) ( )( ) ∑ ∑

∑

∑ ∑

∑

∑= =

=

= =

=

=

=

=•

•=•=

k

l

li

m

ii

ji

m

ii

k

li

n

i

li

i

n

i

ji

k

l

l

jjj

xcfw

xcfw

fw

fw

fw

fwfwZ

fcP

1 0

0

1 0

0

1),(exp

),(exp

exp

exp

exp

expexp1)|(

35

Examples – J&M 36

Why called ”maximum entropy”?

See NLTK book for a further example

37

P(NN)+P(JJ)+P(NNS)+P(VB)=1

P(NN)+P(NNS)=0.8

P(VB)=1/20

Why called ”maximum entropy”?

The multinomial logistic regression yields the probability distribution which Gives the maximum entropy Given our training data

38

Line – Most frequent BoW-features 39


NaiveBayes SklearnClassifier(LogisticRegression())

0 0.528 0.528

10 0.528 0.528

20 0.534 0.546

50 0.576 0.624

100 0.688 0.732

200 0.706 0.752

500 0.744 0.804

1000 0.774 0.838

2000 0.802 0.846

5000 0.826 0.850

Line, binary, ‘product’, BoW-only 40


Bernoulli Chi-square SKLearn-LogReg

Chi-square LogReg

0 0.528 0.528 0.528 0.528

10 0.632 0.786 0.636 0.786

20 0.724 0.826 0.738 0.830

50 0.816 0.846 0.810 0.866

100 0.844 0.862 0.858 0.898

200 0.864 0.878 0.888 0.906

500 0.864 0.898 0.902 0.924

1000 0.892 0.906 0.912 0.928

2000 0.912 0.912 0.924 0.928

5000 0.918 0.916 0.922 0.928

Comparing and combining classifiers

Today



42

Maxent vs Naive Bayes

If the Naive Bayes assumption is warranted – i.e. the features are independent – the two yield the same result in the limit.

Otherwise, Maxent cope better with dependencies between features

With Maxent you may throw in features and let the model decide whether they are useful

Maxent training is slower

43

More than two classes (in general)

Any of or multivalue classification An item may belong to 1, 0 or more than 1 classes Classes are independent Use n binary classifiers Example: Documents

One-of or multinomial classification Each item belongs to one class Classes are mutually exclusive Example: POS-tagging

45

One of classifiers

Many classifiers are built for binary problems

Simply combining several binary quantifiers do not result in a one-of-classifier.

?

?

?

46

Combining binary classifiers

Build a classifier for each class compared to its complement For a test document, evaluate it for membership in each

class Assign document to class with either:

maximum probability maximum score maximum confidence

Multinomial logistic regression is a good example Sometimes one postpones the decision and proceed with the

probabilities (soft classification), E.g. Maxent tagging

47