1 sims 290-2: applied natural language processing marti hearst oct 23, 2006 (slides developed by...

36
1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

Post on 20-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

1

SIMS 290-2: Applied Natural Language Processing

Marti HearstOct 23, 2006

(Slides developed by Preslav Nakov) 

 

Page 2: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

2

Today

Feature selectionTF.IDF Term WeightingWeka Input File Format

Page 3: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

3

Features for Text Categorization

Linguistic features Words– lowercase? (should we convert to?)– normalized? (e.g. “texts” “text”)

Phrases Word-level n-grams Character-level n-grams Punctuation Part of Speech

Non-linguistic features

document formatting informative character sequences (e.g. &lt)

Page 4: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

4

If the algorithm cannot handle all possible features– e.g. language identification for 100 languages using all words– text classification using n-grams

Good features can result in higher accuracyWhat if we just keep all features?– Even the unreliable features can be helpful.– But we need to weight them:

In the extreme case, the bad features can have a weight of 0 (or very close), which is… a form of feature selection!

When Do We NeedFeature Selection?

Page 5: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

5

Why Feature Selection?

Not all features are equally good! Bad features: best to remove– Infrequent

unlikely to be seen again co-occurrence with a class can be due to chance

– Too frequent mostly function words

– Uniform across all categories

Good features: should be kept– Co-occur with a particular category– Do not co-occur with other categories

The rest: good to keep

Page 6: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

6

Types Of Feature Selection?

Feature selection reduces the number of features Usually:

Eliminating features Weighting features Normalizing features

Sometimes by transforming parameters e.g. Latent Semantic Indexing using Singular Value

Decomposition

Method may depend on problem type For classification and filtering, may want to use information from example documents to guide selection.

Page 7: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

7

Feature Selection

Task independent methodsDocument Frequency (DF)Term Strength (TS)

Task-dependent methodsInformation Gain (IG)Mutual Information (MI)2 statistic (CHI)

Empirically compared by Yang & Pedersen (1997)

Page 8: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

8

Pedersen & Yang Experiments

Compared feature selection methods for text categorization

5 feature selection methods:– DF, MI, CHI, (IG, TS)– Features were just words, not phrases

2 classifiers: – kNN: k-Nearest Neighbor– LLSF: Linear Least Squares Fit

2 data collections:– Reuters-22173– OHSUMED: subset of MEDLINE (1990&1991 used)

Page 9: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

9

DF: number of documents a term appears in

Based on Zipf’s LawRemove the rare terms: (seen 1-2 times)

Spurious Unreliable – can be just noise Unlikely to appear in new documents

Plus Easy to compute Task independent: do not need to know the classes

Minus Ad hoc criterion For some applications, rare terms can be good

discriminators (e.g., in IR)

Document Frequency (DF)

Page 10: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

10

Common words from a predefined list Mostly from closed-class categories: – unlikely to have a new word added– include: auxiliaries, conjunctions, determiners, prepositions,

pronouns, articles

But also some open-class words like numerals

Bad discriminators uniformly spread across all classes can be safely removed from the vocabulary– Is this always a good idea? (e.g. author identification)

Stop Word Removal

Page 11: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

11

2 statistic (pronounced “kai square”) A commonly used method of comparing proportions.

Measures the lack of independence between a term and a category (Yang & Pedersen)

2 statistic (CHI)

Page 12: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

12

Is “jaguar” a good predictor for the “auto” class?

We want to compare: the observed distribution above; and null hypothesis: that jaguar and auto are independent

2 statistic (CHI)

Term = jaguar Term jaguar

Class = auto 2 500

Class auto 3 9500

Page 13: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

13

Under the null hypothesis: (jaguar and auto independent): How many co-occurrences of jaguar and auto do we expect?

If independent: Pr(j,a) = Pr(j) Pr(a)

So, there would be: N Pr(j,a), i.e. N Pr(j) Pr(a)

Pr(j) = (2+3)/N;

Pr(a) = (2+500)/N;

N=2+3+500+9500

Which = N(5/N)(502/N)=2510/N=2510/10005 0.25

2 statistic (CHI)

Term = jaguar Term jaguar

Class = auto 2 500

Class auto 3 9500

Page 14: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

14

Under the null hypothesis: (jaguar and auto independent): How many co-occurrences of jaguar and auto do we expect?

2 statistic (CHI)

Term = jaguar Term jaguar

Class = auto 2 (0.25) 500

Class auto 3 9500

expected: fe

observed: fo

Page 15: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

15

Under the null hypothesis: (jaguar and auto – independent): How many co-occurrences of jaguar and auto do we expect?

2 statistic (CHI)

Term = jaguar Term jaguar

Class = auto 2 (0.25) 500 (502)

Class auto 3 (4.75) 9500 (9498)

expected: fe

observed: fo

Page 16: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

16

2 is interested in (fo – fe)2/fe summed over all table entries:

The null hypothesis is rejected with confidence .999, since 12.9 > 10.83 (the value for .999 confidence).

2 statistic (CHI)

)001.(9.129498/)94989500(502/)502500(

75.4/)75.43(25./)25.2(/)(),(22

2222

p

EEOaj

Term = jaguar Term jaguar

Class = auto 2 (0.25) 500 (502)

Class auto 3 (4.75) 9500 (9498)

expected: fe

observed: fo

Page 17: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

17

There is a simpler formula for 2:

2 statistic (CHI)

N = A + B + C + D

A = #(t,c) C = #(¬t,c)

B = #(t,¬c) D = #(¬t, ¬c)

Page 18: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

18

How to use 2 for multiple categories?

Compute 2 for each category and then combine: To require a feature to discriminate well across all

categories, then we need to take the expected value of 2:

Or to weight for a single category, take the maximum:

2 statistic (CHI)

Page 19: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

19

Pluses normalized and thus comparable across terms 2(t,c) is 0, when t and c are independent can be compared to 2 distribution, 1 degree of freedom

Minuses unreliable for low frequency terms

2 statistic (CHI)

Page 20: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

20

Information Gain

A measure of importance of the feature for predicting the presence of the class.

Has an information theoretic justification

Defined as: The number of “bits of information” gained by knowing the term is present or absentBased on Information Theory

– We won’t go into this in detail here.

Page 21: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

21

Information Gain (IG)

IG: number of bits of information gained by knowing the term is present or absent

t is the term being scored, ci is a class variable

entropy: H(c)

specificconditionalentropy H(c|t)

specificconditionalentropy H(c|¬t)

Page 22: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

22

The probability of seeing x and y togethervs

The probably of seeing x anywhere times the probability of seeing y anywhere (independently).

MI = log ( P(x,y) / P(x)P(y) ) = log(P(x,y)) – log(P(x)P(y))

From Bayes law: P(x,y) = P(x|y)P(y)

= log(P(x|y)P(y)) – log(P(x)P(y))

MI = log(P(x|y) – log(P(x))

Mutual Information (MI)

Page 23: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

23

Approximation:

Mutual Information (MI)

A = #(t,c) C = #(¬t,c)

B = #(t,¬c)

D = #(¬t, ¬c)

rare terms get higher scores

does not useterm absence

N = A + B + C + D

Page 24: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

24

Compute MI for each category and then combine If we want to discriminate well across all categories, then

we need to take the expected value of MI:

To discriminate well for a single category, then we take the maximum:

Using Mutual Information

Page 25: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

25

Mutual Information

PlusesI(t,c) is 0, when t and c are independentHas a sound information-theoretic interpretation

MinusesSmall numbers produce unreliable resultsDoes not use term absence

Page 26: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

26

Mutual information

Term strength

From Yang & Pedersen ‘97

CHI max, IG, DF

Page 27: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

27

DF, IG and CHI are good and strongly correlated thus using DF is good, cheap and task independent can be used when IG and CHI are too expensive

MI is bad favors rare terms (which are typically bad)

Feature Comparison

Page 28: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

28

Term Weighting

In the study just shown, terms were (mainly) treated as binary features

If a term occurred in a document, it was assigned 1Else 0

Often it us useful to weight the selected featuresStandard technique: tf.idf

Page 29: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

29

TF: term frequency definition: TF = tij – frequency of term i in document j

purpose: makes the frequent words for the document more important

IDF: inverted document frequency definition: IDF = log(N/ni)

– ni : number of documents containing term i

– N : total number of documents

purpose: makes rare words across documents more important

TF.IDF (for term i in document j) definition: tij log(N/ni)

TF.IDF Term Weighting

Page 30: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

30

Term Normalization

Combine different words into a single representation

Stemming/morphological analysis– bought, buy, buys -> buy

General word categories – $23.45, 5.30 Yen -> MONEY– 1984, 10,000 -> DATE, NUM– PERSON– ORGANIZATION

(Covered in Information Extraction segment)Generalize with lexical hierarchies

– WordNet, MeSH (Covered later in the semester)

Page 31: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

31

1. Feature selectioninfrequent term removal

infrequent across the whole collection (i.e. DF)seen in a single document

most frequent term removal (i.e. stop words)

2. Normalization:1. Stemming. (often) 2. Word classes (sometimes)

3. Feature weighting: TF.IDF or IDF4. Dimensionality reduction (sometimes)

What Do People Do In Practice?

Page 32: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

32

Weka

Java-based tool for large-scale machine-learning problemsTailored towards text analysishttp://weka.sourceforge.net/wekadoc/

Page 33: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

33

Weka Input FormatExpects a particular input file format

Called ARFF: Attribute-Relation File FormatConsists of a Header and a Data section

http://weka.sourceforge.net/wekadoc/index.php/en:ARFF_(3.4.6)

Page 34: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

34Slide adapted from Eibe Frank's

@relation heart-disease-simplified

@attribute age numeric@attribute sex { female, male}@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}@attribute cholesterol numeric@attribute exercise_induced_angina { no, yes}@attribute class { present, not_present}

@data63,male,typ_angina,233,no,not_present67,male,asympt,286,yes,present67,male,asympt,229,yes,present38,female,non_anginal,?,no,not_present...

WEKA File Format: ARFF

Other attribute types:

• String

• Date

Numerical attribute

Nominal attribute

Missing value

http://weka.sourceforge.net/wekadoc/index.php/en:ARFF_(3.4.6)

Page 35: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

35

Value 0 is not represented explicitlySame header (i.e @relation and @attribute tags)the @data section is different

Instead of @data

0, X, 0, Y, "class A"0, 0, W, 0, "class B"

We have

@data

{1 X, 3 Y, 4 "class A"} {2 W, 4 "class B"}

This saves LOTS of space for text applications.Why?

WEKA Sparse File Format

Page 36: 1 SIMS 290-2: Applied Natural Language Processing Marti Hearst Oct 23, 2006 (Slides developed by Preslav Nakov)

36

Next Time

Wed: Guest lecture by Peter Jackson:Pure and Applied Research in NLP: The Good, the Bad, and the Lucky.

Following week:Text Categorization AlgorithmsHow to use Weka