feature selection advanced statistical methods in nlp ling 572 january 24, 2012

88
Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Upload: will-pine

Post on 31-Mar-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature SelectionAdvanced Statistical Methods in NLP

Ling 572January 24, 2012

Page 2: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

RoadmapFeature representations:

Features in attribute-value matricesMotivation: text classification

Managing featuresGeneral approachesFeature selection techniques

Feature scoring measures Alternative feature weighting

Chi-squared feature selection

Page 3: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Representing Input:Attribute-Value Matrix

f1

Currency

f2

Country

… fm

Date

Label

x1= Doc1 1 1 0.3 0 Spam

x2=Doc2 1 1 1.75 1 Spam

..

xn=Doc4 0 0 0 2 NotSpam

Choosing features:• Define features – i.e. with feature templates

Page 4: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Representing Input:Attribute-Value Matrix

f1

Currency

f2

Country

… fm

Date

Label

x1= Doc1 1 1 0.3 0 Spam

x2=Doc2 1 1 1.75 1 Spam

..

xn=Doc4 0 0 0 2 NotSpam

Choosing features:• Define features – i.e. with feature templates• Instantiate features

Page 5: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Representing Input:Attribute-Value Matrix

f1

Currency

f2

Country

… fm

Date

Label

x1= Doc1 1 1 0.3 0 Spam

x2=Doc2 1 1 1.75 1 Spam

..

xn=Doc4 0 0 0 2 NotSpam

Choosing features:• Define features – i.e. with feature templates• Instantiate features• Perform dimensionality reduction

Page 6: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Representing Input:Attribute-Value Matrix

f1

Currency

f2

Country

… fm

Date

Label

x1= Doc1 1 1 0.3 0 Spam

x2=Doc2 1 1 1.75 1 Spam

..

xn=Doc4 0 0 0 2 NotSpam

Choosing features:• Define features – i.e. with feature templates• Instantiate features• Perform dimensionality reductionWeighting features: increase/decrease feature import

Page 7: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Representing Input:Attribute-Value Matrix

f1

Currency

f2

Country

… fm

Date

Label

x1= Doc1 1 1 0.3 0 Spam

x2=Doc2 1 1 1.75 1 Spam

..

xn=Doc4 0 0 0 2 NotSpam

Choosing features:• Define features – i.e. with feature templates• Instantiate features• Perform dimensionality reductionWeighting features: increase/decrease feature import• Global feature weighting: weight whole column• Local feature weighting: weight cell, conditions

Page 8: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature Selection ExampleTask: Text classification

Feature template definition:

Page 9: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature Selection ExampleTask: Text classification

Feature template definition:Word – just one template

Feature instantiation:

Page 10: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature Selection ExampleTask: Text classification

Feature template definition:Word – just one template

Feature instantiation:Words from training (and test?) data

Feature selection:

Page 11: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature Selection ExampleTask: Text classification

Feature template definition:Word – just one template

Feature instantiation:Words from training (and test?) data

Feature selection:Stopword removal: remove top K (~100) highest freq

Words like: the, a, have, is, to, for,…

Feature weighting:

Page 12: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature Selection Example Task: Text classification

Feature template definition: Word – just one template

Feature instantiation: Words from training (and test?) data

Feature selection: Stopword removal: remove top K (~100) highest freq

Words like: the, a, have, is, to, for,…

Feature weighting: Apply tf*idf feature weighting

tf = term frequency; idf = inverse document frequency

Page 13: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

The Curse of Dimensionality

Think of the instances as vectors of features# of features = # of dimensions

Page 14: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

The Curse of Dimensionality

Think of the instances as vectors of features# of features = # of dimensions

Number of features potentially enormous# words in corpus continues to increase w/corpus size

Page 15: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

The Curse of Dimensionality

Think of the instances as vectors of features# of features = # of dimensions

Number of features potentially enormous# words in corpus continues to increase w/corpus size

High dimensionality problematic:

Page 16: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

The Curse of Dimensionality

Think of the instances as vectors of features# of features = # of dimensions

Number of features potentially enormous# words in corpus continues to increase w/corpus size

High dimensionality problematic:Leads to data sparseness

Page 17: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

The Curse of Dimensionality

Think of the instances as vectors of features# of features = # of dimensions

Number of features potentially enormous# words in corpus continues to increase w/corpus size

High dimensionality problematic:Leads to data sparseness

Hard to create valid model Hard to predict and generalize – think kNN

Page 18: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

The Curse of Dimensionality

Think of the instances as vectors of features# of features = # of dimensions

Number of features potentially enormous# words in corpus continues to increase w/corpus size

High dimensionality problematic:Leads to data sparseness

Hard to create valid model Hard to predict and generalize – think kNN

Leads to high computational cost

Page 19: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

The Curse of Dimensionality

Think of the instances as vectors of features # of features = # of dimensions

Number of features potentially enormous # words in corpus continues to increase w/corpus size

High dimensionality problematic: Leads to data sparseness

Hard to create valid model Hard to predict and generalize – think kNN

Leads to high computational cost Leads to difficulty with estimation/learning

More dimensions more samples needed to learn model

Page 20: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Breaking the CurseDimensionality reduction:

Produce a representation with fewer dimensions But with comparable performance

Page 21: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Breaking the CurseDimensionality reduction:

Produce a representation with fewer dimensions But with comparable performance

More formally, given an original feature set r, Create a new set r’ |r’| < |r|, with comparable perf.

Page 22: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Breaking the CurseDimensionality reduction:

Produce a representation with fewer dimensions But with comparable performance

More formally, given an original feature set r, Create a new set r’ |r’| < |r|, with comparable perf.

Functionally,Many ML algorithms do not scale well

Page 23: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Breaking the CurseDimensionality reduction:

Produce a representation with fewer dimensions But with comparable performance

More formally, given an original feature set r, Create a new set r’ |r’| < |r|, with comparable perf.

Functionally,Many ML algorithms do not scale well

Expensive: Training cost, training costPoor prediction: overfitting, sparseness

Page 24: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Dimensionality ReductionGiven an initial feature set r,

Create a feature set r’ s.t. |r| < |r’|

Approaches:

Page 25: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Dimensionality ReductionGiven an initial feature set r,

Create a feature set r’ s.t. |r| < |r’|

Approaches:r’: same for all classes (aka global), vsr’: different for each class (aka local)

Page 26: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Dimensionality ReductionGiven an initial feature set r,

Create a feature set r’ s.t. |r| < |r’|

Approaches:r’: same for all classes (aka global), vsr’: different for each class (aka local)

Feature selection/filtering, vsFeature mapping (aka extraction)

Page 27: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature SelectionFeature selection:

r’ is a subset of rHow can we pick features?

Page 28: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature SelectionFeature selection:

r’ is a subset of rHow can we pick features?

Extrinsic ‘wrapper’ approaches:

Page 29: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature SelectionFeature selection:

r’ is a subset of rHow can we pick features?

Extrinsic ‘wrapper’ approaches: For each subset of features:

Build, evaluate classifier for some task Pick subset of features with best performance

Page 30: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature SelectionFeature selection:

r’ is a subset of rHow can we pick features?

Extrinsic ‘wrapper’ approaches: For each subset of features:

Build, evaluate classifier for some task Pick subset of features with best performance

Intrinsic ‘filtering’ methods: Use some intrinsic (statistical?) measure Pick features with highest scores

Page 31: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature SelectionWrapper approach:

Pros:

Page 32: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature SelectionWrapper approach:

Pros: Easy to understand, implementClear relationship b/t selected features and task perf.

Cons:

Page 33: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature SelectionWrapper approach:

Pros: Easy to understand, implementClear relationship b/t selected features and task perf.

Cons:Computationally intractable: 2|r’|*(training + testing)Specific to task, classifier; ad-hov

Filtering approach:Pros

Page 34: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature SelectionWrapper approach:

Pros: Easy to understand, implementClear relationship b/t selected features and task perf.

Cons:Computationally intractable: 2|r’|*(training + testing)Specific to task, classifier; ad-hov

Filtering approach:Pros: theoretical basis, less task, classifier specificCons:

Page 35: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature SelectionWrapper approach:

Pros: Easy to understand, implementClear relationship b/t selected features and task perf.

Cons:Computationally intractable: 2|r’|*(training + testing)Specific to task, classifier; ad-hov

Filtering approach:Pros: theoretical basis, less task, classifier specificCons: Doesn’t always boost task performance

Page 36: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature MappingFeature mapping (extraction) approaches

Features r’ representation combinations/transformations of features in r

Page 37: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature MappingFeature mapping (extraction) approaches

Features r’ representation combinations/transformations of features in rExample: many words near-synonyms, but treated as

unrelated

Page 38: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature MappingFeature mapping (extraction) approaches

Features r’ representation combinations/transformations of features in rExample: many words near-synonyms, but treated as

unrelated Map to new concept representing all

big, large, huge, gigantic, enormous concept of ‘bigness’

Examples:

Page 39: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature MappingFeature mapping (extraction) approaches

Features r’ representation combinations/transformations of features in rExample: many words near-synonyms, but treated as

unrelated Map to new concept representing all

big, large, huge, gigantic, enormous concept of ‘bigness’

Examples:Term classes: e.g. class-based n-grams

Derived from term clusters

Page 40: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature MappingFeature mapping (extraction) approaches

Features r’ representation combinations/transformations of features in rExample: many words near-synonyms, but treated as unrelated

Map to new concept representing all big, large, huge, gigantic, enormous concept of ‘bigness’

Examples:Term classes: e.g. class-based n-grams

Derived from term clusters

Dimensions in Latent Semantic Analysis (LSA/LSI) Result of Singular Value Decomposition (SVD) on matrix

Produces ‘closest’ rank r’ approximation of original

Page 41: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature MappingPros:

Page 42: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature MappingPros:

Data-drivenTheoretical basis – guarantees on matrix similarityNot bound by initial feature space

Cons:

Page 43: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature MappingPros:

Data-drivenTheoretical basis – guarantees on matrix similarityNot bound by initial feature space

Cons:Some ad-hoc factors:

e.g. # of dimensionsResulting feature space can be hard to interpret

Page 44: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature FilteringFiltering approaches:

Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class

Page 45: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature FilteringFiltering approaches:

Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class

Fairly fast and classifier-independent

Page 46: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature FilteringFiltering approaches:

Applying some scoring methods to features to rank their informativeness or importance w.r.t. some class

Fairly fast and classifier-independent

Many different measures:Mutual informationInformation gainChi-squared etc…

Page 47: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature Scoring Measures

Page 48: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Basic Notation, Distributions

Assume binary representation of terms, classes

tk : term in T; ci: class in C

Page 49: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Basic Notation, Distributions

Assume binary representation of terms, classes

tk : term in T; ci: class in C

P(tk): proportion of documents in which tk appears

P(ci): proportion of documents of class ci

Binary so have

Page 50: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Basic Notation, Distributions

Assume binary representation of terms, classes

tk : term in T; ci: class in C

P(tk): proportion of documents in which tk appears

P(ci): proportion of documents of class ci

Binary so have

Page 51: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Setting Up!ci ci

!tk a b

tk c d

Page 52: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Setting Up!ci ci

!tk a b

tk c d

Page 53: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Setting Up!ci ci

!tk a b

tk c d

Page 54: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Setting Up!ci ci

!tk a b

tk c d

Page 55: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Setting Up!ci ci

!tk a b

tk c d

Page 56: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Setting Up!ci ci

!tk a b

tk c d

Page 57: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Setting Up!ci ci

!tk a b

tk c d

Page 58: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature Selection Functions

Question: What makes a good features?

Page 59: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature Selection Functions

Question: What makes a good features?

Perspective: Best features:

Features that are most DIFFERENTLY distributed across classes

Page 60: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature Selection Functions

Question: What makes a good features?

Perspective: Best features:

Features that are most DIFFERENTLY distributed across classes

I.e. features are best that most effectively differentiate between classes

Page 61: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Term Selection Functions: DF

Document frequency (DF):Number of documents in which tk appears

Page 62: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Term Selection Functions: DF

Document frequency (DF):Number of documents in which tk appears

Applying DF: Remove terms with DF below some threshold

Page 63: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Term Selection Functions: DF

Document frequency (DF):Number of documents in which tk appears

Applying DF: Remove terms with DF below some threshold

Intuition:Very rare terms: won’t help with categorization

Or not useful globally

Pros:

Page 64: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Term Selection Functions: DF

Document frequency (DF):Number of documents in which tk appears

Applying DF: Remove terms with DF below some threshold

Intuition:Very rare terms: won’t help with categorization

Or not useful globally

Pros: Easy to implement, scalable

Cons:

Page 65: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Term Selection Functions: DF

Document frequency (DF):Number of documents in which tk appears

Applying DF: Remove terms with DF below some threshold

Intuition:Very rare terms: won’t help with categorization

Or not useful globally

Pros: Easy to implement, scalable

Cons: Ad-hoc, low DF terms ‘topical’

Page 66: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Term Selection Functions: MI

Pointwise Mutual Information (MI)

Page 67: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Term Selection Functions: MI

Pointwise Mutual Information (MI)

MI=0 if t,c independent

Page 68: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Term Selection Functions: MI

Pointwise Mutual Information (MI)

MI=0 if t,c independent

Issue: Can be heavily influenced by marginalProblem comparing terms of differing frequencies

Page 69: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Term Selection Functions: IG

Information Gain: Intuition: Transmitting Y, how many bits can we

save if we know X? IG(Y,Xi) = H(Y)-H(Y|X)

Page 70: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Information Gain: Derivation

From F. Xia, ‘11

Page 71: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

More Feature SelectionGSS coefficient:

From F. Xia, ‘11

Page 72: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

More Feature SelectionGSS coefficient:

NGL coefficient: N : # of docs

From F. Xia, ‘11

Page 73: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

More Feature SelectionGSS coefficient:

NGL coefficient: N : # of docs

Chi-square:

From F. Xia, ‘11

Page 74: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

More Feature SelectionGSS coefficient:

NGL coefficient: N : # of docs

Chi-square:

From F. Xia, ‘11

Page 75: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

More Term SelectionRelevancy score:

From F. Xia, ‘11

Page 76: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

More Term SelectionRelevancy score:

Odds Ratio:

From F. Xia, ‘11

Page 77: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Global SelectionPrevious measures compute class-specific

selection

What if you want to filter across ALL classes?Compute an aggregate measure across classes

Sum:

Average:

Max:

From F. Xia, ‘11

Page 78: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

What’s the best?Answer:

It depends on ….ClassifiersType of data…

According to (Yang and Pedersen, 1997):{OR,NGL,GSS} > {X2

max,Igsum}> {#avg}>>{MI}On text classification tasks

Using kNN

From F. Xia, ‘11

Page 79: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Feature WeightingFor text classification, typical weights include:

Binary: weights in {0,1}

Term frequency (tf): # occurrences of tk in document di

Inverse document frequency (idf):dfk: # of docs in which tk appears; N: # docs

idf = log (N/(1+dfk))

tfidf = tf*idf

Page 80: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Chi SquareTests for presence/absence of relation random

variablesBivariate analysis tests 2 random variables

Can test strength of relationship

(Strictly speaking) doesn’t test direction

Page 81: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Chi Square ExampleCan gender predict shoe choice?

A: male/female FeaturesB: shoe choice Classes: {sandal, sneaker,…}

Due to F. Xia

sandal sneaker leather shoe

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

Page 82: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Comparing DistributionsObserved distribution (O):

Expected distribution (E):

sandal sneaker leather shoe

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 50

Female 50

Total 19 22 20 25 14 100

Due to F. Xia

Page 83: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Comparing DistributionsObserved distribution (O):

Expected distribution (E):

sandal sneaker leather shoe

boot other

Male 6 17 13 9 5

Female 13 5 7 16 9

sandal sneaker

leather shoe

boot other Total

Male 9.5 11 10 12.5 7 50

Female 9.5 11 10 12.5 7 50

Total 19 22 20 25 14 100

Due to F. Xia

Page 84: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Computing Chi SquareExpected value for cell=

row_total*column_total/table_total

X2=(6-9.5)2/9.5+(17-11)2/11+.. = 14.026

Page 85: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Calculating X2

Tabulate contigency table of observed values: O

Compute row, column totals

Compute table of expected values, given row/colAssuming no association

Compute X2

Page 86: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

For 2x2 TableO:

E:

!ci ci

!tk a b

tk c d

!ci ci Total

!tk (a+b)(a+c)/N

(a+b)(b+d)/N a+b

tk (c+d)(a+c)/N (c+d)(b+d)/N c+d

total a+c b+d N

Page 87: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

X2 TestTest whether random variables are independent

Null hypothesis: 2 R.V.s are independent

Compute X2 statistic: Compute degrees of freedom

df = (# rows -1)(# cols -1) Shoe example, df = (2-1)(5-1)=4

Test probability of X2 statistic value X2 table

If probability is low – below some significance level Can reject null hypothesis

Page 88: Feature Selection Advanced Statistical Methods in NLP Ling 572 January 24, 2012

Requirements for X2 TestEvents assumed independent, same distribution

Outcomes must be mutually exclusive

Raw frequencies, not percentages

Sufficient values per cell: > 5