feature engineering - university of colorado boulder ...jbg/teaching/data_digging/...bamboo...

75
Feature Engineering Digging into Data: Jordan Boyd-Graber University of Maryland April 7, 2014 Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 1 / 47

Upload: buikiet

Post on 07-Mar-2018

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Feature Engineering

Digging into Data: Jordan Boyd-Graber

University of Maryland

April 7, 2014

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 1 / 47

Page 2: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Roadmap

Getting good labels

Feature engineering… Quiz Bowl Dataset… TV Tropes Dataset

How to split your dataset

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 2 / 47

Page 3: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Outline

1 Annotation: Getting Labels

2 Agreement

3 Quiz Bowl

4 Features for Quiz Bowl

5 TV Tropes

6 Features for TV Tropes

7 Evaluation

8 Wrapup

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 3 / 47

Page 4: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Where do labeled data come from?

For supervised classification, we’ve assumed that our data are alreadyavailable

Not always the case

This comes from annotation

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 4 / 47

Page 5: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Examples of annotation

Whether an e-mail is spam or not

Whether a document is relevant to a court case (e-Discovery)

Which meaning the noun “break” has… A time where you’re not working… A stroke of luck… A fracture or other discontinuity… A change in how things are done

Whether an image has a van or not

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 5 / 47

Page 6: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Why do we annotate?

We manually annotate texts for several reasons

to understand the nature of text (e.g., what % of sentences in news articlesare opinions?)

to establish the level of human performance (e.g., how well can peopleassign POS tags?)

to evaluate a computer model for some phenomenon (e.g., how often doesmy tagger or parser Þnd the correct answer?)

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 6 / 47

Page 7: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

The process of annotation

Develop a set of annotations

Define each of the annotations

Have annotations annotate the same data

See if they agree (more on this later)… If not, go back to Step 1… Why not?

∆ Bad annotators?∆ Bad definitions?∆ Unexpected data?

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 7 / 47

Page 8: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Who does the annotation?

Undergrads

Grad students

Crowdsourcing… Scammers… Diverse population

∆ Worldwide∆ Bored office workers∆ Individuals at home

… Equity issues

Users… Reviews… Blog categories… Metadata… Often noisy

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 8 / 47

Page 9: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Why is it important to have agreement?

Think about what happens to a classifier if it has inconsistent data (samedata, different annotations)

… For an SVM: there’s separating hyperplane… For a decision tree: decreases information gain of all the features

Your classifier is only as good as the data it gets

If your annotators only agree on 40% of the data, your accuracy will be lessthan 40%

Common problem: disagreement is undetected because each item is onlyannotated once

Resulting complaint: machine learning sucks

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 9 / 47

Page 10: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Why is it important to have agreement?

Think about what happens to a classifier if it has inconsistent data (samedata, different annotations)… For an SVM: there’s separating hyperplane… For a decision tree: decreases information gain of all the features

Your classifier is only as good as the data it gets

If your annotators only agree on 40% of the data, your accuracy will be lessthan 40%

Common problem: disagreement is undetected because each item is onlyannotated once

Resulting complaint: machine learning sucks

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 9 / 47

Page 11: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Annotation Tools

WordFreak (for text)

LabelMe (for images)

OpenAnnotation (an XML framework)

Bamboo (visualization and annotation for humanists)

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 10 / 47

Page 12: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Outline

1 Annotation: Getting Labels

2 Agreement

3 Quiz Bowl

4 Features for Quiz Bowl

5 TV Tropes

6 Features for TV Tropes

7 Evaluation

8 Wrapup

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 11 / 47

Page 13: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

What does agreement mean?

Simple answer: how often do two annotators give the same answer

More complicated: above, adjusting for chance agreement

Most important for class-imbalanced data

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 12 / 47

Page 14: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Computing Agreement

=P

a

�P

c

1�P

c

(1)

P

a

: Probability of coders agreeing

P

c

: Probability of coders agreeing by chance

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 13 / 47

Page 15: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Agreement example

Annotator BAnnotator A Y N

Y 20 5 25N 10 15 25

30 20

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 14 / 47

Page 16: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Agreement example

Annotator BAnnotator A Y N

Y 20 5 25N 10 15 25

30 20

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 14 / 47

Page 17: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Agreement example

Annotator BAnnotator A Y N

Y 20 5 25N 10 15 25

30 20

Probability of agreement

P

a

= 15+2050 = 0.7

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 14 / 47

Page 18: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Agreement example

Annotator BAnnotator A Y N

Y 20 5 25N 10 15 25

30 20

Probability of agreement

P

a

= 15+2050 = 0.7

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 14 / 47

Page 19: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Agreement example

Annotator BAnnotator A Y N

Y 20 5 25N 10 15 25

30 20

Probability of agreement

P

a

= 15+2050 = 0.7

Chance agreementA says yes with probability .5

B says yes with probability .6

The probability that both of them say yes (assuming independence) is .3; theprobability both say no is .2. The probability of chance agreement is thenP

c

= 0.2+0.3.

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 14 / 47

Page 20: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Agreement example

Annotator BAnnotator A Y N

Y 20 5 25N 10 15 25

30 20

Probability of agreement

P

a

= 15+2050 = 0.7

Chance agreementA says yes with probability .5

B says yes with probability .6

The probability that both of them say yes (assuming independence) is .3; theprobability both say no is .2. The probability of chance agreement is thenP

c

= 0.2+0.3.

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 14 / 47

Page 21: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Agreement example

Annotator BAnnotator A Y N

Y 20 5 25N 10 15 25

30 20

Probability of agreement

P

a

= 15+2050 = 0.7

Chance agreementA says yes with probability .5

B says yes with probability .6

The probability that both of them say yes (assuming independence) is .3; theprobability both say no is .2. The probability of chance agreement is thenP

c

= 0.2+0.3.

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 14 / 47

Page 22: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Agreement example

Annotator BAnnotator A Y N

Y 20 5 25N 10 15 25

30 20

Agreement:

=.7� .51� .5

= .4 (2)

Typically, you want above 0.7 agreement.

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 14 / 47

Page 23: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Outline

1 Annotation: Getting Labels

2 Agreement

3 Quiz Bowl

4 Features for Quiz Bowl

5 TV Tropes

6 Features for TV Tropes

7 Evaluation

8 Wrapup

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 15 / 47

Page 24: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Humans doing Incremental Classification

Game called “quiz bowl”

Two teams play each other… Moderator reads a question… When a team knows the

answer, they signal (“buzz” in)… If right, they get points;

otherwise, rest of the questionis read to the other team

Hundreds of teams in the US alone

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 16 / 47

Page 25: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Humans doing Incremental Classification

Game called “quiz bowl”

Two teams play each other… Moderator reads a question… When a team knows the

answer, they signal (“buzz” in)… If right, they get points;

otherwise, rest of the questionis read to the other team

Hundreds of teams in the US alone

Example . . .

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 16 / 47

Page 26: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Sample Question 1

With Leo Szilard, he invented a doubly-eponymous

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 17 / 47

Page 27: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Sample Question 1

With Leo Szilard, he invented a doubly-eponymous refrigerator with no movingparts. He did not take interaction with neighbors into account when formulating histheory of

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 17 / 47

Page 28: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Sample Question 1

With Leo Szilard, he invented a doubly-eponymous refrigerator with no movingparts. He did not take interaction with neighbors into account when formulating histheory of heat capacity, so

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 17 / 47

Page 29: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Sample Question 1

With Leo Szilard, he invented a doubly-eponymous refrigerator with no movingparts. He did not take interaction with neighbors into account when formulating histheory of heat capacity, so Debye adjusted the theory for low temperatures. Hissummation convention automatically sums repeated indices in tensor products.His name is attached to the A and B coefficients

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 17 / 47

Page 30: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Sample Question 1

With Leo Szilard, he invented a doubly-eponymous refrigerator with no movingparts. He did not take interaction with neighbors into account when formulating histheory of heat capacity, so Debye adjusted the theory for low temperatures. Hissummation convention automatically sums repeated indices in tensor products.His name is attached to the A and B coefficients for spontaneous and stimulatedemission, the subject of one of his multiple groundbreaking 1905 papers. Hefurther developed the model of statistics sent to him by

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 17 / 47

Page 31: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Sample Question 1

With Leo Szilard, he invented a doubly-eponymous refrigerator with no movingparts. He did not take interaction with neighbors into account when formulating histheory of heat capacity, so Debye adjusted the theory for low temperatures. Hissummation convention automatically sums repeated indices in tensor products.His name is attached to the A and B coefficients for spontaneous and stimulatedemission, the subject of one of his multiple groundbreaking 1905 papers. Hefurther developed the model of statistics sent to him by Bose to describe particleswith integer spin. For 10 points, who is this German physicist best known forformulating the

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 17 / 47

Page 32: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Sample Question 1

With Leo Szilard, he invented a doubly-eponymous refrigerator with no movingparts. He did not take interaction with neighbors into account when formulating histheory of heat capacity, so Debye adjusted the theory for low temperatures. Hissummation convention automatically sums repeated indices in tensor products.His name is attached to the A and B coefficients for spontaneous and stimulatedemission, the subject of one of his multiple groundbreaking 1905 papers. Hefurther developed the model of statistics sent to him by Bose to describe particleswith integer spin. For 10 points, who is this German physicist best known forformulating the special and general theories of relativity?

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 17 / 47

Page 33: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Sample Question 1

With Leo Szilard, he invented a doubly-eponymous refrigerator with no movingparts. He did not take interaction with neighbors into account when formulating histheory of heat capacity, so Debye adjusted the theory for low temperatures. Hissummation convention automatically sums repeated indices in tensor products.His name is attached to the A and B coefficients for spontaneous and stimulatedemission, the subject of one of his multiple groundbreaking 1905 papers. Hefurther developed the model of statistics sent to him by Bose to describe particleswith integer spin. For 10 points, who is this German physicist best known forformulating the special and general theories of relativity?

Albert Einstein

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 17 / 47

Page 34: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Humans doing Incremental Classification

This is not Jeopardy

There are buzzers, but players canonly buzz at the end of a question

Doesn’t discriminate knowledge

Quiz bowl questions are pyramidal

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 18 / 47

Page 35: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Research Question: How do we know if a guess is correct?

Turn (question, guess) into features

Treat it as a binary classification problem

What features help us do this well?

Subject of HW3

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 19 / 47

Page 36: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Research Question: How do we know if a guess is correct?

Turn (question, guess) into features

Treat it as a binary classification problem

What features help us do this well?

Subject of HW3

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 19 / 47

Page 37: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Provided Dataset

text: the clues revealed so far

page: a guess at the answer

answer: the actual answer (closest Wikipedia page)

body_score: IR measure of how good a match the text is

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 20 / 47

Page 38: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Outline

1 Annotation: Getting Labels

2 Agreement

3 Quiz Bowl

4 Features for Quiz Bowl

5 TV Tropes

6 Features for TV Tropes

7 Evaluation

8 Wrapup

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 21 / 47

Page 39: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Baseline

What if we always say that the answer is wrong?

Performance: 0.54

Every feature should do better than this (otherwise, it’s useless)

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 22 / 47

Page 40: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Page Name

The title of wikipedia pages often have disambiguation in parentheses1 Paris (mythology)2 Paris (song)3 Paris (genus)4 Paris (band)

Feature is 1 if the page has disambiguator in the text… “This band performed . . . ”, Paris (band)! True

… “This band performed . . . ”, Paris (mythology)! False

Slight improvement: 0.58

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 23 / 47

Page 41: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Page Name

The title of wikipedia pages often have disambiguation in parentheses1 Paris (mythology)2 Paris (song)3 Paris (genus)4 Paris (band)

Feature is 1 if the page has disambiguator in the text… “This band performed . . . ”, Paris (band)! True

… “This band performed . . . ”, Paris (mythology)! False

Slight improvement: 0.58

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 23 / 47

Page 42: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Links

The more morelinks a Wikipediapage has, the morepopular it is

Popularity is oftena sign of a wronganswer

By itself, doesn’t doso well: 0.56

But improves if wetake the log of thevalue: 0.61

0.0

0.1

0.2

0.3

0.4

−2 0 2 4log_links

density

corr

False

True

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 24 / 47

Page 43: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Score

We can see howsimilar the text of aWikipedia page is

Higher, the better

This feature alonegives accuracy of0.75

0.000

0.005

0.010

0.015

0.020

0 100 200 300body_score

density

corr

False

True

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 25 / 47

Page 44: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Length

The more text wesee, the moreconfident weshould be

By itself, doesn’t doso well: 0.56

But whencombined with theIR score, doesgreat: 0.82 (bestso far)

0.0000

0.0005

0.0010

0.0015

0 200 400 600 800 1000 1200obs_len

density

corr

False

True

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 26 / 47

Page 45: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Others . . .

Tournament the question was used in

The type of thing the answer is

Try your own, be creative!

Last year’s feature engineering assignment

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 27 / 47

Page 46: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 28 / 47

Page 47: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Outline

1 Annotation: Getting Labels

2 Agreement

3 Quiz Bowl

4 Features for Quiz Bowl

5 TV Tropes

6 Features for TV Tropes

7 Evaluation

8 Wrapup

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 29 / 47

Page 48: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

TV Tropes

Social media site

Catalog of “tropes”

Functionally like Wikipedia, but . . .… Less formal… No notability requirement… Focused on popular culture

Absent-Minded Professor“Doc” Emmett Brown from Back to

the Future.

The drunk mathematician inStrangers on a Train becomes aplot point, because of hisforgetfulness, Guy is suspected of amurder he didn’t commit.

The Muppet Show: Dr. BunsenHoneydew.

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 30 / 47

Page 49: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Spoilers

What makes neat is that the dataset is annotated by users for spoilers.

A spoiler: “A published piece of information that divulges a surprise, such asa plot twist in a movie.”

Spoiler

Han Solo arriving just in time to saveLuke from Vader and buy Luke thevital seconds needed to send theproton torpedos into the Death Star’sthermal exhaust port.

Leia, after finding out that despite her(feigned) cooperation, Tarkin intendsto destroy Alderaan anyway.

Luke rushes to the farm, only to find italready raided and his relatives deadharkens to an equally distressingscene in The Searchers.

Not a spoiler

Diving into the garbage chute getsthem out of the firefight, but the droidshave to save them from the compacter.

They do some pretty evil things withthat Death Star, but we never hearmuch of how they affect the rest of theGalaxy. A deleted scene betweenLuke and Biggs explores thissomewhat.

Luke enters Leia’s cell in aStormtrooper uniform, and she calmlystarts some banter.

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 31 / 47

Page 50: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

The dataset

Downloaded the pages associated with a show. Took complete sentencesfrom the text and split them into ones with spoilers and those without

Created a balanced dataset (50% spoilers, 50% not)

Split into training, development, and test shows

… Why is this important?

I’ll show results using SVM; similar results apply to other classifiers

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 32 / 47

Page 51: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

The dataset

Downloaded the pages associated with a show. Took complete sentencesfrom the text and split them into ones with spoilers and those without

Created a balanced dataset (50% spoilers, 50% not)

Split into training, development, and test shows… Why is this important?

I’ll show results using SVM; similar results apply to other classifiers

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 32 / 47

Page 52: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

The dataset

Downloaded the pages associated with a show. Took complete sentencesfrom the text and split them into ones with spoilers and those without

Created a balanced dataset (50% spoilers, 50% not)

Split into training, development, and test shows… Why is this important?

I’ll show results using SVM; similar results apply to other classifiers

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 32 / 47

Page 53: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Outline

1 Annotation: Getting Labels

2 Agreement

3 Quiz Bowl

4 Features for Quiz Bowl

5 TV Tropes

6 Features for TV Tropes

7 Evaluation

8 Wrapup

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 33 / 47

Page 54: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Step 1: The obvious

Take every sentence, and split on on-characters.

Input: “These aren’t the droids you’re looking for.”

FeaturesThese:1 aren:1 t:1 the:1 droids:1you:1 re:1 looking:1 for:1

False TrueFalse 56 34True 583 605

Accuracy: 0.517What’s wrong with this?

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 34 / 47

Page 55: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Step 1: The obvious

Take every sentence, and split on on-characters.

Input: “These aren’t the droids you’re looking for.”

FeaturesThese:1 aren:1 t:1 the:1 droids:1you:1 re:1 looking:1 for:1

False TrueFalse 56 34True 583 605

Accuracy: 0.517

What’s wrong with this?

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 34 / 47

Page 56: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Step 1: The obvious

Take every sentence, and split on on-characters.

Input: “These aren’t the droids you’re looking for.”

FeaturesThese:1 aren:1 t:1 the:1 droids:1you:1 re:1 looking:1 for:1

False TrueFalse 56 34True 583 605

Accuracy: 0.517What’s wrong with this?

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 34 / 47

Page 57: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Step 2: Normalization

Normalize the words… Lowercase everything… Stem the words (not always a good idea!)

Input: “These aren’t the droids you’re looking for.”

Featuresthese:1 are:1 t:1 the:1 droid:1you:1 re:1 look:1 for:1

False TrueFalse 52 27True 587 612

Accuracy: 0.520

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 35 / 47

Page 58: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Step 2: Normalization

Normalize the words… Lowercase everything… Stem the words (not always a good idea!)

Input: “These aren’t the droids you’re looking for.”

Featuresthese:1 are:1 t:1 the:1 droid:1you:1 re:1 look:1 for:1

False TrueFalse 52 27True 587 612

Accuracy: 0.520

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 35 / 47

Page 59: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Step 3: Remove Usless Features

Use a “stoplist”

Remove features that appear in > 10% of observations (and aren’t correlatedwith label)

Input: “These aren’t the droids you’re looking for.”

Featuresdroid:1 look:1

False TrueFalse 59 20True 578 621

Accuracy: 0.532

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 36 / 47

Page 60: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Step 3: Remove Usless Features

Use a “stoplist”

Remove features that appear in > 10% of observations (and aren’t correlatedwith label)

Input: “These aren’t the droids you’re looking for.”

Featuresdroid:1 look:1

False TrueFalse 59 20True 578 621

Accuracy: 0.532

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 36 / 47

Page 61: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Step 4: Add Useful Features

Use bigrams (“these_are”) instead of unigrams (“these”, “are”)

Creates a lot of features!

Input: “These aren’t the droids you’re looking for.”

Featuresthese_are:1 aren_t:1 t_the:1the_droids:1 you_re:1 re_looking:1looking_for:1

False TrueFalse 203 104True 436 535

Accuracy: 0.578

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 37 / 47

Page 62: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Step 4: Add Useful Features

Use bigrams (“these_are”) instead of unigrams (“these”, “are”)

Creates a lot of features!

Input: “These aren’t the droids you’re looking for.”

Featuresthese_are:1 aren_t:1 t_the:1the_droids:1 you_re:1 re_looking:1looking_for:1

False TrueFalse 203 104True 436 535

Accuracy: 0.578

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 37 / 47

Page 63: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Step 5: Prune (Again)

Not all bigrams appear often

SVM has to search a long time and might not get to the right answer

Helps to prune features

Input: “These aren’t the droids you’re looking for.”

Featuresthese_are:1 the_droids:1re_looking:1 looking_for:1

False TrueFalse 410 276True 229 363

Accuracy: 0.605

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 38 / 47

Page 64: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Step 5: Prune (Again)

Not all bigrams appear often

SVM has to search a long time and might not get to the right answer

Helps to prune features

Input: “These aren’t the droids you’re looking for.”

Featuresthese_are:1 the_droids:1re_looking:1 looking_for:1

False TrueFalse 410 276True 229 363

Accuracy: 0.605

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 38 / 47

Page 65: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

How do you find new features?

Make predictions on the development set.

Look at contingency table; where are the errors?

What do you miss?

Error analysis!

What feature would the classifier need to get this right?

What features are confusing the classifier?… If it never appears in the development set, it isn’t useful… If it doesn’t appear often, it isn’t useful

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 39 / 47

Page 66: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

How do you find new features?

Make predictions on the development set.

Look at contingency table; where are the errors?

What do you miss? Error analysis!

What feature would the classifier need to get this right?

What features are confusing the classifier?… If it never appears in the development set, it isn’t useful… If it doesn’t appear often, it isn’t useful

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 39 / 47

Page 67: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

How do you know something is a good feature?

Make a contingency table for that feature (should give you good informationgain)

Throw it into your classifier (accuracy should improve)

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 40 / 47

Page 68: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Homework 3

I’ve given you quiz bowl questions

And test data (no labels)

Only have small number of features (should get you around 81%)… For these features, it doesn’t matter (much) which classifier you use

Your job: add additional features and see how they do

Be creative! Find new and interesting data, extract useful things from thesedata.

Last year: best students wrote a paper with me:Jordan Boyd-Graber, Kimberly Glasgow, and Jackie Sauter Zajac. SpoilerAlert: Machine Learning Approaches to Detect Social Media Posts withRevelatory Information. ASIST 2013: The 76th Annual Meeting of the

American Society for Information Science and Technology, 2013.

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 41 / 47

Page 69: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Homework 3

I’ve given you quiz bowl questions

And test data (no labels)

Only have small number of features (should get you around 81%)… For these features, it doesn’t matter (much) which classifier you use

Your job: add additional features and see how they do

Be creative! Find new and interesting data, extract useful things from thesedata.

Last year: best students wrote a paper with me:Jordan Boyd-Graber, Kimberly Glasgow, and Jackie Sauter Zajac. SpoilerAlert: Machine Learning Approaches to Detect Social Media Posts withRevelatory Information. ASIST 2013: The 76th Annual Meeting of the

American Society for Information Science and Technology, 2013.

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 41 / 47

Page 70: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Outline

1 Annotation: Getting Labels

2 Agreement

3 Quiz Bowl

4 Features for Quiz Bowl

5 TV Tropes

6 Features for TV Tropes

7 Evaluation

8 Wrapup

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 42 / 47

Page 71: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Intrinsic vs. Extrinsic Evaluation

We’ve focused on intrinsic evaluation… Correctly predicting spoilers… Assigning words/documents to correct category… Detecting whether an image has a cow in it

More realistic: extrinsic evaluation… Number of spoilers seen by social media user… Number of relevant documents returned by IR system… Throughput of automatic cow milking system

Bottom line: extrinsic evaluations are harder, but they’re more often the thingyou care about.

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 43 / 47

Page 72: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Convincing Results

Give baseline performance… Most frequent class… Random guessing… Current “best practice”

Give qualitative results… Examples that were right / wrong… Error analysis… Tell a story

Give “blue sky” bounds… Oracle results for pipeline systems… Human ability

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 44 / 47

Page 73: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Outline

1 Annotation: Getting Labels

2 Agreement

3 Quiz Bowl

4 Features for Quiz Bowl

5 TV Tropes

6 Features for TV Tropes

7 Evaluation

8 Wrapup

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 45 / 47

Page 74: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

Lifecycle of Project

Starting with no labels

Building classification scheme

Feature engineering

Evaluation

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 46 / 47

Page 75: Feature Engineering - University of Colorado Boulder ...jbg/teaching/DATA_DIGGING/...Bamboo (visualization and annotation for humanists) Digging into Data: Jordan Boyd-Graber (UMD)

RTextTools

library(RTextTools)

train.df <- read.csv("train/train.csv")

train.df$sentence <- as.character(train.df$sentence)

dev.df <- read.csv("dev/dev.csv")

dev.df$sentence <- as.character(dev.df$sentence)

train.df <- train.df[1:1000,]

dev.df <- dev.df[1:100,]

data <- rbind(train.df, dev.df)

dev_size <- dim(dev.df)[1]

total_size <- dim(data)[1]

matrix <- create_matrix(cbind(data$sentence, data$trope),

language="english", removeNumbers=TRUE, stemWords=FALSE,

weighting=weightTfIdf)

container <- create_container(matrix, data$spoiler, trainSize=1:dev_size,

testSize=(1+dev_size):total_size, virgin=FALSE)

models <- train_models(container, algorithms=c("MAXENT","SVM"))

results <- classify_models(container, models)

Digging into Data: Jordan Boyd-Graber (UMD) Feature Engineering April 7, 2014 47 / 47