annotation and feature engineeringjbg/teaching/csci_3022/11b.pdfannotation and feature engineering...

23
Annotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES, SPOILERS, AND TRIVIA Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 1 of 13

Upload: others

Post on 31-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,

Annotation and FeatureEngineering

Introduction to Data Science AlgorithmsJordan Boyd-Graber and Michael PaulHOUSES, SPOILERS, AND TRIVIA

Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 1 of 13

Page 2: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,

Humans doing Incremental Classification

• Game called “quiz bowl”

• Two teams play each other

◦ Moderator reads a question◦ When a team knows the

answer, they signal (“buzz” in)◦ If right, they get points;

otherwise, rest of the questionis read to the other team

• Hundreds of teams in the USalone

Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 2 of 13

Page 3: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,

Humans doing Incremental Classification

• Game called “quiz bowl”

• Two teams play each other

◦ Moderator reads a question◦ When a team knows the

answer, they signal (“buzz” in)◦ If right, they get points;

otherwise, rest of the questionis read to the other team

• Hundreds of teams in the USalone

• Example . . .

Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 2 of 13

Page 4: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,

Sample Question 1

With Leo Szilard, he invented a doubly-eponymous

Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 3 of 13

Page 5: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,

Sample Question 1

With Leo Szilard, he invented a doubly-eponymous refrigerator with nomoving parts. He did not take interaction with neighbors into account whenformulating his theory of

Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 3 of 13

Page 6: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,

Sample Question 1

With Leo Szilard, he invented a doubly-eponymous refrigerator with nomoving parts. He did not take interaction with neighbors into account whenformulating his theory of heat capacity, so

Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 3 of 13

Page 7: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,

Sample Question 1

With Leo Szilard, he invented a doubly-eponymous refrigerator with nomoving parts. He did not take interaction with neighbors into account whenformulating his theory of heat capacity, so Debye adjusted the theory forlow temperatures. His summation convention automatically sums repeatedindices in tensor products. His name is attached to the A and B coefficients

Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 3 of 13

Page 8: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,

Sample Question 1

With Leo Szilard, he invented a doubly-eponymous refrigerator with nomoving parts. He did not take interaction with neighbors into account whenformulating his theory of heat capacity, so Debye adjusted the theory forlow temperatures. His summation convention automatically sums repeatedindices in tensor products. His name is attached to the A and B coefficientsfor spontaneous and stimulated emission, the subject of one of his multiplegroundbreaking 1905 papers. He further developed the model of statisticssent to him by

Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 3 of 13

Page 9: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,

Sample Question 1

With Leo Szilard, he invented a doubly-eponymous refrigerator with nomoving parts. He did not take interaction with neighbors into account whenformulating his theory of heat capacity, so Debye adjusted the theory forlow temperatures. His summation convention automatically sums repeatedindices in tensor products. His name is attached to the A and B coefficientsfor spontaneous and stimulated emission, the subject of one of his multiplegroundbreaking 1905 papers. He further developed the model of statisticssent to him by Bose to describe particles with integer spin. For 10 points,who is this German physicist best known for formulating the

Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 3 of 13

Page 10: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,

Sample Question 1

With Leo Szilard, he invented a doubly-eponymous refrigerator with nomoving parts. He did not take interaction with neighbors into account whenformulating his theory of heat capacity, so Debye adjusted the theory forlow temperatures. His summation convention automatically sums repeatedindices in tensor products. His name is attached to the A and B coefficientsfor spontaneous and stimulated emission, the subject of one of his multiplegroundbreaking 1905 papers. He further developed the model of statisticssent to him by Bose to describe particles with integer spin. For 10 points,who is this German physicist best known for formulating the special andgeneral theories of relativity?

Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 3 of 13

Page 11: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,

Sample Question 1

With Leo Szilard, he invented a doubly-eponymous refrigerator with nomoving parts. He did not take interaction with neighbors into account whenformulating his theory of heat capacity, so Debye adjusted the theory forlow temperatures. His summation convention automatically sums repeatedindices in tensor products. His name is attached to the A and B coefficientsfor spontaneous and stimulated emission, the subject of one of his multiplegroundbreaking 1905 papers. He further developed the model of statisticssent to him by Bose to describe particles with integer spin. For 10 points,who is this German physicist best known for formulating the special andgeneral theories of relativity?

Albert Einstein

Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 3 of 13

Page 12: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,

Humans doing Incremental Classification

• This is not Jeopardy

• There are buzzers, but players canonly buzz at the end of a question

• Doesn’t discriminate knowledge

• Quiz bowl questions are pyramidal

Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 4 of 13

Page 13: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,

Research Question: How do we know if a guess is correct?

• Turn (question, guess) into features

• Treat it as a binary classification problem

• What features help us do this well?

• Subject of HW3

Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 5 of 13

Page 14: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,

Research Question: How do we know if a guess is correct?

• Turn (question, guess) into features

• Treat it as a binary classification problem

• What features help us do this well?

• Subject of HW3

Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 5 of 13

Page 15: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,

Provided Dataset

• text: the clues revealed so far

• page: a guess at the answer

• answer: the actual answer (closest Wikipedia page)

• body_score: IR measure of how good a match the text is

Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 6 of 13

Page 16: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,

Baseline

• What if we always say that the answer is wrong?

• Performance: 0.54

• Every feature should do better than this (otherwise, it’s useless)

Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 7 of 13

Page 17: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,

Page Name

• The title of wikipedia pages often have disambiguation in parentheses

1 Paris (mythology)2 Paris (song)3 Paris (genus)4 Paris (band)

• Feature is 1 if the page has disambiguator in the text

◦ “This band performed . . . ”, Paris (band)→ True◦ “This band performed . . . ”, Paris (mythology)→ False

• Slight improvement: 0.58

Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 8 of 13

Page 18: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,

Page Name

• The title of wikipedia pages often have disambiguation in parentheses

1 Paris (mythology)2 Paris (song)3 Paris (genus)4 Paris (band)

• Feature is 1 if the page has disambiguator in the text

◦ “This band performed . . . ”, Paris (band)→ True◦ “This band performed . . . ”, Paris (mythology)→ False

• Slight improvement: 0.58

Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 8 of 13

Page 19: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,

Links

• The more morelinks a Wikipediapage has, the morepopular it is

• Popularity is oftena sign of a wronganswer

• By itself, doesn’tdo so well: 0.56

• But improves if wetake the log of thevalue: 0.61

0.0

0.1

0.2

0.3

0.4

−2 0 2 4log_links

dens

ity

corr

False

True

Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 9 of 13

Page 20: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,

Score

• We can see howsimilar the text of aWikipedia page is

• Higher, the better

• This feature alonegives accuracy of0.75

0.000

0.005

0.010

0.015

0.020

0 100 200 300body_score

dens

ity

corr

False

True

Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 10 of 13

Page 21: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,

Length

• The more text wesee, the moreconfident weshould be

• By itself, doesn’tdo so well: 0.56

• But whencombined with theIR score, doesgreat: 0.82 (bestso far)

0.0000

0.0005

0.0010

0.0015

0 200 400 600 800 1000 1200obs_len

dens

ity

corr

False

True

Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 11 of 13

Page 22: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,

Others . . .

• Tournament the question was used in

• The type of thing the answer is

• Try your own, be creative!

• Last year’s feature engineering assignment

Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 12 of 13

Page 23: Annotation and Feature Engineeringjbg/teaching/CSCI_3022/11b.pdfAnnotation and Feature Engineering Introduction to Data Science Algorithms Jordan Boyd-Graber and Michael Paul HOUSES,

Introduction to Data Science Algorithms | Boyd-Graber and Paul Annotation and Feature Engineering | 13 of 13