zero-shot entity extraction from web pagespliang/papers/extraction-acl2014-talk.pdf · zero-shot...
TRANSCRIPT
![Page 1: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/1.jpg)
Zero-shot Entity Extraction from Web Pages
ACL
June 23, 2014
Panupong Pasupat and Percy Liang
![Page 2: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/2.jpg)
Focus: Entity Extraction
hiking trails
hiking trails near Baltimore
Avalon Super Loop
Patapsco Valley State Park
Gunpowder Falls State Park
Union Mills Hike
Greenbury Point
...
What are the longest near Baltimore?
Data Source
1
![Page 3: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/3.jpg)
Focus: Entity Extraction
hiking trails
hiking trails near Baltimore
Avalon Super Loop
Patapsco Valley State Park
Gunpowder Falls State Park
Union Mills Hike
Greenbury Point
...
What are the longest near Baltimore?
Data Source
Applications: question answering / semantic parsing / taxonomyconstruction / ontology expansion / knowledge base population / ...
1
![Page 4: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/4.jpg)
Semi-Structured Data on the Web
2
![Page 5: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/5.jpg)
Challenge: Long Tail of Categories
person location organization
3
![Page 6: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/6.jpg)
Challenge: Long Tail of Categories
person location organization
airport battleship acid pitcher
settlement headgear metaphor haircut
poker hand biome enzyme superstition
3
![Page 7: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/7.jpg)
Challenge: Long Tail of Categories
person location organization
airport battleship acid pitcher
settlement headgear metaphor haircut
poker hand biome enzyme superstition
tutorials at ACL 2014
dishes at Pu Pu Hot Pot
Stanford computer science professors
We want to generalize to unseen categories
3
![Page 8: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/8.jpg)
Relevant Approaches
Bootstrapping from Seed Examples:
seeds
Avalon Super Loop
Hilton Area
System
answers
Avalon Super Loop
Hilton Area
Wildlands Loop
...
web pagesweb pagesweb pages
Use seed examples to specify the entity category
[Wang and Cohen, 2009; Google Sets; Sarmento et al. 2007; ...]
4
![Page 9: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/9.jpg)
Relevant Approaches
Bootstrapping from Seed Examples:
seeds
Avalon Super Loop
Hilton Area
System
answers
Avalon Super Loop
Hilton Area
Wildlands Loop
...
web pagesweb pagesweb pages
Use seed examples to specify the entity category
... but we might not have seeds (e.g. in question answering)
[Wang and Cohen, 2009; Google Sets; Sarmento et al. 2007; ...]
4
![Page 10: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/10.jpg)
Our Work
query
hiking trails
near Baltimore
System
answers
Avalon Super Loop
Hilton Area
Wildlands Loop
...
web page
Use a natural language query to specify the entity category
5
![Page 11: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/11.jpg)
Outline
1. Setup
• Problem Setup
• Dataset
2. Approach
3. Results
6
![Page 12: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/12.jpg)
Problem Setup
Input:
• query x
hiking trails near Baltimore
• web page w
7
![Page 13: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/13.jpg)
Problem Setup
Input:
• query x
hiking trails near Baltimore
• web page w
7
![Page 14: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/14.jpg)
Problem Setup
Input:
• query x
hiking trails near Baltimore
• web page w
7
![Page 15: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/15.jpg)
Problem Setup
Input:
• query x
hiking trails near Baltimore
• web page w
Output:
• list of entities y
[Avalon Super Loop, Patapsco Valley State Park, ...]
7
![Page 16: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/16.jpg)
Dataset
We created the OpenWeb dataset with diverse queries and webpages.
airlines of italy
natural causes of global warming
lsu football coaches
bf3 submachine guns
badminton tournaments
foods high in dha
technical colleges in south carolina
songs on glee season 5
singers who use auto tune
san francisco radio stations8
![Page 17: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/17.jpg)
Dataset
We created the OpenWeb dataset with diverse queries and webpages.
airlines of italy natural causes of global warming lsu football coaches
8
![Page 18: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/18.jpg)
Query Generation
Breadth-first search on Google Suggest
list of
Suggest
list of Indian movies
...
[Berant et al., 2013]
9
![Page 19: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/19.jpg)
Query Generation
Breadth-first search on Google Suggest
list of
Suggest
list of Indian movies
...
Template
Extraction
list of movies
list of movies
list of Indian
...
[Berant et al., 2013]
9
![Page 20: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/20.jpg)
Query Generation
Breadth-first search on Google Suggest
list of
Suggest
list of Indian movies
...
Template
Extraction
list of movies
list of movies
list of Indian
...
[Berant et al., 2013]
9
![Page 21: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/21.jpg)
Dataset Annotation
Annotate the first, second, and last entities matching the query usingAmazon Mechanical Turk.
10
![Page 22: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/22.jpg)
Dataset Annotation
Annotate the first, second, and last entities matching the query usingAmazon Mechanical Turk.
airlines of italy
Annotation
First: Air Dolomiti
Second: Air Europe
Last: Wind Jet
10
![Page 23: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/23.jpg)
Dataset Statistics
2773 examples
2269 unique queries
894 unique headwords ← long tail!
1483 unique web domains ← long tail!
(6= wrapper induction)
11
![Page 24: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/24.jpg)
Outline
1. Setup
2. Approach
• Extraction Predicate
• Framework
• Modeling
• Features
3. Results
12
![Page 25: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/25.jpg)
Extraction Predicate
How can we choose what to extract from a web page w?
html
head body
table
tr
td td td td
h1 table
tr
th th
tr
td td
... tr
td td
number of possible entity lists ≈ 2number of nodes
13
![Page 26: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/26.jpg)
Extraction Predicate
Idea: Entities usually share the same tag and tree level
html
head body
table
tr
td td td td
h1 table
tr
th th
tr
td td
... tr
td td
z = /html[1]/body[1]/table[2]/tr/td[1]
[Sahuguet and Azavant, 1999; Liu et al., 2000; Crescenzi et al., 2001]
14
![Page 27: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/27.jpg)
Extraction Predicate
Idea: Entities usually share the same tag and tree level
html
head body
table
tr
td td td td
h1 table
tr
th th
tr
td td
... tr
td td
z = /html[1]/body[1]/table[2]/tr/td[1]
Captures structures such as table columns, list entries, headers ofthe same level, ...
Each web page has ≈ 8500 extraction predicates z
[Sahuguet and Azavant, 1999; Liu et al., 2000; Crescenzi et al., 2001]
14
![Page 28: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/28.jpg)
Framework
x whiking trails
near Baltimore
html
head
...
body
...
15
![Page 29: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/29.jpg)
Framework
x w
Generation
Z
hiking trails
near Baltimore
html
head
...
body
...
(|Z| ≈ 8500)
15
![Page 30: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/30.jpg)
Framework
x w
Generation
Z
Model
z
hiking trails
near Baltimore
html
head
...
body
...
(|Z| ≈ 8500)
/html[1]/body[1]/table[2]/tr/td[1]
15
![Page 31: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/31.jpg)
Framework
x w
Generation
Z
Model
z Execution
y
hiking trails
near Baltimore
html
head
...
body
...
(|Z| ≈ 8500)
/html[1]/body[1]/table[2]/tr/td[1]
[Avalon Super Loop, Patapsco Valley State Park, ...]
15
![Page 32: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/32.jpg)
Framework
x w
Generation
Z
Model
z Execution
y
hiking trails
near Baltimore
html
head
...
body
...
(|Z| ≈ 8500)
/html[1]/body[1]/table[2]/tr/td[1]
[Avalon Super Loop, Patapsco Valley State Park, ...]
A graphical model with latent extraction predicate z
15
![Page 33: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/33.jpg)
Modeling
Let x be a query and w be a web page.
Define a log-linear distribution over the extraction predicates z ∈ Z:
pθ(z | x,w) ∝ exp{θ>φ(x,w, z)}
• θ is a parameter vector
• φ(x,w, z) is a feature vector
16
![Page 34: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/34.jpg)
Modeling
Let x be a query and w be a web page.
Define a log-linear distribution over the extraction predicates z ∈ Z:
pθ(z | x,w) ∝ exp{θ>φ(x,w, z)}
• θ is a parameter vector
• φ(x,w, z) is a feature vector
• Find θ that maximizes the log-likelihood of the training datausing AdaGrad [Duchi et al., 2010]
16
![Page 35: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/35.jpg)
Features
pθ(z | x,w) ∝ exp{θ>φ(x,w, z)}
17
![Page 36: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/36.jpg)
Features
pθ(z | x,w) ∝ exp{θ>φ(x,w, z)}
Structural Features: context
>
17
![Page 37: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/37.jpg)
Features
pθ(z | x,w) ∝ exp{θ>φ(x,w, z)}
Denotation Features: content
hiking trails near Baltimore
Avalon Super Loop
Patapsco Valley State Park
Gunpowder Falls State Park
Rachel Carson Conservation Park
Union Mills Hike
...
>
hiking trails near Baltimore
Home
About Baltimore Tour
Pricing
Contact
Online Support
...
17
![Page 38: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/38.jpg)
Defining Features on Lists
George Washington
John Adams
Thomas Jefferson
James Madison
... (39 more) ...
Barack Obama
John Adams
John Adams
John Adams
John Adams
John Adams
John Adams
... (100 more) ...
John Adams
Blog
Photos and Video
Briefing Room
In the White House
Mobile Apps
Contact Us
good bad bad
18
![Page 39: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/39.jpg)
Defining Features on Lists
George Washington
John Adams
Thomas Jefferson
James Madison
... (39 more) ...
Barack Obama
John Adams
John Adams
John Adams
John Adams
John Adams
John Adams
... (100 more) ...
John Adams
Blog
Photos and Video
Briefing Room
In the White House
Mobile Apps
Contact Us
good bad bad
identity diverse identical diverse
18
![Page 40: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/40.jpg)
Defining Features on Lists
NNP NNP
NNP NNP
NNP NNP
NNP NNP
... (39 more) ...
NNP NNP
NNP NNP
NNP NNP
NNP NNP
NNP NNP
NNP NNP
NNP NNP
... (100 more) ...
NNP NNP
NN
NNS CC NNP
NN NN
IN DT NNP NNP
NNP NNPS
NN PRP
good bad bad
identity diverse identical diverse
POS identical identical diverse18
![Page 41: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/41.jpg)
Defining Features on Lists
Avalon Super Loop
Patapsco Valley State Park
Gunpowder Falls State Park
Union Mills Hike
Greenbury Point
19
![Page 42: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/42.jpg)
Defining Features on Lists
Avalon Super Loop
Patapsco Valley State Park
Gunpowder Falls State Park
Union Mills Hike
Greenbury Point
3
4
4
3
2
1. Abstraction
Map list elements into abstract tokens
19
![Page 43: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/43.jpg)
Defining Features on Lists
Avalon Super Loop
Patapsco Valley State Park
Gunpowder Falls State Park
Union Mills Hike
Greenbury Point
3
4
4
3
2
2 3 4
histogram
Entropy
Majority
MajorityRatio
Single
Mean
Variance
1. Abstraction
Map list elements into abstract tokens
2. Aggregation
Define features using the histogram of the abstract tokens
19
![Page 44: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/44.jpg)
Defining Features on Lists
Avalon Super Loop
Patapsco Valley State Park
Gunpowder Falls State Park
Union Mills Hike
Greenbury Point
3
4
4
3
2
2 3 4
histogram
Entropy
Majority
MajorityRatio
Single
Mean
Variance
1. Abstraction
Map list elements into abstract tokens
2. Aggregation
Define features using the histogram of the abstract tokens
Use this method for both structural and denotation features
19
![Page 45: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/45.jpg)
Outline
1. Setup
2. Approach
3. Results
• Main Results
• Error Analysis
• Feature Analysis
20
![Page 46: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/46.jpg)
Main Results
Baseline
(Most frequent
extraction
predicates)
Accuracy Accuracy @ 50
10
20
30
40
50
60
Accuracy
10.3
21
![Page 47: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/47.jpg)
Main Results
Baseline
(Most frequent
extraction
predicates)
Accuracy Accuracy @ 50
10
20
30
40
50
60
Accuracy
10.3
40.5
55.8
21
![Page 48: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/48.jpg)
Error Analysis
Correct
40.5%
Coverage
Errors
33.4%
Ranking
Errors
26.1%
22
![Page 49: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/49.jpg)
Examples of Correct Predictions
Query: disney channel movies
/html[1]/body/div[2]/div/div/div[3]/div[1]/div/div/div/div/b23
![Page 50: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/50.jpg)
Examples of Correct Predictions
Query: universities in canada
/html[1]/body/div/div/div/div/div/div/div/a/text
24
![Page 51: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/51.jpg)
Examples of Correct Predictions
Query: nobel prize winners
/html[1]/body/div/div[2]/div/div/div/h6/a/text
25
![Page 52: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/52.jpg)
Error Analysis
Correct
40.5%
Coverage
Errors
33.4%
Ranking
Errors
26.1%
26
![Page 53: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/53.jpg)
Error Analysis
Correct
40.5%
Coverage
Errors
33.4%
Ranking
Errors
26.1%
Coverage Errors
No extraction predicate z produces an entity listy matching the annotation
26
![Page 54: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/54.jpg)
Examples of Coverage Errors
Query: companies named after a person
/html/body/div[3]/div[3]/div[4]/ul/li/a
Need richer extraction predicates!
27
![Page 55: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/55.jpg)
Examples of Coverage Errors
Query: hedge funds in new york
/html/body/div[3]/div[3]/div[4]/.../table/tbody/tr/td[2]/a
Need compositionality! 28
![Page 56: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/56.jpg)
Error Analysis
Correct
40.5%
Coverage
Errors
33.4%
Ranking
Errors
26.1%
Coverage Errors
No extraction predicate z produces an entity listy matching the annotation
29
![Page 57: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/57.jpg)
Error Analysis
Correct
40.5%
Coverage
Errors
33.4%
Ranking
Errors
26.1%
Coverage Errors
No extraction predicate z produces an entity listy matching the annotation
Ranking Errors
The system finds a list y matching the anno-tation, but it does not have the highest modelscore.
29
![Page 58: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/58.jpg)
Examples of Ranking Errors
Query: doctors at emory
/html/body/div[3]/div[4]/table/tbody/tr/td[2]
30
![Page 59: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/59.jpg)
Augmenting Denotation Features
Observation: Entities of different categories have different linguisticproperties.
mayors of Chicago universities in Chicago
Rahm Emanuel Aurora University
Richard M. Daley DePaul University
Eugene Sawyer Illinois Institute of Technology
... ...
31
![Page 60: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/60.jpg)
Augmenting Denotation Features
Observation: Entities of different categories have different linguisticproperties.
mayors of Chicago universities in Chicago
Rahm Emanuel Aurora University
Richard M. Daley DePaul University
Eugene Sawyer Illinois Institute of Technology
... ...
Experiment: Augment denotation features with the query category.
POS majority
= NNP NNP (POS majority
= NNP NNP ,query category
= people )
31
![Page 61: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/61.jpg)
Augmenting Denotation Features
Denotation Augmented
Denotation
0
10
20
30
Accuracy
(dev)
19.8
25
32
![Page 62: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/62.jpg)
Augmenting Denotation Features
Structural
+
Denotation
(default)
Structural
+
Augmented
Denotation
0
10
20
30
40
50
Accuracy
(dev)
41.1 41.7
33
![Page 63: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/63.jpg)
Augmenting Denotation Features
Structural
+
Denotation
(default)
Structural
+
Augmented
Denotation
0
10
20
30
40
50
Acc
ura
cy(d
ev)
41.1 41.7
Hypothesis: Structural features have high influence when the webpage comes from Web search result.
33
![Page 64: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/64.jpg)
Augmenting Denotation Features
Hypothesis: Structural features have high influence when the webpage comes from Web search result.
34
![Page 65: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/65.jpg)
Augmenting Denotation Features
Hypothesis: Structural features have high influence when the webpage comes from Web search result.
hiking trails near Baltimore
Verify the hypothesis: Concatenate arandom web page
34
![Page 66: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/66.jpg)
Augmenting Denotation Features
Hypothesis: Structural features have high influence when the webpage comes from Web search result.
hiking trails near Baltimore
Verify the hypothesis: Concatenate arandom web page
• Creates noise: entity lists with highstructural feature scores might notbe the correct list
34
![Page 67: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/67.jpg)
Augmenting Denotation Features
hiking trails near Baltimore
Structural
+
Denotation
(default)
Structural
+
Augmented
Denotation
0
10
20
30
40
Accuracy
(stitched)
19.3
29.2
35
![Page 68: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/68.jpg)
Summary
query
hiking trails
near Baltimore
System
answers
Avalon Super Loop
Hilton Area
Wildlands Loop
...
web page
A framework for extracting entities from a natural language queryand a single web page
36
![Page 69: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/69.jpg)
Summary
tutorials at ACL Focus on the long tail of entitycategories
37
![Page 70: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/70.jpg)
Summary
tutorials at ACL Focus on the long tail of entitycategories
Consider both structural and de-notation features
37
![Page 71: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/71.jpg)
Summary
tutorials at ACL Focus on the long tail of entitycategories
Consider both structural and de-notation features
Avalon ..
Patapsco ..
Gunpowder ..
Union ..
Greenbury ..
3
4
4
3
2
2 3 4
histogram
Handle lists of different sizes withabstraction and aggregation
37
![Page 72: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/72.jpg)
Future Work
• Model relationship between entities and category strings
• Compositionality in natural language
38
![Page 73: Zero-shot Entity Extraction from Web Pagespliang/papers/extraction-acl2014-talk.pdf · Zero-shot Entity Extraction from Web Pages ACL June 23, 2014 Panupong Pasupat and Percy Liang](https://reader036.vdocuments.mx/reader036/viewer/2022062919/5ee14d8dad6a402d666c3b80/html5/thumbnails/73.jpg)
Download code and dataset:
http://nlp.stanford.edu/software/web-entity-extractor-ACL2014
Thank you!
39