harvesting relational tables from lists on the web
DESCRIPTION
Harvesting Relational Tables from Lists on the Web. Hazem Elmeleegy Purdue University Jayant Madhavan and Alon Halevy Google Inc. Outline. Introduction The ListExtract Approach Experiments Conclusion. Lists on the Web. Lists on the Web. Lists on the Web. Lists on the Web. Our Goal: - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/1.jpg)
Harvesting Relational Tables from Lists Harvesting Relational Tables from Lists on the Webon the Web
Hazem ElmeleegyPurdue University
Jayant Madhavan and Alon HalevyGoogle Inc.
![Page 2: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/2.jpg)
OutlineOutline
Introduction
The ListExtract Approach
Experiments
Conclusion
![Page 3: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/3.jpg)
Lists on the WebLists on the Web
![Page 4: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/4.jpg)
Lists on the WebLists on the Web
![Page 5: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/5.jpg)
Lists on the WebLists on the Web
![Page 6: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/6.jpg)
Lists on the WebLists on the Web
• Our Goal: Extract tabular data from all
such lists in an unsupervised and domain-independent
manner.
• Not the typical wrapper generation problem.
![Page 7: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/7.jpg)
Cartoons ExampleCartoons Example
A period (“.”) is used both as a delimiter and to
terminate abbreviations
A slash (“/”) is used both as a delimiter and as part of the text
The slash (“/”) delimiter is
missing (along with the prod.
year)
• Easy for Humans
• Confusing for Machines
![Page 8: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/8.jpg)
Key ContributionsKey Contributions
Developed the ListExtract System, which extracts tables from lists in an unsupervised and domain-independent manner
Introduced using external sources of information such as a large collection of tables collected from the web and a language model to help in the splitting decisions
Conducted a large-scale experimental study which suggests that tens of millions of high-quality lists can be exploited on the Web.
![Page 9: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/9.jpg)
OutlineOutline
IntroductionIntroduction
The ListExtract Approach
Experiments
Conclusion
![Page 10: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/10.jpg)
ListExtract ApproachListExtract Approach
SplittingLines into Records
Aligning Short Records
(Null Insertion)
Decidingthe Number of Columns
Re-SplittingLong Records
Re-Aligning Detected
Field Streaks (Null Insertion)
Detecting Inconsistent
Fields
Re-Splitting Detected
Field Streaks
Independent Splitting
Phase
AlignmentPhase
Refinement Phase
![Page 11: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/11.jpg)
Intermediate Outputs Intermediate Outputs (Independent Splitting Phase)(Independent Splitting Phase)
1 || What’s Opera Doc || Warner Bros || 1957
2 || Duck Amuck || Warner Bros || 1953
3 || The Band Concert || Disney || 1935
4. Duck Dodgers in the 24 1/2th Century (Warner Bros || 1953
5 || One Froggy Evening || Warner Bros || 1956
6 || Gertie the Dinosaur || McCay
…
17 || Popeye the Sailor || Meets || Sinbad the Sailor || Fletcher || 1936
![Page 12: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/12.jpg)
Intermediate Outputs Intermediate Outputs (Re-Splitting Long Records)(Re-Splitting Long Records)
1 || What’s Opera Doc || Warner Bros || 1957
2 || Duck Amuck || Warner Bros || 1953
3 || The Band Concert || Disney || 1935
4. Duck Dodgers in the 24 1/2th Century (Warner Bros || 1953
5 || One Froggy Evening || Warner Bros || 1956
6 || Gertie the Dinosaur || McCay
…
17. Popeye the Sailor Meets || Sinbad the Sailor || Fletcher || 1936
Number of Columns = 4
![Page 13: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/13.jpg)
Intermediate Outputs Intermediate Outputs
(Alignment Phase)(Alignment Phase)
1 What’s Opera Doc Warner Bros 1957
2 Duck Amuck Warner Bros 1953
3 The Band Concert Disney 1935
4. Duck Dodgers in the 24 1/2th Century (Warner Bros 1953
5 One Froggy Evening Warner Bros 1956
6 Gertie the Dinosaur McCay
… … … …
17. Popeye the Sailor Meets Sinbad the Sailor Fletcher 1936
![Page 14: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/14.jpg)
Final Output Final Output
(Refinement Phase)(Refinement Phase)
1 What’s Opera Doc Warner Bros 1957
2 Duck Amuck Warner Bros 1953
3 The Band Concert Disney 1935
4 Duck Dodgers in the 24 1/2th Century Warner Bros 1953
5 One Froggy Evening Warner Bros 1956
6 Gertie the Dinosaur McCay
… … … …
17 Popeye the Sailor Meets Sinbad the Sailor (Fletcher) 1936
![Page 15: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/15.jpg)
ListExtract ApproachListExtract Approach
SplittingLines into Records
Aligning Short Records
(Null Insertion)
Decidingthe Number of Columns
Re-SplittingLong Records
Re-Aligning Detected
Field Streaks (Null Insertion)
Detecting Inconsistent
Fields
Re-Splitting Detected
Field Streaks
Independent Splitting
Phase
AlignmentPhase
Refinement Phase
![Page 16: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/16.jpg)
Output
Input
Line Splitting AlgorithmLine Splitting Algorithm
The Band Concert 0.92
The Band 0.89
Disney 0.82
Band Concert 0.65
Disney 1935 0.51
1935 0.34
... …
3 0.15
Band Concert Disney 0.12
3 The Band 0.07
Concert Disney 1935 0.03
3 The Band Concert Disney 1935
0.01
√
The Band Concert
The Band Concert Disney
3 The Band Concert Disney 1935
3 The Band Concert Disney 1935
3. The Band Concert (Disney /1935)
pre-processing: (removing delimiters)
SubsequenceFQ
Score
√
√
√
![Page 17: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/17.jpg)
Field Quality (FQ) ScoreField Quality (FQ) Score
Linear Combination of multiple score componentsEach component corresponds to a source of evidence
Score Components1. Data Type
Regular expressions to capture different data types (e.g. dates, emails, currencies, … etc)
Score: 1 if match found, 0 otherwise
2. Table CorpusCheck if candidate sequence existed as a field in the table corpusScore: 1 if exists, 0 otherwise
3. Language ModelMeasure the likelihood that candidate sequence occurs in free text, and the unlikelihood that overlapping sequences occur in free text.Score: a combination of the probabilities capturing both the likelihood and unlikelihood
![Page 18: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/18.jpg)
ListExtract ApproachListExtract Approach
SplittingLines into Records
Aligning Short Records
(Null Insertion)
Decide on the Number of Columns
Re-SplittingLong Records
Re-Aligning Detected
Field Streaks (Null Insertion)
Detecting Inconsistent
Fields
Re-Splitting Detected
Field Streaks
Independent Splitting
Phase
AlignmentPhase
Refinement Phase
Majority Voting across all records
![Page 19: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/19.jpg)
ListExtract ApproachListExtract Approach
SplittingLines into Records
Aligning Short Records
(Null Insertion)
Decide on the Number of Columns
Re-SplittingLong Records
Re-Aligning Detected
Field Streaks (Null Insertion)
Detecting Inconsistent
Fields
Re-Splitting Detected
Field Streaks
Independent Splitting
Phase
AlignmentPhase
Refinement Phase
![Page 20: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/20.jpg)
Output
Input
Re-Splitting Long RecordsRe-Splitting Long Records
The Band Concert 0.92
The Band 0.89
Disney 0.82
Band Concert 0.65
Disney 1935 0.51
1935 0.34
... …
3 0.15
Band Concert Disney 0.12
3 The Band 0.07
Concert Disney 1935 0.03
3 The Band Concert Disney 1935
0.01
√
The Band Concert
3 The Band Concert Disney / 1935
3 The Band Concert Disney 1935
3. The Band Concert (Disney /1935)
pre-processing: (removing delimiters)
SubsequenceFQ
Score
√
√
√
Maximum Number of Output Fields = 3
![Page 21: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/21.jpg)
ListExtract ApproachListExtract Approach
SplittingLines into Records
Aligning Short Records
(Null Insertion)
Decide on the Number of Columns
Re-SplittingLong Records
Re-Aligning Detected
Field Streaks (Null Insertion)
Detecting Inconsistent
Fields
Re-Splitting Detected
Field Streaks
Independent Splitting
Phase
AlignmentPhase
Refinement Phase
![Page 22: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/22.jpg)
Aligning Short RecordsAligning Short Records
(Null Insertion)(Null Insertion)
... ... ...
... ... ... ...
... ...
... ... ... ...
Avg. FQScore
... ... ...
... ... ... ...
... ... ... ...
0.88
0.79
0.49
0.62
0.73
0.92
0.86
Independently SplitRecords
![Page 23: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/23.jpg)
Aligning Short RecordsAligning Short Records
(Null Insertion)(Null Insertion)
… … … …
NULL
… … … …
… … … …
… … … …
… …
…
…
… … NULL
… … NULL NULL
Avg. FQScore
0.92
0.86
0.79
0.62
0.88
0.73
0.49
Independently SplitRecords
Output Table
1- Sorting 2- Iterative Alignment
![Page 24: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/24.jpg)
Aligning Short RecordsAligning Short Records
(Null Insertion)(Null Insertion)
To align each record, we use the classical Needleman-Wunsch Sequence Alignment algortihm.
[NW, J. of Molecular Biology, 1970]
The two sequences: Sequence #1: Table columnsSequence #2: Fields of a short record
Design a Field-to-Field Consistency (F2FC) Score.
Use the average F2FC Score as the similarity measure for the alignment algorithm.
![Page 25: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/25.jpg)
Field-to-Field Consistency Field-to-Field Consistency
(F2FC) Score(F2FC) Score
Linear combination of multiple score componentsEach component corresponds to a source of evidence
Score Components1. Data Type
Check if data types are consistent
2. Table CorpusCheck if two fields co-occur in the same column in a table in the corpus
3. SyntaxMeasure the consistency of the syntax of the two fields
(e.g. length, % of upper/lower case letters, digits, spaces, etc)
4. DelimitersMeasures the consistency between the delimiters on both sides of the two fields
![Page 26: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/26.jpg)
ListExtract ApproachListExtract Approach
SplittingLines into Records
Aligning Short Records
(Null Insertion)
Decide on the Number of Columns
Re-SplittingLong Records
Re-Aligning Detected
Field Streaks (Null Insertion)
Detecting Inconsistent
Fields
Re-Splitting Detected
Field Streaks
Independent Splitting
Phase
AlignmentPhase
Refinement Phase
![Page 27: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/27.jpg)
Refinement PhaseRefinement Phase
… … … … … …
… … … … … …
… … … … … …
… … … … … …
… … … … … …
… … … … … …
Output Table
![Page 28: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/28.jpg)
Refinement PhaseRefinement Phase
… X … … … …
… … … … … X
… … X X X …
… … … … … …
… X X … … …
… … … … X …
Detect Inconsistent FieldsOutput Table
![Page 29: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/29.jpg)
Refinement PhaseRefinement Phase
… … … … … …
… … … … … …
… … X X X …
… … … … … …
… X X … … …
… … … … … …
Detect Inconsistent Fields
Consider streaks only
Output Table
![Page 30: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/30.jpg)
Refinement PhaseRefinement Phase
… … … … … …
… … … … … …
… … … …
… … … … … …
… … … … …
… … … … … …
Detect Inconsistent Fields
Consider streaks only
Re-merge
Output Table
![Page 31: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/31.jpg)
Refinement PhaseRefinement Phase
… … … … … …
… … … … … …
… … √ √ √ …
… … … … … …
… √ √ … … …
… … … … … …
Detect Inconsistent Fields
Consider streaks only
Re-merge
Re-split (and re-align if needed)
Use extended FQ score
Output Table
![Page 32: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/32.jpg)
Field Quality (FQ) ScoreField Quality (FQ) Score
[Revisited][Revisited]
Linear Combination of multiple score componentsEach component corresponds to a source of evidence
Score Components1. Data Type
2. Table Corpus
3. Language Model
4. List Support • favors candidates which are more consistent with the columns spanned by
the streak
![Page 33: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/33.jpg)
ListExtract ApproachListExtract Approach
SplittingLines into Records
Aligning Short Records
(Null Insertion)
Decide on the Number of Columns
Re-SplittingLong Records
Re-Aligning Detected
Field Streaks (Null Insertion)
Detecting Inconsistent
Fields
Re-Splitting Detected
Field Streaks
Independent Splitting
Phase
AlignmentPhase
Refinement Phase
![Page 34: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/34.jpg)
Table Extraction (TE) ScoreTable Extraction (TE) Score
Average FQ Score for all fields in the extracted table
Used to compare between and rank the extracted tables based on their extraction quality
![Page 35: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/35.jpg)
OutlineOutline
IntroductionIntroduction
The ListExtract ApproachThe ListExtract Approach
Experiments
Conclusion
![Page 36: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/36.jpg)
0.5
0.6
0.7
0.8
0.9
1
20 40 60 80 100
Top percentage of extracted tables
F-m
ea
su
re
Wlists TDLists
Overall Performance for Overall Performance for WLists and TDListsWLists and TDLists
WLists: A set of 20 manually-collected HTML lists spanning 20 different domains.
TDLists: A set of 100 lists derived from randomly-selected HTML tables
![Page 37: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/37.jpg)
Effect of the Refinement PhaseEffect of the Refinement Phase
(WLists)(WLists)
0.5
0.6
0.7
0.8
0.9
1
20 40 60 80 100
Top percentage of extracted tables
F-m
ea
su
reRefinement No Refinement
![Page 38: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/38.jpg)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
11000
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Table Extraction Score
Nu
mb
er
of
Ex
tra
cte
d T
ab
les
Large-Scale ExperimentLarge-Scale Experiment
A crawl of 100K web pages
100K extracted lists
32K lists after filtering
11K extracted tables with multiple columns
(0.65, ~1,000 tables)
(0.45, ~10,300 tables)
![Page 39: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/39.jpg)
OutlineOutline
IntroductionIntroduction
The ListExtract ApproachThe ListExtract Approach
ExperimentsExperiments
Conclusion
![Page 40: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/40.jpg)
ConclusionConclusion
Our work is a continuation of the efforts to extract structured data from the Web.
Our system, ListExtract, is completely unsupervised and does not assume any domain knowledge. It uses multiple sources of information to make its decisions.
Our results validate the quality of table extraction and suggest that a large number of high-quality lists can be exploited on the Web.
![Page 41: Harvesting Relational Tables from Lists on the Web](https://reader036.vdocuments.mx/reader036/viewer/2022070419/56815b13550346895dc8be74/html5/thumbnails/41.jpg)
Thank youThank you
Questions?