Download - Record-Boundary Discovery in Web Documents
![Page 1: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/1.jpg)
Record-Boundary Discovery in Web Documents
D.W. Embley, Y. Jiang, Y.-K. NgData-Extraction Group*
Department of Computer Science
Brigham Young UniversityProvo, UT, USA
*Funded in part by Novell, Inc., Ancestry.com, Inc., and Faneuil Research.
![Page 2: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/2.jpg)
Record-Boundary DiscoveryLarger Goal: Information Extraction<html><head><title>The Salt Lake Tribune … </title></head><body bgcolor=“#FFFFFF”><h1 align=”left”>Domestic Cars</h1>…<hr><h4> ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her.Previous owner heart broken! <b>Asking only $11,995.</b> #1415JERRY SEINER MIDVALE, 566-3800 or 566-3888 </h4><hr><h4> ‘85 DODGE Daytona, needs paint, runs great. Offer. 262-7557 </h4><hr>…</body></html>
#####
#####
#####
Year Make Model PhoneNr
![Page 3: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/3.jpg)
Desired ObjectiveQuery the Web Like a Database
Example: Get the year, make, model, and price for 1987 or later cars that are red or white.
Year Make Model Price-----------------------------------------------------------------------97 CHEVY Cavalier 11,99594 DODGE 4,99594 DODGE Intrepid 10,00091 FORD Taurus 3,50090 FORD Probe88 FORD Escort 1,000
![Page 4: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/4.jpg)
for a page of unstructured records, rich in data and narrow in ontological breadth
Approach and LimitationsAutomatic Ontology-Based
Wrapper Generation
ApplicationOntology
OntologyParser
Constant/KeywordRecognizer
Database-InstanceGenerator
UnstructuredRecords
Constant/KeywordMatching Rules
Data-Record Table
Record-Level Objects,Relationships, and Constraints
DatabaseScheme
PopulatedDatabase
Record Extractor
Web Page
![Page 5: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/5.jpg)
Application Ontology:Object-Relationship Model Instance
Car [-> object];Car [0..1] has Model [1..*];Car [0..1] has Make [1..*];Car [0..1] has Year [1..*];Car [0..1] has Price [1..*];Car [0..1] has Mileage [1..*];PhoneNr [1..*] is for Car [0..1];PhoneNr [0..1] has Extension [1..*];Car [0..*] has Feature [1..*];
Year Price
Make Mileage
Model
Feature
PhoneNr
Extension
Car
hashas
has
has is for
has
has
has
1..*
0..1
1..*
1..* 1..*
1..*
1..*
1..*
0..1 0..10..1
0..1
0..1
0..1
0..*
1..*
![Page 6: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/6.jpg)
Application Ontology: Data Frames
Make matches [10] case insensitive constant { extract “chev”; }, { extract “chevy”; }, { extract “dodge”; }, …end;Model matches [16] case insensitive constant { extract “88”; context “\bolds\S*\s*88\b”; }, …end;Mileage matches [7] case insensitive constant { extract “[1-9]\d{0,2}k”; substitute “k” -> “,000”; }, … keyword “\bmiles\b”, “\bmi\b “\bmi.\b”;end;...
![Page 7: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/7.jpg)
Ontology Parser
Make : chevy…KEYWORD(Mileage) : \bmiles\b...
create table Car ( Car integer, Year varchar(2), … );create table CarFeature ( Car integer, Feature varchar(10)); ...
Object: Car;...Car: Year [0..1];Car: Make [0..1];…CarFeature: Car [0..*] has Feature [1..*];
ApplicationOntology
OntologyParser
Constant/KeywordMatching Rules
Record-Level Objects,Relationships, and Constraints
DatabaseScheme
![Page 8: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/8.jpg)
Record Extractor
<html>…<h4> ‘97 CHEVY Cavalier, Red, 5 spd, … </h4><hr><h4> ‘85 DODGE Daytona, needs paint, … </h4><hr>….</html>
…#####‘97 CHEVY Cavalier, Red, 5 spd, …#####‘85 DODGE Daytona, needs paint, …#####...
UnstructuredRecords
Record Extractor
Web Page
![Page 9: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/9.jpg)
Record Extractor:High Fan-Out Heuristic
<html><head><title>The Salt Lake Tribune … </title></head><body bgcolor=“#FFFFFF”><h1 align=”left”>Domestic Cars</h1>…<hr><h4> ‘97 CHEVY Cavalier, Red, … </h4><hr><h4> ‘85 DODGE Daytona, needs … </h4><hr>…</body></html>
html
head
title
body
… hr h4 b hr h4 ...h1
Candidate Separator Tags
![Page 10: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/10.jpg)
Record Extractor:Record-Separator Heuristics
IT: Identifiable “html separator” Tags
HT: Highest-count Tags
SD: Standard Deviation
OM: Ontological Match
RP: Repeating-tag Patterns
![Page 11: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/11.jpg)
IT: Identifiable “html separator” Tags
<h1 align=”left”>Domestic Cars</h1><hr><h4> ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her.Previous owner heart broken! <b>Asking only $11,995.</b> #1415JERRY SEINER MIDVALE, 566-3800 or 566-3888 </h4><hr><h4> ‘85 DODGE Daytona, needs paint, runs great. Offer. 262-7557 </h4><hr><h4> ‘96 FORD Taurus GL Only $8900 WOW! Lowbook Sales 474-3335 </h4><hr>
hr tr td a table p br h4 h1 strong b i
![Page 12: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/12.jpg)
HT: Highest-count Tags
<h1 align=”left”>Domestic Cars</h1><hr><h4> ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her.Previous owner heart broken! <b>Asking only $11,995.</b> #1415JERRY SEINER MIDVALE, 566-3800 or 566-3888 </h4><hr><h4> ‘85 DODGE Daytona, needs paint, runs great. Offer. 262-7557 </h4><hr><h4> ‘96 FORD Taurus GL Only $8900 WOW! Lowbook Sales 474-3335 </h4><hr>
Tag Count----------------- hr 4 h4 3 b 1
![Page 13: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/13.jpg)
SD: Standard Deviation
<h1 align=”left”>Domestic Cars</h1><hr><h4> ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her.Previous owner heart broken! <b>Asking only $11,995.</b> #1415JERRY SEINER MIDVALE, 566-3800 or 566-3888 </h4><hr><h4> ‘85 DODGE Daytona, needs paint, runs great. Offer. 262-7557 </h4><hr><h4> ‘96 FORD Taurus GL Only $8900 WOW! Lowbook Sales 474-3335 </h4><hr>
hr ( = 45.5)-------------------159 characters 63 characters 62 characters
h4 ( = 48.0)--------------------159 characters 63 characters
![Page 14: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/14.jpg)
OM: Ontological Match
<h1 align=”left”>Domestic Cars</h1><hr><h4> ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her.Previous owner heart broken! <b>Asking only $11,995.</b> #1415JERRY SEINER MIDVALE, 566-3800 or 566-3888 </h4><hr><h4> ‘85 DODGE Daytona, needs paint, runs great. Offer. 262-7557 </h4><hr><h4> ‘96 FORD Taurus GL Only $8900 WOW! Lowbook Sales 474-3335 </h4><hr>
Record Estimator: average of count of Year, Make, and Model = 3.
Closest candidate separator count: h4 = 3, hr = 4, b = 1.
![Page 15: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/15.jpg)
RP: Repeating-tag Patterns
<h1 align=”left”>Domestic Cars</h1><hr><h4> ‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her.Previous owner heart broken! <b>Asking only $11,995.</b> #1415JERRY SEINER MIDVALE, 566-3800 or 566-3888 </h4><hr><h4> ‘85 DODGE Daytona, needs paint, runs great. Offer. 262-7557 </h4><hr><h4> ‘96 FORD Taurus GL Only $8900 WOW! Lowbook Sales 474-3335 </h4><hr>
3 pairs: <hr><h4>
Of the tags in the repeating pattern, h4 is closest with 3, then hr with 4.
![Page 16: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/16.jpg)
Record Extractor:Consensus Heuristic
Certainty is a generalization of: C(E1) + C(E2) - C(E1)C(E2).C denotes certainty and Ei is the evidence for an observation.
Our certainties are based on observations from 10 differentsites for 2 different applications (car ads and obituaries)
Correct Tag RankHeuristic 1 2 3 4 IT 96.0% 4.0% 0% 0% HT 49.0% 32.5% 16.5% 2.0% SD 65.5% 22.5% 12.0% 0% OM 84.5% 12.5% 2.0% 1.0% RP 77.5% 12.5% 9.0% 1.0%
![Page 17: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/17.jpg)
Record Extractor:Example Consensus Heuristic
Rank Computed IT HT SD OM RP Certainty Factorhr 1 1 1 2 2 .994h4 2 2 2 1 1 .983b 3 3 - 3 - .182
e.g., b: 0 + .165 + .02 - 0.165 - 0.02 - .165.02 + 0.165.02 = .1817
Correct Tag RankHeuristic 1 2 3 4 IT 96.0% 4.0% 0% 0% HT 49.0% 32.5% 16.5% 2.0% SD 65.5% 22.5% 12.0% 0% OM 84.5% 12.5% 2.0% 1.0% RP 77.5% 12.5% 9.0% 1.0%
![Page 18: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/18.jpg)
Record Extractor: Results
4 different applications (car ads, job ads, obituaries,university courses) with 5 new/different sites for eachapplication
Heuristic Success Rate IT 95% HT 45% SD 65% OM 80% RP 75%Consensus 100%
![Page 19: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/19.jpg)
Constant/Keyword Recognizer
Descriptor/String/Position(start/end)
‘97 CHEVY Cavalier, Red, 5 spd, only 7,000 miles on her.Previous owner heart broken! Asking only $11,995. #1415JERRY SEINER MIDVALE, 566-3800 or 566-3888
Year|97|2|3Make|CHEV|5|8Make|CHEVY|5|9Model|Cavalier|11|18Feature|Red|21|23Feature|5 spd|26|30Mileage|7,000|38|42KEYWORD(Mileage)|miles|44|48Price|11,995|100|105Mileage|11,995|100|105PhoneNr|566-3800|136|143PhoneNr|566-3888|148|155
Constant/KeywordRecognizer
UnstructuredRecords
Constant/KeywordMatching Rules
Data-Record Table
![Page 20: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/20.jpg)
Database Instance Generator
• Keyword proximity• Subsumed and overlapping
constants• Functional relationships• Nonfunctional relationships• First occurrence without
constraint violation
Heuristics
Descriptor/String/Position(start/end)
Year|97|2|3Make|CHEV|5|8Make|CHEVY|5|9Model|Cavalier|11|18Feature|Red|21|23Feature|5 spd|26|30Mileage|7,000|38|42KEYWORD(Mileage)|miles|44|48Price|11,995|100|105Mileage|11,995|100|105PhoneNr|566-3800|136|143PhoneNr|566-3888|148|155
Database-InstanceGenerator
Data-Record Table
Record-Level Objects,Relationships, and Constraints
=2 {
} =52
![Page 21: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/21.jpg)
Year|97|2|3Make|CHEV|5|8Make|CHEVY|5|9Model|Cavalier|11|18Feature|Red|21|23Feature|5 spd|26|30Mileage|7,000|38|42KEYWORD(Mileage)|miles|44|48Price|11,995|100|105Mileage|11,995|100|105PhoneNr|566-3800|136|143PhoneNr|566-3888|148|155
Database-Instance Generator
insert into Car values(1001, “97”, “CHEVY”, “Cavalier”, “7,000”, “11,995”, “556-3800”)insert into CarFeature values(1001, “Red”)insert into CarFeature values(1001, “5 spd”)
Database-InstanceGenerator
Data-Record Table
Record-Level Objects,Relationships, and Constraints
DatabaseScheme
PopulatedDatabase
![Page 22: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/22.jpg)
Recall & Precision
N
CRecall
IC
C
Precision
N = number of facts in sourceC = number of facts declared correctlyI = number of facts declared incorrectly
(of facts available, how many did we find?)
(of facts retrieved, how many were relevant?)
![Page 23: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/23.jpg)
Results: Car Ads
Training set for tuning ontology: 100Test set: 116
Salt Lake Tribune
Recall % Precision %Year 100 100Make 97 100Model 82 100Mileage 90 100Price 100 100PhoneNr 94 100Extension 50 100Feature 91 99
![Page 24: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/24.jpg)
Car Ads: Comments
• Unbounded sets– missed: MERC, Town Car, 98 Royale– could use lexicon of makes and models
• Unspecified variation in lexical patterns– missed: 5 speed (instead of 5 spd), p.l (instead of p.l.)– could adjust lexical patterns
• Misidentification of attributes– classified AUTO in AUTO SALES as automatic transmission– could adjust exceptions in lexical patterns
• Typographical errors– “Chrystler”, “DODG ENeon”, “I-15566-2441”– could look for spelling variations and common typos
![Page 25: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/25.jpg)
Results: Computer Job Ads
Training set for tuning ontology: 50Test set: 50
Los Angeles Times
Recall % Precision %Degree 100 100Skill 74 100Email 91 83Fax 100 100Voice 79 92
![Page 26: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/26.jpg)
Results: Obituaries
Training set for tuning ontology: ~ 24Test set: 90
Arizona Daily Star
Recall % Precision %DeceasedName* 100 100Age 86 98BirthDate 96 96DeathDate 84 99FuneralDate 96 93FuneralAddress 82 82FuneralTime 92 87…Relationship 92 97RelativeName* 95 74
*partial or full name
![Page 27: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/27.jpg)
Cautions
• Ontology Creation and Tuning– Regular expressions (tool for experimentation)– Category specialization and cultural localization
• Record Separation– Web page has multiple records satisfying an ontology– (HTML) record separator exists
• Attribute-Value Pair Generation– Context-sensitive recognizable/categorizable constants– Topic switches within records
![Page 28: Record-Boundary Discovery in Web Documents](https://reader036.vdocuments.mx/reader036/viewer/2022062308/56813437550346895d9b294a/html5/thumbnails/28.jpg)
Conclusions• Given an ontology and a Web page with multiple records, it is
possible to extract and structure the data automatically.• Record Separation Results: 100%• Recall and Precision Results
– Car Ads: ~ 94% recall and ~ 99% precision– Job Ads: ~ 84% recall and ~ 98% precision– Obituaries: ~ 90% recall and ~ 95% precision (except names: ~ 73% precision)
• Future Work– Find and categorize pages of interest.– Relax restrictions for record separation.– Strengthen heuristics for extraction.– Add richer conversions and additional constraints to data frames.
http://www.deg.byu.edu/