conceptual-model-based web data extraction by example
DESCRIPTION
Conceptual-Model-Based Web Data Extraction by Example. Yuanqiu (Joe) Zhou Data Extraction Group Brigham Young University Sponsored by NSF. Motivation. Data-rich Websites in abundance Conceptual-Model-Based Methodology is resilient “By Example” approach is user-friendly. - PowerPoint PPT PresentationTRANSCRIPT
Conceptual-Model-Based Web Data Extraction by Example
Yuanqiu (Joe) ZhouData Extraction Group
Brigham Young UniversitySponsored by NSF
Motivation
Data-rich Websites in abundance
Conceptual-Model-Based Methodology is resilient
“By Example” approach is user-friendly
“By Example” Approach
Web users specify desired information by creating a form
Users collect sample pages on the Web
An ontology generator learns the task by analyzing the form and the sample pages
Interactions may be needed to improve or complete the ontology
Architecture
Data Frame Libraries
User Created Form GUI
Sample Pages
Ontology Generator
Extraction Engine Target PagesPopulated Database
Extraction Ontology
Digital Camera
Brand
Model
CCD Resolution
Image Resolution
Optical Zoom
Digital Zoom
PowerShot G2
4.0
2272 x 1074
3
2
Sample Web Page User Created Form
Canon
Extraction Ontology
Relationship Set and Constraints
Extraction Patterns
Keywords
Context Expressions
Primary Object Name
Other Objects’ Names
Participation Constraints
DigitalCamera [-> object];
DigitalCamera [0:1] has Brand [1:*];
DigitalCamera [0:1] has Model [1:*];
DigitalCamera [0:1] has CCDResolution [1:*];
DigitalCamera [0:1] has ImageResolution [1:*];
DigitalCamera [0:1] has OpticalZoom [1:*];
DigitalCamera [0:1] has DigitalZoom [1:*];
Relationship Set and Constraints
Primary Object Name
Other Objects’ Names
Participation Constraints
DigitalCamera [-> object];
DigitalCamera [0:1] has Brand [1:*];
DigitalCamera [0:1] has Model [1:*];
DigitalCamera [0:1] has CCDResolution [1:*];
DigitalCamera [0:1] has ImageResolution [1:*];
DigitalCamera [0:1] has OpticalZoom [1:*];
DigitalCamera [0:1] has DigitalZoom [1:*];
Relationship Set and Constraints
Relationship Set and Constraints
Primary Object Name
Other Objects’ Names
Participation Constraints
DigitalCamera [-> object];
DigitalCamera [0:1] has Brand [1:*];
DigitalCamera [0:1] has Model [1:*];
DigitalCamera [0:1] has CCDResolution [1:*];
DigitalCamera [0:1] has ImageResolution [1:*];
DigitalCamera [0:1] has OpticalZoom [1:*];
DigitalCamera [0:1] has DigitalZoom [1:*];
Primary Object Name
Other Objects’ Names
Participation Constraints
DigitalCamera [-> object];
DigitalCamera [0:1] has Brand [1:*];
DigitalCamera [0:1] has Model [1:*];
DigitalCamera [0:1] has CCDResolution [1:*];
DigitalCamera [0:1] has ImageResolution [1:*];
DigitalCamera [0:1] has OpticalZoom [1:*];
DigitalCamera [0:1] has DigitalZoom [1:*];
Relationship Set and Constraints
Extraction Patterns
Data Frame Libraries Lexicons Synonym Dictionary Regular Expressions
Extraction Pattern: Lexicons for Brand and Model Regular Expressions for numbers and Image
resolution
From Data Frame Libraries
CCDResolution matches [20]constant{ extract "\b\d(\.\d{1,2})?\b"; };
keyword "\bMegapixel\b","\bCCD\b","\bResolution\b";
Features a high-quality 4.0 Megapixel Resolution CCD
The new Nikon Coolpix 995 offers a boasting 3.34 Megapixel CCD
3 effective megapixel
Extraction Patterns Data Frame Libraries
Keywords
Features a high-quality 4.0 Megapixel Resolution CCD
The new Nikon Coolpix 995 offers a boasting 3.34 Megapixel CCD
3 effective megapixel
Keywords
Features a high-quality 4.0 Megapixel Resolution CCD
The new Nikon Coolpix 995 offers a boasting 3.34 Megapixel CCD
3 effective megapixel
Keywords
Features a high-quality 4.0 Megapixel Resolution CCD
The new Nikon Coolpix 995 offers a boasting 3.34 Megapixel CCD
3 effective megapixel
CCDResolution matches [20]constant{ extract "\b\d(\.\d{1,2})?\b"; };
keyword "\bMegapixel\b","\bCCD\b","\bResolution\b";
Context Expressions
3.5x optical zoom (2.5x digital)
a superior 4x Optical Zoom Nikkor lens, plus 4x stepless digital zoom
optical 3X /digital 6X zoom
OpticalZoom matches [10]constant{ extract "\b\d(\.\d)?";
context "\b\d(\.\d)?(x)\b"; };keyword "\boptical\b";
DigitalCamera [-> object];
DigitalCamera [0:1] has Brand [1:*];Brand matches [10] constant{ extract "\bNikon\b";},
{ extract "\bCanon\b";},{ extract "\bOlympus\b";},{ extract "\bMinolta\b";},{ extract "\bSony\b";};
end;
DigitalCamera [0:1] has CCDResolution [1:*];CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; };
keyword "\bMegapixel\b“, "\bCCD\b", "\bResolution\b";
end;
DigitalCamera [0:1] has ImageResolution [1:*];ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; };
keyword "\bResolution\b", "\bImage\b";
end;
DigitalCamera [0:1] has OpticalZoom [1:*];OpticalZoom matches [10]
constant{ extract "\b\d"; context "\b\d(x)\b"; };
keyword "\boptical\b";end;
Extraction Ontology
DigitalCamera [-> object];
DigitalCamera [0:1] has Brand [1:*];Brand matches [10] constant{ extract "\bNikon\b";},
{ extract "\bCanon\b";},{ extract "\bOlympus\b";},{ extract "\bMinolta\b";},{ extract "\bSony\b";};
end;
DigitalCamera [0:1] has CCDResolution [1:*];CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; };
keyword "\bMegapixel\b“, "\bCCD\b", "\bResolution\b";
end;
DigitalCamera [0:1] has ImageResolution [1:*];ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; };
keyword "\bResolution\b", "\bImage\b";
end;
DigitalCamera [0:1] has OpticalZoom [1:*];OpticalZoom matches [10]
constant{ extract "\b\d"; context "\b\d(x)\b"; };
keyword "\boptical\b";end;
Extraction Ontology
DigitalCamera [-> object];
DigitalCamera [0:1] has Brand [1:*];Brand matches [10] constant{ extract "\bNikon\b";},
{ extract "\bCanon\b";},{ extract "\bOlympus\b";},{ extract "\bMinolta\b";},{ extract "\bSony\b";};
end;
DigitalCamera [0:1] has CCDResolution [1:*];CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; };
keyword "\bMegapixel\b“, "\bCCD\b", "\bResolution\b";
end;
DigitalCamera [0:1] has ImageResolution [1:*];ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; };
keyword "\bResolution\b", "\bImage\b";
end;
DigitalCamera [0:1] has OpticalZoom [1:*];OpticalZoom matches [10]
constant{ extract "\b\d(\.\d)"; context "\b\d(\.\d)?(x)\b"; };
keyword "\boptical\b";end;
Extraction Ontology
DigitalCamera [-> object];
DigitalCamera [0:1] has Brand [1:*];Brand matches [10] constant{ extract "\bNikon\b";},
{ extract "\bCanon\b";},{ extract "\bOlympus\b";},{ extract "\bMinolta\b";},{ extract "\bSony\b";};
end;
DigitalCamera [0:1] has CCDResolution [1:*];CCDResolution matches [20] constant{ extract "\b\d(\.\d{1,2})?\b"; };
keyword "\bMegapixel\b“, "\bCCD\b", "\bResolution\b";
end;
DigitalCamera [0:1] has ImageResolution [1:*];ImageResolution matches [20] constant{ extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; }, { extract "\b\d{4}(\s)?(x)(\s)?\d{4}\b"; };
keyword "\bResolution\b", "\bImage\b";
end;
DigitalCamera [0:1] has OpticalZoom [1:*];OpticalZoom matches [10]
constant{ extract "\b\d(\.\d)"; context "\b\d(\.\d)?(x)\b"; };
keyword "\boptical\b";end;
Extraction Ontology
Results (Same Site)
Results (Different Site)
Summary and Future Work
The example indicates that the approach is feasible
Some open questions need to be explored