web data extraction como2010
Post on 20-Oct-2014
1.453 views
DESCRIPTION
TRANSCRIPT
![Page 3: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/3.jpg)
Web data extraction
WEBHTML pages
layout
Corporateedp apps
structured data,Databases,
XML
WRAPPER
Goal: Make web contents accessible to electronic data processing
Wrappers: HTMLselect extract annotate XML
![Page 4: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/4.jpg)
![Page 5: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/5.jpg)
Lixto Visual Developer (VD)
Navigation Steps
Mozilla Web
Browser
Extraction Configuration
![Page 6: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/6.jpg)
Need for Automatic Extraction Technology
Example: Real Estate UK 17,000 sites Many not covered by aggregators We do have a list of all homepages (Yellow
Pgs. UK) Manual or semi-automatic wrapping too
expensive - wrapper construction - testing - keeping track of changes No tool or method can do it fully
automatically. Other domains: Hospitals,restaurants, schools, travel
agents, airlines, hospitals, pharmaceutical companies
and retail companies such as supermarket
chains…..
![Page 7: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/7.jpg)
Need for Automatic Extraction Technology (2)
All search engine providers need it! Many work on it.
Keywords: Vertical search, object search, semantic search.
Raghu Ramakrishnan, Yahoo!, March 2009: “no one really has done this successfully at scale yet”
Alon Halevy, Google, Feb. 2009: “Current technologies are not good enough yet to provide what search engines really need. […] any successful approach would probably need a combination of knowledge and learning.”
![Page 8: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/8.jpg)
The Blackbox we want to construct
BLACKBOX
Application domain with thousands of websites
URL
Application relevant Structured data (XML or RDF)
To achieve this, we plan to combine a host of annotators with a new knowledge-based approach.
![Page 9: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/9.jpg)
Real estate
Restaurants
Relationship to SeCo & Webdam
Q: Find apartments in Milan whose prices are average in quarters were restaurant quality > average.
Results
Web service A
Web service
![Page 10: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/10.jpg)
How to achieve it?Rationale: Combine existing and new
“low level” annotators with “high level” AI and reasoning.
Low level annotators: - Bottom-up page analysis. - ML-based entity recognizers - NLP & ontological text annotation - Web page classification & analysis - Basic link analysis
![Page 11: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/11.jpg)
High level reasoning: - Goal oriented - Conceptual domain objects. - Conceptual interaction
elements - High-level object ontology - Domain knowledge
![Page 12: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/12.jpg)
<table>113
<tr> 134<tr>115
“I’m interested in”
<td>119
<table>124
radiobuttons
<tr>125 <tr>126
<td>129 <td>130
“Buying” “Renting”
<td>135
“Maximum price”
<select>136
<option>137<option>138
<td>139 <td>140
“GBP” “EUR”
![Page 13: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/13.jpg)
Bottom-up (low-level) annotation
Monochromatic Rectangle
Georaphic search facility
Postcode input field
Active map ….
ISA ISA
Occurs in
Price search facility …
.
….
Occurs in
….105
105 127
[(02873,227)(03900,417)]
Geo-Price-Searchbox
ISA
[(02873,227)(03900,417)]
![Page 14: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/14.jpg)
Top-down reasoning
Property SearchFacility
Property List
Single Property Description
Specially highlightedproperty
part-of m1
![Page 15: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/15.jpg)
Bottom-up processing Top-down reasoning
Monochromatic Rectangle
Georaphic search facility
Postcode input field
Active map ….
ISA ISA
Occurs in
Price search facility …
.
….
Occurs in
….105
105 127
[(02873,227)(03900,417)]
Property SearchFacility
Property List
Single Property Description
Geo-Price-Searchbox
ISA
[(02873,227)(03900,417)]
Specially highlightedproperty
Phenomenology
part-of m1
![Page 16: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/16.jpg)
table(T) & occurs_in(T,areaselection) & occurs_in(T,priceselection) goodtable(T).
goodtable(T) & child(Parent,T) containsgoodtable(Parent).
goodtable(T) & containsgoodtable(T) propertysearchmask(T).
If a table contains an area selection input field and a price selection field, both of which are not simultaneously contained in a smaller table, then this table is the property search mask
Datalog for Web-Object Reasoning
![Page 17: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/17.jpg)
Crucial steps
• WP1 data model (KRR model)• WP2 low & intermediate level annotation • WP3 High level ontology and Rules (top down) + mapping HL to Int. Level: Phenomenology• WP4 Access, interaction, & navigation• WP5 Compilation; Learning Xpath expressions• WP6 Highly parallel execution on clouds• WP7 General methodology
![Page 18: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/18.jpg)
Knowledge base
General web knowledgeHTML, CSS, script handling
Domain-specific knowledgerules, constraints, tasks, ontology
Site-specific knowledge
WP1
Factual knowledge extraction
Bottom-up property extraction
Access, interaction, navigation
Top-down pattern perception
WP2 WP3 WP4
WWW
Analysisphase
Compilation phase• Extraction program build• Optimization, parallelization
Runtime phase• Highly parallel extraction
• maximize speed• improve consistency
• Use of elastic framework(cloud computing)
WP5
WP6
![Page 19: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/19.jpg)
Crucial steps
• WP1 data model (KRR model)• WP2 low & intermediate level annotation • WP3 High level ontology and Rules (top down) + mapping HL to Int. Level: Phenomenology• WP4 Access, interaction, & navigation• WP5 Compilation; Learning Xpath expressions• WP6 Highly parallel execution on clouds• WP7 General methodology
![Page 20: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/20.jpg)
The Data Model
Datalog is good but does not suffice.On top of it:
Need for object creation Need for ontological reasoning Need for probabilistic reasoning Need for default reasoning
![Page 21: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/21.jpg)
Object creation in Datalog+
table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2) X (tablebox(X) & contains(X,T1) & contains(X,T2)).
PRODUCTToshiba Protégé cxDell 25416 Dell 23233Acer 78987
PRICE480360 470390
![Page 22: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/22.jpg)
Object creation in Datalog+
table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2) X (tablebox(X) & contains(X,T1) & contains(X,T2)).
PRODUCTToshiba Protégé cxDell 25416 Dell 23233Acer 78987
PRICE480360 470390
T1 T2
![Page 23: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/23.jpg)
Object creation in Datalog+
table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2) X (tablebox(X) & contains(X,T1) & contains(X,T2)).
PRODUCTToshiba Protégé cxDell 25416 Dell 23233Acer 78987
PRICE480360 470390
PRICE480360 470390
T1 T2
![Page 24: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/24.jpg)
Object creation in Datalog+
table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2) X (tablebox(X) & contains(X,T1) & contains(X,T2)).
Deduction in Datalog+ undecidable (TGDs)
![Page 25: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/25.jpg)
Object creation in Datalog+
table(T1) & table(T2) & sameColor(T1,T2) & isNeighbourRight(T1,T2) X (tablebox(X) & contains(X,T1) & contains(X,T2)).
Deduction in Datalog+ undecidable (TGDs)
Datalog : require guardedness of rule bodies. Decidable, linear-time data complexity.
![Page 26: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/26.jpg)
Datalog
Family of languages.
Incorporates ontological reasoning (>DL-LITE)
Further research needed for extending it so to be an ideal language for web objects.
Transitivity:
containedin(T1,T2), containedin(T2,T3) containedin (T1,T3)
![Page 27: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/27.jpg)
Datalog
Family of languages.
Incorporates ontological reasoning (>DL-LITE)
Further research needed for extending it so to be an ideal language for web objects.
Transitivity:
containedin(T1,T2), containedin(T2,T3) containedin (T1,T3)
unguarded!
![Page 28: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/28.jpg)
DL-LITE
DL-LITE Datalog[ ,;Lin]
Professor TeachesTo Professsor(x) y TeachesTo(x,y)
TeachesTo- Student TeachesTo(x,y) Student(y)
HasTutor- TeachesTo HasTutor(x,y) ->TeachesTo(y,x)
funct(HasTutor) HasTutor(x,y) & HasTutor(x,y’)
(always innocuous!) & Neq(y,y’)
Professor Student Professor(x) & Student(x)
DL-Litecore
DL-LiteR
DL-LiteF
![Page 29: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/29.jpg)
Crucial steps
• WP1 data model (KRR model)
• WP2 low & intermediate level annotation
• WP3 High level ontology and Rules (top down)
+ mapping HL to Int. Level: Phenomenology
• WP4 Access, interaction, & navigation
• WP5 Compilation; Learning Xpath expressions
• WP6 Highly parallel execution on clouds
• WP7 General methodology
![Page 30: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/30.jpg)
We will use various existing tools and techniques(rather than re-invent the wheel)
Low & Intermediate Level Annotation
• Named entity recognizers• Machine learning• Computational linguistics• Page layout analysis • PDF- Extraction
![Page 31: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/31.jpg)
![Page 32: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/32.jpg)
![Page 33: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/33.jpg)
![Page 34: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/34.jpg)
Extraction from PDF
Tamir Hassan
![Page 35: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/35.jpg)
Crucial steps
• WP1 data model (KRR model)• WP2 low & intermediate level annotation • WP3 High level ontology and Rules (top down) + mapping HL to Int. Level: Phenomenology• WP4 Access, interaction, & navigation• WP5 Compilation; Learning Xpath expressions• WP6 Highly parallel execution on clouds• WP7 General methodology
![Page 36: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/36.jpg)
Navigation & Interaction
![Page 37: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/37.jpg)
Crucial steps
• WP1 data model (KRR model)• WP2 low & intermediate level annotation • WP3 High level ontology and Rules (top down) + mapping HL to Int. Level: Phenomenology• WP4 Access, interaction, & navigation• WP5 Compilation; Learning Xpath expressions• WP6 Highly parallel execution on clouds• WP7 General methodology
![Page 38: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/38.jpg)
![Page 39: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/39.jpg)
OXPath
• Extension of XPath • Facilitates querying web form and retrieving
returned data• Simulates a user filling out web forms• Highly parallelizable (geared towds cloud
computing)• Navigation and collecting data across multiple
pages
![Page 40: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/40.jpg)
Result Extraction
..../next-field::*/{“Renting”}/.../{...}/.../{“Submit”}
Atomic resultsregardless of presentation (list, table, etc.)
/<XQ>
![Page 41: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/41.jpg)
Result Extraction<XQ> : For each atomic result A
Letprice = A/.../.../text()description = A/.../.../../text()
........Return
<rental area=Oxford><price> 1,200 </price><bedrooms> 3 </bedrooms><bathrooms> 1 </bathrooms><type> Flat </type><location> George Street,OX1 </location><description> ... </description><otherInfo> Furnished; Long let - more than six months</otherInfo>...
<\rental>
price description
Type OtherInfo
Bathrooms
location
type = A/.../.../../text()
![Page 42: Web Data Extraction Como2010](https://reader035.vdocuments.mx/reader035/viewer/2022062613/5444f015afaf9fef2b8b45f1/html5/thumbnails/42.jpg)
Crucial steps
• WP1 data model (KRR model)• WP2 low & intermediate level annotation • WP3 High level ontology and Rules (top down) + mapping HL to Int. Level: Phenomenology• WP4 Access, interaction, & navigation• WP5 Compilation; Learning Xpath expressions• WP6 Highly parallel execution on clouds• WP7 General methodology