data extraction and integration from imprecise web sources
Post on 08-Jan-2016
32 Views
Preview:
DESCRIPTION
TRANSCRIPT
Data Extraction and Integration from Imprecise Web Sources
Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti
Università degli Studi Roma Tre
(Creative Commons License, see last slide)
Data-intensive websites
Website
Data-intensive websites
Database
Template1
Template2
Template3
target
Flint goal
…StockQuote
Last Min Max
Volume 52high Open
Flint
System architecture
WebSearch[WIDM08]
Data Extraction
Data Integration
The WebThe Web
Novel contribution
• Unsupervised• Automatic• Scalable• No knowledge available
Data Extraction
RoadRunner [Vldb01] ExAlg [Sigmod03]
TurboWrapper [Vldb07]
• Unsupervised• Automatic• Scalable• Uncertain Data• No labels available• No corpus available
Data Integration
WebTables [Vldb08]Cimple [Vldb07]
MetaQuerier [Cidr05]PayGo [Cidr07]
Data Extraction
Data Extraction
Data Extraction
AAPL, GOOG, MSFT, INTC, …
128.09, 439.54, 34.89, 112.37, …
127.81, 439.25, 32.13, 111.01, …
132.43, 443.82, 33.67, 114.32, …
0.50%, -0.38%, 1.23%, 3.92%, -1.65%, …
Add AAPL to Your Portfolio, Add GOOG to Your Portfolio, Add MSFT to Your Portfolio,
Add INTC to Your Portfolio, …
…
Data ExtractionHTML fragments taken from two pages belonging to the same website:
1,132,228 , 1,735,857/html/body/table/tr[1]/td[2]
$20.66 , $414.58/html/body/table/tr[2]/td[2]
$11.70 , $247.30/html/body/table/tr[3]/td[2]
$20.72 , $414.06/html/body/table/tr[4]/td[2]
Extraction error!
$0.02 , 99,494,200/html/body/table/tr[5]/td[2]
?
4,732,600 , null/html/body/table/tr[6]/td[2]
Data Integration
103316
(max)
42510
(min)
AAGOMS
(stock)
Data Integration
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5 t=0.5
Data Integration
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5 t=0.5
1.0 1.0 1.0
Data Integration
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5 t=0.5
Data Integration
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5 t=0.5
62612
(price)
42510
(min)
AAGOMS
(stock)
0.6 1.0 1.0
Data Integration
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5 t=0.5
62612
(price)
42510
(min)
AAGOMS
(stock)
?
1.0 1.0
Data Integration
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5
62612
(price)
42510
(min)
AAGOMS
(stock)
1.0
t=0.7 t=0.7
Data Integration
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5
62612
(price)
42510
(min)
AAGOMS
(stock)
1.0
t=0.7 t=0.7
Data Integration
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5
62612
(price)
42510
(min)
AAGOMS
(stock)
t=0.7 t=0.7
Wrapper Refinement
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5
62612
(price)
42510
(min)
AAGOMS
(stock)
10null
10
(min/max)
? ?
0.3 (weak) 0.3 (weak) 0.0 0.0
Wrapper Refinement
matching value
nearby template
tokens
//td[contains(text(),‘Open')]/../td[2]//td[contains(text(),‘Open')]/../../tr[5]/td[1]//td[contains(text(),‘Open')]/../../tr[5]/td[2]//td[contains(text(),‘High')]/../td[2]…
t=0.7 t=0.7
Wrapper Refinement
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5
62612
(price)
42510
(min)
AAGOMS
(stock)
10null
10
(min/max)
1.0 1.0
103316
(max)
42510
(min)
//td[contains(text(),‘Max')]/../td[2]
//td[contains(text(),‘Min')]/../td[2]
t=0.7 t=0.7
Wrapper Refinement
103316
(max)
42510
(min)
AAGOMS
(stock)
103316
(max)
42510
(min)
AAGOMS
(stock)
t=0.5 t=0.5
62612
(price)
42510
(min)
AAGOMS
(stock)
10null
10
(min/max)
103316
(max)
42510
(min)
Experimental Results(100 websites for each domain)
Soccer domain(45,714 pages)
Attribute |m|
• Name 90• Birth Date 61• Height 54• Nationality 48• Club 43• Position 43• Weight 34• League 14
Videogame domain(49,262 pages)
Attribute |m|
• Title 86• Publisher 59• Developer 45• Genre 28• ESRB rating 40• Release Date 9• Platform 9• # Players 6
Finance domain(57,623 pages)
Attribute |m|
• Stock Symbol 84• Price Change 73• % Change 73• Volume 52• Day Low 43• Day High 41• Last Price 29• Open Price 24
Demo
• Found Websites• Integrated Data
the end!
http://flint.dia.uniroma3.it
License
• This work is licensed under the Creative Commons Attribution-ShareAlike License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/1.0/ or send a letter to Creative Commons, 559 Nathan Abbott Way, Stanford, California 94305, USA.
top related