integration of friendly data islands on the web. information extraction

56
Integration of Friendly Data Islands on the Web. Information Extraction.

Upload: shaeleigh-fuller

Post on 03-Jan-2016

22 views

Category:

Documents


0 download

DESCRIPTION

Integration of Friendly Data Islands on the Web. Information Extraction. Roadmap. Introduction What extraction rules are Generating extraction rules A couple of systems Conclusions. Roadmap. Introduction What extraction rules are Generating extraction rules A couple of systems - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Integration of Friendly Data Islands on the Web. Information Extraction

Integration of Friendly Data Islands on the Web.

Information Extraction.

Page 2: Integration of Friendly Data Islands on the Web. Information Extraction

Roadmap

• Introduction• What extraction rules are• Generating extraction rules• A couple of systems• Conclusions

Page 3: Integration of Friendly Data Islands on the Web. Information Extraction

Roadmap

• Introduction• What extraction rules are• Generating extraction rules• A couple of systems• Conclusions

Page 4: Integration of Friendly Data Islands on the Web. Information Extraction

The theory• A wrapper is a building

block that provides an ad-hoc, message-based API to an app

• They interface apps at one or more layers, but, more often than not, they must deal with the user interface or the data layer

User Interface

Controller

Business Logic

Data AccessLayer

Data Layer

Page 5: Integration of Friendly Data Islands on the Web. Information Extraction

The problem

The Da Vinci Code

Buy

Dan BrownDoubleday, 200615.95 €

Robert Langdon is a Harvard Professor of Symbology…

Page 6: Integration of Friendly Data Islands on the Web. Information Extraction

Features of current web documents

• Trillions of documents• Generated on demand by software

applications• Change continuously• Require navigation from search forms• Written in telegraphic language• Formatted according to HTML templates

Page 7: Integration of Friendly Data Islands on the Web. Information Extraction

The solution

Page 8: Integration of Friendly Data Islands on the Web. Information Extraction

Wrapping in a nutshell• Goals

– Endow data islands with APIs

– Ease implementing software applications

• Implications– Form filling– Navigation– Info extraction– “Ontologisation”

Page 9: Integration of Friendly Data Islands on the Web. Information Extraction

Look out!

• Information extraction has driven most research efforts

• Few wrapping systems are complete• Wrapping is usually mistaken for information

extraction• This talk is about engineering information

extraction for enabling information integration

Page 10: Integration of Friendly Data Islands on the Web. Information Extraction

How IE works

Information extractor

Document

Extraction rules

Attributes

The Da Vinci Code

Dan Brown

15.95 €

2006

Robert Langdon…

Doubleday

Templates

Message ID: MUC-0001Message Template: Court resolutionDate of Event: April, 30 2007Charge: Terrorist attackPerpetrator: Salahuddin AminPerpetrator: Anthony GarciaPerpetrator: Waheed MahmoodPerpetrator: Omar Khyam…

The Da Vinci Code

Dan Brown

15.95 €

2006

P1

Robert Langdon…

Doubleday

A1

B1

Ontology instances

Templating/ Ontologisation rules

Page 11: Integration of Friendly Data Islands on the Web. Information Extraction

Roadmap

• Introduction• What extraction rules are• Generating extraction rules• A couple of systems• Side by side comparison• Conclusions

Page 12: Integration of Friendly Data Islands on the Web. Information Extraction

Running example

Page 13: Integration of Friendly Data Islands on the Web. Information Extraction

Running example<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/> <b>Text:</b> blah, blah </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/> <b>Text:</b> yeah, yeah </li> </ul></body></html>

<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/> <b>Text:</b> blah, blah </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/> <b>Text:</b> yeah, yeah </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/> <b>Text:</b> cough, cough </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/> <b>Text:</b> cough, cough </li> </ul></body></html>

<!–- Sample #3 --><html><body> <b>Book name:</b> W4F explained <br/> <b>Reviews:</b> <br/> <ul> </ul></body></html>

<!–- Sample #3 --><html><body> <b>Book name:</b> W4F explained <br/> <b>Reviews:</b> <br/> <ul> </ul></body></html>

Page 14: Integration of Friendly Data Islands on the Web. Information Extraction

Kinds of extraction rules

• Regular expressions • First-order logic rules • Pointers into DOM tree • Context-free grammars • Tag trees

Page 15: Integration of Friendly Data Islands on the Web. Information Extraction

TSIMMISTSIMMIS

Regular expressions

[Root, get("page.html"), "#"]

[BookReview, Root, "<body>#</body>"]

[BookName, BookReview, "</b>#<br/>"]

[Tmp, Rook, "<ul>#</ul>"]

[Reviews, Tmp, "split(Tmp, '<li>')"]

[ReviewerNames, Reviews, "Reviewer:</b>#<br/>"]

[Ratings, Reviews, "Rating:</b>#<br/>"]

[Text, Reviews, "Text:</b>#<br/>"]

[Root, get("page.html"), "#"]

[BookReview, Root, "<body>#</body>"]

[BookName, BookReview, "</b>#<br/>"]

[Tmp, Rook, "<ul>#</ul>"]

[Reviews, Tmp, "split(Tmp, '<li>')"]

[ReviewerNames, Reviews, "Reviewer:</b>#<br/>"]

[Ratings, Reviews, "Rating:</b>#<br/>"]

[Text, Reviews, "Text:</b>#<br/>"]

RoadRunnerRoadRunner

$FileName<html><body> <b>Book name:</b> $BookTitle <br/> <b>Reviews:</b> <br/> <ul> (( <li> <b>Reviewer:</b> $ReviewerName <br/> <b>Rating:</b> $Rating <br/> <b>Text:</b> $Text </li> )+)? </ul></body></html>

$FileName<html><body> <b>Book name:</b> $BookTitle <br/> <b>Reviews:</b> <br/> <ul> (( <li> <b>Reviewer:</b> $ReviewerName <br/> <b>Rating:</b> $Rating <br/> <b>Text:</b> $Text </li> )+)? </ul></body></html>

Page 16: Integration of Friendly Data Islands on the Web. Information Extraction

First-order logic rules

SRVSRV

bookTitle(X) :- prev(X, "Book name:</b>"), next(X, "<br/>").

reviewerName(X) :- prev(X, "name:</b>"),next(X, "<br/>"), !bookTitle(X).

rating(X) :- isNatural(X), length(X, 1), inList(X).

text(X) :- prev(X, "Text:</b>"),next(X, "</li>").

bookTitle(X) :- prev(X, "Book name:</b>"), next(X, "<br/>").

reviewerName(X) :- prev(X, "name:</b>"),next(X, "<br/>"), !bookTitle(X).

rating(X) :- isNatural(X), length(X, 1), inList(X).

text(X) :- prev(X, "Text:</b>"),next(X, "</li>").

Page 17: Integration of Friendly Data Islands on the Web. Information Extraction

Pointer into the DOM tree

WebOQLWebOQL

select x’.Text, y’.Text, y’’’’.Text, y’’’’’’’.Textfrom x, y in browse("page.html")where x.Text = "Book name:" and y.Text = "Reviewer:"

select x’.Text, y’.Text, y’’’’.Text, y’’’’’’’.Textfrom x, y in browse("page.html")where x.Text = "Book name:" and y.Text = "Reviewer:"

Page 18: Integration of Friendly Data Islands on the Web. Information Extraction

Context-free grammars

MinervaMinerva

Page ::= $FileName <html><body> Review </body></html>

Review ::= <b>Book name:</b> $BookName <br/> <b>Reviews:</b> <br/> <ul> (<li> Reviewer Rating Text <li>)* </ul>

Reviewer ::= <b>Reviewer:</b> $Reviewer <br/>

Rating ::= <b>Rating:</b> $Rating <br/>

Text ::= <b>Text:</b> $Text

Page ::= $FileName <html><body> Review </body></html>

Review ::= <b>Book name:</b> $BookName <br/> <b>Reviews:</b> <br/> <ul> (<li> Reviewer Rating Text <li>)* </ul>

Reviewer ::= <b>Reviewer:</b> $Reviewer <br/>

Rating ::= <b>Rating:</b> $Rating <br/>

Text ::= <b>Text:</b> $Text

Page 19: Integration of Friendly Data Islands on the Web. Information Extraction

DEPTADEPTA

Tag trees

li

b b bbr br

Page 20: Integration of Friendly Data Islands on the Web. Information Extraction

Roadmap

• Introduction• What extraction rules are• Generating extraction rules• A couple of systems• Conclusions

Page 21: Integration of Friendly Data Islands on the Web. Information Extraction

Classification

• Hand-crafted• Supervised induction• Little-supervised induction• Unsupervised induction

Page 22: Integration of Friendly Data Islands on the Web. Information Extraction

Hand-crafted

The pattern to extract the title is

“…”

• Techniques– Natural intelligence

• Systems– TSIMMIS– Minerva– WebOQL– W4F– XWrap

Page 23: Integration of Friendly Data Islands on the Web. Information Extraction

Supervised induction • Techniques

– Bottom-up ILP– Top-down ILP– Ad-hoc algorithms

• Systems– SRV– RAPIER– WIEN– WHISK– NoDoSE– SoftMealy– STALKER– DEByE

Raw documents

Labelled documents

Automated induction

Page 24: Integration of Friendly Data Islands on the Web. Information Extraction

Little-supervised induction • Techniques

– String alignment– Tree alignment

• Systems– OLERA– Thresher

Raw document

Record and attribute labelling

Automated induction

Page 25: Integration of Friendly Data Islands on the Web. Information Extraction

Unsupervised induction • Techniques

– String alignment– Tree alignment– Statistical roles

• Systems– DeLa– RoadRunner– EXALG– DEPTA– IEPAD

Raw documents

Automated induction

Pattern interpretation

Page 26: Integration of Friendly Data Islands on the Web. Information Extraction

Roadmap

• Introduction• What extraction rules are• Generating extraction rules• A couple of systems• Conclusions

Page 27: Integration of Friendly Data Islands on the Web. Information Extraction

Roadmap

• Introduction• What extraction rules are• Generating extraction rules• A couple of systems

– RoadRunner– SRV

• Conclusions

Page 28: Integration of Friendly Data Islands on the Web. Information Extraction

Token matching<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

String mistmatch

$1$1

Page 29: Integration of Friendly Data Islands on the Web. Information Extraction

...and matching…<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

Tag match

$1<html>

$1<html>

Page 30: Integration of Friendly Data Islands on the Web. Information Extraction

...and matching…<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

Tag match

$1<html><body>

$1<html><body>

Page 31: Integration of Friendly Data Islands on the Web. Information Extraction

...and matching…<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

Tag match, string match, …

$1<html><body> <b>Book name:</b>

$1<html><body> <b>Book name:</b>

Page 32: Integration of Friendly Data Islands on the Web. Information Extraction

...and matching…<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

String mismatch, tag match

$1<html><body> <b>Book name:</b> $2 <br/>

$1<html><body> <b>Book name:</b> $2 <br/>

Page 33: Integration of Friendly Data Islands on the Web. Information Extraction

...and matching…<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

$1<html><body> <b>Book name:</b> $2 <br/> <ul> <li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>

$1<html><body> <b>Book name:</b> $2 <br/> <ul> <li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>

Page 34: Integration of Friendly Data Islands on the Web. Information Extraction

Stop: lists and optionals<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

Tag mismatch

$1<html><body> <b>Book name:</b> $2 <br/> <ul> <li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>

$1<html><body> <b>Book name:</b> $2 <br/> <ul> <li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>

Page 35: Integration of Friendly Data Islands on the Web. Information Extraction

Stop: lists and optionals<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

$1<html><body> <b>Book name:</b> $2 <br/> <ul> <li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>

$1<html><body> <b>Book name:</b> $2 <br/> <ul> <li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>

Page 36: Integration of Friendly Data Islands on the Web. Information Extraction

Stop: lists and optionals<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

$1<html><body> <b>Book name:</b> $2 <br/> <ul> (<li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>)+

$1<html><body> <b>Book name:</b> $2 <br/> <ul> (<li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>)+

Page 37: Integration of Friendly Data Islands on the Web. Information Extraction

…and matching finishes<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/>… </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/>… </li> </ul></body></html>

$1<html><body> <b>Book name:</b> $2 <br/> <ul> (<li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>)+ </ul></body></html>

$1<html><body> <b>Book name:</b> $2 <br/> <ul> (<li> <b>Reviewer:</b> $3 <br/> <b>Rating:</b> $4 <br/> <b>Text:</b> $5 </li>)+ </ul></body></html>

Page 38: Integration of Friendly Data Islands on the Web. Information Extraction

Just union-free grammars!

Page 39: Integration of Friendly Data Islands on the Web. Information Extraction

Roadmap

• Introduction• What extraction rules are• Generating extraction rules• A couple of systems

– RoadRunner– SRV

• Conclusions

Page 40: Integration of Friendly Data Islands on the Web. Information Extraction

Exercise

• Support predicates: next(x,y), previous(x,y)• Try to explain isCorD(X)

abcabdabbbcaabda

Page 41: Integration of Friendly Data Islands on the Web. Information Extraction

Exercise

• Support Predicates: next(x,y), previous(x,y)• Now, try to Explain isCorDorE(X)

abcabdabeebbcaabdaee

Page 42: Integration of Friendly Data Islands on the Web. Information Extraction

Target PredicatesTarget Predicates

Define target predicates

title: #PCDATA.

reviewer: #PCDATA.

rating: #PCDATA.

text: #PCDATA.

title: #PCDATA.

reviewer: #PCDATA.

rating: #PCDATA.

text: #PCDATA.

Page 43: Integration of Friendly Data Islands on the Web. Information Extraction

Instantiate target predicates<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/> <b>Text:</b> blah, blah </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/> <b>Text:</b> yeah, yeah </li> </ul></body></html>

<!–- Sample #1 --><html><body> <b>Book name:</b> Ontologies <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> John Doe <br/> <b>Rating:</b> 7 <br/> <b>Text:</b> blah, blah </li> <li> <b>Reviewer:</b> Alan Wohl <br/> <b>Rating:</b> 8 <br/> <b>Text:</b> yeah, yeah </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/> <b>Text:</b> cough, cough </li> </ul></body></html>

<!–- Sample #2 --><html><body> <b>Book name:</b> SPARQL in action <br/> <b>Reviews:</b> <br/> <ul> <li> <b>Reviewer:</b> Dan Smith <br/> <b>Rating:</b> 9 <br/> <b>Text:</b> cough, cough </li> </ul></body></html>

<!–- Sample #3 --><html><body> <b>Book name:</b> W4F explained <br/> <b>Reviews:</b> <br/> <ul> </ul></body></html>

<!–- Sample #3 --><html><body> <b>Book name:</b> W4F explained <br/> <b>Reviews:</b> <br/> <ul> </ul></body></html>

Page 44: Integration of Friendly Data Islands on the Web. Information Extraction

Instantiate target predicatesPositive SamplesPositive Samples

title("Ontologies").

title("SPARQL in action").

title("W4F Explained").

reviewer("John Doe").

reviewer("Alan Wohl").

reviewer("Dan Smith").

rating("7").

rating("8").

rating("9").

text("blah, blah").

text("yeah, yeah").

text("cough, cough").

title("Ontologies").

title("SPARQL in action").

title("W4F Explained").

reviewer("John Doe").

reviewer("Alan Wohl").

reviewer("Dan Smith").

rating("7").

rating("8").

rating("9").

text("blah, blah").

text("yeah, yeah").

text("cough, cough").

Negative Samples Negative Samples

!title("Book name:").

!reviewer("Book name:").

!rating("Book name:").

!text("Book name:").

!title("Reviews:").

!reviewer("Reviews:").

!rating("Reviews:").

!text("Reviews:").

!title("Reviewer:").

!reviewer("Reviewer:").

!rating("Reviewer:").

!text("Reviewer:").

!title("Rating:").

!reviewer("Rating:").

!rating("Rating:").

!title("Book name:").

!reviewer("Book name:").

!rating("Book name:").

!text("Book name:").

!title("Reviews:").

!reviewer("Reviews:").

!rating("Reviews:").

!text("Reviews:").

!title("Reviewer:").

!reviewer("Reviewer:").

!rating("Reviewer:").

!text("Reviewer:").

!title("Rating:").

!reviewer("Rating:").

!rating("Rating:").

Page 45: Integration of Friendly Data Islands on the Web. Information Extraction

Support PredicatesSupport Predicates

Define support predicates

prev: #PCDATA, #PCDATA.

next: #PCDATA, #PCDATA.

length: #PCDATA, #PCDATA.

isNatural: #PCDATA.

prev: #PCDATA, #PCDATA.

next: #PCDATA, #PCDATA.

length: #PCDATA, #PCDATA.

isNatural: #PCDATA.

Page 46: Integration of Friendly Data Islands on the Web. Information Extraction

Instantiate support predicatesOn Positive SamplesOn Positive Samples

prev("Ontologies", "</b>").

next("Ontologies", "<br/>").

length("Ontologies", 10).

!isNatural("Ontologies").

prev("SPARQL in action", "</b>").

next("SPARQL in action", "<br/>").

length("SPARQL in action", 16).

!isNatural("SPARQL in action").

prev("W4F explained", "</b>").

next("W4F explained", "<br/>").

length("W4F explained", 16).

!isNatural("W4F explained").

prev("Ontologies", "</b>").

next("Ontologies", "<br/>").

length("Ontologies", 10).

!isNatural("Ontologies").

prev("SPARQL in action", "</b>").

next("SPARQL in action", "<br/>").

length("SPARQL in action", 16).

!isNatural("SPARQL in action").

prev("W4F explained", "</b>").

next("W4F explained", "<br/>").

length("W4F explained", 16).

!isNatural("W4F explained").

On Negative SamplesOn Negative Samples

prev("Book name:", "<b>").

next("Book name:", "</b>").

length("Book name:", 10).

!isNatural("Book name:").

prev("Reviews:", "<b>").

next("Reviews:", "</b>").

!isNatural("Reviews:").

prev("Reviewer:", "<b>").

next("Reviewer:", "</b>").

!isNatural("Reviewer:").

prev("Rating:", "<b>").

next("Rating:", "</b>").

!isNatural("Rating:").

prev("Book name:", "<b>").

next("Book name:", "</b>").

length("Book name:", 10).

!isNatural("Book name:").

prev("Reviews:", "<b>").

next("Reviews:", "</b>").

!isNatural("Reviews:").

prev("Reviewer:", "<b>").

next("Reviewer:", "</b>").

!isNatural("Reviewer:").

prev("Rating:", "<b>").

next("Rating:", "</b>").

!isNatural("Rating:").

Page 47: Integration of Friendly Data Islands on the Web. Information Extraction

Top-down inductiontitle(X) :- . (3, 14)title(X) :- . (3, 14)

title(X) :- prev(X, X). (0, 0)title(X) :- prev(X, X). (0, 0)

title(X) :- !prev(X, X). (3, 14)title(X) :- !prev(X, X). (3, 14)

title(X) :- prev(X, Y). (3, 14)title(X) :- prev(X, Y). (3, 14)

title(X) :- !prev(X, Y). (?, ?)title(X) :- !prev(X, Y). (?, ?)

title(X) :- next(X, X). (0, 0)title(X) :- next(X, X). (0, 0)

title(X) :- !next(X, X). (3, 14)title(X) :- !next(X, X). (3, 14)

title(X) :- next(X, Y). (3, 14)title(X) :- next(X, Y). (3, 14)

title(X) :- !next(X, Y). (?, ?)title(X) :- !next(X, Y). (?, ?)

title(X) :- length(X, X). (0, 0)title(X) :- length(X, X). (0, 0)

title(X) :- prev(X, "<b>"). (0, 5)title(X) :- prev(X, "<b>"). (0, 5)

title(X) :- !prev(X, "<b>"). (3, 9)title(X) :- !prev(X, "<b>"). (3, 9)

title(X) :- prev(X, "</b>"). (3, 9)title(X) :- prev(X, "</b>"). (3, 9)

title(X) :- !prev(X, "</b>"). (0, 5)title(X) :- !prev(X, "</b>"). (0, 5)

Page 48: Integration of Friendly Data Islands on the Web. Information Extraction

Rule selection

00

0

11

1 lnlnnp

p

np

ptGain

p0 = # positive bindings of R

n0 = # negative bindings of R

p1 = # positive bindings of R&A

n0 = # negative bindings of R&A

t = # positive bindings of both R and R&A

New covering Old coveringCombined covering

Page 49: Integration of Friendly Data Islands on the Web. Information Extraction

Induction goes on…title(X) :- . (3, 14)title(X) :- . (3, 14)

title(X) :- prev(X, Y). (3, 14)title(X) :- prev(X, Y). (3, 14)

title(X) :- prev(X, Y), X = Y. (?, ?)title(X) :- prev(X, Y), X = Y. (?, ?)

title(X) :- prev(X, Y), X != Y. (?, ?)title(X) :- prev(X, Y), X != Y. (?, ?)

title(X) :- prev(X, Y), prev(X, X). (?, ?)title(X) :- prev(X, Y), prev(X, X). (?, ?)

title(X) :- prev(X, Y), !prev(X, X). (?, ?)title(X) :- prev(X, Y), !prev(X, X). (?, ?)

title(X) :- prev(X, Y), prev(X, Z). (?, ?)title(X) :- prev(X, Y), prev(X, Z). (?, ?)

title(X) :- prev(X, Y), !prev(X, Z). (?, ?)title(X) :- prev(X, Y), !prev(X, Z). (?, ?)

title(X) :- prev(X, Y), prev(Y, X). (?, ?)title(X) :- prev(X, Y), prev(Y, X). (?, ?)

Page 50: Integration of Friendly Data Islands on the Web. Information Extraction

…and on…title(X) :- . (3, 14)title(X) :- . (3, 14)

title(X) :- prev(X, Y). (3, 14)title(X) :- prev(X, Y). (3, 14)

title(X) :- prev(X, Y), Y = "</b>". (?, ?)title(X) :- prev(X, Y), Y = "</b>". (?, ?)

title(X) :- prev(X, Y), Y = "</b>", prev(X, X). (?, ?)title(X) :- prev(X, Y), Y = "</b>", prev(X, X). (?, ?)

title(X) :- prev(X, Y), Y = "</b>", !prev(X, X). (?, ?)title(X) :- prev(X, Y), Y = "</b>", !prev(X, X). (?, ?)

title(X) :- prev(X, Y), Y = "</b>", prev(Y, Y). (?, ?)title(X) :- prev(X, Y), Y = "</b>", prev(Y, Y). (?, ?)

title(X) :- prev(X, Y), Y = "</b>", !prev(Y, Y). (?, ?)title(X) :- prev(X, Y), Y = "</b>", !prev(Y, Y). (?, ?)

title(X) :- prev(X, Y), Y = "</b>", prev(X, Z). (?, ?)title(X) :- prev(X, Y), Y = "</b>", prev(X, Z). (?, ?)

title(X) :- prev(X, Y), Y = "</b>", !prev(X, Z). (?, ?)title(X) :- prev(X, Y), Y = "</b>", !prev(X, Z). (?, ?)

Page 51: Integration of Friendly Data Islands on the Web. Information Extraction

…and eventually finishestitle(X) :- . (3, 14)title(X) :- . (3, 14)

title(X) :- prev(X, Y). (3, 14)title(X) :- prev(X, Y). (3, 14)

title(X) :- prev(X, Y), Y = "</b>". (?, ?)title(X) :- prev(X, Y), Y = "</b>". (?, ?)

title(X) :- prev(X, Y), Y = "</b>", prev(Y, "Book name:"). (3, 0)title(X) :- prev(X, Y), Y = "</b>", prev(Y, "Book name:"). (3, 0)

Page 52: Integration of Friendly Data Islands on the Web. Information Extraction

Optimisations

• Intelligent predicates– Non-sense atoms– Non-sense atom combinations– Non-bindable variables

• Instantiated target predicates• Statistical analysis of constants• Keep track of non-instantiable predicates

Page 53: Integration of Friendly Data Islands on the Web. Information Extraction

Roadmap

• Introduction• What extraction rules are• Generating extraction rules• A couple of systems• Conclusions

Page 54: Integration of Friendly Data Islands on the Web. Information Extraction

That's quite clear!

• Information extraction enables information integration

Page 55: Integration of Friendly Data Islands on the Web. Information Extraction

Research challenges

• Information extraction– Efficient rule generation– Maintaining rules automatically– Union non-free Grammars (unsupervised)

• Ontologisation rules– Everything is a challenge

Page 56: Integration of Friendly Data Islands on the Web. Information Extraction

Thanks!

Drop by our web site at http://www.tdg-seville.info