integration of friendly data islands on the web. information extraction

Integration of Friendly Data Islands on the Web.

Information Extraction.

Roadmap

• Introduction• What extraction rules are• Generating extraction rules• A couple of systems• Conclusions

The theory• A wrapper is a building

block that provides an ad-hoc, message-based API to an app

• They interface apps at one or more layers, but, more often than not, they must deal with the user interface or the data layer

User Interface

Controller

Business Logic

Data AccessLayer

Data Layer

The problem

The Da Vinci Code

Buy

Dan BrownDoubleday, 200615.95 €

Robert Langdon is a Harvard Professor of Symbology…

Features of current web documents

• Trillions of documents• Generated on demand by software

applications• Change continuously• Require navigation from search forms• Written in telegraphic language• Formatted according to HTML templates

The solution

Wrapping in a nutshell• Goals

– Endow data islands with APIs

– Ease implementing software applications

• Implications– Form filling– Navigation– Info extraction– “Ontologisation”

Look out!

• Information extraction has driven most research efforts

• Few wrapping systems are complete• Wrapping is usually mistaken for information

extraction• This talk is about engineering information

extraction for enabling information integration

How IE works

Information extractor

Document

Extraction rules

Attributes

The Da Vinci Code

Dan Brown

15.95 €

2006

Robert Langdon…

Doubleday

Templates

Message ID: MUC-0001Message Template: Court resolutionDate of Event: April, 30 2007Charge: Terrorist attackPerpetrator: Salahuddin AminPerpetrator: Anthony GarciaPerpetrator: Waheed MahmoodPerpetrator: Omar Khyam…

The Da Vinci Code

Dan Brown

15.95 €

2006

P1

Robert Langdon…

Doubleday

A1

B1

Ontology instances

Templating/ Ontologisation rules

Roadmap

• Introduction• What extraction rules are• Generating extraction rules• A couple of systems• Side by side comparison• Conclusions

Running example

Running example<!–- Sample #1 --><html><body> Book name: Ontologies Reviews: <ul> <li> Reviewer: John Doe Rating: 7 Text: blah, blah </li> <li> Reviewer: Alan Wohl Rating: 8 Text: yeah, yeah </li> </ul></body></html>

<!–- Sample #1 --><html><body> Book name: Ontologies Reviews: <ul> <li> Reviewer: John Doe Rating: 7 Text: blah, blah </li> <li> Reviewer: Alan Wohl Rating: 8 Text: yeah, yeah </li> </ul></body></html>

<!–- Sample #2 --><html><body> Book name: SPARQL in action Reviews: <ul> <li> Reviewer: Dan Smith Rating: 9 Text: cough, cough </li> </ul></body></html>


<!–- Sample #3 --><html><body> Book name: W4F explained Reviews: <ul> </ul></body></html>

Kinds of extraction rules

• Regular expressions • First-order logic rules • Pointers into DOM tree • Context-free grammars • Tag trees

TSIMMISTSIMMIS

Regular expressions

[Root, get("page.html"), "#"]

[BookReview, Root, "<body>#</body>"]

[BookName, BookReview, "# "]

[Tmp, Rook, "<ul>#</ul>"]

[Reviews, Tmp, "split(Tmp, '<li>')"]

[ReviewerNames, Reviews, "Reviewer:# "]

[Ratings, Reviews, "Rating:# "]

[Text, Reviews, "Text:# "]

[Root, get("page.html"), "#"]

[BookReview, Root, "<body>#</body>"]

[BookName, BookReview, "# "]

[Tmp, Rook, "<ul>#</ul>"]

[Reviews, Tmp, "split(Tmp, '<li>')"]

[ReviewerNames, Reviews, "Reviewer:# "]

[Ratings, Reviews, "Rating:# "]

[Text, Reviews, "Text:# "]

RoadRunnerRoadRunner

$FileName<html><body> Book name: $BookTitle Reviews: <ul> (( <li> Reviewer: $ReviewerName Rating: $Rating Text: $Text </li> )+)? </ul></body></html>

$FileName<html><body> Book name: $BookTitle Reviews: <ul> (( <li> Reviewer: $ReviewerName Rating: $Rating Text: $Text </li> )+)? </ul></body></html>

First-order logic rules

SRVSRV

bookTitle(X) :- prev(X, "Book name:"), next(X, " ").

reviewerName(X) :- prev(X, "name:"),next(X, " "), !bookTitle(X).

rating(X) :- isNatural(X), length(X, 1), inList(X).

text(X) :- prev(X, "Text:"),next(X, "</li>").

bookTitle(X) :- prev(X, "Book name:"), next(X, " ").

reviewerName(X) :- prev(X, "name:"),next(X, " "), !bookTitle(X).

rating(X) :- isNatural(X), length(X, 1), inList(X).

text(X) :- prev(X, "Text:"),next(X, "</li>").

Pointer into the DOM tree

WebOQLWebOQL

select x’.Text, y’.Text, y’’’’.Text, y’’’’’’’.Textfrom x, y in browse("page.html")where x.Text = "Book name:" and y.Text = "Reviewer:"

select x’.Text, y’.Text, y’’’’.Text, y’’’’’’’.Textfrom x, y in browse("page.html")where x.Text = "Book name:" and y.Text = "Reviewer:"

Context-free grammars

MinervaMinerva

Page ::= $FileName <html><body> Review </body></html>

Review ::= Book name: $BookName Reviews: <ul> (<li> Reviewer Rating Text <li>)* </ul>

Reviewer ::= Reviewer: $Reviewer 

Rating ::= Rating: $Rating 

Text ::= Text: $Text

Page ::= $FileName <html><body> Review </body></html>

Review ::= Book name: $BookName Reviews: <ul> (<li> Reviewer Rating Text <li>)* </ul>

Reviewer ::= Reviewer: $Reviewer 

Rating ::= Rating: $Rating 

Text ::= Text: $Text

DEPTADEPTA

Tag trees

li

b b bbr br

Roadmap


Classification

• Hand-crafted• Supervised induction• Little-supervised induction• Unsupervised induction

Hand-crafted

The pattern to extract the title is

“…”

• Techniques– Natural intelligence

• Systems– TSIMMIS– Minerva– WebOQL– W4F– XWrap

Supervised induction • Techniques

– Bottom-up ILP– Top-down ILP– Ad-hoc algorithms

• Systems– SRV– RAPIER– WIEN– WHISK– NoDoSE– SoftMealy– STALKER– DEByE

Raw documents

Labelled documents

Automated induction

Little-supervised induction • Techniques

– String alignment– Tree alignment

• Systems– OLERA– Thresher

Raw document

Record and attribute labelling

Automated induction

Unsupervised induction • Techniques

– String alignment– Tree alignment– Statistical roles

• Systems– DeLa– RoadRunner– EXALG– DEPTA– IEPAD

Raw documents

Automated induction

Pattern interpretation

Roadmap


Roadmap

• Introduction• What extraction rules are• Generating extraction rules• A couple of systems

– RoadRunner– SRV

• Conclusions

Token matching<!–- Sample #1 --><html><body> Book name: Ontologies Reviews: <ul> <li> Reviewer: John Doe Rating: 7 … </li> <li> Reviewer: Alan Wohl Rating: 8 … </li> </ul></body></html>

<!–- Sample #1 --><html><body> Book name: Ontologies Reviews: <ul> <li> Reviewer: John Doe Rating: 7 … </li> <li> Reviewer: Alan Wohl Rating: 8 … </li> </ul></body></html>

<!–- Sample #2 --><html><body> Book name: SPARQL in action Reviews: <ul> <li> Reviewer: Dan Smith Rating: 9 … </li> </ul></body></html>


String mistmatch

$1$1

...and matching…<!–- Sample #1 --><html><body> Book name: Ontologies Reviews: <ul> <li> Reviewer: John Doe Rating: 7 … </li> <li> Reviewer: Alan Wohl Rating: 8 … </li> </ul></body></html>




Tag match

$1<html>

$1<html>

Tag match

$1<html><body>

$1<html><body>

Tag match, string match, …

$1<html><body> Book name:

$1<html><body> Book name:

String mismatch, tag match

$1<html><body> Book name: $2 

$1<html><body> Book name: $2

…

$1<html><body> Book name: $2 <ul> <li> Reviewer: $3 Rating: $4 Text: $5 </li>

Stop: lists and optionals<!–- Sample #1 --><html><body> Book name: Ontologies Reviews: <ul> <li> Reviewer: John Doe Rating: 7 … </li> <li> Reviewer: Alan Wohl Rating: 8 … </li> </ul></body></html>




Tag mismatch

$1<html><body> Book name: $2 <ul> (<li> Reviewer: $3 Rating: $4 Text: $5 </li>)+

$1<html><body> Book name: $2 <ul> (<li> Reviewer: $3 Rating: $4 Text: $5 </li>)+

…and matching finishes<!–- Sample #1 --><html><body> Book name: Ontologies Reviews: <ul> <li> Reviewer: John Doe Rating: 7 … </li> <li> Reviewer: Alan Wohl Rating: 8 … </li> </ul></body></html>




$1<html><body> Book name: $2 <ul> (<li> Reviewer: $3 Rating: $4 Text: $5 </li>)+ </ul></body></html>

$1<html><body> Book name: $2 <ul> (<li> Reviewer: $3 Rating: $4 Text: $5 </li>)+ </ul></body></html>

Just union-free grammars!

Roadmap

• Introduction• What extraction rules are• Generating extraction rules• A couple of systems

– RoadRunner– SRV

• Conclusions

Exercise

• Support predicates: next(x,y), previous(x,y)• Try to explain isCorD(X)

abcabdabbbcaabda

Exercise

• Support Predicates: next(x,y), previous(x,y)• Now, try to Explain isCorDorE(X)

abcabdabeebbcaabdaee

Target PredicatesTarget Predicates

Define target predicates

title: #PCDATA.

reviewer: #PCDATA.

rating: #PCDATA.

text: #PCDATA.

title: #PCDATA.

reviewer: #PCDATA.

rating: #PCDATA.

text: #PCDATA.

Instantiate target predicates<!–- Sample #1 --><html><body> Book name: Ontologies Reviews: <ul> <li> Reviewer: John Doe Rating: 7 Text: blah, blah </li> <li> Reviewer: Alan Wohl Rating: 8 Text: yeah, yeah </li> </ul></body></html>

<!–- Sample #1 --><html><body> Book name: Ontologies Reviews: <ul> <li> Reviewer: John Doe Rating: 7 Text: blah, blah </li> <li> Reviewer: Alan Wohl Rating: 8 Text: yeah, yeah </li> </ul></body></html>

Instantiate target predicatesPositive SamplesPositive Samples

title("Ontologies").

title("SPARQL in action").

title("W4F Explained").

reviewer("John Doe").

reviewer("Alan Wohl").

reviewer("Dan Smith").

rating("7").

rating("8").

rating("9").

text("blah, blah").

text("yeah, yeah").

text("cough, cough").

title("Ontologies").

title("SPARQL in action").

title("W4F Explained").

reviewer("John Doe").

reviewer("Alan Wohl").

reviewer("Dan Smith").

rating("7").

rating("8").

rating("9").

text("blah, blah").

text("yeah, yeah").

text("cough, cough").

Negative Samples Negative Samples

!title("Book name:").

!reviewer("Book name:").

!rating("Book name:").

!text("Book name:").

!title("Reviews:").

!reviewer("Reviews:").

!rating("Reviews:").

!text("Reviews:").

!title("Reviewer:").

!reviewer("Reviewer:").

!rating("Reviewer:").

!text("Reviewer:").

!title("Rating:").

!reviewer("Rating:").

!rating("Rating:").

…

!title("Book name:").

!reviewer("Book name:").

!rating("Book name:").

!text("Book name:").

!title("Reviews:").

!reviewer("Reviews:").

!rating("Reviews:").

!text("Reviews:").

!title("Reviewer:").

!reviewer("Reviewer:").

!rating("Reviewer:").

!text("Reviewer:").

!title("Rating:").

!reviewer("Rating:").

!rating("Rating:").

…

Support PredicatesSupport Predicates

Define support predicates

prev: #PCDATA, #PCDATA.

next: #PCDATA, #PCDATA.

length: #PCDATA, #PCDATA.

isNatural: #PCDATA.

prev: #PCDATA, #PCDATA.

next: #PCDATA, #PCDATA.

length: #PCDATA, #PCDATA.

isNatural: #PCDATA.

Instantiate support predicatesOn Positive SamplesOn Positive Samples

prev("Ontologies", "").

next("Ontologies", " ").

length("Ontologies", 10).

!isNatural("Ontologies").

prev("SPARQL in action", "").

next("SPARQL in action", " ").

length("SPARQL in action", 16).

!isNatural("SPARQL in action").

prev("W4F explained", "").

next("W4F explained", " ").

length("W4F explained", 16).

!isNatural("W4F explained").

…

prev("Ontologies", "").

next("Ontologies", " ").

length("Ontologies", 10).

!isNatural("Ontologies").

prev("SPARQL in action", "").

next("SPARQL in action", " ").

length("SPARQL in action", 16).

!isNatural("SPARQL in action").

prev("W4F explained", "").

next("W4F explained", " ").

length("W4F explained", 16).

!isNatural("W4F explained").

…

On Negative SamplesOn Negative Samples

prev("Book name:", "").

next("Book name:", "").

length("Book name:", 10).

!isNatural("Book name:").

prev("Reviews:", "").

next("Reviews:", "").

!isNatural("Reviews:").

prev("Reviewer:", "").

next("Reviewer:", "").

!isNatural("Reviewer:").

prev("Rating:", "").

next("Rating:", "").

!isNatural("Rating:").

…

prev("Book name:", "").

next("Book name:", "").

length("Book name:", 10).

!isNatural("Book name:").

prev("Reviews:", "").

next("Reviews:", "").

!isNatural("Reviews:").

prev("Reviewer:", "").

next("Reviewer:", "").

!isNatural("Reviewer:").

prev("Rating:", "").

next("Rating:", "").

!isNatural("Rating:").

…

…

Top-down inductiontitle(X) :- . (3, 14)title(X) :- . (3, 14)

title(X) :- prev(X, X). (0, 0)title(X) :- prev(X, X). (0, 0)

title(X) :- !prev(X, X). (3, 14)title(X) :- !prev(X, X). (3, 14)

title(X) :- prev(X, Y). (3, 14)title(X) :- prev(X, Y). (3, 14)

title(X) :- !prev(X, Y). (?, ?)title(X) :- !prev(X, Y). (?, ?)

title(X) :- next(X, X). (0, 0)title(X) :- next(X, X). (0, 0)

title(X) :- !next(X, X). (3, 14)title(X) :- !next(X, X). (3, 14)

title(X) :- next(X, Y). (3, 14)title(X) :- next(X, Y). (3, 14)

title(X) :- !next(X, Y). (?, ?)title(X) :- !next(X, Y). (?, ?)

title(X) :- length(X, X). (0, 0)title(X) :- length(X, X). (0, 0)

title(X) :- prev(X, ""). (0, 5)title(X) :- prev(X, ""). (0, 5)

title(X) :- !prev(X, ""). (3, 9)title(X) :- !prev(X, ""). (3, 9)

title(X) :- prev(X, ""). (3, 9)title(X) :- prev(X, ""). (3, 9)

title(X) :- !prev(X, ""). (0, 5)title(X) :- !prev(X, ""). (0, 5)

…

Rule selection

00

0

11

1 lnlnnp

p

np

ptGain

p0 = # positive bindings of R

n0 = # negative bindings of R

p1 = # positive bindings of R&A

n0 = # negative bindings of R&A

t = # positive bindings of both R and R&A

New covering Old coveringCombined covering

Induction goes on…title(X) :- . (3, 14)title(X) :- . (3, 14)


title(X) :- prev(X, Y), X = Y. (?, ?)title(X) :- prev(X, Y), X = Y. (?, ?)

title(X) :- prev(X, Y), X != Y. (?, ?)title(X) :- prev(X, Y), X != Y. (?, ?)

title(X) :- prev(X, Y), prev(X, X). (?, ?)title(X) :- prev(X, Y), prev(X, X). (?, ?)

title(X) :- prev(X, Y), !prev(X, X). (?, ?)title(X) :- prev(X, Y), !prev(X, X). (?, ?)

title(X) :- prev(X, Y), prev(X, Z). (?, ?)title(X) :- prev(X, Y), prev(X, Z). (?, ?)

title(X) :- prev(X, Y), !prev(X, Z). (?, ?)title(X) :- prev(X, Y), !prev(X, Z). (?, ?)

title(X) :- prev(X, Y), prev(Y, X). (?, ?)title(X) :- prev(X, Y), prev(Y, X). (?, ?)

…

…and on…title(X) :- . (3, 14)title(X) :- . (3, 14)


title(X) :- prev(X, Y), Y = "". (?, ?)title(X) :- prev(X, Y), Y = "". (?, ?)

title(X) :- prev(X, Y), Y = "", prev(X, X). (?, ?)title(X) :- prev(X, Y), Y = "", prev(X, X). (?, ?)

title(X) :- prev(X, Y), Y = "", !prev(X, X). (?, ?)title(X) :- prev(X, Y), Y = "", !prev(X, X). (?, ?)

title(X) :- prev(X, Y), Y = "", prev(Y, Y). (?, ?)title(X) :- prev(X, Y), Y = "", prev(Y, Y). (?, ?)

title(X) :- prev(X, Y), Y = "", !prev(Y, Y). (?, ?)title(X) :- prev(X, Y), Y = "", !prev(Y, Y). (?, ?)

title(X) :- prev(X, Y), Y = "", prev(X, Z). (?, ?)title(X) :- prev(X, Y), Y = "", prev(X, Z). (?, ?)

title(X) :- prev(X, Y), Y = "", !prev(X, Z). (?, ?)title(X) :- prev(X, Y), Y = "", !prev(X, Z). (?, ?)

…

…and eventually finishestitle(X) :- . (3, 14)title(X) :- . (3, 14)


title(X) :- prev(X, Y), Y = "". (?, ?)title(X) :- prev(X, Y), Y = "". (?, ?)

title(X) :- prev(X, Y), Y = "", prev(Y, "Book name:"). (3, 0)title(X) :- prev(X, Y), Y = "", prev(Y, "Book name:"). (3, 0)

Optimisations

• Intelligent predicates– Non-sense atoms– Non-sense atom combinations– Non-bindable variables

• Instantiated target predicates• Statistical analysis of constants• Keep track of non-instantiable predicates

Roadmap


That's quite clear!

• Information extraction enables information integration

Research challenges

• Information extraction– Efficient rule generation– Maintaining rules automatically– Union non-free Grammars (unsupervised)

• Ontologisation rules– Everything is a challenge

Thanks!

Drop by our web site at http://www.tdg-seville.info

integration of friendly data islands on the web. information extraction

Documents