university of economics prague 1 information extraction from web pages using extraction ontologies...

1

University of Economics Prague

Information extraction from web pages using extraction ontologies

Martin LabskýKEG Seminar, 28th November 2006

2

Agenda

Purpose Knowledge sources Extraction ontology Finding attribute candidates Instance parsing Wrapper induction Ex demo Discussion

3

Purpose

Extract objects from documents– object = instance of a class from an ontology– document = text, possibly with formatting

Objects– belong to known, well-defined class(es)– classes consist of attributes, axioms, constraints

Documents– may come in collections of arbitrary sizes– Structured, semi-structured, free-text– Extraction should improve if:

documents contain some formatting (e.g. HTML) this formatting is similar within or across document(s)

Examples– Product catalogues (e.g. detailed product descriptions)– Weather forecast sites (e.g. forecasts for the next day)– Restaurant descriptions (cuisine, opening hours etc.)– Contact information– Financial news

4

Knowledge sources

Why– often for some attributes of one class it is easier to obtain manual extraction

knowledge than training data and vice versa (experience from Bicycle product IE)

– allow people to experiment just with manually encoded patterns, let them investigate easily whether some IE task is feasible by trying it quickly. If so, training data can be added for attributes which require it.

1. Knowledge entered manually by expert– the only mandatory source– class definitions + extraction evidence

2. Training data– sample attribute values or sample instances– possibly coupled with referring documents– used to induce typical content and context of extractable items, cardinalities

and orderings of class attributes... 3. Common formatting structure

– of observed instances – in a single document, or– across documents from the same source

5

Extraction ontology

Attribute data types– assigned manually

Cardinality ranges– assigned manually, cardinality probability estimates could be trained

Patterns for content and typical context of attributes:– regular grammars at the level of words, lemmas, POS tags or word types

(uppercase, capital, number, alphanumeric etc.)– phrase lists– attribute value lengths– the above equipped with probability estimates– assigned manually or to be induced from training data

For numeric attributes:– units– estimated probability distributions (e.g. tables, gaussian)– assigned manually or trained

Sample ex ontologies contacts_en.xml or monitors.eol.xml– see class and attribute definitions, data types– ECMAScript axioms,– regular pattern language, – pattern precision and recall parameters

Sample instances monitors.tsv and *.html

6

Finding attribute candidates (1)

Preprocessing– document tokenized– parsed into a light-weight DOM (if HTML)

Matching of attributes’ regular patterns– content and context patterns matched– each pattern has:

Pattern precision– estimates how often the pattern actually identifies a

value of the attribute in question– P(attribute|pattern)

Pattern recall– estimates how many values of the attribute in question

satisfy the pattern– P(pattern|attribute)

7


Create new attribute candidate (AC) where at least one pattern matches

AC for attribute A scored by the estimate of – P(A|patterns) where

patterns = matched state of all patterns known for A

– Independence assumption for all patterns E, F from the set of known patterns for attribute A (phi)

– AC score computation:

(for derivation see ex.pdf)

8


Most attributes can occur independently or as part of their containing class– Each attribute is equipped with estimate of

P(engaged|A) = e.g. 0.75

Three ways of explaining an AC:– part of instance; then the AC score computed as:

P(A|patterns) * P(engaged|A)

– standalone:P(A|patterns) * (1-P(engaged|A))

– mistake:1 - P(A|patterns)

9


ACs naturally overlap; they form a lattice within document:

initial null state

log(AC standalone score)

• the best path scores = -0.5754• if we wanted just standalone attributes, we could be complete

AC’s ID and indices of start and end tokens

10

Parsing instances (1)

Initially, each AC is converted into a singleton instance candidate IC = {AC}

Nested ACs supported Then, iteratively, the most promising ICs are expanded:

neighboring ACs are added to them Expansion possible only if constraints not violated (e.g. max

cardinality reached or ecmascript axioms may fail; selective axiom evaluation)

IC scoring– so far, IC score = log (AC engaged score) + + penalties for skipped ACs (orphans) within the IC’s span + + fixed penalties for crossing formatting blocks by IC– we need to incorporate ASAP:

likelihood of IC’s attribute cardinalities and ordering learnable formatting block crossing penalties

11


Simplified IC parsing algorithm1. Create a set of singleton ICssingletons={ {AC}, {AC}, ... } of singleton ICs each containing just

1 AC2. Enrich ICssingletons by adding ICs with 2 or more contained attribute values (still referred to as

singletons since they have a single containing root attribute)3. Create a set of instance candidates ICsvalid={};4. Create a queue of instance candidates ICswork={}. Keep ICswork sorted by IC score, with max

size of K (heap).5. Add content of ICssingletons to ICswork.6. Pick the best scoring ICbest from ICswork

7. Set beam area of document BA=span of the document fragment (e.g. HTML element) containing ICbest

8. While expanding ICbest:1. If BA does not contain more ACs, expand BA to the parent BA2. Within BA, try adding to IC those ICnear_singleton which are singletons and are closest to IC: ICnew =

IC + ICnear_singleton

3. If ICnew does not violate integrity constraints (e.g. max cardinality already reached in IC or axiom failure)• Add ICnew to ICswork

• If ICnew is valid, add it to ICsvalid

4. Break If• large portion of ICsnear_singleton was refused due to integrity constraints, or• BA is too large or too high in the formatting block tree

9. Remove ICbest from document, and if ICswork is not empty goto 610. Return ICsvalid

12

a b c d e f g h i j k lAX

AY

AZ

Garbage

{AX}

X card=1, may contain YY card=1..nZ card=1..n

Class C

m n ...

block structure

TD TD

TR TABLE

A


13


AY

AZ

Garbage

m n ...

TD TD

TR TABLE

A

{AX}{AY}

{AXAY}


Class C

block structure


14


AY

AZ

Garbage

m n ...

TD TD

TR TABLE

A

{AX}{AY}

{AY}{AXAY}

{AXAY}


Class C

block structure


15


AY

AZ

Garbage

m n ...

TD TD

TR TABLE

A

{AX}{AY}

{AY}{AXAY}

{AXAY}

{AY} {AZ}{AZ}

{AXAYAZ} {AXAYAY} {AXAYAZ}{AXAYAZ} {AXAYAY} {AXAYAZ}{AXAZ} {AXAY} {AXAZ}


Class C

block structure


16


AY

AZ

Garbage

m n ...

TD TD

TR TABLE

A

{AX}{AY}

{AY}{AX[AY]} {AX[AY]} {AXAY}

{AY} {AZ}{AZ}


{AXAY}


Class C

block structure


17


AY

AZ

Garbage

m n ...

TD TD

TR TABLE

A

{AX}{AY}


{AY} {AZ}{AZ}


{AXAY}

{AX[AY]AZ} {AX[AY]AY} {AX[AY]AZ}{AX[AY]AZ} {AX[AY]AY} {AX[AY]AZ}


Class C

block structure


18


AY

AZ

Garbage

m n ...

TD TD

TR TABLE

A

{AX}{AY}


{AY} {AZ}{AZ}


{AXAY}

{AX[AY]AZ} {AX[AY]AY} {AX[AY]AZ}{AX[AY]AZ} {AX[AY]AY} {AX[AY]AZ}


Class C

block structure


19

From the instance parser, we get a set of valid ICs

similar to ACs, these may overlap

valid ICs form a lattice within the analyzed document


20


Since we want to extract both valid instances and standalone attributes, we merge the AC lattice and the valid IC lattice:

ICs which interfere with other ICs and leave their parts unexplained are penalized relatively to the unexplained parts of interfering ICs

21


The best path is found through the merged lattice This should be the sequence of standalone ACs

and valid ICs which best explain the document content

22

Wrapper induction (1)

During IC parsing, we search for common formatting patterns which would encapsulate part of the ICs being generated

E.g. person’s first name and last name (if we extracted these as separate attributes) could be regularly contained in formatting pattern:– TR[1..n] { TD[0] {person.firstname} TD[1] {person.lastname} }

Formatting pattern is defined as the first block area (HTML tag) containing the whole IC, plus the paths from that area to each of the IC’s attributes.

If “reliable” formatting patterns are found, we add them to the context patterns of the respective attributes. For such attribute A, we then:– boost/lower scores of all ACs of A,– create new ACs for A where the formatting patterns match and

ACs did not exist before– rescore all ICs which contain rescored ACs,– add new singleton ICs for newly added ACs.

23


Formatting pattern induction process Segment all ICs from parser’s queue (not only the valid ones)

into clusters of ICs with the same attributes populated– e.g. {firstname: Varel, lastname: Fristensky}

{firstname: Karel, lastname: Nemec} would fit into one cluster.

For each cluster, build an IC lattice going through the document, and find the best path of non-overlapping ICs.

For ICs on the best path, compute the counts of each distinct formatting pattern. For each formatting pattern FP, estimate– precision(FP)=C(FP,instance from cluster) / C(instance

from cluster)– recall(FP)=C(FP,instance from cluster) / C(FP)where C() means observed counts.

We induce a new pattern if precision(FP), recall(FP) and C(FP,instance from cluster) reach configurable thresholds.

24


Plugging wrapper generation into the instance parsing algorithm– in the current implementation, formatting patterns

are only induced once for singleton ICs Parallel parsing of multiple documents

– documents from the same source (e.g. website) often share formatting patterns; we expect measurable improvement over the single document extraction approach

– to be implemented More experiments needed

25

Ex demo

Command line version GUI available

– GUI of Information Extraction Toolkit exists as a separate project, ready to accommodate other IE Engines

Simple API to enable usage in 3rd party systems Everything written in Java

– however may connect to lemmatizers / POS taggers / other tools written in arbitrary languages

– Ex: ~ 26,000 lines of code– Information Extraction Toolkit: ~ 2,500 lines of code

26

Discussion

Thank you.