university of economics prague 1 information extraction from web pages using extraction ontologies...
TRANSCRIPT
1
University of Economics Prague
Information extraction from web pages using extraction ontologies
Martin LabskýKEG Seminar, 28th November 2006
2
Agenda
Purpose Knowledge sources Extraction ontology Finding attribute candidates Instance parsing Wrapper induction Ex demo Discussion
3
Purpose
Extract objects from documents– object = instance of a class from an ontology– document = text, possibly with formatting
Objects– belong to known, well-defined class(es)– classes consist of attributes, axioms, constraints
Documents– may come in collections of arbitrary sizes– Structured, semi-structured, free-text– Extraction should improve if:
documents contain some formatting (e.g. HTML) this formatting is similar within or across document(s)
Examples– Product catalogues (e.g. detailed product descriptions)– Weather forecast sites (e.g. forecasts for the next day)– Restaurant descriptions (cuisine, opening hours etc.)– Contact information– Financial news
4
Knowledge sources
Why– often for some attributes of one class it is easier to obtain manual extraction
knowledge than training data and vice versa (experience from Bicycle product IE)
– allow people to experiment just with manually encoded patterns, let them investigate easily whether some IE task is feasible by trying it quickly. If so, training data can be added for attributes which require it.
1. Knowledge entered manually by expert– the only mandatory source– class definitions + extraction evidence
2. Training data– sample attribute values or sample instances– possibly coupled with referring documents– used to induce typical content and context of extractable items, cardinalities
and orderings of class attributes... 3. Common formatting structure
– of observed instances – in a single document, or– across documents from the same source
5
Extraction ontology
Attribute data types– assigned manually
Cardinality ranges– assigned manually, cardinality probability estimates could be trained
Patterns for content and typical context of attributes:– regular grammars at the level of words, lemmas, POS tags or word types
(uppercase, capital, number, alphanumeric etc.)– phrase lists– attribute value lengths– the above equipped with probability estimates– assigned manually or to be induced from training data
For numeric attributes:– units– estimated probability distributions (e.g. tables, gaussian)– assigned manually or trained
Sample ex ontologies contacts_en.xml or monitors.eol.xml– see class and attribute definitions, data types– ECMAScript axioms,– regular pattern language, – pattern precision and recall parameters
Sample instances monitors.tsv and *.html
6
Finding attribute candidates (1)
Preprocessing– document tokenized– parsed into a light-weight DOM (if HTML)
Matching of attributes’ regular patterns– content and context patterns matched– each pattern has:
Pattern precision– estimates how often the pattern actually identifies a
value of the attribute in question– P(attribute|pattern)
Pattern recall– estimates how many values of the attribute in question
satisfy the pattern– P(pattern|attribute)
7
Finding attribute candidates (2)
Create new attribute candidate (AC) where at least one pattern matches
AC for attribute A scored by the estimate of – P(A|patterns) where
patterns = matched state of all patterns known for A
– Independence assumption for all patterns E, F from the set of known patterns for attribute A (phi)
– AC score computation:
(for derivation see ex.pdf)
8
Finding attribute candidates (3)
Most attributes can occur independently or as part of their containing class– Each attribute is equipped with estimate of
P(engaged|A) = e.g. 0.75
Three ways of explaining an AC:– part of instance; then the AC score computed as:
P(A|patterns) * P(engaged|A)
– standalone:P(A|patterns) * (1-P(engaged|A))
– mistake:1 - P(A|patterns)
9
Finding attribute candidates (4)
ACs naturally overlap; they form a lattice within document:
initial null state
log(AC standalone score)
• the best path scores = -0.5754• if we wanted just standalone attributes, we could be complete
AC’s ID and indices of start and end tokens
10
Parsing instances (1)
Initially, each AC is converted into a singleton instance candidate IC = {AC}
Nested ACs supported Then, iteratively, the most promising ICs are expanded:
neighboring ACs are added to them Expansion possible only if constraints not violated (e.g. max
cardinality reached or ecmascript axioms may fail; selective axiom evaluation)
IC scoring– so far, IC score = log (AC engaged score) + + penalties for skipped ACs (orphans) within the IC’s span + + fixed penalties for crossing formatting blocks by IC– we need to incorporate ASAP:
likelihood of IC’s attribute cardinalities and ordering learnable formatting block crossing penalties
11
Parsing instances (2)
Simplified IC parsing algorithm1. Create a set of singleton ICssingletons={ {AC}, {AC}, ... } of singleton ICs each containing just
1 AC2. Enrich ICssingletons by adding ICs with 2 or more contained attribute values (still referred to as
singletons since they have a single containing root attribute)3. Create a set of instance candidates ICsvalid={};4. Create a queue of instance candidates ICswork={}. Keep ICswork sorted by IC score, with max
size of K (heap).5. Add content of ICssingletons to ICswork.6. Pick the best scoring ICbest from ICswork
7. Set beam area of document BA=span of the document fragment (e.g. HTML element) containing ICbest
8. While expanding ICbest:1. If BA does not contain more ACs, expand BA to the parent BA2. Within BA, try adding to IC those ICnear_singleton which are singletons and are closest to IC: ICnew =
IC + ICnear_singleton
3. If ICnew does not violate integrity constraints (e.g. max cardinality already reached in IC or axiom failure)• Add ICnew to ICswork
• If ICnew is valid, add it to ICsvalid
4. Break If• large portion of ICsnear_singleton was refused due to integrity constraints, or• BA is too large or too high in the formatting block tree
9. Remove ICbest from document, and if ICswork is not empty goto 610. Return ICsvalid
12
a b c d e f g h i j k lAX
AY
AZ
Garbage
{AX}
X card=1, may contain YY card=1..nZ card=1..n
Class C
m n ...
block structure
TD TD
TR TABLE
A
Parsing instances (3)
13
a b c d e f g h i j k lAX
AY
AZ
Garbage
m n ...
TD TD
TR TABLE
A
{AX}{AY}
{AXAY}
X card=1, may contain YY card=1..nZ card=1..n
Class C
block structure
Parsing instances (3)
14
a b c d e f g h i j k lAX
AY
AZ
Garbage
m n ...
TD TD
TR TABLE
A
{AX}{AY}
{AY}{AXAY}
{AXAY}
X card=1, may contain YY card=1..nZ card=1..n
Class C
block structure
Parsing instances (3)
15
a b c d e f g h i j k lAX
AY
AZ
Garbage
m n ...
TD TD
TR TABLE
A
{AX}{AY}
{AY}{AXAY}
{AXAY}
{AY} {AZ}{AZ}
{AXAYAZ} {AXAYAY} {AXAYAZ}{AXAYAZ} {AXAYAY} {AXAYAZ}{AXAZ} {AXAY} {AXAZ}
X card=1, may contain YY card=1..nZ card=1..n
Class C
block structure
Parsing instances (3)
16
a b c d e f g h i j k lAX
AY
AZ
Garbage
m n ...
TD TD
TR TABLE
A
{AX}{AY}
{AY}{AX[AY]} {AX[AY]} {AXAY}
{AY} {AZ}{AZ}
{AXAYAZ} {AXAYAY} {AXAYAZ}{AXAYAZ} {AXAYAY} {AXAYAZ}{AXAZ} {AXAY} {AXAZ}
{AXAY}
X card=1, may contain YY card=1..nZ card=1..n
Class C
block structure
Parsing instances (3)
17
a b c d e f g h i j k lAX
AY
AZ
Garbage
m n ...
TD TD
TR TABLE
A
{AX}{AY}
{AY}{AX[AY]} {AX[AY]} {AXAY}
{AY} {AZ}{AZ}
{AXAYAZ} {AXAYAY} {AXAYAZ}{AXAYAZ} {AXAYAY} {AXAYAZ}{AXAZ} {AXAY} {AXAZ}
{AXAY}
{AX[AY]AZ} {AX[AY]AY} {AX[AY]AZ}{AX[AY]AZ} {AX[AY]AY} {AX[AY]AZ}
X card=1, may contain YY card=1..nZ card=1..n
Class C
block structure
Parsing instances (3)
18
a b c d e f g h i j k lAX
AY
AZ
Garbage
m n ...
TD TD
TR TABLE
A
{AX}{AY}
{AY}{AX[AY]} {AX[AY]} {AXAY}
{AY} {AZ}{AZ}
{AXAYAZ} {AXAYAY} {AXAYAZ}{AXAYAZ} {AXAYAY} {AXAYAZ}{AXAZ} {AXAY} {AXAZ}
{AXAY}
{AX[AY]AZ} {AX[AY]AY} {AX[AY]AZ}{AX[AY]AZ} {AX[AY]AY} {AX[AY]AZ}
X card=1, may contain YY card=1..nZ card=1..n
Class C
block structure
Parsing instances (3)
19
From the instance parser, we get a set of valid ICs
similar to ACs, these may overlap
valid ICs form a lattice within the analyzed document
Parsing instances (4)
20
Parsing instances (5)
Since we want to extract both valid instances and standalone attributes, we merge the AC lattice and the valid IC lattice:
ICs which interfere with other ICs and leave their parts unexplained are penalized relatively to the unexplained parts of interfering ICs
21
Parsing instances (6)
The best path is found through the merged lattice This should be the sequence of standalone ACs
and valid ICs which best explain the document content
22
Wrapper induction (1)
During IC parsing, we search for common formatting patterns which would encapsulate part of the ICs being generated
E.g. person’s first name and last name (if we extracted these as separate attributes) could be regularly contained in formatting pattern:– TR[1..n] { TD[0] {person.firstname} TD[1] {person.lastname} }
Formatting pattern is defined as the first block area (HTML tag) containing the whole IC, plus the paths from that area to each of the IC’s attributes.
If “reliable” formatting patterns are found, we add them to the context patterns of the respective attributes. For such attribute A, we then:– boost/lower scores of all ACs of A,– create new ACs for A where the formatting patterns match and
ACs did not exist before– rescore all ICs which contain rescored ACs,– add new singleton ICs for newly added ACs.
23
Wrapper induction (2)
Formatting pattern induction process Segment all ICs from parser’s queue (not only the valid ones)
into clusters of ICs with the same attributes populated– e.g. {firstname: Varel, lastname: Fristensky}
{firstname: Karel, lastname: Nemec} would fit into one cluster.
For each cluster, build an IC lattice going through the document, and find the best path of non-overlapping ICs.
For ICs on the best path, compute the counts of each distinct formatting pattern. For each formatting pattern FP, estimate– precision(FP)=C(FP,instance from cluster) / C(instance
from cluster)– recall(FP)=C(FP,instance from cluster) / C(FP)where C() means observed counts.
We induce a new pattern if precision(FP), recall(FP) and C(FP,instance from cluster) reach configurable thresholds.
24
Wrapper induction (3)
Plugging wrapper generation into the instance parsing algorithm– in the current implementation, formatting patterns
are only induced once for singleton ICs Parallel parsing of multiple documents
– documents from the same source (e.g. website) often share formatting patterns; we expect measurable improvement over the single document extraction approach
– to be implemented More experiments needed
25
Ex demo
Command line version GUI available
– GUI of Information Extraction Toolkit exists as a separate project, ready to accommodate other IE Engines
Simple API to enable usage in 3rd party systems Everything written in Java
– however may connect to lemmatizers / POS taggers / other tools written in arbitrary languages
– Ex: ~ 26,000 lines of code– Information Extraction Toolkit: ~ 2,500 lines of code
26
Discussion
Thank you.