Transcript
Page 1: Annotation Free  Information Extraction

Annotation Free Information Extraction

Chia-Hui Chang Department of Computer Science & Information Engineering

National Central [email protected]

10/4/2002

Page 2: Annotation Free  Information Extraction

IEPAD: Information Extraction based on Pattern Discovery

C.H. Chang. National Central UniversityWWW10

Page 3: Annotation Free  Information Extraction

Semi-structured Information Extraction Information Extraction (IE)

Input: Html pages Output: A set of records

Page 4: Annotation Free  Information Extraction

Pattern Discovery based IE

Motivation Display of multiple records often forms a repeated

pattern The occurrences of the pattern are spaced regularly and

adjacently

Now the problem becomes ... Find regular and adjacent repeats in a string

Page 5: Annotation Free  Information Extraction

IEPAD Architecture

Pattern Generator

ExtractorExtraction Results

Html Page

Patterns

Pattern Viewer

Extraction Rule

Users

Html Pages

Page 6: Annotation Free  Information Extraction

The Pattern Generator

Translator PAT tree construction Pattern validator Rule Composer

HTML Page

Token Translator

PAT TreeConstructor

Validator

Rule Composer

PAT trees andMaximal Repeats

Advenced Patterns

Extraction Rules

A Token String

Page 7: Annotation Free  Information Extraction

1. Web Page Translation

Encoding of HTML source Rule 1: Each tag is encoded as a token Rule 2: Any text between two tags are translated to a

special token called TEXT (denoted by a underscore) HTML Example:

<B>Congo</B><I>242</I><BR>

<B>Egypt</B><I>20</I><BR>

Encoded token stringT(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)

T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)

Page 8: Annotation Free  Information Extraction

Various Encoding Schemes

B lo ck -lev e l ta g s T ex t-lev e l ta g sHeadings

Text containers

Lists

Others

H1~H6

P, PRE, BLOCKQUOTE,ADDRESS

UL, OL, LI, DL, DIR,MENU

DIV, CENTER, FORM,HR, TABLE, BR

Logical markup

Physical markup

Special markup

EM, STRONG, DFN, CODE,SAMP, KBD, VAR, CITE

TT, I, B, U, STRIKE, BIG,SMALL, SUB, SUP, FONT

A, BASEFONT, IMG, APPLET,PARAM, MAP, AREA

Figure. 2 Tag classification

Page 9: Annotation Free  Information Extraction

2. PAT Tree Construction PAT tree: binary suffix tree A Patricia tree constructed over all possible

suffix strings of a text Example

T(<B>) 000T(</B>) 001T(<I>) 010T(</I>) 011T(<BR>) 100 T(_) 110

000110001010110011100000110001010110011100

T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)T(<B>)T(_)T(</B>)T(<I>)T(_)T(</I>)T(<BR>)

Indexing position:suffix 1 000110001010110011100000110001010110011100$suffix 2 110001010110011100000110001010110011100$suffix 3 001010110011100000110001010110011100$suffix 4 010110011100000110001010110011100$suffix 5 110011100000110001010110011100$suffix 6 011100000110001010110011100$suffix 7 100000110001010110011100$suffix 8 000110001010110011100$suffix 9 110001010110011100$suffix10 001010110011100$suffix11 010110011100$suffix12 110011100$suffix13 011100$suffix14 100$

Page 10: Annotation Free  Information Extraction

The Constructed PAT Tree

$

12

1

2 2

3 4 5

10

1 8 10

0

1

10000

1

$

0

147

0

5

3

22

$0

16

$0

3 13

7

$0

6

11

13

$

4

19

$0

92

a

b

c

d e

f

g

h

i

j k

l m

Figure 3. The PAT tree for the Congo Code

=0110001010110011100=1010110011100=01010110011100=0110011100=11100

Page 11: Annotation Free  Information Extraction

Definition of Maximal Repeats

Let occurs in S in position p1, p2, p3, …, pk is left maximal if there exists at least one (i, j) pai

r such that S[pi-1]S[pj-1] is right maximal if there exists at least one (i, j) p

air such that S[pi+||]S[pj+||] is a maximal repeat if it it both left maximal and ri

ght maximal

Page 12: Annotation Free  Information Extraction

Finding Maximal Repeats

Definition: Let’s call character S[pi-1] the left character of s

uffix pi

A node is left diverse if at least two leaves in the ’s subtree have different left characters

Lemma: The path labels of an internal node in a PAT tr

ee is a maximal repeat if and only if is left diverse

Page 13: Annotation Free  Information Extraction

3. Pattern Validator

Suppose a maximal repeat are ordered by its position such that suffix p1 < p2 < p3… < pk, where pi denotes the position of each suffix in the encoded token sequence.

Characteristics of a Pattern Regularity: Variance coefficient

Adjacency: Density}1|{

}1|{)(

1

1

kippMean

kippStdDevV

ii

ii

||

||*)(

1

pp

kD

k

Page 14: Annotation Free  Information Extraction

Pattern Validator (Cont.) Basic ScreeningFor each maximal repeat , compute V() and D()

a) check if the pattern’s variance: V() < 0.5

b) check if the pattern’s density: 0.25 < D() < 1.5

V()<0.5

0.25<D()<1.5

Yes

NoDiscard

Yes

Pattern

NoDiscard

Pattern

Page 15: Annotation Free  Information Extraction

4. Rule Composer Occurrence partition

Flexible variance threshold control Multiple string alignment

Increase density of a pattern

Page 16: Annotation Free  Information Extraction

Occurrence Partition

Problem Some patterns are divided into several blocks Ex: Lycos, Excite with large regularity

Solution Clustering of the occurrences of such a pattern

Clustering V()<0.1No

Discard

Check densityYes

Page 17: Annotation Free  Information Extraction

Multiple String Alignment

Problem Patterns with density less than 1 can extract only part

of the information

Solution Align k-1 substrings among the k occurrences

A natural generalization of alignment for two strings which can be solved in O(n*m) by dynamic programming where n and m are string lengths.

Page 18: Annotation Free  Information Extraction

Multiple String Alignment (Cont.) Suppose “adc” is the discovered pattern for token

string “adcwbdadcxbadcxbdadcb”

If we have the following multiple alignment for strings ``adcwbd'', ``adcxb'' and ``adcxbd'':

a d c w b d

a d c x b -

a d c x b d

The extraction pattern can be generalized as “adc[w|x]b[d|-]”

Page 19: Annotation Free  Information Extraction

Pattern Viewer Java-application based GUI Web based GUI

http://www.csie.ncu.edu.tw/~chia/WebIEPAD/

Page 20: Annotation Free  Information Extraction

The Extractor

Matching the pattern against the encoding token string Knuth-Morris-Pratt’s algorithm Boyer-Moore’s algorithm

Alternatives in a rule matching the longest pattern

What are extracted? The whole record

Page 21: Annotation Free  Information Extraction

Experiment Setup

Fourteen sources: search engines Performance measures

Number of patterns Retrieval rate and Accuracy rate

Parameters Encoding scheme Thresholds control

Page 22: Annotation Free  Information Extraction

Translation

Table 2. Size of translated sequences and number of patterns

Encoding Scheme Length of Sequence No. of Patterns

All Tag 1128 7.9

No Physical 873 6.5

No Special 796 5.7

Block-Level 514 4.4

Average page length is 22.7KB

Page 23: Annotation Free  Information Extraction

Accuracy and Retrieval RateTable 5. The performance of multiple string alignment

Search Engine Retrieval Rate Accuracy Rate Matching PercentageAltaVistaCoraExciteGalaxyHotbotInfoseekLycosMagellanMetacrawlerNorthernLightOpenfindSavvysearchStpt.comWebcrawler

1.001.001.001.000.970.980.941.000.900.950.831.000.990.98

1.001.000.970.950.860.940.631.000.960.960.900.951.000.98

0.910.971.000.990.880.870.940.760.780.900.660.970.950.98

Average 0.97 0.94 0.90

Page 24: Annotation Free  Information Extraction

Problems

Guarantee high retrieval rate instead of accuracy rate Generalized rule can extract more than the

desired data Only applicable when there are several

records in a Web page, currently

Page 25: Annotation Free  Information Extraction

ROADRUNNER: Towards Automatic Data Extraction from Large Web Sites

Valter Crescenzi , Giansalvatore , Paolo Merialdo

VLDB2001

Page 26: Annotation Free  Information Extraction

Observations

1. Wrapper generator works by using additional information. (labeled samples)

2. Wrapper induction system has some a priori knowledge about the page organization.

3. Finally, systems generate wrapper by examining one HTML page at a time.

Page 27: Annotation Free  Information Extraction

ROADRUNNER new perspective1. Don’t rely on any interaction with the user.

(Completely automatic)

2. No a priori knowledge HTML schema will be inferred along with

wrapper.

Can handle any nested structures.

3. Works with two HTML pages at a time. (based on the study of similarities and dissimilarities between the pages)

Page 28: Annotation Free  Information Extraction
Page 29: Annotation Free  Information Extraction

Theoretical Background Site generation = Encoding of database co

ntent Data extraction = Decoding The problem is based on a close correspon

dence between nested type and union-free regular expressios.

Page 30: Annotation Free  Information Extraction

Delimiter #PCDATA : map to string + : map to lists (nested) , being iterator ? : map to nullable fields, optional patterns.

Find schema and data extraction = Find minimal UFRE.

Page 31: Annotation Free  Information Extraction

Matching Technique It is based on a matching technique called

ACME. (Align, Collapse under Mismatch, and Extract)

HTML XHTML tokens Matching algorithm works on two objects:

A list of tokens, call the sample A wrapper (one UFRE)

This is done by solving mismatches between the wrapper and the sample.

Page 32: Annotation Free  Information Extraction
Page 33: Annotation Free  Information Extraction

Mismatches1. String mismatches:

May be due only to different values of a database field.

These mismatches are use to discover fields. (#PCDATA)

Ex : ‘John Smith’ and ‘Paul Jones’ at token 4

2. Tag mismatches: Optional patterns Iterative patterns

Page 34: Annotation Free  Information Extraction

Discovering Optionals Strategy: Looking for repeated patterns a

s a first step, and then, if this attempt fails, in trying to identify optional pattern.

Two steps: 1. Optional Pattern Location by Cross-Search

Mismatch at token 6 - <UL> and <IMG…/> Assume optional pattern is located on wrapper or s

ample. 2. Wrapper Generalization

( <IMG src=…/> ) ?

Page 35: Annotation Free  Information Extraction

Discovering Iterators1. Square Location by Terminal – Tag Searc

h : Both the wrapper and sample contain at least o

ne occurrence of the square. Terminal Tag = position before the mismatch

In this example is </LI> Test which is the square initial tag ?

</UI> ~ </LI> v.s. <LI> ~ </LI>

Finally, we can infer that the sample contains one candidate occurrence of the square at token 20-25.

Page 36: Annotation Free  Information Extraction

Discovering Iterators (con’t)2. Square Matching :

Try to match the candidate square occurrence (tokens 20-25).

Backwards : matching token 25 and 19, then moves to 24 and 18 and so on.

3. Wrapper Generalization : If we denote the newly found square by s, we

replace the repeated pattern by (s)+

Page 37: Annotation Free  Information Extraction

More Complex Example First mismatch at token 15 (external misma

tch) Find iterators :

Terminal tag = </LI> Candidate square is found : <LI> ~ </LI> at token 1

5-28 Backward match : second mismatch at token 23 and

9 (internal mismatch) solve the mismatch by recursive

Page 38: Annotation Free  Information Extraction

Recursively solve mismatch Internal mismatch at token 23 and 9

Solve it by the same way at external mismatch. But don’t work by comparing one wrapper and

one sample, rather two different portions of the same objects.

Terminal tag = <B> Candidate square is </B>~<B> token 23-18 Backward match : mismatch at token 20 and 26 Find token 20-22 is optional pattern.

Page 39: Annotation Free  Information Extraction
Page 40: Annotation Free  Information Extraction

Matching as an AND-OR tree Finding one solution to match(w,s) corresponds to find

ing one visit for the AND-OR tree. (i) match(w,s) = all external mismatches encountered d

uring the parsing (AND node) (ii) solve mismatch by either introducing one field, or on

e iterator, or one optional (OR) (iii) The search may either on wrapper or sample (OR) (iv) iterators and optionals are various candidates (OR) (v) Discover iterators may be need to recursively solve

several internal mismatches. (AND)

Page 41: Annotation Free  Information Extraction

AND-OR tree

Page 42: Annotation Free  Information Extraction

Experimental Results

Page 43: Annotation Free  Information Extraction

Experimental Results (con’t)

Page 44: Annotation Free  Information Extraction

Extracting Structured Data from Web Page

Arvind Arasu, Hector Garcia-MolinaACM SIGMOD 2003

Page 45: Annotation Free  Information Extraction

Cue Keywords: schema, template Web pages belonging to the same site are gene

rated by encoding data of the same schema with a common template

= > a common template by plugging-in value

Page 46: Annotation Free  Information Extraction

Figuration

Page 47: Annotation Free  Information Extraction

Goal and Challenge Previous IE Techniques rely on heuristic by

human. ex. wrapper Goal: to deduce the template without human

Time consuming and error-prone Optional attributes are ignored

Challenge: No obvious way of differentiating what text is template or data The schema of data in pages isn’t flat but more complex and semi-structured of attributes

Page 48: Annotation Free  Information Extraction

Model, Problem Formulation

Structured Data Model of Page Creation Optionals and Disjunctions Problem Statement Miscellaneous Terminology, Definition

Page 49: Annotation Free  Information Extraction

Structured Data Token: A token is some basic unit of text Structured Data: any set of data values confor

ming to a common schema or type Define “Type”:

1. Basic Type (β): string of tokens e.g. < html > , text2. Ordered List Type: tuple constructor order “n”

e.g. < T1, T2, …, Tn > , T1, T2, …, Tn : type3. Define Type: set constructor e.g. {T} , T: type

Page 50: Annotation Free  Information Extraction

Define term value and example Define “instance”:

1. an instance of basic type, β, token

2. an instance of type < T1, T2, …, Tn> is

   tuple of the form < i1, i2, …, in > , attributes

i1, i2, …, in are instances of typesT1, T2, …, Tn

3. an instance of type {T}, is any set of elements

{e1, e2, …, em}, such ei is an instance of type T

Instance → Value; String → token Example:

Schema S1= Value =

3

21

, , ,B B B B

1 1 1 2 2, , , , ,x t f l f l c 2 0 0, , ,x t f l c

Page 51: Annotation Free  Information Extraction

Model of Page Creation Definition: A template T for a sc

hema S (as shown TS), is defined as a function that maps each type constructor τ of S into an ordered set of strings T(τ ), such that,

τis the tuple constructor of order n, T(τ) is an order set of n+1 string

τis the set constructor of order n, T(τ) is string Sτ

1 ( 1),..., nC C

λ(T, x) :values x that are instances of sub-schema of S

Page 52: Annotation Free  Information Extraction

Encoding of a value x S

1. if x β, then λ (T,x)→x

2. if x <x1, x2, …, xn > τt

λ (T,x) → C1 λ (T, x1) C2 …λ (T, xn) Cn+1

3. if x {e1, e2, …, em}τs , τs S

λ (T,x) → λ (T, e1) S λ (T, e2) ….S λ (T, em)

Page 53: Annotation Free  Information Extraction

Example of Schema S1

3

21

1 , , ,S B B B B

1 1( ) , , ,T A B C D

1 3( ) , ,T E F G 1 2( )T H

1 1 1 2 2, , , , ,x t f l f l c

1 1 1 2 2

1 1 2 2

2 1 1 2 2

1 1 2 2

3 1 1

1 1

1 1 2 2

( ) ( , , , , , ) , , ,

, , ,

( ) ( , , , )

, ,

( ) ( , ) , ,

T T t f l f l c A B C D

String AtB f l f l CcD

T T f l f l H

Substring f l H f l

T T f l E F G

Substring Ef Fl G

String AtBEf Fl GHEf Fl GCcD

Re

H

gularExpression

A B E F G C D

Page 54: Annotation Free  Information Extraction

Optionals and Disjunctions

Optional: If T is type, optional type (T)?≡{T}τ

|τ| = 0 or 1

Disjunction: If T1 and T2 is type, disjunction type

(T1| T2) ≡ < {T1}τ1, {T2}τ2 > τ

|τ1|+|τ2| = 1

Page 55: Annotation Free  Information Extraction

Problem Statement

Extract Problem: n pages, pi = λ(T, xi)

(1 ≤ i ≤ n), created from some unknown deduction template T and values {x1,. . .,x1} from the set of pages alone

Page 56: Annotation Free  Information Extraction

Example of correct solution of EXTRACT (cont.)

1 2 3 4, , ,e e e e eP p p p p

Page 57: Annotation Free  Information Extraction

Example of correct solution of EXTRACT (cont.)

1

1 1

2

2 2

, ,7,...

( , )

,{ , 2,... , ,6,... }

( , )

e

e

e

e

S

Se

S

Se

x Database John T

P T x

x DataMining Jeff Jane T

P T x

( , )eSei iP T x

1 2 3, , ,e e e e

S B B B B

Page 58: Annotation Free  Information Extraction

Miscellaneous Terminology, Definition

An occurrence of a token in template is called a template-token

An occurrence of a token in value is called a value-token

An occurrence of a token in page is called a page-token

2 page-token in Pe have the same role iff they have been generated by the same template-token

Page 59: Annotation Free  Information Extraction

Overview Approach - EXALG

(ECGM)

Page 60: Annotation Free  Information Extraction

EXALG - ECGM – FINDEQ (step2) The module used to compute “equivalence classes:ε”, set of tokens having the same frequency of occurrence in every pages Pe

Ex. εe1:{ <html>, <body>, Book, Reviews, <ol>,

</ol>, </body>, </html> } Ex. εe3:{ <li>, Reviewer, Rating, Text, </li> }

EXALG retain only EQ Classes that are Large and Frequently occurring EQ Classes (LFEQ)

Page 61: Annotation Free  Information Extraction

EXALG - ECGM – HANDINV (step3) The module used to detect and remove invalid LFEQs – those that are not formed by tokens associated with a type constructor

Page 62: Annotation Free  Information Extraction

DIFFFORM (step1) and DIFFEQ (step4) The module used to add more tokens to LFEQ by “diff

erentiating” roles Ex. Name has multiple “role”, one occurs in Book Name and

the other occurs in Reviewer Name Differentiate the multiple roles :

The multiple tokens occur in different path from root in the HTML parse tree (DIFFFORM)

The multiple tokens occur in different “Position” with respect to LFEQ εe1(DIFFEQ)

dtoken: ex. Name5 and Name14

regard NameA and NameB as different tokens

Page 63: Annotation Free  Information Extraction

Review ECGM

Find dtoken from pathin html parse tree

Find LFEQ

Detect and removeinvalid LFEQ

Find dtoken from position in valid LFEQ

Page 64: Annotation Free  Information Extraction

Example After ECGM Process εe1: { <html>, <body>, <b>, Book, Name, </b>, <

b>, Reviews, </b>, <ol>, </ol>, </body>, </html> }

8 →13 εe3: { <li>, <b>, Reviewer, Name, </b>, <b>,

Rating, </b>, <b>, Text, </b>, </li>}5 →12

Position: empty and non-empty

Page 65: Annotation Free  Information Extraction

Construct Schema from ECGM

Construct Schema S’ fromεe1

The 1st of non-empty position is Basic Type β The 2nd of non-empty position is εe3 , are generated b

y set type constructorτe3

→ T(τe1) = <C11, C12,C13>, S’ = <β,{ S” }τe2 >τe1

→ T(τe2) = S” = < C31, C32,C33,C34 > → T(τe3) = < C31, C32,C33,C34 >, <β,β,β,>τe3

S’ = < β,{ <β,β,β,>τe3 }τe2 >τe1

Page 66: Annotation Free  Information Extraction

Equivalence Classes (Cont.)Pages P = { p1, … , pn } , pi = λ(TS, xi)

TS = {τ1, … , τk }: type constructor Definition: All tokens of equivalence class have the s

ame occurrence vector

ex. εe1: <1,1,1,1>; εe3: <1,2,1,0> Observation1 : Tokens associated with the sam

e type constructor τj in T that have unique-roles occur in the same equivalence class. (used to decide EQ valid or not)

Support of token: #(page contain) Size of EQ class: #(token of EQ)

Page 67: Annotation Free  Information Extraction

Equivalence Classes (Cont.) Observation2: for real pages, an equivalence clas

s of large size and support is usually valid Properties of EQ class: <t1, … , tm>

Ordered Nested: the span of all occurrences of εi is within for s

ome fixed Position_p or doesn’t overlap Observation3: A valid equivalence class is ordere

d and a pair of two valid equivalence classes is nested

Page 68: Annotation Free  Information Extraction

Handling Invalid Equivalence classes Detect the existence of invalid LFEQs using vi

olation of ordered and nesting Yes, discard some of LFEQs and break other into

smaller LFEQs

Differentiating roles of tokens By Path – different roles of tokens are in

different path of HTML parse tree By Position – different roles of tokens locates at

different Position (non-empty)

Page 69: Annotation Free  Information Extraction

Equivalence Class Generation Module

OUTPUT: set of LFEQs of dtokens and page represented as string of dtokens

FINDEQ: 2 parameters used to consider

LFEQs (SIZETHRES, SUPTHRES) On running example:

SIZETHRES = SUPTHRES = 3

the iteration = 2, find out εe1 and εe3

Page 70: Annotation Free  Information Extraction

Building Template and Extracting Values

Input to this module is {ε1 ,ε2 , … ,εm } The ANALYSIS consist of 2 modules – CONSTTEM

P and EXVAL CONSTTEMP ,εi = { d1, d2, … , dl }

Start the basic ε1= { <html>, <body>, … ,</body>, </html> }

recursively constructs a template Tεi , corresponding toεi , and template Tεi, p, corresponding to each non-empty position p ofεi

Checks if the set of strings, PosString(εi ,p), corresponding has some recognizable pattern

Page 71: Annotation Free  Information Extraction

Example

In running example, PosString(εe1+ ,6) is a string dto

kens for every occurrence of εe1+, which matches Pat

tern 5 of table; PosString(εe1+ ,10) is always a string

of 0 or more occurrences of εe3+, which matches Patt

ern 1 εe1: { <html>, <body>, <b>, Book, Name, </b>, <b>,

Reviews, </b>, <ol>, </ol>, </body>, </html> }

Page 72: Annotation Free  Information Extraction

Assumption The 4 assumptions:

(A1) A large number of tokens occurring in

template have unique roles

(A2) The EQ class derived from a type constructor

is recognized as an LFEQ

(A3) Irregularity in encoded data that leads to

invalid EQ class

(A4) The separators are around data values. In

this model, strings associated with type

construction are non-empty position

Page 73: Annotation Free  Information Extraction

EvaluationLeaf attribute Am in schema Sm

Correct: the set of Am in the page is equal to the set of extracted value Ae in the page

Partially Correct: the set of Am in the page is not equal to the set of extracted value Ae in the page, but as part of value of Ae

Incorrect: not correct and Partially correct

Page 74: Annotation Free  Information Extraction

Result 18 or 40% of input collections

our System correctly extracted all the attribute

Around 80% of the attributes were extracted correctly

Normalized average Input size <=10 Parameter = 3

Page 75: Annotation Free  Information Extraction

Conclusion EXALG: use 2 novel concept equivalence classes

and differentiate roles, to discovery the template Impact of the failed assumption is limit to a few

attributes Future work:

Develop techniques for crawling, indexing, and providing querying support for the structured pages in the web

Develop techniques for automatically annotating the extracted data, possibly using the words that appear in the template

Page 76: Annotation Free  Information Extraction

References

C.H. Chang. and S.C. Lui. IEPAD: Information Extraction based on Pattern Discovery, WWW2001, pp. 681-688.

Valter Crescenzi, Giansalvatore Mecca, Paolo Merialdo. RoadRunner: Towards Automatic Data Extraction from Large Web Sites. VLDB2001, 109-118

Arvind Arasu, Hector Garcia-Molina. Extracting Structured Data from Web Pages. SIGMOD2003, 337-348.


Top Related