stochastic models for semi- structured document … filedocument restructuration experiments...

Stochastic models for semi-structured document mining

P. GallinariCollaboration with

G. Wisniewski – L. Denoyer – F. Maes

LIP6

University Pierre – Marie Curie - Fr

2006-04-27 - LIPN - P. Gallinari 2

Outline

Context Generative tree models 3 problems

Classification Clustering Document mapping

Experiments Conclusion and future work

XML Document Mining Challenge


Context - Machine learning in thestructured domain

Model, Classify, cluster structured data Domains: Chemistry, biology, XML, etc Models: discriminant e.g. kernels, generative e.g.

tree densities

Predict structured outputs Domains: natural language parsing, taxonomies, etc Models: relational learning, large margin extensions

Learn to associate structured representationsaka Tree mapping Domains: databases, semi-structured data


Context- Machine learning in thestructured domain

Structure only vs Structure + content Central complexity issue

Representation space (#words, #tags,#relations)

Search space for structured outputs -idem

Large corporaneeds simple and approximate

methods


Context-XML semi-structureddocuments

<fig>

<fgc>

<sec>

<p>

<bdy>

<article>

<st>

<hdr>

text

text

text


Outline


Classification Clustering Document restructuration




Document model

),/()/(

)/,()/(

Θ==Θ==

Θ===Θ=ddd

dd

sStTPsSP

tTsSPdDP

Structuralprobability

Contentprobability

),( dd tsd =

s

t


Document Model: Structure

Belief Networks

Paragraphe Paragraphe

Section Section

Titre

Titre

Document

Corps

Titre du document

Titre de la section

Cette section contient deuxparagraphes

Premier paragraphe Second paragraphe

La deuxième section necontient pas de paragraphes

Document

Intro SectionSection

Paragraphe Paragraphe Paragraphe

Document



⇒

∏=

=//

1

)()(d

i

di

d sPsP∏=

=//

1

)))((/()(d

i

id

id

d nparentlabelsPsP( )∏=

=//

1

))(()),((/)(d

i

di

di

di

d nprécédentlabelnparentlabelsPsP

Document




Document Model: Content

Model for each node

1st order dependency

Use of a local generative model for eachlabel

),....,( //1 dddd ttt =

∏=

=//

1

),/(),/(d

i

id

iddd stPstP θθ

)/(),/( idd si

did

i tPstP θθ =


Final network Document


Paragraphe

Paragraphe Paragraphe

T1= « Ce documentest un exemple dedocument structuré

arborescent »

T2= « Ceci est lapremière section du

document »

T3= « Le premierparagraphe »

T4= «Le secondparagraphe »

T5= «La secondesection »

T6= «Le troisièmeparagraphe »

)/( DocumentIntroP )/( DocumentSectionP)/( DocumentSectionP

)/( SectionParagrapheP

)/( SectionParagrapheP

)/( SectionParagrapheP)/1( IntroTP

)/2( SectionTP

)/3( ParagrapheTP

)/5( SectionTP

)/4( ParagrapheTP )/6( ParagrapheTP

( ))/6()/5()/4(*

)/3()/2()/1(*

)/arg()?/()/()( 3

ParagrapheTPSectionTPParagrapheTP

ParagrapheTPSectionTPIntroTP

SectionraphePPDocumentSectionPDocumentIntroPdP =


Different learning techniques

Likelihoodmaximization

Discriminant learning

Logistic function Error minimization

contenustructure

Dd

d

i

ts

di

di

Dd

sd

Dd

LL

stPsP

dPL

TRAIN

di

TRAIN

TRAIN

+=

Θ+

Θ=

Θ=

∑ ∑∑

∑

∈ =∈

∈

//

1

),/(log)/(log

)/(log

∑=

−

−

+

=

+

=

n

ic

ixpaix

cixpaix

e

e

xcP

cxP

cxP

1 )(,

)(,log

)/(

)/(log

1

11

1)/(

θ

θ


Fisher Kernel

Fisher Score :

HypothesisHypothesis : : The gradient of the log-likelihood isinformative about how much a feature « participate »to the generation of an example.

Fisher Kernel : K(X,Y)=K(Ux,Uy)

)/(log θθ XPUX ∇=


Use with the model

( ) ∑ ∑Λ∈ =

ΘΘΘ

Θ∇+Θ∇=Θ+Θ∇=

l lsi

tdi

di

sdtddsdd

di

tlstPsPstPsPU/

),/(log)/(log),/(log)/(log

???

?

?

???

?

?

???

?

?

???

?

???

???

?

?

???

?

?????? ??

??

?

?

??

//1 //

),/(log,...,),/(log),/(log

lsi

tdi

di

lsi

tdi

di

sdd

di

tl

di

tl stPstPsPU

Sous-vecteurcorrespondantau gradient surle modèle destructure

Sous-vecteurcorrespondant augradient pour lesnœuds de label l1

Sous-vecteurcorrespondant augradient pour lesnœuds de label // ?l

),/(log tdd stP ?? ?


Remark

Fisher kernels: very large number ofparameters On INEX :

With flat models : 200 000 parameters With structure models : 20 millions

parameters


Conclusion about this faimily ofgenerative models

Natural setting for modeling semistructured multimedia documents Structural probability (Belief network) Content probability (local generative model)

Learning with maximum likelihood, orcross-entropy

Discriminant learning and Fisher Kernel


Outline






Classification One model for each category

3 XML corpora + 1 multimedia corpus INEX : 12 000 articles from IEEE

18 journals WebKB : Web pages (8K pages)

course, department, …7 topics WIPO : XML Documents of patents

categories of patents NetProtect (European project) : 100 000 web pages

pornographic or not


Categorization : Generative models

0.6040.677Structure

0.5650.662NB

WIPO

0.7430.827Structure

0.7060.801NB

WebKB

0.6220.619Structure

0.6050.59NB

INEX

F1 macroF1 micro


0.6000.575Discriminant learning

0.6680.661Fisher kernel

0.5640.534SVM TF-IDF

0.6220.619Structure model

0.6050.59NB

F1 macroF1 micro

INEX

0.7150.862Fisher Kernel

0.710.822SVM TF-IDF


0.5650.662NB

F1 macroF1 micro

WIPO

0.7920.868Discriminant learning

0.7380.823Fisher Kernel

0.6510.737SVM TF-IDF


0.7060.801NB

F1 macroF1 micro

WebKB

Discriminant models


Multimedia model

94.7[94.2 ;95.1]

93.6[93.1 ;94]

Structuremodeltext andpictures

82.7[81.9 ;83.4]

83[82.2 ;83.7]

Structuremodelwithpictures

92.9[92.3 ;93.3]

92.5[91.9 ;93]

Structuremodelwith text

88.4[87.7 ;89]

89.9[89.2 ;90.4]

NB

Microaveragerecall

Macroaveragerecall

Director Ang Lee Takes Risks with MeanGreen 'Hulk'

LOS ANGELES (Reuters) -Taiwan-born director Ang Lee,perhaps best known for hisOscar-winning "Crouching Tiger,Hidden Dragon," is taking a bigrisk with the splashy summerpopcorn flick …...

For loyal comic book fans who may think Lee's "Hulk" will be too touchy-feely, think again." This is a drama, a family drama," said Lee, "but with big action." His slumping shoulderstwitch and he laughs…..

FAMILY DRAMA, BIG ACTION


Classification : conclusion

Structure model is able to handlestructure and content information

Both structure and content carry classinformation

Multimedia categorization Not in this talk :

Categorization of parts of documents Categorization of trees (structure only)


Outline






Clustering

The usual goal is to find groups ofsimilar documents (in a thematic sense)

The task is different for structureddocuments : What means “similar documents” :

Same structure ? Same content ? Both

Open question


Clustering

)/(*)/(//

1∑=

Θ=ΘC

ic

dc ii

sPdP α

Mixture model :

EM algorithm (CEM) Use on the structure (only) using INEX

corpus


Document



Document



Document



Document



Intro;Section;Section

Paragraphe;Paragraphe

)/()/;()/;;()( SectionParagraphePSectionParagrapheParagraphePDocumentSectionSectionIntroPsP d ?

? ???

//

1

))(()),((/)(

d

i

d

i

d

i

d

i

dnprécédentlabelnparentlabelsPsP ?

?

?

//

1

))(_/()(

d

i

id

id

dnlabelparentsPsP ?

?

?

//

1

)()(

d

i

di

d sPsP

Different models


The grammar model

A

CA

A

B

AEE

A

A C

A C B C

E E A

B

Arbre 1

Arbre 2 Arbre 3

1

1)/,(

2

2)/,,(

5

2)/(

5

3)/,(

=

=

=

=

CCBP

BAEEP

ABP

ACAP


Grammar model and DTD

A

CA

A

B

AEE

A

A C

A C B C

E E A

B

Arbre 1

Arbre 2 Arbre 3

1

1)/,(

2

2)/,,(

5

2)/(

5

3)/,(

=

=

=

=

CCBP

BAEEP

ABP

ACAP

]1[

]1[

]5

2[

]5

3[

BCC

EEAB

BA

CAA

→

→

→

→

BCC

EEAB

BA

CAA

→

→

→

→

< !DOCTYPE A [<!ELEMENT A (A,C)><!ELEMENT A (B) ><!ELEMENT B (E,E,A)><!ELEMENT C (B,C)>]>


Clustering results

70

75

80

85

90

95

5 10 15 18 20 25 30 35 40

Nombre de clusters

En

tro

pie

mic

ro m

oyen

ne (

en

%)

Naive Bayes

Parent

Parent et GrandParent

Parent et Frere

Grammaire


Example of DTDs


Clustering : conclusions

Mixture model of belief networks Different models Grammar model is better

Able to compute a kind of DTD

Ill defined problem: clustering of XMLdocuments ?


Outline






Structural heterogeneity<Restaurant><Nom>L’olivier</Nom><Description> Ce joli restaurant localisé près du

métro Jaurès, au 19 duboulevard de la vilette, perdudans le 19ème arrondissement deParis propose une cuisineitalienne, notamment des pâtesfraîches au 3 fromages.

</Description></Restaurant>

<Restaurant><Nom>La cantine</Nom><Adresse> 65 rue des pyrénées, Paris, 19ème,

FRANCE</Adresse><Spécialités> Canard à l’orange, Lapin au miel</Spécialités></Restaurant>

<Restaurant><Nom>Tokyo Bar</Nom><Adresse> <Ville>Paris</Ville> <Arrd>19</Arrd> <Rue>Bolivar</Rue> <Num>127</Num></Adresse><Plat>Sushi</Plat><Plat>Sashimi</Plat></Restaurant>

Problem: Query heterogeneous XML databasesor collections, Storage, etc

Needs to know the correspondence betweenthe structured representations


Document mapping problem

Problem Learn from examples how to

map heterogeneous sourcesonto a predefined targetschema

Preserve the documentsemantic

Sources: semistructured,HTML, PDF, flat text, etc

Labeled tree mapping problem

<Restaurant><Nom>Lacantine</Nom><Adresse>

<Ville>Paris</Ville>

<Arrd>19</Arrd>

<Rue>pyrénées</Rue>

<Num>65</Num></Adresse><Plat> Canard à l’orange</Plat><Plat> Lapin au miel</Plat></Restaurant>

<Restaurant><Nom>Lacantine</Nom><Adresse> 65 rue despyrénées, Paris,19ème, FRANCE</Adresse><Spécialités> Canard àl’orange, Lapinau miel</Spécialités></Restaurant>


Document mapping problem

Central issue: Complexity Large collections Large feature space: 103 to 106

Large search space (exponential)

Approach Learn generative models of XML target

documents from a training set Decoding of unknown sources according to

the learned model


Learning the correspondencevia examples

Why using ML for structure matching ? Multiple sources: variability, documents do

not follow the schema, collection growth, etc Web sources: DTDs, Schema are often

unknown or do not exist


Learning correspondence

Data centered view (Doan et al.) Multiple independent classifier combination Centralized (mediator) or P2P 1:1 or m:n transformations

Document centered view Document conversion (Xerox)

rendering format (HTML, PDF, etc) -> XMLpredefined DTD format

Information retrieval (LIP6) Content and structure queries (e.g. INEX)


Problem formulation

GivenST a target formatdsin(d) an input documentFind the most probable target document

)'(maxarg)(

'din

TT S

SdS ddPd

∈=

Decoding Learnedtransformation model


General restructuration model

sd

td'

sd'

td

?

),,/(),/(argmax '''

'1 ΘΘ= ddddd

dtstPssPd


Instance 1 : Label mapping

Subtask of structure mapping Tree structure remains unchanged Learn to automatically label nodes

),/,.......,()/,...,(maxarg ''//

'1

'//

'1

,...,1

//'//

'1

ΘΘ=Λ∈

ddd

ddd

d

ss

sttPssPddfinald

dd

A

CA

A

B

AEE

A

A C

A C B C

E E A

B

Arbre 1

Arbre 2 Arbre 3

?

? ?

? ? ? ?

? ?

A

A C

A C B C

E E


Document structure modelDocument

Title Section Section

Paragraph Paragraph

Italic

Paragraph FootNote

Title,Section,Section

Paragraph,Paragraph Paragraph,FootNote

∏=

dn

ntagngschildrentaPsPin nodes all

)),(|)(()|( θθ


PCFG model

A

CA

A

B

AEE

A

A C

A C B C

E E A

B

Arbre 1

Arbre 2 Arbre 3

1

1)/,(

2

2)/,,(

5

2)/(

5

3)/,(

=

=

=

=

CCBP

BAEEP

ABP

ACAP

]1[

]1[

]5

2[

]5

3[

BCC

EEAB

BA

CAA

→

→

→

→


Instance 2: plain text structuring

StructurationautomatiqueF.MAESP.GALLINARIProblématiqueetc.


Stochastic model

d = (c, s) s = (se, si)


Sub-optimal approach

Segmentation and structuration are performedsequentially

Segmentation Structure Extraction


Models Segmentation: HMM

Structure

Document




Instance 3 : HTML to XML

Hypothesis Input document

HTML tags mostly for visualization Remove tags Keep only the segmentation (leaves)

Transformation Leaves are the same in the HTML and XML

document Target document model: node label depends

only on its local context Context = content, left sibling, father


Problem representation


Model and training

Probability of target tree

Document model : max-entropy conditionalmodel learned from a training set of target docs

∏==

iniiiidT

dTdSinT

nfathernsibcnPdddP

dddPddP

))(),(,(),...,(

),...,()(

1

1)(

Document




Decoding

Solve

Exact Dynamic Programming decoding O(|Leaf nodes|3.|tags|)

Approximate solution with LASO (Hal Daume ICML 2005) O(|Leaf nodes|.|tags||tree nodes|)

)'(maxarg)(

'din

TT S

SdS ddPd

∈=


Experiments

INEX corpus: IEEE collection (XML) :

12 000 documents (training : 7 800 , Test : 4 200) ≈ 5 000 000 content nodes 139 tags Mean document depth ≈ 7 vocabulary : ≈ 22 000 mots

test corpus : Transaction On …series Unlabeled documents (tags removed)


Instance 1 : Label mapping -results

9,5%65,30%49,70%27,80%139tags

79,3%86,50%72,90%58%5 tags

naïvemodel

Struct +Content

StructureContent


Instance 1 : IR adapted measure

5 Tags

0102030405060708090100

0 10 20 30 40 50 60 70 80 90 100

% of nodes

% o

f d

oc

um

en

ts

139 Tags

0102030405060708090

100

0 10 20 30 40 50 60 70 80 90 100

% of nodes

% o

f d

oc

um

en

ts

Content with all nodes

Structure

Structure and Content

Content without emptynodes


Instance 2: plain text structuringResults

22,8%24,6%91,5%HMM + TMM

31,2%75,7%92,8 %Exact + TMM

Structuration(internalnodes)

Segmentation(leafs)

labelingModels

Extreme structuration instance Exact + TMM: degraded version of HTML documents

structuration


Instance 3 HTML to XML

IEEE collection / INEX corpus 12 K documents,

Average: 500 leaf nodes, 200 int nodes, 139 tags

Movie DB 10 K movie descriptions (IMDB)

Average: 100 leaf nodes, 35 int. nodes, 28 tags Shakespeare 39 plays

Few doc, but: Average: 4100 leaf nodes, 850 int nodes, 21 tags

Mini-Shakespeare Randomly chosen 60 scenes from the plays

85 leaf nodes, 20 int. nodes, 7 tags


Performances


Conclusion

Document restructuration is a newproblem

Tree transformation problem of highcomplexity (content + structure)

Many different instances Approach based on generative models of

target documents


XML Document Mining Challenge2006 Challenge

INEX-Delos and Pascal networks of excellence Three tasks

Classification Clustering Document mapping

3 XML corpora IEEE collection IMDB (Movie descriptions) Wikipedia in 4 languages Dead line : june 2006

Web site : http://xmlmining.lip6.fr Email : [email protected]