incremental and dynamic text mining

Incremental and Dynamic Text Mining

Graph structure discovery and visualization

DUBOIS Vincent, QUAFAFOU Mohamed{dubois;quafafou}@irin.univ-nantes.fr

IRIN (Universite de Nantes)

Abstract. This paper tackles the problem of knowledge discovery intext collections and the dynamic display of the discovered knowledge.We claim that these two problems are deeply interleaved, and shouldbe considered together. The contribution of this paper is fourfold : (1)description of the properties needed for a high level representation ofconcept relations in text (2) a stochastic measure for a fast evaluation ofdependencies between concepts (3) a visualization algorithm to displaydynamic structures and (4) a deep integration of discovery and knowledgevisualization, i.e. the placement of nodes and edges automatically guidesthe discovery of knowledge to be displayed. The resulting program hasbeen tested using two specific data sets based on the specific domains ofmolecular biology and WWW howtos.

1 Introduction

2 Graph structure discovery

2.1 Graph structure semantic

Given a set of concept, we want to extract the dependencies between them. Themost natural way of displaying such information is by using a direct graph :Each concept is a node, and an edge between two node denotes a dependence.

Binary mutual entropy Dependence between concept is a high level and ab-stract relation, and it is not easy to express. A first approximation is the mutualentropy measure between two concepts, a and b, noted H(a; b) and defined byusing the conditional entropy :

H(a; b) = H(a)−H(a|b) = H(b)−H(b|a)This measure succeed in describing one on one relation between concepts,

but it has several flaws :

– Mutual entropy measure is symmetric, but dependence is not.– If a concept depends on a set of concepts, he does not necessarily depends

on each one separately (e.g. xor table). thus, mutual entropy measure failsin describing them.

The main advantage of mutual entropy measure is to be fast and easy to use,but it lacks the required description finesse.

Bayesian networks The bayesian networks semantic is deep enough to de-scribe the dependence relation, but related algorithms search the smallest graphdescribing the data with a good enough accuracy. As a consequence, a link be-tween two nodes is always a dependence, but not all dependencies are describedin the bayesian network. Moreover, bayesian networks are required to be directedacyclic graphs, whereas dependence cycle are common (if a depends on b, it iscommon to find out that b depends on a).

Thus, we need a formalism with a local description of the dependence (nogeneral property on the graph such as being acyclic), but that involve more thantwo concept at once (otherwise it misses some large dependence).

n-ary mutual entropy Mutual entropy measure can be used with more thantwo concept at once. We can say that a concept a depends on a set B of nconcepts by using it :

H(a;B) = H(a)−H(a, b1, . . . , bn)

But this leads to an hyper-graph, and it not easy to think at it or to displayit. We propose to flatten it by the following way :

a depends on b iff it exists B minimal such that a depends on B.Thus, given a concept a, we search all sets B such that both H(a;B) and

|B| are small.

Size penalty One of the conditional entropy property is that H(A|B) ≤H(A|B ∪ {b}). So we need to complete this measure by adding a size penalty. Ifwe code B using A information, the message length (per instance) is given bythe conditional entropy H(B|A). It is logical to add to this length the encodinglength of the function used to predict B values based on A values. This requireto code 2|B| binary values (one for each possible combination of concepts in B).If we have l instances to code, the structure cost per instance is 2|B|/l. Theresulting evaluation function is :

f(A,B) = H(A)−H(A|B) − 2|B|/l (1)

The optimization process searches best sets of concepts B for each A. B isconsidered minimal, because if a concept in B add no information, then a subsetof B would have the same conditional entropy, while a lower structural cost, andthen get a better score f .

Definitions Given a concept A, we call partial parent sets the at most n setsof concept B with the best f(A,B) positive measure.

We call parent set of concept A the union of all partial parent set.

We say that concept B is one of the parents of concept A when B is in theparents set.

2.2 Network evaluation

Entropy approximation Entropy computation is costly, so we use a stochasticapproximation of it in order to avoid full computation. The approximation isbased on differences between instances (i.e. the set of concepts that are presentsin one (and only one) of the pair of instances).

Fano’s inequality gives :

H(Pe) + Pe. log(|Ξ| − 1) ≥ H(X |Y ) (2)

where Pe is the probability of error while predicting X using Y , and |Ξ|is the number of different possible values for X . In our context, X is a binaryattribute (concept present or absent), so we have :

H(Pe) ≥ H(X |Y ) (3)

So, the conditional entropy is approximated by the entropy of the error ofthe prediction. We now express this error by using sets of equal attributes (seeannex for complete proof):

Pe = 1/2− 1/2.∑

Y

P (Y ).√

2.P (X = |Y )− 1) (4)

The∑

term describe the mean value of the square root of√

2.P (X = |Y )− 1.We approximate it by the square root of the mean value :

Pe ≃ 1/2− 1/2√

2.P (X = |Y =)− 1 (5)

We note that P (X = |Y =) = P (X ∪ Y =)/P (Y =). Thus, the informationrequired to compute our approximation of conditional entropy H(X |Y ) is thecount of identical attributes sets in pair of instances. This count is much easierto compute that every entropy measure, and can be stored in a more efficientway. Moreover, only a small sampling of all possible pairs is necessary. Moreinformation on this method can be found in [?].

2.3 Methods and Algorithms

General method The general methods consists in searching among the spaceof possible parent sets the best one, for each node. It is described in algorithm(1)

Algorithm 1 Graph structure extraction

Input: C set of all concepts ciOutput: G is the best graph according to our metric

G(wi) is the list of n best partial parent sets for ci1: for all node wi do

2: for all possible parent set P ⊂ C\ci do

3: if P is a good enough parent set then4: Add P to G(wi)5: if G(wi) size exceed n then

6: Remove worst partial parent set in G(wi)7: end if

8: end if

9: end for

10: end for

pruning methods An exhaustive search of best parent sets among all possiblesubsets is not realistic. A naive implementation of algorithm (1) leads to anexponential complexity. Hopefully, it is easy to improve this loop, by using somepruning heuristics. The note we compute has two parts : the first one is the abilityof the model to predict data, when the second is the size cost of the structure.While the first part (H(a) −H(a|B)) is bounded (by H(a)), the second growsexponentially (2|B|/l). As note is greater than 0 (the empty set note), thereis a size limit for the structure (2|B|/l < H(a)). If the size of a parent set isbigger than this limit, it will have a negative score, even if it defines a functionaldependency (H(a|B) = 0 = P (e) ) . This property provides an efficient wayto ensure that the complexity is not beyond a selected degree, by adjusting thestructural cost l.

An additional pruning method based on metric properties is possible usingthe note of the best already found parent set. The better the score we alreadyfound, the lower the size of possibly better parent set is. Taking advantage ofthis property requires evaluating the smallest set at the beginning. Thus, welimit the number of unnecessary score computations.

Restricted candidate parent set Considering high dimensional data sets,the previous pruning methods are not sufficient: complexity remains low at thecost of the sensibility of the metric. The relative weight of the structure costis forced high. It is not possible to perform an exhaustive search in very highdimensions. The only way to restrict the search space size is to limit the numberof possible parents we will consider, using some heuristic [?]. Many heuristics arepossible. Each one has its own bias. We choose to build a heuristic that producesgraphs easy to visualize. the main motivation for this heuristic is to make theextracted knowledge understandable. The extracted knowledge is representedby a graph and its visualization play a crucial role in knowledge understanding.Therefore, we will search for a graph maximizing both the MML-like metric andvisualization criteria.

3 Visualization constrains discovery

3.1 Graph structure visualization

Graph structure visualization is a complex problem. We will consider two differ-ent approaches. The first one is to define an appearance criteria, and to try tomaximize it. The main advantage is that we know which properties our structurehas. The main drawback is that such metric computation is costly. The secondapproach is to use construction methods that implicitly implies some nice visu-alization feature. We have no control on the appearance criteria, but we do notneed to compute them. To display our graph, we chose the second approach: aspecific SOM (Self-Organizing Map) produce position of each node.

SOM structure Self Organizing Maps (SOM) are introduced by Kohonen ( [?])It consist in an unsupervised discrete approximation method of a distribution.It is mainly used to display data items, to perform classification and clustering.

Let us define a simple SOM where I and O are the input and output space.The SOM consists in an array of reference vectors mi ∈ I, the index i describingO. The reference vector array is usually two-dimensional.

Learning the distribution is performed using the following algorithm (2):

Algorithm 2 SOM algorithm1: repeat

2: get x(t) ∈ I according to the distribution3: c← argmini||x(t)−mi||4: mi(t+ 1) = mi(t) + hc,i(t)[x(t)−mi(t)]5: until convergence

A SOM variant is characterized by hc,i(t), which is called the neighborhoodfunction. It is necessary that liminf hc,i = 0 to ensure the convergence of thenetwork (otherwise, mi(t) diverges.). Usually, hc,i decreases as mc and mi arecloser in the array.

SOMs have been used widely on text data to perform unsupervised textclassification. WebSOM [?] is a typical use of such a method. Using SOMs, itperforms word clustering, and text classification according to these clusters. Inthis process, SOMs are applied directly on the text and the final result is atext classifier [?,?]: the SOM tends to approach the original text distribution.This kind of methods are well-suited for text classification (e.g. [?]), but are notintended for text analysis.

In this article, we chose a different approach : knowledge is extracted andrepresented as a network, and the SOM approach is used to display it. SOMshave already been successfully used to display graphs [?]. In our case, we use thegraph instead of the array. I represents the visualization space, and O the indexof the nodes. The neighborhood function is built using the distance between

nodes in the graph (in hop count). The SOM then computes the position ateach node in the visualization space. This method relies on the following SOMproperties to ensure some appearance criteria:

Distribution approximation : A reference vector is a discrete approximationof the distribution. We chose a uniform distribution in the visualizationspace. Then, the reference vectors (edge) tend to be distributed uniformlyin the visualization space.

Metric preservation : Input and Output space metrics tend to be compatible.In other words, if two nodes are close in the graph, they will be close in thevisualization space.

The distribution approximation property ensures that the distance betweenany two neighbor edges is approximately the same, and that edges tend to useall the available space (avoiding hole creation). Neighbors in the graph will likelybe neighbors in the visualization space, thanks to the metric preservation prop-erty. Thus, it will reduce the number of crossing links in the graph. Using bothproperties, we have enough clues to believe that the resulting graphs will benice-looking. This method does not require computing any appearance metric,it only relies on the construction properties.

However we do not think that this method is efficient enough. In particular,it requires having the final graph to display. It is not possible to view a resultbefore the full graph has been extracted.

Dynamic SOM structure SOM graph visualization methods suppose thatthe graph to display is given, and find optimal node positions. What happensif we do not freeze it ? Let us make the assumption that in each iteration, thegraph may be slightly different (i.e. the structure is dynamic). Unchanged partsof the graph are not directly affected by the modifications, and converges tobetter position. As soon as no more changes occur, normal convergence starts.But the unchanged parts already started the convergence process. Thus, it isreasonable to begin iterating before the exact structure is known . In the worstcase, only the recently touched parts of the graph need a full iteration number.Let us make an additional assumption : the mean number of links grows. It iseasier to display a sparse graph. At the beginning, a few iteration steps producereasonable positions. As the graph grows, convergence is slower. It is reasonableto think that the initial step leads to an approximative structure position, andis more effective than randomly placing nodes in the following steps.

The question is now how to implement the dynamic structure ? The structureis not directly involved in algorithm (2). Only the neighborhood function hc,i

is affected by a structural change. Let’s rewrite the neighborhood function, toextract the structure dependent part s(c, i) : hc,i(t) = s(c, i).h′

c,i(t). h′ does not

depend on the structure. We assume s is bounded (it depends only on finitevariables). Then the dynamic structure does not affect convergence.

because we want all equally interesting links, we extended the extractedstructure to allow multiple parent sets for a single node. We now need to explain

how to handle this structure for visualization purposes. Given any node a and bsuch that b is an element of one of a parent sets, we assume, for visualization only,that b is a parent of a. So, we do not display all the information available (exactparent sets membership information is lost). We accept to lose this informationbecause displaying such details requires hyper-graph techniques.

One more adaptation is required to use SOMs to display our structure, be-cause the graph we extract from text is a directed graph, and the SOM displaymethod requires acyclic graphs. Two methods are possible to adapt the algo-rithm : the first one is to ignore link direction, and the second is to replace thegraph distance (hop count) by a directed graph distance. Although it seems to bea little difference, the choice of the adaptation may have some dramatic effectson convergence speed. Let us consider a graph with only one directed link a→ b.It is possible to go from a to b, but not from b to a. Thus b is in neighborhoodof a, but a is not in neighborhood of b. If a and b are selected alternatively, bwill move towards a, whereas a will fly away of b (trying to find some emptyplace). a and b run around the visualization space. This, of course, do not speedup convergence. It is important to consider the graph without directions in theneighborhood function to avoid this problem.

Visualization guided graph extraction Information on the structure andnode position is usefull for graph extraction (heuristic). It gives a reasonablebias for parent selection. Uses of distance as heuristic results in a nice-lookinggraph.

Algorithm 3 Visualization guided graph extraction1: repeat

2: get x(t) ∈ [0, 1]× [0, 1]3: j ← argmini||x(t)−mi||4: update Cj parent set list using mj neighbor as candidates5: move Cj and its neighbor toward x(t)6: until convergence

Algorithm (3) presents how to use node positions as a heuristic for possibleparent selection, and how the current graph is considered when placing nodes inthe visualization space. The architecture we chose to implement this algorithmis presented in figure 1. In this figure, the boxes represent the information, andthe arrows the processes. The information flow starts from the textual data andends up in a viewing box. Let us describe each of the information boxes:

Textual data : this is the knowledge source. For now, all data is supposed to beavailable at the beginning. We will present in next section changes requiredif data is dynamic.

DATA : this is the preprocessed version of the textual data. The text is a listof sentences, and each sentence is a set of words. This representation is also

Cache

Stat

Graph

Structure

Parsing

differences

count

MML-Like

metric

SOM

Candidate

selection

Viewing

DATA Note

Node

Position

Textual

data

Search

Level-wise

Fig. 1. general architecture

know as the bag world. no direct access will be done this part of the DATA.They may be stored on disk, as they are required only if new data arrives(to compute differences)

Cache : it retains for each attribute set how many times it occurs during dataline differences calculus. Note evaluation relies on these information on basicstatistics. This structure is accessed often, and is ought to be optimized andpresent in memory.

Stat : These are statistics on single attributes. They are required to computeattribute encoding length. It is possible to build them incrementally (i.e.without rereading DATA)

Note : the note evaluates a set of possible parent for a given node. The betterthe note, the more likely the set of partial parents will be chosen. Manynote computation are required. however, caching note result is at least hard(if not impossible), because possible note computation are exponential, andthat we know no heuristic to guess if a given result will be required later.Moreover, if new data arrives, all metrics becomes obsoletes.

Graph structure : The graph structure gives for each node its current partialparents list. The union of all partial parents set (i.e. parents set) is computedfor each node, because it is required to perform the SOM-like step.

Node position : It is the positions of each node of the graph in the visualiza-tion space. Distance between node is computed using this information, andthen allows to select parent’s candidates.

Viewing : viewing information consists in both the graph structure and thenode positions. It is the final step of the visualization process. As convergenceis a dynamic process, graph visualized is not static. Moreover, if new dataare processed, new results are displayed as soon as available.

Information management is done by the processes shown on the arrows ofthe structure graph. First, Parsing is applied to the raw text data. It only con-sists in finding sentences and words. It is possible to enhance data by applyinglinguistic or statistical filters. Then differences between data lines are computedand counted to fill the cache structure. Some basic statistics are also evaluated.

Using these information, the metric described in equation (1) is computed, ina Level-wise search (algorithm (1)), in order to find the best possible structure.using this structure, SOM algorithm (4) find best node position in the visual-ization space. This position in turn restrains the number of possible parent setsto explore, and the loop goes on.

3.2 Incremental and dynamic Properties

Dynamic properties One interesting property of our approach is the abilityto provide a graphical result in any iteration step. Thus, the current state ofextraction is always available to the user. The Progress of extraction is displayedon screen in real-time. This allows direct parameter tuning.

Although the parameters to be altered during the extraction process requiressome changes in algorithm (3), and leads to the following one :

Algorithm 4 Visualization guided graph extraction (dynamic version)

1: repeat

2: get x(t) ∈ [0, 1]× [0, 1]3: j ← argmini||x(t)−mi||4: update Cj parent sets note5: update Cj parent set list using Cj neighbor as candidates6: move Cj and its neighbor toward x(t)7: until convergence

In fact, the only difference is that we introduce an additional step, to up-date already computed notes. Note that to avoid computing every note at eachparameter modification, updating is done just before using the scores. It meansthat just after an update, most of the notes are erroneous. Of course, this doesnot affect computation a lot, as we chose to correct them before use. Anotherkey advantage of the dynamic result display is that the exploration of extractedknowledge is not delayed.

Incremental properties Data used to compute the graph note may easily beupgraded. New data usually becomes available during treatment. It is necessaryto have an incremental property to handle data on the fly.

If new data arise, they are handled as follow (see figure 1):

1. Incoming textual data are parsed and go into DATA2. Stat and Cache are upgraded according to the new data

3. If required, new nodes are added to the graph structure, and placed randomlyin the visualization space.

4. The SOM neighborhood function rises (this is equivalent to temperatureincrease in simulated annealing methods)

No additional action is needed. As an iteration occurs, parent set notes are up-dated according to the new metric. The already computed positions are reused,avoiding restarting from scratch. Thus, the algorithm presented here is incre-mental.

4 Experimental results

4.1 Results on corpus

First, we apply our approach to discover knowledge and to visualize it from twodifferent texts. In this first steps, we do not consider any dynamic aspect ofthe data .We assume files are full and available. The main objective of thesetwo tests is to show what kind of knowledge it is possible to extract from text,using unsupervised probabilistic network structure learning, and now to displaythe extracted knowledge. Before looking at the graphs, we have two importantremarks to present:

– Task performs here is done without any prior domain knowledge, or anylinguistic tool. It is almost language independent, and could have been re-alized on most web-available text documents. However, to interpret results,we human requires language and / or domain knowledge.

– The graphs we produce are quite large (and we could generate even largergraphs ). To present them in this article, we had to reduce the scale. Resultinggraph may not be easy to explore (tiny character police). We tried to “zoom”on some interesting parts, but it is easier to zoom on real electronic file. Moreover, it is not possible to show in a paper document how graph evolve, andreact to parameters tuning.

Let us now present the text data we focused on :

WWW-HOWTO This how-to is included in many Linux distribution. It dealswith WWW access and servicing. It is a typical technical document, using spe-cific words. It would not be easy to analyze it using natural language text process-ing, because it would lack a specialized thesaurus. According to word countingutility, its length is more than 14000 words. This is enough to perform statisticalanalysis.

In figure (4.4), we show the result of our unsupervised analyze method. Byfollowing the links, it is possible to recognize familiar computer domain wordsassociation, such as ftp, downloaded, network, man, hosts, virtual, machine ... .We can also distinguish some connected component such as computer, domain,

enable, restart, message, send. Thus, it is possible to extract reduced lexical fieldautomatically.

DNA Corpus At opposite from the WWW-HOWTO file stand the DNA Cor-pus. This corpus has been presented and studied in the International Journalof Corpus Linguistics ([?]). Initial study was based on a linguist collocationalnetwork, in order to extract emergent use of new patterns of word in the sci-entific sublanguage associated to molecular biology applied to parasitic plants.Automated probabilistic networks construction may be usefull to linguist, evenif the relation between collocational networks and probabilistic ones is not clearyet.

regardless

5259a264t266

absence

absorbed

abstract

acid

acids

action

active

activities

added

adventitious

aelig

album

alga

alkaline allowed

allows

alteration

americana analyses

analysis

analyzed

ancestor apparatus

applied

approx

approximately

aqueous

areas

arrangement assays

assembly

associated

atrazine

attained

authors

autumn

availability

average

avg

bamhi

bands based

basis

became

biotype blot

blots

broad

calculated

capacity

case

caused

cdna

cells

cellular

centrifuged

chain

changes

characterized

chase

chlamydomonas

chondriomes

classes

classified

clay

cloned

close

clusters

cm

codon

cold

coli

collection

combinations

compared

comparison

comparisons complex

component

concentrated

concentration

conclusion

conditions

confirm consensus

conservation

consistency

constructed

contained

containing

controlling

copy

corresponding

corresponds

cp

ctdna

cv cybridizations

cybrids

cytokinin

daltons

deduced

degradation

degree

dehydrogenase

deleted

demissum

demonstrate

dependent

depth

described

detect

detected

determined

determining

diameter

different

diploperennis

direct

directed

discs

dispersed

displayed

distinct divergence

divided

dm11

dm16_dm15

dm4

dna

donor

dormant

dried

dsrnas due

duplicated

duplication

earlier

early

ecorv

edta

effective

effects

electron

electrophoresed

emerged

endonuclease

energy

entire

enzyme

enzymes

equivalent

establish

ethylene

etuberosum

examples

exception

expansion

experiment

experiments

expressed

extensive

extract

fact failed

families

fewer

fig

filled

filters

fine

fluorescence

found

four

fractionated

fractions

fragment fragments

frames

free frozen

function fusion

fusions

gapdhs

gel

gene

generate

generated

genes

genome

genomes

germinate

germination

ginbozu

given

glaberrima

ground

haeiii

haploid

herbicide

herbicides

heteroplasmic

heterotrophic

higher

hind

hpaii

hybridization hybridized

hybridizing

hydrolysis

ice

identical

identification

identify

import

imported

including

indica

indicates

indicating

individual

individuals

induced

inferred

information

inhibit

initiation

insertions

integration

interactions

interspersed

inverted

involved

ioe23

ioe33

isolated

isolation

japonica

kb

known

lack

lacking

lamourouxia

lapathifolium

large

lcf

led

left

length

leucine

level levels

light

line

linear

linked

liquid

locus

loss

lysate

machinery

main majority

manufacturer

maximum

means

mechanism

media

medium

members

membranes

metabolite

method

metribuzin

mg

mildura

min

mitochondrial

ml

modified

molecular mrnas

mutant

mutated

nacl

name

ndh

ndhb

negligible

niche

nine

nitrocellulose

nonsense

normal

notable

noted

nuclear

nucleotide

number

numbers

observed

obtained

occurred

occurring

oe33

oligonucleotide

organisms

orientation

origin

original

orobanchaceae

oxidase

oxidation

oxygenase

parallel

parasite

parasitic

parts

pastoris_bursa_capsella

pathway

pcr

perennis

perfoliata_thlaspi

petri

phosphate

photosystem

phylogeny

pieces

plant

plants

plastid

plastids

poe23

point

polyacrylamide

polyadenium

polymorphic

polypeptide

polypeptides

portion

portions

possible

potato

pots

prd39

precursor

presence

present

presented

presumably

primer

primers

prior

probe

procedure

processes

product

production

promote

proper

proposed

protease

protein

proteins

protoplast

ps

purified

putative

pvuii

quantities

radioactivity

rapd

rapidly

reactions

reduction

ref

reflexa

regardless regions

regulated

regulatory

reinhardtii

related

relationships

relative

remaining

repetitive

reports

represent

representing

required

restricted

restriction

resulted

results

resuspended

reveal

rflp

ripen

root

roots

rps2

rps7

rrna

rubisco

rubp

rubpcase

samples

sand

saturating

saturation scintillation

screened

second

sections

see

seed

seeds

segregated

selected

sequence

sequenced

sequences

serine

serpyllifolia_arenaria

serve

served

shoot

short

showed

similar

single

site

situation

size

sizes

small

smaller

solanum

sorbitol

source

species

specific

spinach

spring

stable

stellaria

step

sterile

stimulants

stimulation

strains

stranded

strict

strong

studied

studies

study

substitution

substrate

subunit

successful

sucrose

suggests

sum

sup

susceptible

symptoms

synthesis

synthesized

system

tabacum

tabs

tdz

temperature

temperatures

termetypehcide

termetypehp

termetypepp

terminated

termination

thylakoid

thylakoids

tightly

tiles

total

toward

transcribed

transcriptional

transfer

translated

translation

translocation

transport

treated

trees

triple

trnacys

trnas

true tuberosum

types

units

unlabeled

unlikely

unpublished

unstable

used

useful

variability

various

vary

vector

veronica

vials

visible

vmax

washed

washes

week

wild

xhoi

year

kbp

blotting

southern

preparation

pea

counting

amounts data missing amounts

blotting

counting

data

kbp

missing

pea

preparation

southern

Fig. 2. DNA Corpus

The figure 2 show the result we get for the corpus DNA. We have no particularknowledge in molecular biology, and choose among the extracted patterns someunderstandable samples. It appears that some of the associations are lexical fields(e.g. : deletion, evolutionary) while others (hybrize, intact) show applicationdomain. Even if links are different from collocational links, distance in the PNwe produce (in hop counts) between collocate is low (e.g. : gene and plant arecollocate, and they are linked by remains .

deletion ← gene contains kb similar evolutionary substitutionsevolutionary ← gene plant involved necessary deletion remains phylogenetic

organisms hemiparasite properly intronshybridizes ← chloroplast intact genes codons families stop lack leaf relate al-

teredintact ← functional species genes templates hybridizes leaf relate barley alteredremains ← gene plant plastids living involved hemiparasite reduction

sites ← protein site gel size homology sequencingstudy ← trnas trna sequence isolated single content tests

Characterization of the linguistic relation between a word and its parentsrequires an expert of the domain. Being able to get any relation without knowingits nature was one of our requirements in this study. Interpretation of the relationis an other research topic, and a possible extension of this work.

4.2 Convergence problem

In figure (3), we draw the note augmentation per 1000 iteration step on theWWW-HOWTO file. As an experimental result, we see that the curve may bebounded by a decreasing exponential. This result is in accordance with the con-vergence property of the presented algorithm : note is strictly growing, becauseparent set may be replaced only by better parent set. The note is bounded, be-cause there is only a finite number of possible graphs. The growing and boundednote suit therefore converges, and difference between consecutive notes tends to0, as suggested by the figure (3). We ensure graph note convergence, but whatabout graph convergence ? The note converges and depends only on finite vari-able. From a given iteration step, the note is constant. Structure change onlyif a better parent set has been found. but it is not possible if note is constant.Then no better parent set may be found, and structure do not evolve.

-0.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000

"dif.1000.log"

Fig. 3. note augmentation per 1000 iteration step (WWW-HOWTO)

4.3 Effect of Initial position on result

The objective of this data set (Mail-HOWTO) is to study the dependence be-tween initial position and graph note after numerous (100000) iteration steps.We run our program using the same data set and three different initial positionof nodes in the visualization space. Results are shown in figure (4), where eachcurve represents the note evolution for a given initialization of words positionsin the visualization space . Although individual curves behavior is somewhatstochastic, it appears that distance between them is always lower than 1 unit.As the values are around 20, the relative error is 5 percent. Thus, the dependenceof the curve evolution to the initial position is low.

16

17

18

19

20

21

22

23

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

"d1.log""d2.log""d3.log"

Fig. 4. note augmentation per 1000 iteration step (WWW-HOWTO)

4.4 Effect of new data insertion

In this experiment, initial data set has been cut into 4 segments. Every 50000iteration step, a new segment has been transmitted to the program. The resultinggraph curve is given in figure (5). The most important effect of each new additionis to invalidate previous evaluation function. This has two immediate effect:

– current structure is evaluated using the new notation function. Thus, notemay increase or decrease very fast. It decreases if the graph is less useful

given the new data set, and increase if new data fit the already built model.One interpretation of note decrease is that method learned on a restrictedset of data, and was too close to these data, and no more able to generalize.This problem is also called over-fitting.

– convergence process speed grows : unless new data set may be predictedusing current model, the model quality may increase by taking new datainto account.

5.6

5.8

6

6.2

6.4

6.6

6.8

7

7.2

7.4

7.6

0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000

"parts.log"

Fig. 5. 4-parts segmented text (WWW-HOWTO), one part every 50000 iteration steps

In order to evaluate the change in the model before and after the data addi-tion, the split the WWW-HOWTO into two part. We captured the graph beforeand after the second part insertion. Result is figure (4.4 and 6). No parameterwas changed between theses two snapshots. However, the second graph is sparserthan the first one. This is a direct consequence of the graph metric change : inone hand some links no more holds, in the other hand, some new links appears.

4.5 Influence of data order

This data set (Mail-HOWTO) has been built by cutting original file (Mail-HOWTO) into tree segments : A1,A2 and A3. We present these data to ourprogram in two different orders:

32bit

mnt

nt

protocols

section

various

windows

able access

adapter

add

address

administrator

alias

allow

allows

apache

apacheweek

applications

apps

available

best

better

bin

binary

browser

browsers

button

cern

cgi

changed

check

clicking

code

com

comments

commercial

communicator

compile

compiling

computer

conf

configuration copy

copyright

data

depending

details

different

directive

directory

dll

docs

document

documentation

domain

downloaded

downloading

edit

emacs

enable

end

etc example

explains export

file

files

finally

find

first

following

frame

free

ftp

full

gateway

get

gnu

go

going

good

group

hat_red

home

host hosts

howto

html

htmldocs

http

httpd

include

includes

inetd

information

installing

ip

kernel

latest

license

line

link

links

linux

linuxports

load

local

log

looks

lynx

machine

made

make

makefile

man

map

memory

message

microsoft

mode

modules

mount

mysite

navigator

ncsa

needs

net

netcraft

netscape

network

networks

ok

old

option

org

pages

part

poet

port

proc

products

program

programming

pub

put

recommend

refer

request

restart

rfc

root

route

run

say

scripts

second

security

send

server

set

setting

setup

show

simple

simply

source

srm

ssi standalone

start

stronghold

subnet

sunsite

support

supports

survey

system

tab

tables

take

tgz

time

trying

types

typing

unc

unix

use

users

using

usually

utilities

version

virtual

visit

want

website

winnt

work

works

world

xemacs

32bit windows

protocols

various

mnt

nt

section

button

32bit

don mnt netware

nt

protocols windows

accessing

adapter

add

administrator

apache

applications

apps

available

box

caldera

cgi

choose

clicking

commands

commercial

compile

compiling

configuration copy

copyright

corporate

data

directory

distributions

dll

docs

doubleclick

downloaded

edit end

etc

explains

export

file

files

filesystems

frame

ftp

get

hat_red

howto

htmldocs

http

information installing

ipx

kernel

know

latest

let

license

linuxports

list

looks

mail

make

methods

mode

mount

navigator

ne2000

net

netcraft

netscape

networks

openlinux

page

part

permissions

poet

proc

products

prompt

pub

put

questions

rc

recommend

refer

route

running

say

see

servers

setup

show

shown

similar

simply

smb

srm

standalone

stronghold

subdirectory

sunsite

supports

survey

system

tab

tables

trying

type

typing

unc

unix

use

used

users

using

utilities

version

visit

web

website

winnt world

www

free

button

don

nt

protocols 32bit

mnt

netware windows

Fig. 6. WWW-HOWTO analysis (evolution)

order 1: segment A1,A2 and then A3.order 2: segment A3,A2 and then A1.

In both experiments, we sent a new data segment every 10000 iteration step.Figure (7) presents the note curves at each step for order 1 and order 2.

In the first period (0-10000), order 1 discovered graph note is lower thanthe note of the one extracted considering the order 2. We deduce from thisobservation that segment A1 is harder to model than segment A3.

In the second period, the curves associated with both order 1 and order2 grows rapidly. Consequently, segment A2 seems to be the easiest to model.Networks that were well fitted for segment A1 and A3 get an even better notwhen A2 is added. A2 structure certainly encompass A1 and A3.

In the third period, order 1 becomes better than order 2. but differencebetween the two curves is roughly one unit. As we have find such difference oncurves that differs only in the initial nodes position, we cannot say that, in thiscase, data order had a great influence on final result.

5 Conclusion

In this paper, we present data structures and algorithms to perform text min-ing using a probabilistic network framework. We take advantage of automatedstructure learning from data methods in BN to extract interesting patterns intexts, without any language prior knowledge or syntaxtic analysis. The proposed

10

12

14

16

18

20

22

24

26

0 5000 10000 15000 20000 25000 30000 35000

"order1.log""order2.log"

Fig. 7. 2 different order on same data

method is robust, general and unsupervised, and thus may be applied to almostany kind of textual data.

The work described in this paper is being adapted as module for an auto-mated web browser, in order to read automically retrieved web documents, andpresent an analysis to the end-user. We take advantage of the heuristic used hereto perform visualization efficiently (word position on screen is computed duringthe graph optimization.) We get encouraging results in this direction.

Proofs

Error estimation using differences (Pe =1/2− 1/2.

∑

YP (Y ).

√

2.P (X = |Y ) − 1))

Pe =∑

Y

P (Y ).min(P (X = 0|Y ), P (X = 1|Y )))

We use min(p, 1 − p) = (1 − |2p − 1|)/2 and 2p(1 − p) = q ↔ p = 1/2 ±1/2.√1− 2.q

Pe = 1/2.∑

Y

P (Y ).(1− |2.P (X = 0|Y )− 1|)

Pe = 1/2.∑

Y

P (Y ).(1 − |1±√

1− 2.2.P (X = 0|Y ).P (X = 1|Y )− 1|)

Pe = 1/2.∑

Y

P (Y ).(1 −√

1− 2.P (X 6= |Y )

Pe = 1/2− 1/2.∑

Y

P (Y ).√

2.P (X = |Y )− 1)

incremental and dynamic text mining

Documents