incremental and dynamic text mining
TRANSCRIPT
Incremental and Dynamic Text Mining
Graph structure discovery and visualization
DUBOIS Vincent, QUAFAFOU Mohamed{dubois;quafafou}@irin.univ-nantes.fr
IRIN (Universite de Nantes)
Abstract. This paper tackles the problem of knowledge discovery intext collections and the dynamic display of the discovered knowledge.We claim that these two problems are deeply interleaved, and shouldbe considered together. The contribution of this paper is fourfold : (1)description of the properties needed for a high level representation ofconcept relations in text (2) a stochastic measure for a fast evaluation ofdependencies between concepts (3) a visualization algorithm to displaydynamic structures and (4) a deep integration of discovery and knowledgevisualization, i.e. the placement of nodes and edges automatically guidesthe discovery of knowledge to be displayed. The resulting program hasbeen tested using two specific data sets based on the specific domains ofmolecular biology and WWW howtos.
1 Introduction
2 Graph structure discovery
2.1 Graph structure semantic
Given a set of concept, we want to extract the dependencies between them. Themost natural way of displaying such information is by using a direct graph :Each concept is a node, and an edge between two node denotes a dependence.
Binary mutual entropy Dependence between concept is a high level and ab-stract relation, and it is not easy to express. A first approximation is the mutualentropy measure between two concepts, a and b, noted H(a; b) and defined byusing the conditional entropy :
H(a; b) = H(a)−H(a|b) = H(b)−H(b|a)This measure succeed in describing one on one relation between concepts,
but it has several flaws :
– Mutual entropy measure is symmetric, but dependence is not.– If a concept depends on a set of concepts, he does not necessarily depends
on each one separately (e.g. xor table). thus, mutual entropy measure failsin describing them.
The main advantage of mutual entropy measure is to be fast and easy to use,but it lacks the required description finesse.
Bayesian networks The bayesian networks semantic is deep enough to de-scribe the dependence relation, but related algorithms search the smallest graphdescribing the data with a good enough accuracy. As a consequence, a link be-tween two nodes is always a dependence, but not all dependencies are describedin the bayesian network. Moreover, bayesian networks are required to be directedacyclic graphs, whereas dependence cycle are common (if a depends on b, it iscommon to find out that b depends on a).
Thus, we need a formalism with a local description of the dependence (nogeneral property on the graph such as being acyclic), but that involve more thantwo concept at once (otherwise it misses some large dependence).
n-ary mutual entropy Mutual entropy measure can be used with more thantwo concept at once. We can say that a concept a depends on a set B of nconcepts by using it :
H(a;B) = H(a)−H(a, b1, . . . , bn)
But this leads to an hyper-graph, and it not easy to think at it or to displayit. We propose to flatten it by the following way :
a depends on b iff it exists B minimal such that a depends on B.Thus, given a concept a, we search all sets B such that both H(a;B) and
|B| are small.
Size penalty One of the conditional entropy property is that H(A|B) ≤H(A|B ∪ {b}). So we need to complete this measure by adding a size penalty. Ifwe code B using A information, the message length (per instance) is given bythe conditional entropy H(B|A). It is logical to add to this length the encodinglength of the function used to predict B values based on A values. This requireto code 2|B| binary values (one for each possible combination of concepts in B).If we have l instances to code, the structure cost per instance is 2|B|/l. Theresulting evaluation function is :
f(A,B) = H(A)−H(A|B) − 2|B|/l (1)
The optimization process searches best sets of concepts B for each A. B isconsidered minimal, because if a concept in B add no information, then a subsetof B would have the same conditional entropy, while a lower structural cost, andthen get a better score f .
Definitions Given a concept A, we call partial parent sets the at most n setsof concept B with the best f(A,B) positive measure.
We call parent set of concept A the union of all partial parent set.
We say that concept B is one of the parents of concept A when B is in theparents set.
2.2 Network evaluation
Entropy approximation Entropy computation is costly, so we use a stochasticapproximation of it in order to avoid full computation. The approximation isbased on differences between instances (i.e. the set of concepts that are presentsin one (and only one) of the pair of instances).
Fano’s inequality gives :
H(Pe) + Pe. log(|Ξ| − 1) ≥ H(X |Y ) (2)
where Pe is the probability of error while predicting X using Y , and |Ξ|is the number of different possible values for X . In our context, X is a binaryattribute (concept present or absent), so we have :
H(Pe) ≥ H(X |Y ) (3)
So, the conditional entropy is approximated by the entropy of the error ofthe prediction. We now express this error by using sets of equal attributes (seeannex for complete proof):
Pe = 1/2− 1/2.∑
Y
P (Y ).√
2.P (X = |Y )− 1) (4)
The∑
term describe the mean value of the square root of√
2.P (X = |Y )− 1.We approximate it by the square root of the mean value :
Pe ≃ 1/2− 1/2√
2.P (X = |Y =)− 1 (5)
We note that P (X = |Y =) = P (X ∪ Y =)/P (Y =). Thus, the informationrequired to compute our approximation of conditional entropy H(X |Y ) is thecount of identical attributes sets in pair of instances. This count is much easierto compute that every entropy measure, and can be stored in a more efficientway. Moreover, only a small sampling of all possible pairs is necessary. Moreinformation on this method can be found in [?].
2.3 Methods and Algorithms
General method The general methods consists in searching among the spaceof possible parent sets the best one, for each node. It is described in algorithm(1)
Algorithm 1 Graph structure extraction
Input: C set of all concepts ciOutput: G is the best graph according to our metric
G(wi) is the list of n best partial parent sets for ci1: for all node wi do
2: for all possible parent set P ⊂ C\ci do
3: if P is a good enough parent set then4: Add P to G(wi)5: if G(wi) size exceed n then
6: Remove worst partial parent set in G(wi)7: end if
8: end if
9: end for
10: end for
pruning methods An exhaustive search of best parent sets among all possiblesubsets is not realistic. A naive implementation of algorithm (1) leads to anexponential complexity. Hopefully, it is easy to improve this loop, by using somepruning heuristics. The note we compute has two parts : the first one is the abilityof the model to predict data, when the second is the size cost of the structure.While the first part (H(a) −H(a|B)) is bounded (by H(a)), the second growsexponentially (2|B|/l). As note is greater than 0 (the empty set note), thereis a size limit for the structure (2|B|/l < H(a)). If the size of a parent set isbigger than this limit, it will have a negative score, even if it defines a functionaldependency (H(a|B) = 0 = P (e) ) . This property provides an efficient wayto ensure that the complexity is not beyond a selected degree, by adjusting thestructural cost l.
An additional pruning method based on metric properties is possible usingthe note of the best already found parent set. The better the score we alreadyfound, the lower the size of possibly better parent set is. Taking advantage ofthis property requires evaluating the smallest set at the beginning. Thus, welimit the number of unnecessary score computations.
Restricted candidate parent set Considering high dimensional data sets,the previous pruning methods are not sufficient: complexity remains low at thecost of the sensibility of the metric. The relative weight of the structure costis forced high. It is not possible to perform an exhaustive search in very highdimensions. The only way to restrict the search space size is to limit the numberof possible parents we will consider, using some heuristic [?]. Many heuristics arepossible. Each one has its own bias. We choose to build a heuristic that producesgraphs easy to visualize. the main motivation for this heuristic is to make theextracted knowledge understandable. The extracted knowledge is representedby a graph and its visualization play a crucial role in knowledge understanding.Therefore, we will search for a graph maximizing both the MML-like metric andvisualization criteria.
3 Visualization constrains discovery
3.1 Graph structure visualization
Graph structure visualization is a complex problem. We will consider two differ-ent approaches. The first one is to define an appearance criteria, and to try tomaximize it. The main advantage is that we know which properties our structurehas. The main drawback is that such metric computation is costly. The secondapproach is to use construction methods that implicitly implies some nice visu-alization feature. We have no control on the appearance criteria, but we do notneed to compute them. To display our graph, we chose the second approach: aspecific SOM (Self-Organizing Map) produce position of each node.
SOM structure Self Organizing Maps (SOM) are introduced by Kohonen ( [?])It consist in an unsupervised discrete approximation method of a distribution.It is mainly used to display data items, to perform classification and clustering.
Let us define a simple SOM where I and O are the input and output space.The SOM consists in an array of reference vectors mi ∈ I, the index i describingO. The reference vector array is usually two-dimensional.
Learning the distribution is performed using the following algorithm (2):
Algorithm 2 SOM algorithm1: repeat
2: get x(t) ∈ I according to the distribution3: c← argmini||x(t)−mi||4: mi(t+ 1) = mi(t) + hc,i(t)[x(t)−mi(t)]5: until convergence
A SOM variant is characterized by hc,i(t), which is called the neighborhoodfunction. It is necessary that liminf hc,i = 0 to ensure the convergence of thenetwork (otherwise, mi(t) diverges.). Usually, hc,i decreases as mc and mi arecloser in the array.
SOMs have been used widely on text data to perform unsupervised textclassification. WebSOM [?] is a typical use of such a method. Using SOMs, itperforms word clustering, and text classification according to these clusters. Inthis process, SOMs are applied directly on the text and the final result is atext classifier [?,?]: the SOM tends to approach the original text distribution.This kind of methods are well-suited for text classification (e.g. [?]), but are notintended for text analysis.
In this article, we chose a different approach : knowledge is extracted andrepresented as a network, and the SOM approach is used to display it. SOMshave already been successfully used to display graphs [?]. In our case, we use thegraph instead of the array. I represents the visualization space, and O the indexof the nodes. The neighborhood function is built using the distance between
nodes in the graph (in hop count). The SOM then computes the position ateach node in the visualization space. This method relies on the following SOMproperties to ensure some appearance criteria:
Distribution approximation : A reference vector is a discrete approximationof the distribution. We chose a uniform distribution in the visualizationspace. Then, the reference vectors (edge) tend to be distributed uniformlyin the visualization space.
Metric preservation : Input and Output space metrics tend to be compatible.In other words, if two nodes are close in the graph, they will be close in thevisualization space.
The distribution approximation property ensures that the distance betweenany two neighbor edges is approximately the same, and that edges tend to useall the available space (avoiding hole creation). Neighbors in the graph will likelybe neighbors in the visualization space, thanks to the metric preservation prop-erty. Thus, it will reduce the number of crossing links in the graph. Using bothproperties, we have enough clues to believe that the resulting graphs will benice-looking. This method does not require computing any appearance metric,it only relies on the construction properties.
However we do not think that this method is efficient enough. In particular,it requires having the final graph to display. It is not possible to view a resultbefore the full graph has been extracted.
Dynamic SOM structure SOM graph visualization methods suppose thatthe graph to display is given, and find optimal node positions. What happensif we do not freeze it ? Let us make the assumption that in each iteration, thegraph may be slightly different (i.e. the structure is dynamic). Unchanged partsof the graph are not directly affected by the modifications, and converges tobetter position. As soon as no more changes occur, normal convergence starts.But the unchanged parts already started the convergence process. Thus, it isreasonable to begin iterating before the exact structure is known . In the worstcase, only the recently touched parts of the graph need a full iteration number.Let us make an additional assumption : the mean number of links grows. It iseasier to display a sparse graph. At the beginning, a few iteration steps producereasonable positions. As the graph grows, convergence is slower. It is reasonableto think that the initial step leads to an approximative structure position, andis more effective than randomly placing nodes in the following steps.
The question is now how to implement the dynamic structure ? The structureis not directly involved in algorithm (2). Only the neighborhood function hc,i
is affected by a structural change. Let’s rewrite the neighborhood function, toextract the structure dependent part s(c, i) : hc,i(t) = s(c, i).h′
c,i(t). h′ does not
depend on the structure. We assume s is bounded (it depends only on finitevariables). Then the dynamic structure does not affect convergence.
because we want all equally interesting links, we extended the extractedstructure to allow multiple parent sets for a single node. We now need to explain
how to handle this structure for visualization purposes. Given any node a and bsuch that b is an element of one of a parent sets, we assume, for visualization only,that b is a parent of a. So, we do not display all the information available (exactparent sets membership information is lost). We accept to lose this informationbecause displaying such details requires hyper-graph techniques.
One more adaptation is required to use SOMs to display our structure, be-cause the graph we extract from text is a directed graph, and the SOM displaymethod requires acyclic graphs. Two methods are possible to adapt the algo-rithm : the first one is to ignore link direction, and the second is to replace thegraph distance (hop count) by a directed graph distance. Although it seems to bea little difference, the choice of the adaptation may have some dramatic effectson convergence speed. Let us consider a graph with only one directed link a→ b.It is possible to go from a to b, but not from b to a. Thus b is in neighborhoodof a, but a is not in neighborhood of b. If a and b are selected alternatively, bwill move towards a, whereas a will fly away of b (trying to find some emptyplace). a and b run around the visualization space. This, of course, do not speedup convergence. It is important to consider the graph without directions in theneighborhood function to avoid this problem.
Visualization guided graph extraction Information on the structure andnode position is usefull for graph extraction (heuristic). It gives a reasonablebias for parent selection. Uses of distance as heuristic results in a nice-lookinggraph.
Algorithm 3 Visualization guided graph extraction1: repeat
2: get x(t) ∈ [0, 1]× [0, 1]3: j ← argmini||x(t)−mi||4: update Cj parent set list using mj neighbor as candidates5: move Cj and its neighbor toward x(t)6: until convergence
Algorithm (3) presents how to use node positions as a heuristic for possibleparent selection, and how the current graph is considered when placing nodes inthe visualization space. The architecture we chose to implement this algorithmis presented in figure 1. In this figure, the boxes represent the information, andthe arrows the processes. The information flow starts from the textual data andends up in a viewing box. Let us describe each of the information boxes:
Textual data : this is the knowledge source. For now, all data is supposed to beavailable at the beginning. We will present in next section changes requiredif data is dynamic.
DATA : this is the preprocessed version of the textual data. The text is a listof sentences, and each sentence is a set of words. This representation is also
Cache
Stat
Graph
Structure
Parsing
differences
count
MML-Like
metric
SOM
Candidate
selection
Viewing
DATA Note
Node
Position
Textual
data
Search
Level-wise
Fig. 1. general architecture
know as the bag world. no direct access will be done this part of the DATA.They may be stored on disk, as they are required only if new data arrives(to compute differences)
Cache : it retains for each attribute set how many times it occurs during dataline differences calculus. Note evaluation relies on these information on basicstatistics. This structure is accessed often, and is ought to be optimized andpresent in memory.
Stat : These are statistics on single attributes. They are required to computeattribute encoding length. It is possible to build them incrementally (i.e.without rereading DATA)
Note : the note evaluates a set of possible parent for a given node. The betterthe note, the more likely the set of partial parents will be chosen. Manynote computation are required. however, caching note result is at least hard(if not impossible), because possible note computation are exponential, andthat we know no heuristic to guess if a given result will be required later.Moreover, if new data arrives, all metrics becomes obsoletes.
Graph structure : The graph structure gives for each node its current partialparents list. The union of all partial parents set (i.e. parents set) is computedfor each node, because it is required to perform the SOM-like step.
Node position : It is the positions of each node of the graph in the visualiza-tion space. Distance between node is computed using this information, andthen allows to select parent’s candidates.
Viewing : viewing information consists in both the graph structure and thenode positions. It is the final step of the visualization process. As convergenceis a dynamic process, graph visualized is not static. Moreover, if new dataare processed, new results are displayed as soon as available.
Information management is done by the processes shown on the arrows ofthe structure graph. First, Parsing is applied to the raw text data. It only con-sists in finding sentences and words. It is possible to enhance data by applyinglinguistic or statistical filters. Then differences between data lines are computedand counted to fill the cache structure. Some basic statistics are also evaluated.
Using these information, the metric described in equation (1) is computed, ina Level-wise search (algorithm (1)), in order to find the best possible structure.using this structure, SOM algorithm (4) find best node position in the visual-ization space. This position in turn restrains the number of possible parent setsto explore, and the loop goes on.
3.2 Incremental and dynamic Properties
Dynamic properties One interesting property of our approach is the abilityto provide a graphical result in any iteration step. Thus, the current state ofextraction is always available to the user. The Progress of extraction is displayedon screen in real-time. This allows direct parameter tuning.
Although the parameters to be altered during the extraction process requiressome changes in algorithm (3), and leads to the following one :
Algorithm 4 Visualization guided graph extraction (dynamic version)
1: repeat
2: get x(t) ∈ [0, 1]× [0, 1]3: j ← argmini||x(t)−mi||4: update Cj parent sets note5: update Cj parent set list using Cj neighbor as candidates6: move Cj and its neighbor toward x(t)7: until convergence
In fact, the only difference is that we introduce an additional step, to up-date already computed notes. Note that to avoid computing every note at eachparameter modification, updating is done just before using the scores. It meansthat just after an update, most of the notes are erroneous. Of course, this doesnot affect computation a lot, as we chose to correct them before use. Anotherkey advantage of the dynamic result display is that the exploration of extractedknowledge is not delayed.
Incremental properties Data used to compute the graph note may easily beupgraded. New data usually becomes available during treatment. It is necessaryto have an incremental property to handle data on the fly.
If new data arise, they are handled as follow (see figure 1):
1. Incoming textual data are parsed and go into DATA2. Stat and Cache are upgraded according to the new data
3. If required, new nodes are added to the graph structure, and placed randomlyin the visualization space.
4. The SOM neighborhood function rises (this is equivalent to temperatureincrease in simulated annealing methods)
No additional action is needed. As an iteration occurs, parent set notes are up-dated according to the new metric. The already computed positions are reused,avoiding restarting from scratch. Thus, the algorithm presented here is incre-mental.
4 Experimental results
4.1 Results on corpus
First, we apply our approach to discover knowledge and to visualize it from twodifferent texts. In this first steps, we do not consider any dynamic aspect ofthe data .We assume files are full and available. The main objective of thesetwo tests is to show what kind of knowledge it is possible to extract from text,using unsupervised probabilistic network structure learning, and now to displaythe extracted knowledge. Before looking at the graphs, we have two importantremarks to present:
– Task performs here is done without any prior domain knowledge, or anylinguistic tool. It is almost language independent, and could have been re-alized on most web-available text documents. However, to interpret results,we human requires language and / or domain knowledge.
– The graphs we produce are quite large (and we could generate even largergraphs ). To present them in this article, we had to reduce the scale. Resultinggraph may not be easy to explore (tiny character police). We tried to “zoom”on some interesting parts, but it is easier to zoom on real electronic file. Moreover, it is not possible to show in a paper document how graph evolve, andreact to parameters tuning.
Let us now present the text data we focused on :
WWW-HOWTO This how-to is included in many Linux distribution. It dealswith WWW access and servicing. It is a typical technical document, using spe-cific words. It would not be easy to analyze it using natural language text process-ing, because it would lack a specialized thesaurus. According to word countingutility, its length is more than 14000 words. This is enough to perform statisticalanalysis.
In figure (4.4), we show the result of our unsupervised analyze method. Byfollowing the links, it is possible to recognize familiar computer domain wordsassociation, such as ftp, downloaded, network, man, hosts, virtual, machine ... .We can also distinguish some connected component such as computer, domain,
enable, restart, message, send. Thus, it is possible to extract reduced lexical fieldautomatically.
DNA Corpus At opposite from the WWW-HOWTO file stand the DNA Cor-pus. This corpus has been presented and studied in the International Journalof Corpus Linguistics ([?]). Initial study was based on a linguist collocationalnetwork, in order to extract emergent use of new patterns of word in the sci-entific sublanguage associated to molecular biology applied to parasitic plants.Automated probabilistic networks construction may be usefull to linguist, evenif the relation between collocational networks and probabilistic ones is not clearyet.
regardless
5259a264t266
absence
absorbed
abstract
acid
acids
action
active
activities
added
adventitious
aelig
album
alga
alkaline allowed
allows
alteration
americana analyses
analysis
analyzed
ancestor apparatus
applied
approx
approximately
aqueous
areas
arrangement assays
assembly
associated
atrazine
attained
authors
autumn
availability
average
avg
bamhi
bands based
basis
became
biotype blot
blots
broad
calculated
capacity
case
caused
cdna
cells
cellular
centrifuged
chain
changes
characterized
chase
chlamydomonas
chondriomes
classes
classified
clay
cloned
close
clusters
cm
codon
cold
coli
collection
combinations
compared
comparison
comparisons complex
component
concentrated
concentration
conclusion
conditions
confirm consensus
conservation
consistency
constructed
contained
containing
controlling
copy
corresponding
corresponds
cp
ctdna
cv cybridizations
cybrids
cytokinin
daltons
deduced
degradation
degree
dehydrogenase
deleted
demissum
demonstrate
dependent
depth
described
detect
detected
determined
determining
diameter
different
diploperennis
direct
directed
discs
dispersed
displayed
distinct divergence
divided
dm11
dm16_dm15
dm4
dna
donor
dormant
dried
dsrnas due
duplicated
duplication
earlier
early
ecorv
edta
effective
effects
electron
electrophoresed
emerged
endonuclease
energy
entire
enzyme
enzymes
equivalent
establish
ethylene
etuberosum
examples
exception
expansion
experiment
experiments
expressed
extensive
extract
fact failed
families
fewer
fig
filled
filters
fine
fluorescence
found
four
fractionated
fractions
fragment fragments
frames
free frozen
function fusion
fusions
gapdhs
gel
gene
generate
generated
genes
genome
genomes
germinate
germination
ginbozu
given
glaberrima
ground
haeiii
haploid
herbicide
herbicides
heteroplasmic
heterotrophic
higher
hind
hpaii
hybridization hybridized
hybridizing
hydrolysis
ice
identical
identification
identify
import
imported
including
indica
indicates
indicating
individual
individuals
induced
inferred
information
inhibit
initiation
insertions
integration
interactions
interspersed
inverted
involved
ioe23
ioe33
isolated
isolation
japonica
kb
known
lack
lacking
lamourouxia
lapathifolium
large
lcf
led
left
length
leucine
level levels
light
line
linear
linked
liquid
locus
loss
lysate
machinery
main majority
manufacturer
maximum
means
mechanism
media
medium
members
membranes
metabolite
method
metribuzin
mg
mildura
min
mitochondrial
ml
modified
molecular mrnas
mutant
mutated
nacl
name
ndh
ndhb
negligible
niche
nine
nitrocellulose
nonsense
normal
notable
noted
nuclear
nucleotide
number
numbers
observed
obtained
occurred
occurring
oe33
oligonucleotide
organisms
orientation
origin
original
orobanchaceae
oxidase
oxidation
oxygenase
parallel
parasite
parasitic
parts
pastoris_bursa_capsella
pathway
pcr
perennis
perfoliata_thlaspi
petri
phosphate
photosystem
phylogeny
pieces
plant
plants
plastid
plastids
poe23
point
polyacrylamide
polyadenium
polymorphic
polypeptide
polypeptides
portion
portions
possible
potato
pots
prd39
precursor
presence
present
presented
presumably
primer
primers
prior
probe
procedure
processes
product
production
promote
proper
proposed
protease
protein
proteins
protoplast
ps
purified
putative
pvuii
quantities
radioactivity
rapd
rapidly
reactions
reduction
ref
reflexa
regardless regions
regulated
regulatory
reinhardtii
related
relationships
relative
remaining
repetitive
reports
represent
representing
required
restricted
restriction
resulted
results
resuspended
reveal
rflp
ripen
root
roots
rps2
rps7
rrna
rubisco
rubp
rubpcase
samples
sand
saturating
saturation scintillation
screened
second
sections
see
seed
seeds
segregated
selected
sequence
sequenced
sequences
serine
serpyllifolia_arenaria
serve
served
shoot
short
showed
similar
single
site
situation
size
sizes
small
smaller
solanum
sorbitol
source
species
specific
spinach
spring
stable
stellaria
step
sterile
stimulants
stimulation
strains
stranded
strict
strong
studied
studies
study
substitution
substrate
subunit
successful
sucrose
suggests
sum
sup
susceptible
symptoms
synthesis
synthesized
system
tabacum
tabs
tdz
temperature
temperatures
termetypehcide
termetypehp
termetypepp
terminated
termination
thylakoid
thylakoids
tightly
tiles
total
toward
transcribed
transcriptional
transfer
translated
translation
translocation
transport
treated
trees
triple
trnacys
trnas
true tuberosum
types
units
unlabeled
unlikely
unpublished
unstable
used
useful
variability
various
vary
vector
veronica
vials
visible
vmax
washed
washes
week
wild
xhoi
year
kbp
blotting
southern
preparation
pea
counting
amounts data missing amounts
blotting
counting
data
kbp
missing
pea
preparation
southern
Fig. 2. DNA Corpus
The figure 2 show the result we get for the corpus DNA. We have no particularknowledge in molecular biology, and choose among the extracted patterns someunderstandable samples. It appears that some of the associations are lexical fields(e.g. : deletion, evolutionary) while others (hybrize, intact) show applicationdomain. Even if links are different from collocational links, distance in the PNwe produce (in hop counts) between collocate is low (e.g. : gene and plant arecollocate, and they are linked by remains .
deletion ← gene contains kb similar evolutionary substitutionsevolutionary ← gene plant involved necessary deletion remains phylogenetic
organisms hemiparasite properly intronshybridizes ← chloroplast intact genes codons families stop lack leaf relate al-
teredintact ← functional species genes templates hybridizes leaf relate barley alteredremains ← gene plant plastids living involved hemiparasite reduction
sites ← protein site gel size homology sequencingstudy ← trnas trna sequence isolated single content tests
Characterization of the linguistic relation between a word and its parentsrequires an expert of the domain. Being able to get any relation without knowingits nature was one of our requirements in this study. Interpretation of the relationis an other research topic, and a possible extension of this work.
4.2 Convergence problem
In figure (3), we draw the note augmentation per 1000 iteration step on theWWW-HOWTO file. As an experimental result, we see that the curve may bebounded by a decreasing exponential. This result is in accordance with the con-vergence property of the presented algorithm : note is strictly growing, becauseparent set may be replaced only by better parent set. The note is bounded, be-cause there is only a finite number of possible graphs. The growing and boundednote suit therefore converges, and difference between consecutive notes tends to0, as suggested by the figure (3). We ensure graph note convergence, but whatabout graph convergence ? The note converges and depends only on finite vari-able. From a given iteration step, the note is constant. Structure change onlyif a better parent set has been found. but it is not possible if note is constant.Then no better parent set may be found, and structure do not evolve.
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0 20000 40000 60000 80000 100000 120000 140000 160000 180000 200000
"dif.1000.log"
Fig. 3. note augmentation per 1000 iteration step (WWW-HOWTO)
4.3 Effect of Initial position on result
The objective of this data set (Mail-HOWTO) is to study the dependence be-tween initial position and graph note after numerous (100000) iteration steps.We run our program using the same data set and three different initial positionof nodes in the visualization space. Results are shown in figure (4), where eachcurve represents the note evolution for a given initialization of words positionsin the visualization space . Although individual curves behavior is somewhatstochastic, it appears that distance between them is always lower than 1 unit.As the values are around 20, the relative error is 5 percent. Thus, the dependenceof the curve evolution to the initial position is low.
16
17
18
19
20
21
22
23
0 10000 20000 30000 40000 50000 60000 70000 80000 90000 100000
"d1.log""d2.log""d3.log"
Fig. 4. note augmentation per 1000 iteration step (WWW-HOWTO)
4.4 Effect of new data insertion
In this experiment, initial data set has been cut into 4 segments. Every 50000iteration step, a new segment has been transmitted to the program. The resultinggraph curve is given in figure (5). The most important effect of each new additionis to invalidate previous evaluation function. This has two immediate effect:
– current structure is evaluated using the new notation function. Thus, notemay increase or decrease very fast. It decreases if the graph is less useful
given the new data set, and increase if new data fit the already built model.One interpretation of note decrease is that method learned on a restrictedset of data, and was too close to these data, and no more able to generalize.This problem is also called over-fitting.
– convergence process speed grows : unless new data set may be predictedusing current model, the model quality may increase by taking new datainto account.
5.6
5.8
6
6.2
6.4
6.6
6.8
7
7.2
7.4
7.6
0 50000 100000 150000 200000 250000 300000 350000 400000 450000 500000
"parts.log"
Fig. 5. 4-parts segmented text (WWW-HOWTO), one part every 50000 iteration steps
In order to evaluate the change in the model before and after the data addi-tion, the split the WWW-HOWTO into two part. We captured the graph beforeand after the second part insertion. Result is figure (4.4 and 6). No parameterwas changed between theses two snapshots. However, the second graph is sparserthan the first one. This is a direct consequence of the graph metric change : inone hand some links no more holds, in the other hand, some new links appears.
4.5 Influence of data order
This data set (Mail-HOWTO) has been built by cutting original file (Mail-HOWTO) into tree segments : A1,A2 and A3. We present these data to ourprogram in two different orders:
32bit
mnt
nt
protocols
section
various
windows
able access
adapter
add
address
administrator
alias
allow
allows
apache
apacheweek
applications
apps
available
best
better
bin
binary
browser
browsers
button
cern
cgi
changed
check
clicking
code
com
comments
commercial
communicator
compile
compiling
computer
conf
configuration copy
copyright
data
depending
details
different
directive
directory
dll
docs
document
documentation
domain
downloaded
downloading
edit
emacs
enable
end
etc example
explains export
file
files
finally
find
first
following
frame
free
ftp
full
gateway
get
gnu
go
going
good
group
hat_red
home
host hosts
howto
html
htmldocs
http
httpd
include
includes
inetd
information
installing
ip
kernel
latest
license
line
link
links
linux
linuxports
load
local
log
looks
lynx
machine
made
make
makefile
man
map
memory
message
microsoft
mode
modules
mount
mysite
navigator
ncsa
needs
net
netcraft
netscape
network
networks
ok
old
option
org
pages
part
poet
port
proc
products
program
programming
pub
put
recommend
refer
request
restart
rfc
root
route
run
say
scripts
second
security
send
server
set
setting
setup
show
simple
simply
source
srm
ssi standalone
start
stronghold
subnet
sunsite
support
supports
survey
system
tab
tables
take
tgz
time
trying
types
typing
unc
unix
use
users
using
usually
utilities
version
virtual
visit
want
website
winnt
work
works
world
xemacs
32bit windows
protocols
various
mnt
nt
section
button
32bit
don mnt netware
nt
protocols windows
accessing
adapter
add
administrator
apache
applications
apps
available
box
caldera
cgi
choose
clicking
commands
commercial
compile
compiling
configuration copy
copyright
corporate
data
directory
distributions
dll
docs
doubleclick
downloaded
edit end
etc
explains
export
file
files
filesystems
frame
ftp
get
hat_red
howto
htmldocs
http
information installing
ipx
kernel
know
latest
let
license
linuxports
list
looks
make
methods
mode
mount
navigator
ne2000
net
netcraft
netscape
networks
openlinux
page
part
permissions
poet
proc
products
prompt
pub
put
questions
rc
recommend
refer
route
running
say
see
servers
setup
show
shown
similar
simply
smb
srm
standalone
stronghold
subdirectory
sunsite
supports
survey
system
tab
tables
trying
type
typing
unc
unix
use
used
users
using
utilities
version
visit
web
website
winnt world
www
free
button
don
nt
protocols 32bit
mnt
netware windows
Fig. 6. WWW-HOWTO analysis (evolution)
order 1: segment A1,A2 and then A3.order 2: segment A3,A2 and then A1.
In both experiments, we sent a new data segment every 10000 iteration step.Figure (7) presents the note curves at each step for order 1 and order 2.
In the first period (0-10000), order 1 discovered graph note is lower thanthe note of the one extracted considering the order 2. We deduce from thisobservation that segment A1 is harder to model than segment A3.
In the second period, the curves associated with both order 1 and order2 grows rapidly. Consequently, segment A2 seems to be the easiest to model.Networks that were well fitted for segment A1 and A3 get an even better notwhen A2 is added. A2 structure certainly encompass A1 and A3.
In the third period, order 1 becomes better than order 2. but differencebetween the two curves is roughly one unit. As we have find such difference oncurves that differs only in the initial nodes position, we cannot say that, in thiscase, data order had a great influence on final result.
5 Conclusion
In this paper, we present data structures and algorithms to perform text min-ing using a probabilistic network framework. We take advantage of automatedstructure learning from data methods in BN to extract interesting patterns intexts, without any language prior knowledge or syntaxtic analysis. The proposed
10
12
14
16
18
20
22
24
26
0 5000 10000 15000 20000 25000 30000 35000
"order1.log""order2.log"
Fig. 7. 2 different order on same data
method is robust, general and unsupervised, and thus may be applied to almostany kind of textual data.
The work described in this paper is being adapted as module for an auto-mated web browser, in order to read automically retrieved web documents, andpresent an analysis to the end-user. We take advantage of the heuristic used hereto perform visualization efficiently (word position on screen is computed duringthe graph optimization.) We get encouraging results in this direction.
Proofs
Error estimation using differences (Pe =1/2− 1/2.
∑
YP (Y ).
√
2.P (X = |Y ) − 1))
Pe =∑
Y
P (Y ).min(P (X = 0|Y ), P (X = 1|Y )))
We use min(p, 1 − p) = (1 − |2p − 1|)/2 and 2p(1 − p) = q ↔ p = 1/2 ±1/2.√1− 2.q
Pe = 1/2.∑
Y
P (Y ).(1− |2.P (X = 0|Y )− 1|)