eurocast’01eurocast’01 marta e. zorrilla, josé l. crespo and eduardo mora department of applied...
TRANSCRIPT
EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01
Marta E. Zorrilla, José L. Crespo and Eduardo Mora
Department of Applied Mathematics and Computer Science
University of Cantabria
An Online Information Retrieval Systems by means of Artificial Neural Networks
An Online Information Retrieval Systems by means of Artificial Neural Networks
EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01
Introduction IIntroduction IIntroduction IIntroduction I
What’s about ‘Information
Retrieval Systems’?
structured field search full-text search
EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01
Indexing andStoring
Search Interface
Relevance classification
Indexes
DocumentsDocument
transfer
Documents
General processGeneral processGeneral processGeneral process
EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01
Documents database
Original documents
‘Pure text’ files Files to index
• Stopwords
• Stemming
• Thesaurus
• List of terms
Text extraction
FilteringLink to
documents
Indexation
Storing
Indexes
Indexing and storingIndexing and storingIndexing and storingIndexing and storing
EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01
Classification of Information Retrieval SystemsClassification of Information Retrieval Systems
ClassificationClassificationClassificationClassification
Free dictionary
Clustering
Latent Semantic Indexing
Statistics
Self-organising ANN
In words
In n-gramsInverse indexes
Pre-established dictionary
Vectorial representation
EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01
WORD d p s w WORD D A
Ordered word list DictionaryText indexes Text files
Plant 2 3
Plant
Plant
Plant
d = document codep = paragraph numbers = sentence numberw = word in the sentence
D = number of documentsA = number of appearances
Inverse indexInverse indexInverse indexInverse index
EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01
inputs
bmu
Neighbourhoodradio
Kohonen’s topological map
Fritzke’s growing topological maps
Self-organising ANNSelf-organising ANNSelf-organising ANNSelf-organising ANN
EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01
Distances* * * * * * * * * * * * * * * * * * *
0.14 0.170.230.270.310.380.420.470.570.610.690.730.790.840.890.910.960.99
7 clusters
Clustering statisticsClustering statisticsClustering statisticsClustering statistics
EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01
Ak
m x n
D1 D2 D3 Dn-1 Dn
T1T2T3….….….
T m-1
T m
f11
f21
f31
…
…
…
fm-1,1
fm 1
f21
f22
f32
…
…
…
fm-1,2
fm 2
f13
f23
f33
…
…
…
fm-1, 3
fm 3
…… …… ….. ……. f1 n-1
f2 n-1
f3 n-1
…
…
…
fm-1, n-1
fm, n-1
f1 n
f2 n
f3 n
…
…
…
fm-1, n
fm n
Documents
Terms
Singular Value Decomposition
= U Vt
Term vectors
Document vectorsk
k
m x r
r x r r x n
X X
1kkεUqq t
Query
New documents1kkεUdd t
New terms1kkεVtt t
LSILSILSILSI
EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01
• Competitive networks (self-organising, p.e.)
a processor in output layer with non-null response
• Radial basis networks
a continuos response , generally in one layer
• Multilayer perceptrons
similar to radial networks, except in activation function and operations made at the connections
ANN for classificationANN for classificationANN for classificationANN for classification
EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01
ProposalProposalProposalProposal
Information Retrieval System
doc1 doc2 doc3 doc4 .......... doc n
w1 w2 w3 w4 .......... wn
dictionary word in binary representation
documents in output layer
COES: Spanish dictionary developed by Santiago Rodriguez and Jesús Carretero
Documents: Spanish Civil Code Articles
EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01
Test and Results ITest and Results ITest and Results ITest and Results I
Results:
•Error function tends to mistaken minima (gradient is essentially zero)
•The Neural Network needs a processor for each word in the dictionary, i.e., the network isn’t compact
Neural network with Radial Basis Functions
Error function: mean squared error, entropia
Nº documents: 93 Nº words in dictionary: 140
Conclusion: ordinary RBF approach is not appropriate, it is necessary a change of approach or a change of network; we present another network: MLP
EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01
Test and Results IITest and Results IITest and Results IITest and Results II
Multilayer Perceptron with tanh activation function
Error function: mean squared error, entropia
Nº documents: 10 Nº words in dictionary: 14
Architecture: 10x5x10 ; 10x7x10 ; 10x10x10
Optimisation methods: Conjugate Gradient, Quasi-Newton with lineal and parabolic minimisation.
Results:
•A 10x5x10 architecture can learn the training set
•The optimisation method can have a definitive importance
•The same method, in different programs, offers different results
•The error function does not make much of a difference
Conclusion: In order to gain insight into optimisation process, we program the network
Results:
•A 10x5x10 architecture almost learn the training set, in 10x10x10 is perfect
•Quasi-Newton with parabolic minimisation is the most efficient method
•Mean squared error offers better results than entropia.
•Sorting the training set by number of occurrences or scaling the output between 0 and 1 doesn’t offer better results.
EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01
Test and Results IIITest and Results IIITest and Results IIITest and Results III
Future:
Growing output layer with more documents. It will be necessary to increase the number of hidden neurons when the error becomes high.
0
0,05
0,1
0,15
0,2
0,25
0,3
0,35
0,4
0,45
0,5
5 6 7 8 9 10
Nº hidden neurons
Me
an
sq
ua
red
err
or
Quasi-Newton (Golden)
Quasi-Newton (Brent)
Gc (Golden)
Gc (Brent)
EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01
What’s an Information Retrieval System What’s an Information Retrieval System
What does it work?What does it work?
Classification of IRSClassification of IRS
ProposalProposal
A neural network in which each output layer processor represents a document and the input layer receives words in binary representation
ResultsResults
Promising results of MLP in a toy problem
ConclusionsConclusionsConclusionsConclusions