eurocast’01eurocast’01 marta e. zorrilla, josé l. crespo and eduardo mora department of applied...

15
EUROCAST’01 EUROCAST’01 Marta E. Zorrilla, José L. Crespo and Eduardo Mora Department of Applied Mathematics and Computer Science University of Cantabria An Online Information Retrieval Systems by means of Artificial Neural Networks

Upload: kenneth-jenkins

Post on 16-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EUROCAST’01EUROCAST’01 Marta E. Zorrilla, José L. Crespo and Eduardo Mora Department of Applied Mathematics and Computer Science University of Cantabria

EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01

Marta E. Zorrilla, José L. Crespo and Eduardo Mora

Department of Applied Mathematics and Computer Science

University of Cantabria

An Online Information Retrieval Systems by means of Artificial Neural Networks

An Online Information Retrieval Systems by means of Artificial Neural Networks

Page 2: EUROCAST’01EUROCAST’01 Marta E. Zorrilla, José L. Crespo and Eduardo Mora Department of Applied Mathematics and Computer Science University of Cantabria

EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01

Introduction IIntroduction IIntroduction IIntroduction I

What’s about ‘Information

Retrieval Systems’?

structured field search full-text search

Page 3: EUROCAST’01EUROCAST’01 Marta E. Zorrilla, José L. Crespo and Eduardo Mora Department of Applied Mathematics and Computer Science University of Cantabria

EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01

Indexing andStoring

Search Interface

Relevance classification

Indexes

DocumentsDocument

transfer

Documents

General processGeneral processGeneral processGeneral process

Page 4: EUROCAST’01EUROCAST’01 Marta E. Zorrilla, José L. Crespo and Eduardo Mora Department of Applied Mathematics and Computer Science University of Cantabria

EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01

Documents database

Original documents

‘Pure text’ files Files to index

• Stopwords

• Stemming

• Thesaurus

• List of terms

Text extraction

FilteringLink to

documents

Indexation

Storing

Indexes

Indexing and storingIndexing and storingIndexing and storingIndexing and storing

Page 5: EUROCAST’01EUROCAST’01 Marta E. Zorrilla, José L. Crespo and Eduardo Mora Department of Applied Mathematics and Computer Science University of Cantabria

EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01

Classification of Information Retrieval SystemsClassification of Information Retrieval Systems

ClassificationClassificationClassificationClassification

Free dictionary

Clustering

Latent Semantic Indexing

Statistics

Self-organising ANN

In words

In n-gramsInverse indexes

Pre-established dictionary

Vectorial representation

Page 6: EUROCAST’01EUROCAST’01 Marta E. Zorrilla, José L. Crespo and Eduardo Mora Department of Applied Mathematics and Computer Science University of Cantabria

EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01

WORD d p s w WORD D A

Ordered word list DictionaryText indexes Text files

Plant 2 3

Plant

Plant

Plant

d = document codep = paragraph numbers = sentence numberw = word in the sentence

D = number of documentsA = number of appearances

Inverse indexInverse indexInverse indexInverse index

Page 7: EUROCAST’01EUROCAST’01 Marta E. Zorrilla, José L. Crespo and Eduardo Mora Department of Applied Mathematics and Computer Science University of Cantabria

EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01

inputs

bmu

Neighbourhoodradio

Kohonen’s topological map

Fritzke’s growing topological maps

Self-organising ANNSelf-organising ANNSelf-organising ANNSelf-organising ANN

Page 8: EUROCAST’01EUROCAST’01 Marta E. Zorrilla, José L. Crespo and Eduardo Mora Department of Applied Mathematics and Computer Science University of Cantabria

EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01

Distances* * * * * * * * * * * * * * * * * * *

0.14 0.170.230.270.310.380.420.470.570.610.690.730.790.840.890.910.960.99

7 clusters

Clustering statisticsClustering statisticsClustering statisticsClustering statistics

Page 9: EUROCAST’01EUROCAST’01 Marta E. Zorrilla, José L. Crespo and Eduardo Mora Department of Applied Mathematics and Computer Science University of Cantabria

EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01

Ak

m x n

D1 D2 D3 Dn-1 Dn

T1T2T3….….….

T m-1

T m

f11

f21

f31

fm-1,1

fm 1

f21

f22

f32

fm-1,2

fm 2

f13

f23

f33

fm-1, 3

fm 3

…… …… ….. ……. f1 n-1

f2 n-1

f3 n-1

fm-1, n-1

fm, n-1

f1 n

f2 n

f3 n

fm-1, n

fm n

Documents

Terms

Singular Value Decomposition

= U Vt

Term vectors

Document vectorsk

k

m x r

r x r r x n

X X

1kkεUqq t

Query

New documents1kkεUdd t

New terms1kkεVtt t

LSILSILSILSI

Page 10: EUROCAST’01EUROCAST’01 Marta E. Zorrilla, José L. Crespo and Eduardo Mora Department of Applied Mathematics and Computer Science University of Cantabria

EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01

• Competitive networks (self-organising, p.e.)

a processor in output layer with non-null response

• Radial basis networks

a continuos response , generally in one layer

• Multilayer perceptrons

similar to radial networks, except in activation function and operations made at the connections

ANN for classificationANN for classificationANN for classificationANN for classification

Page 11: EUROCAST’01EUROCAST’01 Marta E. Zorrilla, José L. Crespo and Eduardo Mora Department of Applied Mathematics and Computer Science University of Cantabria

EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01

ProposalProposalProposalProposal

Information Retrieval System

doc1 doc2 doc3 doc4 .......... doc n

w1 w2 w3 w4 .......... wn

dictionary word in binary representation

documents in output layer

COES: Spanish dictionary developed by Santiago Rodriguez and Jesús Carretero

Documents: Spanish Civil Code Articles

Page 12: EUROCAST’01EUROCAST’01 Marta E. Zorrilla, José L. Crespo and Eduardo Mora Department of Applied Mathematics and Computer Science University of Cantabria

EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01

Test and Results ITest and Results ITest and Results ITest and Results I

Results:

•Error function tends to mistaken minima (gradient is essentially zero)

•The Neural Network needs a processor for each word in the dictionary, i.e., the network isn’t compact

Neural network with Radial Basis Functions

Error function: mean squared error, entropia

Nº documents: 93 Nº words in dictionary: 140

Conclusion: ordinary RBF approach is not appropriate, it is necessary a change of approach or a change of network; we present another network: MLP

Page 13: EUROCAST’01EUROCAST’01 Marta E. Zorrilla, José L. Crespo and Eduardo Mora Department of Applied Mathematics and Computer Science University of Cantabria

EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01

Test and Results IITest and Results IITest and Results IITest and Results II

Multilayer Perceptron with tanh activation function

Error function: mean squared error, entropia

Nº documents: 10 Nº words in dictionary: 14

Architecture: 10x5x10 ; 10x7x10 ; 10x10x10

Optimisation methods: Conjugate Gradient, Quasi-Newton with lineal and parabolic minimisation.

Results:

•A 10x5x10 architecture can learn the training set

•The optimisation method can have a definitive importance

•The same method, in different programs, offers different results

•The error function does not make much of a difference

Conclusion: In order to gain insight into optimisation process, we program the network

Results:

•A 10x5x10 architecture almost learn the training set, in 10x10x10 is perfect

•Quasi-Newton with parabolic minimisation is the most efficient method

•Mean squared error offers better results than entropia.

•Sorting the training set by number of occurrences or scaling the output between 0 and 1 doesn’t offer better results.

Page 14: EUROCAST’01EUROCAST’01 Marta E. Zorrilla, José L. Crespo and Eduardo Mora Department of Applied Mathematics and Computer Science University of Cantabria

EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01

Test and Results IIITest and Results IIITest and Results IIITest and Results III

Future:

Growing output layer with more documents. It will be necessary to increase the number of hidden neurons when the error becomes high.

0

0,05

0,1

0,15

0,2

0,25

0,3

0,35

0,4

0,45

0,5

5 6 7 8 9 10

Nº hidden neurons

Me

an

sq

ua

red

err

or

Quasi-Newton (Golden)

Quasi-Newton (Brent)

Gc (Golden)

Gc (Brent)

Page 15: EUROCAST’01EUROCAST’01 Marta E. Zorrilla, José L. Crespo and Eduardo Mora Department of Applied Mathematics and Computer Science University of Cantabria

EUROCAST’01EUROCAST’01EUROCAST’01EUROCAST’01

What’s an Information Retrieval System What’s an Information Retrieval System

What does it work?What does it work?

Classification of IRSClassification of IRS

ProposalProposal

A neural network in which each output layer processor represents a document and the input layer receives words in binary representation

ResultsResults

Promising results of MLP in a toy problem

ConclusionsConclusionsConclusionsConclusions