self-organizing maps applied to information retrieval of dissertations and theses from bdtd-ufpe

20
Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE Bruno Pinheiro [email protected] Renato Correa [email protected]

Upload: nedaa

Post on 23-Feb-2016

28 views

Category:

Documents


0 download

DESCRIPTION

Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE. Bruno Pinheiro [email protected] Renato Correa [email protected]. Guide. Information Retrieval Systems (IRS) IRS + SOM Related Works Document Collection System Architecture Methodology - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE

Self-organizing maps applied to information retrieval of dissertations and theses

from BDTD-UFPE

Bruno [email protected]

Renato [email protected]

Page 2: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE

Guide• Information Retrieval Systems (IRS)• IRS + SOM• Related Works• Document Collection• System Architecture• Methodology• Results

Page 3: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE

Information Retrieval Systems (IRS)

• Indexing, Searching , classifying textual documents.

• User’s information needs

• Matching user’s queries and system’s vocabulary.

Page 4: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE

IRS + SOMSelf-

Organized Maps

Information Retrieval System

Page 5: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE

IRS + SOM• Navigation Interface build trough document

maps

• Document’s maps– Self-Organizing Map trained with document

vectors

Page 6: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE

Related Works• First Works (1991 - 1995)

– Lin / Merkl • Great projects(1996 -2000)

– Arizona Digital Library, WEBSOM , SOMLib • Diversification (2001 - 2005)

– LiGHtSOM, GHSOM, H2SOM• Convergence (2006)

Page 7: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE

Document Collection• UFPE Digital Library of Theses and

Dissertations(BDTD-UFPE)– Offers in full all the theses and dissertations

produced on the graduate programs of the university.

– Approximately 6000 documents. – Linked to Brazilian BDTD and to NDLTD

(Networked Digital Library of Theses and Dissertations)

Page 8: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE

Document Representation

Dimensionality Reduction

Volume Reduction

Construction of Document Map

Document Vectors

Reduced Vectors

Prototype Vectors

Document Map

Document IndexingInverted Index

Document AcquisitionDocuments’ content

System Architecture

Page 9: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE

Methodology• Document Acquisition

– Harvesting process through the OAI-PMH protocol

– XMLs containing document’s metadata

– Data extraction through the java library JColtrane

Page 10: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE

Methodology• Indexing

– Java library, Lucene.

– Stemming operations, digits and stopwords elimination.

– Inverted index built through vectorial space model.

Page 11: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE

Methodology• Document representation

– Documents are represented by vectors, where terms are the indexes and the corresponding values are functions of term’s frequency of occurrence in the document.

Page 12: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE

Methodology• Dimensionality reduction

– Feature selection based on words’ frequency– Stopwords elimination– Final dimensionality: 13095 terms

• Volume reduction– Not used.– Volume : 4781 documents

Page 13: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE

Methodology• Document’s map construction

– Single stage

– somtoolbox functions for MATLAB

– Document’s vectors normalized before training

– SOM map with rectangular structure (10 x 12) and hexagonal neighborhood

Page 14: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE

Methodology• Document’s map construction

– Weights initialized linearly along the two greatest eigenvectors

– Batch-type SOM algorithm with dot product metric

– Gaussian neighborhood function – Neighborhood size linearly decreasing with the

number of epochs

Page 15: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE

Methodology• Document’s map construction

– Parameters• Number of epochs

– Rough phase : 10 epochs– Fine-tuning phase : 10 epoch

• Neighborhood size – Rough phase

» Initial: [(biggest dimension units number )/2 ]+ 1» Final: 2

– Fine-tuning phase: » Initial: 2» Final: 0.8

Page 16: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE

Methodology• User’s interface construction

– Documents are mapped to the node with the closest model vector in terms of cosine distance

– Each map node is labeled according to the category

• Knowledge areas (CHLA, CBS, TCEN)• Graduate programs

Page 17: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE

Results

Categories Accuracy F1 micro F1 macro Topographic error

3 0.96 0.96 0.96 0.01

61 0.66 0.66 0.44 0.01

Page 18: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE

Results

Knowledge Areas Graduate Programs

Page 19: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE

Acknowledgement

Page 20: Self-organizing maps applied to information retrieval of dissertations and theses from BDTD-UFPE

Questions?

Bruno Pinheiro [email protected] Correa [email protected]