intelligent database systems lab n.y.u.s.t. i. m. mining massive document collections by the websom...
TRANSCRIPT
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Mining massive document collections bythe WEBSOM method
Presenter : Yu-hui Huang
Authors :Krista Lagus, Samuel Kaski *, Teuvo Kohonen
Information Sciences 2004
國立雲林科技大學National Yunlin University of Science and Technology
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Outline
Motivation
Objective
Methodology
Experimental
Conclusion
Personal Comments
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Motivation
It would be of great help for browsing an encyclopaedia or a digital library, if the items could be preordered according to their contents.
The main problem with the MDS methods is that one has to know all the items before computation of the mapping. The computation is also a heavy and even impossible task for any sizable collection of items.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Objective
when the searching can be started that match best with the search expression, further relevant search results can be found on the basis of the pointers stored at the same or neighboring map units.
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology
The Batch Map version of the SOM:
5
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology-Encoding
Vector space method
Methods for dimensionality reduction
Latent semantic indexing Random projection Word clustering
6
docs t1 t2 t3D1 1 0 1D2 1 0 0D3 0 1 1D4 1 0 0D5 1 1 1D6 1 1 0D7 0 1 0D8 0 1 0D9 0 0 1
D10 0 1 1D11 1 0 1
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology-Encoding
Weighting of words
IDF-based weights
Entropy over topical document classes
7
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology-Fast
Rapid initialization by increasing the map size
Faster computation of the final state of the SOM
Addressing old winners Intial best matching units
8
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology-Fast
Additional computational shortcuts
Parallelized Batch Map algorithm Saving memory by reducing representation
accuracy Utilizing the sparsity of the vectors
9
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Methodology-Fast
Performance evaluation of the new methods
Numerical comparison with the traditional SOM algorithm
Comparison of the computational complexity
10
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental
Largest experiment: nearly 7 million patent abstracts
11
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental
Experiment on the Britannica collection Preprocessing and document encoding Construction of the map Obtaining descriptive labels for text clusters and map
regions
Exploration of the map
12
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Experimental
Exploration of the map
13
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Conclusion
WEBSOM method has been shown to be robust for organizing large and varied collections onto meaningfully ordered document maps.
The developed computational speedups enable the creation of very large maps.
14
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.Personal Comments
Advantage …
Drawback …
Application Search Engine,
various retrieval of large document such as encyclopaedia or digital library.