on the use of self-organizing maps to classify journal articles

1

Abstract The feasibility of using self-organizing maps for classifying a collection of science journal articles is considered. In particular, the effect of using only short article abstracts is examined. In addition, the use of multiple self-organizing maps over successive time periods to gain insight into the evolution of a field of science is considered.

1 Introduction It is often useful to be able to create a visualization of the relationships between different areas of a particular field of research. This can allow one to spot trends and find areas that might be interesting to investigate further. Such a visualization can also be useful in examining the evolution of a field of research and how it changes over time. One possible method of constructing this visualization is to create a two-dimensional map of the articles with related articles located spatially close to each other. A common unsupervised learning approach for constructing such a map is to use self-organizing maps. There are many

prior works [4],[5],[6],[7] showing that self-organizing maps are effective in classifying a collection of text documents and building two-dimensional maps. The document collection that is considered here is a collection of 22,732 articles from the Virtual Journal of Nanoscale Science & Technology (http://www.vjnano.org). These are articles from various scientific journals with topics relating to nanotechnology. For reasons that will be discussed in Section 3, only the article abstracts are used to build the self-organizing maps. While previous studies have used full document texts to build self-organizing maps, this study will explore the possibility of using only much shorter article abstracts to build self-organizing maps on a collection of science journal articles.

2 Prior Work In [4], Kohonen created a self-organizing map (SOM) using over a million documents from 80 different Usenet newsgroups. These documents contain newsgroup postings covering a

On the Use of Self-Organizing Maps to Classify Journal Articles

Jason Fong University of California, Los Angeles

Computer Science [email protected]

2

variety of very different topics. The resulting SOM was successful in grouping together newsgroups postings with similar topics. In [5], Kohonen, et. al., used similar methods to create a SOM using patent documents. In [7], Merkl and Rauber created a SOM using the 1990 edition of the CIA Factbook. This is a collection of 245 text documents with information on the various countries and regions in the world. The full texts of these documents were used to construct a SOM. The resulting map had many geographically close countries located close together. There were also groupings for items related for other reasons, such as the communist countries (as of 1990) or the various oceans in the world. In [6], Merkl created a self-organizing map using the manual pages from the NIH Class Library. The NIH Class Library is a library for the C++ programming language for storing and retrieving arbitrarily complex data structures. The resulting SOM successfully grouped together related functions. This demonstrates that maps can be successfully created for documents of a very technical nature.

These prior studies have shown that self-organizing maps can be very successful for classifying text documents in a number of different fields. These prior works created SOMs based on sizeable bodies of text. This study is different from those prior studies in that it focuses on nanotechnology journal articles, and that it attempts to construct SOMs using only the smaller amount of text available in the article abstracts.

3 Source Documents The collection of documents used to create the self-organizing maps in this study is a set of articles from the Virtual Journal of Nanoscale Science & Technology (VJN). These articles come from 52 different journals and consist of 22,732 articles from January 2000 through September 2005. The collection consists of articles from various science journals that include articles concerning nanotechnology. This includes articles in the following subject areas:

Advances in Fabrication and Processing Structural Properties Electronic Structure and Transport Nanomagnetism and Spintronics Imaging Science and Technology Optical Properties and Quantum Optics Micro and Nano Electromechanical Systems (MEMS / NEMS) Carbon Nanotubes, C60, and Related Studies Quantum Coherence, Computing, and Information Storage

3

Supramolecular and Biochemical Assembly Organic-Inorganic Hybrid Nanostructures Surface and Interface Properties Chemical Synthesis Methods

Only the abstract text of each article is used in training the self-organizing map. This is done for two reasons. The first reason is that the abstract text is much shorter than the full article text. For a large collection of documents, this saves disk storage and reduces the computation time needed to select terms and construct the document representations (more on this in Section 4). The collection of documents used in this study is relatively small and would not run into excessive difficulties if the full text of the documents were used. However, if the process used in this study is to be extended to larger document collections, it would be useful to know if building a SOM using only article abstracts can still produce useful results. The second reason to only use abstracts is that they can be more easily obtained than full article texts. In the case of the VJN articles, the full article text is available on the VJN website. However, for a general application to articles outside of the VJN, full article texts are not always easily available without charge.

4 Document Representation The SOMs considered in this study operate on neurons with weights expressed as floating point numbers. Thus, a numeric representation of each document is needed in order to construct the self-organizing map. A common approach is to select a set of terms that are relevant to the

collection of documents and represent each document as a vector of numeric values that correspond to the importance of each term in a document. For the construction of the self-organizing map, the set of terms were chosen by selecting the 200 terms that appear in the most number of different documents. A term is considered to be a group of characters separated by whitespace, but excluding the following:

a single letter a single letter followed by numbers (probably a variable in an equation) mathematical formulas numbers common words such as the, it, and, is, etc.

Before selecting the 200 terms, the remaining terms were stemmed in order to combine multiple forms of a term into a single term. For example, the words walk, walks, and

4

walking would all be stemmed to the same term walk. The stemming was performed using an implementation of the snowball stemmer available in the Tsearch2 [2] full-text search extension for the PostgreSQL database management system. One part of the Tsearch2 extension is an implementation of the snowball stemmer that uses the stemming algorithm by Porter [1]. This system was used to stem terms since the VJN data was already stored in a PostgreSQL database, so the Tsearch2 extension was the most convenient method for stemming the terms. The number of terms used was chosen to be 200 since after the top 200 terms, the terms begin to appear in less than 5% of the documents. This is a bit of an arbitrary decision, but this still resulted in groupings that appeared reasonable. With the 200 terms selected, the importance of each term in a document was determined using the term frequency inverse document frequency (tf-idf) method [9]. This is computed as follows:

=

=

=

i

k k

i

dD

idf

n

ntf

idftftfidf

log

The term tf is the term frequency and measures how important a term is to a particular document by measuring how often a term appears in a document with respect to the total number of terms in the document. The tf value increases for terms that make up a greater fraction of the total terms in a document. ni is the number of times the term i appears in the document. k kn is

the total number of terms in the document. The term idf is the inverse document frequency and measures how important a term is to

identifying a specific document within the entire document collection. The idf value is greater for terms that appear in fewer documents since such a term will match with a smaller subset of the collection of documents. |D| is the total number of documents in the collection. |di| is the number of documents in which the term i appears.

Each input document is represented by a vector of the tf-idf values for each of the 200 terms. Each element of this vector represents how important a term is to a document. This vector can be imagined as representing a point in a 200-dimensional space. Points that are spatially close to each other are likely to represent documents with similar content since the tf-idf values for their terms are similar.

5

5 Self-Organizing Maps Self-organizing maps have become a common tool for applying unsupervised learning to the classification of text documents. The SOMs used in this study consist of a square array of neurons. Each neuron i in the array has a weight vector assigned to it. This weight vector contains 200 elements, with one element corresponding to each of the 200 terms used to represent the input documents. The activation of each of these neurons is calculated as the Euclidean distance between the input vector and the weight vector of the neuron. After these values are calculated for all of the neurons with respect to a particular input document i, a single neuron is selected as the winner. The winner neuron c is the neuron that is most similar to the input vector. This is

determined by finding the neuron which has the minimum Euclidean distance from the input vector for the input document i:

)()(min)(: tmtxtmc iic =

The winner neuron c in this case is the neuron with the lowest activation as calculated above. x(t) is the tf-idf vector from the input document. mi(t) is the weight vector of a neuron. mc(t) is the weight vector of the winner neuron. After the winner neuron is selected, its weight vector is adjusted to move it a bit closer to the input vector. This makes the neuron more similar to the input document and more likely to be selected as the winner the next time the same document is presented to the SOM. In addition, a number of neurons around the winner are also adjusted so that similar documents will tend to be drawn toward locations near the winner neuron. The number of neighboring neurons affected in this way decreases with time and is controlled by the following neighborhood function:

= )(2exp)( 22

t

rrth icci

This neighborhood function is a Gaussian that scales down the amount of the weight adjustments with distance away from the winner neuron. The amount that it scales down is greater for neurons at a greater distance from the winner neuron. ri is the two-dimensional vector for the location of a neuron i in the two-dimensional SOM. rc is the two-dimensional vector for the location of the winner neuron. ||rc ri|| is the distance between a neuron i and the winner neuron. The usual process is to initially set the size of this adaptation neighborhood to a wide area and then to gradually reduce this area until only the winner neuron is adaopted. This leads to an initial formation of large clusters and then finer adjustments toward the end of the training

6

iterations. This reduction in the size of the adaptation area is determined by the time-varying parameter .

As each document is presented to the SOM, the effected neurons calculate their new weights according to the following formula:

[ ])()()()()()1( tmtxthttmtm iciii +=+ mi(t) is the current value of the weight vector of neuron i. mi(t+1) is the new value of the weight vector of neuron i. [x(t) mi(t)] is the difference between the neurons weight vector and the input vector for the input document. (t) is the time-varying learning rate. This learning rate decreases with time so that finer adjustments are made to the neuron weights as the training process progresses. hci(t) is the previously discussed neighborhood function that controls the size of the area affected by the weight adjustment. The result of an adjustment to a neurons weight vector is to move the neuron a bit closer to the position of the input document vector. This makes the winner neuron more similar to the input document so that the next time the same document or similar documents are presented, the neuron will be more likely to win. Also, the neighboring neurons are also made a bit more similar (but less so than the winner neuron) so that they are more likely to recognize the document or similar documents. The end result is that neurons recognizing similar documents are located in spatially close positions.

6 Constructing a Self-Organizing Map The self-organizing map was constructed using the Java Object Oriented Neural Engine [3]. This is a toolkit for building models of neural networks using the Java programming language. The network was modeled as follows:

Kohonen Synapse

Winner-Take-All Neuron Layer

Input Neuron Layer

7

The first layer of neurons is an input layer consisting of 200 neurons. Each of these

neurons corresponds to one of the 200 document description terms. This input layer receives values from a file containing the tf-idf values for each term in each document.

The second (and last) layer is the winner-take-all (WTA) layer consisting of a rectangular array of neurons. The arrangement of these neurons corresponds to points in the resulting two-dimensional map. The response of this layer is that the neuron with the lowest input value is selected as the winner and outputs a value of 1. All other neurons output a value of 0.

The input layer and the winner-take-all layer are connected by a Kohonen synapse. This synapse handles the algorithm for a self-organizing map by adjusting the weight vector assignments for each of the neurons in the WTA layer. When a document is presented to the SOM, the input layer presents the Kohonen synapse with an input vector (the tf-idf values for each of the 200 terms). The Kohonen synapse then calculates the Euclidean distance between the input vector and each of the current weight vectors of the neurons in the WTA layer. These Euclidean distances are sent to WTA layer, which selects the winner and informs the Kohonen synapse. The Kohonen synapse then adjusts the weights of the winner neuron and some number of neighboring neurons according to the SOM algorithm described in Section 5.

In this study the problem space can be visualized as a 200-dimensional space. The tf-idf values of the input documents and the weight vector values of the WTA array neurons can be visualized as occupying a point in this 200-dimensional space that corresponds with their vector value. The weight adjustments of the WTA array neurons can be visualized as a process that moves a WTA array neuron closer to the position of the input document.

A single iteration of the SOM training is complete when each document has been presented to the SOM and the weights of the neurons in the WTA layer have been adjusted accordingly. The training iterations are repeated until the weights of the WTA neurons change by a negligible amount. A number of different SOMs were created using this process. In order to observe a map of the entire document collection, two maps trained with the entire document collection were created. One used an 8x8 array of neurons in the WTA layer, and the other used a 5x5 array of neurons in the WTA layer. In order to explore the possibility of using SOMs to observe the change of a research field over time, additional maps were created with subsets of the entire document collection. These subsets were taken so that each set contained articles from different halves of a year. Twelve more maps were created for each half of a year from 2000 to 2005.

8

7 Evaluation Methodology Even though the document collection is relatively small with 22,732 documents, it is still too large to manually check all of the document classifications. However, a less stringent verification can be performed by using ordinary human intelligence and some familiarity with topics in the nanotechnology field. The top terms in each SOM group can be examined to verify that they are reasonable for the nanotechnology field. Also, we can take a small random sample of articles in each group and check that the article titles and abstracts fit with the assigned group. The full article texts are also available, so those can be examined to further confirm that an article has been properly classified. After the self-organizing maps were trained, they were used to create tables with key terms for each group. The key terms were chosen by selecting the terms that occur the most often in a group, and also the terms that occur in the most documents in a group. This creates tables that are somewhat cluttered and difficult to understand. In order to fit the tables in a page and in order to make the table easier to understand, the tables shown in the results in Section 8 are abbreviated versions of these tables. Most of the terms appear in both the appears most often and the appears in the most documents lists, so the two lists can usually be combined. The abbreviated tables are created by first selecting the top 5 terms from the appears most often list. If this does not include the top two terms from the appears in the most documents list, then those two terms replace the 4th and 5th terms from the appears most often list. This process attempts to strike some balance between terms that appear very often in a few document and terms that occur in many documents. In the resulting maps, some cells contain many documents while other cells contain only a few. The cells that contain many documents are likely to be actual classification groups, while the cells that contain only a few documents are likely the result of documents that were not strongly associated with any of the larger groups. In order to more easily distinguish the larger groups, the table cells with larger groups of articles are in bold. Also included is a count of the number of occurrences of each term in a group.

8 Results Even though using only the article abstracts gives a relatively small amount of text to describe each document, the resulting self-organizing maps appear to be effective at classifying the documents. This is not an exhaustive analysis since there are too many documents to be able to complete verify each of them. There are also some documents that seem to be out of place in

9

some classifications, but this is to be expected since self-organizing maps are not perfect in their classifications.

The following is an 8x8 self-organizing map of the entire document collection:

nanotub(1801) carbon(1476) wall(920) electron(762) singl(707)

imperfect(3) detector(2) copi(2) alon(1) scheme(1)

micro(704) quantum(114) time(113) structur(112) temperatur(108)

fullerit(2) action(1) irradi(1) undergo(1) soften(1)

sampl(1047) measur(275) temperatur(246) field(212) tip(205)

laser(1255) quantum(319) puls(259) optic(235) temperatur(230)

state(1741) entangl(604) quantum(590) local(450) qubit(243)

mode(1450) photon(254) frequenc(248) crystal(199) structur(176)

fulleren(3) obtain(3) character(2) symmetri(2) microscopi(1)

phase(1602) transit(390) temperatur(332) structur(261) quantum(224)

analyz(4) use(3) fit(2) nano(2) langevin(2)

current(1640) voltag(375) electron(340) quantum(279) field(270)

reson(1435) frequenc(393) quantum(263) electron(222) field(221)

photon(10) pair(4) entangl(4) scheme(4) state(2)

polar(1089) spin(482) electron(219) quantum(219) field(193)

drug(4) system(2) deliveri(2) therapeut(2) nanoparticl(1)

dot(4712) quantum(3561) electron(1374) state(1011) energi(820)

copi(7) provid(5) distil(3) entangl(3) suffic(2)

atom(1392) electron(283) structur(272) cluster(227) surfac(218)

free(1) tini(1) float(1) field(1) driven(1)

emiss(1564) field(704) electron(386) current(327) nanotub(290)

electron(2592) quantum(2161) energi(1821) structur(1803) field(1621)

entangl(3) present(2) thermal(1) mirror(1) antiferromagnet(1)

cell(4) mechan(3) mechanotransd(2) receptor(2) vertebr(2)

bound(2) distil(2) entangl(2) protocol(2) bipartit(1)

devic(1104) electron(237) fabric(181) gate(174) base(160)

synthesi(2) jet(1) boron(1) arc(1) scandium(1)

conduct(1705) electron(551) temperatur(455) quantum(383) transport(292)

bind(3) spacer(2) mesh(2) ion(2) network(2)

paint(1) substrat(1) glass(1) inexpens(1) circuitri(1)

oper(461) quantum(290) state(248) entangl(173) qubit(168)

pattern(666) surfac(124) substrat(123) fabric(112) process(102)

photon(2112) crystal(1382) structur(518) optic(462) dimension(415)

layer(1961) thick(440) structur(401) quantum(265) dot(254)

defect(3) perfect(2) combin(2) object(2) problem(2)

charg(1173) electron(345) quantum(277) state(233) energi(191)

correl(606) quantum(236) electron(210) state(180) system(168)

signal(1) flash(1) light(1) ring(1) molecul(1)

nanodevic(1) properti(1) retain(1) gallium(1) macroscop(1)

spin(4681) electron(1152) magnet(851) quantum(836) field(672)

worldwid(1) quantum(1) mark(1) commun(1) world(1)

interfac(767) structur(157) layer(150) electron(141) surfac(121)

beam(624) electron(240) fabric(147) optic(141) structur(139)

thermal(626) conduct(299) temperatur(223) measur(104) effect(95)

xe(5) load(3) indent(3) valu(3) express(3)

si(1839) ge(363) surfac(319) structur(250) layer(239)

particl(1573) size(337) magnet(244) nanoparticl(193) interact(180)

forc(1121) measur(273) tip(246) atom(234) microscop(178)

silicon(870) structur(158) high(144) oxid(132) electron(130)

tunnel(1355) electron(463) quantum(296) junction(288) barrier(264)

molecul(1527) electron(351) singl(314) structur(286) surfac(259)

assembl(790) self(690) self-assembl(590) structur(285) surfac(189)

nanotub(2457) carbon(1008) electron(314) wall(288) singl(277)

band(1038) gap(876) photon(497) structur(395) crystal(255)

tip(4) virus(4) icosahedr(3) rna(3) genom(3)

wave(709) function(170) electron(153) quantum(120) field(100)

growth(1159) surfac(288) temperatur(249) deposit(207) substrat(206)

lattic(646) structur(155) crystal(136) electron(109) dimension(108)

surfac(2192) energi(376) structur(273) atom(268) electron(258)

nm(1422) diamet(392) structur(298) nanowir(285) electron(264)

magnet(3738) field(1759) spin(582) electron(473) effect(471)

film(2473) thin(611) deposit(487) temperatur(459) substrat(358)

vortex(4) vortex-antivortex(3) type(3) antivortex(3) superconductor(2)

scatter(999) electron(362) quantum(246) effect(198) well(148)

10

The validity of this map was confirmed by randomly sampling documents from the groups and checking if they can be reasonably related and that they match the terms identified for the group. For the most part this held to be true. The following is a small sample of article titles found within the groups:

This group appears to be about carbon nanotubes:

Low temperature burnable carbon nanotube paste component for carbon nanotube field Screen printed carbon nanotube field emitter array for lighting source application Density functional theory calculations of energy-loss carbon near-edge spectra of small diameter armchair and zigzag nanotubes: Core-hole, curvature, and momentum-transfer orientation effects

Spindt tip composed of carbon nanotubes Carbon Nanotube Single-Electron Transistors at Room Temperature Structural Determination of Isolated Single-Wall Carbon Nanotubes by Resonant Raman Single-Molecule Torsional Pendulum Radial-breathing-like phonon modes of double-walled carbon nanotubes Theoretical study of the adsorption of H2 on (3,3) carbon nanotubes Adhesion between single-walled carbon nanotubes

This group appears to be about quantum dots:

Quantum dots in magnetic fields: Thermal response of broken-symmetry phases On the nature of quantum dash structures Temporal variation in photoluminescence from single InGaN quantum dots Effect of carrier hopping and relaxing on photoluminescence line shape in self-organized InAs quantum dot heterostructures GaAs buffer layer morphology and lateral distributions of InGaAs quantum dots Self-assembled quantum-dot molecules by molecular-beam epitaxy Growth of high optical quality InAs quantum dots in InAlGaAs/InP double heterostructures Incompressible states in double quantum dots Growth and magnetic properties of self-assembled (In, Mn)As quantum dots Fine structure of trions and excitons in single GaAs quantum dots



11

A self-organizing map of size 5x5 neurons was also created to explore the effect of smaller map sizes. The smaller map will have a more coarse-grained classification of the documents since there are fewer cells available and some merging of classifications will likely occur. However, the smaller maps can be built quicker since fewer neurons need to be updated at each learning iteration.

The following is a single 5x5 self-organizing map of the entire document collection:

cell(4) mechan(3) reveal(3) receptor(2) organ(2)

electron(5297) quantum(3734) temperatur(3070) effect(3068) structur(2901)

forc(1457) tip(430) measur(423) atom(405) microscop(287)


surfac(3579) layer(1935) structur(1793) growth(1437) substrat(1144)

state(1856) quantum(920) entangl(832) local(790) qubit(455)

free(1) tini(1) float(1) field(1) liquid(1)

film(3074) thin(782) deposit(597) temperatur(579) surfac(526)

bind(3) spacer(2) mesh(2) ion(2) network(2)

laser(1559) quantum(394) optic(364) puls(340) temperatur(291)


spin(5700) electron(1448) magnet(1102) polar(1077) quantum(1009)

particl(1789) size(379) magnet(275) nanoparticl(237) interact(199)

signal(1) flash(1) light(1) ring(1) molecul(1)

si(2029) ge(408) surfac(384) layer(310) structur(288)

worldwid(1) quantum(1) mark(1) commun(1) world(1)

photon(2970) crystal(1913) structur(937) optic(715) dimension(672)

paint(1) substrat(1) glass(1) inexpens(1) circuitri(1)

nanodevic(1) properti(1) retain(1) gallium(1) macroscop(1)

emiss(1877) field(765) electron(501) current(390) quantum(347)

drug(4) system(2) deliveri(2) therapeut(2) nanoparticl(1)

magnet(4979) field(2459) spin(838) electron(773) temperatur(741)

phase(2165) transit(569) temperatur(470) structur(422) system(354)

mode(1824) frequenc(328) photon(300) reson(276) optic(264)

molecul(1876) electron(520) singl(407) structur(395) surfac(343)

12

The large groups found in the 5x5 map also appear as large groups in the 8x8 map. However, some of the large groups in the 8x8 map do not appear in the 5x5 map. Those missing groups have probably been merged into more dominant groups in the 5x5 map. This is to be expected since the 5x5 map can support fewer groups. An interesting result is that the small groups of documents in the 5x5 map also appear as the same small groups in the 8x8 map. This suggests that those small groups are not just spurious results from the SOM building process. Those small groups could signify a new emerging area of study that does not fit with the existing areas. However, the small groups could also be established areas of study that just happen to have very few articles published. The time to construct the different sizes of SOMs appears to be roughly proportional to the number of neurons in the WTA layer. This is expected since as each document is presented to the SOM, each neuron in the WTA layer needs to calculate its Euclidean distance from the input. The time to complete the calculations for the entire WTA layer would then be roughly proportional to the number of elements in the layer. This holds true for the observed times to construct the SOMs. The time to construct the SOM for the 8x8 array was approximately 5 hours and 45 minutes on a 2.4 GHz Intel Pentium 4. The time to construct the SOM for a 5x5 array was approximately 2 hours. The time to construct the 8x8 SOM versus the time to construct a 5x5 SOM is a bit more than the 64:25 ratio of neurons in the 8x8 SOM versus the 5x5 SOM. This is reasonable considering that the 5x5 SOM would probably need less time to adjust the weights since the 5x5 SOM has a greater proportion of neurons on the edges of the map. Such neurons would have fewer neighbors, so the time to adjust the neighbors weights would be lower. Even though the 5x5 map is smaller and coarser grained in detail, it still appears to be useful. Many of the large groups, as well as the small groups, appear in both the 8x8 and 5x5 SOMs. As can be seen by the sample maps, the smaller 5x5 map is easier to mentally grasp since there are far fewer groups and terms to consider. However, this ease of quick understanding comes at the cost of a loss of fine detail, as can be seen by the greater variety of groups in the 8x8 map.

With the ease of quick understanding in mind, more 5x5 maps were created for subsets of the entire document collection. The subsets were for halves of each year from 2000 to 2005. A total of twelve 5x5 maps for the subsets were created. (For the sake of brevity, these maps are not included in this paper). These half-year maps can be used to analyze the change of the nanotechnology field over time. The resulting maps revealed some changes in the groups over time. Some groups remained roughly the same throughout the different time periods. Examples of these are the group

13

involving quantum dots and the group involving magnetic fields. Some other groups showed changes over the time periods. An interesting example of this is the group on carbon nanotubes. The first time period analyzed is the first half of 2000. In this time period, there is a fairly large number of articles in a group where carbon and nanotube are the top two terms. The last time period analyzed is the second half of 2005. In this time period, there is not a group where carbon and nanotube are the top terms. However, carbon and nanotube are found together in multiple other groups, but as lower ranked terms. One possible explanation for this is that the year 2000 was still relatively soon after the discovery of nanotubes, so there were many articles that focused on carbon nanotubes. As the year 2005 came around, other areas of nanotechnology may have emerged and spurred on many articles focused on those areas. The dispersion of nanotube and carbon into various other cells could suggest that carbon nanotubes may be receiving less exclusive attention. Another possible explanation is that carbon nanotubes became so common in nanotechnology research that the importance of the carbon and nanotube terms decreased in the tf-idf measure. If a term appears in many documents, then the idf part would decrease in value. Thus, if carbon nanotubes appeared in many documents, then their tf-idf values would decrease and those terms might not be as likely to be dominant in a classification group. The correctness of this explanation will take more than a self-organizing map to

determine. However, whether or not it is correct is not the focus of this study. What is important is that the use of the self-organizing maps highlighted an aspect of the collection of nanotechnology articles that may warrant closer examination by other means.

14

9 Future Work In [6], Merkl discusses a method of using hierarchical maps to build large SOMs. This approach has the advantage of being faster than building the entire map as one large map, and the hierarchical structure helps to focus finer grained analysis on documents that are more likely to be related. This approach could be attempted on the VJN article maps in order to further classify the large groups into more specifically defined groups. The ability to recognize categories in nanotechnology research could possibly be improved by using terms that are known to be important to the field of nanotechnology. Generating a list of such terms is not trivial, however, so this study used a generic approach using all terms in the documents. Such a list of nanotechnology terms could potentially improve the accuracy of the identification of subfields in nanotechnology since the terms used to construct the SOMs would be related to the field. In [8], Lagus and Kaski suggest some methods for term selection and labeling of map regions that may be useful.

10 Conclusions This study demonstrated that self-organizing maps can be successfully used to classify documents from a collection of journal articles. In addition, this classification can be successful even if only a relatively short abstract text is available for each article. The sizes of the maps do not need to be very large in order for the maps to begin to provide useful results. A small 5x5 map contained many of the details available in a larger 8x8 map. However, the larger maps contain more different classifications than the smaller maps due to classifications being merged together in the smaller maps. In addition, subsets of documents taken over different time periods were shown to be useful in gaining some insight into the evolution of a field of research.

15

References [1] M. F. Porter, An Algorithm for Suffix Stripping, Readings in Information Retrieval, San

Francisco, Morgan Kaufmann, 1997, pp. 313-316

[2] "Tsearch2 - full text extension for PostgreSQL, http://www.sai.msu.su/~megera/postgres/gist/tsearch/V2

[3] Joone - Java Object Oriented Neural Engine, http://www.jooneworld.com [4] T. Kohonen, Self-organization of very large document collections: State of the art, Proc of

the Int'l Conf on Articial Neural Networks (ICANN'98), Skovde, Sweden, 1998 [5] T. Kohonen, S. Kaski, K. Lagus, J. Salojrvi, J. Honkela, V. Paatero, A. Saarela, Self

organization of a massive document collection, IEEE Transactions on Neural Networks, Vol. 11, No. 3, pp. 574-585, 2000

[6] D. Merkl, Exploration of text collections with hierarchical feature maps, Proc Int'l ACM SIGIR Conf on R&D in Information Retrieval (SIGIR'97), Philadelphia, PA, 1997

[7] D. Merkl, A. Rauber, Document classification with unsupervised artificial neural networks, Soft Computing in Information Retrieval: Techniques and Applications, Vol. 50, pp. 102-121, Heidelberg: Physica Verlag, 2000

[8] K. Lagus, S. Kaski, Keyword selection method for characterizing text document maps, Proc of the Int'l Conf on Articial Neural Networks (ICANN'99), Edinburgh, UK, 1999

[9] G. Salton, C. Buckley, Term-weighting approaches in automatic text retrieval, Information Processing & Management, Vol. 24, Iss. 5, pp. 513523, 1988