comparing words, stems, and roots as index terms in an ... · pdf filedepartment of computer...

13
Comparing Words, Stems, and Roots as Index Terms in an Arabic Information Retrieval System lbrahim A. Al-Kharashi King Abdulaziz City for Science and Technology, General Directorate for Information Services, P. 0. Box 6086, Riyadh 11442, Saudi Arabia Martha W. Evens Department of Computer Science, Illinois Institute of Technology, 10 West 31st Street, Chicago, IL 60676 The Micro-AIRS System, a microcomputer system for Ara- bic Information Retrieval, was designed as an experimen- tal system to investigate indexing and retrieval processes for Arabic bibliographic data. A series of experiments were performed using 29 queries against a base of 355 Arabic bibliographic records, covering computer and information science from the bibliographic databank at King Abdulaziz City for Science and Technology. These experiments re- vealed that using roots and using stems as index terms gives better retrieval results than using words. The root performs as well as or better than the stem at low recall levels and definitely better at high recall levels. Several different binary similarity coefficients were tried: the co- sine, Dice, and Jaccard coefficients. All three led to exactly the same document rankings for every query. The experi- ments were run on an IBM/AT-compatible microcomputer. Micro-AIRS is written in Turbo C, Version 2.0. Introduction The Problem Techniques for storing, maintaining, and retrieving from English bibliographic databaseshave been studied, implemented, and tested for the last three decades, but we do not know how well these techniques will work on Arabic data. Experimentation with retrieval systems in Arabic language environments has been very limited. Arabization of available information retrieval systems has dealt mostly with internal representation of the Ara- bic data and translation of menus and system messages Received May 26, 1992; revised February 9, 1994; accepted February 9, 1994. 0 1994John Wiley & Sons, Inc. to Arabic. The problems of working with the Arabic lan- guage have not been confronted directly. In principle, there are two approaches to developing an Arabized computer application; the first approach is to develop the application from scratch and bear in mind the characteristics of the Arabic language. The second approach, however, is based on building an I/O interface to existing application software built for non-Arabic lan- guages.The first approach is costly and time consuming; the second approach is easy to implement at the price of abandoning some Arabic language characteristics. The second approach has been adapted to Arabize two well known retrieval system software packages, STAIRS (Sal- ton &McGill, 1983) and ISIS (UNESCO, 1989). The Ar- abization effort, however, is limited to the internal repre- sentation of the text, and the translation of the menus and messages to Arabic (Al-Gasimi, 1987). The aim of our work is to study the problems and difficulties of applying indexing and retrieval algorithms to Arabic data. In particular we explore the problems of storing and displaying bilingual bibliographic data, selec- tion of index terms, ranking of Arabic records, and stem- ming algorithms for Arabic index terms. Special effort will be devoted to the study of the effect of stemming algorithms on the performance of the information re- trieval system. Stemming in information retrieval systems designed for use with English text is usually confined to suffix re- moval. The motive for the use of stemming is obvious; term stemming can increase the number of retrieved documents since the stem of a term represents a broader notion than the original term itself Several stemming al- gorithms have been used in experimental environments JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 45(8):548-560, 1994 CCC 0002-8231/94/080548-l 3

Upload: ngotram

Post on 14-Feb-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Comparing words, stems, and roots as index terms in an ... · PDF fileDepartment of Computer Science, ... Arabization of available information retrieval ... and roots as index terms

Comparing Words, Stems, and Roots as Index Terms in an Arabic Information Retrieval System

lbrahim A. Al-Kharashi King Abdulaziz City for Science and Technology, General Directorate for Information Services, P. 0. Box 6086, Riyadh 11442, Saudi Arabia

Martha W. Evens Department of Computer Science, Illinois Institute of Technology, 10 West 31st Street, Chicago, IL 60676

The Micro-AIRS System, a microcomputer system for Ara- bic Information Retrieval, was designed as an experimen- tal system to investigate indexing and retrieval processes for Arabic bibliographic data. A series of experiments were performed using 29 queries against a base of 355 Arabic bibliographic records, covering computer and information science from the bibliographic databank at King Abdulaziz City for Science and Technology. These experiments re- vealed that using roots and using stems as index terms gives better retrieval results than using words. The root performs as well as or better than the stem at low recall levels and definitely better at high recall levels. Several different binary similarity coefficients were tried: the co- sine, Dice, and Jaccard coefficients. All three led to exactly the same document rankings for every query. The experi- ments were run on an IBM/AT-compatible microcomputer. Micro-AIRS is written in Turbo C, Version 2.0.

Introduction

The Problem

Techniques for storing, maintaining, and retrieving from English bibliographic databases have been studied, implemented, and tested for the last three decades, but we do not know how well these techniques will work on Arabic data. Experimentation with retrieval systems in Arabic language environments has been very limited. Arabization of available information retrieval systems has dealt mostly with internal representation of the Ara- bic data and translation of menus and system messages

Received May 26, 1992; revised February 9, 1994; accepted February 9, 1994.

0 1994 John Wiley & Sons, Inc.

to Arabic. The problems of working with the Arabic lan- guage have not been confronted directly.

In principle, there are two approaches to developing an Arabized computer application; the first approach is to develop the application from scratch and bear in mind the characteristics of the Arabic language. The second approach, however, is based on building an I/O interface to existing application software built for non-Arabic lan- guages. The first approach is costly and time consuming; the second approach is easy to implement at the price of abandoning some Arabic language characteristics. The second approach has been adapted to Arabize two well known retrieval system software packages, STAIRS (Sal- ton &McGill, 1983) and ISIS (UNESCO, 1989). The Ar- abization effort, however, is limited to the internal repre- sentation of the text, and the translation of the menus and messages to Arabic (Al-Gasimi, 1987).

The aim of our work is to study the problems and difficulties of applying indexing and retrieval algorithms to Arabic data. In particular we explore the problems of storing and displaying bilingual bibliographic data, selec- tion of index terms, ranking of Arabic records, and stem- ming algorithms for Arabic index terms. Special effort will be devoted to the study of the effect of stemming algorithms on the performance of the information re- trieval system.

Stemming in information retrieval systems designed for use with English text is usually confined to suffix re- moval. The motive for the use of stemming is obvious; term stemming can increase the number of retrieved documents since the stem of a term represents a broader notion than the original term itself Several stemming al- gorithms have been used in experimental environments

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 45(8):548-560, 1994 CCC 0002-8231/94/080548-l 3

Page 2: Comparing words, stems, and roots as index terms in an ... · PDF fileDepartment of Computer Science, ... Arabization of available information retrieval ... and roots as index terms

(Lovins, 1968; Porter, 1980; Salton, 197 1). Experiments using word stems as indexing terms show different re- sults. While the implementation of suffix removal algo- rithms in the SMART system (Salton, 1971) shows im- provement in retrieval effectiveness, Harman’s ( 1987) experiments show less improvement and even some- times decay. Further research (Harman, 1991) suggests that in an online system, stemming should be applied differentially (to some queries but not others) under user control, depending on results obtained for particular queries.

The basic goal of our research is to try to find the best way to solve this problem for documents in Arabic. The morphological structure of the Arabic language makes the stemming problem much more complex. We com- pare three alternative choices for index terms: the word itself, the stem, and the root, with the goal of finding out which of these three alternatives gives the best results. We have also examined alternative choices of a similarity coefficient, comparing the effects of using the familiar co- sine measure, and the Dice and Jaccard coefficients.

Background

Ring Abdulaziz City for Science and Technology, KACST, was established in Saudi Arabia in 1977 as a research and development institution. KACST is respon- sible for the formulation of national science and technol- ogy policies and for the coordination and promotion of applied scientific research. It sponsors and supports re- search activities across a broad spectrum of scientific and technological fields.

KACST also provides a wide variety of information support services through the General Directorate of In- formation Systems, GDIS. Such services include access to national and international databases, maintenance of a specialized library and a national database, and opera- tion of a computer network connecting the computers of major research institutions in the Gulf States.

The national database holds over 70,000 biblio- graphic records covering a wide range of science and technology. The collection includes: master’s and doc- toral theses, technical reports, books, articles, measure- ments and standards, statistics, and proceedings of con- ferences and scientific seminars. The collection has an online catalogue. This catalogue is divided into two da- tabases: an Arabic database which contains about 23,800 records, and a non-Arabic database. Sample Arabic and English database records are shown in Figures I and 2, respectively. Each record in the database is composed of 36 fields. The Arabic records are classified-that is, a subject area for the document is given in the record. However, due to the short supply of Arabic indexers and abstracters, only a few document records contain ab- stracts or index terms.

Plan of Research

To achieve our goals, we built a microcomputer-based Arabic Information Retrieval System, Micro-AIRS, targeted for the IBM/PC and compatible microcomput- ers. The system was implemented using the Turbo C compiler, Version 2.0. A few routines, however, were coded in assembly language.

Processing the Arabic Language

Special characteristics of the Arabic language make it difficult to deal with, especially when using a system de- signed for Roman characters (Tayli & Al-Salamah, 1990). Among these characteristics are the right to left orientation, the fact that vowels may be included or dropped, and the morphological structure.

The Arabic language belongs to the Semitic language group. These languages have a common grammatical system based on a root-and-pattern structure. Most Ara- bic words are morphologically derived from a short list of productive roots. The root is the bare verb form; it can be triliteral, quadriliteral, or pentaliteral. According to Hegazi and Elsharkawi (1985) there are about 1200 roots.

A stem is a combination of a root and derivational morphemes to which one or more affixes can be added. A triliteral bare verb generates 14 verb forms, whereas a quadriliteral bare verb generates three verb forms.

Arabic words are classified into three main categories: nouns, verbs, and particles. All verbs and many nouns are derived from root verbs. Some of the root letters may be deleted or modified during morphological derivation. Also a word may change its inflectional form when pre- ceded by certain prefixes or prepositions or followed by certain suffixes. Some nouns, known as “solid nouns,” have no verb origins. Particles can be found in the form of prefixes and/or suffixes attached to verbs or nouns. Some particles can be found in isolated form. Particles include preposition particles, negative particles, answer particles, interrogative particles, conjunction particles, and so forth. Affixes can be added to the beginning, the end, and the middle of a word.

Affixes fall into four categories: particles, pronouns, inflectional morphemes, and derivational morphemes. It is very common to find a verb, subject, and object con- tained in a single word. Yahya (1989) counted 120 different forms of nouns resulting from adding affixes to the basic naked noun, and 1440 different forms of verbs resulting from adding affixes to the basic naked verb.

For the purposes of our experiment we used a word- stem-root dictionary developed by hand for each index term. Now that the system is being enhanced for actual use at KACST we plan to add automatic morphological analysis. Several morphological analysis algorithms have been suggested and/or implemented. Hegazi and Elshar- kawi ( 1985) describe a computer-aided morphological

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-September 1994 549

Page 3: Comparing words, stems, and roots as index terms in an ... · PDF fileDepartment of Computer Science, ... Arabization of available information retrieval ... and roots as index terms

hierarchy system for a vowelized Arabic text. They based their work on both the morphological rules and the pho- netic rules of the language. The main disadvantage ofthis method is that the phonetic analysis requires a fully vow- elized text which rarely appears in today’s applications. Gheith and Aboul-Ela ( 1989) present a computer-based syntax analyzer which is based on a morphological ana- lyzer that separates the linguistic model from the pro- cessing algorithm. In another study, Gheith and El-Sa- dany ( 1987) describe a morphological analyzer that can

INTRNL CNTL NO 8903003 114 CATF.GORY COMPUTING AND CONTROL ENGINEERING DOCUMENT TYPE CONFERENCE PROCEEDING TnLE Computer virus prevention and containment on mainframes AUTHORS Dowry. Ghannam M. Al- AFFILIATION Industrial Security Planning and Support Services Department,

ARAMCO, Dhahran, SA SOURCE TlTLE Pnxedings of the 1 lth National Computer Confennce, Dhahmn,

March 47.1989: Computers and Productivity VOLUME I PAGINATION NO. OF REFER PIJBLICATN DATE PUBLISHER INF. IEXT LANGUAGE ABSTRACf

48-60 73 1989/01/01 King Fahd University of Petroleum and Minerals. Dhahnn, SA ENGLISH The nature and anatomy of the computer virus is outlined. Basic preventions. detection and correction techniques for reducing he dams-s caused bv viruses are oresented. Vaccinea or fdters. encrypt& access &trol softwke. test to production control pn~edures, personnel selection and review control and physical access control m&hods are detailed with examples. The paper presents measures to be adopted by the industry to make the computer systems less inviting to attacks from viruses.

DESCmORS Computer software; Computer viruses: Computer security;

STORAGE MEDIA Mainframe computers: Data pnxessing PAPER COPY

AVAILABILITY KACST. Source

FIG. 2. Sample English database record.

detect the root and the morphological structure of a given vowelized Arabic word with a trilateral root. Al- Fedaghi and Anzi (1989) present a simple but slow math- ematical method to generate the root and the pattern of a given Arabic word. Hilal(1985) gives a more compre- hensive theoretical approach while Thalouth and Al- Dannan (1987) give a more practical approach to the analysis of an unvowelized Arabic text. The principal phase in all these algorithms is the isolation of any suffixes and/or prefixes from the word before proceeding to deeper analyses.

Representation of the Arabic Language

The representation of the Arabic language has been a major concern for the designers of Arabic systems. The representation involves the internal representation of the stored data as well as the external representation, which is used in displaying text on the screen or the printer.

The General Assembly of the Arabic Standardization and Metrology Organization has approved many stan- dards for Arabic text representation. The seven-bit coded Arabic character set for information interchange, ASMO-449, was adopted in October 1982, to represent Arabic characters along with some graphical and control characters. Although this code was intended for pure Ar- abic language applications only, some applications use it to handle bilingual text by using some special characters to indicate that the text is changing from one language to another. For bilingual applications the organization

550 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE--September 1994

Page 4: Comparing words, stems, and roots as index terms in an ... · PDF fileDepartment of Computer Science, ... Arabization of available information retrieval ... and roots as index terms

adopted an eight-bit coded Arabic-Roman character set for information interchange known as ASMO-708.

Both ASMO-449 and ASMO-708 include 32 Arabic alphabetic characters. The set of displayable Arabic shapes, however, is much larger than the set of coded Arabic characters. This is because an Arabic character changes its shape depending on whether it is at the begin- ning, the middle, the end of the word, or isolated. The majority of the Arabic characters have two distinguished shapes, and a few characters have one, three, or four dis- tinguishable shapes. To determine the correct shape of a given Arabic character a contextual analysis algorithm is needed. Previous work has provided a fast and efficient algorithm (Al-Kharashi, 1989; 1990b) which was imple- mented in Micro-AIRS as a basic function used by the I/O interface.

Displaying the Arabic Text

There are two available approaches to displaying Ara- bic shapes on the PC. The first approach uses the graphic video mode, while the other uses the alphanumeric video mode. Using the graphic screen allows the display of an unlimited number of fonts with flexible sizes and the vowels at the correct positions. Unfortunately, using a graphic screen will slow down the system as its complex- ity increases. Also as the font size increases, the amount of displayable information decreases.

One of the easiest ways to speed up the I/O routines and make the screen hold more information is by using the alphanumeric video mode. On the original MDA and CGA display adapters, the only fonts that could be displayed in the alphanumeric video mode were those defined in a table located in ROM on the adapter. To display different fonts, the ROM must be replaced with one that holds the new font definitions. Recent adapters, such as EGAs and VGAs, all have alphanumeric charac- ter generators that use character definition tables located in predesignated areas of RAM. This table can be ac- cessed and modified by means of software.

For Micro-AIRS, a small number of I/O routines have been designed and implemented to allow it to accept and display an Arabic/English text. A previous system (Al- Kharashi, 1989), which uses a graphic screen to display vowelized Arabic text, has been modified to display both Arabic and English text in the same screen line. The new system uses the text screen instead of the graphic screen to display text. To achieve this, a whole new font table was created by the first author. The font shapes that rep- resent the English ASCII characters are kept without any change. The last 128 font shapes, which represent some graphical and foreign shapes, have been replaced by the shapes of Arabic characters and vowels. The Micro- AIRS fonts are shown in Figure 3. A brief glimpse of this interface and the basic input/output routines will be pro- vided during the discussion of the Micro-AIRS system structure below.

FIG. 3. The Micro-AIRS fonts.

The Structure of the Micro-AIRS System

Basically, Micro-AIRS consists of three main concep- tual components: namely, a User Interface, a Command Processor, and a Database Handler. The description of each component of the system follows.

User Interface

The real effectiveness of a computer system is mea- sured by its usability by people other than computer pro- fessionals. This leads to the need for an effective human- computer interface. Menu-driven systems are one of the most successful and widely used system design tech- niques. The advantages of using menu-driven systems have been discussed by Shneiderman ( 1987) and by Ga- lambos et al. ( 1985). They include: reducing the training and memorizing effort, simplifying entry of choices, structuring the user’s task, and allowing the user to be- come acquainted with the range of possibilities that the system offers. Micro-AIRS adapted the menu system that is used by Borland’s interactive compiler products such as Turbo C, Version 2.0 and Turbo Pascal, Version 5.0 (Borland, 1988b; 1988~). Thus, our interface is com- posed of two components, a permanent menu and pull- down menus. The permanent menu displays the options of the main menu. Left and right arrow keys are used to

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-September 1994 551

Page 5: Comparing words, stems, and roots as index terms in an ... · PDF fileDepartment of Computer Science, ... Arabization of available information retrieval ... and roots as index terms

Pull-Down

(b)

FIG. 4. (a) Arabic Micro-AIRS user interface. (b) Equivalent English interface.

move through the items causing them to be highlighted one at a time. The highlighted item can be selected by pressing the (ENTER) key. A pull-down menu, on the other hand, is displayed when an item from the perma- nent menu, or an item from the current pull-down menu is selected. Pull-down menu items are listed vertically, and the user can move through the items by using the up and down arrow keys. For large lists of items in one menu, more elaborate scrolling capabilities are provided. The screen in the Micro-AIRS user interface is divided into three areas as shown in Figure 4 and described as follows:

l System response area: This area is used by the system to display its response to a user command such as DIS- PLAY, SEARCH, or SORT.

l System status area: The bottom line of the screen shows the system status (running, waiting, or idle), er- ror messages, names of active databases, and so forth.

l System menu area: The top line of the screen lists all available submenus/commands in the system. An in- dividual entry can be activated by highlighting it using left/right arrow keys and then pressing the (ENTER)

key. There are eight basic items in the main menu. When activated, each item in the main menu will pull up another menu. An item in the second level menu, in turn, could trigger another menu or select a basic item.

Command Processor

This module accepts a user request, validates it, and then processes it. Since all user commands are entered through a menu-driven system, a great deal of this mod- ule is devoted to interpreting the SEARCH and DIS- PLAY commands. All the system commands are avail- able through a menu-driven system. Commands are cat- egorized into eight groups, namely FILE, EDIT, SEARCH, DISPLAY, SORT, PRINT, UTILITIES, and HELP. Each group is represented by an entry in the sys- tem menu area. When a group is selected, it will display a related commands list as shown in Figure 4.

The DISPLAY command allows the user to access the text of a database record directly, or to choose to display a document from a previously retrieved set. The user then can display the next or the previous document, the last or the first document, or jump backward or forward a given distance from the current document.

Micro-AIRS allows the user to search the database us- ing three retrieval methods, using words, stems, or roots, one at a time. To switch from one retrieval method to another the user should use the SEARCH/RETRIEV- AL-METHOD command to select the desired method. The system then will close the current keyword and post- ing files associated with the current method and open the files for the selected method. The system makes available a full set of Boolean and distance operators. The ADJ operator specifies that two words must appear next to each other in the document and in the proper word or- der. The FLD operator specifies that two words must ap- pear in the same field. Another option allows the user to specify that two index terms must be separated by exactly n number of words. The truncation symbol “:” can be suffixed to a query term to widen the search results. The truncation symbol indicates whether the term is to be searched as a complete term or as a fragment of a large term. The truncation option is limited to the word-re- trieval method only since the stem and root retrieval methods have superior effects over the truncation. Pa- rentheses can be used to construct long and complex queries.

A query can be submitted interactively using the ap- propriate chain of menus, or by giving a name of pre- edited query file which can contain one or more queries.

File Handler

This module is responsible for accessing and updating the data file. The three most basic operations are creating

552 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE--September 1994

Page 6: Comparing words, stems, and roots as index terms in an ... · PDF fileDepartment of Computer Science, ... Arabization of available information retrieval ... and roots as index terms

a new database, building a searchable database out of pure text files, and searching through a database.

Each Micro-AIRS database consists of five basic files: the database definition file, the data and the index files, and the keyword and posting files (Al-Kharashi, 1990a). Descriptions of each file are provided in the next section along with a discussion of the indexing process.

The Micro-AIRS Indexing Process

Indexing Strategies

The indexing and the data organization are the two major factors that influence the effectiveness and the efficiency of an information retrieval system. The index- ing process deals with the selection of appropriate terms capable of representing the content of a given biblio- graphic record. Experimental information retrieval sys- tems use different indexing methods. Frequency-based indexing methods measure the importance of a given term by its frequency in individual documents as well as its frequency in the whole collection. Frequency factors can also be used during the retrieval process as term weights to enhance the precision of the system, that is to present the user with a set of records that closely match the query in decreasing order. Binary weighting schemes, however, can be used instead. In this case all indexable terms are assigned the same weight. Although Micro- AIRS stores the frequency of occurrence of every valid indexable term in the collection, these values were not used as a measure for selecting significant terms.

There are two reasons for not using the frequency of the word as a valid measurement tool during the index- ing and the retrieval of the data. The first reason is related to Luhn’s (1958) observation. Since Luhn’s law has not been verified with an Arabic text, it is not realistic to use it as a solid base for indexing Arabic data. The second reason involves the type of collection that was used to test the system. Salton (197 1) concluded that the effectiveness of the content analysis depends on the length of the textual data available. Content analysis works better with larger textual data. In our collection, every record contains a short title with no abstract except for very few records. It is seldom to find a word occurring more than once or twice in the same document. Hence, a frequency based measurement has no significance for our data.

Data Description

As was described earlier, the Arabic collection con- tains about 23,800 records covering a wide range of sci- ence and technology fields. This data was originally con- tained in a single sequential file that occupies about seven million bytes of disk space. Each record in the data file is represented as a sequential list of bibliographic fields (e.g., title, author, journal title, abstract, and so

forth). The text of each record is terminated with an end- of-record mark. Every field starts with a three character field identifier followed by a space and then the text for that field. The field text is terminated with an end-of-line mark (i.e., carriage return and line feed characters).

The evaluation of the system requires some initial manual tasks, mainly relevance judgments. This task needs experts in the field of the area that the system cov- ers. Because the collection has wide coverage, we needed to choose a subset that covers a specific area where we would find help in performing the manual tasks. A single record from the original data is contained in one or more sets. The computer and information science set, with 355 records, was found to be the most suitable set for testing and evaluating the system. With this set it is easy to find people who are able to create queries and perform the relevance judgment.

The text of the selected set contained a few typing and spelling mistakes. To reduce the effect of these mistakes on the evaluation process, they were corrected before the final indexing process is carried out. The majority of these mistakes were easily detected after all keywords from the record texts were extracted and sorted. The VATE editor (Al-Kharashi, 1989) then was used for the simple editing and correction processes.

Database Dejnition Table

The database definition table controls the behavior of the system during editing, indexing, and retrieval. Every field in the database has an entry in this file and holds the following information: the full name of the field, the abbreviated name of the field, and the field attributes.

The process of selecting index terms goes through many phases starting with plain text and ending with a list of useful, accessible index terms. The indexing pro- cess accepts plain text defined by the index and record file, and extracts all words from every indexible field in all database records. In the database definition table we define the category, the title, and the abstract fields as indexible and searchable fields. The extraction of a word from a bilingual text is certainly more difficult than the extraction of a word from a unilingual text. With the use of the character attribute file, the word extraction process was able to distinguish between numeric, control, En- glish, and Arabic data. The length of any extracted term is limited to 25 bytes. If the original term exceeds the 25 byte limit, the term then will be truncated at the 25th byte and the remainder of the term will be skipped.

After the extraction of all index terms from the record file, the indexed keyword list is then sorted. A general pur- pose sorting program such as the DOS sort utility or other commercial sorting programs will not work with our data. This is because these utilities are usually intended for tex- tual data and the indexed keyword list contains binary coded values in the document number entries.

IBM PCs and compatibles are built on the Intel family

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-September 1994 553

Page 7: Comparing words, stems, and roots as index terms in an ... · PDF fileDepartment of Computer Science, ... Arabization of available information retrieval ... and roots as index terms

TABLE 1. The list of queries.

English meaning Arabic query Number

Computer systems

Computer and languages

Arabic programming languages

Computer and architecture design

Natural language processing

Computer and drawing

Computer and language learning

Computer and industries and industrial information

Computers in military field

Computer Arabization

Parallel programming

Morphological analysis

Computers and (indexing or classification or

documentation)

Computers in Saudi Arabia

Computers and (Quran or Hadeeth)

Computers and children

Computers and phonetics

Computers and agriculture

Computer networks and communication

Computer and design

Computers in education

Arabic terminology

Thesauri and information retrieval

Computers and information security

Computers in libraries

Machine translation

Computers and managements

Personal computers

Terminology databanks

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

microprocessors. The Intel CPUs store an integer value, data with numerical data coded in binary format. To which occupies two bytes, with the most significant byte overcome this problem, a general purpose sorting pro- at the lower location and the least significant byte at the gram MERGE3ORT.C was developed. MERGE higher location. This structure causes a general purpose SORT uses the Turbo C built-in quick sort function, sorting program intended for textual data to mis-sort any QSORT (Borland, 1988a), for internal sorting.

554 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-September 1994

Page 8: Comparing words, stems, and roots as index terms in an ... · PDF fileDepartment of Computer Science, ... Arabization of available information retrieval ... and roots as index terms

TABLE 2. The binary similarity coefficients.

Cosine Dice Jaccard

IQnDl 2 IQnDl IQnDl VIQI . VIDI IQ1 + IDI IQI + IDI - IQnDl

I D / , number of terms in the document text; I Q 1, number of terms in the query text; /Q tl D / , number of terms in both document and query.

Creation of the Word-Stem-Root Dictionary

A comprehensive Arabic dictionary is not available in machine-readable form. So we created a small word- stem-root dictionary, using the words in the collection. The dictionary is used during the indexing and the re- trieval process to identify the stem or the root of a given word and also identify the stop words. For every keyword a corresponding stem and root structure was created. From 355 bibliographic records we obtained 1,126 words, 725 stems, and 526 roots.

Evaluation of the Micro-AIRS System

The major purpose of this work is to study the effect of using words, stems, or roots as index terms on the per- formance of an Arabic information-retrieval system.

Relevance Judgments

Information-retrieval systems are usually evaluated in terms of two measures, recall and precision. Recall is de- fined as the proportion of the documents in the collec- tion relevant to the query that are actually retrieved. Precision is defined as the proportion of the documents retrieved that are actually relevant. Perfect recall (a value of 1 .O) occurs when the system finds all the items in the collection that are relevant to the document. Perfect precision (also a value of 1 .O) occurs when all the docu- ments retrieved are relevant.

Both measures depend on knowing what documents are relevant to each query, so the first step in evaluation is making relevance judgments. For large collections, sampling techniques are used (Salton, 1975), but our col- lection was small enough so that we could carry out this task manually.

We asked graduate students in computer science, who were also native speakers of Arabic, to make up 60 que- ries that they might themselves use in their own research. Ten queries were removed because they were essentially duplicates of other queries, asking for the same informa- tion. The 355 database records were divided into three sets. Each set was handed to one of the students along with a computer-based relevance judgment support sys- tem designed and implemented by the first author. This system allowed the judge to browse through the records

of the set and the list of queries at the same time in two different windows on the same screen. If the user judged that the displayed record was relevant to the displayed query, he simply marked a box on the screen. After the completion of the relevance judgment task, the judg- ments for the three sets were grouped together in a rele- vance judgment matrix.

Out of the 50 queries, only the 29 queries shown in Table 1 were found to have one or more relevant docu- ments in the collection. Thus, only these 29 queries were used for the system evaluation.

Similarity Measurements

Similarity coefficients have several important applica- tions in an information-retrieval system (Salton, 1989). Their most important function is in ranking retrieved doc- uments in order to present to the user first the documents most relevant to the query. There are three common nor- malized similarity coefficients (each with two versions, one for binary and one for weighted terms), the cosine, Dice, and Jaccard coefficients (van Rijsbergen, 1979; Salton, 1989). We decided to try all three binary coefficient mea- surement methods and select the one that performs the document ranking best. Table 2 shows the formulas for the cosine, Dice, and Jaccard binary coefficients.

The calculations of the similarity measurements be- tween a query and a document require information about the number of terms in the document text and in the query text and the number of terms that appear jointly in the query and the document text. As the root or the stem is used instead of the word, it is more likely that one or more word collapses in one common stem or root. Hence, the number of unique words is reduced and the similarity coefficient is increased.

In determining the order in which documents should be presented to the user, however, the actual value of the coefficient does not matter, it is only the relative values that make a difference. The ranking process showed that all three binary similarity coefficients produced exactly the same rankings for all queries.

The actual values are shown for all three similarity methods combined with all three retrieval methods for

TABLE 3. Ranking of the result of query number 20 using roots.

Similarity coefficient values Document Relevance

number indicator Rank Cosine Dice Jaccard

19 I 0.5774 0.5000 0.3334 18 * 2 0.4715 0.3637 0.2223

216 * 3 0.4715 0.3637 0.2223 281 * 4 0.4715 0.3637 0.2223 325 5 0.4083 0.2858 0.1667 212 6 0.2133 0.0870 0.0455

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-September 1994 555

Page 9: Comparing words, stems, and roots as index terms in an ... · PDF fileDepartment of Computer Science, ... Arabization of available information retrieval ... and roots as index terms

TABLE 4. Ranking of the result ofquery number 22 using words.

Similarity coefficient values Document Relevance

number indicator Rank Cosine Dice Jaccard

279 * 1 0.5346 0.4445 0.2858 263 * 2 0.4473 0.3334 0.2000 254 * 3 0.4265 0.3077 0.1819 253 * 4 0.3780 0.2500 0.1429 273 * 5 0.3652 0.2353 0.1334

three example queries in Tables 3, 4, and 5. The results obtained for all the other queries were the same. Clearly, it does not matter what coefficient is used and we are free to use whichever one is the cheapest to compute on the target hardware.

Results of Processing the Queries

The processing of our 29 queries on the Micro-AIRS system using words, stems, and roots produces the results shown in Table 6. The system performance using each of the three indexing methods can be categorized into six groups as follows:

(1) The system failed to retrieve any document with the use of any retrieval method in response to query 4.

(2) The three methods perform equally in response to queries 9, 11, and 16.

(3) The word-retrieval method performs as well as or better than the other methods at most recall levels in queries 13,20,25,26, and 29. It is not able, however, to retrieve any of the relevant documents for the fol- lowing queries: 2,4, 5, 7, 12, 15, 17, 18, and 23.

(4) The stem retrieval method outperforms the other methods in response to some queries, including 1 and 7.

(5) The root-retrieval method outperforms the other

TABLE 5. Ranking ofthe result of query number 2 I using stems.

Similarity coefficient values Document Relevance

number indicator Rank Cosine Dice Jaccard

328 * 1 0.5774 0.5774 0.4000 73 * 2 0.5774 0.5000 0.3334

230 * 3 0.4365 0.4000 0.2500 52 * 4 0.4365 0.4000 0.2500 13 5 0.4365 0.4000 0.2500 12 6 0.3850 0.3334 0.2000

276 7 0.3652 0.3077 0.1819 255 8 0.3652 0.3077 0.1819

11 9 0.3652 0.3077 0.1819 120 * IO 0.3334 0.2667 0.1539 224 * 11 0.2218 0.0938 0.0492

TABLE 6. The results of the processing of the 29 queries.

Word Stem Root

Query Rc. Ret. Rel. Fall Ret. Rel. Fall Ret. Rel. Fall

1 15 5 4 1 15 10 5 22 10 12 2 5 0 0 0 85 4 81 85 4 81 3 10 1 1 0 5 5 0 7 7 0 4 2 0 0 0 0 0 0 0 0 0 5 5 0 0 0 4 2 2 4 2 2 6 3 1 1 0 3 3 0 3 3 0 7 3 0 0 0 7 3 4 25 3 22 8 11 8 8 0 11 10 1 13 11 2 9 5 3 3 0 3 3 0 3 3 0

10 13 3 3 0 9 5 4 47 11 36 11 1 1 1 0 I 1 0 1 1 0 12 I 0 0 0 0 0 0 1 1 0 13 9 5 5 0 6 5 1 6 5 1 14 15 10 9 I 17 13 4 17 12 5 15 2 0 0 0 1 1 0 2 2 0 16 I 1 1 0 1 1 0 1 1 0 17 4 0 0 0 2 2 0 3 3 0 18 4 0 0 0 0 0 0 3 2 1 19 5 2 2 0 3 3 0 6 5 1 20 4 2 2 0 6 3 3 6 3 3 21 9 3 3 0 I1 6 5 52 7 45 22 26 5 5 0 5 5 0 14 14 0 23 2 0 0 0 1 1 0 2 2 0 24 7 4 4 0 4 4 0 7 4 3 25 7 3 3 0 4 3 I 7 4 3 26 1 1 1 0 2 1 1 3 1 2 27 14 4 1 3 10 4 6 12 6 6 28 II 1 I 0 8 8 0 9 9 0 29 I 2 1 1 2 1 1 3 1 2

Rc., number of relevant records in the collection; Ret., number of retrieved records using the method; Rel., number of relevant records actually retrieved; Fall, number of irrelevant records actually retrieved.

(6)

methods in most of the queries, most strikingly in 3, 8, 12, 15, 17. 18, 19,22,23,27,and28. The stem- and the root-retrieval methods perform equally in response to queries 2,5. and 6.

Table 7 and Figure 5 show the differences in average retrieval values. It is clear from the table and the figure that the root-retrieval method was able to retrieve more documents per query than the other two retrieval meth- ods. The problem with the root-retrieval method can be shown in the number of irrelevant documents retrieved along with the relevant documents. The stem-retrieval method retrieved fewer irrelevant documents yet a rea-

TABLE 7. Average retrieval of the 29 queries.

Method Retrieved Relevant Irrelevant

Word 2.24 2.03 0.21 Stem 7.79 3.69 4.10 Root 12.55 4.72 7.83

556 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE--September 1994

Page 10: Comparing words, stems, and roots as index terms in an ... · PDF fileDepartment of Computer Science, ... Arabization of available information retrieval ... and roots as index terms

Word Stem Root

FIG. 5. Average retrieval ofthe 29 queries.

sonable number of relevant documents. Finally, the word-retrieval method retrieved the least number of rel- evant and irrelevant documents.

From the previous detailed and averaged retrieval data, we cannot draw a precise conclusion about the effectiveness of the system. In the following sections we will discuss and present the standard evaluation of the information-retrieval system based on the recall and precision measures.

Recall-Precision Measurements

The recall and precision values produced for a given query reveal the behavior of the system only under that query and for those calculated recall and precision val- ues. Notice that the rough recall-precision values could contain many precision values for a single recall value. Furthermore, some precision values may not be defined at certain recall levels. Different smoothing algorithms are in common use for precision averaging (Keen, 1972). In Micro-AIRS we used the smoothing algorithm sum- marized below.

(1) Divide the recall values into 10 levels;

(2)

(3)

(4)

0.0 <= rO.l < 0.1,

0.1 <= r0.2 < 0.2,

. . . )

0.9 <= rl.O <= 1.0.

Assign the largest precision value of a level to that level. Assign the largest precision value found in the table to the first level. Starting from the tenth region we start removing all sawtooth lines by assigning the current level

precision value to the next level if its precision value is lower than the current one.

(5) To assure that the precision will drop gradually from a certain precision value to a zero value, we assign any level with a zero precision to half of the precision value of the previous level.

Table 8 and Figure 6 show the averaged recall- precision values with the zero-smoothing process. The summaries provided by the average recall-precision val- ues suggest that the root-retrieval method outperforms both the word- and the stem-retrieval methods. They also suggest that the stem-retrieval method outperforms the word-retrieval method.

LO- LO-

4 4 0.8 0.8

0.6 * 0.6 *

0.4 . 0.4 . Root

stem stem

Word Word

Redl Redl 0.0. I 0.0. I

0.10 0.10 0.20 0.20 0.30 0.30 0.40 0.40 050 050 0.60 0.60 0.70 0.70 0.80 0.80 0.90 0.90 1.00 1.00

FIG. 6. Average recall-precision graph after zero-smoothing process.

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-September 1994 557

Page 11: Comparing words, stems, and roots as index terms in an ... · PDF fileDepartment of Computer Science, ... Arabization of available information retrieval ... and roots as index terms

TABLE 8. Average recall-precision table with zero-smoothing pro- cess.

Precision

Recall Word Stem Root

0.10 0.7143 0.8739 0.9308 0.20 0.6357 0.8356 0.8998 0.30 0.5643 0.7772 0.854 1 0.40 0.4946 0.75 10 0.8442 0.50 0.424 1 0.6952 0.8085 0.60 0.3353 0.5721 0.6946 0.70 0.2569 0.4457 0.5860 0.80 0.1999 0.3545 0.5047 0.90 0.1714 0.2935 0.4290 1 .oo 0.1571 0.2467 0.3912

TABLE IO. Wilcoxon signed-rank test for word vs. stem.

Favoring Favoring Norm dev. One-sided Recall word stem NDF Z probability

.lO 4.00 32.00 8.00 1.9604 .0250

.20 8.00 70.00 12.00 2.43 I8 .0075

.30 21.00 115.00 16.00 2.4303 .0075

.40 21.00 132.00 17.00 2.6273 .0043

.50 23.00 148.00 18.00 2.7219 .0033

.60 14.00 157.00 18.00 3.1139 .0009

.70 15.00 175.00 19.00 3.2194 .0006

.80 8.00 182.00 19.00 3.5011 .0002

.90 8.00 182.00 19.00 3.5011 .0002 1 .oo 8.00 182.00 19.00 3.501 I .0002

Statistical Analysis

H2: The word-retrieval and the root-retrieval methods give the same results.

H3: The root-retrieval method is better than the word- retrieval method.

To draw accurate conclusions about the effectiveness of the system using word-, stem-, and root-retrieval methods, to determine the significance of the results shown in Table 8 and Figure 6, we use two nonparamet- ric statistical tests, the sign test and the Wilcoxon signed- rank test. In this analysis we compared each pair of re- trieval methods separately. Thus essentially we looked at the results of three experiments the word-stem experi- ment, the word-root experiment, and the stem-root ex- periment.

The null hypothesis and the alternative hypothesis used for the word-stem experiment are:

The null hypothesis and the alternative hypothesis used for the stem-root experiment are:

H4: The stem-retrieval and the root-retrieval methods give the same results.

H5: The root-retrieval method is better than the stem- retrieval method.

HO: The word-retrieval and the stem-retrieval method give the same results.

H 1: The stem-retrieval method is better than the word- retrieval method.

The null hypothesis and the alternative hypothesis used for the word-root experiment are:

The test results are shown in Tables 9- 14. The statis- tical results support H 1 and H3, that is, they confirm the superiority of root- and stem-retrieval methods over the word-retrieval method with alpha = .03 using the Wil- coxon signed-rank test.

When we compare the stem- and the root-retrieval methods then the results are not so clear. The one-sided probability values at the lower recall levels (up to .5) of Tables 13 and 14 comparing stem- and root-retrieval methods show that the root-retrieval method does not perform significantly better than the stem method. At

TABLE 9. Sign test for word vs. stem. TABLE 11. Sign test for word vs. root.

Favoring Favoring Norm dev. One-sided Favoring Favoring Norm dev. One-sided Recall word stem Tied z probability Recall word root Tied Z probability

.10 2 6 21 1.4142 .0793 .10 2 8 19 1.8974 .0287

.20 3 9 17 1.7321 .0418 .20 4 12 13 2.0000 .0228

.30 5 11 13 I .5000 .0668 .30 5 15 9 2.236 1 .0073

.40 5 12 12 I .6977 .0446 .40 5 16 8 2.4004 .0082

.50 5 13 11 I .8856 .0294 SO 3 18 8 3.2733 .0005

.60 4 14 11 2.3570 ,009 1 .60 2 19 8 3.7097 .oooo

.70 4 15 10 2.5236 .0059 .70 2 20 7 3.8376 .oooo

.80 3 16 10 2.9824 .0014 .80 1 21 7 4.2640 .oooo

.90 3 16 10 2.9824 .0014 .90 1 21 7 4.2640 .oooo 1 .oo 3 16 10 2.9824 .0014 I .oo I 21 1 4.2640 .oooo

Combined 37 128 125 7.0843 .oooo Combined 26 171 93 10.3308 .oooo

558 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-September 1994

Page 12: Comparing words, stems, and roots as index terms in an ... · PDF fileDepartment of Computer Science, ... Arabization of available information retrieval ... and roots as index terms

TABLE 12. Wilcoxon signed-rank test for word vs. root.

Favoring Favoring Norm dev. One-sided Recall word root NDF Z probability

.I0 5.00 50.00 10.00 2.2934 .OllO

.20 15.00 121.00 16.00 2.7406 ,003 1

.30 29.50 180.50 20.00 2.8186 .0020

.40 26.00 205.00 2 1 .oo 3.1 108 .0009

.50 16.50 214.50 2 1 .oo 3.4410 .0003

.60 9.00 222.00 2 I .oo 3.7017 .oooo

.70 9.00 244.00 22.00 3.8147 .oooo

.80 2.00 251.00 22.00 4.0420 .oooo

.90 2.00 251.00 22.00 4.0420 .oooo 1 .oo 2.00 251.00 22.00 4.0420 .oooo

higher recall levels, however, the root-retrieval method performs better than the stem-retrieval method.

Conclusions

Summary

Micro-AIRS was designed as an experimental system to investigate indexing and retrieval processes for Arabic bibliographic data. During the design and implementa- tion of the system, we dealt with the following problems:

(I) Accessing, processing, and displaying Arabic/En- glish text.

(2) Indexing and sorting Arabic terms. (3) Indexing and retrieval of Arabic data using different

types of index terms, words, stems, and roots. (4) Ranking documents using different binary similarity

coefficients.

This research reveals the superiority of root- and stem-retrieval methods over word-retrieval methods for Arabic data. The root performs as well as or better than the stem at the low recall levels, and definitely better at high recall levels. We also found that the document rank-

TABLE 13. Sign test for stem vs. root.

Favoring Favoring Norm dev. One-sided Recall stem root Tied Z probability

.I0 3 2 24 -.4472 .3300

.20 5 4 20 -.3333 .3707

.30 5 5 19 .oooo .5000

.40 5 6 18 .3015 .3821

.50 4 7 18 .9045 .I841

.60 4 11 14 1.8074 .035 1

.70 5 12 12 1.6977 .0446

.80 5 12 12 1.6977 .0446

.90 5 12 12 1.6977 .0446 1.00 4 13 12 2.1828 .0146

Combined 45 84 161 3.4338 .0003

TABLE 14. Wilcoxon signed-rank test for stem vs. root.

Favoring Favoring Norm dev. One-sided Recall stem root NDF Z probability

.I0 6.00 9.00 5.00 .4045 .3409

.20 18.00 27.00 9.00 .5331 .2982

.30 18.00 37.00 10.00 .9683 .I660

.40 20.00 46.00 11.00 1.1558 .I230

.50 18.00 48.00 11.00 1.3337 .0918

.60 22.00 98.00 15.00 2.1583 .0154

.70 33.00 120.00 17.00 2.0592 .0197

.80 30.00 123.00 17.00 2.2012 .0131

.90 34.00 119.00 17.00 2.0119 .0217 1.00 24.00 129.00 17.00 2.4853 .0064

ing process produced exactly the same results when using different binary similarity coefficients, so a single simple coefficient can be used.

These results were obtained in a system where each document was accurately classified as to subject area. Also, the part of the collection involved in the experi- ments, the set containing all 355 computer science doc- uments in the database, was carefully proofread to elim- inate spelling errors. Of most concern, most documents in the collection were represented by titles only, not by abstracts. Clearly, further experiments are needed.

Future Research

In an operational system, the word-stem-root dictio- nary should be replaced by a morphology algorithm that finds stems and roots as mentioned previously.

By using stems and roots for indexing and retrieval we were able to retrieve most of the relevant documents in the collection. The retrieval failure of some or all rele- vant documents (see Table 6) was due to the use of re- lated words (e.g., synonyms). We believe that the use of an interactive thesaurus will be helpful in retrieving more relevant documents. For a discussion of the use of such a thesaurus in English see Fox (1980) and Wang, Vandendorp, and Evens ( 1985). Research on this prob- lem is being carried forward at Illinois Institute of Tech- nology using a database of Arabic documents with ab- stracts.

The current system allows the user to use only one type of index term at any given time. To reduce the num- ber of irrelevant documents, the user should have the ability to impose the retrieval method over individual words of a query. For example, the search argument “A and (B or C)” could be expressed as “root:A and (stem:B or word:C).”

Using a binary ranking process fails in some cases to put the most relevant documents at the top of the re- trieved list. A weighted ranking process should be inves- tigated for Arabic documents using a database where all documents have abstracts, or better still, where all docu-

JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-September 1994 559

Page 13: Comparing words, stems, and roots as index terms in an ... · PDF fileDepartment of Computer Science, ... Arabization of available information retrieval ... and roots as index terms

ments are available online in full-text form. The first au- thor is planning a large-scale test of the effectiveness of the system at KACST using a large collection of docu- ments with abstracts and a large number of test queries collected from actual users.

REFERENCES Borland International. ( 1988a). Turbo C, version 2.0; Reference guide,

Scotts Valley, CA. Borland International. (I 988b). Turbo C, version 2.0: User’s guide,

Scotts Valley, CA. Borland International. (1988~). TurboPascal, version 5.0: User’sguide,

Scotts Valley, CA. Al-Fedaghi, S. S.. & Al-Anzi, F. S. (I 989, March). A new algorithm to

generate Arabic root-pattern forms. In Proceedings ofthe 11th Na- tional Computer Conference and Exhibition, (pp. 391-400.) Dhah- ran, Saudi Arabia: King Fahd University ofPetroleum and Minerals.

Fox, E. (1980). Lexical relations: Enhancing effectiveness of informa- tion retrieval. SIGIR Forum. 1.5, 6-35.

Galambos, J. A., Sebrechts, M.. Wikler, E., & Black, J. (1985). A dia- grammatic language for instruction of a menu-based word process- ing system. In S. Williams (Ed.), Humans and machines (pp. 1 l-44). Norwood. NJ: Ablex.

Al-Gasimi, M. (1987, April). Arabization of the MINISIS system. In Proceedings qfthe First King Saud University S.ymposium on Com- puter Arubization (pp. 13-26.) Riyadh. Saudi Arabia: King Saud University.

Gheith. M., & El-Sadany, T. (1987, April). Arabic morphological ana- lyzer on a personal computer. In Proceedings ofthe First King Saud University Symposium on Computer Arabization (pp. 55-65.) Riy- adh, Saudi Arabia: King Saud University.

Gheith. M.. & Abdul-Ela, M. (1989, March). A computer based Arabic syntax analyzer. In Proceedings of the I Ith National Computer Con- ference and Exhibition (pp. 352-360.) Dhahran. Saudi Arabia: King Fahd University of Petroleum and Minerals.

Harman, D. (1987, June). A failure analysis on the limitation of suffixing in online environments. In Proceedings ofthe 10th Annual International ACM SIGIR Coqference, New York: Association of Computer Machinery.

Harman, D. ( 199 1). How effective is suffixing? Journal of the American Society,for IGformation Science, 42, 7- 15.

Hegazi, M., & Elsharkawi, A. A. (1985, April). An approach to a com- puterized lexical analyzer of natural Arabic. Computer Processing qf the Arabic Language. Wbrkshop papers (Vol. I). Kuwait: Kuwait Institute for Scientific Research (KISR).

Hilal, Y. (1985. April). Morphological analysis of Arabic speech, Com- puter Processing qfthe Arabic Lnnguage. Workshop papers (Vol. I). Kuwait.

Keen, E. M. (1972). Prospects for classification suggested by evaluation tests carried out 1957-1970. In A. Maltby (Ed.), Classification in the 197O’s(pp. 193-210). Hamden, CT: Linnet Books.

Al-Kharashi, I. A. (1989). V,4 TE: A vowelized Arabic text editor. Ph.D. qualifying project, Illinois Institute of Technology, Chicago, IL.

Al-Kharashi, I. A. (I 990a, October). Micro-AIRS: A microcomputer based Arabic information retrieval system, design, implementation and evaluation. In The 12th National Computer Conference (Vol. 2) (pp. 5 15-529.) Riyadh, Saudi Arabia: King Saud University.

Al-Kharashi, I. A. (1990b, October). An efficient contextual analysis algorithm for Arabic text handling. The 12th National Computer Conference(Vo1. 2) (pp. 465-473.) Riyadh, Saudi Arabia: King Saud University.

Lovins, J. B. (1968). Development ofa stemming algorithm. h4echani- cal Translation and Computational Linguistics, I I, 22-3 I.

Luhn, H. P. ( 1958). The automatic creation of literature abstracts. IBM Journal ofResearch and Development, 2, 159- 165.

Porter, M. F. (1980). An algorithm for suffix stripping. Program, 14. 130-137.

Salton, G. (Ed.) (197 I). The SMART retrieval system experiments in automatic document processing. Englewood Cliffs, NJ: Prentice Hall.

Salton, G. (1975). A theory of indexing. Regional Conference Series in Applied Mathematics, No. 18. Philadelphia: Society for Industrial and Applied Mathematics.

Salton. G. ( 1989). Automatic te,vtprocessing: The transformation, anal- ysis, and mrieval of information by computer. Reading, MA: Addi- son-Wesley.

Salton, G., & McGill, M. J. (1983). Introduction to modern information retrieval. New York: McGraw-Hill.

Shneiderman, B. ( 1987). De.yigning the user interSace: Strategies@ hu- man-computer interuction. Reading, MA: Addison-Wesley.

Tayli, M., & Al-Salamah, A. I. (I 990). Building a bilingual microcom- puter system. Communications of the ACM, 33,495-504.

Thalouth, B., & Al-Dannan, A. (1987). A comprehensive Arabic mor- phological analyzer/generator. IBM Kuwait Scientific Center.

UNESCO ( 1989). Mini-micro CDS/ISIS, Paris. van Rijsbergen, C. J. (1979). I@rmation retrieval (2nd ed.). London:

Buttenvorths. Wang. Y. C., Vandendorpe, J., & Evens, M. (I 985). A microcomputer

based information retrieval system supporting stroke diagnosis. Journal oJthe American Society for Information Science, 36, 15-27.

Yahya, A. H. (1989, October). On the complexity ofthe initial stage c)fArubic text processing. Paper presented at the First Great Lakes Computer Conference. Kalamazoo, MI.

560 JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE-September 1994