integrating natural language processing and information retrieval in a troubleshooting help desk

integrating Nafurwi Language Processha and hmRormation Re h.ievaiIm a 7koubieshooting Help Desk Peter 6. Anick, Digital Equipment Corporation

ation provides telephone and electronic support to its customers through a worldwide network of support centers and field offices. In addition to troubleshooting customer problems, Digital's service and support specialists share their knowledge and experiences with other specialists by writ- ing up their problems and solutions and making them available in a worldwide on- line information base. By sharing this knowledge among a large body of physi- cally distributed support engineers. customer problems that have been previously encountered can be resolved much more quickly, thereby increasing customer sat- isfaction and decreasing Digital's cost of providing service.

Over the past decade, the range of products supported by Digital's specialists has grown steadily and now includes non-Dig- ita1 products as well. This growth has been matched by a massive increase in the size of the on-line inforniation base, which has surged from less than 10,000 article5 into the hundreds of thousands.

As the size and importance of the database has grown, the customer support centers have improved their software's data management and information retrieval ca- pabilities. In the early 1980s. articles were

AS THE SIZE AND IMPORTANCE OF DEC'S ON-LINE DATABASE HAVE GROWN, ITS CUSTOMER SUPPORT

SOFTWARE HAS EVOLVED TO HANDLE INCREASED DATA MANAGEMENT AND INFORMATION RETRIEVAL NEEDS.

NATURAL LANGUAGE PROCESSING PLAYS AN lMl'ORlXhT ROLE,

I

indexed with a controllcd vocabulary and retrieved using Boolean expressions over this vocabulary. By the mid- 1980s. however, this system was replaced by a full- text system. Stars, which proved much more effective. In a full-text system, all the words in a text are available as keywords for querying. Users do not have to learn a controlled vocabulary, and they can ex- press their queries in natural language rather than the syntax of Boolean expressions.

Stars did not try to "understand" the natural language queries, nor did i t simply match the query's tokens with articles containing those strings. Instead. i t truncated words to remove suffixes. eliminated noise words like "a" and "of." added synonyms. and converted the results into Boolean expressions for matching against articles in the database. It could also remove terms

from a query to broaden i t if the original search resulted in no hits.

On the whole. this approach to information retrieval has worked well for specialists hunting down problem-solving information. However, the rapid growth in the size of databases has contributed to an increase in the number of false hits retrieved during searches, as well as an increase in the need to reformulate queries to guide the search. Particularly frustrating were cases in which a specialist knew an article existed but couldn't easily craft a query that would retrieve it. Problems like this led Digital's Stars development group to contact us at the company's AI Technol- ogy Center to explore the applicability of AI technologies to this critical component of customer service and support. Intrigued by the possibilities for exploiting expert

systems, intelligent user/system dialogues, blackboard-based integration of knowledge sources, and natural language processing, we eagerly accepted the challenge and chris- tened our collaboration the AI-Stars project.

Interviewing the users

Information retrieval research in the 1980s had already pursued anumber of AI- influenced approaches. The Rubric system, for example, cast queries as rules for building up evidence about a document’s relevance.’ 13R used a blackboard architecture to orchestrate a dialogue with the user, elaborate the user’s information needs, and find appropriate documents.* Natural language processing was used to automate the role of an expert intermediary in interactive document r e t r i e~a l .~ Statistical methods had also emerged as attractive alterna- tives to Boolean and controlled vocabulary approaches; for example, Wide Area In- formation Servers4 drew on the vector space model, in which queries and documents are represented as weighted term vectors, allowing for the computation of statistical similarity metrics between them.5 Such systems work best with long queries, and it is possible to use the text of one article as a query to find others similar to it.

One of the conclusions researchers had come to after exploring this wide range of approaches was that no one retrieval meth- od is likely to meet the needs of all users.6 Not only do the methods tend to retrieve different sets of relevant documents, but users themselves come with different task orientations, interaction preferences, skill levels, and so on. We therefore began our AI-Stars investigation by interviewing members of our user population and watch- ing them as they performed their daily work of computer troubleshooting.

We had discovered many different styles of on-line searching. Some specialists liked to enter a broad (one- or two-word) query and then, based on the size of the resulting set, further restrict the search expression. Others started with longer queries of perhaps half a dozen terms and removed or added related terms if the initial query appeared to miss its mark. Users tended to understand that their natural language expression was transformed into a Boolean expression and, on some occasions, would resort to using optional explicit Boolean

operators, including negation. The object of their task, to respond as quickly as possible to a specific customer inquiry, meant that they were primarily interested in “precision” searching: finding the one article that addressed a particular problem, as opposed to all the articles that might be even peripherally related.

For the most part, users reported liking the Stars system and appeared to be effective at using it. However, their confidence was shaken when they had trouble locating an article that they knew to exist; and they

W E FELT OBLIGED TO EXPLORE SOLUTIONS THAT

WOULD MINIMIZE THE USER’S NEED TO DO

ANYTHING MORE THAN TYPE IN A SIMPLE NATURAL

LANGUAGE QUERY.

had little way of knowing, when a search failed, whether that meant there was no article that met their need or whether further searching was likely to be fruitful. Users also reported some trouble coming to grips with the internal heuristics used by Stars to convert natural language queries into search expressions. Indeed, each of Stars’ internal query-processing strategies could in some cases lead to poor results and confuse users:

Stemming sometimes reduced unrelated words to the same stem. Some apparent noise words were in fact content words in certain contexts. Automatic use of synonyms sometimes broadened the query in inappropriate ways. Users intended certain combinations of words in their queries to be contiguous phrases rather than independent search terms. The strategy for automatically broadening a query with no hits did not always do the “right” thing.

Perhaps reacting to these kinds of expe-

riences, the users we interviewed were wary of many of the “enhancements” we initially proposed. Intelligent automated query reformulation and statistical nearest neighbor matching methods were perceived as potentially making the system even more difficult to understand. They argued that users should be able to readily comprehend and easily override any “intelligence” built into the system. Furthermore, their practical need to perform searches quickly argued against a user interface that engaged in interactive dialogues to elicit long and “accurate” statements of users’ information needs. They liked the convenience of launching searches with short natural language queries.

Users did volunteer suggestions for im- provements, however. They wanted accurate stemming, which behaved according to their linguistic intuitions and did not, for example, accidentally conflate technical terms with similar natural language words. They wanted the system to match phrases, so that “C library” would be interpreted as a single expression rather than the combination of two potentially noncon- tiguous terms. They wanted more control over the system’s default behavior with respect to noise words, synonyms, and query broadening. They wanted the system to recognize certain expressions such as version numbers, so as to retrieve articles containing variant forms or specializa- tions (for example, a query containing “V5.4” should match an article containing “version 5.4-la”). Help in reformulating queries would be valuable as well, so long as the user remained firmly in control of reformulation.

These interviews dampened our enthusi- asm for a number of possible AI approaches; we concluded that our initial efforts should focus on improving the linguistic performance of the existing Stars system. While some of the users’ requested functionality could be achieved through extensions to the user query language, such as proximity operators and wildcarding, we felt obliged to explore solutions that would minimize the user’s need to do anything more than type in a simple natural language query. We therefore decided to evaluate the feasi- bility of incorporating some of the basic building blocks of a standard natural language processing system, namely a computational lexicon, morphological analyzer, and parser.

IO IEEE EXPERT

I

Developing a prototype

To implement our first prototype, we adapted a set of tools we had developed for a machine-aided translation system. This included a 17,000-word English dictionary containing information about each word’s grammatical category and stem forms. Its morphological analyzer contained a set of rules encoding English orthography and inflectional suffixation. Given a string, it proposed a set of potential stems that could then be verified against the lexicon. The parser was a simple bottom-up chartpars- er,’ which, given a set of syntactic rules and an English string, generated a table, or chart, by applying the grammar to the tokens in the input string and storing all the well-formed substrings. The chart can be thought of as a two-dimensional array, in which the input tokens span single columns along the topmost row. Larger con- stituents recognized by the parser span multiple columns directly below the tokens or phrases from which they are composed. For natural language processing, this data structure conveniently represents competing ambiguous interpretations as edges in different rows that span the same columns.

Our original intent was to use the chart parser to recognize potential noun phrases in user queries. However, many user queries consist simply of a string of nouns, without punctuation or function words to indicate noun phrase boundaries. We decided that a purely syntactic approach to recognizing noun phrases in user queries was impractical, since any sequence of two or more nouns in a query would be interpreted as a potential noun phrase. Instead, we adopted the strategy of storing useful noun phrases in the lexicon and constrain- ing the parser to recognize only those phrases. The parser was also used to recognize special expressions, such as variants of operating systems’ version numbers.

One common technique in implementing an efficient text retrieval system is to construct inverted indexes for descriptors, facilitating rapid access to the objects associated with each descriptor.8 For our prototype, we chose to index articles by the citation forms of the words and phrases contained in them (that is, the string you would typically use to look up a word in a traditional dictionary). This required run- ning a morphological and phrasal analysis

version 5 U

Figure 1. The query input window and query reformulation workspace.

over the entire text base at load time, but had the effect of reducing all inflectional variants of a word or phrase to a single index entry. By indexing on the citation form, we were also collapsing across the word’s part of speech. That is, our index did not distinguish between “program” as a noun and as a verb, for example. However, since noun and verb senses of English words are often semantically related, we felt that any attempt to disambiguate words into grammatical categories at index time would have little tangible benefit for sub- sequent information retrieval.

At query time, the AI-Stars prototype parsed the user’s query, as described above, to construct a chart containing the citation forms of words, phrases, and special expressions. It then applied heuristics to determine which of the identified terms should be included in the actual search expression to be matched against the article index. Function words like “the” and “of” were marked inactive. Likewise, any term whose chart column span was exceeded by the span of another term was labeled inactive, thus eliminating any individual terms that were part of a larger phrase. The prototype then applied an algorithm to the chart to construct a Boolean query from the re-

maining active terms. Roughly speaking, the algorithm ORed together terms that appeared in the same column on the chart, and then ANDed together these disjunct^.^ AI-Stars ran the resulting query expression against the database index, and displayed the size of the matching article set to the user, who could then examine the list or modify the query and rerun it.

Reformulating queries

Adding linguistic components helped correct a number of shortcomings in the existing Stars engine, but we still had to address our users’ second directive: to make the system’s search behavior easier to in- terrogate and manipulate. Our recent experience with generating Boolean expressions from the underlying 2D chart data structure led us to craft a visual representa- tion of the chart to serve as a user window into the internals of the Stars query interpreter. Figure 1 shows the visual display we developed; specifically, the query input window with the query reformulation workspace. Each term in the query is depicted as a tile in a 2D layout, with active terms in reverse video to visually distinguish them

DECEMBER 1993 11

I

I n

Flk, T m " I ___ I .

I H-r

Udad Tenar

igure 2. The thesaurus window, query input window, and modified query reformulation workspate.

from inactive terms. Each tile displays the term's citation form and the number of article postings for that term. Through the display, the results of all the default behav- iors of the query interpreter are available at aglance: the morphological analysis, phrase matching, special expression recognition, and selection of active terms. The article postings provide feedback on the discrim- inative role played by each term in the query. For most configurations of the tiles, even the Boolean interpretation is visually manifest; users can apply the rule of thumb that tiles in the same column are ORed and

Clicking on a tile changes i t from active to inactive, and vice versa. Dragging a tile to a new location changes its Boolean relationship with the other tiles. Stretching a tile across multiple columns effectively ORs the term with the AND of terms in the columns it spans. Selecting a tile option can pull up a thesaurus window for that term (see Fig- ure 2).

The thesaurus provides a way to access related terms that can help broaden or nar- row a search. To minimize user effort in 1 adding thesaurus terms to the query, we

1 divided the window into four catgories of

those in different columns are ANDed. The tiles in Figure 1 can thus be interpreted as the n i i e n r

1 relationships: Y""J

But the visual display is not only a window into the system's internals. It also serves as a direct manipulation workspace for performing iterative query refinement:

. . .

phrase for the selected term in the workspace: It is placed as an active tile in the same column(s) as the previously selected workspace term, which is made inactive.

Selecting a synonym from the Thesaurus window broadens the query by adding the item as an active tile to the same column(s) as the previously selected workspace term, without changing the term's activation. Selecting a conceptually related term narrows the query by appending the item as an active tile after the last occupied column of the chart (visually establish- ing a new column). Selecting a compound term has the same effect as adding a synonym. Compound terms are superstrings of the term that contain at least one nonalphabetic character. They are common in computer science domains where many terms, such as error messages and facility names, are composed of mnemonic strings embed- ded in longer strings, using nonalphabetic characters as delineators.

Selected thesaurus terms are automatically placed on the chart in aposition (relative to

IEEE EXPERT

the initial term) that makes sense for their relationship to that term, so the user does not have to decide where to place the terms.

Since each default thesaurus operation has an immediate visual manifestation, the user can adjust the query if the default is not what is desired. Figure 2 shows how the query reformulation workspace from Fig- ure 1 would look after clicking on the tiles for “copy,” “BACKUP,” and “saveset,” and adding the phrase “scratch tape” from the thesaurus. This new tile configuration would translate into the query

((“BACKUP” AND “saveset”) OR “BACKUP saveset”) AND ”scratch tape” AND (“v5.0” OR “version 5.0”))

In lieu of interacting directly with the thesaurus, a user can opt to have prespeci- fied categories of related terms (such as synonyms) added to the query automatically. In addition, other terms that might be useful in query reformulation can be included in the display as inactive tiles. In Figure 2, for example, an inactive tile has been added by default to generalize the version number (“version 5.0” + “version 5”). In this way, the ability to broaden the query according to this useful domain- specific generalization is only a mouse click away.

By making it easy to iteratively permute and rerun a query, we were hoping to over- come the traditional brittleness of all-or- none responses to single Boolean queries, without sacrificing the almost “surgical” precision of the Boolean model. Likewise, we believed that giving the user a convenient way to grasp how the various component terms contribute to the result would improve the user’s sense of whether to continue trying to refine a search or to assume that no relevant article exists.

From prototype to product

Evaluating an information retrieval system has always been problematic.’” Accu- rately calculating precision and recall for a given document collection and a given set of queries requires a full understanding of the collection’s contents and some way to determine the subjective relevance of each document to each query. Furthermore, different task orientations, user backgrounds, and the nature of the information stored in the collection might need to be considered.

Given the additional problem that our pro- spective users were always too busy to ex- tensively test a prototype, we had to make do with a series of short demonstrations followed by informal user evaluations. In general, users were pleased with the linguistic enhancements underlying the query interpreter, especially the handling of phrases and special expressions. Initial reactions to our slide presentation about the query reformulation workspace ranged widely; some users were intrigued and enthusiastic, others were baffled and skeptical. While we do

THERE WAS NO WAY TO UPDATE THE ARTlCLE INDEXES WITHOUT REAPPLmG THE

ANALYZER TO ALL THE T m T S , A TlME-CONSUMING PROCESS.

not know how much of the latter response was due to a general unfamiliarity with direct-manipulation graphical user interfaces, we were pleased to find that most users interviewed, including some of the initially most dubious critics, felt at ease with the workspace after about five minutes of supervised interaction.

Encouraged by this feedback, the Stars development team decided to include the workspace in the next major release of the Stars product. Also slated for inclusion were a number of major architectural changes and functional enhancements, many designed specifically to support the customer support centers. The centers wanted to evolve their corporate network of loosely coupled Stars databases into a true worldwide distributed database, from which users could retrieve any piece of information in an identical manner from any site. New features included full clientkerver computing, user-defined semistructured information objects, data replication, hyperin- formation linking, and logical database partitioning. As we prepared to reimplement the linguistic features we had prototyped, we had to consider how they would integrate into this new architecture. In

DECEMBER 1993

addition, our experience implementing the prototype had made us aware of several weaknesses in our original design.

A dynamic lexicon. One of the first problems we encountered was the need to support a dynamic rather than static lexicon. The constant influx of new textual information from a variety of sources, including corporate publications, internal bulletin boards, and external support documentation, guranteed that lexical k n o w l e d g d h e words and phrases that comprise the application’s working vocabulary-would also remain highly dynamic. Indeed, each new computing product or technology is likely to introduce a wealth of new phrases, endow existing words with new meanings or new grammatical categories (for instance, a noun might be adopted as a verb), and even coin altogether novel words (such as “widget” and “laptop”).

Compared to a static lexicon, an evolv- ing lexicon has significant implications for any information retrieval system. Since we cannot assume that all required vocabulary is known before the first article is loaded into the system, the system must include some mechanism for ensuring that the lexicon and the article database always remain synchronized. Not only must the lexicon be updated if new vocabulary is used in an article, but, if article index keys are based on linguistic analysis, then reindex- ing might be necessary whenever the linguistic information changes. Secondly, a dynamic lexicon is likely to require many of the same administrative functions that apply to a dynamic textual database, such as data distribution, security, activity logging, data subsetting, transactional consis- tency, and so on. We were therefore interested in exploring whether an information retrieval system could itself be an effective repository of lexical information, allowing the lexicon to share all the administrative facilities already provided.

As we began to add new words and phrases to our prototype’s lexicon, we were forced to confront the disadvantage of indexing articles by their citation forms: There was no way to update the article indexes without reapplying the parser/morphological analyzer to all the texts, a time- consuming process. What was worse, a new article might be added while the previ- ous index transaction was still in progress

~

13

and, conversely, new lexicon entries might be added while an article index update was already in progress to accommodate previ- ous lexicon modifications. We did not want to block either lexicon or article updates, so we initially developed a scheme that time-stamped the updates. This facilitated the coordination of lexicon and index updates, and gave users the illusion of a consistent system by making new lexicon entries “invisible” until all article reindex- ing had completed. From a user’s perspec- tive, however, it made no sense that, after updating the lexicon, it took some inde- terminate amount of time before the update was reflected in the system’s search behavior. Therefore, we began to look into other possibilities.

We recognized that if we indexed on surface forms (the strings just as they appear in the text) in addition to citation forms, we could take advantage of a much more efficient scheme for updating article indexes. That is, whenever a new word was to be added to the lexicon, we could use our morphological generator to generate a list of all the inflectional variants of the word. We could then do a query on the OR of those terms, and the resulting article set would be the precise set that should be indexed by the newly added citation form.

Likewise, for phrases, we could do a query on the AND of the phrase components to identify the articles that potentially contained the phrase. However, without information specifying the locations of phrase components within the articles, it was impossible to compute the precise set without reparsing each article in the set to check for a confirmed occurrence of the phrase. We thus chose to build a concordance index, which contains not only the set of articles that contain each word but also the location (token number) of each occurrence of the word. Word adjacency can be tested by comparing token numbers in the index, enabling AI-Stars to find phrases in articles without reparsing.

We also decided to try indexing only on surface strings, and using query-time morphological analysis and generation as well as phrase detection (via concordance adjacency checking) to simulate the matching of articles by citation forms of words and phrases. This approach has the architectural advantage of functionally isolating linguistic processing from article indexing. If indexes no longer depend on linguistic

14

processing, there is no longer a need to coordinate updates of the lexicon with updates to the article indexes. As soon as a new word or phrase is added to the lexicon, it can be used in a user query to access articles.

The disadvantage of this approach is that it moves more of the computational burden to query time. However, we have found that the major performance cost (the disk 110s required to fetch the index entries for each inflectional variant) can be effectively mitigated by using a B-tree index that sorts

I F INDEXES NO LONGER DEPEND ON LINGUISTIC

PROCESSING, THERE IS NO LONGER A NEED TO

COORDINATE UPDATES OF THE LEXlCON WTH UPDATES TO

THE ARTICLE INDEXES.

the surface string keys according to a case insensitive alphabetical sort. Since most inflectional variants of a word (in English) share the same prefix string, they will tend to cluster on the same disk block, making it likely that successive disk fetches will not be required to access the index entries for a set of inflectional variants.

Implementing a lexicon on top of an information retrieval engine. As men- tioned earlier, treating the lexicon as a dynamic rather than a static data structure requires support for the full range of activ- ities implied by a dynamic database. These include not only Add, Modify, and Delete functions, but also a host of other features that might be required in the real-world environment in which the lexicon must operate. Digital’s customer support lexicon is managed by a dedicated group of database administrators, whose responsi- bility also includes administrative over- sight of the contents of the on-line textual databases. Activity logging and reporting, security, and data organization, subsetting, and distribution are some of the functions that not only must be provided for the text

database but are desirable for lexical data as well.

We initially prototyped the AI-Stars lexicon on top of a standard relational database management system. However, as we added features and extended the system into a wide-area distributed clientkerver implementation, we began to explore the possibility of layering the natural language processing functions directly on top of the AI-Stars Collection Services, the system’s storage and indexing facilities. Several as- pects of the system made this particularly attractive:

In our distributed environment, Collec- tion Services provide for the replication of document collections at multiple network sites, for both performance and reliability (availability) reasons. Updates to document collections are automatically propagated to all replication sites. By using the same facilities for lexical data, lexicon updates can be handled in an identical fashion. Collection Services support the con- struction of virtual databases (so-called derived collections), defined by applying a query filter to the union of one or more other collections. For textual databases, this provides a mechanism to log- ically categorize the data in multiple ways, as well as to provide restricted views of data for security reasons. For lexical data, this capability can be used to maintain sublanguage vocabularies or vocabularies involving security restrictions. Collections also provide a natural vehicle for providing separate lexicons for different natural languages. Lexical entries are semistructured objects. They can include free text data, such as definitions and usage examples. Layering the lexicon on the information retrieval system lets users query lexical data in exactly the same manner as any other semistructured information object in the AI-Stars database.

In spite of these important advantages, it was not clear to us at first whether performance (again, measured in disk I/Os) would be competitive with other approaches. Retrieving individual words is intrinsically less efficient in a concordance- based inverted indexing system, since each stem lookup retrieves not the set of words but a set of object identification tags that must in turn be used to access the word

IEEE EXPERT

objects. Fortunately, however, this indi- rection (the use of intermediate sets of object IDS) can be exploited to recognize phrases. Rather than use an on-disk dis- crimination net, for example, we can perform phrase recognition using set intersec- tion over the concordance index, testing whether contiguous words in a query are members of the same phrase (or phrases). Since linguistic analysis at query time must include both word and phrase recognition, the query time cost of using our concordance-indexing scheme turns out, on the whole, to be equivalent to that of alterna- tive approaches. We therefore chose to reimplement the lexicon on top of the AI- Stars Collection Services.

Extending the query reformulation workspace. Another piece of code that underwent significant redesign was the query reformulation workspace. We recognized that the workspace interface, while originally designed for manipulating natural language queries, generalized nicely to the case of structured queries, which range over the values of structured data fields, such as author and date. For manipulating structured data, we needed to extend the chart workspace functionality to allow for the nesting of subcharts. That is, we allow a tile to correspond to either a value term, such as a word in a natural language query, or a full <field relational-operator value> restriction expression.

A user can expand any tile of the second type into its own chart window to manipulate its value content. But in addition, the Boolean relationships among multiple-restriction expressions are now available for direct manipulation in the top-level query reformulation workspace. This structured- query interface is illustrated in Figure 3, in which the tile representing the restriction for the All Text field has been expanded in its own subquery window. The user can manipulate the tiles in either window (or both) and then rerun the query.

When other engineering groups expressed interest in experimenting with this interface for other applications, we decided to repack- age the operations on the chart data structure, which originally had been tied directly to the natural language parser. As our chart now had at least two clients (the parser and the direct manipulation interface), we decided to implement it as an abstract data type with its own callable interface. The abstract

igure 3. The structured query interface for a query containing multiple restrictions. The subquery window shows the workspate for the value of the All Text field.

interface supports the creation, movement, and deletion of tiles. The operational seman- tics of tiles is implemented via callbacks associated with annotations on the tiles. A chart is itself defined as a subtype of tile, thereby allowing charts to be nested (as needed for the structured query interface described earlier.)

Using the morphological analyzer for technical terms. Among the most useful terms in some computer troubleshooting queries are technical symbols such as io$m-abort and sys$unwind. These terms, identifying such things as function library entry points and error messages, tend to have a fixed number of systematic syntactic variants, such as io$v-abort, $unwind-s, $unwind_g, and so on. Just as with inflectional variants of English words, users typically want to match on all variants; they also tend to be inconsistent about which variant they use in their queries. We have therefore found it convenient to treat such terms as we do inflectional variants, adding rules to our morphological analyzer/ generator to cover technical symbols that undergo regular patterns of variation.

Populating the lexicon. Given the re- quirement for a dynamic lexicon, we designed an interactive interface to make it

easy for database administrators to add new vocabulary, phrases, and thesaurus entries. However, manually identifying the thousands of new words and phrases that appear over time in on-line technical databases would be economically infeasible. We have therefore constructed a number of tools, based on techniques developed in the field of corpus linguistics,' ' to assist in the (semi)automatic population of our lexical knowledge base.

With each word in the lexicon, we store its uninflected form, its part of speech, inflectional paradigm, and orthographic properties, such as whether its final conso- nant gets doubled. Based on our assump- tion that in a large enough text corpus, words that inflect are likely to appear in several of their inflected forms, we developed the following algorithm for extracting information about unknown words in a corpus:

(1) For each surface string in the corpus, test whether it is already in the lexicon, using morphological analysis. Omit known surface strings from further consideration.

(2) Since we generally do not need to store inflectional information for proper nouns in the lexicon, eliminate them from the data using the simple heuristic

DECEMBER 1993 IS

, . ~ -~ .- -~ . _- .~ . .-__ .-

I . I ‘.,.

, Star> without ; in cxce\.riw burdm of hand- t;iiloring the Icxicm. S i n x the typical hclp d e A en\ tronmcnt 4tiinlv doc\ not ha\.e the

(VMS system” ((S “VMS”) (L “system“ mum))) (“LTA device” ((S ‘LTA“) (L “device” nouns))) (“load host” (6 “load”) IL ”host” nouns)))

i

(“patch file” ((S “patch”) (L “file” nouns))) (“server node” ((S “server”) (L “node” nouns)))

manual administration of linguistic database$, the ability to automate much of the

, 1

that proper nouns are usually capital- ized in English. Use the morphological guesser to generate a set of candidate lexical entries for each of the remain- ing (unknown and noncapitalized) strings. Apply abductive reasoning to choose a “best guess” explanation for each unknown surface form in the corpus. That is, choose the lexical entry that ac- counts for the greatest number of different surface forms that are actually found in the text. For example, if we found both “instantiate” and “instantiates” in a corpus, the algorithm would propose that “instantiate” is either the singular form of a noun whose plural is “instantiates” or it is the infinitive of a verb whose third person singular form is “instantiates.” However. if it finds “instantiated” in the text as well, then it prefers the verb interpretation, as this interpretation would account for three unknown strings in the corpus. whereas the noun explanation covers only two. Many English verbs are also nouns, so it is useful to supplement morphological data with syntactic collocational data to test whether a word is a noun in addition to a verb. A simple heuristic for this would be to test for the appearance of a determiner (for example, “the” or “a”) appearing directly before the potential noun somewhere in the corpus. Use heuristics to break ties. For words that have more than one candidate explanation with the same degree of corpus covering, or for words that do not appear to have any inflection variants in the corpus. we apply a set of suffix heuristics. The suffix - /y typically in- dicates an adverb, whereas - i o r i indi- cates a noun. For words for which no suffix heuristic applies, we classify the word as a noun by default.

In English. a language with relatively little morphological inflection, the vast majority of new unknown words are nouns, and this classification is often made on the basis of the default case in step 4. How- ever. we have found this algorithm particularly useful for languages such as French in which a great deal of morphological inflection exists. We are currently experimenting with i t to populate a French lexicon from scratch.

Extracting phrases. Even with the best parsers. syntactic analysis of large text corpora can be computationally expensive, slow, and error prone. Given that our goal was to extract possible noun phrases in an expedient manner, we eschewed full pars- ing in favor of a more streamlined, if less accurate, approach. We apply a three-pass algorithm that

( I ) uses an ordered set of local syntactic context rules (along with the morphological analyzer) to heuristi- cally tag each word in the corpus with a single unambiguous part of speech,

(2) uses afinite-state network that encodes a set of noun phrase recognition rules tc) identify certain sequences of parts of speech as potential noun phrases, *..A

tion is still largely required is in specifying thesaurus relationships among words. While we are pursuing research in automatically extracting thesaurus relationships from corpora, this is still in the very early stages.’? Fortunately, syntactically related words generally are also semantically related. Thus, we could use the database of noun phrases as a source for generating related terms. That is, words that appear together in phrasal relationships can be made available via the thesaurus as related terms.

r H E NEW VERSION O F STARS IS in the early stages of field testing at several of Digital’s customer support centers. Us- ers have indicated that, subjectively at least, the new search engine allows them to retrieve articles with greater confidence and precision. Because they are adjusting to a number of new facilities in the tool (including the switch from a character cell interface to a window-based one). many have not yet begun to take full advantage of the query reformulation facilities. Thus it is still premature to speculate how much more effective users will be once they are comfortable with the range of new options. We are continuing to alter the user interface as users request. We are also looking into the possibility of integrating other retrieval models, where appropriate. For

aiiu example, if the user finds an article that is (3) applies a filter to the result of step 2 to closely related to but that does not com- remove a known set of false alarms. 1 pletely fulfill some information need, it These include candidate phrases that a might be useful to allow a vector space or person in the loop has already explicit- ~

~ probabilistic search to test for other similar ly rejected. articles in the database.

This algorithm has been quite effective, generating more than 15,000 noun phrases from a 100-Mbyte corpus, with an accept- ably low percentage of false alarms (Figure 4 gives examples of acquiredphrases). Most false alarms are due to errors made in part- of-speech tagging in step I . These knowledge acquisition procedures make i t possible to bring up new databases for use in

We view our present work as part of an evolutionary approach to incorporating even more sophisticated linguistic processing. For example, a system that tried word sense disambiguation, concept identification, or the generation of syntactic variants of larger phrasal units could also use a graphical device like the query reformulation workspace to make its inferences public and

1

malleable. At the Same time, our exper]- ence integrating even modest amounts of ~

natural language processing in an informa- 1 Transformarion. Analysis, and Retrieval of Informarion by Computer, Addison-Wes- ley, Reading, Mass., 1989.

9. P.G. Anick et al., “A Direct Manipulation system has made us ~ Interface for Boolean Information Retriev-

aware of the subtle relationships that can exist between components. More sophisticated use of natural language processing in information retrieval will have to be ac- companied by continued practical research 1 in system architecture and performance. 1 Ithaca. N,Y,, 1991,

Tech. Report TR91 - I 2063 Dept. of Computer Science, Cornell Univ.,

Acknowledgments 1 1 . Compururional Linguisrics, special issue on

using large corpora, Vol. 19, No. I , 1993. 12. J. Pustejovsky, S . Bergler, and P. Anick,

I I ’

The evolution of the ideas embodied in AI- Stars and its implementation have been a grow 1

“Lexical Semantic Techniques for Corpus Analysis,” Cnmputarinnal Linguistics, Vol. 19, No. 2, 1993, pp. 331-358.

effort. Jeffrey Robbins was the principal design- er and implementor of the original Stars system. Past and current members of the AI-Stars team who have contributed to the work cited here include Bryan Alvey, Suzanne Artemieff, Jeff Brennen, Rex Flynn, David Hanssen, Jong Kim, Jim Moore, Mayank Prakash, and Clark Wright. I also wish to acknowledge Jim Miller andRalph Swick of Digital’s Cambridge Research Lab for their collaboration on the chart abstract data type, as well as the many database administrators and specialists who have offered their in- sights, critiques, and support.

References 1. R.M. Tong andD.G. Shapiro, “Experimen-

tal Investigations of Uncertainty in a Rule- Based System for Information Retrieval,” Int’l J . Man-Machine Studies, Vol . 22, No. 3, 1985, pp. 265-282.

2. W.B. Croft and R.T. Thompson, “13R: A New Approach to the Design of Document Retrieval Systems,” J . Am. Soc. for Infor- mation Science, Vol. 38,1987, pp. 389-404.

3. G . Guida and C. Tasso, “IR-NLI: An Ex- pert Natural Language Interface to On- Line Databases,” Proc. ACL Conf Applied Natural Language Processing, Assoc. for Computational Linguistics, Morristown, N.J., 1983.

4. B. Kahle, “An Information System for Cor- porate Users: Wide-Area Information Serv- ers,’’ Tech. Report TMC-199, Thinking Machines, Cambridge, Mass.

5. G. Salton and M.J. McGill, Inwoduction to Modern Information Retrieval, McGraw- Hill, New York, 1983.

6. N.J. Belkin and W.B. Croft, “Retrieval Techniques,” in Ann. Rev. Information Sci- ence and Technology, Vol. 22, Martha Williams, ed., Elsevier Science Publishers, New York, 1987, pp. 109-145.

7. M. Kay, “Algorithm Schemata and Data Structures in Syntactic Processing,” Tech. Report CSL-80-12, Xerox Palo Alto Re- search Center, Palo Alto, Calif., 1980.

8. G. Salton, Automatic Text Processing: The

RAL: Rule-Based Systems

DECEMBER 1993

expert system interfaces, full-text information retrieval, and machine-aided translation. He holds a MS from the University of Wisconsin-Madison and is pursuing a PhD at Brandeis University. He is a member of the Association for Computational Linguistics and the ACM SIGIR. He can be reached at DEC, 1 1 I Locke Dr., LM02-1/D12, Marlboro, MA 01 752; e-mail, [email protected]

RAL represents a new approach to rule-based systems. It was developed by one of the oldest and most experienced companies in the industry to meet the needs of real-world developers. RAL offers:

Intewration With C. Based o n a new implementation technology, RAL is the first rule-based system to permit the seamless integration of rules into C programs. When you use RAL, you need no clumsy interface code between the rules and the procedures. And you need no “bridges” to access other software packages.

High performance. PST products are based o n the proprietary Rete I1 algorithm. Independently-developed benchmarks have shown PST tools to run significantly faster than the fastest competing tool tested.

True ease Of use. T h e rich rule language of RAL enables you to solve problems with fewer rules. At PST, we believe that this kind of expressive power defines true ease of use. With RAL, you can concentrate o n solving your problems, not o n working around limitations of the tool.

Proven reliability. Systems developed with PST’s tools are used in critical applications throughout North America, Europe, and Japan. A number of systems developed using PST tools are now being sold as com- mercial products. You may already be using RAL systems without knowing it.

RAL is available for PCs, workstations, superminis, and mainframes If you are inrerested in solving real-world problems with rule-based systems, write for our technical brochure today Or call us at

Production Systems Technologies Inc 5001 Baum Boulevard Pinsburgh Pennsylvania 15213 4 12-683-4000. (FAX: 412-683-6347)

Reader Service Number 2

mailto:[email protected]

integrating natural language processing and information retrieval in a troubleshooting help desk

Documents