aspects of swedish morphology and semantics from the

Aspects of Swedish Morphology and Semantics from the Perspective of Mono- and Cross-language Information

Retrieval.

Turid Hedlund*, Ari Pirkola and Kalervo Järvelin Dept. of Information Studies

University of Tampere P.O. Box 607

FIN-33101 Tampere, Finland

*and Swedish School of Economics and Business Administration Library

P.O. Box 479 FIN-00101 Helsinki, Finland

Corresponding author: Turid Hedlund Swedish School of Economics and Business Administration Library P.O. Box 479 FIN-00101 Helsinki, Finland phone: +358-9-43133378 fax: +358-9-43133425 email: [email protected]

Abstract

This paper analyzes the features of Swedish language from the viewpoint of mono-

and cross-language information retrieval (CLIR). The study was motivated by the fact

that Swedish is known poorly from IR perspective. This paper shows that Swedish

has unique features, in particular gender features, the use of fogemorphemes in the

formation of compound words, and a high frequency of homographic words.

Especially in dictionary-based CLIR, correct word normalization and compound

splitting are essential. It was shown in this study, however, that publicly available

morphological analysis tools used for normalization and compound splitting have

pitfalls that might decrease the effevtiveness of IR and CLIR. A comparative study

was performed to test the degree of lexical ambiguity in Swedish, Finnish and

English. The results suggest that part-of-speech tagging might be useful in Swedish

IR due to the high frequency of homographic words.

Keywords: Text retrieval, Cross-language information retrieval, Swedish language,

Natural language processing

2

1 Introduction

Our society is depending on written communication, which today often is created and

stored in digital form. We have an increasing amount of full text material in different

languages available through the Internet and other information suppliers. Information

retrieval (IR) from operational full text databases or text retrieval is characterized by

the lack of controlled vocabularies. A relatively new research area is cross-language

information retrieval (CLIR). It is a process of selecting and ranking documents in a

language different from the query language.

Text retrieval, involves the use of natural language in the form of textual descriptions

of a topic. The variation in word forms and homographic and polysemous expressions

causes disambiguation problems in text retrieval, and the effect increases in CLIR.

We can express the same information content in many different ways depending on

who is the writer and for whom or for what purpose a text is written.

Full text retrieval as a response to a query in natural language involves methods

developed in computational linguistics (Grishman 1986) and natural language

processing (NLP) (Strzalkowski 1994; 1995). Increasing recall and precision depends

on these applications to match non-identical expressions belonging to the same

concept. Computational linguistics offers research in language analysis and

knowledge representation (syntactic and semantic knowledge), and NLP applications

rely strongly on statistical text analysis and machine readable texts and dictionaries

(Boguraev & Briscoe 1989; Guthrie, Pustejovsky, Wilks & Slator 1996). Analytic

tools relevant for IR purposes involve morphological analysis programs, part of

3

speech taggers, syntactic parsers, and truncation algorithms (stemmers) (Harman

1991; Karlsson 1992; Koskenniemi 1983; Porter 1980). Because English has been the main language for IR system development, much

research on IR and NLP involves English. However, IR systems for small languages

like Finnish and Swedish and other Nordic languages cannot be developed properly

without studying their special features. Although Spanish and Chinese have rendered

special tracks in the TREC Conferences1 and even Finnish (Alkula & Honkela 1992)

has been explored the results cannot be applied on linguistically quite different

languages.

This paper analyzes Swedish as document and query language for IR. Swedish is

spoken as a native language by 8 - 9 million people, mainly in Sweden but also by a

minority population in Finland. However, due to close relationships between the

Nordic countries and the other Scandinavian languages the number of people who

speak Swedish and can understand it is much larger (Teleman, Hellberg & Andersson

1999). Approximately, about 20 million people have a basic knowledge of Swedish.

Since the Scandinavian languages Swedish, Danish and Norwegian are close, results

of scientific research on one language can help to create a deeper understanding of

the same phenomena in the other Scandinavian languages (Teleman et al. 1999).

Thus, a careful generalization of the results in this study can be made to all three

languages.

So far we can report very few analyses of the Scandinavian languages concerning IR,

only a small Norwegian study (Fjeldvig & Golden 1988). Swedish and the

1 The fourth, fifth and sixth Text REtrieval Conferences, 1998-1998. URL: http://trec.nist.gov/

4

Scandinavian languages have unique and novel features which affect IR. We will

identify and present these features in the Swedish language relevant to IR on

morphologic, syntactic as well as on semantic level. The problem areas studied are a)

morphological features such as inflection, derivation, gender and compound words, b)

semantical features as homonymy, polysemy and hyponymy. The viewpoints for the

analysis are full text IR, database indexing, query formulation and CLIR. Publicly

available tools to overcome some of the problems have some pitfalls, which we

analyze. We believe that these questions deserve serious research efforts.

The rest of this paper is organized as follows. Section 2 summarizes the linguistic

problems of IR. Section 3 considers IR and natural language processing. In Section 4

the main linguistic features of Swedish are analyzed from the perspective of IR. In

Section 5 the use of NLP tools in Swedish IR and CLIR is discussed. Section 6

presents concluding remarks.

2 Linguistic problems in IR

An ideal case in searching would be that a search would give all the relevant

documents of a database, ranked in an order of descending relevance, and none of the

irrelevant documents. In a typical case, however, search results often include many

irrelevant documents, and many of the relevant documents are not retrieved. This is

primarily caused by linguistic phenomena. The basic linguistic problems of IR can be

classified as follows: (1) the selection of alternative concepts and search keys, (2) the

morphological variation of search keys, (3) referred and omitted search keys, (4)

search key ambiguity, and (5) multilinguality.

5

The first problem, the selection of alternative concepts and search keys, is related to

the fact that different documents which consider the same topic may use, and often do

use, different concepts and expressions. Typically only part of the relevant documents

are retrieved, as users are not able to produce exhaustively alternative concepts for the

concepts of requests and are not able to expand the queries using synonyms. The

problem of query expansion has been investigated in several studies (Efthimiadis

1996; Kekäläinen 1999). In IR terms, the search concepts which describe a given

aspect of a request and which thus are alternative concepts are either in hierarchical or

associative relations to each other. In linguistic terms, these relationships include

hyponymy, a hierachical subordinate relation (animal - horse ) meronymy, a

relationship between the whole and its parts (hand – finger), and other sense relations,

such as antonymy ( hot – cold) (for associative relations). In IR terms, the search keys

which represent a given concept are equivalents. In linguistic terms, equivalence

covers synonymy (Information Retrieval – IR).

The second problem, the morphological variation of search keys is associated with

the matching process. In morphologically simple languages, such as English, word

form variation can be handled by quite simple means, e.g. for nouns, by recognizing

the plural forms (-s) (Harman 1991; Porter 1980). However several European

languages are morphologically much more complex (e.g., German, French, Italian and

the Scandinavian languages). Documents are not retrieved if the search key and its

occurrence in a database index (the index term) are not identical in form. Thus a

search key given in a base form does not match with the inflected forms of the key.

The same problem also concerns derivationally related keys. Moreover, in a database

6

index, search key occurrences may be embedded in compound words. All these forms

may, however, reflect the information need behind the query. In this study a

compound is defined as a word formed by two or more components that are spelled

together. The term phrase is used for the case where the constituents are spelled

separately.

The third problem, referred and omitted search keys covers anaphoric and elliptical

keys. In the sentences: “Where is Mary? She is in the garden.”, the pronoun she and

the noun Mary are corefferential, and the pronoun is an anaphor. An example of

omitted search key or ellipsis is found in the sentences: “I have a dog. You have one

too.” In certain cases in proximity searching a considerable improvement in retrieval

performance is achieved if anaphors and ellipses are resolved (Pirkola & Järvelin

1996a; 1996b). The fourth problem, search key ambiguity, covers polysemy, a word

with more than one sense or subsenses ( star – a star in the sky or star - a famous

person) and homonymy, different words with the same form (bank – a financial

institution or bank – the side of a river). Due to the ambiguity of search keys the

intended senses of search keys are not always present in retrieved documents. In other

words, despite successful matching irrelevant documents are retrieved.

The fifth problem, multilinguality, concerns multilingual and monolingual document

collections from where documents can be retrieved using more than one language.

Cross-language retrieval is concerned in particular with the issues associated with the

cross-lingual equivalence of words (Ballesteros & Croft 1997; Davis 1998; Hull

1998; Oard & Dorr 1996; Pirkola 1998).

3 IR and natural language processing

7

Natural language processing involves linguistic methods and the analysis can take

place on different levels of a language, i.e., morphological, syntactic and semantic

levels. On a morphological level the structure of words is analyzed. Recognizing

different word forms as variants of the same basic word affects both indexing (word

weights) and retrieval (matching). Syntactic analysis determines the structure of

phrases and sentences, while semantic analysis investigates the meaning or sense of

words and sentences.

Commonly used methods in document indexing are word form normalization

(Koskenniemi 1983; Pirkola 1999) and stemming (Harman 1991). A stemmer

removes affixes from the word forms and the output is a common root, not necessarily

a real word. A similar type of process is normalization, but in this case the output is

the base form, a real word. Due to stemming and normalization three kinds of benefits

may be gained (Harman 1991; Järvelin 1995; Alkula & Honkela 1992). 1) A user

does not need to worry about truncation and inflection, because different forms of the

key are automatically conflated into the same form. 2) Stemming and normalization

result in storage savings. 3) Stemming and normalization may improve retrieval

performance, especially recall since a larger number of potentially relevant

documents are retrieved. However, no significant improvement in performance was

found by Harman (1991) in her experiment with simple stemmers for the English

language. For inflectionally more complex languages the results are not nexessarily

the same.

Swedish is rich in compound words, and is thus confronted with the problem of

embedded search keys. Splitting the compounds into their components allows the use

8

of the component words as separate search keys. For instance, the decomposition of

the compound hustak (roof of the house) gives the expansion keys hus (house) and tak

(roof). If the compound hustak was truncated and used as a key, documents including

the word tak would not be found.

In IR part-of-speech (POS) tagging may be used to identify central words (word

classes, especially nouns) and phrases of a sentence. In CLIR part-of-speech tagging

is useful in matching the source language keys with translation dictionary entry

words.

A syntactic parser (program) determines the structure of a sentence according to a

particular grammar (Grishman 1986). The parsing procedure may involve the

assignment of a tree structure to the input sentence. Linguistic transformation, such as

transforming active sentences into passive can be of potential value for IR. Syntactic

analysis can be used as a basis for further analysis, e.g., anaphor resolution.

Word sense disambiguation is an NLP method which aims at finding correct senses

for word occurrences. It has been studied intensively in IR and other fields. The

methods used in the studies include dictionaries (Dagan et al. 1991; Guthrie et al.

1991; Krovetz and Croft 1989), knowledge bases (Hirst 1987), statistical methods

(Brown et al. 1991; Schütze & Pedersen 1995), multiple knowledge sources (McRoy

1992), thesauri (Voorhees 1993) and pseudo-words (Sanderson 1994; Sanderson

1997). Most IR studies have reported no or only slight improvements in retrieval

performance due to word sense disambiguation (Krovetz & Croft 1992; Sanderson

1994; Sanderson 1997; Vorhees 1993).

9

4 Linguistic features of Swedish from IR perspective

In Swedish we can identify the following features that are likely to affect information

retrieval: 1) a fairly rich morphology with 2) gender features (gender uter and gender

neuter2) and 3) high frequency of compound and derivative word forms, 4) common

noun phrases are less frequent in Swedish than for example in English 5) a high

frequency of homographic word forms.

Morphology. The inflectional and derivational morphology is more complex than

that of English. Nouns can be divided into five declination types according to the

plural suffixes they take, i.e. -or, -ar, -er(r), -n, and ø(no suffix). Genitive forms are

formed by the suffix -s. Definite forms of nouns are expressed using the articles den,

det (sing.) de (pl.), and the definite suffixes -en(-n) (gender uter) and -et (-t) (gender

neuter). Due to the fairly rich inflectional morphology, nouns have several

inflectional forms and even the stems change (umlaut) (Table 1). Therefore simple

indexing and matching methods are insufficient for Swedish IR. Inflection interferes

with the weighting of index terms, and in matching documents are lost due to

inflected words.

2 Gender is an inalienable property of a Swedish noun. A noun is defined as gender “uter” or gender “neuter” according to the inflectional suffix it takes in definite form singular (uter: -en, -n, neuter: -et, -t). Homonyms often have diferent gender, for example: plan –en (plan), plan –et (plane).

10

Base form Definite form sing. Plural Definite form pl. Flick|a (girl) Flicka-n Flick-or Flickor-na Bro (bridge) Bro-n Bro-ar Broar-na Bok (book) Bok-en Böck-er Böcker-na Genitive sing. Genitive def. sing. Genitive plural Genitive def. pl. Flicka-s Flickan-s Flickor-s Flickorna-s Bro-s Bron-s Broar-s Broarna-s Bok-s Boken-s Böcker-s Böckerna-s

Table 1 Examples of inflected noun forms

Adjectives are congruent with their headwords in gender and number (Table 2).

Irregularities to the general inflectional rules exist in nouns as well as in adjectives

and verbs. In Swedish strong verbs have vowel change. For a comprehensive

description of inflection we refer the reader to books on Swedish morphology, e.g.,

(Hellberg 1978; Kiefer 1970; Malmgren 1994; Teleman et al. 1999).

Indef. form sing. Definite form sing. Indef. form plural Definite form pl. en vit hare (uter) (a white rabbit)

den vita haren (the white rabbit)

vita harar (white rabbits)

de vita hararna (the white rabbits)

ett vitt hus(neuter) (a white house)

det vita huset (the white house)

vita hus (white houses)

de vita husen (the white houses)

Table 2. Examples of adjective inflection

For the problem of inflectional forms, morphological analysis programs should be

used for word normalization at the indexing and searching stages. We shall consider

morphological analyzers and word form normalization in Section 5.

Gender. Swedish nouns have two grammatical genders, gender uter and gender

neuter (Teleman et al. 1999) The separation of gender in Swedish has positive effect

11

on the effectiveness of anaphor resolution algorithms, as in the algorithm for pronoun

resolution in Swedish, constructed by Fraurud (1988). The gender based resolution is

based on the fact that the pronouns den (uter), det (neuter) (meaning it) should be in

agreement with their correlates. Homographic nouns often possess different genders

(Teleman, Hellberg & Andersson 1999), thus the use of gender is helpful also in the

resolution of homographs.

Derivatives. New words are formed by derivation and compounding. Derivatives and

their stem words may belong to different part-of-speech categories. The possibility to

change part-of speech class of a word by derivation enables for example the

derivation of a noun (an actor or an action) from a verb. Such words are in many

cases lexicalized. Words belonging to different part-of-speech categories have

different derivational suffixes, and the inflection of derived and underived words is

similar. Examples of noun suffixes are -are –het and -ning, e.g., målare (painter),

skönhet (beauty) and förkortning (abbreviation). Examples of adjective suffixes are

–bar and -aktig, e.g., mätbar (measurable) and livaktig (vivid). The suffixes –era and

-na, e.g., spionera (spy), vakna (wake up) are examples of verb suffixes (Malmgren

1994).

Derivation is a common method to form antonyms.They are often formed by prefixes,

i.e., o-, a-, in-, il-, im-, ir- , e.g., lycklig / olycklig (happy / unhappy), symmetrisk /

asymmetrisk, tolerant / intolerant, legal / illegal, produktiv / improduktiv, rationell /

irrationell. Derivational prefixes are, as in the examples above in many cases of

foreign origin (Malmgren 1994).

12

Derivatives can enter into compounds as a stem and the rules for derivation must

precede those of compounding, for example:

lära - lärare - lärarförbundet (teach - teacher - teacher association)

fängsla - fängelse - fängelsestraff (arrest - prison - penalty of imprisonment)

antik - antikvitet - antikvitetsaffär (antique - antiquity - antique shop)

In a normalization process for IR, derivatives may be related back to the original

word/stem. This increases ambiguity and sometimes makes it impossible to find the

right translation alternative for CLIR, because in many cases the derived forms are

lexicalized and have separate entries in a translation dictionary.

Compounds and Phrases. Swedish is characterized by high frequency of compound

words. Phrases are not so common as in English. Compounds are easier to deal with

than phrases, because there is no need for identification. On the other hand for

effective IR and CLIR, compounds need to be decomposed. Especially in CLIR this is

often necessary, because for many compounds dictionaries include only the

components of compounds.

A typical feature of Swedish is also the use of fogemorphemes in compound word

formation. Fogemorphemes are elements used to join the constituents of compounds.

The fogemorpheme types are as follows:

- (omission) flicknamn (maiden name)

-s rättsfall (legal case)

-e flickebarn (female child)

-a gästabud (feast, banquet)

-u gatubelysning (street lighting)

13

-o människokärlek (love of mankind)

Sometimes the component preceding the fogemorpheme is a stem, e.g., gat(a)

sometimes a base form, e.g., rätt. Proper weighting of index terms and effective

matching require that fogemorphemes could be handled and correct base forms of

component words could be identified.

Many nominal compounds are lexicalized so that it is not always possible to derive

their meaning from the meanings of component words, e.g., jordgubbe (strawberry).

In many compound words, however, the last component is a hyperonym of the full

compound being thus a valuable key (Pirkola 1999). For exemple, compound words

denoting to different types of schools have the component skola (school) as their last

component. In IR the decomposition of compounds would often be helpful in cases

like this. The last component may also be used as an ellipsis later in the text.

Particle verbs (compound verbs where one component is a particle) are special forms

of compounds. There are four types of particle verbs: always tightly compounded,

e.g., påminna (to remind), always loosely compounded (except in past participle),

e.g., tycka om / omtyckt (to like), both loosely and tightly compounded, but with no

change in sense, e.g., omtala / tala om (speak about), and with a change in sense, e.g.,

avbryta / bryta av (interrupt / break) (Malmgren 1994). From IR point of view, some

particle verbs are similar to nominal compounds and some similar to nominal phrases.

Tightly compounded particle verbs are mostly lexicalized, but with the other forms

we can see the benefits of identifying phrases and the benefits of splitting compounds

into components.

14

Homonymy and polysemy. Homonymy and polysemy are two different forms of

lexical ambiguity. Homonyms are different lexemes spelled similarly. A polysemous

word is a word that has several subsenses which are related with one another

(Teleman et al. 1999). For example the word stjärna (star) may denote to a famous

person or a star in the sky. Due to lexical ambiguity the intended senses of keys are

not always implemented in the retrieval results, and irrelevant documents are

retrieved. In CLIR the negative effects of lexical ambiguity on retrieval performance

are often more severe than in monolingual retrieval, because in CLIR both source and

target language keys may be ambiguous.

Homographs are very common in Swedish. According to Karlsson (1994) around

65% of Swedish words in running text are homographs. In Finnish the percentage is

only 15% and in English around 50%. For example, the Swedish form för can be

used as a preposition, verb, noun, and adverb. Since Swedish is a language rich in

homographic word forms it is possible that the disambiguation of homonyms or the

analysis of the the meaning of a word from its context would have a positive effect on

retrieval results, contrary to the findings for English, e.g., (Sanderson 1994).

5 Natural language analysis programs for Swedish

Normalization. SWETWOL is a morphological analyzer developed to analyze

Swedish words (Karlsson 1992). The program incorporates a full description of

Swedish morphology, and morphological descriptions are assigned. Normalization

15

returns all the derivational and inflectional forms of a word to the base form by

recognizing / deleting inflectional and derivational affixes.

Let us consider some examples of Swedish word normalization using SWETWOL

(web-version, used in October 1999). Table 4 shows that most sample words are

correctly normalized. The noun behandling (treatment) is normalized to the verb

behandla (to treat). In CLIR this may be a problem, because dictionaries typically

give nouns and verbs as separate entries. It would benefit the translation if the

normalization output would give both forms. In monolingual IR this problem is less

severe since both document and query words are treated similarly.

Input word Normalization to base form

SWETWOL analysis

avgöras avgöra (decide) V PASS INF (verb, passive, infinitive)

utsläpp

utsläppa (dumping) utsläpp (outlet)

V ACT IMP (verb, active, imperative) N NEU INDEF SG/PL NOM (noun, neuter, indefinite, sg./pl. nominative)

behandling (treatment, usage, management)

behandla (treat, deal with) VDER/-ning N UTR INDEF SG NOM (verb derivation /-ning, noun, uter, indefinite, sg. nominative)

väsentligt (essentially, mainly)

väsentlig (essential, fundamental, main )

A NEU INDEF SG NOM (adjective, neuter, sg. nominative)

operation (operation) operation (operation)

N UTR INDEF SG NOM (noun, uter, indefinite, sg. nominative)

Table 4. Normalization for Swedish

Compound splitting. Table 5 presents some examples of compound splitting by

SWETWOL. In the first example “skogsindustrin” (forest industry) the components

skog (forest) and industrin (the industry) are joined with the fogemorpheme “s”. The

16

latter component industrin is normalized to the base form industri, which is a

hyperonym of skogsindustri and therefore often a valuable search key. The former

component skogs has retained the “s”, and is not normalized to the base form skog. In

monolingual IR this has the effect that the matching process only concerns the

genitive form skogs and compounds with the first component skogs. The compounds

with skog- (the fogemorpheme “s” not used ) like skogbevuxen (wooded), skoglös

(treeless) are not found.

For CLIR this is still more puzzling since the translation alternatives are several for

almost every word. If the compound word kärnkraftverk (nuclear power plant)

(example 2 in table 5) is found in the dictionary, we will get an unambiguous

translation. If the compound is split, we will have a situation where the last

component is normalized to base form but the other components are retained in the

form they exist in the compound word. In this case we have an example of omission

(See Section 4 on fogemorphemes), where kärna is transformed to kärn.

For the word toppmötet (meeting, summit) there are two outputs topp and möte and

topp, mö and te (example 3 in table 5). The latter case provides an example of a

semantically false coordination.

17

Input word Compound splitting SWETWOL analysis skogsindustrin (the forest industry)

skogs#industri (forest # industry)

<N> # N UTR DEF SG NOM (noun # noun, uter, definite, sg. nominative)

kärnkraftverken (the nuclear power plant)

"kärn#kraft#verk" (kärn-a = kernel, seed, grain, nuclear etc. kraft = force, strength, etc. verk = work(s), labor, plant, mill etc.)

<N> # <N> # N NEU DEF PL NOM (noun # noun # noun, neuter, definite, pl. nominative)

toppmöte (summit, (top-level) meeting)

"topp#möte" (topp= done!, agreed!, it's a bargain! top crest, summit, pinnacle, peak, apex möte= meeting, encounter, appointment, date, gathering, assembly, conference) "topp#mö#te" (mö= maid, maiden te= tea)

<N> # N NEU DEF SG NOM (noun#noun, neuter, def. sg. nominative) <N> # <N> # N NEU DEF SG NOM (noun#noun, neuter, def. sg. nominative)

Table 5. Examples of compound splitting in Swedish

Lexical ambiguity. Homograph disambiguation might be useful in Swedish IR due to

the high frequency of homographic expressions in Swedish. We performed a test in

which 35 Finnish test requests used in IR and CLIR research were translated into

English and Swedish. We picked out 60 Finnish, English and Swedish equivalent

words representing different part-of speech classes, and studied the degree of

ambiguity in the three languages. We used off-the-shelf dictionaries3 to count the

3 Suomi-ruotsi suursanakirja [Finnish-Swedish comprehensive dictionary] / Knut Cannelin, Lauri Hirvensalo, Nils Hedlund 3rd ed. Porvoo : WSOY, 1976 (150.000 words) Ruotsi-suomi-suursanakirja [Swedish-Finnish comprehensive dictionary] / Lea Lampén, Porvoo : WSOY, 1973 (70.000 words) Norstedts engelsk-svenska ordbok [Norstedt’s English-Swedish dictionary] /Britt-Marie Berglund...[et al.] Stockholm: Norstedts 1983 (60.000 words and phrases)

18

number of homographic word forms and translation alternatives. By translation

alternatives is meant alternative translations (including polysemy) and translation

examples to a word in similar dictionaries that are used for CLIR-research. The test

with homographic base form words in Table 6 show an average of 0,22 homographic

expressions for Swedish, 0,07 for English and no homographic expressions in the

Finnish test sample. The number of translation alternatives for the same test sample is

shown in Table 7. The number of translation alternatives is depending on the

dictionary used, in this case standard dictionaries of approximately the same size.

Similar translation tools would have to be used for CLIR.

Average number of homographic base form wordsFinnish English Swedish

Nouns 0 0,05 0,Verbs 0 0,15 0,4Adjectives+Adverbs 0 0 0Average all POS 0 0,07 0,

25

22

Table 6. Average number of homographic base form source words in dictionaries

Average number of transl. alternatives for base form wordsFin-Swe Eng-Swe Swe-Fin

Nouns 20,8 8,1 8,7Verbs 43,9 36,4 19,3Adjectives + Adverbs 5,7 7 5Average all POS 23,4 17 11,0

Table 7. Average number of translation alternatives for base form words

19

To test the effect of normalization we used the test sample, this time using the

inflected word form of the original test requests. The (TWOL)4 morphological

analyzers were used in the experiments. In this case (Table 8), we can see a clear

difference in the languages. For Finnish nouns we can see a small increase in

translation alternatives, and a higher for verbs and adjectives + adverbs. English

shows practically no difference, but in Swedish we have a considerable increase in

translation alternatives for all part-of-speech classes.

Average number of translation alternatives of inflected word formsafter morphological analysis, the increase in percent to words in base form (Table 7)

Fin-Swe % increase Eng-Swe % increase Swe-Fin % increaseNouns 22,9 10,1 8,1 0 33,6 286,2Verbs 50,5 15,0 36,4 0 34,7 79,8Adjectives+Adverbs 13,9 143,9 8,9 29,0 18,1 262,0Average all POS 29,1 56,3 17,8 9,7 28,8 209,3

Table 8. Average number of translation alternatives after normalization

A general cautiousnes is needed about the prospects of word sense disambiguation

strategies and the hope that they should significantly improve retrieval performance

(Lewis & Sparck Jones 1996). But also in this case we have no research result for

Swedish.

4 SWETWOL, Copyright © Lingsoft Ab 1995. http://www.lingsoft.fi/cgi-pub/swetwol ENGTWOL Copyright © Atro Voutilainen, Juha Heikkilä, Timo Järvinen and Lingsoft, Inc. 1993-1995. http://www.lingsoft.fi/cgi-pub/engtwol FINTWOL Copyright © Kimmo Koskenniemi & Lingsoft Oy 1995 - 1997. http://www.lingsoft.fi/cgi-pub/fintwol

20

One method to disambiguate word senses is the use of a concordance, where the

context of the word is listed along with the word. Homonymous words can be

disambiguated by establishing their co-occurrence or collocablility. For example, as

in

English in Swedish the word “bank” means both financial institution and reef or sand

bank. Bank in the sense of financial institution is probably occurring with words like

“lån” (credit), “pengar” (money), “ekonomi” (economy), while the word “bank”

(reef) occurs with “jord” (earth), “flod” (river) etc.

Consider one of our test requests:

“Skuldkrisen i Sydamerika. Hur har skuldsättningsproblemet utvecklats?”

“South American debt crisis. How has the debt problem developed?”

From a dictionary we can establish the following meanings for the Swedish words

skuld, kris and skuldsättning:

skuld debt, amount due, liabilities, fault, blame, guilt

kris crisis, depression

skuldsättning getting into debt

If we can establish that the words skuld and skuldsättning are usually co-occurring,

we can establish that the right translation would be “debt crisis”, not “guilt crisis” or

“guilt depression”. However, the method with collocating words for disambiguation

purposes is not easy to implement in operational IR-systems, and specially not if we

only have a relatively short query to work with. This is the case for dictionary-based

methods in CLIR.

21

Part-of-speech tagging might be useful in Swedish IR because of the high frequency

of homographic word forms. The words in the query and in the documents are part-of-

speech marked and in the matching process the part-of-speech marks for the search

keys and the document index terms are acknowledged. With this method, for

example, the preposition för (because) could be separated from the verb för(a)

(bring). In CLIR part-of-speech tagging of source language words has been used

successfully for finding correct translation equivalents in the dictionary (Ballesteros

& Croft 1998).

With CLIR-research in mind we tested part-of speech-marking, using the same test

sample of 60 words in Finnish, English and Swedish. The effect on the number of

translation alternatives is shown in Table 9. We can se a small decrease in the number

of translation alternatives for English nouns, no decrease for Finnish nouns and a

larger for Swedish nouns. The results agree with earlier research by Leppänen (1995)

(Finnish) and Sanderson (1994, 1997) (English). From the test we come to the

conclusion that part-of-speech-marking might be effective in CLIR for Swedish.

Average number of translation alternatives after POS-markingthe decrease in percent to normalization without POS-marking (Table 8)

Fin-Swe % decrease Eng-Swe% decrease Swe-Fin % decreaseNouns 22,9 0 7,2 -11,1 21,4 -36,3Verbs 45,5 -9,9 31,9 -12,4 29,8 -14,1Adjectives+Adverbs 5,0 -64,0 4,8 -46,1 7,5 -58,6Average all POS 24,5 -24,6 14,6 -23,2 19,5 -36,3

Table 9. Average numberof translation alternatives after POS-marking

22

6 Conclusions The present situation with Swedish natural language IR is that it is lacking research.

Neither has there been any CLIR research in Swedish. A Norwegian study (Fjeldvig

& Golden 1988) done ten years ago with a fairly small test database is the only

research result that might give some leads on how the Swedish language behaves in

similar research situations. The Norwegian study shows the effectiveness of automatic

normalization and truncation but an improvement in the results for compound

splitting was not shown.

We have described the properties of Swedish language and pointed out a number of

research problems for Swedish IR. We also argued that they deserve serious research

attention, because 1) the Nordic countries have a fairly large population, 2) they are

technically advanced countries with lots of IR-related activity, and because 3) the

features of Swedish differ from those of many other languages, such as English and

Finnish. The research results for other languages do not necessarily apply to Swedish

due to its unique features, e.g., the frequency of homographs, the use of

fogemorphemes in compound words, and particle verbs.

We have also shown that the publicly available morphological analysis tools for word

normalization and compound splitting have pitfalls that might decrease the

effectiveness of IR and CLIR. The identification of base forms and the analysis of

23

compounds, in particular may present problems for IR. Therefore the morphological

analysis programs cannot be utilized directly but need to be tuned for IR applications.

References

Alkula, R., Honkela, T. (1992). Tekstin tallennus- ja hakumenetelmien kehittäminen

suomen kielen tulkintaohjelmien avulla. FULLTEXT-projektin loppuraportti. [

Linguistic processing and retrieval techniques in Finnish fulltext databases. Final

report of the FULLTEXT project.] VTT julkaisuja - publikationer 765. Espoo: VTT.

Ballesteros, L. & Croft, W. B. (1997). Phrasal translation and query expansion

techniques for cross-language information retrieval. In Proceedings of the 20th

ACM/SIGIR Conference, (pp. 84-91).

Ballesteros, L., Croft, W. B. (1998). Resolving ambiguity for cross-language retrieval.

In Proceedings of the 21stAnnual International ACM/SIGIR Conference, (pp. 64-71).

Boguraev, B. & Briscoe, T. (Eds.) (1989). Computational lexicography for natural

language processing. London: Longman.

Brown, P.F., Della Pietra, S.A., Della Pietra, V.J. & Mercer, R.L. (1991). Word-sense

disambiguation using statistical methods. In Proceedings of the 29th Annual Meeting

of the Association for Computational Linguistics, (pp. 264-270).

24

Dagan, I., Itai, A. & Schwall, U. (1991). Two languages are more informative than

one. In Proceedings of the 29th Annual Meeting of the Association for Computational

Linguistics, (pp. 130-137).

Davis, M. & Ogden, W. C. (1997). QUILT: Implementing a large-scale cross-

language text retrieval system. In Proceedings of the 20th Annual International

ACM/SIGIR Conference, (pp. 92-98).

Efthimiadis, E.N. (1996) Query expansion. In Annual Review of Information Science

and Technology, 31, (pp. 121-187).

Fjeldvig, T., Golden, A. (1988) Experiments with language-based aids in information

retrieval systems. Nordic Journal of Linguistics, 11(1-2), 33-46.

Fraurud, K. (1988). Pronoun resolution in unrestricted text. Nordic Journal of

Linguistics, 11(1-2), 47-68.

Grefenstette, G. (1998). The problem of cross-language information retrieval. In G.

Grefenstette (Ed.) Cross-Language Information Retrieval, (pp. 1-9). Boston: Kluwer

Academic Publishers.

Grishman, R. (1986). Computational linguistics: an introduction. Cambridge:

Cambridge University Press.

25

Guthrie, J. A., Guthrie, L., Wilks, Y. & Aidinejad, H. (1991). Subject-dependent co-

occurrence and word sense disambiguation. In Proceedings of the 29th Annual

Meeting of the Association for Computational Linguistics, (pp. 146-152).

Guthrie, L., Pustejovsky, J., Wilks, Y. & Aidinejad, H. (1996). The role of lexicons in

natural language processing. Communications of the ACM, 39(1), 63-72.

Harman, D. (1991). How effective is suffixing? Journal of the American Society for

Information Science, 42(1), 7-15.

Hellberg, S. (1978). The morphology of present-day Swedish: Word-inflection, word-

formation, basic dictionary. Stockholm: Almqvist & Wiksell.

Hirst, G. (1981) Anaphora in natural language understanding: A survey. Lecture

Notes in Computer Science, 119. Berlin: Springer-Verlag.

Hull, D. (1998). A weighted Boolean model for cross-language text retrieval. In G.

Grefenstette (Ed.) Cross-Language Information Retrieval, (pp. 119-136). Boston:

Kluwer Academic Publishers.

Järvelin, K. (1995). Tekstitiedonhaku tietokannoista. Johdatus periaatteisiin ja

menetelmiin. [Text retrieval from databases. An introduction to principles and

methods. ] Espoo: Suomen ATK-kustannus.

26

Karlsson, F. (1992). SWETWOL: A Comprehensive Morphological Analyser for

Swedish. Nordic Journal of Linguistics, 15, 1-45.

Karlsson, F. (1994). Yleinen kielitiede. [General linguistics. ] Helsinki:

Yliopistopaino.

Kekäläinen, J. (1999). The effects of query complexity, expansion and structure on

retrieval performance in probabilistic text retrieval. Ph.D. Thesis, University of

Tampere.

Kiefer, F. (1970). Swedish morphology. Stockholm: Skriptor.

Koskenniemi K. (1983). Two-Level Morphology: A General Computational Model for

Word-Form Recognition and Production. Ph.D. Thesis. University of Helsinki.

Krovetz, R. & Croft, W.B. (1989). Word sense disambiguation using machine-

readable dictionaries. In Proceedings of the 12th Annual International ACM SIGIR

Conference on Research and Development in Information Retrieval, (pp. 127-136).

Krovetz, R. & Croft, W.B. (1992). Lexical ambiguity and information retrieval. ACM

Transactions on Information Systems, 10(2), 115-141.

Leppänen, E. (1995). Homografien disambiguoinnin vaikutukset ja toteuttaminen

tekstihaussa. [The effects of disambiguation of homographs and its implementation

in text retrieval] Master’s Thesis. University of Tampere.

27

Lewis, D., Sparck Jones, K. (1996). Natural language processing for information

retrieval. Communications of the ACM, 39(1), 92-101.

Malmgren, S. G. (1994). Svensk lexikologi. Ord, ordbildning, ordböcker och

orddatabaser. [Swedish lexicology. Words, word formation, dictionaries and word

databases. ] Lund: Studentlitteratur.

McRoy, S.W. (1992). Using multiple knowledge sources for word sense

disambiguation. Computational Linguistics, 18(1), 1-30.

Oard, D. & Dorr, B. (1996). A survey of multilingual text retrieval. Technical Report

UMIACS-TR-96-19. University of Maryland, Institute for Advanced Computer

Studies.

Pirkola, Ari: (1998). The effects of query structure and dictionary setups in

dictionary-based cross-language information retrieval. In Proceedings of the 21st

Annual International ACM/SIGIR Conference, (pp. 55-63).

Pirkola, A. (1999). Studies on linguistic problems and methods in text retrieval. Ph.D.

Thesis, University of Tampere.

Pirkola, A. & Järvelin, K. (1996a). The effect of anaphor and ellipsis resolution on

proximity searching in a text database. Information Processing & Management,

32(2), 199-216.

28

Pirkola, A. & Järvelin, K. (1996b). Recall and precision effects of anaphor and

ellipsis resolution in proximity searching in a text database. In P. Ingwersen & N. O.

Pors (Eds.) Proceeding CoLIS2, Second International Conference on Conceptions of

Library and Information Science: Integration in Perspective. Copenhagen, October

13-16, 1996. Copenhagen: The Royal School of Librarianship, (pp. 459-475).

Porter M.F. (1980) An algorithm for suffix stripping. Program, 14, 130-137.

Sanderson, M. (1994). Word sense disambiguation and information retrieval. In

Proceedings of the 17th Annual International ACM SIGIR Conference on Research

and Development in Information Retrieval, (pp. 142-151).

Sanderson, M. (1997). Word sense disambiguation and information retrieval. Ph.D.

Thesis University of Glasgow, Department of Computing Science.

Schütze, H. & Pedersen, J.O. (1995). Information retrieval based on word senses. In

Proceedings of the Symposium on Document Analysis and Information Retrieval, (pp.

161-175).

Strzalkowski, T. (1994) Robust text processing in automated information retrieval. In

Proceedings of the 4th Conference on Applied Natural Language Processing, 168-

173. Stuttgart, Germany: Association for Computational Linguistics. Reprinted in K.

Sparck Jones & P. Willet (Eds.). Readings in Information Retrieval (1997). San

Francisco, CA: Morgan Kaufmann Publishers.

29

Strzalkowski, T. (1995). Natural language information retrieval. Information

Processing & Management, 31(3), 397-417.

Teleman, Hellberg, & Andersson, E. (1999) Svenska Akademiens grammatik 1-4

[Grammar of the Swedish Academy 1-4]. Stockholm: Svenska Akademien.

Vorhees, E.M. (1993). Using WordNet to disambiguate word senses for text retrieval.

In Proceedings of the 16th Annual International ACM SIGIR Conference on Research

and Development in Information Retrieval, (pp. 171-180).