proceedings 2005.pdf · 2018-03-26 · in recall while incurring only a small loss in precision....

Workshop Language and Speech Infrastructure in the Balkan Countries 2005 - Borovets, Bulgaria I

INTERNATIONAL WORKSHOP

LANGUAGE AND SPEECH

INFRASTRUCTURE

FOR INFORMATION ACCESS

IN THE BALKAN COUNTRIES

PROCEEDINGS

Edited by

Stelios Piperidis and Elena Paskaleva

Borovets, Bulgaria

25 September 2005

Workshop Language and Speech Infrastructure in the Balkan Countries 2005 - Borovets, BulgariaII

International Workshop

Language and Speech Infrastructure for Information Accessin the Balkan Countries

PROCEEDINGS

Borovets, Bulgaria

25 September 2005

ISBN 954-91743-2-8

Designed and Printed by INCOMA Ltd.Shoumen, BULGARIA

Workshop Language and Speech Infrastructure in the Balkan Countries 2005 - Borovets, Bulgaria I

WORGANISERS AND SPONSORS

The Workshop on “Language and Speech Infrastructure for Information Access in the Balkan Countries” is organised by

Stelios Piperidis, Institute for Language and Speech Processing, GR Elena Paskaleva, Bulgarian Academy of Sciences, BG

The Workshop on “Language and Speech Infrastructure for Information Access in the Balkan Countries” is supported by

Fifth International Conference on Recent Advances in Natural Language Processing, RANLP-2005

The team behind the Workshop on “Language and Speech Infrastructure for Information Access in the Balkan Countries”

Stelios Piperidis, Institute for Language and Speech Processing, GRMaria Gavrilidou, Institute for Language and Speech Processing, GRKaterina Pastra, Institute for Language and Speech Processing, GR Elena Paskaleva, Bulgarian Academy of Sciences, BGGalia Angelova, Bulgarian Academy of Sciences, BG

Workshop Language and Speech Infrastructure in the Balkan Countries 2005 - Borovets, BulgariaII

Workshop Language and Speech Infrastructure in the Balkan Countries 2005 - Borovets, Bulgaria III

Programme Committee

Galia Angelova, Bulgarian Academy of Sciences, BG Kalina Bontcheva, University of Sheffield, UKDan Cristea, “Alexandru Ioan Cuza” University of Iasi, ROTomaz Erjavec, Jožef Stefan Institute, SLBojana Gajic, Norwegian University of Science and Technology, NOMaria Gavrilidou, Institute for Language and Speech Processing, GR Florentina Hristea, University of Bucarest, ROVangelis Karkaletsis, NCSR Demokritos, GRRada Mihalcea, University of North Texas, USRuslan Mitkov, University of Wolverhampton, UKIvan Obradovic, University of Belgrade, SRKemal Oflazer, Sabanci University, TRPetya Osenova, Sofia University “St’ Kliment Ohridski”, BGHarris Papageorgiou, Institute for Language and Speech Processing, GRElena Paskaleva, Bulgarian Academy of Sciences, BGKaterina Pastra, Institute for Language and Speech Processing, GRBojan Petek, University of Ljubljana, SLStelios Piperidis, Institute for Language and Speech Processing, GRKiril Simov, Bulgarian Academy of Sciences, BGSofia Stamou, Computer Technology Institute, GRDusko Vitas, University of Belgrade, SR

Workshop Language and Speech Infrastructure in the Balkan Countries 2005 - Borovets, BulgariaIV

PREFACEThe emerging convergence of internet and media technologies, the abundance of interesting and useful archived and contemporary content and the ever increasing need for access to this content has set new challenges and opportunities for human language technologies (HLT). Useful results from the application of HLT for information access in a number of widely used languages already exist, while only some first attempts have been made in less widely used languages.

At the same time, the enlargement process of the European Union and the forthcoming accession of a number of Balkan countries set new requirements for technologically enhanced access to information generated and consumed in the Balkan Peninsula.

The current workshop brings together researchers working in a broad range of human language technologies for processing single- and/or multimedia content. The focus is on issues relevant to the respective speech and language infrastructure in the Balkans. Papers covering 6 Balkan languages and a couple of western European languages were submitted by authors from 10 countries, including the Balkans, other European Union countries and the USA. 12 papers were retained for oral presentation, covering all aspects from infrastructural and general language resources descriptions to multilingual lexical knowledge elicitation methods and techniques, syntactic parsing, information retrieval experiments, information extraction, question-answering and machine translation.

The workshop clearly shows that HLT is constantly moving at a high speed in the Balkan area. A sound infrastructure is in place for most Balkan languages as far as written language processing and associated applications are concerned. Spoken language and multimedia content processing are still in their first steps and considerable effort and resources need to be invested so that enhanced access to content is ensured.

We would like to thank the organizers of the Fifth International conference on Recent Advances in Natural Language Processing, RANLP-2005, for agreeing to host this workshop as one of its satellite events. We would also like to thank the researchers who submitted their papers and the members of the Programme Committee for their help with the paper reviewing process. Last but not least, we would like to thank Galia Angelova for her support with the organization of this workshop.

It is the aim of workshops like this one to act as a regular meeting of the Balkan countries where researchers, especially younger ones, might have a platform for presentations, contacts and co-operation, as a forum for publicizing, promoting research results and establishing a communication channel.

Stelios Piperidis September 2005Elena Paskaleva

Workshop Language and Speech Infrastructure in the Balkan Countries 2005 - Borovets, Bulgaria V

Table of Contents

Some Aspects of Negation Processing in Electronic Health RecordsSvetla Boytcheva, Albena Strupchanska, Elena Paskaleva and Dimitar Charaktchiev 1

Towards a Greek Dependency CorpusElina Desipri, Prokopis Prokopidis, Maria Koutsombogera, Xaris Papageorgiou and SteliosPiperidis

9

Building Multilingual Terminological ResourcesMaria Gavrilidou, Penny Labropoulou, Monica Monachini, Stelios Piperidis andClaudia Soria

15

An Algorithm for the Semiautomatic Generation of WordNet Type Synsets with SpecialReference to RomanianFlorentina Hristea and Cristina Vata 23

Resources for Processing Bulgarian and Serbian – a brief overview of Completeness,Compatibility, and SimilaritiesSvetla Koeva, Cvetana Krstev, Ivan Obradovic and Duško Vitas 31

Dictionary, Statistical and Web Knowledge in Shallow Parsing Procedures for InflectionalLanguagesPreslav Nakov and Elena Paskaleva 39

Infrastructure for Bulgarian Question Answering: Implications for the Language Resourcesand ToolsPetya Osenova and Kiril Simov 47

Experimenting in Information Retrieval SystemsVeno Pachovski 53

BULTRA (English-Bulgarian Machine Translation System) - Basic Modules andDevelopment ProspectsElena Paskaleva and Tanya Netcheva 60

The Globe: A 3D Representation of Linguistic Knowledge and Knowledge about the WorldTogetherKamenka Staykova and Sergey Varbanov 68

Valence Specifics of Bulgarian VP Feature-Structures in regard to HPSG UniversalConstraintsTzvetomira Venkova 75

Designing a PAROLE/SIMPLE German-English-Romanian LexiconMonica Gavrila, Walther v. Hahn and Cristina Vertan 82

Workshop Language and Speech Infrastructure in the Balkan Countries 2005 - Borovets, Bulgaria 9


Some Aspects of Negation Processing in Electronic Health Records

Svetla Boytcheva1, Albena Strupchanska2,Elena Paskaleva2 and Dimitar Tcharaktchiev3

1Department of Information Technologies, Faculty of Mathematics & Informatics, Sofia University �St. Kl. Ohridski�, 5 James Bauchier Str., 1164 Sofia

[email protected] 2Linguistic Modelling Department, Central Laboratory for Parallel Processing,

Bulgarian Academy of Sciences, 25A Acad. G. Bonchev Str., 1113 Sofia {albena, hellen}@lml.bas.bg

3University Hospital of Endocrinology, Nephrology and Gerontology, Medical University, 6 Dame Gruev Str., Sofia

[email protected]

Abstract

The presented paper discusses a hybrid approach for negation processing in Electronic Health Records (EHRs) in Bulgarian. The rich temporal structure and the specific combination of medical terminology in both Bulgarian and Latin do not allow the application of standard language processing techniques. The problem gets even worse due to the often use of specific abbreviations, analyses and clinical test data. Various expressions of negation often occur in EHRs. This raises many difficulties for language processing especially in semantic analysis. That is why we propose an approach that combines information extraction with deep semantic analysis, allows to determine the negation, negation scope and to treat it appropriately. We present a prototype of a system MEHR for automatic recognition of medical terms and some facts in EHRs in Bulgarian. This is the first step towards filling in a template concerning the patient status. Automatic extraction of facts needed for description of patient status in full is our ultimate goal. However in this paper we just focus on proper treatment of negation.

1. Introduction

Automatic generation of Patient�s Chronicle � symptoms and diagnosis from EHRs is very challenging and ambitious task. It requires recognition of medical terminology, deep semantic analysis in certain domain relevant and important points, processing of temporal structure and discourse analysis. Even partial solution for generation of patient�s chronicle requires intensive linguistic and domain knowledge as well as application of different

language processing techniques. For languages other than English such knowledge resources are missing. So in the attempt to process EHRs in Bulgarian we have to rely on usage of limited resources and shallow processing techniques. Because of the frequent usage of negated observations in important parts of EHRs, a crucial element in correct recognition of patient�s symptoms and diagnosis is determining whether a particular symptom/diagnosis is present or absent in the patient. In this paper we will mainly focus on negation � its usage, classification and proper treatment. The paper is organized as follows: The first section contains a brief introduction. Section 2 presents overview of some related work. Section 3 describes system architecture and main modules. Section 4 focuses on the problems in negation treatment and describes the most frequently used types of negations in EHRs. Section 5 presents the main steps of the system�s work and illustrates them by an example. Section 6 contains a brief discussion about problems and unsolved cases. Section 7 explains further work and conclusion.

2. Related work

Several language processing systems, which extract and codify information from electronic health records in English, have been developed. A good overview, comparison and evaluation of such systems can be found in [1, 2]. We will briefly mention the main language techniques and resources which have been

Workshop Language and Speech Infrastructure in the Balkan Countries 2005 - Borovets, Bulgaria2

used. Some systems try to parse the whole sentence (LSP, Ménélas), others try to process large segments of the sentence or local phrases only (RECIT, SPRUS). However for sentence processing the system MedLEE consecutively uses all methods mentioned above. All the systems use different amounts of syntactic, semantic, and domain knowledge and their combinations vary considerably. Some of them include knowledge concerning the sentence structure, others rely on semantic patterns/frames. It seems that the methods based on analysis of phrases rather than complete sentences show substantial increase in recall while incurring only a small loss in precision. Leroy et al. [3] developed a shallow parser that captures relations between noun phrases in medical abstracts. The parser has a syntactic basis and it extracts relations between all noun phrases (NPs) regardless of their type. It uses AZ Noun Phraser with adjusted settings to extract medical nouns and NPs. The parser searches templates in the texts which are based on English closed-class words, i.e prepositions (by, of, in), negation, conjunctions (and, or) and auxiliary or modal verbs. The extracted relations can contain up to five arguments: relation negation, left-hand side (LHS), connector modifier, connector and right-hand side (RHS). For instance: NOT : Hsp90 (LHS)�inhibit (connector)� receptor function (RHS) The parser recognizes two types of negations: negation that precedes a verb phrase and negation that is part of a noun phrase. Please note that negation is just recognized but not processed further. If there exists a marker of negation in a template, the template is ignored. Another attempt for negation recognition in medical texts has been made by Mutalik et al. [4]. They have created a Negfinder program consisting of a lexer and a parser. The lexer identifies a lot of negation signals and further classifies them on the basis of properties such as whether they generally precede or succeed the negated concept and whether they can negate multiple concepts. A single token is corresponding to each class. Taking into account the token and some grammar rules, the parser associates the negation signal with a single concept or with multiple concepts preceding or succeeding it. The parser relies on a restricted subset of context-free grammars

and it partially parses the sentence focusing on the occurrence of concepts matched to the UMLS, negation signals, negation terminators, and sentence terminators. It treats most other words just as fillers. The output of the Negfinder shows that a large part of negations can be detected by a simple strategy but the reliability of detection depends on the accuracy of concept recognition, which is hampered by composite concepts and homonyms. The Negfinder has precision 91% and recall 96%. The authors conclude that �in most cases, errors made by Negfinder are easily correctable by syntactic methods and involve minor modifications of the lexer or the parser. However, in some cases semantic methods may be required, such as better characterization of temporary composite concepts using noun phrase detection combined with a rich semantic model of the domain.�

Chapman et al. [5] propose a simpler algorithm (NegEx) than the one used in Negfinder which can detect large portions of negations. The algorithm relies on a set of negation phrases and regular expressions to negate UMLS terms. Although it can be easily extended, the algorithm has lower recall since it limits the number of words between the negation phrase and the UMLS term up to five. It performs with 84% precision and 78% recall. NegEx has even lower precision when the negation phrase is �not�. The authors of [6] made an attempt to improve the precision in case of �not� by using Naïve Bayes and Decision Trees machine learning algorithms. They have analyzed a sample of sentences which NegEx inaccurately negated. The result of their analysis could be summarized into a simple rule whose addition to the NegEx increases the analysis was restricted to the specific negation phrase.

3. System Architecture

The mail goal of the MEHR (Maintaining Electronically Health Records) is to extract from EHRs in natural language (NL) all required information for automatic generation of Patient�s Chronicle � symptoms and diagnosis. We propose an approach that combines information extraction with deep semantic analysis, allows to determine the


negation, negation�s scope and to treat it appropriately. MEHR is a prototype of a system for automatic recognition of medical terms and some facts in EHRs in Bulgarian. This is the first step towards filling in a template concerning a patient�s status. Automatic extraction of facts needed for description of patient�s status in full is our ultimate goal. However in this paper we just focus on proper treatment of negation. The MEHR architecture is strongly influenced by EHRs specific structure. The information in EHRs in Bulgarian is organized in 11 ordered predefined topics: Personal data, Anamnesis, Status, Examinations, Consultations, Debate, Treatment, Treatment results, Recommendations, Working abilities, Diagnosis. The average length of each EHR is about 2-3 pages. MEHR system works in two modes � for filling symptoms scenarios templates and for filling diagnosis scenarios templates. In the current version of MEHR system we do not treat temporal structure and discourse, thus we can not determine relations between a diagnosis and all corresponding symptoms that caused this diagnosis. MEHR system uses the following resources: lexicon, grammar rules, negation rules, terminology bank, shallow ontology of body parts and templates. MEHR architecture is shown on fig.1. The system consists of the following modules: � A&C : Annotation and chunking � Post Processing Module � Negation Treatment Module � Extractor � Filling Scenario Templates Module

Each EHR is split into different topics and then it is sent as an input to MEHR. The system processes one separate EHR topic per time. The A&C module (programmed in Perl) annotates the text and extracts chunks from it. The annotation process is based on morphological analysis. It uses a lexicon, which consists of 50 000 lexems and contains all their wordforms. Words in the text are juxtaposed to lexicon entries and for each word the module finds the word�s basic form (lexem) with its lexical and grammatical features. Chunks are sequence of words that forms a syntactic group. We have defined a nominal chunk as an adektive followed by a noun. We define an adektive as a word that

syntactically behaves as an adjective which is coordinated with the succeeded noun. The adektives describe some attributes of the succeeded noun. Chunks are recognized by rules (regular expressions), which take into account morphological features of words and their mutual position. The module recognizes mostly nominal chunks (NPs).

The output of A&C system is a tagged text with information about recognized nominal chunks. The lexicon that uses A&C is expanded with medical terminology and frequently used words in EHRs. Post Processing Module using lexicon, verb frames for some domain important verbs and grammar rules determines some VP chunks. The negation treatment module inserts markers in the text for negated phrases and determines scope of negation by using negation rules. More detailed information about this process can be found in the next paragraph. The output of negation module is used as an input of Extractor. The extractor determines patient�s symptoms or diagnosis with the help of Terminology Bank and Ontology for Diagnosis, Shallow Ontology for Body parts and Frozen Phrases Templates. In the current prototype the diagnosis extraction is in the initial stage of implementation. The Filling Scenario Templates module tries to fill all obligatory fields in Patient�s Chronicle Template and some of optional fields if there is additional information.

4. Negation Treatment

We treat the negation in the context of a sentence, so we will briefly describe its semantics in this context. In Bulgarian as in other languages, it is possible either to negate the whole statement or to negate different components, arguments of the sentence predicate. In the former case of negation, the negation is general while in the latter it is partial. The negation is expressed by specific lexical means: particles that form complex predicate negation, pronouns and adverbs as well as verbs with general semantics of �absence�, �lack�, �inhibit� etc. We will give some general classification of negation in Bulgarian language and discuss the different expressions of negation in EHRs.


4.1. Expressions of negation in Bulgarian

4.1.1. Surface markers of negation General negation i.e negation of the verb action. It is expressed by the negative particle ��(not), which is considered as a part of the verb complex. The usage of particles ��(neither) and �� (nor) is just to repeat and exalt the negation. However, their usage together with the negation itself shifts the scope of the negation from general to partial i.e negation of the arguments of the main predicate. For instance, take into account the semantics of the following sentences: �He doesn�t drinks�. vs. �He doesn�t drink neither wine nor whiskey.� The particles that exalt the negation can be used independently in a positive sentence and they are semantically related to the succeeding word: �Neither drugs, nor any treatments can help him.� Another way of negating a verb action in Bulgarian is by preceding it with the preposition �� (without) followed by particle �� (in fact this sequense correspond to a double conjunction). Since the semantics is the same MEHR treats �� and �� in the same way. Partial negation. The scope of the partial negation is spread out to any other parts of the predicate-argument structure as well as to their attributes except for the predicate itself. The negation is again expressed by the usage of lexical means, which can negate: � the presence/existence of some attributes

of an object without being explicitly mentioned. The lexical means of expression are (i) negative adverbs and pronouns: ��, ��, ��(nowhere, never, nobody) etc. In contrast to English, negative adverbs and pronouns always presuppose general negation as they are always used together with negation of the predicate, (ii) prepositions with meanings of an �absence� e.g. ��(without). These prepositions negate the presence/existence of an attribute

expressed by a noun. In this case the attribute is mentioned explicitly e.g. �� (breathing without crepitating).

� the presence of an argument of some predicate with focus on the absence itself and mentioning the argument explicitly. The means of expression here are the usage of lexems, in which the negation is part of their meanings e.g. �� (there

is not, not exist) � negation of �� (there is, exist); �� (absent) � negation of ��(be available); ��(deny) � ��(confess). The words that have a negative meaning but a positive form we will call inherent negatives. In this kind of negation the negative particle �� (not) is missing at the surface level.

4.1.2. Negation of complex syntactic units � In coordinative sentences

the negative particle ��(not) precedes each negated predicate i.e the number of simple sentences and the number of negative particles �� is equal. The list of negated verbs is separated by commas or by coordinative conjunction �� (and). It

is also possible to have a contrastive conjunction �� (but). Examples: �� (No changes were found and the patient has not been in hospital).

� Homogeneous parts of a sentence (members of coordination): when the negative particles �� (not) or ��(without) precede a chain of homogeneous sentence parts, they spread out the negation to the whole syntactic group not only to the nearest group�s element. Example: �� , ��(without hirsurtism, amenorea).

Fig. 1 MEHR Architecture


� Heterogeneous parts of a complex sentence: when the negative particle ��precedes the chain of heterogeneous sentence parts, it spreads out the negation to the nearest to it element.

4.2.Negation in EHRs and its treatment

Expressions of negation mostly occur in Anamnesis, Status and Debate sections of EHRs. These sections include crucial information about disease, patient status and ways of treatments. Anamnesis and Status contain descriptions of diseases with their symptoms (symptom complex of disease). The clinical chronicle is in the Debate (reasons for disease, patient�s problems, treatments and results). Since these sections contain very important information for patient�s chronicle correct interpretation of negation is needed.

We have analyzed negations in EHRs by using surface markers of negation and a concordancer. The most frequently used negation markers are shown in table 1.

Negationmarker

Statistics Total words

�e(No)

350

��(without,missing)

200

��(deny)

35

��(missing,absent)

30

55 000

Table1. Most frequent negation markers

Roughly we can categorize negation into two types: a direct negation and a distant one.

4.2.1. Direct negation The negative markers directly precede some verb chunk or noun (nominal) chunk. In the first case the MEHR interprets the negation as an absence of action or state expressed by the verb � �� (not see). In the second case, the negation is interpreted as negation of the whole nominal chunk without specifying if it negates existence of some object or its attribute. Usually the noun of the nominal chunk is the subject of negation, however, in some cases (domain terms) the noun has less semantic importance than the adektive.

Example of direct negation: �...�� ...�

�...�� ...� �...��

��...� �...�� ...�

In such cases we consider the whole noun chunk as the scope of negation without further deep processing. In future we plan to separate noun chunk into sub-chunks and to look at the most frequent sub chunks. We expect that the most frequent ones will be more meaningful and they can be potential candidates in determining the negation scope into chunks themselves.

The MEHR extends the definition of a chunk into compound chunk where it combines nouns, noun chunks or both when there is a conjunction between them.

�...�� ...�

For determining the scope the system treats the compound chunk like a conjunction of simple ones:

�...�� ...� and �...�� ...�

In cases where the negation marker is succeeded by a noun, if the noun belongs to our terminology bank or body parts ontology then the negation scope is the noun itself otherwise the negation scope is unsolved. In order to solve the latter case we have to recognize not only chunks but whole noun phrases (mostly nouns connected by prepositions).

4.2.2. Distant negation The distance negation includes cases where the negative marker is at some distance from the subject of negation and it hampers the correct interpretation. At the surface syntactic level the negation is attached to some object but at the semantic level it relates to another (shifting) or spreads the negation to another (distribution). There is a relationship between distant objects and the treatment of negation strictly depends on it. Since the negative particle �� (not) is the most frequent negation marker in the EHRs the following classification of relationships between distant negation objects will be related to it. At present MEHR takes into account the


relationship between two objects. Let�s call them A and B by convention.

Predicate-argument relationship B fills some valency role of A. In this case we believe that there is a distribution of the negation over A and B. Examples:

�� .�

�� .. ��

The treatment of negation depends strictly on the semantics of the negated verb. For the most frequent verbs in EHRs we have defined templates that describe verbs and their valency roles. Example of templates:

X �� Y �� Z X �� Y � ZX �� Y � Z

Such templates are defined for all inherent negatives (��, ��, ��, ��,��, ��, ��) too. We have 18 templates for the verbs.

�odular relationship A explains B by the usage of a modal verb. B is either a verb or a noun derived by a verb. In this case at a surface level the modal verb is negated but semantically the action/state of the main verb is negated. The MEHR treats this situation by applying a rule using a list of modal verbs in Bulgarian language. �henegation scope is defined as B with some certainty factor depending on the semantics of the modal verb. ��

Copula relationship A is a semantic empty verb used as a syntactic relation to B. In the restricted domain these empty verbs can be enumerated. MEHR uses a list of such verbs (�� , �� ,�� ...)��

Anaphora Since the MEHR tries to determine negation scope using shallow processing only, it doesn�t recognize negation in the case of anaphora between A and B. ��

It is important to mention that surface negation markers that have same interpretation are grouped into clusters. For each cluster we associate some semantic treatment. In the example above all markers belong to the same cluster.

5. Extractor

Specially implemented Splitting module separates each EHR into 11 parts corresponding to the 11 topics mentioned above. We will focus on those of them that are important for MEHR system processing. Anamnesis, Status and Debate contain information about patient�s symptoms (see fig.2). Anamanesis, Debate, Consulations and Dignosis contain information about patient�s diagnosis. The Status part contains all current symptoms of the patient. After processing these parts some templates related to symptoms are filled in. They have the following structure: (body part � sympt1, sympt 2 ,�). The body parts are recognized by using a terminology bank and a shallow ontology.

After careful analysis of EHRs we have found that some phrases have special meaning and can serve as clues for the recognition of symptoms, key events etc. We call these phrases �frozen phrases�. For distinguishing symptoms we have defined templates for symptoms frozen phrases. An example of such template is:

�� {�� , �� , � �� , ��

�� }� The first part of a template is a verb that

means �entering the hospital for treatment� and the second part is � list of synonym phrases after which symptoms are expected.

The analysis also shows us that phrases often occur after negation markers but serve as a conjunction between the markers and the negated objects e.g. �� , �� ,�� . These phrases we call stop/empty phrases. For each such phrase we have a template too

6. Example

Let�s consider that we have as input in the MEHR system the following text form Status


part of EHR of a patient (negation words are bold). TEXT �� .�� . �� , �� . ��: ��, �� ,�� .

The A&C module annotates the given text and extracts chunks from it, using the following resources: rules presented as Regular expressions for recognition of nominal chunks, grammar rules and lexicon extended with medical terminology. Below are presented the annotations sentence by sentence of the input text:

ANNOTATION Sentence 1: ��{��.N+F:s,��.V+IPF+T:R1s} ��{��.PREP} ��{��.A+GR:sf,��.V+PF+T+NSE:Psf} ��{��.N+F:s}. Sentence 2: ��{��.A:sf} ��{��.N+F:s} �{�.CONJ} ��{��.A+GR:p} ��{��.A:p} ��{��.N+M:p} ��{��.PC} ��{��.PC,��.PRO+RFL:SA} ��{��.V+PF+T+NSE:R3s:I2s:E2s:E3s} ��{��.V+PF+T+NSE:Pp}. Sentence 3: ��{��.N+M:p} ��{��.PREP} ��{��.A+GR:p,��.V+PF+T+NSE:Pp} ��{��.N+M:p} ��{��.V+IPF+I:R3p} ��{��.N+M:p} �{�.CONJ} ��{��.A:p} ��{��.V+PF+T+NSE:R3s:I2s:E2s:E3s,��.N+F:p} Sentence 4: ��{��.A+GR:sf} ��{��.N+F:s} -{gb}��{��.A+GR:sf}, ��{��.A+GR:p} ��{��.N+M:p}, ��{��.PREP} ��{��.A:sf} ��{��.N+F:s}.

Below are listed found nominal chunks form A&C module and some VP chunks (??) founded from Post Processing Module .

FOUND CHUNKS: ��_��_��_��_��_��_��_��_��_��

After that the negation treatment module inserts markers <NEG> in the text for negated phrases and determines scope of negation using negation rules presented as Regular expressions. Then scope of the negation is marked with initial marker <NEG> and final marker </NEG>.

The extractor determines patient�s symptoms or diagnosis with the help of Terminology Bank and Ontology for Diagnosis, Shallow Ontology for Body parts and Frozen Phrases Templates.

INTERNAL INTERPRETATION Sent�nce 1: <Body Part = Noun> PREP <STATUS=NP chunk>. Sentence 2: <Body Part = NP Chunk> & <Body Part = NP Chunk> <NEG> <Frozen phrase> < STATUS = ADJ> </NEG>. Sentence 3: <Body Part = Noun> - PREP < STATUS1 =NP chunk > & <NEG> < STATUS2=Noun>& < STATUS3=NP Chunk> </NEG>Sentence 4: Template: : <Body Part = NP Chunk> : < STATUS1> & < STATUS2> & <NEG> < STATUS3=NP Chunk> </NEG>

The Filling Scenario Templates module tries to fill all obligatory fields in Patient�s Chronicle Template and some of optional fields if there is additional information. For instance the obligatory fields in the templates include information such as: Patient�s name, age, sex, address, list of previous diagnosis and etc. The optional fields can include the status information of some patient�s body parts which data are not directly related to the current patient�s diagnosis or these data are not important for the current diagnosis recognition. However the status for some body parts is very important for diagnosis


recognition that is why they are marked in the scenario template as obligatory fields.

The scenario templates are presented in MEHR system in XML format. The following table shows a part from the scenario template about the patient with diabetes diagnosis. These data present information about body parts and their status. Those data that were recognized as negated in the input text are marked in the Status field with marker NEG. If there is information concerning presence or absence of several symptoms for one and same body part then in the scenario template for each pair �body part � status� is allocated a separate field.

RESULT Body part Status��

�� NEG ��

��

NEG ��

��

�� NEG �� NEG ��

��.��

7. Discussion

The presented system has been tested on about 70 EHRs containing 55 000 words in total. Although the relatively low number of negation markers in processed EHRs, their correct interpretation is very crucial as: � negation markers appear more often in

those parts (Anamnesis, Status, Debate) of EHRs that contain important information concerning patient�s status.

� Very often negation marker precedes a coordination phrase so one marker is used for negating of not only one item but of a sequence of items.

The results of analysis show that about 57% of negations were recognized correctly, 28% were recognized incorrectly and about 15% were not recognized at all. In most cases incorrect recognition actually means inaccurate interpretation of negation scope. Careful analysis of EHRs and the choice of frozen phrases templates have the highest

influence on interpretation process. In the example below the first two items could be interpreted correctly by a suitable template but the third one needs deeper semantic analysis.

Example: ��

��

��

8. Conclusion and further work

We proposed an approach for treatment of negation in EHRs that uses shallow processing (chunking only) in combination with deep semantic analysis in certain points. The choice of proper templates for recognition and interpretation of negation influences the system performance.

As further work we plan to refine the chunking algorithm, to enlarge the number of templates and to expand the knowledge resources of the system (lexicon, ontology etc.).

References [1] Friedman C, Hripcsak G, Shablinsky I. An evaluation of natural language processing methodologies. In Chute CG, ed. Proceedings 1998 AMIA Annual Symposium. Phil: Hanley & Belfus, 1998, p. 855-859. [2] Friedman, C. and Hripcsak, G. Natural language processing and its future in medicine. Academic Medicine. 1999;74(8), p.890-895. [3] Leroy, G., Chen, H., and Martinez, J.D. A Shallow Parser Based on Closed-class Words to Capture Relations in Biomedical Text. Journal of Biomedical Informatics (JBI) vol. 36, pp 145-158, June 2003. [4] Mutalik PG, Deshpande A, Nadkarni PM. Use of general-purpose negation detection to augment concept indexing of medical documents. J Am Med Inform Assoc 2001; 8:598�609. [5] Chapman, W. W., W. Bridewell, et al. (2001). ASimple Algorithm for Identifying Negated Findings and Diseases in Discharge Summaries. Journal of Biomedical Informatics 34(5): 301-310.[6] Goldin, Ilya M., Chapman, Wendy W. Learning to Detect Negation with 'Not' in Medical Texts. Proc. Work-shop on Text Analysis and Search for Bioinformatics at the 26th ACM SIGIR Conference (SIGIR-2003). Eds. Eric Brown, William Hersh and Alfonso Valencia.


Towards a Greek Dependency Corpus

Elina Desipri *, Prokopis Prokopidis *† Maria Koutsombogera *, Harris Papageorgiou *,

Stelios Piperidis *†

*Institute for Language and Speech Processing Artemidos 6 & Epidavrou, 15125 Maroussi,

Athens†National Technical University of Athens,

Iroon Polytexneiou 9, 15780 Zografou, Athens{elina,prokopis,mkouts,xaris,stelios}@ilsp.gr

AbstractIn this paper we present work in progress for the construction of a resource that we provisionally call Greek Dependency Treebank (GDT). Our approach involves annotation at the level of syntax and semantics and is envisioned to result in an annotated reference corpus for the Greek language, compliant with existing resources for other languages. Taking into account multi-layered annotation schemes designed to provide deeper representations of structure and meaning, we describe the methodology followed at both the syntactic and semantic layer, we report on the annotation process and the problems faced and we conclude with comments on future work and exploitation of the resulting resource.

1 IntroductionTreebanks are widely recognized as important resources for acquisition of linguistic knowledge. Moreover, they serve as training and testing material for several NLP applications. In recent years corpora annotated with various types of semantic relations like predicate argument structure, co-reference links and pragmatic information, are considered as one of the key elements for many NLP tasks in which semantic representation is needed, including information extraction, question answering, summarization etc. Large and medium-scale treebanks are already in place (Böhmová et al. 03; Kingsbury & Palmer 02; Simov et al. 03) while others are under development (Oflazer et al. 03).

Taking these facts into consideration, the work presented in this paper aims at the creation of a corpus for the Greek language, annotated both at the level of syntax and semantics. Our approach

incorporates and combines insights from recent work in the field, especially from the Prague Dependency Treebank (Böhmová et al. 03) and PropBank (Kingsbury & Palmer 02) annotation efforts.

The rest of the paper is structured as follows: sections 2 and 3 present the data selected and their preprocessing respectively. Section 4 reports on the syntactic representation, while section 5 discusses the semantic layer of our corpus. Section 6 puts our goal into perspective by focusing on further work and exploitation of the resulting resource.

2 Corpus Description Our annotation corpus comprises texts that were collected in the framework of national and EU-funded research projects aiming at multilingual, multimedia information extraction. While building the annotation collection we tried to satisfy the needs of these projects (by selecting texts from particular domains of interest) and, at the same time, to create resources that would constitute the first part of a reference corpus for Modern Greek, annotated at multiple levels. The main domains covered at this stage are manual transcripts of European parliamentary sessions, and web documents pertaining the financial, health, and travel domains.

Each annotation file corresponds either to the full text of a web document or to a randomly extracted segment (30-60 sentence long in most cases) from parliamentary sessions. The total size of the currently annotated resource amounts to 70 KWs.

3 Data preparation Annotators working on this collection were


presented with data that had been preprocessed via an existing pipeline of shallow processing tools forGreek. This processing infrastructure (Figure 1) is based on both machine learning algorithms and rule-based approaches, together with language resources adapted to the needs of specific processing stages.Specifically, our processing tools include texthandling (tokenization and sentence boundaryrecognition), part-of-speech tagging, lemmatization,chunk and clause recognition, and head identification modules. Development andperformance information for these modules is given in (Papageorgiou et al. 02), while their use in thepreparation of the particular annotated resource is detailed in the next section.

4 Syntactic representation in GDTThe first level of manual annotation focuses on surface syntax. At this level, we have opted for adependency-based representation instead of aconstituency-based one. Dependency analysesrepresent sentences as graphs where each wordcorresponds to a node in the graph. Sentences areprototypically headed by the verb of the main clause,which can have zero or more dependents. Words are direct dependents of their heads without anyintermediate phrasal nodes. Arcs between heads and dependents are labeled according to the kind of relation (like subject and object) between respectivewords, although it is common practice to assign thelabel to the dependent node, together with any otherword-specific information like POS tags and lemmas.

Figure 1: Preprocessing Pipeline

Parsers and annotation efforts for identification of dependencies between words (or dependenciesbetween constituents) are known to exist for anumber of languages including Danish, Dutch, English, Slovene and Turkish, while the probablymost well-known dependency annotation project is the Prague Dependency Treebank for the Czechlanguage (which has been developed by the Institute of Formal and Applied Linguistics (Böhmová et al. 03)).

A dependency-based representation was chosen on the basis that it allows for more intuitive

descriptions of a number of phenomena, including long-distance dependencies, as well as structures specific to free-word-order languages like Greek. Atthe same time, dependency representations seem tobe more theory-neutral since they are based on well-known traditional-grammar relations between wordsthat annotators are usually quite familiar with.

Moreover, while available constituency- based approaches tend to focus on specific constructions,traditional textbook grammars describing the full range of Greek language phenomena are morecompatible with dependency-based descriptions. The set of labels in our annotation schema is a derivative of the PDT, adapted to cater for Greek language structures. We compiled guidelines for the mainsyntactic structures of Greek after an initial study ofrandomly extracted selections from our corpus andexemplary sentences from Greek textbook grammars. For illustration purposes, let us examine asample sentence from our corpus:

Ex1. ��/give-3pl ��/the ��/chance��/to+the ��/culprit ��/to��/escapeThey gave the culprit the chance to escape

The basic node of the corresponding graph in Figure2 is the verb “��”, which is assigned the label Predicate and is attached to an artificial node. Two words, “��” and “��” are annotated as dependents of the Pred and are assigned labels Object and IndirectObject respectively. In a constituency-based scheme that does not allow discontinuous constituents, one would have to usesome mechanism involving empty nodes and indices to describe the connection between the subordinate clause “�� ” and the noun “��” itdepends on in Ex1. Nevertheless, in manydependency-based schemas creation of non-projective trees with crossing arcs is allowed. Thus an intuitive and theory-independent representation ofsimilar non-projective constructions is attainable.

For the initial generation of the dependencygraphs that the annotators have to correct the following procedure is undertaken. After POS-tagging and lemmatization, a pattern grammar compiled into finite state transducers recognizes chunk and clause boundaries, while a head identification module based on simple heuristics takes care of spotting the heads of these structures,and assigning labeled dependency links betweenhead words of each chunk and clause, and the rest ofthe words inside their limits. Furthermore, the headidentification module tries to assign dependencylinks between heads of different chunks or clausesinside the limits of the sentence. Thus the output is a


dependency graph that the annotators have to furtherenrich by providing missing links of unattached words to their heads, and/or by correcting automatically generated labeled dependencies.

Nine groups of 30 students in a postgraduate NLP course were each given a portion of equal size of the 70KW corpus to correct. All annotators have used TrEd, an open source tool that has been developed in the framework of the PDT project (Pajas 05), and is suitable for the annotation ofdependency graphs (Figure 2).

Figure 2: A sample dependency graph in TrEd

We plan to use this annotated resource for the development of an automatic dependency parser for Greek, following recent advances in parsing literature (Klein & Manning 04; Yamada & Matsumoto 03).

5 Semantic representation in GDTFollowing the syntactic annotation level, the next phase deals with the enrichment of the resource viaSemantic Role Labeling and event type assignment.These subtasks are strongly related to each other;however their outcomes could be independentlyexploited by various types of applications.

Semantic Role Labeling (SRL) is defined as the recognition and labeling of the arguments of a target predicate. Given a sentence, the task consists inidentifying, extracting and labeling the argumentsthat fill a semantic role of the predicates identified in the sentence. The approach adopted is data drivenaiming at the automatic extraction of relational data; it does not try to justify specific theory on empiricalgrounds. In this respect, we incorporate and combineinsights from recent work in the field, especiallyfrom PropBank (Kingsbury & Palmer 02) and the Tectogrammatical Level of the PDT (Haji�ová et al.00).

5.1 Lexicon – Frames

Semantic Role Labeling heavily relies on the underlying predicate-argument structure. Recognition of this structure is often hindered by the existence of multiple different syntactic realizationsfor the same set of arguments.

Thus, in order to ensure consistent annotations for argument roles, a lexicon with semanticinformation for verb predicates was built prior to theSRL annotation phase of the GDT. The framesetdescriptions in this resource were meant to serve asguidelines for the actual labeling procedure by the annotators involved. Selection of the verbal predicates was determined by a) the frequency of theverbs in the corpus collection from which annotation material was extracted, and b) analysis of the datawith respect to further goals, i.e. extraction of factsfor an end-user in an information extraction setting. This selection process yielded a list of approximately 800 verbs that we expect to denote events of interest. The next step concernedexamining sentences from our corpus containing thetarget verbs. These instances were grouped into one or more major senses; each sense was turned into a single frameset, that is, a corresponding set of semantic roles, as illustrated in the following table.

“��”sense: submit

Example: �� µ�� (the members will submit the report to theCommission)

argument argumentlabel

0 ACT µ�� members1 THE �� report2 ADDR �� Commission

Table 1: Frameset description of the verb ��


Each role acquires a respective label from the list in Table 2. All possible syntactic realizations of a sense are grouped under the same frameset. Sense discrimination corresponds to the distinction of framesets and was based on both syntax and semantics. Specifically, we distinguished framesets taking into account a) the number of the semantic roles (two verb meanings are distinguished as different framesets if they take different number of arguments) b) semantic role labels (same number of arguments but different labels) and c) verb class andsubclass (based on an adaptation of Levin’s verb classes (Levin 93) for Greek). It should be noted that we intend for synonymous predicates to share similar role labels.

Label RoleACT ActorTHE Theme PAT PatientBNF BenefactiveEXP ExperiencerADDR Addressee RCP RecipientATTR AttributeLOC LocationTMP Time MNR MannerINSTR InstrumentEXT ExtentENP End Point STP Start PointCAU CausePNC PurposeSRC SourceDST Destination

Table 2: Argument labels

Each frameset is complemented by a set of comprehensive examples, extracted from the corpus, that denote the respective predicate argument structure described in the frameset. The frameset resource was produced by a group of 3 linguists. As regards the framing rates, framing of each verbal predicate required approximately 10 minutes. However, longer framing times were needed for highly polysemous verbs.

5.2 Corpus annotation

Based on the manually annotated syntactical dependencies assigned by annotators, a new version of the resource was automatically generated. In this version, a new label has been attached to dependents of verbal elements, depicting their semantic relation

to their head. It should be stressed that verbal predicates are being targeted keeping nominalizations for a later stage.

Preprocessing for this phase is implemented as a running procedure on the dependency annotated data. It mainly involves assigning semantic roles to nodes annotated as Sb, Obj or IObj at the syntactic level, according to rules resulting from the analysis of the dependency annotated data. Functional and auxiliary words are still attached to the head but are not assigned any role. The output of this phase is then manually corrected, using the TrEd tool, as in the previous phase.

The annotation process is a two-pass procedure. The annotators team worked on the data for a period of 4 weeks. Apart from the frameset descriptions, the annotators were provided with guidelines describing the annotation schema adopted, in order to further ensure consistency of the annotation process. These guidelines were enriched with indicative examples, concerning handling of several problematic cases such as null subjects, passive and ergative constructions, alternations, the distinction between similar labels i.e. Recipient and Addressee etc.

Specifically, the annotators were asked to correct the automatically generated labels and to assign labels to all arguments attached to the verbal predicates of each sentence. One of the major issues encountered during the SRL annotation is handling of null elements that were not annotated in the previous phase. We decided to introduce new nodes to restore only the null subjects, in order to fill important semantic roles such as actor and theme (i.e. in passive and ergative constructions).

Apart from arguments filling semantic roles, adverbial modifiers were also annotated during this phase. A list of the semantic labels assigned to these modifiers is provided in Table 3. These adjuncts, although annotated throughout the corpus, are not included in the frame files.

Label RoleLOC LocationTMP Time MNR MannerINSTR InstrumentEXT ExtentENP End Point STP Start PointCAU CausePNC PurposeSRC SourceDST Destination

Table 3: Adjunct labels


A sample of the semantic role labeling annotation is presented in figure 3.

It should be noted that the use of a syntacticallyannotated corpus as the basis of semantic annotationproved to be very helpful, as most annotators became comfortable with the process within 2 daysof work. The first pass was then checked andcorrected according to the modifications thatresulted from the problems encountered during the process.

The resulting resource will serve as training material for the development of a system that willautomatically recognize arguments of predicates in a sentence and will label them with their semanticrole.

Figure 3: A sample of semantic role labeling

6 Future WorkThe next step of the semantic annotation of the GDT concerns event type annotation and is still inprogress. Our approach will be based upon the

guidelines released by LDC1 in the framework of the ACE project. Given the fact that our corpus pertains to specific domains, we will not tag all the events ofeach sentence but only those that are indicative of the domains of interest, namely, EU politics, traveland health. We thus plan to modify the ACE set of event types and subtypes in order to meet our needs.To this end, event annotation will be limited to a constrained set of types and subtypes. Our ultimategoal is to identify events that could be considered asfacts that clearly describe a specific domain ofinterest.

Specifically, the task involves spotting the verbal predicates that indicate “important” events ina sentence and assigning type and subtype to eachone of them. When identifying events there are twobasic concepts that need to be clarified; the eventextent and the event trigger. For our purposes, wedefine the event extent as the sentence within whichthe event is expressed, and the event trigger as thesingle word verbal predicate that clearly expressesthe event occurrence.

Apart from the annotation layers describedabove, we plan to enrich GDT with coreference linksconcerning events and their participants as well. Asregards the event coreference, we will only deal with event identity relations.

Finally, apart from the annotation extensions described above, our goal in the following months isto compile a 300K words corpus, annotated withsyntactic and semantic role information.

Acknowledgements

Work described in this paper was partly supported by research project "Multimedia Content Management System" (MUSE), funded in the framework of Axis 3, Measure 3.3 of the ConcertedProgramme for Electronic �usiness of the GeneralSecretariat for Research and Technology of theGreek Ministry of Development.

References(Böhmová et al. 03) A. Böhmová, J. Haji�, E. Haji�ová, and B.

Hladká. The Prague Dependency Treebank: A Three-Level Annotation Scenario. In A. Abeillé (ed.) Treebanks: Building and Using Parsed Corpora. Kluwer, 2003.

(Carreras & Màrquez 04) X. Carreras and L. Màrquez. Introduction to the CoNLL-2004 Shared Task: Semantic Role Labeling. In CoNLL: Conference on Natural Language Learning, 2004.

1http://www.ldc.upenn.edu/Projects/ACE/Annotation/2005Tasks.html


(Haji�ová et al. 00) E. Haji�ová, J. Panevova, and P. Sgall. A Manual for Tectogrammatic Tagging of the Prague Dependency Treebank. UFAL/CKL Technical Report TR-2000-09, Charles University, Czech Republic, 2000.

(Kingsbury & Palmer 02) P. Kingsbury and M. Palmer. From Treebank to PropBank. In Proceedings of the 3rd International Conference on Language Resources and Evaluation, Las Palmas, 2002.

(Klein & Manning 04) D. Klein and C. Manning. Corpus-Based Induction of Syntactic Structure: Models of Dependency and Constituency. In Proceedings of the 42nd Annual Meeting of the ACL, 2004.

(LDC 05) LDC 2005. Automatic Content Extraction, http://www.ldc.upenn.edu/Projects/ACE/Annotation/2005Tasks.html

(Levin 93) B. Levin. English Verb Classes and Alternations: A Preliminary Investigation. University of Chicago Press, 1993.

(Oflazer et al. 03) K. Oflazer, B. Say, D. Hakkani-Tür, and G. Tür. Building a Turkish Treebank, Invited chapter in A. Abeillé (ed.) Treebanks: Building and Using Parsed Corpora. Kluwer, 2003.

(Pajas 05) P. Pajas. Tree Editor TrEd. http://ckl.mff.cuni.cz/pajas/tred/, 2005.

(Papageorgiou et al. 02) H. Papageorgiou, P. Prokopidis, I. Demiros, V. Giouli, A. Konstantinidis, and S. Piperidis. Multi-level XML-based Corpus Annotation. In Proceedings of the 3rd International Conference on Language Resources and Evaluation, Las Palmas, 2002.

(Simov et al. 03) K. Simov, P. Osenova, S. Kolkovska, E. Balabanova, and D. Doikoff. Language resources for the creation of a Bulgarian Treebank. In Workshop on Balkan Language Resources and Tools, Thessaloniki, 2003 (satellite event to the Balkan Conference on Informatics - BCI 2003).

(Stamou et al. 03) S. Stamou, V. Andrikopoulos, and D. Christodoulakis. Towards Developing a Semantically Annotated Treebank Corpus for Greek. In Proceedings of The Second Workshop on Treebanks and Linguistic Theories (TLT2003), 14-15, Växjö, 2003.

(Yamada & Matsumoto 03) H. Yamada and Y. Matsumoto. Statistical Dependency Analysis with Support Vector Machines. In Proceedings of the 8th International Workshop on Parsing Technologies, 2003.


Building Multilingual Terminological Resources

Maria Gavrilidou*, Penny Labropoulou*, Monica Monachini+,Stelios Piperidis*, Claudia Soria+

* ILSPArtemidos 6 - 15125 Athens{maria, penny, spip}@ilsp.gr

+ CNR-ILCVia Moruzzi 1 – 56124 Pisa

{monica.monachini,claudia.soria}@ilc.cnr.it

AbstractAvailability of Language Resources (LRs) for thedevelopment of Human Language Technology(HLT) applications is recognized as a critical issuewith political and economic impact as well asimplications on the sphere of cultural identity. Thispaper reports on the experience gained during theINTERA European project for the production ofmultilingual languages resources (MLRs), namelyparallel corpora and terminological lexicons, forless widely available languages, i.e. thoselanguages that suffer from poor representation overthe internet and from scarce computationalresources, but yet are requested by the market. Itdiscusses the procedure followed within theproject, focuses on the problems faced which hadan impact on the initial goals, presents thenecessary modifications that resulted from theseproblems, evaluates the market needs as attested byvarious surveys, and describes the methodologythat is proposed for the efficient production ofMLRs.

1 IntroductionLRs are central components for the development ofHLT applications. The availability, thus, of adequateLRs for as many languages as possible plays a criticalrole in view of the development of a truly multilingualinformation society. It is well-known that the task ofproducing LRs is an extremely long and expensiveprocess. For most western languages, and English inparticular, this is partly mitigated by large and oftenfree raw data availability, good representativeness, andsignificant size, together with availability of languageprocessing tools, which lower the cost and time of theLRs production. There are plenty of languages,however, for which this picture is far from being true:many languages suffer from poor representation andscarcity of raw material, not to mention the lack ofrobust processing tools. For these languages manyterms have been used, including "less spoken", "lessused", "minor" languages etc., all having a slightly

pejorative aspect. Since the amount of digitisedlanguage material does not reflect the number ofspeakers neither the "importance" of a language (alllanguages are important!), a more accurate term, webelieve, is "less widely available in the digital world"(Gavrilidou et al. 03; Gavrilidou et al. 04). The notionof "less widely available in the digital world" is by nomeans to be interpreted as a synonym to "less widelyspoken languages"; in fact, many of the languages towhich the former concept seems to apply, such aseastern European and Balkan languages, actuallyappear among the forty most widely spoken languagesall over the world (http://www.globallanguages.com).For instance, Russian, Polish, Ukrainian, Romanianand Serbo-Croatian appear in the 8th, 22nd, 23rd, 32nd

and 37th rank respectively.This concept has been developed in response to asurvey conducted in the framework of the INTERAproject, aimed at the identification of user needs andexpectations concerning LRs. Although westernEuropean languages have been confirmed as having avery high degree of request, the survey clearlydemonstrates that there is also an increase in demandfor Balkan and eastern European languages.Nevertheless, these languages still seem to be ratherunderrepresented as regards digital content, althoughthere is an increase in Balkan LRs production -tendency encouraged by national and EU activities aswell - as attested by surveys conducted in theframework of other European projects (e.g.ENABLER, www.enabler-network.org). Despite theirlimited availability, these languages are highlyrequested by the digital content market, thus makingthe issue of resource production even more crucial.

This paper presents the results of the INTERA project(Integrated European language data Repository Area)an eContent program,http://www.elda.org/rubrique22.html), which had atwofold task: on the one hand it was aimed at buildingan integrated European LRs area by connecting


existing data centers, and on the other, to producemultilingual parallel corpora and terminological lexicafor some of the "less available" languages. Thisendeavour was additionally interpreted as a case studyfor the development of a production model for MLRsthat is up-to-date, compliant with existing standardsand viable and attractive for digital content producers.More specifically, in this paper we focus on thesecond axis of the project, presenting the experiencegained during the INTERA project for the productionof a model for multilingual parallel corpora andterminologies extracted from these.

2 LRs production specifications2.1 User needsThe identification of user needs and requirements hasheavily relied on exploiting the results of a number ofprevious initiatives to roadmap the state-of-the-art inMLRs, in combination with new initiatives undertakenin the framework of the project and targeted to theeContent world. The surveys conducted in theframework of the ENABLER project (Maegaard et al.03; Gavrilidou & Desipri 03; Calzolari et al. 04)provided insights as to the existence and availability ofdifferent types of LRs, language demand, domains ofinterest, standards employed, etc. Other surveysaiming at determining the needs of users with respectto available and potentially available LRs(http://www.elra.info/), or surveys available over theInternet through the sites of international organizationssuch as LISA and IDC or consultancy firms(http://www.globalsight.com, LISA 01; LISA/AIIM01; LISA/OSCAR 03) shed a light as to theavailability of resources and relevant tools.The information elicited from these surveys wascoupled by a study of the activities of the eContentprofessionals as regards LRs, conducted in theframework of INTERA (Gavrilidou et al. 04). Themain areas of the study concerned the types of LRs theeContent professionals are interested in, domains andlanguages of interest, and, most important, policiesconcerning the way they acquire, use and exploit LRsand tools.

2.2 SpecificationsSpecifications as regards the INTERA MLRsproduction have been based on the following factors:• the study of the target group (eContent

professionals) in order to cater for user needs;• the intended application of automatic terminology

extraction.

2.2.1 Text and term selection specificationsAs regards the composition of the textual andterminological resources, the following specificationshave been defined in the framework of the project: domains: eContent users are more interested in

specialized domains than in general LRs; in thisrespect, the survey results showedhealth/medicine, tourism, education, law,automotive industry and tele-communications, asbeing the prevailing domains. In the framework ofINTERA, we focused on the domains of tourism,health, education and law, which correspond tothe predominant digital activities (eTourism,eHealth, eLearning, eGovernment andeCommerce).

languages: the focus being on less widelyavailable over the Internet languages, the selectedset was Bulgarian, Greek, Serbian and Slovene,combined with English.

standards: the use of standards is appreciated byeContent professionals, since it permits reusabilityand interoperability, and has, therefore, been acrucial criterion in the selection of corporaannotation practices and terminology encoding.

terminologies: cross-language equivalents areconsidered as essential information, whiledefinitions, conceptual relations, a domain codeindication, and reliability codes are seen asdesirable.

An important issue concerning the corpus compositionin relation to the degree of textual equivalence acrosslanguages is raised by the intended application ofterminology extraction.Data availability, representativeness and size, togetherwith availability of language processing tools arecrucial factors to be taken into account in a veryrealistic production model, since they are variableswith strong repercussions on the task and the processof corpus production and terminology extraction.Different possible scenarios were foreseen for theproduction of multilingual resources, mainly asregards the composition of the corpus, all of themhaving an impact over the task of term extraction. Thequality of the multilingual terminological resourcesextracted from a corpus dramatically differs dependingon the composition of this corpus and, morespecifically, on the following points: whether a corpusis parallel, i.e. built up from exact translations of thesame text in all the corpus languages, or comparable,i.e. set up from pairs of texts in the same domain,whether there is a unique pivot language available or


not etc. The possible scenarios envisaged can besummarized as follows: in the "ideal" scenario, the extraction task can be

performed working on parallel specialized textswith a pivot language for which NLP tools andresources are available. In this situation, trulymultilingual terminology can be the goal, whereterms will be interconnected.

in the "worst" scenario, there are no truly paralleltexts but sets of bilingual parallel texts withoutany pivot language. In this case, term extractionwould have to rely only on statistical proceduresfor term recognition. The greatest risk of thissolution would be to produce a list of candidateterms with a high amount of noise or silence1 andwhere human involvement would necessarily bemassive.

in the "mid-way" scenario, where sets ofcomparable corpora are available, the resultingterminological lexicon is not a truly multilingualhomogeneous one, but a set of terminologicallexicons in different languages where terms arelikely not to be the same throughout the lexicons.

2.2.2 Text processing specificationsThe specifications for text processing have been basedon the requirements of the intended application, i.e.the extraction of terminology, and involve thefollowing tasks: automatic alignment of the texts at sentence level

coupled by human validation by language experts; structural annotation: segmentation at sentence

level and metadata information that will berequired for the distribution and re-use of thecorpus;

linguistic processing: below-Part of Speechtagging (i.e. grammatical category andmorphological features, e.g. gender, number, caseetc. for nouns) and lemmatization.

2.2.3 Text encoding specificationsTo ensure re-usability of the collected and processedmaterial, compliance with the followinginternationally accredited standards was decided: alignment conforms to the TMX standard

(Translation Memory eXchange,http://www.lisa.org/tmx/), XML-compliant (seesample text in Figure 1);

for the external annotation, the IMDI metadata

1 Noise and silence are commonly used in terminologyas complementary of precision and recall respectively.

schema (IMDI 03,http://www.mpi.nl/world/ISLE/schemas/schemasframe.html) has been selected;

the internal structural annotation adheres to theXCES standard, i.e. the XML version of theCorpus Encoding Standard (XCES,http://www.cs.vassar.edu/XCES/ and CES,http://www.cs.vassar.edu/CES/CES10.html);

the linguistic annotation (see Figure 2) alsoadheres to the XCES standard, which incorporatesthe EAGLES guidelines for annotation at themorphosyntactic level(http://www.ilc.cnr.it/EAGLES96/home.htl).

3 Corpus production3.1 Text collectionPrevious surveys that identify existing LRs as well asa search over the Internet attested the scarcity ofavailable resources in the selected languages anddomains. The identification process of potentialsources had: to cover a variety of sources of interest to the

eContent society, to cater for clearance of Intellectual Property

Rights (IPR) issues.The ideal candidates, in this respect, mainly consist oftexts available over the Internet, provided byorganizations that wish to make their own materialavailable in more than one language. A more carefulinvestigation, however, of web texts showed thefollowing shortcomings as regards the projectdesiderata: although the Internet is rapidly becoming

multilingual, it is not yet parallel, i.e. most sitesare still monolingual with other languages besidesEnglish increasingly entering the Internet;

when a site includes multiple language versions,this still does not mean that it is parallel, i.e.including exact equivalent texts across languages:a closer inspection of web texts seemingly parallelshowed that only sporadic parts of them were trulyparallel, with free translation putting an extraburden on the processing;

most multilingual sites include language versionsonly in the more widely spoken ones (i.e. English,Spanish, French etc.), thus limiting the possibilityto find truly parallel raw data in all the languagesinvolved in the project;

more problems arise from the fact that texts maycontain large parts of foreign language material.


<?xml version="1.0" encoding="utf-8" standalone="yes" ?><tmx version="1.3"><header creationtool="Align2TMX" creationtoolversion="1.3" segtype="sentence" datatype="plaintext" o-tmf="Align2TMX 1.0" adminlang="EN" srclang="EN" /><body><tu> <prop type="Domain">Files: 01BNG-EN-id.xml 01BNG-SR-id.xml</prop> <tuv xml:lang="EN" creationid="n1" creationdate="20040601T000000Z"><tu> <prop type="Domain">Files: 01BNG-EN-id.xml 01BNG-SR-id.xml</prop>

<tuv xml:lang="EN" creationid="n17" creationdate="20040601T000000Z"><seg>The designations employed in this publication and the presentation of the material do not imply on the part of

UNICEF the expression of any opinion whatsoever concerning the legal status of the country or territory, or of itsauthorities, or the delimitations of its frontiers.</seg> </tuv><tuv xml:lang="SR" creationid="n17" creationdate="20040601T000000Z"> <seg>Odrednice upotrebljene u ovoj publikaciji i prezentacija materijala ne odražavaju odnos UNICEF-a premalegalnom statusu države, prema njenoj teritoriji, pitanjima razgraničenja ili vladajućim strukturama.</seg> </tuv></tu>

Figure 1: Sample Serbian – English aligned text

Given the above observations, cooperation with datacenters with expertise in the area of LRs for thespecific languages was sought; in order to ensurecontent quality, both for the corpus construction tasks(selection, encoding and validation processes,especially as regards the alignment validation and thelinguistic processing), but also for term validation.

3.2. Corpus corpositionThe project aimed at the construction of a multilingualparallel corpus of 12 million words in total. The actualscenario we were confronted with was similar to thethird one described above: the final collection isrepresented by a comparable multilingual corpus asopposed to a parallel multilingual corpus. Instead ofhaving the same texts in all languages, pairs of textcollections were produced belonging to the samedomains. English, the pivot language, alwaysrepresents one member of the pair, while the other is

Greek, Bulgarian, Slovene, or Serbian (see Table 1).

Domain LanguagesGreek Bulgarian Serbian Slovene

LawHealthEducationTourismEnvironmentFinancePolitics

Table 1: Domain coverage per language

3.3 Text processingThe texts are aligned, structurally annotated at thesentence level, below-PoS tagged and lemmatized (seesample linguistically annotated text in Figure 2).

<par> <s id="n1"> <tok id="tok_1_1"> <orth>Varšavska</orth> <base>varšavski</base> <ctag>A+PosQ+Top+PGgr</ctag>

<msd>Ar________-_--t-</msd> </tok><tok id="tok_1_2"> <orth>deklaracija</orth> <base>deklaracija</base> <ctag>N</ctag> <msd>N-_-___------</msd>

</tok><tok id="tok_1_3"> <orth>za</orth> <base>za</base> <ctag>PREP+p2+p4</ctag> <msd>Sp____s</msd> </tok><tok id="tok_1_4"> <orth>Evropu</orth> <base>Evropa</base> <ctag>N+NProp</ctag> <msd>Np_-___------

</msd> </tok><tok id="tok_1_5"> <orth>bez</orth> <base>bez</base> <ctag>PREP</ctag> <msd>Sp____-</msd> </tok><tok id="tok_1_6"> <orth>duvana</orth> <base>duvan</base> <ctag>N</ctag> <msd>N-_-___------</msd> </tok><tok id="tok_1_7"> <orth>,</orth> <base>,</base> <ctag>PUNCT</ctag> <msd>F</msd> </tok><tok id="tok_1_8"> <orth>2002</orth> <base>2002</base> <ctag>NUM</ctag> <msd>Xn__</msd> </tok><tok id="tok_1_9"> <orth>.</orth> <base>.</base> <ctag>PUNCT</ctag> <msd>F</msd> </tok>

</s> </par>Figure 2: Sample of Serbian linguistically annotated text


The most important point of this process is theharmonization of the different tagsets used, so that thesubsequent processing stage of term extraction candeploy the same tools for all languages. Inconformance with the methodology adopted in theproject, i.e. of re-using existing material, wheneverpossible, with the least possible interventions, so as toensure time and cost efficiency, it was decided to re-use only existing tools for each language, withoutmaking any modifications to the tools themselves butonly conversion(s) of their output. Therefore, the taskof harmonizing the output with regard to themorphosyntactic tags employed by each tagger is thelast stage of the procedure, where all tagsets aremapped to one ("INTERA tagset"), based on theEAGLES guidelines. This task mainly involved re-ordering and mapping of tags to a common tagset,while no theoretical assumptions have been forcedupon any of the tagsets: where possible, grammaticalfeatures and values corresponding to the same notionare mapped to the same common tag (e.g. nominativecase is represented by the same symbol, elicited fromthe EAGLES tagset), while language-specific and/ortagger-specific features and values form an extensionto the core tagset, following the EAGLES principles.

4 Terminological resources production4.1 Theoretical issuesTerminology can be considered the surface realizationof relevant domain concepts (Cabré 92; Sager 90).Candidate terminological expressions are identifiedeither by hand, or in a semi-automatic manner. Semi-automatic procedures for terminology extractionusually consist in shallow techniques that range fromstochastic methods to more sophisticated syntacticapproaches (Jacquemin 01; Bourigault et al. 01). Allof them, however, converge in identifying termsmostly on statistical grounds, on the basis of theirrelative frequency in a corpus; possibly augmentingthis measure with filters capturing the domainspecificity of a term. Although not theoreticallycorrect (as the status of "termhood" is in principleindependent of the number of occurrences, and ahapax might well be a term), this practice is rooted incomputerized terminology, where computer-aided textanalysis and the possibility of processing large amountof information have changed the bases of terminologycompilation, especially with respect to the way theappropriateness of terms is conceived and the degreeof human intervention in the whole process. In thisparticular context, we adopted a hybrid approach to

terminology extraction from parallel texts, combiningstatistical and symbolic techniques.The size of available data is important for determiningthe coverage of the terminological resource, sincemore data mean more terms. However, it is importantalso for the quality of the terminological resource, asthe automatic procedure needs an amount of datareaching a level of statistical relevance to yield high-quality data. Unfortunately, the available datadramatically differed in size both across the differentdomains and across the different languages. Therichest domain is law (129 Mb), followed at a distanceby education (20 Mb) and health (14 Mb). Thisdifference among domains has an obviousconsequence over the overall number of terms that canbe made available as a result of the extraction process.In other words, there were domain-specificterminologies that were very different in size andhence term coverage.The task of automatic term extraction was organizedaround three main phases:1. Automatic extraction of terms from the English

components of the parallel corpora - the Englishlanguage is defined as the "pivot language";

2. Automatic identification of candidate translatorsin the target languages;

3. Manual verification of the candidate translatorsfound with the automatic procedure.

4.2 Extraction of English candidate termsThe objective of the first step is the identification ofterms for a given sub-language; it is assumed thatthese terms should represent those that most probablyare particular to a specific domain. Under thisassumption, the terms that will be identified willrepresent the candidate terms for a specialized(domain-specific) lexicon. Candidate single terms areextracted by comparing the relative frequency oflemmas (i.e. the sum of the frequencies of all inflectedtypes of a given word) inside each domain andlanguage specific sub-corpus against a lemma-basedfrequency lexicon of the British National Corpus,which was used as a reference corpus.In more detail, the comparison between the frequencydistributions of terms in the general lexicon and that ofthe different domain-specific lexicons was performedadopting a mathematical function evaluating thedistance of the frequency of domain-specific termsfrom the frequency which was expected on the basisof the general lexicon. We compared the listsgenerated adopting several different mathematical


formulae, among which are the following:d1 = fr (specialized lexicon) - fr (general lexicon)

d2 = fr (specialized lexicon) / fr (general lexicon)

d3 = log(fr (specialized lexicon) / fr (general lexicon))where fr represents the relative frequency of a terminside the lexicon.

4.3 Extraction of candidate translatorsOnce candidate terms are identified for English, weturn to the task of automatic identification of candidatetranslators in the target languages. To this end weexploited the structuring information available in theparallel corpora from which the terminology was to beextracted. Since the sentences in the target languagetexts are aligned to those of the pivot language, it iseasy to select a suitable search space for any candidateterm. The algorithm for the extraction of candidatetranslators consists of the following steps:a. Selection of the source region set from the pivot

language corpus;b. Extraction of target region set from the target

language corpus;c. Search/Extraction of lemmas from target region

set;d. Ordering of the lemmas contained in the search

target region set according to a ranking function;e. Selection of candidates.

Given an English candidate term t, the target regionset inside the target language corpus is easilyidentified: each region of the English corpuscontaining the term t is uniquely associated with aregion of the target language corpus. Then, thelemmas from the target region set are extracted,filtering out lemmas belonging to "non significantgrammatical categories" (e.g. conjunctions,prepositions). It was observed that the target languagelemmas could be classified on the basis of their"probability" of being a translation of a given term bymeans of simple frequency analyses. Thisclassification is obtained through the synthesis of aranking function. Several hypotheses were considered,all of them aiming at highlighting the statistical"idiosyncrasies" of the translating lemma. The bestperforming measure is the following:

f(l)= r(l)-q(l)*|I|Where r(l) is the number of regions of the targetregion set containing at least one occurrence of lemmal, q(l) is the ratio between the number of regionscontaining lemma l and the total number of regions inthe corpus and |I| is the total number of regions of thetarget region set.

4.4 Term validation and encodingFollowing the task of term extraction, human experts,all native speakers of the selected languages, validatedthe automatically extracted candidate term lists.Since compliance to a reliable standard is a pre-requisite for ensuring reusability and exchangeabilityof data, the TMF family of formats was taken as thereference model to encode the terminological entries.TMF stands for Terminological Markup Framework(ISO 16642 2001), an international standard designedin the framework of the ISO initiatives to support thecreation and use of computer applications forterminological data and exchange of such databetween different applications. Being a meta-modelfor terminology markup, TMF allows for thespecification of user-defined markup languages (calledTMLs). A TML makes it possible to design theencoding format of a terminological collectionaccording to particular needs. In designing theINTERA TML we tried to harmonize user needs withthe realistic considerations when dealing with under-represented languages.The terminological collection is organised into eightpackages, each of them corresponding to a particulardomain (see table 2). For each domain, there are asmany archives as are the combinations of languagesthat yielded valid terms. For instance, if for a givendomain there are valid terms for the Greek, Serbianand Bulgarian languages, there will be as many files asfor the intersections of terms in the tuples English-Greek, English-Serbian, English-Bulgarian, English-Greek-Serbian-Bulgarian, English-Greek-Serbian,English-Greek-Bulgarian, English-Serbian-Bulgarian.Moreover, separate archives are available for singleand multi-word terms.A quadri-lingual terminology is feasible only for Law.Education yields a tri-lingual terminology, and Healtha bi-lingual one. The other domains are mono-lingualterminologies. Table 2 summarizes the distribution ofterms over domains and languages.

Greek Bulg Serb Slov Eng X dom.Law 1232 279 1436 2052 5042 10041Law-Politics 426 424 850Politics 39 39 78Education 1707 81 232 1679 3699Environment 182 166 348Health 518 201 604 1323Tourism 524 480 1004Finance 14 14 14X lang. 4163 825 1883 2052 8448 17357Table 2: Term distribution over domains & languages


5 ConclusionsIn this paper, we described the methodology followedin the construction of MLRs (i.e. corpus andterminologies); this task has been interpreted as a testapplication endeavor in the process of defining amodel for the production of LRs. The approach toMLRs production adopted and described in this papercannot be seen as a standard practice nor is it to beconsidered as a recommended practice in LRsbuilding. In fact, there is no such practice for allpurposes. There are only better solutions under certainconditions. We claim, however, that the procedureadopted is a viable and fruitful one given thefollowing conditions: available data are sparse; no reference corpora are available for the target

languages but many NLP tools and referencecorpora are available for one language (the pivot).

Moreover, this experience taught us some lessonsabout the more general task of building resources forlanguages suffering from scarcity of widely availabledata and processing tools. Thus, besides the actualproduction of the resources, a parallel result has beenthe identification of gaps and shortcomings in theprocess usually employed by LRs producers (or userswho might wish to create their own LRs) and tosuggest ways of remedying them. At a general level,the production methodology is heavily influenced bythe following factors: Lack of compatibility among the resources

themselves: this means, for instance, not onlyenforcing compatibility in data encoding andrepresentation, but also ensuring that the resourcesare compatible from the point of view of theadditional, linguistic and non-linguisticinformation that is added to the raw data. Onceagain, compliance with agreed-upon standards isrecommended, as well as harmonization amongthe different tagsets used in the various resources.Ideally, all resources should use the sameconvention of linguistic annotation; when this isnot possible, it is recommended that a harmonizedtagset is used, or that conversion procedures fromthe proprietary tagset to a common, standardizedone is provided.

Lack of integration among computer toolsworking at different levels of analysis. In order toenhance the LRs production effort, the re-use ofexisting tools is considered crucial. It is true thatan increasing number of tools are available for textprocessing; however, these are oriented mainly

towards the "major" languages. Moreover,information concerning the existence, availabilityand operation of existing tools is not easy to locate– a gap that the other pillar of INTERA tries toremedy through the building of an integratedEuropean LRs area. Additionally, tools must beenhanced with respect to two directions:improvement of the tools themselves (e.g. morerobust alignment techniques) and interoperabilityof all relevant tools currently used at differentphases of processing.

Interoperability is closely related with the issue ofstandards. The promotion and deployment ofexisting standards as well as the creation of newstandards, when these are lacking, is important toensure viability and re-use of LRs, given the costof their production.

The particular configuration of resourcesavailable: the particular methodology to beadopted for the production of multilingualterminological resources must be carefullyadjusted to the idiosyncratic situation to behandled, where by situation we mean type oflanguages, quantity and quality of resources, andpurposes for which the resource is being built.

References(Bourigault et al. 01) D. Bourigault, C. Jacquemin,M.-C. L'Homme (eds), Recent Advances inComputational Terminology, Amsterdam &Philadelphia, John Benjamins, 2001.(Cabré 92) M.T. Cabré, Terminology. Theory, methodsand applications, Amsterdam & Philadelphia, JohnBenjamins, 1992.(Calzolari et al. 04) N. Calzolari, K. Choukri, M.Gavrilidou, B. Maegaard, P. Baroni, H. Fersoe, A.Lenci, V. Mapelli, M. Monachini, S. Piperidis,ENABLER Thematic Network of National Projects:Technical, Strategic and Political Issues of LRs, inLREC-2004 Proceedings, Lisbon, 2004.(Gavrilidou & Desipri 03) M. Gavrilidou, E. Desipri,Final Version of the Survey, ENABLER Deliverable2.1, 2003.(Gavrilidou et al. 03) M. Gavrilidou, E. Desipri, P.Labropoulou, S. Piperidis, M. Monachini, C. Soria,Technical specifications for the selection andencoding of multilingual resources, INTERADeliverable D5.1, 2003.(Gavrilidou et al. 04) M. Gavrilidou, V. Giouli, E.Desipri, P. Labropoulou, M. Monachini, C. Soria, E.Picchi, P. Ruffolo, E. Sassolini, Report on the


multilingual resources production, INTERADeliverable D5.2, 2004.(IMDI 03) IMDI, Metadata Elements for SessionDescriptions, Version 3.0.4, Sept. 2003.(Jacquemin 01) C. Jacquemin, Spotting andDiscovering Terms through Natural LanguageProcessing, Cambridge, MA and London, The MITPress, 2001.(Krauwer 98) S. Krauwer, ELSNET and ELRA: Acommon past and a common future, in ELRANewsletter, Vol.3, N. 2, 1998.(LISA 01) LISA, The LISA Globalization StrategiesAwareness Survey, 2001.(LISA/AIIM 01) LISA/AIIM, The Black Hole in theInternet: LISA/AIIM Globalization Survey, 2001.(LISA/OSCAR 03) LISA/OSCAR, TranslationMemory Survey, 2003.(Maegaard et al. 03) B. Maegaard, K. Choukri, V.Mapelli, M. Nikkou, C. Povlsen, Language Resource– Industrial Needs, ENABLER Deliverable D4.2,Copenhagen, 2003.(Sager 90) J.C. Sager, A Practical Course inTerminology Processing, Amsterdam & Philadelphia,John Benjamins, 1990.


An Algorithm for the Semiautomatic Generationof WordNet Type Synsets with Special Reference to Romanian

Florentina HristeaFaculty of Mathematics and ComputerScienceUniversity of Bucharest, Academiei 14, [email protected] ; [email protected]

Cristina VaţăWylog Romania

Bd. Unirii [email protected]

Abstract

As its authors note [Miller et. al., 90], WordNet(WN) is a lexical knowledge base, firstdeveloped for English and then adopted forseveral European languages, which was createdas a machine-readable dictionary based onpsycholinguistic principles. The present paper isan attempt to discuss the semiautomaticgeneration of WNs for languages other thanEnglish, a topic of great interest since theexistence of such WNs will create theappropriate infrastructure for advancedInformation Technology systems. Extending thealgorithmic approach introduced in [Nikolov andPetrova, 01], we propose a semiautomaticmethod based on heuristics for the generation ofWN type synsets. The focus is on noun synsets,but comments concerning the possibility ofextending the proposed method to adjectives andverbs are also made. The target language forperforming tests will be Romanian. Ourapproach to WN generation relies on so-called“class methods”, namely it uses as knowledgesources individual entries coming from bilingualdictionaries and WN synsets, but at the sametime demonstrates the need to combine suchmethods with structural ones.

1. Introduction

WN has been recognized as a valuable resourcein the human language technology andknowledge processing communities. The humanlanguage research community has encouragedthe development of WNs for languages otherthan English, at the same time concentrating onthe possibility of automatically generating suchhuge lexical data bases. The main reason for thisis the desire and the necessity to create auniform ontological infrastructure across

languages. This can be achieved since, whileconcepts are language dependent, the basic set ofrelations that link the concepts remains the same.This means that the inference algorithms forextracting information remain the same. Theexistence of such an uniform ontologicalinfrastructure across languages will thereforesimplify machine translation from a language toanother and will facilitate the use of the samereasoning schemes and algorithms developed inconjunction with the American WN.

The present study concentrates on theimportant and up-to-date topic of automaticgeneration of WNs for languages other thanEnglish. The approach to WN generationconsists of a semiautomatic method based onheuristics which belongs to the so-called “classmethods” [Atserias et. al., 97]. It therefore usesindividual entries coming from bilingualdictionaries and WN synsets as knowledgesources, and hence the success of our methoddepends directly on the availability ofcomprehensive bilingual dictionaries inelectronic format.

The basic translation algorithm (Algorithm2.1 of the present paper) will be using so-called“elementary sets”, a concept introduced in[Nikolov and Petrova, 00]. Algorithm 2.1, whichis described in [Nikolov and Petrova, 01], will befurther completed by Algorithm 2.2, proposed in[Hristea, 02], which performs a backtrackingaction (step 1) in order to obtain as final outputthe foreign synset corresponding to the givenEnglish one. It should be noted that theBulgarian authors who first describe Algorithm2.1 [Nikolov and Petrova, 01], having as output a sorted list of elementary sets, make no commentwhatsoever as to how they obtain the finalforeign synset, in their case the final Bulgariannoun synset. One can easily assume that it ismanually obtained by linguists using the outputof Algorithm 2.1. It was the concern of [Hristea,


02] to automate the process of creation of aforeign WN type synset to the largest extent thatthis is possible, and our comments concerningoutput obtained in the case of the Romanianlanguage will be made within this type offramework1.

Since the same Bulgarian authors [Nikolovand Petrova, 01] do not specify what evaluationfunction has been used, additional comments willbe made here, taking into account the mentionedRomanian output, with respect to the type ofevaluation function that was used in thetranslation process.

Finally, to the praise of the mentionedauthors, who are only concerned with obtaining“a core of Bulgarian noun synsets”, it turns outthat their algorithm can be extended (more orless successfully) to the general case of anyforeign language (not just Bulgarian).Additionally, it is our belief that Algorithm 2.1can be successfully used in the case of all other(three) parts of speech that WN deals with,provided that it is modified accordingly. Suchmodifications should take into account thetypical semantic relations implemented in WNwith regard to each part of speech, thuscombining the class method initially used in thecase of nouns with a structural approach to WNgeneration (see the various enrichmenttechniques proposed in the present paper).

2. The Translation Algorithm

The algorithm for translating a given Englishsynset into the corresponding synset in alanguage other than English will be using so-called “elementary sets” or e-sets, a conceptintroduced in [Nikolov and Petrova, 00]. Ane-set corresponds to a monosemous reading(sense) of a word and can be defined as follows:

Definition 2.1An e-set relative to a word is the set ofsynonyms corresponding to a specificmonosemous reading (sense) of that word.

Let us denote by EW any English word andby FW any foreign word, namely a word of alanguage other than English. Let eword ofsequence (1) be an EW, while fword1, fword2and fword3 of the same sequence are itscorresponding translation equivalents (accordingto the appropriate bilingual dictionary):

1 And using version 1.7.1 of the WN database.

eword fword1; fword2, fword3 (1)In order to distinguish among fword1,

fword2 and fword3 two different separators areused in standard paper dictionaries. A semicolonseparates different meanings of a given word. Acomma separates synonyms which refer to oneand the same meaning of the word. (In this casefword2 and fword3 are synonyms). This is theform of a bilingual dictionary which will be usedby the programs implementing the proposedtranslation algorithm. In the above example theinvolved e-sets are

{fword1} and {fword2, fword3}.The computer programs which implement the

translation algorithm will generate the list of alle-sets of FWs corresponding to the meaning ofall EWs occurring in a given English synset. Theforeign synset corresponding to the studiedEnglish one is formed of one or more of thegenerated e-sets (which can be adjoined). The“candidates” for inclusion in the foreign synsetare labeled e-sets, namely those e-sets whichcontain labeled words.

In order to label the FWs belonging to thegenerated e-sets, we have decided to first labelthe EWs belonging to the English synset. TheseEWs will be labeled with integer numbersranging from 1 to n (where n is the size of thesynset, namely the number of words it contains),in the order of their occurrence. After labelingthe EWs of the original synset, the FWs of thegenerated e-sets are looked up in thecorresponding bilingual dictionary. Each time anEW of the given synset represents thetranslation, according to the dictionary, of a FW,the corresponding FW receives the label of thatEW. If any word of a foreign e-set can betranslated into a word of the English synset usingthe bilingual dictionary, the whole foreign e-setis moved to the “list of candidates”. As noted in[Nikolov and Petrova, 01], when completed, thislist of candidates is the most importantpreliminary result. The appropriate foreignsynset must be a compilation of some e-setsbelonging to this list. Various evaluatingfunctions which sort the extracted e-sets andoutline the most adequate ones have beendeveloped. In order to define such evaluatingfunctions let us refer to the following concepts:

Definition 2.2The label of an e-set represents the number oflabels assigned to the words belonging to thate-set.


Definition 2.3An e-set is unlabeled if it contains no labeledwords.

Any word can have one or more labelsassigned to it (as well as no label at all). Themost common evaluating function which hasbeen proposed [Nikolov and Petrova, 01] takesas argument an e-set and has a value given by thevery label of that e-set. A variant of thisevaluating function is that which divides thenumber representing the label of the e-set to thesize of the same e-set.

As far as we are concerned, we have takeninto consideration the evaluation function whichis defined below.

Each EW belonging to the given Englishsynset will have a label (represented by aninteger number from 1 to n, where n is the size ofthe synset) and the labeling of the FWsbelonging to the e-sets is performed according tothis label. The labels of the foreign words whichdiffer from the label of the corresponding EWwill be considered as representing two points,while the others represent just one point. Thevalue of the evaluation function relative to aspecific e-set is given by the total number ofpoints corresponding to that e-set divided by itssize.

Having defined all necessary concepts, onecan now state the algorithm for generating theforeign e-sets corresponding to a given Englishsynset:

Algorithm 2.1

Input: The file containing the Englishsynsets and the two files representing the twobilingual dictionaries (for instance, theEnglish-Romanian and the Romanian-English dictionary respectively).

1. Create (by consulting the appropriatebilingual dictionary) the e-setscorresponding to each word of the givenEnglish synset.

2. Label the English words belonging tothe given English synset.

3. Label each of the e-sets generated inStep 1.

4. Remove all unlabeled e-sets.5. Evaluate the e-sets (using the assigned

labels and an evaluating function).

Output: The sorted list of e-setscorresponding to the given English synset.

The translations in the foreign language ofthe words occurring in the English synset areextracted from the bilingual dictionary asfollows:

eword1 meaning11; meaning12; … ; meaning1m1

………..……………………………………………ewordn meaningn1; meaningn2; … ; meaningnmn

The set of e-sets generated by Algorithm 2.1 isof the following form:

{{meaningij} | 1 ≤ i ≤ n, 1 ≤ j ≤ mi}.The foreign synset will be generated using thisset.

In the automatic generation of the foreignsynset corresponding to a given English synsetwe shall also take into account

Remark 2.1Of all possible meanings of a word, only onerefers to a specific concept (to which a synsetcorresponds).

Using the sorted list of e-sets generated byAlgorithm 2.1 (namely the evaluated e-sets), themeaning (elementary set) evaluated with thehighest value will be chosen corresponding toeach English word. Let this meaning,corresponding to ewordj, be meaningjij. Theforeign synset will be generated using the e-setsobtained by means of Algorithm 2.1, taking intoaccount Remark 2.1, and according to

Algorithm 2.2

Input: The sorted list of e-sets generated byAlgorithm 2.1 corresponding to the givenEnglish synset, [eword1, eword2, ... ,ewordn].

1. Compute the foreign synset as havingthe following form:{meaning1i1} ∪ {meaning2i2} ∪ … ∪

{meaningnin}, 1 ≤ ij ≤ mj , ∀ j = 1,n2. Delete words occurring in more than

one e-set from this union, such that eachword will occur just once.

Output: The foreign synset corresponding tothe given English synset.

It has now become obvious that our approachto WN generation belongs to the class ofsemiautomatic methods based on heuristics. As itis well known [Atserias et. al., 97] suchheuristics can belong to two main categories: one


in which the corresponding heuristics rely oninformation found in the bilingual dictionariesand the structure of WN, another containingheuristics that rely on the genus informationextracted from the monolingual dictionary.Obviously, the heuristic which is used herebelongs to the first mentioned category, since ourgeneration method does not use monolingualresources (with the exception of WN itself) butrelies solely on bilingual dictionaries (inelectronic format).

An example

English synset: {personification, incarnation}

Gloss: the act of attributing humancharacteristics to abstract ideas etc.

e-sets:

eword e-set score

incarnation {personificare} 2.0

incarnation{incarnare,

intruchipare,intrupare}

1.3333334

personification {personificare,intruchipare} 1.0

Proposed Romanian synset(s):

• {personificare, intruchipare}

As it is noted in [Nikolov and Petrova, 01],the greatest advantage of Algorithm 2.1 is theability to create synsets which may includeforeign words that would not be extracted fromthe input resource at the first step of the work.Thus, even if a foreign word occurs in theEnglish-Romanian dictionary, for instance, but ismissing from the Romanian-English one, there isstill a big chance for this word to be included inthe final resulting synset. (The only necessarycondition for this is the presence in the list ofcandidates of an e-set which includes that word).This is a very important fact considering howincomplete bilingual dictionaries usually are. Asthe above mentioned authors point out, thisalgorithm does not represent a simple mirrortranslation.

3. Noun Synsets

Algorithms 2.1 and 2.2 have been implementedin Prolog and tested by us, with fairly goodresults, in the case of Romanian nouns. In orderto test the algorithms, we have used fragments ofbilingual dictionaries in electronic format. Whenworking with a semantic network like WN therichness of the bilingual dictionaries which areused is of the essence. Due to the imperfection ofexisting Romanian-English and English-Romanian dictionaries in electronic format, andin order to ensure the most possible accuratetesting, we have generated our own fragments ofelectronic bilingual dictionaries, using some ofthe most complete existing paper ones [Leviţchi,73], [Leviţchi et. al., 74]. We have randomlychosen a number of 200 English noun synsets forwhich we have automatically generated thecorresponding Romanian ones. Since mostEnglish synsets contain two words, our datasample was chosen according to the samepattern. Thus, out of the 200 considered Englishsynsets, 179 contained two English nouns, 4synsets contained 3 English nouns and 17 synsetscontained more than 3 English nouns (between 4and 7 words). The number of e-sets involved inthe experiment was of 616. Several Englishsynsets containing just one noun have beensubsequently taken into consideration. All testsperformed at this stage have been using theoriginal WN 1.7.1 in its Prolog-readable formatsince this version of WN includes a largernumber of synsets consisting of a unique word.

The generated Romanian synsets werevalidated by Romanian linguists using the latestbilingual dictionaries and the correspondinggloss indicated in the American WordNet.

When testing the translation algorithmrelatively to Romanian nouns, we have noticedthat, in several cases, Algorithm 2.2 hasgenerated more than one Romanian synsetcorresponding to the given English one. This wasthe case when Algorithm 2.1 had as output a listof e-sets (corresponding to different meanings ofthe same word) that had been evaluated with thesame value. Each such e-set then represented acandidate and led to a different Romanian or, ingeneral, foreign synset. In such cases the correctforeign synset will be chosen from the list ofsynsets generated by Algorithm 2.2 according tothe gloss of the given English synset. Thecomputer program implementing Algorithm 2.2must therefore provide as output the gloss aswell, since it is necessary in the validation


performed by linguists. Future work shouldprobably take into consideration a wider range ofevaluation functions as well.

When performing tests for Romanian nouns itturned out that, besides the cases when the resultwas correct, in most other cases the algorithmhad generated several Romanian synsets, amongwhich the correct one could be found. In thosecases when the English synsets did not havecorrect Romanian counterparts it was mostlybecause of wrong or missing data in the bilingualdictionaries. Special problems occurred in thecase of English synsets containing a singlepolysemous noun, the presented algorithm beingunable to decide among meanings. In order tofacilitate the experiment, when choosing oursample of English synsets a necessary step wasthat of removing the synsets with proper names,compounds and collocations. These should bedealt with separately and with a more significantcontribution on the part of the linguists.

Obviously, when using Algorithms 2.1 and2.2 for specific languages, various difficultieswill occur according to what is typical of eachlanguage at morphological and derivationallevel. In [Hristea, Th., 03] the linguist concludesby noticing that the main difficulties whichoccurred when automatically translating Englishsynsets into Romanian ones were generated bycollocations, by loan translation, and by the factthat the polysemy of many English words isgreatly superior to that of the correspondingRomanian words.

A special case is that of English synsetscontaining a single polysemous noun, a situationin which Algorithm 2.1, or any other algorithmof the same type, will not be able to distinguishamong various meanings. Such synsets should besubject to further investigation. This type ofdifficulty has suggested to us the enrichmenttechnique which is proposed in §4 with respectto nouns (see Algorithm 4.1), as well as otherpossible enrichments concerning the other partsof speech that WN deals with (see §5). Suchsynset enrichments will take into account theWN structure, proving the necessity ofcombining class methods with structural ones.

In spite of the mentioned difficulties, weconsider the presented translation algorithms asbeing appropriate for building a core of synsetscorresponding to all four parts of speechinvolved in WN. This can be done in more orless any language other than English, providedthat good bilingual dictionaries in electronicformat exist for the specific target languageunder consideration. The most important issue

here is probably the fact that Algorithms 2.1 and2.2 do not depend on the type of part of speechinvolved in the translation.

4. Noun Synsets Revisited

Having as starting point the linguist’s remark[Hristea, Th., 03] that “most problems occurredwhen translating English synsets that contain asingle word, the algorithm often being unable todecide among meanings”, we now revisit theEnglish synsets containing a unique polysemousnoun.

It again becomes obvious that, in order tocorrectly distinguish among the existingmeanings, when performing automatictranslation, the proposed algorithm should takeinto account the WN structure as well. This leadsus to taking into consideration the semanticrelation of hypernymy, as suggested in [Simaand Vaţă, 04]. Hypernymy is one of the basicsemantic relations implemented in WN, whichcorresponds to the isa relation and according towhich nouns are structured as hierarchies. Whenautomatically translating noun synsets containinga unique polysemous noun, we have thereforedirected our attention towards the involvedsynset’s hyperonym, denoting the motherconcept of the one that the synset underinvestigation refers to.

In the Prolog version of the WN database,that we have been using, the hypernymy relationis expressed under the following form:

hyp(synset_id, synset_id).The hyp operator specifies that the second synsetis a hypernym of the first synset. This relationholds for nouns and verbs. The reflexiveoperator, hyponym, implies that the first synset isa hyponym of the second one.

We have been trying to estimate to whatextent synset enrichment performed by means ofhypernyms increases the chances of correctlytranslating English synsets that contain a uniquepolysemous noun. The total number of suchsynsets existing in the 1.7.1. version of WN is13448. Taking this figure and these synsets intoaccount as input, a random selection has beengenerated, as follows:

1. We have randomly selected (using theRand( ) function which is implementedin Perl 5.8.0) a number of 135 synsetscontaining a unique polysemous noun.These selected synsets represent 1/100of the total input.


2. We have added to this data sample allother synsets having the same contentsas those generated at the previous step.(Example: If, in step 1, the synset[bearing], having the synset_id102450394, has been selected, then, instep 2, all synsets containing the uniquepolysemous noun bearing will beincluded in the data sample. In this caseonly one such synset exists, namely theone having the synset_id 111640712).After performing step 2 of thissimulation, we have obtained a numberof 257 noun synsets which will be usedin our estimation and which represent1.91% of the original input.

3. Each of the 257 noun synsets obtainedin step 2 has been enriched by addingall nouns of the first hypernym2 synset.

Remark 4.1The choice of the “first hypernym” as beingthe most significant to be used for performingenrichment has been made according to thefollowing simulation: a number of 199synsets containing a unique polysemous nounand having multiple hypernyms exists. (Thisrepresents only 1.4% of the total number ofsynsets under investigation). We haverandomly selected 20 such synsets(representing approximately 10% of thesynsets having multiple hypernyms) and havecome to the conclusion that, in 17 cases outof 20 (namely in 85% of all cases), one canretain only the first hypernym, which can beconsidered the most significant.

Algorithm 2.1 has been used for performingthe automatic translation to Romanian of allsynsets containing a unique polysemous noun, aswell as of the 257 enriched synsets obtained aspreviously described. In the first case (non-enriched synsets), the algorithm has failed toproduce a correct translation in 60.75% cases.When dealing with enriched synsets, the samealgorithm has failed in only 31% cases. Thisshows us that the proposed enrichment techniquedecreases failure in automatic uniquepolysemous noun synset generation withapproximately 50%, a result which encouragesus to reformulate Algorithm 2.1, correspondingto this part of speech, as follows [Sima and Vaţă,04]:

2 Here “first hypernym” refers to the ordering existingamong synset_ids.

Algorithm 4.1

Input: The English synset which is to betranslated, the file wn_s.pl (where an soperator is present for every word sense inWN), the file corresponding to the hypoperator, and the two files representing thetwo bilingual dictionaries.

1. If the given English synset consists ofjust one noun, find out (by consultingthe wn_s.pl WN file) if this noun is apolysemous one. If not, STOP and useAlgorithm 2.1 for performingtranslation.

2. Use the hyp operator file in order tofind the first hypernym of the givensynset. If such a hypernym exists3, do:

2.1. Enrich the given Englishsynset containing aunique polysemous nounwith all nouns of thesynset representing itsfirst hypernym.

2.2. Delete one of theoccurrences of the nounof the original synset ifthis word also exists inthe synset used toperform the enrichment.The newly resulting(enriched) synset is theone to be translated.

3. Create (by consulting the appropriatebilingual dictionary) the e-setscorresponding to each word of theEnglish synset to be translated.

4. Label the English words belonging tothe given English synset.

5. Label each of the e-sets generated instep 2.

6. Remove all unlabeled e-sets.7. Evaluate the e-sets (using the assigned

labels and an evaluating function).8. Sort (according to their scores and to

the English words they correspond to),in ascending or descending order, theobtained list of evaluated e-sets.

3 In WN 1.7.1 all noun synsets have hypernyms. Anexception is represented only by the top level synsets.Among these synsets, there is one containing a uniquepolysemous noun, namely the synset {state} having asgloss (the way something is with respect to its mainattributes; “the current state of knowledge”; “his stateof health”; “in a weak financial state”).


9. Choose the e-set corresponding to thenoun in the original English synsetwhich is evaluated with the highestscore and present it as output. STOP.

Output: The foreign synset corresponding tothe original English synset.

Remark 4.2Since the hypernym synset has been usedonly for specifying the meaning of thepolysemous noun occurring in the Englishsynset to be translated, no e-setscorresponding to nouns of the hypernymsynset will be selected when forming theforeign synset that represents the translation.Using Algorithm 2.2 is therefore notnecessary in this case. Algorithm 4.1 has asoutput the final result, namely the foreignsynset representing the translation in thetarget language of the given English one (seealso the following example).

5. Related work

In order to decrease failure in the automatictranslation of English synsets containing aunique polysemous noun, as well as to minimizethe involved human effort, an enrichment stephas been included in the existing translationAlgorithm 2.1. This improves automatictranslation of such noun synsets withapproximately 50%.

Let us note the fact that a correspondingenrichment step can be included in the generaltranslation algorithm in order for it to performthe generation of adjective synsets as well, inlanguages other than English. In this case werecommend [Hristea, 02] using a strategy whichconsists in enriching the given synset with newadjectives that suggest the meaning of the oneoccurring in this synset. The new adjectives areobtained using the similarity relation thattypically exists in WN among adjective synsets.In the Prolog version of the WN database, thissimilarity is expressed under the following form:

sim (synset_id, synset_id).The sim operator specifies that the second synsetis similar in meaning to the first synset. Thismeans that the second synset is a satellite of thefirst synset, which is the cluster head. Thisrelation only holds for adjective synsetscontained in adjective clusters (and therefore ourtranslation method only refers to descriptiveadjectives, which have this organization in WN).

Thus, in order to enrich the given synset withnew words, the adjectives occurring on the firstposition within synsets semantically linked to theoriginal one via the similarity relation will bechosen. These words will be appended to theoriginal synset, starting from the second position.This idea was inspired by the way in whichadjective clusters are organized and structured inWN. At this point one therefore again feels theneed to combine the presented class method witha structural one (namely one that takes profit ofthe WN structure). A demo concerningautomatically generated Romanian adjectivesynsets can be seen at

http://phobos.cs.unibuc.ro/roric/adjsynsets.htmlandhttp://phobos.cs.unibuc.ro/roric/enrichadjsynsets.htmlrespectively.

In the case of verb synsets, it is equallyrecommended [Hristea, 03] that the similarity inmeaning of various verb synsets should be used.In the Prolog version of the WN database, thissimilarity is expressed under the following form:

vgp (synset_id, synset_id).The vgp operator specifies verb synsets that aresimilar in meaning and that should be groupedtogether when displayed in response to a groupedsynset search. This relation only holds for verbsynsets.

6. Final remarks

The proposed approach to WN generation is acombination of automatic and manual methods.The manual method relies on human experts,while the automatic one strongly relies onbilingual dictionaries and represents acombination of class methods with structuralones.

Using the proposed class method (which islanguage independent and irrespective of part ofspeech) is sufficient in order to automaticallygenerate the synsets of the target language(which will be manually validated).

The significance of the manual effortinvolved in quality assurance primarily dependson the existence of appropriate tools. Theinvolved human effort can be greatly reduced inthe case of those languages for which correct andcomplete bilingual dictionaries in electronicformat exist. Thus, in most cases when theobtained results are not the best possible ones, itis mainly because of the imperfection of existingbilingual dictionaries.


No matter what language is taken intoconsideration, linguistic difficulties that can notbe overcome will always exist. Additionally, weshould note that most problems occur whentranslating English synsets that contain a singlepolysemous word, Algorithm 2.1 being unable todecide among meanings. Such synsets aresubject to further investigation and theenrichment step taking into account the WNstructure is added to Algorithm 2.1 wheneverpossible. In the case of Romanian nouns thisimproves automatic translation of such synsetswith approximately 50%. Our improved methodreinforces the necessity of combining classmethods with structural ones when dealing withthis type of task.

Acknowledgements

The elaboration of the present study has beenstarted by Florentina Hristea within theframework of the BALRIC-LING project,funded by the European Commission (IST-2000-26454), and has been concluded during aFulbright grant at the Cognitive ScienceLaboratory of Princeton University, in 2004. Theauthor would like to thank both the EuropeanCommission and the Romanian - U.S. FulbrightCommission for the importance they haveattached to the presented topic, as well as forhaving offered their full support.

References

[Atserias et. al., 97] Atserias, J., Climent, S.,Farreres, X., Rigau, G., Rodriguez, H.:“Combining Multiple Methods for the AutomaticConstruction of Multi-lingual WordNets”; in:Recent Advances in Natural LanguageProcessing II. Selected papers from RANLP ’97.Edited by Nicolas Nicolov and Ruslan Mitkov;John Benjamins Publishing Company,Amsterdam/Philadelphia (1997), 327-338.

[Fellbaum, 98] Fellbaum, C. (Ed.): “WordNet:An Electronic Lexical Database”; The MITPress, Cambridge/London/England (1998).

[Harabagiu, 99] Harabagiu, S.: “LexicalAcquisition for a Romanian WordNet”; Proc.EUROLAN ’99, Iasi, Romania (1999).

[Hristea, Th., 03] Hristea, Th.: “Some LinguisticComments Concerning the Obtained Output”; in:Building Awareness in Language Technology.

Edited by Florentina Hristea and MariusPopescu; Editura Universitatii din Bucuresti,Bucharest (2003), 153-157.

[Hristea, 02] Hristea, F.: “On the SemiautomaticGeneration of WordNet Type Synsets andClusters”; Journal of Universal ComputerScience, vol. 8, no. 12 (2002), 1047-1064.

[Hristea, 03] Hristea, F.: “On the SemiautomaticGeneration of Verb Synsets in Languages otherthan English”; Anals of the University ofBucharest, anul LII (2003), 75-86.

[Hristea and Popescu, 03] Hristea, F., Popescu,M. (Eds.): “Building Awareness in LanguageTechnology”; Editura Universitatii din Bucuresti,Bucharest (2003).

[Leviţchi, 73] Levitchi, L.: “Dictionar roman-englez” (3-rd edition); Editura Stiintifica,Bucharest (1973).

[Leviţchi et. al., 74] Levitchi, L., Bantas, A.,Nicolescu, A.: “Dictionar englez-roman”;Editura Academiei Romane, Bucharest (1974).

[Miller et. al., 90] Miller, G.A., Beckwith, R.,Fellbaum, C., Gross, D., Miller, K.J.:“Introduction to WordNet: an on-line lexicaldatabase”; International Journal ofLexicography, 3,4 (1990), 235-244.

[Nikolov and Petrova, 00] Nikolov, T., Petrova,K.: “Building and Evaluating a Core ofBulgarian WordNet for Nouns”; OntoLex ‘2000Report, Sozopol, Bulgaria (2000).

[Nikolov and Petrova, 01] Nikolov, T., Petrova,K.: “Towards Building Bulgarian WordNet”;Proc. RANLP ’01, INCOMA Ltd., TzigovChark, Bulgaria (2001), 199-203.

[Sima and Vaţă, 04] Sima, C., Vaţă, C.: “On theSemiautomatic Generation of Romanian NounSynsets”; Anals of the University of Bucharest,Anul LIII, No.1 (2004), 125-136.


Resources for Processing Bulgarian and Serbian – a brief overview ofCompleteness, Compatibility, and Similarities

Svetla Koeva, Cvetana Krstev, Ivan Obradovi, Duško VitasDepartment of Computational

Linguistics – IBL, BAS52 Shipchenski prohod, Bl. 17

Sofia [email protected]

Faculty of PhilologyUniversity of Belgrade

Studentski trg 311000 Belgrade

[email protected]

Faculty of Geology andMining, U. of Belgrade

ušina 711000 Belgrade

[email protected]

Faculty of MathematicsUniversity of Belgrade

Studentski trg 1611000 Belgrade

[email protected]

AbstractSome important and extensive language resourceshave been developed for Bulgarian and Serbianthat have similar theoretical background andstructure. Some of them were developed as a partof a concerted action (wordnet), others weredeveloped independently. Brief overview of theseresources is presented in this paper, with emphasison similarities and differences in the informationpresented in them. Special attention is given tosimilar problems encountered in the course of thedevelopment.

1. Introduction

Bulgarian and Serbian as Slavonic languages showsimilarities in their lexicons and grammaticalstructures. At present, some equivalent languageresources were developed for both languages,moreover the formal approaches for the organizationof those recourses are very similar. The goal of thispaper is briefly to present these similarities whiledescribing the integration between electronicdictionaries and lexical-semantic data bases(wordnets) of both languages.

The Bulgarian and Serbian wordnets have beeninitially developed in the framework of the projectBalkaNet – a multilingual semantic network for theBalkan Languages which has been aimed at thecreation of a semantic and lexical network of theBalkan languages [Stamou, 2002] with a view to theirintegration in the global WordNet [Fellbaum, 1998;Miller, 1990] — an extensive network of synonymoussets and the semantic relations existing between themin different languages, enabling cross-referencesbetween equivalent sets of words with the samemeaning [Vossen, 1999].

The common origin of Bulgarian and Serbian, theequivalent types of existing electronic resources and

application approaches used offer not only a very goodbasis for comparative research but furthermorepresupposes the successful implementation in suchdifferent application areas as cross-lingual informationand knowledge management, cross-lingual contentmanagement and text data mining, cross-lingualinformation retrieval and information extraction,multilingual summarization, multilingual languagegeneration etc.

2. Electronic dictionaries2.1 Bulgarian Grammatical DictionaryThe grammatical information included in theBulgarian Grammatical Dictionary (BGD) is dividedinto three types [Koeva, 1998]: category informationthat describes lemmas and indicates the wordsclustering into grammatical classes (Noun, Verb,Adjective, Pronoun, Numeral, and Other);paradigmatic information that also characterizeslemmas and shows the grouping of words intogrammatical subclasses, i.e. — Personal, Transitive,Perfective for verbs, Common, Proper for nouns, etc.;and grammatical information that determines theformation of word forms and shows the classificationof words into grammatical types according to theirinflection, conjugation, sound and accent alternations,etc.The BGD is a list of lemmas where each entry isassociated with a label [Koeva, 2004a]. The label itselfrepresents the grammatical class and subclass to whichthe respective lemma belongs and contains a uniquenumber that shows the grammatical type. All words inthe language that belong to the same grammaticalclass, subclass and have an identical set of endings andsound / stress alternations are associated with one andthe same label. Each label is connected with thecorresponding formal description of endings andalternations. The inflectional engine used is equivalentto a stack automaton. Despite the existence of some


CATEGORY BULGARIAN SERBIANGender masculine, feminine, neutralNumber singular, plural, counting form singular, plural, paucalCase vocative nominative, genitive, dative, accusative, vocative,

instrumental and locativeDefiniteness definite, indefinite, definite - full form,

definite - short formAnimateness animate, inanimate

Table 1. Grammatical features of nouns

differences in the format, the BGD represents a kindof DELAS dictionary [Courtois, 1990; Courtois at al,1990; Silberztein, 1990], and it is compiled into aFinite-State Transducer.

2.2. Serbian Morphological DictionaryElectronic dictionaries of Serbian consist ofmorphological dictionaries of general lexica,dictionaries of proper names, and the Serbian wordnet.The system of e-dictionaries of simple words inSerbian has been developed according to the LADLmodel and described, with other types of dictionaries,in [Vitas & al., 2003]. Following this model thesystem is based on a dictionary of lemmas namedDELAS. A dictionary of all inflectional forms, namedDELAF, is automatically generated on basis ofmorphological information attached to lemmas inDELAS. The most important piece of informationaccompanying a DELAS lemma is the inflectionalclass it belongs to, which enables the generation of allinflected forms of a lemma with accompanyinggrammatical information. The information on theinflectional class is expressed by a code, e.g. N600 orV651. The information attached to a lemma in theDELAS dictionary relates to all forms of that lemma,whereas morphological information attached to theinflected form in the DELAF dictionary ischaracteristic of that form only.

3. Morphological information inelectronic dictionaries of Bulgarian andSerbianWith respect to PoS (parts of speech), the PrincetonWordnet (PWN) and other wordnets that use PWN asa model consist of nouns, verbs, adjectives andadverbs.

3.1. NounsNouns in Bulgarian and Serbian are characterized bytheir inflectional categories (Table 1).

Bulgarian nouns are divided into grammaticalsubclasses with respect to their type (Common,Proper, Singularia tantum, Pluralia tantum) andGender. The category Gender with Bulgarian nouns isa lexical-semantic category, which means that a givennoun does not possess different word forms expressingmasculine, feminine and neuter, although the nounlemmas can be grammatically classified into the threeclasses: (chair) - masculine, (table) –feminine, and (dog) - neuter.

The category Case has lost its morphologicalrealization in the system of Bulgarian nouns and onlyvocative against nominative is kept with some properand common nouns (in masculine and feminine)denoting persons. Some concrete nouns also allowpotential generation of vocative in metaphorical usage.

The category Definiteness is realized by means ofindefinite and definite forms that add a definitemorpheme at end of the word. A special feature ofBulgarian is the existence of two definite morphemesfor masculine distinguishing the syntactic functions ofsubject from others.

Bulgarian masculine non-animate nouns after countingnumerals and quantifiers are used in plural in acounting form – (five textbooks), (ten pine-trees).

In Serbian, nouns are morphologically realized inseven cases. The category Gender is in Serbian aninflectional category: for instance papa (pope) ismasculine but its plural form pape is feminine.Besides two main categories for number, singular andplural, Serbian nouns also have the so called “paucal”form which represents a synthetic category of numberand gender that is used with small numbers (two, threefour): jedan lep zec (one pretty rabbit), dva lepa zeca(two pretty rabbits), pet lepih zeeva (five prettyrabbits). Animateness is also the inflectional categoryfor masculine gender nouns: the form of the accusative


CATEGORY BULGARIAN SERBIANPerson first, second, third first, second, thirdNumber singular, plural singular, pluralTense present, aorist, imperfect present, aorist, imperfect, futureMood indicative, imperative infinitive, imperativeParticiples present active, aorist active, imperfect

active, past passivepast active, past passive

Voice active, passive active, passiveDefiniteness definite, indefinite, definite – full

form, definite - short formGender masculine, feminine, neuter masculine, feminine, neuterGerund past active present active, past active

Table 2. Grammatical features of verbs

case is equal to the genitive case for the animate nounsand to the nominative case for the inanimate nouns.

Noun lemmas in the Serbian DELAS dictionary aremarked with markers which sometimes determine thenoun in a more precise manner. For example, pluraliatantum are marked with the marker +PT as inpantalone denoting the concept lexicalized in PWN as{trousers:1, pants:1} and in the Bulgarian wordnet as{:1, :1}. The markers +MG +FGare used to mark the natural male and female gender(or sex) which does not necessarily match thegrammatical gender and which is important foragreement. This is the case, for example, with thenoun izbeglica (refugee), which denotes persons ofboth male and female sex. This noun is inflected as anoun of feminine gender, agrees with the adjective asa noun of feminine or masculine gender in singular (zasvakog (m) izbeglicu (f) – for every refugee) and as anoun of feminine gender in plural, and can agree withthe relative pronoun in plural both as a noun offeminine gender (Izbeglice (f) koje (f) su jue stigle (f)su izjavile (f)… – The refugees that arrived yesterdaysaid…) and as a noun of masculine gender (UNHCRe pružiti pomo za izbeglice (f) koji (m) žele da seintegrišu u lokalnu sredinu – UNHCR will providehelp for refugees that want to integrate into the localsociety). Finally, the marker +Pl marks a noun insingular which denotes a natural plural: braa(brothers) is inflected as a noun of feminine gender insingular and agrees as noun both with singular andplural: Njena (s) braa (s) su (p) dolazila (s) svaki dan(her brothers came every day).

3.2. VerbsVerbs in Bulgarian and Serbian are characterized byinflectional categories and their values (see Table 2).

Bulgarian Verbs are classified in subclasses withrespect to Transitivity (transitive and intransitive),Perfectiveness (perfective and imperfective), andPersonality (personal, third personal and impersonal),while the Serbian verbs are classified according to thefirst two features.

Verb lemmas in Serbian are characterized by thefollowing markers: for aspect imperfective +Imperfand perfective +Perf, for reflexiveness reflexive +Refand irreflexive +Iref, and for transitivity transitive +Trand intransitive +It.

Many verbs in the two languages can be bothimperfective and perfective, such as andadresirati which denote the concept lexicalized inPWN as {address:3, direct:12}. Many formally equalverbs can express both reflexive and irreflexivemeaning, such as {topiti:1a} lexicalized in PWN as{melt:1, run:39, melt down:1} and in the Bulgarianwordnet as {:1, :1, :1,:1, :1, :1} and topitise, lexicalized in PWN as {dissolve:9, thaw:1,unfreeze:1, unthaw:1, dethaw:1, melt:2} and in theBulgarian wordnet as { :1, :1, :1, :1, :1, :1}. Lexical reflexivity in both languagesis expressed by the lexical particle se (and si forBulgarian). Cognate verbs can also express eithertransitive or intransitive meaning, such as {svirati:1b}denoting the concept lexicalized in PWN as {play:3}and in the Bulgarian wordnet as {:3} asintransitive verb (The band played all night long), or{svirati:1a} denoting the concept lexicalized in PWNas {play:7} and in the Bulgarian wordnet as {:1}as transitive verb (He plays the flute). Synsets thatcontain the same verbs, in one case as reflexive and inthe other as irreflexive are often


CATEGORY BULGARIAN SERBIANGender masculine, feminine, neutral masculine, feminine, neutralNumber singular, plural singular, plural, paucal

Case nominative, genitive, dative, accusative, vocative,instrumental, and locative

Definiteness definite, indefinite, definite - fullform, definite - short form definite, indefinite

Comparison positive , comparative, superlative positive, comparative, superlativeAnimateness animate, inanimate

Table 3. Grammatical categories with adjectives

linked by the cause / caused relation. The transitive /intransitive forms have to have separate meanings inPWN. This is not the case with the aspect – perfectiveverbs which are not generated by prefixing should bein the same synset as the imperfective verb.

The Bulgarian imperative has two declined forms - 2nd

person singular and plural, in comparison with Serbianwhere three declined forms: 2nd person singular, 1st

and 2nd person plural, are realized.

Bulgarian participles are specified for aspect and aredeclined according to number, gender anddefiniteness. Serbian participles are specified foraspect and are declined according to number: singularor plural, and gender: masculine, feminine, or neutral.

Participles in both languages are used to formcompound tenses, both in the active and passive voice:perfect, pluperfect, future (past and perfect), andconditional.

Infinitive in Serbian and Gerunds in both languagesare indeclinable.

3.3. AdjectivesThe categories realized with adjectives in Bulgarianand Serbian are similar as well. The main differencesare observed with the categories Case andAnimateness. Adjectives are characterized by theirmorphological categories and their members (shownin Table 3).

3.4. AdverbsAdverbs in traditional Bulgarian and Serbiangrammars are considered indeclinable word types,although for many of them comparison exists: forexample, , brzo (rapidly), -, brže (morerapidly) and -, najbrže (the most rapidly).These are usually treated as separate lemmas but aparadigmatic analysis in the scope of themorphological category Comparison is also

acceptable. There is another level of comparison foradjectives and adverbs in Serbian which is realized bythe prefix po- and superlative: ponajbrže, whichrelativizes the superlative and denotes, in this case, thefastest way among the slow ways.

4. Language resources integration4.1. Bulgarian resourcesThere are three large Bulgarian resources: BulgarianWordNet (BulNet) which covers approximately onethird of the general Bulgarian lexicon [Koeva & al.,2004], BGD - encoding lemmas and correspondinginflection types and Bulgarian Frame Lexicon -encoding the arguments of the verbs and theirsemantic features [Koeva, 2004c]. The combination ofthese resources results in their mutual enhancement,their expansion and reliable validation.

The Bulgarian WordNet models nouns, verbs,adjectives, and (occasionally) adverbs, and containsalready 24 405 synonymous sets (towards 1.09.2005),where 51 584 literals have been included (the ratio is2,11). Following the standards accepted in theBalkaNet project the structure of the Bulgarian andSerbian data bases is organized in an XML file. Everysynset encodes the equivalence relation betweenseveral literals (at least one has to be present), havinga unique meaning (specified in the SENSE tag value),belonging to one and the same part of speech(specified in the POS tag value), and expressing thesame lexical meaning (defined in the DEF tag value).Each synset is related to the corresponding synset inthe English Wordet2.0 via its identification numberID. The common synsets in the Balkan languages areencoded in the tag Base Concepts -- BCS. There has tobe at least one language-internal relation (there couldbe more) between a synset and another synset in themonolingual data base. There could also be severaloptional tags encoding usage, some stylistic,morphological or syntactical features, a stamp


marking the person who worked out the particularsynset, as well as the last time it was edited.

In order to merge the language data existing in BulNetand BGD it was decided to assign an additionalgrammatical note to each literal thus linking it with theBGD lemma's label [Koeva, 2004b]. All labels forBGD entry forms that are found in the BulNet havebeen entered as values of the LNOTE grammatical tagin the XML format. Most of the literals which werenot recognized are either specialized terms that haveno place in a grammatical dictionary of the commonlexis (often written in Latin) or compounds. Thecontradictory cases where two or more labels wereassociated with one and the same literal were solvedmanually.

The grammatical specifications used in the BulgarianFrame Lexicon are identical with those in BGD andBulNet. Thus the Bulgarian WordNet is fullyexportable with the syntactic information availablefrom the Frame lexicon.

4.2. Serbian resourcesThe Serbian wordnet covers at present approximatelyone fifth of the Serbian general lexicon, but it isconstantly being developed [Krstev & al., 2004a]. Inthe course of its development it has been enrichedwith information pertaining to inflexion of literals –simple words. A software tool specially designed forthis purpose is used to enable automatic transfer of allinformation on the inflectional class of a literal fromthe morphological dictionary into the wordnet where itbecomes the content of the <LNOTE> element for thatliteral (the <LNOTE> element is part of the content ofthe <LITERAL> element) [Krstev & al., 2004b]. Theprogram allows the user to alter the automaticallyassigned class in cases when different choices arepossible.

Inflections are of great importance for Serbianlanguage, given the fact that the generation ofinflective forms is not straightforward. This can bebest illustrated by the existence of a large number ofhomograph lemmas: for example, deka can be asynonym for a blanket, a unit of measurement (shortfor decagram) or a hypocoristic for grandfather1. In thefirst two cases the nouns are inanimate, and offeminine gender, with the same inflection – theybelong to one and the same inflectional class. In thethird case the noun is animate, of masculine gender in

1 All three lemmas are accentuated in a different manner, but thatis not obvious from written text.

singular and feminine gender in plural form, andbelongs to different inflective class.

4.3. Problems to be solvedLiterals – simple words, can appear in the Bulgarianand Serbian wordnets which are not lemmas in themorphological dictionary. Such is the case with animaland plant species, which appear as nouns in plural –the singular denotes just one member of the species.For example, the Serbian wordnet contains the synset{Felidae:1, porodica Felidae:1, make:X}, where thevalue N603+Zool:p has been assigned to the<LNOTE> element for the last literal – which meansthat the literal belongs to the N603 inflective class(fleeting “a” appears in genitive plural), is marked asanimal (+Zool), an is always used in plural (:p). Thecorresponding Bulgarian synset is {:1, :1, :1, :1, Felidae:1, Felidae:1}, and theEnglish is {Felidae:1, family Felidae:1} whichbelongs to the hierarchical branch that starts with{group:1, grouping:1}.

group:1, grouping:1biological group:1

taxonomic group:1, taxonomic category:1, taxon:1family:6

mammal family:1Felidae:1, family Felidae:1

On the other hand maka (cat) from the synset{životinja iz roda maaka:1, maka:1b}(corresponding to the PWN synset {feline:1, felid:1},and the Bulgarian synset {:1}) which belongs tothe hierarchical branch that starts with {organism:1,being:2} is in a holo_member relation with the formersynset has N603+Zool as the content of the <LNOTE>element, which means that the noun can appear both insingular and plural. This synset belongs to thehierarchical tree branch:

organism:1, being:2animal:1, animate being:1, beast:1, …

chordate:1vertebrate:1, craniate:1

mammal:1placental:1, placental mammal:1, …

carnivore:1feline:1, felid:1

any literals in the Bulgarian and Serbian wordnets,as in other wordnets, are not simple words butcompounds. There are 12 636 compound literals out of51 584 in BulNet (24,49 %) and respectively 3 081


such literals out of 16 621 existing in SerbianWordNet (18,53 %). The majority of them fall in oneof the following categories:

1.Adjective*-noun, for example {konusni presek:1,kupasti presek:1} (corresponding to {conic section:1,conic:1} in PWN and { :1}) inBulgarian wordnet, or {konjska trka:1} (correspondingto {horse race:1} in PWN and { :2, :1} in the Bulgarian wordnet).

2. Noun phrases where the noun is supplemented witha prepositional phrase: for example, {pobeda napoene:1} (corresponding to {decision:3} in PWN and{ :1} in the Bulgarian wordnet), or{daska za peglanje:1} (corresponding to {ironingboard:1} in PWN and { :1} in theBulgarian wordnet).

3. Coordinate noun phrases (just a few), such as muž ižena in {brani par:1, muž i žena:1} (corresponding to{marriage:2, married couple:1, man and wife:1} inPWN and { :1, :1, :1, :1} in the Bulgarianwordnet).

4. Verb phrase in which the verb is supplemented by anoun phrase, such as {:3, }corresponding to {live:2} in PWN and {živeti život:1,voditi život:1} in the Serbian WordNet.

5. A genitive phrase in Serbian: such as {deljenjeakcija:1} (corresponding to {split:9, stock split:1, splitup:1} in PWN and {:1} in the Bulgarianwordnet), or {izraz lica:1} (corresponding to{countenance:1, visage:2}, in PWN and {:2,:1} in the Bulgarian wordnet).

6. Noun-noun subordinate phrase in Serbian, whichare the rarest: for example {biljka penjaica:1}(corresponding to {vine:1} in PWN and {:1, :1} in the Bulgarianwordnet),

Compounds have their own inflective rules: forexample, in the second and fifth case only the headnoun is inflected, whereas in the third and sixth caseboth nouns are inflected. In the fourth case only theverb is inflected in Serbian and Bulgarian (in thisparticular case). In the first case the noun is inflectedand the adjective(s) agree with the noun. A precisedescription of this type of inflections remains to beelaborated in accordance with the solution proposed in[Savary, 2005]. This is why the <LNOTE> elementsfor compounds in Bulgarian and Serbian wordnets stillremain empty. In Bulgarian and Serbian wordnets, as

in PWN, there are a lot of Latin names for species thatare uninflected in practice.

5. Mirroring of PWN concepts andstructure to Bulgarian and SerbianThe BalkaNet project adopted the Princeton WordNetstructure and concepts as the model for thedevelopment of wordnets for five Balkan languagesand Czech. However, the development of thesewordnets showed that mirroring PWN synsets and therelations among them to Balkan languages is neitherthe simplest nor the most appropriate solution. Itsrationale could be found principally in the necessity ofobtaining a coherent multilingual lexical database. Theproblems encountered were many. We will illustratesome of them with examples related to Serbian andBulgarian.

The simplest problem was the absence of specificPWN concepts in Serbian and/or Bulgarian. Anexample is the PWN concept defined as “an actorsituated in the audience whose acting is rehearsed butseems spontaneous to the audience” and lexicalized assynset {plant:4}. Although the synsets for this concepthave been introduced in both the Serbian and theBulgarian wordnet, the lexicalizations in Serbian{glumac iz publike:1} and Bulgarian {:1, :1} in fact do notadequately represent the original PWN concept.

Conversely, the problem of absence of Serbian and/orBulgarian concepts as well as concepts from otherBalkaNet languages in PWN was also encountered.The solution for this problem was sought within theproject in the introduction of the language specific andBalkan specific concepts. Initially, a set of concepts,not present in PWN, was defined for each language,with appropriate synsets and an English definitionattached.

In this stage 316 Serbian specific concepts weredefined: 259 nouns, 9 verbs and 47 adjectives. Therewere 336 concepts defined for Bulgarian, 309 forGreek, 545 for Romanian, 332 for Turkish and 226 forCzech. The English definition attached to theappropriate synsets enabled mutual comparison oflanguage specific concepts, and extraction of conceptscommon for two or more languages, such as twooriental sweets common for Bulgarian, Greek,Romanian, Serbian and Turkish (Fig 1), defined in allfive initial sets of language specific concepts for theselanguages, and nonexistent in PWN.


Every language specific concept became a Balkanspecific concept. These concepts were incorporatedinto appropriate BalkaNet wordnets, and commonconcepts were linked via a BILI (BalkaNet ILI) index.

Bulgarian

Greek

Romanian cataif Halva

Serbian

Turkish kadayıf helva

Figure1. Two Balkan specific concepts common tofive languages

The initial set of Balkan specific common conceptsconsisted mainly of concepts reflecting the culturalspecifics of the Balkans (many of them pertaining tofamily relations, religion, socialist heritage etc.).Serbian wordnet presently contains 538 Balkanspecific and 55 Serbian specific concepts, Bulgarian –444 Balkan specific and 42 Bulgarian specificconcepts.

There are other specific features of Bulgarian andSerbian that are of a linguistic nature and that disablethe strict one-to-one mapping with PWN. Forexample, a very small number of possessive andrelative adjectives can be found in PWN, whereas theinitial set of language specific concepts for Bulgariancontained a number of relative adjectives, most ofthem having an equivalent in Serbian. For example,the relative adjective {:1} defined inBulgarian as “ ” (of orrelated to steel) has the Serbian equivalent {elini:1}with exactly the same definition “koji se odnosi naelik”. Another example is {:1} defined inBulgarian as “ ” (of or related to a soldier andarmy service) which has the Serbian equivalent{vojniki:1} with practically the same definition “kojise odnosi na vojnika ili njegovu službu”. Anothergroup of concepts specific both for Bulgarian andSerbian (but also for some other BalkaNet languages)are lexicalized by nouns resulting from gender motion.Some of them were accepted as Balkan specificconcepts. For example, {omladinac:1} defined as“lan, pripadnik omladinske organizacije” (a memberof the youth organization.) has its female gendercounterpart lexicalized by a noun derived by gendermotion {omladinka:1} defined as “devojka, lanomladinske organizacije” (a girl, member of the youth

organization.). Both concepts, also related by gendermotion, exist in Bulgarian: {:1} and{:1}. For some concepts which exist inPWN, such as {politician:2, politico:1, pol:1, politicalleader:1}, which have their Serbian and Bulgarianequivalent in {politiar:1} and {:1}, there is no corresponding concept in PWNrelated to the female gender, whereas such a concept,again lexicalized by a noun derived by gender motion,exists in Serbian: {politiarka:1}. In order to describerelations between concepts in the aforementionedcases, specific relations, more specific than thederived relation already existing in PWN, wereintroduced in the Serbian wordnet, namely: derived-pos and derived-gender. However, all these relationsare in general inadequate, since they link synsetsrather than literals, whereas the relation of derivationcan only pertain to literals.

Among many other language specific features wemention here also concepts related to young animals,which do not exist in PWN, such as {ave:1,avi:1}, a young avka (jackdaw) or {jare:1,jarence:1, kozli:1}, a young koza (goat). Related tothese concepts are concepts denoting the birth of ayoung animal, lexicalized by appropriate verbs. Suchconcepts exist in Serbian for a number of variousspecies, with their counterpart in PWN for only a fewof them. An example is {ojariti se:1} defined as “givebirth to a goat”. The same features are shown inBulgarian although the equivalent examples are notyet included in BulNet.

A specific problem is posed by concepts lexicalized bynouns originating from regular derivation which doesnot alter either the PoS or the gender, such asdiminutives and augmentatives [Vitas & Krstev,2005]. There are several possible approaches to thesenouns:

treat them as denoting specific concepts and defineappropriate synsets;

include them in the synset with the noun they werederived from;

omit their explicit mentioning, but rather let theflexion-derivation description encompass thesephenomena as well.

The first approach is mandatory if the diminutive oraugmentative acquires a special meaning: for example,the diminutive glavica from glava (head) is used inSerbian for the concept lexicalized in English as {headcabbage:1, head cabbage plant:1, Brassica oleracea


capitata:1} whereas the augmentative glasina fromglas (voice) is used for the concept lexicalized inEnglish as {rumor:1, rumour:1, hearsay:1} and inBulgarian as {:2, :1, :1} and definedas “gossip (usually a mixture of truth and untruth)passed”. On the other hand, if the third approach isaccepted, the question arises whether it is possible toapply the same approach to other regular phenomena(gender motion and possessive adjectives)?

6. ConclusionsFor both languages the importance of includinginflectional information into the wordnet has beenrecognized and, consequently, it was added in thewordnets for the respective languages. However, a lotof work still remains to be done, particularly for theinflectional description of compound words. The firstresults obtained by the comparison of the extensiveand powerful resources already developed promisetheir possible successful usage in many NLPapplications.

References[Courtois, 1990] Courtois, B. (1990). Le dictionnaire DELAS. in

Dictionnaires électroniques du français, Langue française n° 87(pp. 11-22). Larousse: Paris.

[Courtois at al., 1990] Courtois B., Silberztein, M. Eds (1990).Dictionnaires électroniques du français, Langue française n° 87(127 pages). Larousse: Paris.

[Fellbaum 1998] Fellbaum, C. (ed.). WordNet: An ElectronicLexical Database. Cambridge, Mass.: MIT Press, 1998. [Koeva,1998] S. Koeva Bulgarian Grammar Dictionary. Description ofthe linguistic data organization concept in: Bulgarian language,1998, 6, 49-58.

[Koeva at al., 2004] S. Koeva, T. Tinchev and S. Mihov BulgarianWordnet-Structure and Validation in: Romanian Journal ofInformation Science and Technology, Volume 7, No. 1-2, 2004:61-78.

[Koeva, 2004a] S. Koeva Modern language technologies –applications and perspectives, in: Lows of/for language, Hejzal,Sofia, 2004, 111- 157

[Koeva, 2004b] S. Koeva Validating Bulgarian WordNet usinggrammatical information in: Proceedings from JointInternational Conference of the Association for Literary andLinguistic Computing and the Association for Computers andthe Humanities, Göteborg University, 2004, 80-82

[Koeva, 2004c] Koeva S., Theoretical model for a formalrepresentation of syntactic frames, Scripta and e-Scripta. Vol.2, Sofia,2004: 9-26.

[Krstev at al., 2004a] C. Krstev, G. Pavlovi-Lažeti, D. Vitas, I.Obradovi (2004a) . Stamou, K. Oflazer, K. Pala, D.Christodoulakis, D. Cristea, D. Tufis, S. Koeva, G. Totkov, D.Dutoit, M. Grigoriadou Using Textual and Lexical Resources inDeveloping Serbian Wordnet in: Romanian Journal ofInformation Science and Technology, Volume 7, Numbers 1-2,2004, pp. 147-161.

[Krstev at al., 2004b] C. Krstev, D. Vitas, R. Stankovic, I.Obradovic, G. Pavlovic-Lazetic, (2004b) Combining

Heterogeneous Lexical Resources in Proceedings of the FourthInternational Conference on Language Resources andEvaluation, Lisabon, Portugal, May 2004, vol. 4, pp. 1103-1106,ARTIPOL - Artes Tipograficas, Lda, Portugal.

[Miller, 1990] Miller G. A. Introduction to WordNet: An On-LineLexical Database. In ``International Journal of Lexicography'',Miller G.A., Beckwidth R., Fellbaum C., Gross D., Miller K.J.Vol. 3, No. 4, 1990, 235--244. [Savary, 2005] A. Savary, (2005)Towards a Formalism for the Computational Morphology ofMulti-Word Units in Proceedings of 2nd Language &Technology Conference, April 21-23, 2005, Poznan, Poland, ed.Zygmunt Vetulani, pp. 305-309, Wydawnictwo Poznanskie Sp.z o.o., Poznan.

[Silberztein, 1990] Silberztein, M. Le dictionnaire DELAC. InDictionnaires électroniques du français, Langue française n° 87(pp. 73-83). Larousse: Paris.

[Stamou, 2002] Stamou S., K. Oflazer, K. Pala, D.Christoudoulakis, D. Cristea, D. Tufis, S. Koeva, G. Totkov, D.Dutoit, M. Grigoriadou, BALKANET: A Multilingual SemanticNetwork for the Balkan Languages, Proceedings of theInternational Wordnet Conference, Mysore, India, 21-25January 2002, 12-14.

[Vitas at al., 2003] D. Vitas, C. Krstev, I. Obradovi, Lj. Popovi,G. Pavlovi-Lažeti (2003) An Overview of Resources andBasic Tools for the Processing of Serbian Written Texts in:Workshop on Balkan Language Resources and Tools,Novembar 21, Thessaloniki, Greece.

[Vitas & Krstev, 2005] Duško Vitas, Cvetana Krstev (2005)Derivational Morphology in an E-Dictionary of Serbian inProceedings of 2nd Language & Technology Conference, April21-23, 2005, Pozna, Poland, ed. Zygmunt Vetulani, pp. 139-143, Wydawnictwo Poznaskie Sp. z o.o., Pozna, 2005.

[Vossen, 1999] Vossen P. (ed.) EuroWordNet: a multilingualdatabase with lexical semantic networks for EuropeanLanguages. Kluwer Academic Publishers, Dordrecht. 1999.


Dictionary, Statistical and Web Knowledgein Shallow Parsing Procedures for Inflectional Languages

Preslav NakovUniversity of California at Berkeley

EECS departmentCS division

Berkeley CA [email protected]

Elena PaskalevaLinguistic Modelling Department,

Institute for Parallel Processing, BAS25A, Acad. G. Bontchev St

Sofia, [email protected]

AbstractThe paper studies the potential of using diag-nostic word endings for shallow noun phraseidentification in Bulgarian. The problem is ad-dressed from a machine-learning perspective asa 9-way classification task using a large morpho-logical dictionary, raw text and Web frequen-cies. The evaluation shows 93% per word preci-sion, and 98% accuracy for the nominal chunk-ing.

1 Introduction

Unlike other European languages (e.g. English),Slavonic languages are highly inflective and thusword endings1 provide a lot of information aboutthe corresponding grammatical and/or syntacticfunction of the underlying wordform. They alsodemonstrate well-developed derivational mecha-nisms, with many distinct derivational forms perlemma, by means of agglutinative suffixes, e.g.verot-en (probable), verot-nost (probabil-ity), verot-nosten (probabilistic), etc., whichin many cases, leave the impression of a highsemantic content bearing. Acad. Shcherba hasthe following funny example in Russian: “Glokakuzdra xteko bodlanula kuzdrenka.”, which isa convertive of the Chomsky’s sentence “Color-less green ideas sleep furiously.”. In the formercase, meaningless lexical entities with well-formedgrammatical endings produce a unique semanticreading about some kind of female animal thatdoes something in a particular way to her child(compare to “Bela gusyn bolno podtolknulagusenka.” – ‘The white duck pushed the ducklingpainfully.’), while in the latter, meaningful lexicalunits with a perfect grammatical structure, butwrong semantics, yield a nonsensical expression.The high informativeness of the word endings

is the basis for many applications of unsuper-vised statistical language processing methods, es-pecially in the case of absence of organised linguis-

1Below we will use the term ending to refer to the fi-nal morphological word formants, regardless of their actualgrammatical/syntactic/semantic function.

tic knowledge, e.g. dictionaries, grammars etc.Due to the surface nature of the ending guessing,it is efficient for procedures that are shallow andrequire some guessing. This is especially true fortexts from a specific domain, with a low degree ofnormativeness, where using lexical bases directlyis not feasible because of new words, typos, terms,dynamic expressions, etc.In previous research, we have studied the au-

tomatic extraction of diagnostic word endingsfor Slavonic languages aimed to determine somegrammatical, morphological or semantic proper-ties of the underlying word (Nakov & Paskaleva04). Using a scoring function previously pro-posed by (Mikheev 97) for part-of-speech (POS)guessing, we learned ending guessing rules from alarge morphological dictionary of Bulgarian in or-der to predict POS, gender, number, definiteness(definite article) and semantics. The evaluationdemonstrated coverage close to 100%, and preci-sion within 97-99%.The observed success of guessing encouraged us

to look for other applications of that methodol-ogy, where ending rules would be used as an ad-ditional knowledge source for real tasks. One suchexample, which we will present below, is the taskof nominal chunking in Bulgarian: the nominalwordform formation mechanism is highly produc-tive, and as a result, the newly created wordformsare typically nouns, which makes the task a per-fect candidate application.

2 Related Work

Chunking and shallow parsing. The idea ofchunking as an approach to parsing has been pro-posed by (Abney 91). (Ramshaw & Marcus 95)used transformation-based learning to train a textchunker. (Sang 02) applies a memory-based learn-ing and weighted majority voting for base phraserecognition, (Molina & Pla 02) use an HMM,(Zhang et al. 02) adopt a generalized version ofWinnow, (Megyesi 02) retrains POS taggers, (De-


jean 02) learns top-down rules and (Osborne 02)uses noisy data. There has been a lot of recent re-search on text chunking: coNLL had it as a sharedtask in 2000 (and a related task, on clause iden-tification, in 2001), and the Journal of MachineLearning Research had a special issue on shallowparsing in 2002 (Hammerton et al. 02).POS guessing. (Kupiec 92) uses pre-specified

suffixes and performs statistical learning for POSguessing. The XEROX tagger comes with a listof built-in ending guessing rules (Cutting et al.92). A very influential is the work of (Brill 97),who induces more linguistically motivated rulesexploiting both a tagged corpus and a lexicon. Helooks not only at affixes, but also checks word’sPOS class in a lexicon. (Mikheev 97) proposes asimilar approach, but learns the rules from rawas opposed to tagged text. (Daciuk 99) speeds upthe process by means of finite state transducers.

3 Nominal Chunking Procedure

We start with some definitions. A chunk rep-resents a shallow level of syntax, generally cor-responding to a low level of non-recursive con-stituents. A nominal chunk (below just chunk) isa sequence of two or more words, which form asyntactic group. As we will see below, typically itcan be easily found in analysed text using formalsyntactic and morphological characteristics thatrestrict the possible word sequences. Each nomi-nal chunk has a head, followed by a tail. The headis a complex entity, which includes a base (thefirst word), followed by optional extensions. Boththe base and the extensions share some commoncharacteristics with the head. There can be oneor more “nodi” between the head’s elements. Thenodi are words that share no common characteris-tics with the head elements, but are connected tothem via a strong cohesion: immediate following,unambiguous reading, etc. Consider the followingnominal chunk:

prekrasnatamagnificent

muhis

napisanawritten

otdavnafar-ago

knigabook

‘the magnificent book of his that he wrote far ago’

Here the head is prekrasnata mu napisana ot-davna and the tail is kniga. The base is prekras-nata, napisana is an extension, and mu and ot-davna are nodi.The first nodus mu is a short possessive pro-

noun, which, according to the rules of the Bul-garian grammar, always immediately follows the

head (which in turn should be definite), despitethe fact that it references (both syntactically andsemantically) the tail kniga. The second nodusotdavna is a temporal adverb, which should im-mediately follow or immediately precede an ad-jectival. Below we will use the term adjectival torefer to a syntactic category that acts as an ad-jective. The following syntactic categories can beadjectivals: adjectives, pronouns, numerals andparticiples.The different head elements (the base and its

extensions) need to agree in gender and num-ber both between themselves and with the tail.A summary of these agreement requirements isshown in Table 1. There are several additionalnon-agreement requirements (e.g. on the numberand kinds of pronouns, on the number of numer-als, etc.), which we omit here.Each row in Table 1 defines a rule for the au-

tomatic identification of nominal chunks. For ex-ample, the second row says that we combine asequence of words in a chunk, if the first one isa singular feminine adjectival, followed by zeroor more singular feminine adjectivals, ordinal nu-merals or pronouns2 and finally by a singular fem-inine noun. In addition, there can be one or moreoptional nodi, which are restricted by additionalrules, which are omitted here.We have developed a nominal chunks extrac-

tion tool, which uses Table 1 (together with someadditional restrictions) as a guide for checkingagreement between the different chunk elements.The annotations are obtained by looking up thewords in a large morphological dictionary of Bul-garian, created at the Linguistic Modeling De-partment, Institute for Parallel Processing, Bul-garian Academy of Sciences (Paskaleva 03). Thedictionary is in DELAF format (Silberztein 93)and at the time of our experiments it contained908,525 wordforms (60,137 lemmata). Each entryconsists of a wordform, a corresponding lemma,followed by morphological and grammatical in-formation. There can be multiple entries for thesame word, in case of multiple homographs. Ta-ble 2 shows the dictionary entries for the wordsin the nominal chunk prekrasnata mu napisanaotdavna kniga.Of course, the above dictionary is incomplete,

as new words are constantly added to Bulgar-2There are additional restrictions, e.g. on the kinds and

types and pronouns, which are not shown.


BASE EXTENSION TAILlexical grammatical lexical grammatical lexical grammaticalAa sm Aa/NU+ORD/PRO sm N+M sAa sf Aa/NU+ORD/PRO sf N+F sAa sn Aa/NU+ORD/PRO sn N+N sAa p Aa/NU+any/PRO p/any/p N+any p

NU+ORD sm Aa/PRO sm N+M sNU+ORD sf Aa/PRO sf N+F sNU+ORD sn Aa/PRO sn N+N sNU+ORD p Aa/NU+CAR/PRO p/any/p N+any pNU+CAR any Aa/NU+ORD/PRO p N+any pPRO sm Aa/NU+ORD/PRO sm N+M sPRO sf Aa/NU+ORD/PRO sf N+F sPRO sn Aa/NU+ORD/PRO sn N+N sPRO p Aa/NU+CAR/PRO p/any/p N+any p

Table 1: Base-extension-tail agreement. Lexical annotations: ‘Aa’ stands for adjectival (see thetext for the definition of adjectival), ‘N+M’ for masculine noun, ‘N+F’ for feminine noun, ‘N+N’for neuter noun, ‘N+any’ for any noun, ‘NU+ORD’ for ordinal numeral, ‘NU+CAR’ for cardinalnumeral, ‘NU+any’ for any numeral, ‘PRO’ for pronoun. Grammatical annotations: ‘s’ stands forsingular, ‘p’ for plural, ‘m’ for masculine, ‘f ’ for feminine, ‘n’ for neuter, ‘any’ for no restrictions.

ian through derivation, inflection, borrowing fromforeign languages, misspellings, etc. Thus weneed a mechanism of guessing, that would al-low us to decide whether a particular out-of-vocabulary word meets a given set of restrictions,as listed in Table 1.We address the guessing problem as a ma-

chine learning task, trying to learn diagnosticword endings that can predict a suitable set ofword classes. Unlike our previous work (Nakov& Paskaleva 04), where we were learning separatesets of rules to predict POS (adjective, adverb,noun, numeral, verb), definiteness (definite, indef-inite, none), gender (feminine, masculine, neuter,none), number (singular, plural, none) and se-mantics (human, animate, none), here we learnrules that directly predict the combinations ofsyntactic and lexical features that are tested inTable 1. We did not try to predict combinationsthat involve pronouns (PRO) as they representa closed-class syntactic category. We also choseto rule out the possibility that an unknown wordcould be a numeral (NU), as, despite being anopen-class syntactic category, the frequent numer-als are exhaustively enumerated in our dictionary.As a result, we reduced our task to a 9-way clas-sification into the following classes: Aa:p, Aa:sf,Aa:sm, Aa:sn, N+F:s, N+M:s, N+N:s, N+any:pand other. Note that the first eight classes are

non-intersecting (Bulgarian can distinguish gen-der in singular forms only) and that the last classcovers everything else. Note also, that the gen-der is a lexical annotation for the nouns, but agrammatical one for the adjectivals.We believe that targeting these 9 categories di-

rectly is a better idea, compared to separatelyguessing POS, number and gender, as some of thecombinations do not make sense and could com-plicate training. For example, we do not want totry to predict things like Aa:pm, i.e. a plural mas-culine adjectival, as plural adjectivals always havethe same form for all genders. Similarly, it doesnot make sense to talk about gender or numberfor some POS, like verbs.

4 Ending Guessing Rules

4.1 Rule Scoring and Selection

Our rules are similar to the ones proposed by(Mikheev 97), who uses a dictionary to build POSprediction rules with 4 parts: deletion (–), addi-tion (+), checking against the dictionary (?) andPOS assignment (→). Generally speaking, eachrule operates either on the beginning or the endingof the target wordform. For example, the follow-ing rule says that if an unknown word ends on-ied, this ending should be stripped, -y should beappended, a test should be performed of whether


word lemma lex. gr.prekrasnata prekrasna A+GR sfd

mu negov PRO+POS Smu toi PRO+PER SDmu to PRO+PER S

napisana napixa V+PF+T Psfotdavna otdavna ADV+TM —kniga kniga N+F s

Table 2: Dictionary annotations for the wordsin the nominal chunk “ prekrasnata mu napisanaotdavna kniga”. Only the first annotation ofmu is possible in this context. Lexical annota-tions: ‘PRO+PER’ stands for personal pronoun,‘PRO+POS’ for possessive pronoun, ‘ADV+TM’for temporal adverb, ‘N+F’ for feminine noun,‘A+GR’ for an adjective that can have compar-ative and superlative forms, ‘V+PF+T’ for per-fect transitive verb. Grammatical annotations: ‘s’stands for singular, ‘f ’ for feminine, ‘d’ for defi-nite, ‘S’ for short form, ‘D’ for dative form, ‘P’for passive participle.

the newly created word is in the dictionary andannotated as (VB VBP) there, and if so, (JJ VBDVBN) for the original word should be predicted:

e[-ied +y ?(VP VBP) → (JJ VBD VBN)]

All rule elements are optional, except for thePOS assignment. This means that a rule canjust add and/or remove letters, without looking inthe dictionary (although it could potentially ben-efit, if it did). When both removal and additionare used, one can account for mutations in theword stem. In fact, Mikheev uses the following re-stricted types of rules: Prefix (prefix deletion anddictionary lookup), Suffix0 (suffix deletion anddictionary lookup), Suffix1 (suffix deletion withmutation in the last letter and dictionary lookup),Ending (suffix deletion). There are separate end-ing guessing rules for hyphenated, capitalised andall other words.Given a dictionary, a scan through the word-

forms is performed, during which all possible rulesare collected and scored, and those above somethreshold3 are selected. Finally, rule mergingis applied to rules with identical preconditions

3We used a threshold of 0.5, as our previous experimentshad found it to perform best (Nakov & Paskaleva 04).

but different predictions: the new rule predictsthe union of the predictions of the original rules,which results in higher ambiguity but potentiallyallows it to pass above the threshold after rescor-ing.We do not use the full power of the Mikheev-

like rules: we do not treat the hyphenated orcapitalised wordforms in any special way and welimit ourselves to ending rules without dictionarylookup and single-class predictions4.The intuition behind the Mikheev’s rule score is

that a good guessing rule should be unambiguous(predicts a particular class without or with onlyvery few exceptions), frequent (based on a largenumber of occurrences) and long (the longer therule the better its prediction). These criteria arecombined in the following scoring function:

score = p−t(n−1)

(1−α)/2

p(1−p)n

1+log l

where:

• l is the rule length;

• x is the number of successful rule guesses;

• n is the total number of training instancescompatible with the rule;

• p = (x + 0.5)/(n + 1) is a smoothed versionof the maximum likelihood estimation, whichensures that neither p nor (1−p) can be zero;

•

p(1−p)n is an estimation of the dispersion;

• t(n−1)(1−α)/2 is a coefficient of the t-distributionwith n−1 degrees of freedom and confidencelevel α.

4.2 Rule Cleansing

In our experiments, we used the Mikheev’s scor-ing function to learn ending guessing rules pre-dicting our 9 classes. Each potential ending isscored and the ones above some threshold are se-lected. In case multiple rules are applicable to aparticular word, the longest one is chosen. Thisallows us to clean the rule set, as described in(Nakov & Paskaleva 04). The idea is as follows.Suppose, there is a rule “-vaha”, which was met6,593 times as a verb and once as a noun (i.e.

4But, as we will see below, because of systematichomonymy, we would potentially benefit a lot, if we al-lowed multiple-classes predicting rules.


it is 99.98% correct). Now, suppose there is an-other one “-vaha”, which was met 1,498 times,always as a verb. Further, let there also be ruleslike “-avaha”, “-kvaha”, etc., all making the sameprediction. Then we do not need to keep themall, we only need “-vaha”. Removing redundan-cies of this kind leads to a dramatic drop in thenumber of rules, without altering any potentialsystem decision. In the experiments below, we al-ways applied this rule cleansing procedure and wereport the reduced number of rules only.

5 Experiments and Discussion

We ran three types of guessing experiments, esti-mating the frequencies of the dictionary words:

• uniformly;• from raw text;• using search engine page hits.

We collected 25 MB of various Bulgarian textsas follows:

• newspapers: 21,118 KB;• legal: 1,990 KB;• prose: 1,713 KB;• religion: 393 KB.

We use this collection to find hapax words fortesting (see below). We also use it to estimate thefrequencies of the dictionary words. For example,the word izbori (‘elections’) is met 1,069 times.So, when we extract all possible word endings (-i,-ri, -ori, -bori, -zbori and izbori), we wouldpretend that each had been observed 1,069 times.In the last set of experiments, we used a search

engine to estimate the word frequencies: we triedGoogle and MSN Search. While Google suppos-edly has a larger index, it rounds the number ofpage hits, in case it is too high5. MSN Search hasa smaller overall index, but always returns un-rounded numbers and has been found to providebetter page hit estimations for English (Nakov &Hearst 05). It was unclear though whether thiswould be the case for a language like Bulgarian,which is under-represented on the Web, and thuswe decided to try both search engines. Finally,we experimented both with and without limitingthe queries to Bulgarian pages only.

5Probably because, under high loads, Google samplesfrom its index, rather than performing exact calculations.

Class P C FA:p 94.55% 98.24% 96.36%A:sf 94.13% 98.26% 96.15%A:sm 95.32% 98.09% 96.68%A:sn 93.26% 94.60% 93.92%N+F:s 90.32% 84.92% 87.54%N+M:s 92.68% 83.54% 87.87%N+N:s 93.47% 63.25% 75.44%N+any:p 92.02% 80.74% 86.01%OTHER 90.65% 94.27% 92.43%Overall 92.91% 100.00% 96.32%

Table 3: Frequency 1 for all: 18,949 rules.


Table 4: 25 MB of raw text: 25,007 rules.

5.1 Single-Word Experiments

In order to estimate the quality of the learnedending guessing rules, we first tested their per-formance on individual words in isolation. Wefollowed the suggestion in (Baayen & Sproat 95),(Dermatas & Kokkinakis 95) and (Mikheev 97)that unknown words have a distribution simi-lar to that of hapax words (met only once ina text corpus). We extracted 66,481 of themfrom the 25 MB text collection, but only 38,042were in our morphological dictionary, the rest be-ing mostly misspellings, concatenations etc. Wethen split the dictionary into two parts: we usedthese 38,042 words for testing and the remaining870,483 wordforms for training. We extracted theending guessing rules by looking at all endings ofall wordforms from the training part of the dic-tionary. We then applied the rules thus learned(each time preferring the longest one that is com-patible with the target word) to the testing part



Table 5: Google (Bulgarian): 28,414 rules.


Table 6: Google (any language): 27,183 rules.

of the dictionary and we measured the precision P(what % of the cases the predicted class matchedthe hypothesised one) and the coverage C (what% of the cases there was a rule that was compati-ble with the target wordform). We also calculateda kind of F -measure, which is normally defined as2PR/(P + R), where R is the recall (proportionof proposed instances out of all that had to befound). Precision, recall and F -measure are de-fined in the information retrieval community interms of positive and negative documents for agiven query, i.e. with respect to a single class,but here we have many classes. While we can de-fine both an overall and a class-specific precision,it makes more sense to talk about recall with re-spect to a particular class, but about coverage,when this is a measure for all classes. So, we re-defined the F -measure as 2PC/(P + C).The results of the experiments are shown in ta-

bles 3, 4, 5, 6, 7 and 8. We can see that the bestoverall results are achieved, when we assume a


Table 7: MSN (Bulgarian): 30,328 rules.


Table 8: MSN (any language): 30,805 rules.

uniform frequency of 1 for every dictionary word,but the differences between the different experi-ments are very subtle and not statistically signif-icant. In all tables, the major problem is the lowcoverage for N+N:s (neuter singular noun), whichis due to the systematic homonymy in Bulgarianof that form with the corresponding adverb form.

5.2 Nominal Chunking Experiments

In a second experiment, we tested the accuracy ofthe ending guessing rules when applied to the taskof nominal chunking. Here we chose a 1,000 wordslong medical diagnosis text containing 112 nomi-nal chunks of length two or longer (our gold stan-dard). We ran three types of experiments: (1) noguessing, using the dictionary only; (2) guessingthe class for all open-class words; (3) guessing theclass for the unknown words only. For experimenttypes (2) and (3) we tried uniform frequency foreach dictionary word, estimation from raw textand using a search engine.


The results are given in Table 9. In the exper-iment ‘no guess’, 58 chunks were found, of which56 were correct, i.e. we have a precision of 56/58(96.55%) and a recall of 56/112 (50.00%). In theexperiment ‘one only’, 96 chunks were found, 90 ofwhich were correct, yielding a precision of 90/96(93.75%) and a recall of 90/112 (80.36%). In theexperiment ‘dict.+one’, 111 nominal chunks wereidentified, of which 105 were correct. This yieldsa precision of 105/111 (94.59%) and a recall of105/112 (93.75%). In the remaining experiments‘dict.+XXX’, 4 of the predictions made by theending predictions in ‘dict.+one’ were now cor-rect, i.e. we had precision of 109/111 (98.20%)and a recall of 109/112 (97.32%).Looking at Table 9, we can observe the inter-

esting fact that using ending guessing rules onlyis much better than using the dictionary only:while by doing so we lose a little bit on preci-sion (96.55% vs. 93.75%), we win a lot on recall(50.00% vs. 80.36%). We believe, the main rea-son for that is the kind of text we had chosen:a specialised medical text, with a large propor-tion of out-of-vocabulary words: 18.5%. Contraryto the results in the single-word experiments sec-tion, here using a uniform weight for the dictio-nary words is a bad idea, compared to using es-timates from text or from a search engine. Notsurprisingly, MSN, (which gives exact estimates)looks better compared to Google, but requiring aBulgarian language filter is even more important.There are three kinds of errors: (a) part of

the chunk has been wrongly annotated as other,which makes it unusable by the rules in Table1; (b) part of the chunk has been wrongly anno-tated as a wrong class, different from other; and(c) elements that are neighboring to the chunkhave been annotated as potential chunk elements,which caused the chunking rules to extend thechunk beyound its true boundaries.A sample error of type (c) is

“e{PC,V+AUX:R3s} ustanovena {A:sf}krvna{A:sf} zahar{N+F:s}”, ‘has been de-tected blood sugar’. Here the chunk wronglyincluded the adjective ustanovena (‘detected’),which is part of the preceding verbal chunk. Notehowever that this inclusion does not break theinformativeness of the nominal chunk, as it issemantically connected to it. The impact of suchkinds of errors is very application-dependent.For example, they would not hurt an information

extraction system that relies on the chunker.Finally, we performed an evaluation over a

1,000 words long legal text. The results areshown in Table 10. This is a very different kindof text: almost all words are known and usingthe dictionary only correctly identifies the bound-aries of all 113 nominal chunks of length at leasttwo that are present in that text (if we allow therules in Table 1 to apply in case of ambiguoushomonyms). Using ending rules only, misses 26 ofthe chunks: 10 because of missing word annota-tion, and 16 because of a wrong annotation. In 13out of these 16 cases, there was an ambiguity andthe ending rules had chosen the wrong homonym.For example, in “politiqeski.A:sm celi.A:p”(‘political objectives’), politiqeski was wronglyannotated as a singular masculine adjectival (hereit should be plural, but there is a homonym withthe proposed annotation) and celi should be aplural feminine noun (but there is an adjecti-val homonym: the plural of the adjective cl‘whole’).This last experiment shows the potential bene-

fit of allowing multi-class predicting rules, whichwould be especially effective because of system-atic homonymy. For example, all adjectives, end-ing on -ski have the same form for singular mas-culine and for plural (and we need to distinguishbetween these two, according to Table 1). Aneven worse example of systematic homonymy isgiven by the fact that Bulgarian uses the singularneuter adjective form also as adverb.

6 Future Work

While the results above are very encouraging,there is much more to be done. First, we wouldlike to perform more experiments with differentkinds of texts and different proportion of un-known words. Second, it would be interesting totry to guess the different elements separately ofthe classes in isolation, i.e. independently guessthe POS, number and gender by using three sep-arate classifiers. Third, we would like to extendthe syntax of the nominal chunks, e.g. by addingrules for more complex coordinate constructions(currently we only allow a coordination betweenthe head elements). Finally, we want to try simi-lar kinds of experiments for other languages withhighly inflectional morphology, such as Russian.Acknowledgements. We thank the anony-

mous reviewers for the useful suggestions.


Method P R Fno guess 96.55% 50.00% 65.88%one only 93.75% 80.36% 86.54%text only 93.94% 82.30% 87.74%Google:BG only 94.00% 83.19% 88.26%Google:any only 91.92% 80.53% 85.85%MSN:BG only 97.00% 85.84% 91.08%MSN:any only 93.94% 82.30% 87.74%dict.+one 94.59% 93.75% 94.17%dict.+text 98.20% 97.32% 97.76%dict.+Google:BG 98.20% 97.32% 97.76%dict.+Google:any 98.20% 97.32% 97.76%dict.+MSN:BG 98.20% 97.32% 97.76%dict.+MSN:any 98.20% 97.32% 97.76%

Table 9: Nominal chunking accuracy: eval-uation over a 1,000 words long medical text.‘one’ means training assuming each dictionaryword has a frequency of 1. ‘text’ uses frequencyfrom a 25 MB raw text corpus. ‘Google:BG’and ‘MSN:BG’ use Google and MSN Search lim-ited to Bulgarian pages only. ‘Google:any’ and‘MSN:any’ use no language filter. We performadd-1 smoothing to make sure that we never givea weight of 0 to a dictionary word. Experiments‘dict.+XXX’ perform guessing only in case theword has not been found in the dictionary.

References(Abney 91) Steven Abney. Parsing by chunks. InPrinciple-Based Parsing: Computation and Psycholin-guistics, pages 257–278, 1991.

(Baayen & Sproat 95) Harald Baayen and Richard Sproat.Estimating lexical priors for low-frequency morpholog-ically ambiguous forms. Computational Linguistics,3(22):155–166, 1995.

(Brill 97) Eric Brill. Unsupervised learning of disambigua-tion rules for part of speech tagging. In Natural Lan-guage Processing Using Very Large Corpora, 1997.

(Cutting et al. 92) D. Cutting, J. Kupiec, J. Pedersen, andP. Sibun. A practical part-of-speech tagger. In ANLP,pages 133–140, 1992.

(Daciuk 99) Jan Daciuk. Treatment of unknown words.In Workshop on Implementing Automata., pages IX–1–IX–9, 1999.

(Dejean 02) Herve Dejean. Learning rules and their excep-tions. Journal of Machine Learning Research, 3(2):669–693, 2002.

(Dermatas & Kokkinakis 95) Evangelos Dermatas andGeorge Kokkinakis. Automatic stochastic tagging ofnatural language texts. Computational Linguistics,2(21):137–164, 1995.

Method P R Fno guess 100.0% 100.0% 100.0%one only 84.47% 76.99% 80.56%dict.+one 100.0% 100.0% 100.0%

Table 10: Nominal chunking accuracy: eval-uation over a 1,000 words long legal text.

(Hammerton et al. 02) James Hammerton, Miles Osborne,Susan Armstrong, and Walter Daelemans. Introductionto special issue on machine learning approaches to shal-low parsing. Journal of Machine Learning Research,4(2):551–558, 2002.

(Kupiec 92) J. Kupiec. Robust part-of-speech tagging us-ing a hidden markov model. Computer Speech and Lan-guage, 3(6):225–242, 1992.

(Megyesi 02) Beata Megyesi. Shallow parsing with pos tag-gers and linguistic features. Journal of Machine Learn-ing Research, 3(2):639–668, 2002.

(Mikheev 97) Andrei Mikheev. Automatic rule inductionfor unknown-word guessing. Computational Linguistics,23(3):405–423, 1997.

(Molina & Pla 02) Antonio Molina and Ferran Pla. Shallowparsing using specialized HMMs. Journal of MachineLearning Research, 3(2):595–613, 2002.

(Nakov & Hearst 05) Preslav Nakov and Marti Hearst. Astudy of using search engine page hits as a proxy for n-gram frequencies. In Proceedings of RANLP (to appear),2005.

(Nakov & Paskaleva 04) Preslav Nakov and ElenaPaskaleva. Robust ending guessing rules with applica-tion to slavonic languages. In Proceedings of the 3rdworkshop on RObust Methods in Analysis of NaturalLanguage Data (ROMAND), COLING’2004, pages 76–85, 2004.

(Osborne 02) Miles Osborne. Shallow parsing using noisyand non-stationary training material. Journal of Ma-chine Learning Research, 3(2):695–719, 2002.

(Paskaleva 03) Elena Paskaleva. Compilation and valida-tion of morphological resources. InWorkshop on BalkanLanguage Resources and Tools (Balkan Conference onInformatics), 2003.

(Ramshaw & Marcus 95) L. Ramshaw and M. Marcus.Text chunking using transformation-based learning. InProceedings of the ACL Third Workshop on Very LargeCorpora, pages 88–94, 1995.

(Sang 02) Erik F. Tjong Kim Sang. Memory-based shal-low parsing. Journal of Machine Learning Research,3(2):559–594, 2002.

(Silberztein 93) Max Silberztein. Dictionnaireselectroniques et analyse automatique de textes: lesysteme INTEX. Masson, Paris, 1993.

(Zhang et al. 02) Tong Zhang, Fred Damerau, and DavidJohnson. Text chunking based on a generalizationof winnow. Journal of Machine Learning Research,3(2):615–637, 2002.


Infrastructure for Bulgarian Question AnsweringImplications for the Language Resources and Tools∗

Petya Osenova, Kiril SimovBulTreeBank Project

http://www.BulTreeBank.orgLinguistic Modelling Laboratory, Bulgarian Academy of Sciences

Acad. G. Bonchev St. 25A, 1113 [email protected], [email protected]

AbstractThis paper describes the creation of an infrastructurefor Bulgarian Question Answering (QA) and Infor-mation Retrieval (IR). The infrastructure consists ofa corpus, a set of 20 topics and a set of 100 questions.The topics and the questions have parallel translationsin English as well. Additionally, the topics have atleast one relevant document in the corpus, and thequestions have at least one correct answer in the cor-pus. The questions are of various types reflectingthe ability of the potential systems to find answersof different types (person, location, time, measure,etc.) and in different contexts. Also, we imposed ad-ditional criteria over the answer support, such as vari-ability of correctness in order to facilitate the answerdetection. This range of degrees in difficulty of theQA task requires various kinds of language resourcesand tools for the adequate analysis of the questionsand respectively, successful answer detection. In thispaper we first describe the steps in the creation of theinfrastructure, then we discuss the sources of emerg-ing problems which hamper the work of automaticIR systems. This is done in tight connection with thetypes of language resources and tools, which are nec-essary to solve the task.

1 Introduction

The Cross-Language Evaluation Forum (CLEF) is aEuropean initiative for creation of an infrastructurefor Information Retrieval (IR) and Question Answer-ing (QA) evaluation within a multi-lingual environ-ment. The infrastructure includes corpora in severallanguages, set of topics for IR tasks and set of ques-tions for QA tasks. There are also tracks on image andspoken document retrieval. CLEF is concerned withtwo tasks: (1) developing an infrastructure for thetesting, tuning and evaluation of information retrievalsystems operating on European languages in bothmonolingual and cross-language contexts, and (2) cre-ating test-suites of reusable data which can be em-ployed by system developers for benchmarking pur-poses (see http://clef.isti.cnr.it/). TheCLEF contains eight evaluation tracks in 2005.

∗The work reported here is partially supported by: the Bul-TreeBank project, which is funded by the Volkswagen Stiftung,Federal Republic of Germany under the Programme “Coopera-tion with Natural and Engineering Scientists in Central and East-ern Europe” contract I/76 887; and a special grant by MicrosoftResearch Ltd, Cambridge, United Kingdom.

Bulgarian was included in CLEF initiative in 2004as a source language. This means that for the 2004campaign we had to translate into Bulgarian the top-ics for IR and the questions for QA for other lan-guages included in the initiative. To be a source lan-guage means that Bulgarian can participate in bilin-gual and multilingual tasks were the topics and ques-tions are in Bulgarian, but the appropriate documentsand answers are searched for in different language.We have also participated in the QA track for 2004with a Bulgarian-English system together with theQuestion Answering Group TCC Division - Cognitiveand Communication Technologies (ITC-irst) Centerfor scientific and technological research - Institute ofCulture in Trento, Italy. We have analyzed the ques-tions in Bulgarian and translated the analyses into En-glish. Then the group in Trento evaluated the trans-lated analyses over the English corpus. For details see(Osenova et al. 2005).

For 2005 campaign Bulgarian was included as a tar-get language, too. Thus now the systems could searchfor documents and answers in a Bulgarian corpus.We have prepared Bulgarian data for two tasks: (1)“Mono-, Bi- and Multilingual Document Retrieval onNews Collections” (this year topics were constructedfor five languages: Bulgarian, English, French, Hun-garian, and Portuguese) and (2) “Multiple LanguageQuestion Answering” (nine languages were included:Bulgarian, Dutch, English, Finnish, French, German,Italian, Portuguese and Spanish). This year for thefirst time Bulgarian participated not only as a sourcelanguage, but also with its target collection. It was in-volved in monolingual tasks only. The organizationof the resources included: the preparation of a cor-pus, the compilation of 100 questions with at least onecorrect answer in the corpus, the translation of thesequestions into English, the translation of the 800 ques-tions made by the other 8 groups in Bulgarian, the se-lection of another set of 100 questions from the set of800 in parallel to their validation in Bulgarian corpus,assessment of the systems which participated in thetrack. Concerning the topics, we compiled 20 topics,


validated other groups’ topics against our corpus andfinally we assessed the results of the final set of top-ics, which was returned by the retrieval systems. Thecreation of a similar infrastructure and its integrationwith CLEF is discussed in (Santos and Rocha 2005)among others.In this paper we first present and describe the ele-

ments of the Infrastructure for Bulgarian QA and IR(corpus, topics and questions) and then we discuss indetail what support is needed from the surroundingcontext in order to find correct topics or answers. Thelast section conclude the paper.

2 Corpus

CLEF corpora have to be comparable in a sense thatthey have to cover the same time span and to be basedon articles from popular newspapers. This require-ment is necessary in order to ensure that the corporawill include articles on similar topics. This would fa-cilitate cross-lingual retrieval and question answering.The time span for other languages is 1994-1995. Un-fortunately, for Bulgarian this time span was a prob-lem, because it is hard to find newspaper corpora forit. Thus, it was decided to change the time span forBulgarian (and for Hungarian) to be 2002. This deci-sion caused some problems when the topics and ques-tions were created as we describe below.The Bulgarian corpus consists of the electronic

issues of two newspapers, ”Sega” and ”Standard”from year 2002. They had to conform to the spe-cially designed DTD structure of the CLEF consor-tium. The DTD includes the following information:each newspaper issue consists of documents (elementDOC), where each document is an article in the issue.Each document has a document identification (ele-ment DOCID), a document number (element DOCNO),which coincide (The identification contains the nameof the newspaper, the date of the issues and a uniquenumber within the issues: SEGA20020102-031. Inthis way the identification of an article is unique forthe whole corpus); the elements TITLE, AUTHOR,DATE, TEXT are obligatory and they contain the ti-tle of the article (if any), the author of the article (ifany), the date of the issues, the body of the article.The body of the article is divided into paragraphs (el-ement P). Additionally, it can have RUBRIC elementcontaining the general rubric of the article (Bulgaria,Abroad, Economics etc.), one or more LEAD elementsfor one or more subtitles or other leading text andPICTURE element for the text connected to a picture.The size of the corpus is over 60 000 articles and about

14 000 000 running tokens. The corpus is freely avail-able from ELRA.

The conversion of the original electronic form intothe required format was done semi-automatically withthe help of the CLaRK system: (Simov et. al. 2001),http://www.bultreebank.org/clark/index.html.

3 Topics

Each topic is a description of the information to beretrieved from the corpus. The topics conform to thefollowing structure: a title, a description and a narra-tive part. The first gives the general idea on the topic,the second serves as an instruction to the contestantswhat to search for, and the third describes in more de-tail what kind of documents are relevant and what arenot with respect to their content. For example:

<EN-title> Illegal immigrants in EU</EN-title>

<EN-desc> Find documents about number,nationality or jobs of immigrants, who stayillegally in some EU country. </EN-desc>

<EN-narr> Relevant documents must con-tain information about illegal immigrants inEU. Documents about illegal immigrants inother countries as well as about legal immi-grants in EU are irrelevant. </EN-narr>

Topics should be chosen in such a way that theywould be trackable in all multilingual corpora. How-ever, recall that the choice of topics had to face thetime inequality span of Bulgarian corpus (2002) (alsotrue for the Hungarian corpus), on the one hand, andthe other language corpora, on the other (1994, 1995).In order to overcome the discrepancy we suggestedtopics general and cross-cultural enough to be acces-sible in 1994-1995 corpora. On the other hand, wewanted to suggest not only topics of international na-ture, but also topics which are typical for Bulgarianculture. However, this was not acceptable in the fi-nal selection of topics for all groups. Our suggestionswere as follows: International and European - earth-quakes, products of Philips, footballers’ contusions,hostages of terrorists, the harm from smoking, Euro-pean Union and closing down electric powerplants,environment and the destruction of weapons of war(rockets), cars imported in the USA etc.; National -Bulgaria in NATO, Bulgarian music abroad, Pope’svisits in Bulgaria etc. During the final selection of ajoint set of topics the national ones were excluded, butmany of our other suggestions were sustained. The


idea of the consortium was to find the best set suitablefor every participant. For that reason, before decidingon that, all topics were assessed cross-linguistically.Thus our group assessed French, English, Hungarianand Portuguese topics in Bulgarian corpus. As a re-sult, some of the topics were rejected (mostly thosethat had no hit in all the corpora) or generalized toallow more relevant documents. For example, the ini-tial ‘Alternative Medicine Costs’ became ‘AlternativeMedicine’. Under the first title there was no rele-vant document in our corpus, but with respect to thechanged one we found 29 documents.In the creation of the topics we were also guided by

(Mandl and Womser-Hacker). In this work the depen-dency between the linguistic information in the topicsand the performance of IR systems was viewed froma statistical point of view. For example, the authorsnoticed that the presence of more named entities sig-nificantly facilitated the systems’ performance.

4 Questions

The elaboration of the questions by each group waskept more independent in comparison to topics. Eachgroup had to make 100 questions. The general cri-terion was these questions to be close to the topicsfrom the previous contests (2000-2004), but there wassome possibility for new ones as well. The ques-tions were divided into two main types: (1) defini-tion questions of type ‘What/Who is X?’(with two an-swer types: PERSON and ORGANIZATION) and (2)factoid questions (with six answer types: PERSON,TIME, LOCATION, ORGANIZATION, MEASURE,OTHER). The latter type was further divided into twosubtypes: without temporal restriction and with tem-poral restriction (DATE, PERIOD and EVENT). Eachof these 100 questions should have at least one knownanswer in the corpus. For the final set of Bulgarianquestions (these 100 questions plus 100 selected fromthe questions of the other groups) we also had to in-clude an answer type NIL. It was needed to ensurea better test of QA systems. These NIL questionsshould not address something that does not exist incorpora, but rather something that exists, but is notthe answer to this question. For example, there area lot of documents about Diego Maradona, but if thequestion is: Howmuch does DiegoMaradona weigh?,then the answer should be: NIL, because there is nosuch information in the corpus.The answer to all questions should be as exact as

possible, i.e. without displaying redundant informa-tion. At the same time the answer should be seman-

tically complete, not partial. Some of the problemswe faced were as follows: the names of popular or-ganizations do not always have an adequate Cyrillicequivalent, which means that they become other lan-guage (mostly English) dependant and easily track-able, or they have several variants (Cyrillic and Latin)- in this case it is not always easy to decide which ismore popular. The questions with temporal restric-tions were more problematic, because we should con-centrate on themes, which were oriented to 1994-1995or before. Even when we obeyed this rule, it turnedout that there were some problems with the informa-tion that was not the same for the same topic in the dif-ferent corpora. For example, there was discrepancy inthe period when Jimmy Carter was a president of theUSA; or when exactly did the library in Alexandriaburn? But such discrepancies are beyond our task.

During the validation phase, the answers had to bemarked in four categories: right, wrong, inexact andnot supported (by the context, i.e. accidental or non-motivated). The most problematic for validation cat-egory was ‘inexact’. We can presume that ‘inexact’has to comprise at least the head of a construction,but sometimes it is difficult to judge what the head isor whether the information in the answer is sufficient.Some of these problems are discussed below.

5 Topic and Answer Support

It turned out that the task of selecting relevant doc-uments with respect to a given topic is not so triv-ial. Partly, this is due to the possibility for differentinterpretations of the narrative part in the directionof overgeneration or undergeneration. For example,the topic ‘Brain-Drain Impact’ has as a requirement:‘The countries involved should be named.’ First ofall, some country might be named indirectly, as in thedocument we found. It is not said that ‘brains fleeBulgaria’, but that ‘Bulgarian brains flee’. And then,the article does not mention some specific country towhich the brains escape. Does it make the documentirrelevant? Then the topic ‘NATO Expansion’ facedsome obscurity of whether this topic includes addi-tional initiatives under this expansion as a final goal,or not. The narrative part says only: ‘Any discussionin favor or against expanding NATO eastwards to in-clude countries formerly part of, allied with or underthe political influence of the USSR’. Thus, two thingsseem to be important: (1) How to specify the narrativepart with pre-defined criteria and (2) How explicit dowe have to be?

Then, considering the results from the retrieval sys-


tems we can summarize that the problems have twomain sources: (1) key word ambiguity (‘alternative’as a part of ‘alternative medicine’, but also in politicalgenre) and (2) weak inference mechanisms.As it was described in the previous section, each

type of question has a corresponding abstract answertype, but when the answer is in a real context, thereexists a scale with respect to the answer acceptabil-ity. And the concrete answer must be mapped againstthis scale. The change of the context can change theanswer grade in the scale. In this section we will givesome examples of answers supported by different con-texts.We consider the text as consisting of two types of

information: ontological classes and relations, andworld facts. The ontological part determines gener-ally the topic and the domain of the text. We call thecorresponding ”minimal” part of ontology implied bythe text ontology of the text. The world facts representan instantiation of the ontology in the text. Both typesof information are called uniformly ‘semantic contentof the text’. Both components of the semantic con-tent are connected to the syntactic structure of the text.Any (partial) explication of the semantic content of atext will be called semantic annotation of the text1.The semantic content of a question includes some re-quired, but underspecified element(s) which has(have)to be specialized by the answer in such a way that thespecialization of the semantic content of the questionhas to be true with respect to the actual world.We consider a textual element a to be a supported

answer of a given question q in the text t if and only ifthe semantic content of the question with the additionof the semantic annotation of the textual element a istrue in the world2.Although the above definition is quite vague it gives

some ideas about the support that an answer receivesfrom the text in which it is found. The semantic an-notation of the answer comprises all the concepts ap-plicable for the textual element of the answer and alsoall relations in which the element participated as anargument3. Of course, if we have the complete se-mantic annotation of the corpus and the question, itwill be relatively easy to find a correct answer of thequestion into the corpus, if such exists. Unfortunately,such explication of the semantic annotation of the textis not feasible with the current NLP technology. Thus

1Defined in this way the semantic annotation could containalso some pragmatic information and actual world knowledge.

2World such as it is described by the corpus.3We consider the case when the answer denotes a relation to

be a concept.

we are forced to search for an answer using partialsemantic annotations. In order to give an idea of thecomplexity necessary in some cases we would like tomention that the context which has to be explored canvary from a phrase (one NP), to a clause, a sentence,a paragraph, the whole article or even the whole is-sues. The required knowledge can be linguistic rela-tions, discourse relations, world knowledge, results ofinferences above the semantic annotation.

Here are some examples of dependencies withindifferent contexts and a description of the propertiesnecessary to interpret the relations:

Relations within NPBulgarian nominal phrase is very rich in its struc-

ture. We will consider the following models:NP :- NP NP

This model is important for two kinds of questions:definition questions for people and questions for mea-surement. The first type of question is represented bythe abstract question “Koj e Ime-na-chovek?” (Whois Name-of-a-Person?): Koj e Nikolaj Hajtov? (Whois Nikolaj Hajtov?). As it was discussed in (Tanev2003) some of the possible patterns that can help usto find the answer to the question are: ”NP Name”,”Name is NP” where the Name is the name from thequestion and NP constitutes the answer. The first pat-tern is from the type we consider here. The anotherone and some more patterns are presented below. Al-though it is a very simple pattern the quality of theanswer extraction depends on the quality of the gram-mar for nominal phrase. The first NP can be quitecomplicated and recursive. Here are some examples:

(1)[NP klasikyt] [NP Nikolaj Hajtov][NP the classic] [NP Nikolaj Hajtov](2)[NP golemiya bylgarski pisatel] [NP Nikolaj Haj-

tov][NP the great Bulgarian writer] [NP Nikolaj Haj-

tov](3)[NP zhiviyat klasik na bylgarskata literatura][NP Nikolaj Hajtov]

[NP the alive classic of the Bulgarian literature][NP Nikolaj Hajtov]

(4)[NP predsedatel na syyuza na pisatelite

i zhiv klasik na bylgarskata literatura][NP Nikolaj Hajtov]

[NP chair of the committee of the union of the writ-ers


and alive classic of the Bulgarian literature][NP Nikolaj Hajtov]

As it can be seen from the examples, the first NPcan comprise a head noun and modifiers of differ-ent kinds: adjectives, prepositional phrases. It alsocan exemplify coordination. Thus, in order to processsuch answers, the system needs to recognize correctlythe first NP. This step is hard for a base NP chunker(being nonrecursive), but when it is combined with se-mantic information and a named-entity module, thenthe task is solvable. A characteristic for the first NP isthat the head noun denotes a human (see the examples3 and 4 above). If such nouns are mapped to ontolog-ical characteristics, the work of the tool is facilitated.Another usage of this NP recursive model concerns

measurement questions, such as: “Kolko e prihodytna ”Grijnpijs” za 1999 g.?” (Howmuch is the incomeof Greenpeace for 1999?). The answers of such ques-tions have the following format: ”number”, ”nounfor number”, ”noun for measurement”. For example,”[NP 300 miliona] [NP dolara]” (300 million dollars).The NPs are relatively easy to recognize, but theircomposition remains unrecognized in many cases andthe systems return partial answers like ‘300 million’or only ‘300’. However, without the complete mea-surement information such an answer is not quite cor-rect and is discarded.

NP :- (Parenthetical NP) | (NP Parenthetical)Such NPs are relevant for definition questions about

the extensions of acronyms: Kakvo e BMW? (What isBMW?). Very often the answers are presented in theform of an NP, which is the full name of the organi-zation and the corresponding acronym is given as aparenthetical expression in brackets, or the opposite.In this case two gazetteers: of acronyms and the cor-responding organization names would be of help. Ad-ditionally, we have to rely on opportunistic methodsas well, because it is not possible to have all the newoccurrences in pre-compiled repositories. Then, thecase with the extension as parenthesis is easier to han-dle than the opposite case. Recall the problems withdefining the boundaries of a complex name.Problems arise when there are longer names of or-

ganizations with embedded PPs or with contactingPPs which are not part of them. The systems oftenreturn some NP, but the thing is that they suggest ei-ther the dependant NP as an answer instead of thehead one, or an NP, which is a part of a PP, whichdoes not modify the head NP. An example for the firstcase is the answer to the question: What is FARC?The system answered ‘Columbia’ instead of answer-

ing ‘Revolutionary Armed Forces of Colombia’ or atleast ‘Revolutionary Armed Forces’. An example forthe second case is the answer to the question: What isCFOR?. It was ‘Bosnia’ instead of ‘command forces’(in Bosnija).

Another interesting case is when the first NP hasthe form AP NP where AP is a relational adjectiveconnecting the noun with another noun like: ital-ianski (Italian) → Italy, ruski (Russian) → Russia,etc. In this case the answer of questions like “Otkoya strana e FIAT?” (Where does FIAT come from?)or “Na koya strana e prezident Boris Yelcin?” (Ofwhich country Boris Yelcin is the president?) is en-coded within the adjective. This means that we shouldhave lexicons, which are interrelated in order to derivethe necessary information even when it is indirectlypresent in the text. Note that this does not hold onlywithin NPs. For example, the answer of the question“Who was Michael Jackson married to?” could be‘Michael Jackson’s ex-wife Debby’. Of course, herethe relation is more complex, because there is a re-lation not only between ‘marry’ and ‘wife’, but alsotemporal mapping between ‘was married’ and ‘ex-wife’.

NP :- NP RelClaussHere the main relations are expressed via the fol-

lowing relative pronoun. It is a kind of local coref-erence. Let us consider the example: ‘Mr Murdoch,who is the owner of several newspapers’. We can tracewho is Murdoch through the relative clause. However,sometimes it might be tricky, because in complex NPswe do not know whether the relative clause modifiesthe head NP or the dependant one. For example, inthe phrase: ‘the refugee camp in the city, which is thebiggest in the country’, we cannot know whether thecamp or the city is the biggest in the country.

Relations within a clause (Sentence)In order to derive the relevant information, very

often we need the availability of relations amongparaphrases of the same event. This idea was dis-cussed in (Dagan and Glickman 2004), (Kouylekovand Magnini 2005) and (Lin and Pantel 2001) amongothers. For that task, however, the corpus should beannotated with verb frames and the grammatical rolesof their arguments. Additionally, lists of possible ad-juncts are also needed, because they are mapped asanswer types to questions for time, measure, loca-tion, manner. Thus we have to go beyond the argu-ment structure annotation. The ideal lexical reposi-tory should include relations between semantic units,such as if something is a location, you can measure


distance to it; if something is an artefact, you canmeasure its cost etc. Also, the classical example ofthe entailment like: if you write something, then youare its author, can be derived from proper explanatorydictionary, which is properly parsed.

Discourse relationsThey are necessary, when the required information

cannot be assessed locally. When some popular politi-cian is discussed in the newspaper, it might be the casethat he is addressed only by his name, not the title:‘Yaser Arafat’ instead of ‘the Palestinian leader YaserArafat’. In such cases we need to navigate throughwider context and then the marked coreferential re-lations become a must: Yaser Arafat is mentioned inthe sentence, then in the next one he is referred to as‘the Palestinian leader’ and finally, as ‘he’. Here wecould rely on anaphora resolution tools and on somegathered encyclopedic knowledge.

World knowledgeWe usually rely on our world knowledge when

there is more specific information in the questions andmore general in the candidate answers. For example,to the question ‘Who is Diego Armando Maradona?’we found answers only about ‘Diego Maradona’ or‘Maradona’. For this case we could be sure that allthese names belong to the same person. However,there could be trickier cases like both Bush - fatherand son. If the marker ‘junior’ or ‘senior’ is not there,then we have to rely on other supportive markers liketemporal information or some events that are con-nected with the one or the other.

6 Conclusions

In this paper we described our infrastructure designedfor Bulgarian Question Answering track and Bulgar-ian Information Retrieval track. At the same time, wecommented on the problems that came out when val-idating the answers and topics in their real context.We tried to show what difficulties retrieval and QAsystems face. Also, we connected these difficultiesto language variability, which could become manage-able with the adequate set of language resources andinference mechanism at hand.For the next evaluation campaigns we are planning

to enlarge the criteria for classifying the questionswith a description of the context of the correspondinganswers. In this way the system could evaluate notonly the performance with respect to the answer type,but also the recognition of the supporting contexts.In our view such explicit description of the con-

text will improve also the development of language

resources and tools for Bulgarian, because it will di-rect these developments to concrete problems. In ourown system for question answering we are planing toextend the current resources (see (Simov et. al. 2004))we have with resources for shallow semantic annota-tion, including noun and adjective semantic types andverb argument structure. For some work in this direc-tion see (Simov and Osenova 2004).

ReferencesIdo Dagan and Oren Glickman. Probabilistic Textual

Entailment: Generic Applied Modeling of LanguageVariability. Learning Methods for Text Understand-ing and Mining Workshop. (2004) Available at:http://www.cs.biu.ac.il/glikmao/Publications

Milen Kouylekov and Bernardo Magnini.RecognizingTextual Entailment with Tree Edit Distance Algorithms.PASCAL Challenges Workshop. (2005) Available at:http://www.kouylekov.net/Publications.html

Dekang Lin and Patrick Pantel.Discovery of InferenceRules for Question Answering. In: Natural LanguageEngineering 7(4):343-360. (2001)

Thomas Mandl and Christa Womser-Hacker. Linguisticand Statistical Analysis of the CLEF Topics. In Proc.of Advances in Cross-Language Information Retrieval,Third Workshop of the Cross-Language Evaluation Fo-rum, CLEF. (2002)

Petya Osenova, Alexander Simov, Kiril Simov, HristoTanev, and Milen Kouylekov. Bulgarian-English Ques-tion Answering: Adaptation of Language Resources. In(Peters, Clough, Gonzalo, Jones, Kluck, and Magninieds.) Fifth Workshop of the Cross–Language EvaluationForum (CLEF 2004), Lecture Notes in Computer Sci-ence (LNCS), Springer, Heidelberg, Germany. (2005)

Diana Santos and Paulo Rocha. The key to the first CLEF inPortuguese: Topics, questions and answers in CHAVE.In (Peters, Clough, Gonzalo, Jones, Kluck, and Magninieds.) Fifth Workshop of the Cross–Language EvaluationForum (CLEF 2004), Lecture Notes in Computer Sci-ence (LNCS), Springer, Heidelberg, Germany. (2005)

Kiril Simov, Zdravko Peev, Milen Kouylekov, AlexanderSimov, Marin Dimitrov, and Atanas Kiryakov. CLaRK— an XML-based System for Corpora Development.Proceedings of the Corpus Linguistics 2001 Conference.pp. 558–560. (2001)

Kiril Simov and Petya Osenova. A Treebank-Driven Ap-proach to Semantic Lexicons Creation. In: Proceedingsof TLT04,Tuebingen, Germany. (2004)

Kiril Simov, Petya Osenova, Sia Kolkovska, Elisaveta Bal-abanova, Dimitar Doikoff. A Language Resources In-frastructure for Bulgarian. In: Proceedings of LREC2004, Lisbon, Portugal. pages 1685-1688. (2004)

Hristo Tanev. Socrates: A Question Answering Proto-type for Bulgarian. In Proceedings of RANLP 2003,Borovetc, Bulgaria. pp. 377-386. (2003)


Experimenting in Information Retrieval SystemsVeno Pachovski

Faculty of Civil Engineering, Department of Mathematics, "Sts. Cyril and Methodius",

Partizanski Odredi bb,1000 Skopje,

[email protected]

AbstractThe paper is about ongoing experimentation which is being performed within a working model of an Information Retrieval (IR) system [1]. The system has over 10.000 documents and a retrieval function is being tested [2]. For that purpose, a series of queries (over 30) has been entered, and from the result lists of documents (sorted by their relevance), in almost all of the cases, the first fifty were selected for evaluation.

The primary aim of the experiment was to see how well the implemented function performs i.e. how satisfied the experts are with the selected documents and their order. For that purpose, an evaluation interface to the IR system was created. Finally, the results were tested using kappa statistics.

The paper consists of a short description of the interface, discusses some of the problems that the experts were facing and the evaluation methodology. Some conclusions for future testing were also drawn.

The work is part of an IR research dealing with specific Cyrillic documents in native language, with the purpose of improving the quality of retrieval in national IR systems. Another series of experiments with improved version of the function will be forthcoming.

1. Introduction The efficiency of a retrieval function in an IR system has been tested on a collection of over 10.000 documents. More then thirty queries were generated and for each of them the first 50 documents (if available) were selected for rating.

The rating of the documents was performed by two testers with average computer knowledge. They placed them in one of the following categories: relevant (5), fairly relevant (3) and not relevant (1).

The testers were required to create "mental images" of the purpose of each test (a kind of information need) according to the key words of a query, and then to rate the documents accordingly.

Their rates were gathered, compared and analyzed with kappa statistics.

2. The interface and the process of ratingThe interface was created very carefully, with a lot of consideration for the work of the testers and “user friendly” as much as possible.

Standard IR features were implemented. Query words within the document were marked in different colours, and the testers could judge the relevancy by the context. When the testers thought the document was too long or couldn't see the context clearly, they could refer to an abstract of the document, containing only the sentences with the query words.

For this experiment, grids are a good choice as they allow visual following of the progress of testing (a black arrow on the left margin of the grid points to the current data record). Selected fields are displayed to the right of every grid, which also helps following the changes in data. Query words are displayed at the top, in the middle of the form, in order to remind the tester about the information need (mental image) being judged. Rates are displayed near the test buttons.

The process of actual testing began with the tester choosing the test and reading the query. Then, the tester created a mental image, having in mind what information need could be satisfied by the given query. Here, the testers were simply given instructions how to create mental images and were not given any additional explanations. This resulted in a very divergent approach to judging the relevance of the given documents, but did not affect the interface at all. The testers were required to write down their mental images as a kind of reminder and for further analysis.

The rating itself is very simple. The documents can be placed in one of the three categories: relevant (5), fairly relevant (3) and not relevant (1). Having decided the rate, the tester simply clicks on the appropriate button (green, light green or red), and goes to the next document by clicking the nextbutton..


Picture 1: The testing interface

So, the actual process of rating is reduced to a "two clicks" process

It must be noted that when the grading button is clicked, the rate is not entered yet although it is shown in an appropriate field in a database. The grade can still be changed. The grade is finally entered in the database when the user goes to the next document.

If the tester is hesitant, a "short" version of the document can be obtained by pressing the appropriate button. The "short" version of a document consists of a subset of sentences which contain words from the query, ordered by their occurrence within the document. This is also an attempt to create an automatic abstract.

There is another important point that should be noted. Whenever the tester activates the testing interface, a log file is created. The log file records

the time and date when the test was accessed, the grades given. If there was a hesitation and the user changed the grade several times, that is also recorded. The data collected prove that the experiment was performed correctly, with genuine results. They also provide base for further study.

There was no record of a program crash.

3. Mental images and information needWhen users access the searching interface, they have a very vague idea what they are looking for, i.e. there is an information need, but it is not very clear. This means that users look for some documents, but can not point out exactly about what they are supposed to be.

Users very rarely have a chance, concentration or enough time to look through the whole list of

Data about the test - where the tester selects the test

Control buttons: three colors mark 5, 3 or 1 respectively. The dark blue button is pressed when an abstract of the document.

The text of the document to be rated - query words are marked with different colors

List of the documents to be rated


documents system suggests. If the list is too long, users usually enter another query with some words added. Some words are usually removed, if the system gives a blank list. Recall plays an important part in this because if it is too large, in most cases users automatically change query words. Each observed document also helps users to further define their information need.

The testing requires from the testers to formulate artificial information need and to make sense of words that sometimes do not link at all. Testers look at the query words and imagine the kind of information that can be gathered/ retrieved by them, i.e. what the words imply or could mean.

When the idea is clear, somewhat visualised or linked with a situation from the real world, that picture is called a mental image. In a way, it is an extension or expansion of the original query, because testers, besides creating the picture, have to write it down as well.

Mental images are very important for this kind of experimenting. They do not depend entirely on the education, but also on the language skills, the IQ of the testers and their knowledge of the world.

That is why the testers were required to write down their mental images, and it appeared they had various, sometimes completely different mental images. This fact can be used in analyzing the quality of the grades the testers gave when evaluating how successful the retrieval was in general.

In this experiment, both testers were female, with comprehensive reading and writing skills. The first tester is a 52-year-old English teacher, with long experience in translating from native to English and vice versa and has been a computer user of various

text editors for the last 15 years. She is also an experienced Internet user.

The second tester is in her mid thirties, a professor of philosophy, with great experience in writing magazine articles, short stories and poetry. She is currently finalizing her MA in philosophy. She has been using computers for about 10 years.

The testers do not know each other and their only connection is the author of this paper. Their mental images were created independently.

Considering the fact that one of the testers is a skilful Internet user and regularly searches Internet databases and dictionaries, and the other one studies language on a semiotic level, it was easy for them to understand the concept of information need. The explaining how and why there is a need to create mental images took longer, since they were afraid that the experiment might fail. They were assured that any outcome was welcome and advised to freely grade the documents.

Some of the most interesting examples of agreement or disagreement in mental images are shown in Table1.

The testers performed the ratings consequently, the first one worked in May, and the other one in June/ July 2005. Afterwards, they were interested to know how their data would be analysed. This fact, as well as their experience, will help in further experimenting.

An important conclusion at this stage is that in further experiments mental images should be created and agreed upon together and later followed individually. This may somewhat eliminate the factor of chance when rating the documents.

Query words The first tester The second tester Agreement

Corruption(test no. 4)

Corruption in public institutions; fight against corruption

corruption in the economy, politics, judiciary, education

Quite good

prime minister(test no. 9)

Branko Crvenkovski; Ljubco Georgievski

generally – privileges, obligations, actions

Totally opposite - one of the testers concentrates on current politicians

dnevnik sdsm police (test no. 14)

news and comments in the daily Dnevnik related to the Social Democratic Alliance of Macedonia (SDSM) and the police

articles in the daily Dnevnik or news programs about SDSM and the police

Excellent

ParliamentNovember discuss stability (test no. 27)

discussions in the Parliament regarding the stability of Macedonia (‘November’ was not crucial for the rating)

the elections and the position of the Parliament

Quite opposite


reforms company data issue(test no. 28)

issues related to data about the companies that are being privatized, reforms in the company management

data, issues and opinions related to the reforms

Superficially close

Table 1: Some of the mental images created by the testers

4. Some comments about kappa statistics and the treatment of the resultsThere is wide disagreement about the usefulness of kappa statistics to assess tester agreement. At the least, it can be said that (1) kappa statistics should not be viewed as the unequivocal standard or default way to quantify agreement; (2) one should be concerned about using a statistic that is the source of so much controversy; and (3) one should consider alternatives and make an informed choice.

One can distinguish between two possible uses of kappa: as a way to test tester independence (i.e. as a test statistic), and as a way to quantify the level of agreement (i.e., as an effect-size measure). The first use involves testing the null hypothesis that there is no more agreement than might occur by chance given random guessing; that is, one makes a qualitative, "yes or no" decision about whether testers are independent or not. Kappa is appropriate for this purpose (although to know that testers are not independent is not very informative; testers are dependent by definition, inasmuch as they are rating the same cases).

It is the second use of kappa--quantifying actual

levels of agreement--that is the source of concern. Kappa's calculation uses a term called the proportion of chance (or expected) agreement. This is interpreted as the proportion of times testers would agree by chance alone.

As a test statistic, kappa can verify that agreement exceeds chance levels. But as a measure of the level of agreement, kappa is not "chance-corrected"; indeed, in the absence of some explicit model of tester decision-making, it is by no means clear how chance affects the decisions of actual testers and how one might correct for it.

Considering that there is a great amount of data, an interface for some raw analysis of data was created. There, the rates from every test are aligned together and the results are presented in two windows. One contains the result matrices (3x3) from which the kappa statistics would be calculated, and the other contains the rates prepared for entering into another program for further statistical analyses. After the form is exited, there are two TXT files that contain the same information presented in those windows.

The data from the matrices were entered in an Excel file, and the kappa statistics were calculated for each test. The results are presented in the following diagrams:

Picture 3: A screenshot from Excel - Calculation of kappa statistics for test no 29


0

10

20

30

40

50

60

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Chart 1: Total number of documents per list and number of tester agreements

Kappa

-0,200

0,000

0,200

0,400

0,600

0,800

1,000

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Chart 2: Calculated Kappa statistics per list

-20%

0%

20%

40%

60%

80%

100%

120%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Chart 3: Kappa statistics compared to the percentage of agreement between testers

The results are not easy to interpret, but with the help of Excel trend line function (Moving average with the degree 2 implemented on Chart 4), certain

parallelism can be noticed between the general agreement of two testers (expressed in percentage of agreements) and the resulting agreement calculated


by the kappa statistics. So, in this case, a reasonable conclusion could be that the kappa statistics follows the general agreement between testers.

And, finally, the agreement of the testers, grouped by kappa categories is represented in Chart 5.

-20%

0%

20%

40%

60%

80%

100%

120%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Chart 4: Chart 3 with trend lines added, for better comparison

0246810121416

Poor Slight Fair Moderate Substantial Almostperfect

kappa comment

no o

f tes

ts

Chart 5: Number of tests per kappa analysis (the agreement according to kappa statistics)

What is left for further analysis is the question of existing connection between the similarities or differences expressed in mental images and the differences or closeness in grades, expressed in terms of kappa statistics; and how mental images affect the statistics.

5. Conclusions and some considerations for future experimentingThe interface performed very well, and the testers found it easy to use with minimal training. Much more time was spent explaining how to create mental images and how to grade then how to use the interface.

The mental images somehow created a problem,

considering that one of the testers is a skilful Internet user and regularly searches Internet databases and dictionaries, and the other one studies language on a semiotic level. However, with a couple of meetings spent in explanations, that problem was successfully solved.

For the next series of experiments, the starting point can be that mental images may be previously created and agreed upon together. This could eliminate one of the factors of confusion in rating the documents.

The calculation of kappa statistics was performed in Excel, but can easily be implemented in the second interface created for primary analysis of data. Nevertheless, the TXT files with the results of the analysis are useful as a way to transfer the grades to another application for further statistical analysis.


The presentation of the results in the charts 1-5 gives a clear picture of the size and the nature of the experiment.

With the help of Excel trend line function (Moving average with the degree 2 implemented on Chart 4), certain parallelism can be noticed between the general agreement of two testers (expressed in percentage of agreements) and the resulting agreement calculated by the kappa statistics. So, in this case, a reasonable conclusion could be that the kappa statistics follows the general agreement between testers.

ReferencesNo particular references for this kind of experimenting could be found. The starting point was a consultation with prof. Galia Angelova from BAN and Mr. Preslav Nakov.

The details about kappa statistics and its calculation for a case of two testers with three situations were provided from test files and manuals of available statistical applications accessed on-line (StatsDirect and CrosssTab). For additional reading about kappa statistics Cohen (1960) or Kraemer (1982) are recommended, but could not be obtained. [1.] A model for an information retrieval system for "small

scale" organizations, MathIND, MII, Mathematics and Informatics for Industry, International Conference, Thessaloniki, Greece, 14-16 April, 2003, (with prof. d-r Margita Kon-Popovska)

[2.] A proposition for a method of calculating relevancy in an Information Retrieval System, (VI Balkan Conference on Operational Research, A Challenge for Scientific and Business Collaboration, Thessaloniki, Greece, 22-25 May, 2002)


BULTRA ( English – Bulgarian Machine Translation System) Basic Modules and Development Prospects

Elena Paskaleva , Tanya Netcheva

Linguistic Modeling Department, Central Laboratory for Parallel Processing, Bulgarian Academy of Sciences, 25A Acad. G. Bonchev Str., 1113 Sofia [email protected]

Interoperability Department , National Defense and Staff College 82 Evlogi Georgiev Blv. 1000 Sofia

[email protected]

ABSTRACT

This article presents for demonstration and discussion the first operative Bulgarian Machine Translation System functioning at present for corporate necessities in the field of military texts. The system modules are discussed, mainly from the point of view of linguistic knowledge used in them. The basic modules and resources of the system and its possibilities for further use are considered as well.

1. The story

Basically the studies on Machine Translation (MT) in a country begin in the research field, responding mainly to practical needs, as it was the case with the first English-Russian MT. The complexity of the task predetermined the development of the following relations between research and industry for every single language. In most cases the research work was years ahead of industry, and very often it remained the only field of development of MT systems and tools in the country.

The situation in Bulgaria was no different. Here the first trial experiment was conducted in 1965, and it was more like an illustration of the computer abilities for research purposes (Paskaleva 2000).

In the following years the MT research transformed naturally into CL activities orientated towards the Bulgarian language for which there were no developed basic NLP modules. As a result, the MT research was moved to the periphery of the research

activity, and was predominantly reduced to various multilingual tools and resources.

The presented system of machine translation has come up as a result of the increasing volume of documentation in English that required translation to Bulgarian for the needs of the armed forces some twenty years ago. At that time in Bulgaria there were no available software translation tools that could help translators in their work. As a result, this fact naturally led to the idea of creating machine translation (MT) tools.

In 1986 Assoc. Prof. Ivan Ivanov, Ph.D., and Stojan Dimitrov, Ph.D., had already started to make some progress towards the development of basic language technologies involved in MT systems. During a 2-year preliminary preparation phase, large-scale survey and observations on MT results in Japan, USA and Soviet Union were made. Structural and statistical studies of both the Bulgarian and English languages started.

The initial research base, the experimental corpus, on which the basic translation rules were formulated, had a volume of 20 000 representative word-forms from the two languages, and comprised illustrative text excerptions of typical problems for the English-Bulgarian translation. The sources of knowledge for its completion were not corpus driven but introspective – English explanatory dictionaries, Bulgarian explanatory dictionaries, bilingual dictionaries, frequency and combinatory dictionaries


(Benson 1986) and many others lexicographical tools. The sources used for generating the grammatical rules for analysis, transition, and synthesis were various university grammars of English and Bulgarian.

In 1999 the two researchers established their own company, and in 2000 they finalized BULTRA (BULgarian TRAnslator) - the first English-Bulgarian industrial MT tool. The aim of BULTRA was to provide translators with a good environment for editing and reviewing MT output. The team involved in the creation of this system consisted of computer specialists, with the occasional help of linguistic consultants on practical level.

Hence, the linguistic expert knowledge was not used in the initial stages of the project of this MT system. It was limited only to tuning and refining of the already compiled tools, No profound theoretical

linguistic studies and expert linguistic knowledge were involved. Nevertheless, we can try to account for the system in the terms of the adopted linguistic levels and operations.

2. Interface of the system – post- and preediting facilities

Created by software experts, the system uses to maximum the abilities of a user-friendly interface, which makes it possible to carry out: • The preliminary guidance of the

translation process, and reduction of the word-sense ambiguity – by choosing the subject domain, and choosing the appropriate thematic glossary (see Fig.1) :

.Figure. 1: Choice of the desired thematic glossary.


• The dynamic editing of the translation results with possible choice and correction of the translation equivalent which is realized through window presentation of

the whole spectrum of translation equivalents of the chosen word (see Fig.2).

Figure 2: Correction of the choice of a translation equivalent.

• The adding of translation equivalents to the database through creating a new translation equivalent or correcting the available translations. (see Fig.3).

• Enriching the synthesis phase through choosing the inflexional class of the new or unknown word or creating a new class (see Fig.4) .


Figure. 3: Adding a new translation equivalent.

Figure 4: Choice of the inflectional class.


3. Lexical resources

Currently the system operates with a database of 84 000 lemmas and their 280 000 wordforms in the common lexique, excluding the special glossaries. The considerable number of topical glossaries gives an opportunity of finding a solution for that part of semantic ambiguity in which the choice of the appropriate word-sense is determined by the subject domain. Moreover, the system is flexible and offers a facility for users to create their own databases of words and expressions (see Fig. 3).

Because of the rather surface level of analysis and the lack of real parsing procedures, the choice of the correct meaning, and consequently, the choice of the adequate translation equivalent is realized on frequency basis of the English meanings oriented to their translation in Bulgarian. In addition, an opportunity has been given to the users to create their personal translation database, additional to the common one, with preferable translations.

Other attempts of avoiding ambiguity, transferring the specific difficulties from deeper language levels to the shallow level, where the only correlation between the elements is collocation, are realized through the use of the linguistic means from different linguistic levels – morphosyntactic, syntax and semantic – developed, however, not in specific structural presentation, but as a huge flat base of English-Bulgarian correlations, about 130000 in number .They represent in some way the Bulgarian translation of each meaning of the English word.

4. Morphosyntax - parameters of linguistic knowledge in English lexical record

In terms of the traditional NLP description, we can speak about morphosyntax only in

the lemmatization part, i.e., about identification of the wordform–lemma relation. As far as the second type of information related to the dictionary unit, the proper grammatical one, is concerned, it cannot be classified in the standard terms (see below). Because of their extremely wide linguistic nature, the components of that knowledge are unfolded in a linear, non-structured flat presentation. It is displayed as a long list of notifications, 2 symbols at maximum, standing for every tag combination of all possible grammatical meanings. The main reason for the linguistic layer dilution in the complex characteristic of the English lexical elements is the orientation of the English analysis directly and predominantly to their translation.

The above-mentioned shallow level of MT transfer procedure between the two languages determines the analysis type of every linguistic unit. This type is highly dependent upon the concrete translation of the unit in Bulgarian. In this way, all the grammatical meanings and their patterns assigned to the English lexical unit are directly oriented to the translation.

The lexical base contains two types of lexical units. The basic building material is the English wordforms comprising a single word lexicon.

Since the demonstrated MT system goes beyond word-to-word translation not only in ambitions, but also in results, an additional transfer instrument to Bulgarian are English phrase structures and their Bulgarian translation comprising a multiword lexicon of about 8000 entries.

1. Single word units and their tagsystem.

The specific tagsystem in the English section of this part of the lexicon is linear display of information from several types of linguistic knowledge, namely: • Morphosyntactic class. The classification follows the traditional one,


but is additionally developed through POS subclasses. These subclasses characterize the behavior of the lexical unit from different aspects, such as purely classificational designations for POS having heterogeneous morphosyntactic nature – types of pronouns and numerals, common /proper for nouns, etc. • Localization of the wordform in the lexical paradigm, i.e. wordform’s own grammatical meaning. • Semantic information – animateness, countability for nouns, for the purpose of detecting the right filler of verb valency for the concrete verb. • Syntactic information - verbal frame data including the number of verbal valences, as well as the conditions for their saturation. These conditions are formulated through the linear position of the verbal arguments on the phrase, and through some selective morphosyntactic and semantic properties.• Translation information – the Bulgarian translation equivalents assigned to the lemma. In the case of translation ambiguity the choice of the right translation is done by the second component of the system - the rich register of multiword units where, actually, the proper transfer to translation takes place. This is realised by the means of phrase equivalents tables, see below.

2. Multiword lexicon – phrase to phrase translation.

The tables with English-Bulgarian phrasal equivalents comprise in their English part multiword strings (about 8 000) whose components are linked in their mutual appearance by different degrees of cohesion. In the English part, there are included fixed phraseological unities (toand fro), idioms proper (wet blanket), as well as regular phrasal combinations with high frequency (machine translation, first aid).

The crucial part of the transfer phase is to work up the discrepancies in the grammatical systems of the two languages. To the general transformations, typical for every translated couple of languages, specific English-Bulgarian ones are added

due to the essential differences in the principles and resources of: • Morphology system – e.g. the English verb paradigm consisting of 4 members vs. the Bulgarian one consisting of 54 members just for the synthetic formative performance. • Analytical expression of the grammatical categories – the systems of the analytical verb tenses. • Representation of quantificational relations – determination and deixis • Simple sentence formation – the

English fixed word order vs. the Bulgarian partially fixed one (fixed only for the elements of the phrase constituents but free for the constituent order in the sentence). • Complex sentence formation and shaping the intersentence connections. The interlanguage transformation in the translation process is reduced to: rearranging elements, adding new elements, and eliminating irrelevant ones.

The phrasal English-Bulgarian equivalents are produced on the basis of generated elaborate models of the English phrase (SVO type), where the central position is assigned to the verb nucleus, and then consecutively, to the subject of the verbal action and to its complements.

See the translation:

They sent him a telegram.Pr obj V past Pr obj Det Noun

Pr nom Pron dat V past Noun [They to him sent telegram ]

The disappearing of the English determiner and the position change of the verbal valences is described in the lexical record of the English lexemes a and sendrespectively. In the verb frame description of the word send are inserted not only the verb actants (two direct objects), but also their correlation, and the respective two actants of the Bulgarian verb (one direct and one indirect object).

The synthesis rule indicates that if expressed in its short form, the pronoun-actant can precede the verb.


In the analysis process, the verb nucleus is the first to be identified. The next step is combining the verb kernel with the identified nominative groups, which are checked for saturation of the verbal valences conditions.

5. Generation The synthesis section works at the already chosen translation equivalents, and shapes them according to the rules of the independent generation of Bulgarian wordforms. This section contains tables of the Bulgarian word-formation, the so called inflexional classes. The system interface has the capacity to assign these classes to new Bulgarian words (see Fig 4).

The functions of syntactical generation - the ordering of the translation equivalents is accomplished in the already mentioned unfolded records of English words where to every translation the right order of its components is assigned (see the above discussed translation of the English sent).

6. System applications – BULTRA versions

Besides the BULTRA PERSONAL with the common database being constantly updated, there exists a version called BULTRA PROFESSIONAL. It comprises the common database plus 69 thematic glossaries with specialized terminology in various domains like: economics, informatics, law, medicine, architecture, several military databases, astronomy, physics, navigation, as well as very specific thematic glossaries in shipbuilding, internal combustion engines, metallurgy, mineralogy, plastics, farm technology, food industry, and many others.

All the information for these professional glossaries was gathered through consultants in the fields, and dictionaries available on the Bulgarian market.

In 2002 BULTRA LAN and BULTRA WEB SERVER were launched on the market.

The LAN version is allocated for corporate networks and contains all lexical databases. This server was installed, for example, in the National Parliament administration. The WEB version, which requires BULTRA subscription program, can be downloaded from the product web site.

So far, BULTRA is the only MT software on the Bulgarian market that serves the needs of corporate clients. Other MT systems, which include the Bulgarian language, exist within the frames of some experimental systems for multilanguage translation (see the experimental system for word to word translation http://www.tranexp.com/win/WordTran.htm,where Bulgarian is in the company of other 32 languages). Presently, the WEBTRANS (cf. http://webtrance.skycode.com) system has been in operation, until recently only for on-line web-based translation. It is not used on corporative level.

Currently the system is in use at the Military Medical Academy, the National Defense and Staff College, the Ministry of Defense, the shipbuilding plant in Bourgas, the cement mills in Zlatna Panega, and at a few universities. The program performs best in domains for which specialized lexical databases of words and phrases have been created.

7. Conclusion

The above discussed MT system – the first and the only one in Bulgaria, operating mainly for corporate clients, could hardly be extended and improved through development and change of its general linguistic design.

On the other hand, its considerable resources, especially in the unprecedented part of the English-Bulgarian interlanguage correlations, as well as its concept of the correcting and editing interface, could become a good ground for constructing a modern translation memory system, which is a question of present interest in the


process of acquis communautaire, likewise in the field of military interoperability, which is the birth-place of this system.

REFERENCES:

(Paskaleva 2000)Paskaleva,E. A. Ljudskanov. In: Early Years in Machine Translation. Memoirs and bibliographies of pioneers. Ed. W.J.Hutchins. John Benjamins, 2000, Amsterdam/Philadelphia, pp 361-377.

(Benson 86) Benson, M. The BBI Combinatory Dictionary of English. Philadelphia, 1986


The Globe: A 3D Representation of Linguistic Knowledge and Knowledge about the World Together

Kamenka Staykova and Sergey Varbanov Institute of Information Technologies Institute of Mathematics and Informatics

Bulgarian Academy of Sciences [email protected] [email protected]

Abstract

This paper describes a three dimensional ontological model called The Globe, which is build on the base of ontology GUM. GUM (or Generalized Upper Model) is a linguistically motivated ontology, which implements both- the theoretical postulates of Systemic Functional Grammar and the requirements of an applied ontology working within the environment for multilingual natural language generation KPML. The paper presents an attempt at drawing the Bulgarian variant of Globe by analysis of the most frequent Bulgarian verbs.

1. Introduction: Ontologies

In many fields of modern science ontologies are used to represent different kinds of knowledge. The definition given by (Gruber 1992) “Ontology is an explicit specification of a shared conceptualization” is widely accepted, but there are still different views and approaches to ontologies and their uses. “The relation between the words in language and the entities to which they refer is often complex… Nevertheless, defining words and defining the entities to which words refer is often done in the same way by referring to a general conceptual type that classifies them.” (Vossen 2003) The Globe model

presented here is a new attempt in this direction.

2. Natural Language Generation- Both Knowledge about the World and an Organization of Linguistic Knowledge are Needed

Each Natural Language Generation (NLG) task or generation of any utterance begins with a definition of “given”: -whether or when to start generation?- an impulse to begin; -what to be expressed?- content of the utterance usually composed by knowledge concerning the World; -how to say it?- mainly linguistic knowledge of particular language.

Most of the NLG systems (see Bateman and Zock 2001) have at their disposal some means (procedures or knowledge structures) to determine the answers of these questions. The very nature of any NLG task requires both- formal representation of the World (or Domain) knowledge and an organization of wide range of linguistic knowledge.

“The notion of ‘grammaticality’, central to formal approaches of language is by no means sufficient and many other factors, such as social, discourse, and pragmatic constraints, have to be taken into account in order to achieve successful communication. Many of these latter factors have traditionally been at the heart of functional theories of


language [e.g. Halliday 1994], and this is one of the reasons why such theories have been used widely by the NLG community.” (Bateman and Zock 2003 p. 287).

Systemic Functional Grammar (SFG) (Halliday, 1994) is among those theoretical frameworks being well suited to the needs of NLG. The CLAUSE is recognized to play a special role on the boundary between bigger chunks of speech and smaller combinations of words. Central role within the SFG CLAUSE analysis is given to the notion of PROCESS. The different PROCESS types are linguistically motivated and three main groups are outlined: PROCESSES of the Physical World (Material), PROCESSES of the Abstract Relations (Relational), and PROCESSES of the World of Consciousness (Mental). In harmony with the innate functional style not to pose constraints, but to distinguish features, three new groups of PROCESSESare defined on the borderlines of the main types- Existential PROCESSES, Verbal PROCESSES, Behavioral PROCESSES. The new PROCESS types share characteristics of two of the main PROCESS types. This way all of them could be ordered and “they form a circle, not a line.” (Halliday 1994, p.107). It is even mentioned (on the same page) that more correct image concerning the types of PROCESSES is a sphere than the circle, but the sphere is also a metaphor much more difficult to handle, so Halliday prefers the image of circle. The model could be seen as an attempt to combine in formal representation the World notions and their appearance within the language as linguistic elements.

The theoretical SFG has been adopted within a number of computationally implemented

generators, among which is KPML (Bateman, 2001).

3. KPML, Generalized Upper Model and The Globe

KPML is a multilingual environment for Natural Language Generation. “KPMLoffers a robust, mature platform for large-scale grammar engineering that is particularly oriented to multilingual grammar development and generation. Grammars have been developed using KPML for a variety of languages including English, German, Dutch, Chinese, Spanish, Russian, Bulgarian, and Czech. The KPML system is an ongoing development drawing on over a decade of experience in large-scale grammar development work for natural language generation.” (Bateman 1997)

Generalized Upper Model (GUM) is the operational ontology of KPML and contains the classification of PROCESSES given by (Halliday 1994, pp 106-161). “The Generalized Upper Model is a computationally implemented, general, task and domain independent, `linguistically motivated ontology' intended for organizing information for expression in natural language.” (Bateman at al. 1995, p.13)

GUM consists of two hierarchies, the first one contains all the concepts(top entity: UM-THING) and the second one contains all the relations (top entity: UM-RELATION). “The top node of the concept hierarchy, UM-THING,corresponds to the most general abstract entity posited in the semantics of the Upper Model. There are three major subtypes of UM-THING:-As a configuration of elements all participate in some activity or state of affairs. This is represented by the concept CONFIGURATION.


-As a single, ``stand-alone'', object or conceptual item. This is represented by the concept ELEMENT.-As a complex situation where various activities or configurations are connected by some relation to form a sequence. This is represented by the concept SEQUENCE.“ (Bateman at al. 1995, p13).

Picture 1 UM-THING hierarchy of GUM

The branch of CONFIGURATIONcorresponds to the level of CLAUSE in the speech, so from here on we will be concentrated on this part of the ontology. The following three pictures present the three main groups of CONFIGURATIONSas they are defined within GUM ontology.

Picture 2 Doing&Happening

Picture 3 Being&Having

Picture 4 Saying&Sensing

We build a 3-dimentional model out of the three hierarchies shown above and call it the Globe. The root nodes- Doing&Happening, Being&Having and Saying&Sensing present quite general notions and form the kernel of the Globe. This is suggested by Halliday and shown on a diagram of circle (Halliday, 1994, p.108) to illustrate the order of PROCESS types, which was discussed in previous part. Let’s imagine that these three notions are very close points near to the Globe centre and they direct their nodes-inheritors to different directions in the space. If we use colours to depict these three points, we shall choose quite pale colour, almost white, in respect to their general nature. Following Halliday’s idea we order the next level of notions given on the Pictures 2, 3, 4 on the central light-grey zone. They are successors of the three most general notions and logically follow in this way: D&H_Directed-Action, D&H_Nondirected-Action, B&H_Existence, B&H_Relating,S&S_Internal-Processing, S&S_External-Processing

These notions are areas on the surface of the next layer of the Globe. The areas consist of points, each point presents a particular CONFIGURATION.The color of points should be a bit more condensed than the color of the first layer. The six areas are distinguishable


from each other and inherit their main features from the three bulbs in the heart of the Globe. The borders between areas could contain CONFIGURATIONS of mixed nature. The notions represented on the layer are still high level general nodes from GUM ontology and the corresponding CONFIGURATIONS express quite general meanings. In GUM documentation each area (ontological class) is associated with a general meaning CONFIGURATION, which serves as label or key CONFIGURATION.Picture5 bellow shows a part of the surface of this layer with the typical CONFIGURATIONS expressing External-Processing and Internal-Processing in Bulgarian: �� /say/ and ��/sense/.

Picture 5 Second layer of Globe surface

Let’s consider now a third layer of the Globe, which contains areas corresponding to the notions of the right grey parts of pictures 2, 3 and 4 above. The CONFIGURATIONS presented here as points on the surface are more particular in meaning and their color is more saturated. All the CONFIGURATIONS (or points) are grouped in different areas, which are inheritors of those in previous

layer. The three following pictures present the Globe surface with Bulgarian key CONFIGURATIONS for each area.

Picture 6 Doing&Happening

Picture 7 Being&Having

Picture 8 Saying&Sensing


The key CONFIGURATIONS in Bulgarian shown on the pictures 6, 7, 8 come from theoretical SFG-analysis of Bulgarian. The authors of this paper were interested in the connections of fluent Bulgarian speech to the Globe model presented above. We established an experiment to answer the question: Is the Globe model adequate to the most frequently used CONFIGURATIONS in Bulgarian?

4. Some Investigations to Build The Globe in Bulgarian

The starting position of the experiment is the frequency list prepared over the corpus of the BulTreeBank projecthttp://www.bultreebank.org/Resources.html.The frequency list contains the first 100 000 Cyrillic tokens with highest number of occurrences in the corpus. Our exercise proceeded in the following steps: First. To specify the most frequently used Bulgarian verbs. Second. To determine which CONFIGURATIONSof the Globe model could realize each particular verb, beginning the conceptualization from the top of the frequency list. To complete the first step we sifted out the frequency list the first 500 tokens, which could have been used as verb form. After that, a nominalization procedure was carried out and from the 500 verb forms there remained 279 verbs in infinitive form (1st person, singular, present tense for Bulgarian). Perfective and imperfective verb forms were combined in one and the same entry, because they are considered to present the same meaning, so our list of the most frequent Bulgarian verbs counted 238 entries with corresponding frequency equal to the sum of the frequencies of

occurred verb forms. The top 10 verbs are as follows: �� /have not/ 169 919�� /must, have to/ 164 047��, �� /say/ 139 618�� /beat/ 105 082��, ��/offer, suggest/ 83 701�� /know/ 66 986��, ��/happen, become, get up/ 65 671

The second and substantial step of our experiment was conceptualization of the verbs against the CONFIGURATIONS of the Globe. It was performed by help of Bulgarian Interpreting Dictionary (Andreichin at al. 1955), where all the meanings of a particular verb were checked up. Below are given the results concerning the first 10 verbs.

�� /to be/: This is the leader with far more occurrences then the second verb in the frequency list (2 414 948 vs. 287 051). Although a number of these occurrences are in its role of auxiliary verb it remains the most frequent verb. From functional point of view, this is because �� /to be/ is used to realize several types of CONFIGURATIONS (see Picture7 above): Relating-Intensive-Identity, Relating-Intensive-Ascription, Relating-Generalized-Positioning, Existence. �� /can/: auxiliary verb; has no reflection within the presented picture of the Globe. �� /have/: The impersonal form �� is most frequently used (158 464 occurrences vs. 121 465 occurrences of all other forms), which suggests that the meanings ��, �� /there is, exists/ and ��, �� /there is somewhere/ are widely used. These meanings correspond to the CONFIGURATIONS Existence and


Relating-Generalized-Positioning respectively. First meaning of the verb within the Dictionary is possess and leads to a bunch of CONFIGURATIONSunder Generalized-Possession (see Picture 7 above). �� /have not/: The form �� /has not/ is used almost 3 times more then all the other forms of the verb (128 435 vs. 41 484). It happens mainly because the form �� is quite often an auxiliary verb, when negative simple future is expressed and is in fact �� /will not/.The main meaning of the verb according to the Bulgarian Interpreting Dictionary is posses not which is realized by CONFIGURATIONS under Generalized-Possession (see Picture 7 above). �� /must, have to/: auxiliary verb; it is not depict on the picture of the Globe. ��,�� /say/: It is probably the most typical verb to build a verbal CONFIGURATION. On the Picture 8 above it is chosen as a label of External-Processing-Saying area of CONFIGURATIONS. All its meanings support this classification except one,�� /to be called/, which belongs to one of the branches of Relating-Intensive CONFIGURATION. �� /beat/: Presence of this verb in the list of top 10 most frequent verbs is definitely a result of very simple procedure of disambiguation applied during the experiment. The rule is to associate each of the pretenders even parts of the number of occurrences of ambiguous verb form. In Bulgarian most of the past forms of the verb �� /beat/ coincide with some of the past forms of the verb �� /to be/: ��, ��, ��,�� etc. or with the forms of particle �� /would/: ��, ��, ��, ��, �� etc.This way the frequency score of the verb �� /beat/ becomes much more compatible than would be the score of its

substantially used verb forms. For completeness of our analysis, most of the meanings of �� /beat/ could be realized by Dispositive-Material-ActionCONFIGURATION, the meanings ��,�� /pulsate/ and �� /fight/ could be realized by CONFIGURATIONSof Dispositive-Material-Action. ��, �� /offer, suggest/:The CONFIGURATIONS, which express the meanings of this verb are within the area of External-Processing, could be both External-Processing-Saying or External-Processing-Behaving (see Picture 8). On the Globe model we should depict a notion corresponding to this verb nearer to the center if we regard with respect its more general nature. Its precursor or relative is ��,�� /give choice/ and further in this direction ��,�� /give/, with its typicalCONFIGURATION with three participants. �� /know/: This is the verb which realizes one of the key CONFIGURATIONS (think, know, believe) within Cognition area (see Picture 8). ��, �� /happen, become, get up/:When used in its impersonalized variant the verb leads to CONFIGURATIONS of Nondirected-Happenings (see Picture 6).When it is used personally it involves different type of participants and related CONFIGURATIONS are those of Nondirected-Doings for ��, ��/get up/ (should be depict on Picture 8),and Intensive for ��, ��/become/(should be depict on Picture 7).

As a conclusion, the fact that the most frequent verbs in our list, which are not auxiliary verbs, have key positions within The Globe model could be underlined. On the area of Being&Having and on the area of Saying &Sensing could be fixed most of the CONFIGURATIONS worked out during the experiment. It is indicative of the


nature of Doing&Happening area that only two CONFIGURATIONS fell there. The CONFIGURATIONS of this area express PROCESSES from the Physical World and are more concrete than the others in a sense. Of course, the extract of verbs presented here is very small, but on the other hand they are the most frequent ones. If we proceed the same way with the next verbs in our frequency list our Globe will turn to be a formal semantic model of the World. Its points- CONFIGURATIONS- present also linguistic information which is substantial when the performed task involves speech, for example NLG task. Each notion of the Globe possess semantic meaning which is easily understandable by the position of the point- CONFIGURATION within the Globe: near the center the meanings are more general. All positions of the points are theoretically grounded by SFG.

5. Future Work

The next step in our research is the implementation of the Globe in computational form and its enrichment with Bulgarian notions of particular Domain, which will experimentally provide clearer understanding of the practical significance of the approach. The Globe itself could be interesting research object from different view points: -multilinguality- juxtaposition of Globes build for different natural languages; -knowledge representation- probably the Globe could be used by Ontology Engineering community in their applications (agents, Semantic Web); -psychology- The Globe presents a model of inner and outer World and probably could be compared to other models of brain; etc.

The authors are open for further comments and discussions.

6. References

(Bateman at al. 1995) J.A.Bateman, R. Henschel, F. Rinaldi, Generalized Upper Model 2.0: documentation, 1995 Darmstadt, Germany http://www.fb10.uni-bremen.de/anglistik/langpro/webspace/jb/gum/index.htm

(Bateman 1997) Bateman, John A. “Enabling technology for multilingual natural language generation: the KPML developmentenvironment”, Journal of Natural Language Engineering, 1997, 3(1):15--55.

(Bateman and Zock 2001) Bateman John A. and Michael Zock, The B-to-Z of Natural Language Generation Systems: An Almost Complete List, France, 2001, http://www.purl.org/net/nlg-list.

(Bateman and Zock 2003) Bateman, Zock Natural Language Generation, In R. Mitkov (ed.) "The Oxford Handbook of Computational Linguistics", Oxford University Press, 2003, pp 284-304.

(Gruber 1992) Gruber, T.R. ”Ontolingua: a Mechanism to Support Portable Ontologies”Report KSL, Stanford University, 1992, pp 61-66.

(Halliday 1994) M.A.K.Halliday An Introduction to Functional Grammar, Second Edition, Edward Arnold, London, 1994

(Vossen 2003) Piek Vossen Ontologies, In R. Mitkov (ed.) "The Oxford Handbook of Computational Linguistics", Oxford University Press, 2003, pp 264-482.

(�ndreichin at al. 1955) �.��, �.��, ��.��, �.��, ��.��,��.��, ��.��, �� , �� , ��, 1955


��

��

��[email protected]

!��

�� !�" � ��#��$��#��%��&�� ' �(! � �� $��#��)��(!��% � �� )��*�� %�� )��+,�� -�#�� )�� .% ��#��

"# "��

�� $��#��)�� (�� !�� +(!.�� !�"�� #�� %��# � �� !�"%�� &�� '��%�� *��#��

�� #�� !�"�� )�� %��"$%�/0"%�"!�"��%�� # � �� !�"��0��% ��$��#��)�� (!

�� #�� #�� !�"�(!��% � &�� 1��*��% ��

�� +��*�� *��.��% � �� #�� !�"�� !�"�� %��

��)�� !�"

�� # � �� &�� #��# � �� #��&��#��#)�� % �� % � �� )��#�� % � �� % � �� 1��#� �� # � !�"� ��

�� $��#�� # � �� &�� #��#��#� ��#�� 0�� &��#��)��2��%��)��%�� % � !�"�� #��

""# � �� $��%��&��

2��#��%� !�"��#�� #��)�� % � !�" � �� 3��4�

'��'��(%�� #�� #�� %!�� "�#$�%�� &�.� � 3�� #�� 5��#��4%��4�+'(�$�)%��&��.�� 678� � �� # � �� "��%��5�� % � ��% � �� #��# � ��%�� # �� #��#�� +'(�$�)% � �� .� � �� #��#)�� !�� !�� "��$��#��0��#� �� % �3� �!�� 2� ��#

�� #�� 5��

� ��1�&�� #�


��#��#��#��#��9��#�� #�� #��#��+!��#��66:;�:.��#��#��#'��#��)��5��#��#�+��#��.��#��#4�+��.� ��%��3��4��

#�� #��% � !�" � �� &�� &�� %��<��"��%�� #�� % � ��#��% ��3��6==%�� !�"%��!��#��6>=%�3��=�� !�" � ��

#��%�� )��* �� /&��)0�� "��% ��$�� 6>?� � �� 5��4% � �� &�� )��#��#��&�� !�"�� "�� !��

"��% � �� #��# � �� !�"��#��%�� % � �� % �� *�+�� "�#$�%�� "!�"� �� &��%�� !�"�� "!�"4� � �� % � �� &��"��5�� #�� 4% �� 1��

�� $��(% �� #��#�� #�� # � �� #��)��#��#��%�� #�� !�"��&�� % � �� #�� !�"��)�� "��%�� #��)��$� � ��# � ��

�� % � !�"�� #�� #�� *�� &�� !�"

�� # � �� )��,��%�� !�"��%�� &� � �� #�� #�� %�� !�"��%�� )�� 2� � ��% � �� ,��%� �� 1�� !�"��

��)��#��#)��2� �� "�� #�� #�% � �� % � �� !�� "�� &��% � ��#�� #� � +�� #�. �� +�� .% �� #��&��#��# � �� # � ��&��#��#)�� $�� #��#��2� �� -� ��

�� #��#)�� %��)��%��%�� % � �� % � �� # �� &�� #� � �� % �� #�� #�� % � �� @��)0��% � �� % � �� #��#��%��2� � �#�� !�"

�� % ��% � ��# � �� #��#��#��#��#��#� ��% � ��

�� !�" � �� #��0��% � �� !�"��

�� #��#)�� #��#)�� % � ��)�� )��#�� % � ��5��#��4% �� %�� *��% � ��


�,��%��# � �� #�% � ��)�� #�� #��#��#�� !�,��.�� % � 5�� 4 � �� % � ��*�� 5�� #�� #�� %��#��#�4�� !�" � �� #��#��-�#�� !�" � �� $��#��

��#��#��#)��#�� #�� *�� #��#)��#�� !�"�#��% ��

#�� !�"�� -�#��)�� &�� #��#� ��##�� #� � ��% � �� #��#� �� #�� % � ��# � ��

"""#�+��&��

1��$��#�� !�"��% �� %�� &-�� *��,�� 1�� !�"��% � ��

�� !�"��(!��#��#��A��%�� (�� $��%��/��(��"�)�+�#��- ��- �� 0�� #�� &�� % ��1�� &��(�� 1�)��$�� 2�,�(��( �� 1��#�� )��#�� !�"� � �� !�" � �� $��#�� &��(�� , ��

�� B��A��% ��#�� $��#�� (! � �� % � �� (! � ��#��#)�� 1 � �� "��)�� (! � �� !�" � �� % � ��

&��%� �� &��C��% ��%�� -�#��% ��&��(!��B!��#% ��/��(��"�)�+�#��

�2��%��(!��B!�� +/��(� �� "�)�+�#� ��.-� �� (!&��%��#��%��"�� +/��(� � �� .� �� &�� #� � �� #��# � ��#��#)�� (!% �� !�"��#��#)�� "��)��

�� #�� !�"��#�� !�" � �� &��% � �� &��C�� #��#� � ��#� � �� &�� #��&�� &��% � ��

1��%��##��)��#�� !�"�� (��$$�� #�� #� � �� !�" � ��)�� # � �� #��#�� ; � ��#�� #��#)�� #�� #��#�� #��#��,��A��%��

�� !�"�#��#�� #��#��#��$��#��1��%��#��(!%

��&��(��%�� 1�)��$�� 2�,�(��(�� "��#�� #��#� � �� #��;��)��)��%��#��*��#��#��#��1� � �� !�" � ��

$��#��%��


�� !�"�� $��#��%��#��#��

��*��%�� $��#�� !�"% �� $�� *��#� � ��#��# � +$��#��.% � �� #�� &��% � �� #��#��#��2�� !�"��#��%��##�� !�")�� *��% � �#� � �� *��)�� #�� +3��.% � �� $��#��)��*��% � �� &� � �� #��#�� !�"��1 � �� # � �� !�"

��$��#��#� ��% ��#�� $��#��+1��&.��+1��.��

"�#��$�� &��

�� !�"��$��#�� #�� % � �� -�#�� 2��%�� &��# � ��)�� -�#�� (! � �� #�� $��#�� (!� � ��% � �� $��#��)��(!��#�� !�"��#��%�� !�"��$��#�� #��#)��#�� #��2��&��%��

�� $��#��)�� ' � ��*�� % � �� +��*�� *��.?� �� )��,��-�#��"��)��%��#�� !�" � �� #��

�#�-��!.��

? ��#��#�%��#��<��#��#�

1�� $��#�� *��2��%$��#��#��)��5��)��*��4��5��)��4 � ��#��#�� 2� � �� &�� +��. � �� *��% �� +��. � ��*��

!��. ��(��(�*��$��*��

!�,. 1+��(��(�*��4��$��*��

!��)�� !�" � ��% � �� +��.� ��+�,.� ��#�� (!��&��#��*�� *��)�� #�� 2� � "$% � �� #��

��*��% � �� # � ��2��+��.�� !�" � ��#��

��#��#��*��*��)��#��%�� 2�� "$% � �� #�� *�� ,�� ) � �� % � �� !�"��%��##��!��#%�� %��#�� !�"��#��%�� %�� !�"��"$�� *�� #��

�� $��#�� (! � �� ##�� !�" � �� % � �� !��%��#��$��#�� # � �� $��#��(!��1 � �� % � ��#

$��#�� *��% � �� #�� )��*��$��#��)�� #�� *�� #��&�� % � �� )�� % � �� #�� #�� *�� 2��%��)��*��


$��#��% � �� "$ � +��(�� .� ��)�� )��*�� )�� #�� #�� *�� 1��#��%��%��#�� #��$��#�� !�" � ��A��% � �� !�"��)��*��#��#��1 � �� ,�� )��*��

$��#��*��%��+��,.��*��*��5(��5��)��)��*��+��.��)�� +�,.- �� *��3��#��)��

!��.��(��(��6��76��(��(��$��(��(��$��

!�,. 1+��6��(��(��71+��(��(�6��4��(��$��4��$��(��

�� #�� !�"��)��$��#��

�� #��#)�� %�� (�� % �� '(�$�)%��&�� !�"�� $��#��(!��#��B!��$��#�� !�" � ��

$��#��)�� )��*�� #�� #��#�% � �� +8��)� � ��.%2��+9��.��!��#��

/#�0��!��!��

2��$��#��#��&��2��%��*��*��)��B!�� # � �� D% �� +D�)�.% �� *��% � �� *��;

!��. ��%��E��#��E

D ��B!��

+D�.�1��6�(��4��1��#��-��

��E2��E

+D�.�B��*��9��-��$��-��E2��E�

2�� !�"��%�� 0�� &��% � ��*��#��&�� ; � �� #� � ��#�� )��!!�+:�)�.��+:�.;

!��.��)�+(��()��(�#��9��(��%�

!�,. ��)�+(��()��(�#��(��%��

!��. ��)�+(��()��(�#��9��(��%��

��E��E

�� (1/F3�A!��)�+(��,��*�� *�� % � �� 3�A!��GH�)��A��% � ��

�� #�� #��1�� *��#��;�

!��.��"��:��5��#4

!�,.�7�"��:��

1 � �� (1/�� *��#��*��#��1��%�� # � �� &�� !�"��$��

�� )�� % � �� 0��% � �� $� � ;��95 �� ;�� (�

�(��5 �� # ��%�� (1/F3�A!� � ��% � ��% � �� % � ��,� ��%��1� � �� % � ��##�� "$

�� I� �!�� +��(�� .% � �� *�� $��5��95� �� 2� � � � �� +<�6��+(��


�� . � �� # � �� .�� (!��%��(!��#��

�)�� !�"�� $�� #��#� � +<��,� ��. � �� 0��

1# ��$��

1�� #��#)�� (! � ��$��#��%��%��#��#��)�� % � �� +7�. � � �� +7�. � ��$��#�� ,�� -�#�� +7�.��-�#��#��+7�.;

! �.=��/��#�� 4��

! ,.<��:�6��)��%�� -��

! �.�=��4��)��

! �.<��1+��,��(�$��(�� >��-��

0�� % � �� $��#�� (!� �� !�"��%��-�#��"��(!%��*�#��JBF -1KF1"<F!-<�� B�A ��+�� -��.��#��*�� % � �#�� 2

�� #�� % � �� ,�� $��#�� !�� - � *�<?= � ��- �0��(��&�-��(��

��+�#��(��.�� % � ��# � �� 5��& ��)��4�� % � �� #��

��)��0�� % � �� 5��4 ��

�� ) � ��:��%��*�� % � �� +=. � �� )��% �� *��+�@4@��+=.��+>.��;

!&.�:��(�)��6��-��:(�%��A��4��$�>��-��

��@:(�%��A��$��$�@

: �1��&��'��#��%��)��*��&��-��

!�.�:��+��6��-��(��(��6��-��

��:(�%�)��#��(��4��#��$��

��% ��)�� ;

!�.3,��-��+�+��(��'��;'��%��5

2��)��%��!!)L�!M(!��$��#��%�� #��2��#��!!%�� .�� % �� *�<?= � �� % � ��*��%�� %��B��)��% � �� )��

&�� ) �� % ��#� �� & � �� *��1��

�� # ��%��!��!��!��J��%��

��0��&��% ��)��(��5�(��5� ��%��# � �� % � ��%��%��&��;

!��.�"��6��6�(��4��@��(��$��

!��."��6��6�(-��4��@��(��(��$��

K�� % � 2 � �� %��*��#��C��,��2� � �� #� � ��

��C � ��,� �� !.�� .�� 2� � �� *�D$, � ��- � 0�� 0��6��#��

�� % ��#��J��A�� *��% � ��#�� )�� % � �� &��#��% �� 0�� )�� % � �� *��% � ��*��%��%��*�D$,��2 � �� )��

��5��)��9��


��%�� #��-�#��"��+7�)�.%��%��%��-�#��*��!��.;

!��.+-�#��.�4��$��(��(��!��,.+$��#�.�"��6��$��6��

A��% � �� % � �� )��#�� 1��# � �� 9�$��B��(��>��*��+��.��(��)��% �E�� ) ��)�� *��+��.��%�E�� &��% �'��)� ��' �� 1�N�� 2!��!��1� � ��

�� % � � ��)�� % � �� *��,��-��

"�#��

�� !�"��$��#��)�� # � �� & � �� $��#��(!��$��#��2�� % � �� #��# � !�"�� )��#��% � ��#��

��)�� $��#��(!��#��#�� #�� )�� $��#��)��(! � �� # � �� *��%��#%��)��&��$��#��(!��

�� %��# � �� (! � �� 2�� % � �� $��#�� !�"�� #�� %�� #�� &��C�� $��#��

�� !�"�� # � �� 2� � �#��

$��#�� (!% �� $��#��(!�� &� � �� #��% � �� !�" � �� 2� � �#�� !�"�� #��% � �� )��#��#$��#��)-�#��)"�� # � ��% � �� % � �� !�"��-�#��

2��31��%�0��@��%�"��66>;�1�:(��%��%�3�/2�1�#��%��66=;�@��3��$��#��%��O��1�#��666;��#�� !�"��2�;��

/��*%�3�/2�!��$��%�(��?88�;�K��*��<�� !�"%�=F"��?88?��$�*�� 66>; � PQRSTUVW% � X�% � YZ[\]QW% � ^�% � _V`aVW% � b�

GHIJKLKMKM�NHOPQJRST�KUTS%�^cS��d_Vef]�PV]Q`9%�gQhUR�$��% � I� � �� % � <� � �6>?; � 2�� 2�; �0��

E��*��$$��E��%�$��%�I��+��.%�3��%�1��66D;�9��(��'��$��%�<��-�01-9��(�$3��%�B��6�=;��%�� #�;A��3��%�B��6>8��E��E��'��$,��V��K��)(��%�A��66>;�1��$��#��3��

@��#��!��i�0��W��66>%��=)D?�K��% � /� � �66>; �B�� 3�� 2�� 2�;E�$�� /��*%

3�/2�!��%��?�6)?7=�0��% �� 6>?; � 2��!��)�� !�� i �'(��

F��%��:8)��"��6>�;�"��%�"�%��%�-�%�!��%�"�%��#%�2��*��*-�

3��#%�A1;� ��!��"��#%�I��#%�2��?888;�4��4��3�/2��!��"��j��% � k� � �6�:; � 0��*� � �� l� �� *m��

��l��)��n��2�;�$!�I%�N222%��o�%��6�:%��7=)6?�"�$�-��66:;�XJQLQYTSQ�MQ�RHIJ��NHOPQJRST�SM��KUTS-�Y�� % � -� � �� B��%� �� 66D; � ��#�� (!

��"��2�;1��*��$��B��-��)�?� ��% � -� � �� B��%� ��66:; �/��# � 1�N� � �� "��

(��3��&��2�;�*��$��/��*-�:7�'�F4�-��))D=� ��% � -� � �� B��%� �� 666; �!�� (! � �� B!

��"��;��*�%?=�))DD��3��#�A��% �(��66?; �!��*�� % �=��

��Z��$��%��o��A��6>�;�p\qrQW%�s��XJQLLQYTSQ�N[OPQJRS[P[�\U])��pQqtW\%A��%�@��K��?888;�F�A��*��+��(��%��A��*��$��

"��8��'��!�K��K��%��O��#��A��6=8;�pZe\haUVW\%�u�� [_`TMTYKOMT\Y�RHaU�_Q�I�Gbc�

^cS�`\�Pvw%�gQhUR�B��%�0��6>7;�F��:(��%��1$��%�1��!��%�2��%��% � !� � ?88�; � �� *��)(�� 1#�� $��#�� +1�

!�")��.��2�;��(��(�89�F-��$��!��66D;�_V`aVW%�b��bHOPQJRST�RTMYQSRTR��_rQWSUW�!��%�3��#%�2��6>=;�4��$��<��%��A��$��-

B��-�3�/2�!��%�3��#%�2��66:;�/��*%�3�/2%��!�� 67?; � _QxQW% � Y�GHIJKLKMKM � NHOPQJRST � KUTS�

GTMYQSRTR�gQhUR%w\Zt\�U�UctZqeWQ�<��% � A�)/� � �66:; � 3�� ( � A��

/��#��#��$��2��;�"FF:��-�� <��%�3��6>7;�K�)��%�0��8��&�:D�):��#��?88D;��%��:(��%��1�8��$��4��%�3�/2�� ?88?; � !�")�� $��#��

+$��$��.��2�;�d1�E��,�#��'��d%�?88?%��D�)�:?�(��% � �� ?88?; � 1 � /�� "�� A��

<��#��3��3��*��<1F?1"4�:4'1��%�D7�)D6��


Designing a PAROLE/SIMPLEGerman-English-Romanian Lexicon

Monica Gavrila, Walther von Hahn and Cristina VertanHamburg University, Natural Language Systems Department

Vogt-Kölln Strasse 30Hamburg

{gavrila,vhahn,cri}@nats.informatik.uni-hamburg.de

Abstract

In this paper we describe G.E.R.L. - amultilingual German-English-Romanian Lexicon,based on the PAROLE/SIMPLE standard.Several particularities of Romanian languageimposed light modifications of the initialstandard. We will describe these particularities aswell as the envisaged solutions.

1. Introduction

A large amount of lexical material was producedduring the past 20 years. Unfortunately, in theabsence of a standard each application producedand used its own lexicon following a specificmodel and encoded in a particular format,according to the particularities of language,system functionality, and available physicalresources (Handke 95). Reusable lexicalresources would reduce the cost of developmentof NLP (Natural Language Processing)applications, and are of great interest especiallyfor languages with a smaller electronic visibility,for which such resources are even more rare.

The standardization efforts were conducted intwo directions: development of standard modelsfor mono- and multilingual lexicon designs.Results of these activities were thePAROLE/SIMPLE model (Ruimy & al. 98) andthe follow up project MILE (Calzolari & al. 03).These models are valuable platforms for thedevelopment of further resources, however theproblem of reusing old resources is still a matterof research (Vertan & v.Hahn 02).

PAROLE/SIMPLE lexicons were created formore European languages, and the model claimsto be general enough in covering a broadspectrum of linguistic phenomena. In this paperwe will show that for the realization of aRomanian lexicon and its connection to amultilingual environment, changes to theoriginal model are required.

2. The Choice of the Lexicon Model

PAROLE/SIMPLE is a standard model fordeveloping monolingual lexicons, offering alsothe possibility of linking these lexicons to amultilingual one. The specification follows 4layers, 3 for the monolingual lexicons(morphological, syntactic, semantic) and one formultilingual connections.

The PAROLE model of morphological datawas conceived to represent variations in formand grammatical properties of words (Guimierand Ognowski 98). The word forms are groupedinto Morphological Units (MUs). A MUcorresponds either to a closed-class word:adverb, preposition, etc., or to a paradigm offorms related by inflectional operations:conjugation, change of gender, or number, etc.The notion of MUs corresponds often to that of„lemmas“ in traditional dictionaries. Each MUis linked to one or more combinations ofmorphological features (CombMF) in whichimportant features are recorded: gender, number,tense, etc.

The syntactic layer of PAROLE deals with:sub-categorization, characteristics of the lexicalunit when associated to a sub-categorizationframe, control, diathesis alteration, constraintson the syntactic context, syntactic compounds,etc. A Syntactic Unit (SynU) is equivalent to asyntactic entry or reading and has one basedescription and possibly several deriveddescriptions.

SIMPLE is a follow up of the PAROLEproject (http://www.ilc.pi.cnr.it), and it adds asemantic layer to the subset of existingmorphological and syntactic layers developed byPAROLE. The PAROLE lexicons weredeveloped for 12 European languages, amongthem English and German


MILE was a follow up project (ofPAROLE/SIMPLE) aiming to define amultilingual standard model for lexicons. Weintended first to develop G.E.R.L. according tothis model. However in the availabledocumentation there is no indication about themorphological unit. Also direct contacts withresearchers involved in the MILE developmentcould not help in this way, therefore we turnedback to the PAROLE/SIMPLE model.

Some other possibilities of choosing alexicon model would have been MULTEXT-EAST and WordNet

MULTEXT-EAST (http://nl.ijs.si/ME/),covers a large number of mainly Central andEast European Languages (includingRomanian). It consists of morpho-syntacticspecifications (grammar and vocabulary formorpho-syntactic descriptions - MSDs),morpho-syntactic lexicons and morpho-syntactically annotated corpus.

WordNet (http://www.globalwordnet.org/)exists for several languages, includingRomanian (BalkanNet), English (WordNet 2.0,EuroWordNet), and German (EuroWordNet).As mentioned on the Princeton WordNet andEuroWordNet websites1: "WordNet® is an on-line lexical reference system [...]. English nouns,verbs, adjectives and adverbs are organized intosynonym sets, each representing one underlyinglexical concept. Different relations link thesynonym sets." "The word-nets are linked to anInter-Lingual-Index. Via this index, thelanguages are interconnected so that it ispossible to go from the words in one language tosimilar words in any other language."

One of problems that was encountered inusing MULTEXT-EAST and WordNet is thatthey do not contain all the information fromG.E.R.L. specification.

MULTEXT-EAST contains only morpho-syntactic lexicons. But G.E.R.L. needs alsosemantics, multilingual information, and somemore morphological features (e.g.morphological segmentations for nouns), andPAROLE/SIMPLE considered these areas. Also,PAROLE/SIMPLE and MULTEXT-EAST, withsome modifications, can be mapped (at least formorphological and syntactical layers).

1http://wordnet.princeton.edu/w3wn.html (PrincetonWordNet),http://www.illc.uva.nl/EuroWordNet/ (EuroWordNet).

In WordNet, the existing information andrelations between synsets are not enough for thegoal of the lexicon - e.g. more morphologicalinformation needed, more (technical) words tobe introduced, etc. More information onBalkaNet and BalkaNet for Romanian can befound in (Cristea et al 04; Tufis et al. 04 a; Tufiset al. 04b).

3. Short overview of RomanianLanguage

In this section we will describe particularities ofRomanian language relevant for the design of amultilingual lexicon involving this language.

Romanian belongs to the Romance languagefamily, its grammar and basic vocabulary beingderived from Latin. Due to the influence of theSlavonic languages, particular morphologicaland syntactical phenomena are encountered.

Romanian is a high-inflected language. Thereare 3 genus (masculine/feminine/neutral) and 5cases (nominative/accusative/genitive/dative/vocative). In contrast to all otherRomance languages, the definite article isconcatenated at the end of the word:e.g. scaunul (Rom) = the chair (Engl.)

casa (Rom) = the house (Engl).This is preserved by all the inflected formscontaining the definite article. Indefinite articlebehaves like in all other languages (e.g. unscaun (Rom.) = a chair (Engl.))For one noun there are approximately 5 inflectedforms. The inflexion is quite irregular, andtherefore for NLP applications it is preferable tohave a full form lexicon instead ofimplementing morphological processes, andtaking into account a long list of exceptions.

Adjectives agree in gender, number and casewith the corresponding noun. Both orders:(adjective, noun) or (noun, adjective) arepossible, having different semantics: thecommon use is (noun, adjective) whereas thepositioning of the adjective on the first placeimplies an emphasis on the adjective. Thedefinite article is added to the word with themain semantic role, e.g.:Fata frumoasa = the beautiful girlFrumoasa fata = this beautiful girl

There are 7 tenses and 4 (personal) moods forthe verbs, not all having one-to-onecorrespondence in other languages:e.g. mersei (rom indicative, „simple perfect“) =


I went somewhere, sometimes, in the near past.Further information about the Romanian

Language can be found in (Hristea & Moroianu03; Ionescu 03).

4. G.E.R.L. Multilingual German –English-Romanian Lexicon

4.1. G.E.R.L. Specification

G.E.R.L. is aimed to be a PAROLE/SIMPLEcompatible lexicon. Therefore mainly thestructure follows the DTD specified in thePAROLE/SIMPLE Project(http://gilc.ub.es/DTD-ALL/index.html). It isaimed to serve for educational purposes, instudent NLP projects and for the development ofMachine Translation prototypes or other smallNLP applications. The lexicon was used mainlyfor student projects and the vocabulary containsa lot of technical words.

The specification of G.E.R.L. contains onlypart of the features described in (Guimier &Ognowski 98). The following information ishandled:

• Morphology:� Noun: type, gender, number,

case, morphologicalsegmentation, definiteness (forRomanian)

� Verb: type, mode, tense, voice,number, compounding particle(for German and English)

� Pronoun: type, person, gender,number, case

� Adjectives: gender, number, case,comparison degree

� Article: gender, number, case,type

� Adverb: comparison degree� Numeral: type� Preposition� Conjunction� Verb particle (for German and

English; verbs in Romanian donot have particles)

• Syntax:� Specific cases requirements for

prepositions� Main/subordinate clauses

introduced by specificconjunctions or verbs

� Personal /impersonal verbs� Transitive /intransitive verbs� Mass nouns, nouns only with

singular, not countable nouns• Semantics:

� Synonymy (following thePAROLE/SIMPLE model)

� Thematic roles for verbs� Collocations� Indication for a word if it is a

foreign word (useful mainly formachine translation)

G.E.R.L. is a full form multilingual lexicon.It contains the three monolingual lexicons, eachhaving the structure described above, and amultilingual layer, connecting the entries. In thecase of a compound word we impose that all thecomponents should be already in the dictionary.The part of speech for the compound word is thesame of the one for the main word in thecompound. When no one-to one correspondencebetween entries is possible, a local translation ismentioned.

4.2. G.E.R.L. Architecture

G.E.R.L. is composed of 4 layers:morphological, syntactic, semantic andmultilingual. The first three layers have mainunits uniquely identified via an Id. The structurecan be viewed in Figure 1.

The morphological layer:The main unit of this layer is the

Morphological Unit (MU). From the original 4types of MUs (of the PAROLE/SIMPLEmodel), 3 were kept:

1. Simple MU (MuS): for simple lexemeentries

2. Compound MU (MuC): for compoundlexemes

3. Affix MU (MuAff): for affixes. This isused when describing with help of theDerivation tag which noun has affixes.

PoS2 is specified with the attribute gramcat3;most of the other types with the attributegramsubcat4. The lexeme is specified with a newintroduced tag in the MuS and MuC: entry. The

2Part of speech.3Grammatical category.4Grammatical sub-category.


other parts of the morphological layer as well asthe syntactic layer follow the PAROLE/SIMPLEmodel. In order to respect the lexiconspecification and to fit the Romanian language,some new grammatical features were introducedin the CombMF tag (e.g. degree of comparisonfor adjectives and adverbs, article – to be able tomention the definiteness of a noun/ an adjective(Romanian)-, etc.).

CorrespMultMU

Translation2/1

CorrespGap CorrespGap

Figure 1: The G.E.R.L. Architecture

The Semantic layer:The main unit is the Semantic Unit (SemU).

Synonyms are specified as „synonymy relation“between SemUs.

To specify collocations, a new tag wasintroduced. We introduced also an attribute ofthe MuS /MuC tag for specifying if a word is aforeign lexeme or not. In the originalPAROLE/SIMPLE model SemUs and SynUswere connected. Because of the collocation tag,this connection does not exist any more. Theconnections are made as follows: MuS/MuCwith SynU and MuS/MuC with SemU.

As each MuS/MuC has a correspondingSemU, the synonymy relation is specified asfollows:.........

<MuS gramcat="Noun"subgramcat="common" id="Nou_0001"synulist="EMPTY" semulist="SemU_xx"foreign="NO">.........</MuS>.........<SemU id="SemU_xx" comment="nosemantic information" example=""collocationlist=""><RweightValSemU comment=””targetlist=”wordID1 wordID2collocationID3 ...” semR=”SYN” /></SemU><RSemU id="SYN" comment="synonymyrelation" sstype="SYNONYMY" />.........

This way it is mentioned that the word withthe id Nou_0001 has as synonyms the wordswordID1, wordID2, and the collocationcollocationID3.

The Multilingual layer supported also somechanges, due to the specifications of the lexicon.In the original PAROLE/SIMPLE model onlylinks at the „unit“ level were possible. Weintroduced also the CorrespGap tag, whichmarks a lexical gap. Multilingual connectionsare made at following levels: MUs(morphological units), and Collocation.Connections at the Syntactic and Semantic Unitwere for the moment not considered. Themultilingual link is not always bi-directional.

In the following paragraph, there are givensome examples that prove the specificationsdescribed above:

The word battery (Engl.) has the followingcorrespondents in Romanian:� (o) baterie (N/A, indefinite article/no

article)� bateria (N/A, definite article)� (unei) baterii (D/G, indefinite article /no

article)� bateriei (D/G, definite article)

This means that in G.E.R.L. to 4 Romanianentries corresponds 1 English entry. There arecases when it happens the other way round: e.g.to 1 Romanian entry correspond 2 Englishentries: mame (Rom.) – mother or mothers(Eng.):unei mame (Rom) D, indefinite article, singular– to a mother (Eng.)unor mame (Rom) D, indefinite article, plural –

��

��

��

��

��

��

��

��

��

��

��

�

��

��

��

��

��

��

�

��


to some mothers (Eng.)Below, we present the entry in the Romanian

lexicon for the word “baterie” (Eng. battery).The bold tags are defined directly in the lexiconxml file. The others are created using thesoftware for managing lexicon entries.

<?xml version="1.0" encoding="UTF-8"?><LesParole><Parole><ParoleMorpho>

<MuS gramcat="Noun"subgramcat="common" id="Nou_0001"synulist="EMPTY" semulist="EMPTY"foreign="NO"><Entry>baterie</Entry><Gmu inp="Nn-feminine-singular-nominative-no article" /></MuS>

<GInp id="Nn-feminine-singular-nominative-no article"><CombMFCifcombMF="Nn_feminine_singular_nominative_no article" /></GInp><CombMFid="Nn_feminine_singular_nominative_no article" gender="feminine"number="singular" case="nominative"article="no article" />

<MuS gramcat="Noun"subgramcat="common" id="Nou_0002"synulist="EMPTY" semulist="EMPTY"foreign="NO"><Entry>baterie</Entry><Gmu inp="Nn-feminine-singular-accusative-no article" /></MuS>............<MuS gramcat="Noun"subgramcat="common" id="Nou_0003"synulist="EMPTY" semulist="EMPTY"foreign="NO"><Entry>baterie</Entry><Gmu inp="Nn-feminine-singular-nominative-indefinite" /></MuS>............<MuS gramcat="Noun"subgramcat="common" id="Nou_0004"synulist="EMPTY" semulist="EMPTY"foreign="NO">

<Entry>baterie</Entry><Gmu inp="Nn-feminine-singular-accusative-indefinite" /></MuS>............</ParoleMorpho>

<ParoleSyntaxe><SynU id="EMPTY" comment="nosyntactical information" example=""description="EMPTY" /><Description id="EMPTY" comment=""example="" /></ParoleSyntaxe>

<ParoleSemant><SemU id="EMPTY" comment="nosemantic information" example=""collocationlist="" /><RSemU id="SYN" comment="synonymyrelation" sstype="SYNONYMY" /><SemanticRole id="SR_agent"example="" comment=""name="agent" />............</ParoleSemant>

</Parole>

<ParoleMultilinguelangue1="Romanian"langue2="English" />

<ParoleMultilinguelangue1="Romanian" langue2="German"/>

</LesParole>

Examples:O baterie este … (Rom) = A battery is … (Eng.)– N, indefinite article… pe o baterie (Rom.) = … on a battery (Eng.)– Ac, indefinite article... pe baterie (Rom) = ... on a battery (Eng.) –Ac, without article

We mention that beside the newly introducedtag-names, all the other are taken from thePAROLE/SIMPLE model.

A GUI manipulating the presented structurewas built in order to allow the lexicon creationand management. A sample of the graphical userinterface is presented in Figure 2.


Figure 2: Snapshot of the GUI interface of theG.E.R.L. management tool

5. Conclusions and further work

In this paper we presented a multilingualGerman-English-Romanian lexicon followingthe PAROLE/SIMPLE model. Working withRomanian language showed that there are stilllinguistic features not covered by the model. Weupdated the model, in order to make itcompatible with the Romanian language.G.E.R.L. is the first multilingual dictionary,taking into consideration PAROLE/SIMPLElexicons were already developed for Germanand English, and the lexicons are under certainrestrictions available; we are now working onfilling the Romanian lexicon with a relevantnumber of entries, as well as to realise themultilingual connections. Further workaddresses mainly the integration of G.E.R.L. inMachine Translation applications. We are alsointerested in the integration of G.E.R.L. withother PAROLE/SIMPLE lexicons.

References

(Calzolari et al. 03) Nicoletta Calzolari, FrancescaBertagna, Alessandro Lenci and Monica Monachini.Standards and Best Practice for MultilingualComputational Lexicons &MILE (the MultilingualISLE Lexical entry). Deliverable D2.2-D3.2 ISLEComputational Lexicon Working Group., to beretrieved athttp://www.ilc.cnr.it/EAGLES96/isle/clwg_doc/ISLE_D2.1-D3.1.zip, 2003

(Cristea et al. 04) Dan Cristea, Catalin Mihaila, CorinaForascu, Diana Trandabat, Maria Husarciuc, GabrielaHaja, Oana Postolache, Mapping Princeton WordNetSynsets Onto Romanian WordNet Synsets , Romanian

Journal of Information Science and Technology,Volume 7, Numbers 1-2, pages 125-145, 2004

(Erjavec 04) Tomaz Erjavec. MULTEXT-east Version3: Multilingual Morphosyntactic Specifications,Lexicons and Corpora, to be retrieved athttp://nl.ijs.si/ME/V3/doc/bib/mte-lrec2004.pdf, 2004

(Guimier & Ognowski 98) Emilie Guimier andAntoine Ognowski, LE-PAROLE Reports on theMorphological and Syntactic Layers, to be retrievedunder:http://www.ub.es/gilcub/SIMPLE/reports/parole/ ,1998

(Handke 95) Jürgen Handke. The Structure of theLexicon–Human versus Machine. Mouton de Gruyter,Berlin-NewYork, 1995

(Hristea & Moroianu 03) Theodor Hristea and CristianMoroianu. Generarea formelor flexionaresubstantivale si adjectivale in limba româna. InBuilding Awareness in Language Technology, Papersof the Romanian Regional Information centre forHuman Language Technology, Fl. Hristea and M.Popescu (Eds.), pages 443-460. Editura Universitatiidin Bucuresti, 2003

(Ionescu 03) Emil Ionescu. Premise ale unui dictionarmorfologic al limbii române. In Building Awareness inLanguage Technology, Papers of the RomanianRegional Information centre for Human LanguageTechnology, Fl. Hristea and M. Popescu (Eds.), pages461-468. Editura Universitatii din Bucuresti, 2003

(Ruimy et al. 98) Nida Ruimy, Ornella Corazzari,Elisabetta Gola, Antonietta Spanu, Nicoletta Calzolariand Antonio Zampolli, The European LE-PAROLEProject and the Italian Lexical Instantiation,Proceedings of the ALLC/ACH, 1998, Lajos KossuthUniversity, Debrecen, Hungary, pages.149-153, 5-10July 1998

(Tufis et al. 04a) Dan Tufis, Eduard Barbu, VerginicaBarbu Mititelu, Radu Ion, Luigi Bozianu, TheRomanian WordNet, Romanian Journal of InformationScience and Technology, Volume 7, Numbers 1-2,pages 107-124, 2004

(Tufis et al. 04b) D. Tufis, D. Cristea, S. Stamou,BalkaNet: Aims, Methods, Results and Perspectives. AGeneral Overview, Romanian Journal of InformationScience and Technology, Volume 7, Numbers 1-2,pages 9-43, 2004

(Vertan & von Hahn 02) Cristina Vertan and Waltherv. Hahn, Towards a generic architecture for LexiconManagement. In Proceedings of the Workshop onInternational Standards of Terminology and LanguageResources Management, LREC, pages 45-49. LasPalmas de Gran Canaria, 28th May 2002.