basic design of the architecture and methodologies (second round)

77
Basic Design of the architecture and methodologies (second round) Document Number D2.2 Project ref. IST-2001-34460 Project Acronym MEANING Project full title Developing Multilingual Web-scale Language Technologies Project URL http://www.lsi.upc.es/˜nlp/meaning/meaning.html Availability Public Authors: German Rigau (UPV/EHU), Bernardo Magnini (ITC-irst), Eneko Agirre (UPV/EHU), John Carroll (Sussex), Piek Vossen (Irion), Jordi Atserias (UPC) INFORMATION SOCIETY TECHNOLOGIES

Upload: others

Post on 03-Feb-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Basic Design of the architecture and methodologies (second round)

Basic Design ofthe architecture and methodologies

(second round)

Document Number D2.2Project ref. IST-2001-34460Project Acronym MEANINGProject full title Developing Multilingual Web-scale Language TechnologiesProject URL http://www.lsi.upc.es/˜nlp/meaning/meaning.htmlAvailability PublicAuthors: German Rigau (UPV/EHU), Bernardo Magnini (ITC-irst), Eneko Agirre(UPV/EHU), John Carroll (Sussex), Piek Vossen (Irion), Jordi Atserias (UPC)

INFORMATION SOCIETY TECHNOLOGIES

Page 2: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 1

Project ref. IST-2001-34460Project Acronym MEANINGProject full title Developing Multilingual Web-scale

Language TechnologiesSecurity (Distribution level) PublicContractual date of delivery August 2003Actual date of delivery March 25, 2004Document Number D2.2Type ReportStatus & version v DraftNumber of pages 75WP contributing to the deliberable Work Package 2WPTask responsible German Rigau (UPV/EHU)Authors

German Rigau (UPV/EHU),Bernardo Magnini (ITC-irst),Eneko Agirre (UPV/EHU), JohnCarroll (Sussex), Piek Vossen(Irion), Jordi Atserias (UPC)

Other contributorsReviewerEC Project Officer Evangelia MarkidouAuthors: German Rigau (UPV/EHU), Bernardo Magnini (ITC-irst), Eneko Agirre(UPV/EHU), John Carroll (Sussex), Piek Vossen (Irion), Jordi Atserias (UPC)Keywords: NLP, Lexical Knowledge Representation, Acquisition, WSDAbstract: This document aims to produce a detailed design of the developmentphase for the second Meaning round (months 13 to 24). It revises the overallmethodology of Meaning including the standard protocols, formats, procedures,and evaluation criteria and the design Multilingual Central Repository database.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 3: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 2

Contents

1 Executive Summary 6

2 The second round of Meaning 7

3 Design of WP3 Language Processors and Infrastructure 103.1 Linguistic Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.2 Lexical Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.3 Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.4 Current plans for LP1 and LP2 . . . . . . . . . . . . . . . . . . . . . . . . 12

3.4.1 Basque . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.4.2 Catalan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.4.3 Spanish . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.4.4 English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.4.5 Italian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Design of WP4 (Knowledge) Integration 144.1 The Multilingual Central Repository . . . . . . . . . . . . . . . . . . . . . 144.2 Content of Mcr0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.3 Content of the Mcr1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.3.1 The Meaning Inter–Lingual–Index . . . . . . . . . . . . . . . . . . 174.3.2 The EuroWordNet Base Concepts . . . . . . . . . . . . . . . . . . . 184.3.3 The EuroWordNet Top Concept Ontology . . . . . . . . . . . . . . 184.3.4 Suggested Upper Merged Ontology (Sumo) . . . . . . . . . . . . . 204.3.5 eXtended WordNet . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.3.6 WordNet 2.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.3.7 Improved Selectional Preferences acquired from BNC . . . . . . . . 214.3.8 Parsed SemCor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.3.9 Instances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.4 Initial plans for Mcr2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.5 Cross-checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 234.6 Mcr example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 Design of WP5 Acquisition 285.1 Experiment 5.A – Multilingual acquisition for predicates . . . . . . . . . . 30

5.1.1 Current Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 305.1.2 Introduction and Background . . . . . . . . . . . . . . . . . . . . . 305.1.3 Source Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.1.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 325.1.5 Evaluation and discussion . . . . . . . . . . . . . . . . . . . . . . . 33

5.2 Experiment 5.B – Collocation . . . . . . . . . . . . . . . . . . . . . . . . . 335.2.1 Current Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 4: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 3

5.3 Experiment 5.C – Acquisition of domain information for named entities . . 335.3.1 Current Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.3.2 Introduction and Background . . . . . . . . . . . . . . . . . . . . . 345.3.3 Source Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345.3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.3.5 Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . . 35

5.4 Experiment 5.D – Topic Signatures . . . . . . . . . . . . . . . . . . . . . . 355.4.1 Current Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355.4.2 Introduction and Background . . . . . . . . . . . . . . . . . . . . . 365.4.3 Source Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.4.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385.4.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.5 Experiment 5.E – Sense Examples . . . . . . . . . . . . . . . . . . . . . . . 395.5.1 Current Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395.5.2 Introduction and Background . . . . . . . . . . . . . . . . . . . . . 395.5.3 Source Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415.5.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 425.5.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.6 Experiment 5.F – Lexical knowledge from MRDs . . . . . . . . . . . . . . 435.6.1 Current Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

5.7 Experiment 5.G – Improved Selectional Preferences . . . . . . . . . . . . . 435.7.1 Current Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.7.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435.7.3 Use of a weighting factor to counter the affect of sample size on TCMs 435.7.4 Automatic WSD of the training data used for selectional preference

acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445.7.5 Protomodels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.8 Experiment 5.H – Clustering WordNet Word Senses . . . . . . . . . . . . . 455.8.1 Current Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.8.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 455.8.3 Data sources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.8.4 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.8.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.9 Experiment 5.I – Multiwords: phrasal verbs . . . . . . . . . . . . . . . . . 465.9.1 Current Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 465.9.2 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.10 Experiment 5.J – New Senses . . . . . . . . . . . . . . . . . . . . . . . . . 475.10.1 Current Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.11 Experiment 5.K – Ranking Senses automatically . . . . . . . . . . . . . . . 485.11.1 Current Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 485.11.2 Introduction and background . . . . . . . . . . . . . . . . . . . . . 48

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 5: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 4

6 Design of WP6 Word Sense Disambiguation 496.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496.2 Experiment 6.A – English all-word WSD system . . . . . . . . . . . . . . . 51

6.2.1 Current Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516.2.2 Introduction, Background and Goals . . . . . . . . . . . . . . . . . 516.2.3 Source Data and Tools . . . . . . . . . . . . . . . . . . . . . . . . . 516.2.4 Design and Architecture . . . . . . . . . . . . . . . . . . . . . . . . 526.2.5 Steps/Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . 536.2.6 Evaluation and Discussion . . . . . . . . . . . . . . . . . . . . . . . 546.2.7 Extensions of WSD1 (future work for WSD2) . . . . . . . . . . . . 54

6.3 Experiment 6.B – High Precision English WSD for bootstrapping . . . . . 546.3.1 Current Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.4 Experiment 6.C – High quality sense examples . . . . . . . . . . . . . . . . 556.4.1 Current Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.5 Experiment 6.D – Transductive Support Vector Machines . . . . . . . . . . 556.5.1 Current Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

6.6 Experiment 6.E – All-words WSD systems for the rest of languages . . . . 566.6.1 Current Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.6.2 Current Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.6.3 Introduction and Background . . . . . . . . . . . . . . . . . . . . . 566.6.4 Source Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566.6.5 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.6.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

6.7 Experiment 6.F – More informed features . . . . . . . . . . . . . . . . . . . 576.7.1 Current Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 576.7.2 Introduction and Background . . . . . . . . . . . . . . . . . . . . . 586.7.3 Source Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 586.7.4 The Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

6.8 Experiment 6.G – Unsupervised WSD . . . . . . . . . . . . . . . . . . . . 596.8.1 Introduction and Background . . . . . . . . . . . . . . . . . . . . . 59

6.9 Experiment 6.H – Bootstrapping . . . . . . . . . . . . . . . . . . . . . . . 596.9.1 Introduction and Background . . . . . . . . . . . . . . . . . . . . . 596.9.2 Source Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 606.9.3 The Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.10 Experiment 6.I – Effect of Sense Clusters . . . . . . . . . . . . . . . . . . . 616.10.1 Current Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.11 Experiment 6.J – Semantic Class Classifiers . . . . . . . . . . . . . . . . . 616.11.1 Current Status . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6.12 Experiment 6.K – Effect of Ranking Senses Automatically . . . . . . . . . 626.13 Experiment 6.L – Disambiguating WN Glosses . . . . . . . . . . . . . . . . 62

6.13.1 Introduction and Background . . . . . . . . . . . . . . . . . . . . . 626.13.2 Source Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 636.13.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 6: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 5

6.13.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

7 Design of WP7 Evaluation and Assessment 647.1 Output of Linguistic Processors . . . . . . . . . . . . . . . . . . . . . . . . 657.2 Acquired/Ported Lexical Information . . . . . . . . . . . . . . . . . . . . . 657.3 Word Sense Disambiguation . . . . . . . . . . . . . . . . . . . . . . . . . . 65

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 7: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 6

1 Executive Summary

This document is the result of a planned revision of deliverable D2.1 (Basic Design ofarchitecture and methodologies 1). This report aims to produce a detailed design of thedevelopment phase for the second Meaning round (months 13 to 24). It revises theoverall methodology of Meaning including the standard protocols, formats, procedures,and evaluation criteria and the design Multilingual Central Repository (Mcr) database.

The consortium revises in this report the basic requirements for all modules involved inthe project. Meaning is a long and very complex project with multiple interdependenciesbetween workpackages. This deliverable tries to identify the current requirements neededfor developping succsessfully the whole project. For each task all requirements must berevised, e.g. requirements for the Language Processors and infrastructure (WP3). Thisdeliverable describes also for the second round the information flow and content data foruploading and porting information to the Multilingual Central Repository (Mcr) (WP4),the experiments for the acquisition process (WP5) and word sense disambiguation (WP6),and a revision of the evaluation criteria needed for measuring the quality of the tools andresources produced by Meaning (WP7). In particular, this report provides the mainguidelines:

• To identify the current requirements for the Language Processors and infrastructure(WP3) to be used in second round of Meaning.

• To define the timing, information flow and content data of the acquisition (WP5),word sense disambiguation (WP6), uploading and porting cycles (WP4) for the sec-ond round.

• To revise the main functionality of the Mcr (WP4) including:

– The content to be represented into the Mcr (WP4).

– The process for uploading the data acquired from one wordnet to the Mcr.

– The process for porting the knowledge stored in the Mcr to the respectivewordnets.

• To define a set of large-scale knowledge acquisition experiments (WP5) for the secondround.

• To define a set of large-scale word sense disambiguation experiments (WP6) for thesecond round.

• To revise the assessment and evaluation criteria to be used (WP7) for this round.

After this summary, section 2 provides an general overview of the second round of theMeaning project. Section 3 is devoted to WP3 (Linguistic Processors and Infrastructure)

1http://www.lsi.upc.es/ nlp/meaning/documentation/D2.1.pdf.gz

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 8: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 7

presenting the requirements and the plans for the Language Processors and infrastructurefor the second round. Section 4 focuses on WP4 (Knowledge Integration) and the Mcr.Section 5 and 6 focus respectively on the design of WP5 (Large-scale Knowledge Adquisi-tion) and WP6 (Word Sense Disambiguation). Finally, section 7 presents a revision of thedesign criteria of WP7 (Evaluation and Assessment)

2 The second round of Meaning

After a major revision of the whole architecture of the project no major changes have beenplanned for the second round of Meaning. However, some minor changes in the ProjectPlanning have been devised.

As explained in the Technical Annex, the duration of the Meaning project is 36months. The work is subdivided into three main phases:

1. Analysis (User and Software requirements, overall architectural design definition;months 1-6, 13-18, 25-27).

2. Development (three releases of the software tools and resources; months 7-33).

3. Assessment and Validation (months 10-36).

The first outcomes of this Analysis phase are:

• the User Requirements Report (deliverable D1.1 2)

• the Basic Design of the architecture and methodologies (deliverable D2.1 3)

However, being this a long research project, partial revisions of the Basic Design of thearchitecture and methodologies were designed after each round. That is, if necessary, twomore deliverables D2.2 and D2.3 (not included in the Technical Annex) have been plannedto be produced at months 18 and 27.

The Development phase is central to the project, and aims to develop all the softwaretools and resources to produce the final Meaning outcomes.

The development is organised in three consecutive cycles involving Work Parts WP3-7. This report summarizes the design for the second cycle of Meaning. Four workpackages, from WP3 to WP6, will perform the three consecutive acquisition, word sensedisambiguation, uploading and porting processes, while WP7 is devoted to assess andevaluate the tools developed, the process carried out and the resources produced.

WP3 is devoted to develop the Linguistic Processors (LPs) for each language involvedin the project. The current design of LP1 is extensively described in section 3 of thisdocument.

Figure 1 summarises the Meaning data flow. Each development cycle consists of:

2http://www.lsi.upc.es/ nlp/meaning/documentation/D1.1.pdf.gz3http://www.lsi.upc.es/ nlp/meaning/documentation/D2.1.pdf.gz

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 9: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 8

• WP6 (WSD): Word Sense Disambiguation systems (WSD0, WSD1, WSD2) usingthe local wordnets and the enriched knowledge ported from the Multilingual CentralRepository. WSD1 is extensively described in section 6 of this document.

• WP5 (Acquisition): Local acquisition of knowledge using specially designed toolsand resources, corpus and wordnets (ACQ0, ACQ1, ACQ2). ACQ1 is extensivelydescribed in section 5 of this document.

• WP4 (Knowledge Integration): Uploading the acquired knowledge from each lan-guage into the Multilingual Central Repository and porting to the local wordnets(PORT0, PORT1, PORT2). PORT1 is extensively described in section 4 of thisdocument.

After each cycle WP7 is devoted to the evaluation and assessment of the software toolsand resources produced in Meaning. WP7 is revised in section 7 of this document.

Following deliverable D2.1 (Basic Design of architecture and methodologies 4), theknowledge acquired locally is uploaded and ported across the rest of languages via theEuroWordNet Ili, maintaining the compatibility among them. Meaning will performthree consecutive processes for uploading and porting the knowledge acquired from eachlanguage to the respective local wordnets: PORT0, PORT1, PORT2. The knowledgeacquired from each language during the three cycles will be consistently upload into theMcr, granting the integrity of all the data produced by the project. After each Meaning

cycle, all knowledge acquired and integrated into the Mcr will be then distributed acrossthe local wordnets.

During the first Meaning round, WSD0 and ACQ0 started simultaneously using theknowledge already placed into the local wordnets. The knowledge acquired during thisphase was uploaded into the Mcr and ported (PORT0) to the local wordnets. Next cyclesof WSDi and ACQi start simultaneously using the knowledge acquired from the previousphase. That is, each WSDi+1 will benefit from ACQi and each ACQi+1 from WSDi.

The first cycle consisted on the independent acquisition and WSD using the knowl-edge available in the local wordnets (ACQ0, WSD0) and the first porting of those results(PORT0).

Now, in the second cycle of Meaning (see figure 1), ACQ1 will use the knowledgeplaced into the Multilingual Central Repository (Mcr0) (see Deliverable D4.1 PORT05)and sense-tagged corpora provided by WSD0 (see Deliverable D6.1 WSD06); and also,WSD1 with the knowledge from the Multilingual Central Repository (Mcr0) and theanalysed corpora provided by ACQ0 (see Deliverable D5.1 ACQ07).

In the third and final cycle, Meaning will have the results from a complete sequence ofAcquisition and WSD: ACQ2 over results from WSD1 (resulting from ACQ0); and WSD2over results from ACQ1 (resulting from WSD0). And also the corresponding PORTings.

4http://www.lsi.upc.es/ nlp/meaning/documentation/D2.1.pdf.gz5http://www.lsi.upc.es/ nlp/meaning/documentation/D4.1.pdf.gz6http://www.lsi.upc.es/ nlp/meaning/documentation/D6.1.pdf.gz7http://www.lsi.upc.es/ nlp/meaning/documentation/D5.1.pdf.gz

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 10: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 9

Figure 1: MEANING data flow

Major milestones and output results during this development regard the delivery ofthe software tools and resources, whose three version submissions will be at 12, 24 and 33months.

The end of this phase will coincide with the starting of the validation activities on thedemonstration (WP8).

The main outcomes for the second round are:

• Second release of Linguistic Processors (English, Italian, Spanish, Catalan and Basque)(reported in deliverable D3.2)

• ACQ1: Second acquisition process (reported in deliverable D5.2)

• WSD1: Second Word Sense Disambiguation process (reported in deliverable D6.2)

• PORT1: Second upload and porting processes (reported in deliverable D4.2)

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 11: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 10

3 Design of WP3 Language Processors and Infras-

tructure

The aim of workpackage WP3 “Linguistic Processors and Infrastructure” is that of analyz-ing the situation of each partner of the Meaning project with respect to the availabilityof Linguistic Processors, Lexical Resources and Corpora and that of making a plan for theimprovement of the currently available resources and the development of further resourcesuseful in the acquisition (WP5) and disambiguation (WP6) phases of the project.

Deliverable D3.1 reported the situation of each partner at the beginning of the project(month 9), providing an extensive description of all tools and resources available at thedifferent sites for the languages involved in the project, together with detailed plans fortheir development in the following two phases now scheduled at T21 (LP1) and T30 (LP2).We now analyze the situation for T21 (LP1), after the first phase of the project, focusingon the improvements of linguistic processors and resources and their use within the firstcycle of acquisition (ACQ1) and word sense disambiguation (WSD1). Building on theexperience gained in the first phase of the project, the development plans drawn up at thebeginning have been checked and updated in order to meet any new requirement emergedlater.

On the basis of the plan originally presented in D3.1, the next sections describe thesituation of each Meaning language for LP1 with respect to that plan and discuss theplans for LP2 on the basis of the experience gained up to LP0. Details concerning howtools, linguistic resources and corpora have been used by each partner will be provided inDeliverable D3.2.

3.1 Linguistic Processors

The tools available at the partners’ sites may be different for the different languages con-sidered in Meaning and deal with tasks of varying complexity, ranging from tools forbasic text processing to more advanced processing.

The Linguistic Processors available for LP1 are described in Table 1. Some tools werealready available at LP0 and have been improved; moreover some new tools have beendeveloped. It is important to note that tokenizers, morphological analyzers, and Part ofSpeech taggers are available for all the languages involved in the project.

3.2 Lexical Resources

The main lexical resource adopted in the Meaning project is WordNet. WordNets fordifferent languages have been developed by the partners of the Consortium and, to maintaincompatibility within them, a Multilingual Central Repository has been created (WP4). Theacquired knowledge from each language will be consistently uploaded to the MultilingualCentral Repository and ported over to the local WordNets involved in the project. Thecurrent figures and volumes of the local wordnets will be reported in Working Paper WP4.4

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 12: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 11

Basque Catalan English Italian Spanish

Language LangId LangId LangId LangIdIdentifier

Tokenizer Tokenizer UPC UPC TokenPro UPCSussex LEX-Tokenizer

CL-TokenizerSentence Splitter Splitter Splitter SentencePro SplitterMorphological Lemati maco+ Sussex MorphoPro maco+Analyzer maco+ FSTANWord Aligner KNOWA KNOWAKey Concept KX KXExtractor

Chunker Sussex (NG) Italian Chunker

NE Recognizer Eihera NERC LearningPinocchio LearningPinocchioNERD NERD

Multiword Identifier Sussex

PoS Tagger Euslem relax Sussex SSI Tagger relaxrelax LemmaPro

TreeTagger TreeTaggerParser Zatiak TaCat Sussex(RASP) TaCat

TaCatText Classifier Sailka KB-TC Categorizer

ML-TC

Table 1: Linguistic processors currently available in the Meaning project. New processorswith respect to LP0 are in bold.

and deliverable D4.2).

3.3 Corpora

Corpora play a crucial role within the Meaning project as they represent an importantsource of lexical information which is useful both in the acquisition and in the disam-biguation tasks. Deliverable D3.1 reported the corpora available at the Consortium at thebeginning of the project. These corpora are being progressively automatically annotatedat different levels.

Moreover, at LP1 ITC-irst will made available to the Meaning Consortium MultiSem-Cor, a parallel English/Italian corpus aligned at word level and morphosyntactically andsemantically annotated according to WordNet senses. At present MultiSemCor is com-posed of 232 parallel texts (corresponding to about 464.000 tokens) and is being used forlexical acquisition (WP5) and word sense disambiguation (WP6). All the details aboutthis corpus will be presented in the Working Paper 3.2.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 13: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 12

3.4 Current plans for LP1 and LP2

The plan originally devised for LP1 regarding the development of linguistic processors andresources is been to be fulfilled. The original plans for LP1 and LP2 has been analyzed onthe basis of the experience gained in the first cycle of the project, and the Consortium hasdecided that no changes are required to meet the needs of the second cycle of acquisition(ACQ1) and word sense disambiguation (WSD1). Therefore, the plans for LP1 and LP2remains the original one. In the following we report the situation for each language.

3.4.1 Basque

• Language identifier: improvements done for 6 languages; English documentationadded.

• Tokenizer: slight changes introduced

• Lemmatizer: improvement in the disambiguation of proper names and guessing; newversion of lexicon.

• POS tagger: improvement in the disambiguation of proper names and guessing; newversion of lexicon.

• Text classifier: complete evaluation.

• Multiwords recognizer: wider coverage including new units in the lexicon.

• Chunker-parser: Enhanced. Integration in the batch process including postpositions.

• NE recognizer: first version of Eihera finished. Integration in the batch process.

Other performed tasks

• Chunker-parser: development of a treebank (almost finished).

• Integration using XML: almost finished for the tokenizer, lemmatizer and POS tagger.

3.4.2 Catalan

• Morphological analyzer: improvement of the tool, lexicon extension and debugging.

• Sentence splitter: partial improvement.

• NE Recognizer: integration postponed to LP2.

• POS tagger: bootstrapping done; hand-tagged corpus almost finished.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 14: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 13

• Text Classification system: not yet developed due to lack of training corpora.

Other performed tasks

• improvement in the speed of linguistic processors.

• morphological analyzer ported to C++.

• fast HMM tagger developed.

3.4.3 Spanish

• Sentence splitter: improvement partially done.

• NE Recognizer: integration postponed to LP2.

• Text Classification system: integration postponed to LP2 .

Other performed tasks:

• improvement in the speed of linguistic processors.

• morphological analyzer ported to C++.

• fast HMM tagger developed.

3.4.4 English

• XML input/output added to RASP system.

• Multiword recognizer: data extracted and integrated.

• NE Recognizer: maximum entropy model-based, trained on CONNL data, and inte-grated.

3.4.5 Italian

• NE Recognizer: porting to Italian done.

• Multiwords Recognizer: extraction of a list of 181,938 multiword expressions (77,984bigrams and 103,954 trigrams) from ”La Repubblica” corpus (years 1999-2000, 38million words).

• Chunker: development of a first version.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 15: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 14

• Meaning corpus: Improvement of the size of the Micro-balanced component (lastversion composed of 21,310,540 tokens); all corpus annotated up to the morphosyn-tactic level (96,837,555 tokens for the Macro-balanced component and 21,310,540tokens for the Micro-balanced).

Other performed tasks:

• New Sentence Splitter and POS tagger.

• new version of WordNet-Domain under construction (reorganization of the hierar-chy, mapping of all the domains with the corresponding DDC codes). First releasescheduled for LP2

4 Design of WP4 (Knowledge) Integration

The initial design of WP4 for the first Meaning round was presented in deliverable D2.1(Basic Design of architecture and methodologies 8). This design will continue to be validfor the second round.

4.1 The Multilingual Central Repository

meaning designed the Multilingual Central Repository (Mcr) to act as a multilingualinterface for integrating and distributing all the knowledge acquired in the project[Atseriaset al., 2004]. The Mcr follows the model proposed by the EuroWordNet project (ewn):a multilingual lexical database with wordnets for several European languages.

Summarizing, the ewn architecture included the Inter-Lingual-Index (Ili), a DomainOntology (Do) and a Top Concept Ontology (Tco) [Vossen, 1998]. The Ili consists ofa flat list of records that interconnect synsets across wordnets. During the Ewn Projectaround 1000 Ili-Records were selected as Base Concepts and connected to the Tco.

Using the Ili, wordnets in the Mcr are interconnected so that it is possible to gofrom word meanings in one language or particular wordnet to their equivalents in otherlanguages or wordnets.

The main purpose of the Tco is to provide a common ontological framework for allthe wordnets. It consists of 63 basic semantic distinctions that classify a set of Ili-recordsconnected to WordNet Base Concepts Bc which represents the most important conceptsin the different wordnets.

The current version of the Mcr uses Princeton WordNet 1.6 as Ili. Initially most ofthe knowledge uploaded into the first version of the Mcr was derived from WordNet 1.6(i.e. automatic selectional preferences acquired from SemCor and BNC) and the ItalianWordNet and the MultiWordNet Domains, both developed at IRST are using WordNet

8http://www.lsi.upc.es/ nlp/meaning/documentation/D2.1.pdf.gz

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 16: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 15

1.6 as Ili [Bentivogli et al., 2002; Magnini and Cavaglia, 2000]. This option tried tominimise side effects with other European initiatives (Balkanet, EuroTerm, etc.) andwordnet developments around Global WordNet Association. However, the Ili for Spanish,Catalan and Basque wordnets was WordNet 1.5[Atserias et al., 1997; Benıtez et al., 1998],as well as the EuroWordNet Tco and the associated Base Concepts. In order to maintaincompatibility between wordnets of different languages and versions, past and new, theMcr uses the technology for the automatic alignment of different large-scale and complexsemantic networks [Daude et al., 1999; Daude et al., 2000; Daude et al., 2001]. In thatway, the Ili has been connected to newer versions of WordNet 1.6, 1.7.1 and 2.0.

The current Mcr software include system modules for:

• Uploading the data acquired from one language to the Mcr.

• Porting the knowledge stored into the Mcr to the local wordnets.

• Checking the integrity of the data stored in the Mcr.

In that way, the Mcr produced by Meaning is going to constitute the natural mul-tilingual large-scale linguistic resource for a number of semantic processes that need largeamounts of linguistic knowledge to be effective tools (e.g. Web ontologies). The fact thatword senses will be linked to concepts in Mcr will allow for the appropriate representationand storage of the acquired knowledge.

4.2 Content of Mcr0

After PORT0, the first porting process preformed in the first cycle, the Mcr included thefollowing large–scale resources:

• ILI

– Aligned to WordNet 1.6 [Fellbaum, 1998]

– EuroWordNet Base Concepts [Vossen, 1998]

– EuroWordNet Top Concept Ontology [Vossen, 1998]

– MultiwordNet WordNet Domains version 070501 [Magnini and Cavaglia, 2000]

• Local wordnets

– English WordNet 1.5, 1.6, 1.7.1 [Fellbaum, 1998]

– Basque wordnet [Agirre et al., 2002]

– Italian wordnet [Bentivogli et al., 2002]

– Catalan wordnet [Benıtez et al., 1998]

– Spanish wordnet [Atserias et al., 1997]

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 17: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 16

• Large collections of semantic preferences

– Acquired from SemCor [Agirre and Martinez, 2001; Agirre and Martinez, 2002]

– Acquired from BNC [McCarthy, 2001]

• Instances

– Named Entities [Alfonseca and Manandhar, 2002]

See deliverable D2.1 (Basic Design of architecture and methodologies 9)for an extendedsummary of each of these components.

4.3 Content of the Mcr1

For the second release of the Mcr, we plan to upload several new large-scale semanticresources into the Mcr:

• Suggested Upper Merged Ontology (Sumo) [Niles and Pease, 2001]

• eXtended WordNet [Mihalcea and Moldovan, 2001]

• WordNet 2.0 [Fellbaum, 1998]

• Improved Selectional Preferences acquired from BNC [McCarthy, 2001]

• Direct dependencies form Parsed SemCor [Agirre and Martinez, 2001]

• Named Entities from Sumo [Niles and Pease, 2001]

• Named Entities from MultiWordNet [Bentivogli et al., 2002]

Working Paper WP4.4 (UPLOAD1) will provide further analysis and details of all theseresources after uploading into the Mcr.

After PORT1, the final content of Mcr1 should include:

• ILI

– Aligned to WordNet 1.6 [Fellbaum, 1998]

– EuroWordNet Base Concepts [Vossen, 1998]

– EuroWordNet Top Concept Ontology [Vossen, 1998]

– MultiwordNet WordNet Domains version 070501 [Magnini and Cavaglia, 2000]

– Suggested Upper Merged Ontology (Sumo) [Niles and Pease, 2001]

• Local wordnets

9http://www.lsi.upc.es/ nlp/meaning/documentation/D2.1.pdf.gz

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 18: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 17

– English WordNet 1.5, 1.6, 1.7, 1.7.1, 2.0 [Fellbaum, 1998]

– eXtended WordNet [Mihalcea and Moldovan, 2001]

– Basque wordnet [Agirre et al., 2002]

– Italian wordnet [Bentivogli et al., 2002]

– Catalan wordnet [Benıtez et al., 1998]

– Spanish wordnet [Atserias et al., 1997]

• Large collections of semantic preferences

– Direct dependencies from Parsed SemCor [Agirre and Martinez, 2001]

– Acquired from SemCor [Agirre and Martinez, 2001; Agirre and Martinez, 2002]

– Acquired from BNC (2nd release) [McCarthy, 2001]

• Instances

– Named Entities [Alfonseca and Manandhar, 2002]

– Named Entities [Niles and Pease, 2001]

– Named Entities [Bentivogli et al., 2002]

In the next subsections, we will provide a short description of each new semantic re-source (for those resources not mentioned here, please consult deliverable D2.1 (BasicDesign of architecture and methodologies10)).

4.3.1 The Meaning Inter–Lingual–Index

For the Mcr1, Meaning will continue using WordNet1.6 as Ili because most of theknowledge currently uploaded into the Mcr is using that version of WordNet. However,we can consider to upgrade the Ili version in Mcr2. The current version of WordNet isnow 2.0, and the Balkanet project uses WordNet 1.7 as Ili.

However, in order to upload consistently all the new semantic resources into the Mcr

we will need to produce new mappings between WordNet versions [Daude et al., 2003a;Daude et al., 2003b]. In particular, in order to upload WordNet 2.0, we need to producea direct WordNet mapping from version 1.6 to 2.0. The same applies to the eXtendedWordNet. We will need to produce a new direct mapping between WordNet 1.6 andWordNet 1.7.

10http://www.lsi.upc.es/ nlp/meaning/documentation/D2.1.pdf.gz

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 19: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 18

4.3.2 The EuroWordNet Base Concepts

The EuroWordNet Base Concepts were selected manually to cover the most importantconcepts of the languages involved in the project [Vossen, 1998]. The main characteristicof the Base Concepts is their importance in the wordnets. According to our pragmaticpoint of view, a concept is important if it is widely used, either directly or as a referencefor other widely used concepts. Importance is thus reflected in the ability of a concept tofunction as an anchor to attach other concepts or properties. This anchoring capabilitywas defined in terms of three operational criteria that can be automatically applied to theavailable resources:

• the number of relations (general or limited to hyponymy).

• high position of the concept in a hierarchy

• being widely used by several languages

The procedure of selecting the EuroWordNet Base Concepts and the Top Ontologyis discussed in [Vossen, 1998]. The final set of common Base Concepts totalized 1030WordNet 1.5 synsets.

In the first Meaning cycle, the Base Concepts from Wn1.5 have been mapped toWn1.6. After a manual revision and expansion to all Wn1.6 top beginners, the resultingBc for Wn1.6 totalized 1,601 Ili-records. In that way, the new version of Bc covers thecomplete hierarchy of Ili-records.

However, the definition of Base Concepts in EuroWordNet could not use the sensefrecuency information currently available in the Princeton WordNet11.

We suggest for future rounds to devise a simple and fully automatic method to derive theBase Concepts from the information contained into the Mcr following the three operationalcriteria defined above.

The resulting new Base Concepts can then be used to attach consistently new ontolog-ical properties defined into the Top Concept Ontology.

4.3.3 The EuroWordNet Top Concept Ontology

The EuroWordNet project only performed a complete validation of the consistency of theTop Concept Ontology at the Base Concept level.

Assuming (as the builders of Sumo and WordNet Domain have done) that the onto-logical properties have been correctly assigned to particular synsets and WordNet definescoherent ontological subsumption chains across taxonomies, an automatic process can con-sistently inherit all the properties through the whole hierachy of WordNet - no matter theontology they come from.

In Meaning we have performed an automatic expansion of the Top Concept Ontologyproperties assigned to the Base Concepts. That is, we enriched the complete Ili structure

11WordNet started to contain sense frecuency information derived from SemCor in version 1.6

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 20: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 19

with features comming from the Bc by inheriting the Top Concept features following thehyponymy relationship.

This way, once properties are exported to the Ili and inherited through the wholeWordNet Hierarchy, all concepts in a WordNet will result to be assigned with a set ofsemantic features as in the following example.

lentil 1WD gastronomySF food

SUMO FruitOrVegetableTCO Comestible ; Plant

In order to provide consistency to the inheritance process we used the following basicincompatibilities among Top Concept Ontology properties (furtherly expanded to theirdaughter concepts) which were defined inside the EuroWordNet project:

• substance - object

• plant - animal - human - creature

• natural - artifact

• solid - liquid - gas

As the classification of WordNet is not always consistent with the Top Concept On-tology, these incompatibilities impeded the full automatic top–down propagation of theTop Concept Ontology properties. That semi-automatic process resulted in a number ofsynsets showing non–compatible information. Specifically:

• Sticking to Top Concept Ontology and according to the set of incompatibilities, someTop Concept Ontology properties assigned by hand appeared to be incompatiblewith either (a) inherited information, (b) information assigned via equivalence tothe Semantic Files (Lexicographical Files from WordNet) or/and even (c) other TopConcept Ontology properties assigned by hand.

• Top Concept Ontology properties, either original or inherited, are suspicious to beincompatible with other ontologies currently uploaded into the Mcr.

By examining a subset of synsets, we realised that there are at least the following mainsources of errors:

• Erroneous hand-made Top Concept Ontology mappings

• Erroneous statements of equivalence between Top Concept Ontology properties andSemantic Files

• Erroneous ISA links in WordNet -which causes erroneous inheritance

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 21: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 20

• Multiple inheritance within WordNet can cause incompatibilities in inheritance ofproperties [Guarino and Welty, 2000]

We can see an example of incompatible information in the following example, where a3rdOrderEntity can not coexist with properties only attributable to Events:

00660718 process 1MWD factotum

WN16SF actSUMO IntentionalProcess

EWNTO 3rdOrderEntity;Cause;Mental;PurposeIn future Meaning rounds, we suggest to research towards a new and consistent Top

Concept Ontology for the Mcr.

4.3.4 Suggested Upper Merged Ontology (Sumo)

Sumo12 [Niles and Pease, 2001] is being created as part of the IEEE Standard UpperOntology Working Group. The goal of this Working Group is to develop a standardupper ontology that will promote data interoperability, information search and retrieval,automated inference, and natural language processing. There is a complete set of mappingsfrom WordNet 1.6 synsets to Sumo: nouns, verbs, adjectives, and adverbs.

Sumo consists of a set of concepts, relations, and axioms that formalize an upperontology. An upper ontology is limited to concepts that are meta, generic, abstract orphilosophical, and hence are general enough to address (at a high level) a broad rangeof domain areas. Concepts specific to particular domains are not included in the upperontology, but such an ontology does provide a structure upon which ontologies for specificdomains (e.g. medicine, finance, engineering, etc.) can be constructed.

The current version of Sumo consists of 1,019 terms (all of them connected to WordNet1.6 synsets, 4,181 axioms and 822 rules.

We think that further investigation is needed with respect comparing both Sumo andthe EuroWordNet Top Ontology. For instance, the typology of processes in the Sumo wasinspired by Beth Levin’s well-received work entitled ”Verb Classes and Alternations”.?Among other things, this work attempts to classify over 3,000 English verbs into 48 “se-mantically coherent verb classes”. Some of the verb classes relate to static predicates inthe ontology rather than to processes, and some classes are syntactically motivated, e.g.the class of verbs that take predicative complements.

Further, the ontology also defines formally its types and relations between them in theform of axioms.

4.3.5 eXtended WordNet

In the eXtended WordNet 13 [Mihalcea and Moldovan, 2001] the WordNet glosses aresyntactically parsed, transformed into logic forms and the content words are semantically

12http://ontology.teknowledge.com/13http://xwn.hlt.utdallas.edu/

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 22: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 21

disambiguated. The key idea of the Extended WordNet project is to exploit the richinformation contained in the definitional glosses that is now used primarily by humansto identify correctly the meaning of words. In the first version of the eXtended WordNetreleased, XWN 0.1, the glosses of WordNet 1.7 are parsed, transformed into the logic formsand the senses of the words are disambiguated. Being derived from an automatic process,disambiguated words included into the glosses have assigned a confidence label indicatingthe quality of the annotation (gold, silver or normal).

In order to upload coherently the eXtended WordNet into the Mcr, we also needto upload WordNet 1.7 (no included into Mcr0) and provide a new mapping betweenWordNet 1.6 and WordNet 1.7 14.

We think that further investigation is also needed with respect these resources. Forinstance, trying to derive automatically disambiguated semantic relations between synsetglosses [Gangemi et al., 2003].

4.3.6 WordNet 2.0

WordNet 2.0 includes more than 42,000 new links between nouns and verbs that are mor-phologically related, a topical organization for many areas that classifies synsets by cate-gory, region, or usage, gloss and synset fixes, and new terminology, mostly in the terrorismdomain.

In this version, the Princeton team has added links for derivational morphology betweennouns and verbs.

Some synsets have been also organized into topical domains. Domains are always nounsynset, however synsets from every syntacic category may be classified. Each domain isfurther classified as a CATEGORY, REGION, or USAGE.

4.3.7 Improved Selectional Preferences acquired from BNC

The Tree Cut Models (tcms) that we used in the first round of Meaning acquisition forlearning selectional preferences from unannotated text often suffered from an overly highlevel of generalisation, that is classes which are very high in the WordNet hierarchy areused to represent the preferences. Prototypical classes such as food as the direct object ofeat are sometimes hidden by a selectional preference for a class further up the hyponymhierarchy, such as entity. This is partly because of the polysemy of the training data andpartly because the tcm method, using the minimum description length principle, coversall the data rather than looking for prototypical classes. We are investigating 3 possibilitiesto acquire more specific, accurate and intuitive models:

4.3.8 Parsed SemCor

SemCor will be also parsed using a new version of Minipar[Lin, 1998]. All the syntacticdependencies between head synsets will be captured and uploaded into Mcr1. This re-

14http://www.lsi.upc.es/ nlp/tools/mapping.html

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 23: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 22

source will allow direct comparisons between word instances and the generalized SelectionalPreferences (captured from SemCor and BNC).

Conversely to the first upload, the syntactic dependencies captured by this process willnot be generalised.

4.3.9 Instances

The Mcr1 will contain three different sources of instances:

• Named Entities from the work of [Alfonseca and Manandhar, 2002]

• Named Entities from Sumo [Niles and Pease, 2001]

• Named Entities from MultiWordNet [Bentivogli et al., 2002]

For future versions of the Mcr, we suggest to provide a new ontology of Named Entitiesto support and cover the formal criteria followed by the three approaches. This initiativewould be very useful when comparing Named Entities derived using different languageprocessors.

4.4 Initial plans for Mcr2

VerbNet15 is a verb lexicon with syntactic and semantic information for English verbs,using Levin verb classes to systematically construct lexical entries. For each syntacticframe in a verb class, there is a set of semantic predicates associated with it. Many ofthese semantic components are cross-linguistic. The lexical items in each language formnatural groupings based on the presence or absence of semantic components and the abilityto occur or not occur within particular syntactic frames. The English entries are mappeddirectly onto English WordNet senses. We hope that this new resource will provide furtherstructure and consistency to the selectional preferences acquired automatically.

For Mcr2 (the last version of the Mcr) we are planning to provide also an Ontology forNamed Entities, and an automatically constructed set of Base Concepts. The consortiumwill also ask for a new set of Base Concepts developed for WN1.7 for Balkanet.

In ACQ1 we plan to obtain a large set of sense examples acquired automatically fromthe web. These examples will be obtained querying Google. For each word sense inWordNet, a program builds a complex query including sets of monosemous synonymousrelatives. Using this approach, large collections of text can be obtained. This will representhundreds of examples per word sense. Using this large-scale resource we plan to generatetopic signatures for every word sense in WordNet that will be uploaded into Mcr2.

The consortium also plans to upload a new parsed version of SemCor. The corpus willbe parsed with a new version of RASP able to process XML files.

We will also study how to port the last version of the Mcr to other formats includingDAML+OIL, RDF, etc. that will be of utility of the Semantic Web communities.

15http://www.cis.upenn.edu/old/verbnet/home.html

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 24: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 23

4.5 Cross-checking

We think that further investigation is also needed with respect the resources currently up-loaded into the Mcr. For instance, trying to derive automatically disambiguated relationsbetween synset glosses. In fact, after uploading all these new resources to the Mcr, a newset of complex integration and porting processes must be studied. This will allow designingsophisticated strategies and metarules for subsequent portings.

Obviously, integrating all these large–scale semantic resources into a single platform acomplete cross-checking research can be performed. For instance, we can improve both theSumo labels with the WordNet Domains by simply merging and comparing them.

Synset Word SUMO Domain00536235n blow Breathing anatomy00005052v blow Breathing medicine

00003142v exhale Breathing medicine00899001a exhaled Breathing factotum00263355a exhaling Breathing factotum

00536039n expiration Breathing anatomy02849508a expiratory Breathing anatomy00003142v expire Breathing medicine

02579534a inhalant Breathing anatomy00536863n inhalation Breathing anatomy00003763v inhale Breathing medicine00898664a inhaled Breathing factotum00263512a inhaling Breathing factotum

00537041n pant Breathing anatomy00004002v pant Breathing medicine00535106n panting Breathing anatomy00264603a panting Breathing factotum00411482r pantingly Breathing factotum

Table 2: Sumo vs. Domain labels

To illustrate how we can detect errors and inconsistences between different types ofknowledge, we can see in the example in table 2 that sistematically, the nouns correspondingto the Sumo process Breathing has been labelled with ANATOMY domain, some verbswith MEDICINE and some adjectives with FACTOTUM, when in fact, all these sensescorrespond to different Part-of-Speech of the same Breathing concept.

In order to ilustrate the kind of problems we need to face when merging all thesesemantic resources into a single and common platform, consider the example shown infigure 2. The act playing#1 which is a kind of musical performance#1 is connected byderivational morphological relations to three senses of the verb play. The verb play#3 isconnected by a domain relation to the noun music#1 and the verb play#7 is connected

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 25: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 24

to music#1 and music#3. However, play#6, also related to the musical domain is notconnected by a domain relation to none of the music#1 nor music#3. All the three sensesof the verb play have the WN Domain MUSIC label and the Sumo music label. However,each verb sense of play have different behaviour asigning category relations. Should thenoun playing be also connected by a category relation to both music#1 and music#3?Should be made explicit this conection? Regarding WN Domain labels, why the musicalsenses of the verb play and the noun music do not have also the FREE TIME label asthe noun act playing? With respect Sumo, why they have different types? Furthermore,being the eXtended WordNet the result of an automatic process, it contains also wrongdisambiguations (play#4 belonging to the THEATRE domain). We think that having allthis different sources of knowledge uploaded and integrated into the same framework willallow to improve sistematically all this missleading inconsistencies.

RELATED-TO

the act of playing a

musical instrument

DOMAIN free_time music

SUMO &%RecreationOrExercise+

00093905n

playing

the act of performing music

DOMAIN free_time music

SUMO &%RecreationOrExercise+

00092967n

musical performance

RELATED-TO

RELATED-TO

play on an instrument;

"The band played all night long"

DOMAIN music

SUMO &%music+

01675975v

play#3

perform music on a musical instrument;

"He plays the flute";

"Can you play on this old recorder?"

DOMAIN music

SUMO &%music+

01677078v

play#7

re-play (as a melody);

"Play it again, Sam";

"She played the third movement very beautifully"

DOMAIN music

SUMO &%music+

01675975v

play#6

spiel#1

CATEGORY

an artistic form of auditory

communication incorporating

instrumental or vocal tones

in a structured and continuous

manner

DOMAIN music

SUMO &%music+

06591368n

+music#1

CATEGORYmusical activity (singing

or whistling etc.);

"his music was his central

interest"

DOMAIN music

SUMO &%music+

00515842n

music#3

CATEGORY

play a role or part;

"Gielgud played Hamlet"; ...

DOMAIN theatre

SUMO &%Pretender+

01670298v

act#3

play#4

represent#10

GLOSS

Figure 2: Example of noun playing

4.6 Mcr example

When uploading coherently all this knowledge into the Mcr a full range of new possibilitiesappear for improving both Acquisition and WSD problems. We will ilustrate these newcapabilities by a simple example. The Spanish noun vaso has three possible senses.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 26: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 25

The first one is connected to the same ILI as the English synset <drinking glass glass>.This ILI record, belonging to the Semantic File ARTIFACT has no specific WordNetDomain (FACTOTUM). However, the EuroWordNet Top Ontology provides futher cluesabout its meaning: it has the following properties Form-Object, Origin-Artifact, Function-Container and Function-Instrument. The Sumo type for this synset is also Artifact. Avaluable information also comes from the disambiguated glosses included into the eXtendedWordNet. This gloss has two ’silver’ words16 (glass, container) and three ’normal’ words(the rest). For instance, hold#VBG#8 corresponds to “(contain or hold; have within:”The jar carries wine”; ”The canteen holds fresh water”; ”This can contains water”). Fur-ther, comming from the Selectional Preferences acquired from SemCor, we know that thetypical things that somebody does with this king of vaso are for instance the correspondingequivalent translations to Spanish for <polish, shine, smooth, smoothen> or <beautify,embellish, prettify>. WordNet 2.0 also provides a new morphological derivational rela-tion: to glass#v#4. Finally, we must add that this also holds for the rest of languagesconnected.

vaso_1 02755829-n

SF: 06-NOUN.ARTIFACT

DOMAIN: FACTOTUM

SUMO: &%Artifact+

TO: 1stOrderEntity-Form-Object

TO: 1stOrderEntity-Origin-Artifact

TO: 1stOrderEntity-Function-Container

TO: 1stOrderEntity-Function-Instrument

EN: drinking_glass glass

IT: bicchiere

BA: edontzi baso edalontzi

CA: got vas

02755829-n drinking_glass glass:

GLOSS: a glass container for holding liquids while drinking

eXtended WordNet:

GLOSS: a glass#NN#2 container#NN#1 for hold#VBG#8 liquid#NNS#1 while drink#VBG#1

DOBJ SemCor

02755829 00849393 0.0074 polish shine smooth smoothen

16High confidence

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 27: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 26

02755829 00201878 0.0013 beautify embellish prettify

02755829 00826635 0.0010 get_hold_of take

02755829 00140937 0.0001 ameliorate amend better improve meliorate

02755829 00083947 0.0000 alter change

WN2.0

RELATED TO: glass#v#4 (put in a glass container)

The sense of vaso is the equivalent translation of <vessel, vas>. This ILI record, belong-ing to the Semantic File BODY has assigned a different WordNet Domain (ANATOMY).The EuroWordNet Top Ontology in this case, has the following properties Form-Substance-Solid, Origin-Natural-Living, Composition-Part and Function-Container. The Sumo labelprovides the properties and axioms assigned to BodyVessel. This gloss has two ’gold’ words17 (tube and circulate) and one ’silver’ (body fluid) and the last word is monosemous. Fromthe Selectional Preferences acquired from SemCor, we know that the typical events appliedto this king of vaso are for instance the corresponding equivalent translations to Spanishfor <inject, shoot> or <administer, dispense>. In this case, there are no new relationscoming from WordNet 2.0. As before, we must add that this knowledge can be also portedto the rest of languages connected.

vaso_2 04195626-n

SF: 08-NOUN.BODY

DOMAIN: ANATOMY

SUMO: &%BodyVessel+

TO: 1stOrderEntity-Form-Substance-Solid

TO: 1stOrderEntity-Origin-Natural-Living

TO: 1stOrderEntity-Composition-Part

TO: 1stOrderEntity-Function-Container

EN: vessel vas

IT: vaso dotto canale

BA: hodi baso

CA: vas

04195626-n vessel vas:

GLOSS: a tube in which a body fluid circulates

eXtended WordNet:

GLOSS: a tube#NN#4 in which a body_fluid#NN#1 circulate#VBZ#4

17Hand corrected

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 28: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 27

DOBJ SemCor

04195626 01781222 0.0334 be occur

04195626 00058757 0.0072 inject shoot

04195626 01357963 0.0068 follow travel_along

04195626 00055849 0.0045 administer dispense

04195626 01012352 0.0022 block close_up impede jam obstruct occlude

04195626 00054862 0.0021 care_for treat

04195626 01670590 0.0017 hinder impede

04195626 00401762 0.0011 cognize know

04195626 01253107 0.0005 go locomote move travel

04195626 01669882 0.0003 keep prevent

SUBJ SemCor

04195626 01831830 0.0133 stop terminate

04195626 01357963 0.0127 floow travel_along

04195626 01830886 0.0043 discontinue

04195626 01779664 0.0008 cease end finish terminate

04195626 01832078 0.0003 continue go_along go_on keep keep_on proceed

04195626 01253107 0.0002 go locomote move travel

04195626 01520167 0.0002 transfer

04195626 01505951 0.0002 give

04195626 01590833 0.0002 furnish provide render supply

04195626 01612822 0.0001 act move

04195626 01775973 0.0000 be

The last sense of vaso is the equivalent translation of <glassful, glass>. This ILIrecord, belongs to the Semantic File QUANTITY and has has assigned a different Word-Net Domain (FACTOTUM-NUMBER). The EuroWordNet Top Ontology in this case, hasthe following properties Composition-Part SituationType-Static and SituationComponent-Quantity. The Sumo label provides the properties and axioms assigned to ConstantQuan-tity. This gloss has only one ’silver’ word from the eXtended WordNet (quantity). Theother two have label ’normal’. From the Selectional Preferences acquired from SemCor, weknow that the typical events applied to this king of vaso are for instance the correspondingequivalent translations to Spanish for <drink, imbibe> or <consume, have, ingest take,take in>. In this case, no further relations provide WordNet 2.0. As before, we must addthat this knowledge can be also ported to the rest of languages connected.

vaso_3 09914390-n

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 29: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 28

SF: 23-NOUN.QUANTITY

DOMAIN: NUMBER

SUMO: &%ConstantQuantity+

TO: 1stOrderEntity-Composition-Part

TO: 2ndOrderEntity-SituationType-Static

TO: 2ndOrderEntity-SituationComponent-Quantity

EN: glassful glass

IT: bicchierata bicchiere

BA: basocada

CA: got vas

09914390-n glassful glass:

GLOSS: the quantity a glass will hold

eXtended WordNet:

GLOSS: the quantity#NN#1 a glass#NN#2 will hold#VB#1

OBJ SemCor

09914390 00795711 0.0026 drink imbibe

09914390 01530096 0.0009 accept have take

09914390 00786286 0.0009 consume have ingest take take_in

09914390 01513874 0.0001 acquire get

As we can see, we can add consistently a large set of explicit knowledge about each senseof vaso that can be used to differentiate and characterize better their particular meanings.We expect to devise appropriate ways to exploit this unique resource in the next rounds.

5 Design of WP5 Acquisition

The initial design of WP5 in the first Meaning cycle comprised five experiments coveringthe proposed ACQ0 topics and some of ACQ1. Most of the experiments gave promisingresults. For this new round, the consortium has decided to continue all the experiments,and add new ones to cover the topics originally envisaged for ACQ1 and ACQ2.

The ACQ1 phase of Meaning will continue to focus on acquiring lexical informationfrom large text colections (corpora currently available) rather than collecting data fromthe web. However, some experiments (for instance, Experiment D below) are processinglarge amounts of data acquired directly from the web.

Below we describe eleven experiments (named A–K) that cover the topics proposed in

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 30: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 29

the Meaning technical annex for ACQ0 and ACQ1 and also start to address topics inACQ2.

• ACQ0:

– Subcategorization frequencies (experiment A)

– Topic signatures (experiment D)

– Terminology/collocations (experiment B)

– Domain Information (experiment C)

• ACQ1 (using WSD0)

– New senses (experiment J)

– Coarser-grained sense clusters (experiment H)

– Selectional preferences (experiment G)

• ACQ2 (using WSD1)

– Specific lexico-semantic relations (experiment I)

– Thematic-role assignments for nominalizations

– Diathesis alternations

In summary, the consortium plans to perform the following experiments for ACQ1:

Experiment 5.A Investigating multilingual acquisition for verbal predicates.

Experiment 5.B Detecting new noun-noun and adjective-noun collocations.

Experiment 5.C Acquiring domain information for named entities.

Experiment 5.D Acquiring topic signatures from large corpora and the web.

Experiment 5.E Acquiring examples of words used in particular senses from large text collections.

Experiment 5.F Acquisition of lexical knowledge from MRDs.

Experiment 5.G Acquiring improved selectional preferences.

Experiment 5.H Acquisition of sense clusters.

Experiment 5.I Acquiring multiwords: phrasal verbs.

Experiment 5.J Acquiring new senses.

Experiment 5.K Acquiring sense frequencies.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 31: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 30

5.1 Experiment 5.A – Multilingual acquisition for predicates

5.1.1 Current Status

• Initial Design: Deliverable D2.1 Section 5.1

• Current reports: Working Papers WP5.2a, WP5.2b

• Summary: Deliverable D5.1 Section 2

• Status: Active

5.1.2 Introduction and Background

Obtaining large, explicit lexicons rich enough for practical NLP tasks has proved difficult.Methods for automatic lexical acquisition have been investigated for a number of differenttypes of lexical information, including collocations [Dunning, 1993; Justeson and Katz,1995], word senses [Schutze, 1992; Lin and Pantel, 1994], prepositional phrase attachmentambigity [Hindle and Rooth, 1993], selectional preferences [Resnik, 1993; Ribas, 1995;Li and Abe, 1998; McCarthy, 2001; Agirre and Martinez, 2001; Agirre and Martinez,2002], subcategorization frames (SCFs) [Brent, 1991; Brent, 1993b; Ushioda et al., 1993;Brent, 1993a; Briscoe and Carroll, 1997; Carroll and Rooth, 1998; Gahl, 1998; Lapata,1993; Zarkar and Zeman, 2000; Korhonen, 2002] and diathesis alternations [Lapata, 1993;Lapata, 2001; Walde, 2000; McCarthy, 2001]. However, many of these methods are stillexperimental and need further research before they can successfully applied to large scaleacquisition.

Being a multidimensional problem, predicate knowledge is one of the most complextypes of information to acquire. Predicates (verbs and their corresponding nominaliza-tions) are essential for the development of robust and accurate parsing technology capableof recovering predicate-argument relations and logical forms. Without this knowledge, re-solving most structural ambiguities of sentences is difficult, and understanding (to producerepresentations at a semantic level) imposible.

Moreover, predicate-argument knowledge has been shown to vary across corpus type(writen vs. spoken), corpus genre (e.g. financial news vs. balanced text), and discourse type(single sentences vs. conected discourse) [Carroll and Rooth, 1998; Roland et al., 2000;Roland and Jurafsky, 1998]. [Roland and Jurafsky, 2002] have showed that much of thisvariation is caused by the effects of different corpus genres on verb sense and the effect ofverb sense on predicate-argument associations.

A full account of predicate information requires specifiying the number and type of ar-guments, predicate sense in question, semantic representation of the particular predicate-argument structure, mapping between the syntactic and semantic levels of representation,semantic selectional restrictions/preferences on participants, control of ommitted partici-pants and possible diathesis alternations. Unfortunatelly, all these kinds of knowledge areinterdependent.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 32: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 31

The methodology we are using in Meaning provides a possible shortcut to this neverending cycle by means of three consecutives acquisition (ACQ) and word sense disambigua-tion (WSD) rounds (for instance, WSD0 helping the acquisition of selectional preferencesin ACQ1, see also section ??). We anticipate that if we can make progress towards asolution to one facet then that will help the rest.

One approach to the acquisition of predicate-argument associations is syntax-driven.Following a bottom-up approach, from syntax to semantics, if we can identify specificassociations between subcategorisation frames (SCFs) and predicates, we can gather in-formation from corpus data about head lemmas which occur in argument slots of SCFsand use this information as input to selectional preference acquisition [McCarthy, 2001;Walde, 2000]. Selectional preferences are an important part of predicate information,since they can be used to aid anaphora resolution [Ge et al., 1998], WSD [Ribas, 1995;Resnik, 1997; McCarthy et al., 2001] and automatic identification of diathesis alternationsfrom corpus data [Walde, 2000; Lapata, 1993; Stevenson and Merlo, 1998; McCarthy, 2001].

However, [Korhonen, 2002] showed that in terms of SCF distributions, individual verbscorrelate more closely with syntactically similar verbs and clearly more closely with seman-tically similar verbs, than with all verbs in general. Moreover, her results show that verbsemantic generalizations can successfully be used to guide and structure the acquisition ofSCFs from corpus data.

Thus, it is possible to devise alternative acquisition schemes going top-down from se-mantics to syntax. If we identify specific associations between participants and predicates(selectional preferences), we can also use corpus data to gather information about theirparticular syntactic behaviour with respect to a predicate, helping the acquisition of SCFs,diathesis alternations, etc. However, this new approach requires that we work directly ata sense level, with predicates and associations to participants semantically disambiguated.

Furthermore, in a multilingual semantic scenario, it seems possible to devise waysto acquire from a particular language and using a bottom-up approach some predicate-argument knowledge, and then, following a top-down fashion, to acquire some knowledgein another language.

Two different and complementary dimensions can help to minimize the WSD problem:multilingualism and domains. Although, working in parallel with comparable corpora inseveral languages will increase the complexity of the process, we believe that languagetranslation discrepancies among word forms can help the selection of the correct wordsenses [Habash and Dorr, 2002]. Moreover, further reduction of the search space amongsense candidates can be obtained by processing domain corpora [Gale et al., 1992].

This experiment is mainly designed to test the validity of this new approach for theacquisition of predicate knowledge (SCFs, Selectional Restrictions, diathesis alternations,etc.). We will first focus on object and subject positions of a set of verbal predicates.However, in a complementary experiment we will deal with the PP attachment problem.This problem is especially hard for languages like Basque, which has free word order. Ourcurrent Basque parser makes attachment decisions based on heuristics. We are devisingan experiment where we will transfer attachment information coming from English parseddata, and the attachment decisions will be based on this transferred information.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 33: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 32

5.1.3 Source Data

Summarizing, the experiment aims to study new ways for restricting the search spacewhen performing acquisition tasks, in order to obtain more accurate knowledge for somelanguages and balance the coverage of such knowledge across languages. Thus, this ex-periment can be also seen as a common framework to study productive paths to exploitappropriately:

• available semantic knowledge (wordnets, Semantic Files, WordNet Domains, Eu-roWordNet Top Ontology, Base Concepts, Sumo, etc.)

• cross language discrepancies/agreements through the Ili

• available comparable domain corpora (fixing also the time period)

We are focussing only on FINANCE and SPORTS domains. For that purpose, all lan-guages use (where possible) news articles for similar time periods (January, February andMarch 2000) and similar domains (from FINANCE and SPORT domains). In the previ-ous round we empirically proved the feasibility of obtaining large-scale semantic patternsfor any language based only on shallow parsing and some basic semantic generalizations.Being a exploratory experiment we performed only a qualitative evaluation. We comparedseveral semantic patterns coming from translation equivalent verbs selected from differentlanguages and domains [Atserias et al., 2003].

5.1.4 Experiments

The first goal of this experiment is to study how the current technology and the knowledgeavailable for one language can help large-scale acquisition tasks, mainly subcategorizationframes (SCFs) and selectional restrictions or preferences (SRs) for other languages. Obvi-ously, this is a general goal that exceeds the scope of a short experiment. Hopfully, eachlanguage/partner will devise during all the project their own methods, tools, resources formaking feasible this goal. A major outcome of this experiment should be the design of acomplete methodology for acquiring knowledge for predicate-participant associations forthe languages involved in the project, taking into account that this process is not performedas an isolated endeavour, but aided by the presence of knowledge and resources from otherlanguages.

Thus, the current goal of this experiment is to study if, by using currently availabletechnology and resources (taking into account coverage, quality, cross language discrep-ancies/agreements and domain corpora, etc.), we can improve large-scale acquisition ofpredicate knowledge. Initially, the experiment will be performed on a small number ofpredicates (around 10 verbs). Although a major goal of this experiment should be to pro-vide a common framework to compare/use lexical knowledge across languages, we shouldobtain some other fruitful side effects: a common platform to compare the current ca-pabilities of the different Meaning Linguistic Processors. As each Meaning group has

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 34: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 33

their own idiosincratic language, background, methods, tools, resources (e.g. there are nofull parsers for Basque, Catalan, Italian or Spanish, nor treebanks for Basque, Catalan orSpanish, nor large SCFs lexicons ready to use for Catalan, Italian or Spanish, etc.), ourfirst goal must be to build a common framework for comparing data and results. Theconstruction of this common framework will allow the experiment to start properly.

Although other approaches are possible (for instance, starting from raw data [Brent,1991; Brent, 1993b] or parsed data [Briscoe and Carroll, 1997; Carroll and Rooth, 1998;Zarkar and Zeman, 2000; McCarthy, 2001; Korhonen, 2002]), in this experiment we willperform POS tagging, named entity recognition and classification (NERC) and syntacticchunking [Abney, 1991]. Basically, chunks are non-recursive cores of major phrases, e.g.NPs, PPs, verb groups and so forth. Essentially, chunking factors sentence structure intopieces to allow later generalizations over slot heads and prepositions.

5.1.5 Evaluation and discussion

Since this is a preliminary and exploratory experiment (with necessarily many simplifica-tions) a qualitative evaluation is more suitable than a quantitative evaluation. We willcompare the semantic patterns comming from the translation equivalent verbs selectedfrom each language/domain combination (and with other equivalent resources available).We will look in particular at overlapping patterns, etc.

If our hypothesis is correct, we will be able to acquire similar semantic patterns fromseveral languages with the technology and resources we currently have available. If thisexperiment is successful each partner must devise their own large-scale process for acquiringnew semantic predicate argument associations from the rest of languages. This will takeplace in the next cycle of the project.

5.2 Experiment 5.B – Collocation

5.2.1 Current Status

• Initial Design: Deliverable D2.1 Section 5.2

• Current reports: Working Paper WP5.3

• Summary: D5.1 Section 3

• Status: Suspended (effort reallocated to experiment 5.D)

5.3 Experiment 5.C – Acquisition of domain information fornamed entities

5.3.1 Current Status

• Initial Design: Deliverable D2.1 Section 5.3

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 35: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 34

• Current reports: Working Paper WP5.4

• Summary: D5.1 Section 4

• Status: Active

5.3.2 Introduction and Background

The aim of this experiment is to automatically acquire domain information for namedentities (NE). For instance, we want to know that ”Kasparov” not only is an entity ofkind Person but that such entity is highly related to the domain of chess, because ”Kas-parov” is a famous chess Master. This information will then be used both to enrich theMeaning Mcr and to improve the performance of the domain-based WSD system, whichcurrently can consider domains only for those words included in WordNet. The result ofthe experiment will be a large repository of domain annotated named entities.

The acquisition methodology is based on two steps: first, NEs need to be recognized andclassified with respect to a set of predefined categories (typically Person, Organization,Date, Location, Measure); then such named entities will be classified with respect to apredefined set of domain labels. As for evaluation, a direct evaluation of the acquiredNE-domain pairs is still under discussion, since there are no available gold-standards tobe used. An interesting approximation can be obtained considering nouns with similarcharacteristics with respect to NEs, which are domain tagged in WordNet Domains. Someexperiments in this directions will be performed.As for an indirect evaluation in a WSDtask, it also presents some open issues, since it would require that the same collection isboth sense tagged and contains a relevant number of NEs. Currently we are consideringSemCor as a candidate for indirect evaluation.

For the first task we will use the NERD system [Magnini et al., 2002a]. For the secondtask we will experiment two alternative approaches: “term categorization”, a supervisedtechnique for the creation of domain specific lexicons described in [Avancini et al., 2003];“domain vectors” an unsupervised technique based on the domain categorization of thecontext in which the named-entity occurs.

5.3.3 Source Data

The design of the experiment is based on the following resources:

• A repository of domain labels to annotate named entities. We have used the domainlabels used in WordNet Domains [Magnini and Cavaglia, 2000], freely available athttp://wndomains.itc.it. In addition, WordNet Domains has been used for a numberof preliminary experiments whose aim is to check the feasibility of the acquisition ofdomain information for NEs on the base of the behavior of nouns for which a domainannotation is already known.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 36: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 35

• A collection of documents containing named entities. For the experiments we haveused a subset of the Reuters Corpus Volume I (RCVI), a set of documents made avail-able by Reuters18 for text categorization experimentation and consisting of 806,812news stories produced by Reuters between 20-Aug-1996 and 19-Aug-1997. The storiescover the range of content typical of a large English language international newswire.They vary from few hundred to several thousand words in length.

5.3.4 Experiments

In the Term Categorization task proposed in [Avancini et al., 2003] WordNet Domainscan be used as a gold standard containing domain labeled terms. Such gold standard canused for learning and testing a supervised classifier: for each category (i.e. a domain)positive and negative examples (i.e. terms) can be provided for learning and the system’sperformances can be tested on new terms. Each term will be represented by a “bag ofdocuments” feature vector (i.e. the list of the documents containing the term in thecorpus). This methodology has been proposed as a general schema for the acquisition ofdomain specific terminology. The goal of the experiment consists in the application of TermCategorization to the acquisition of domain information for NEs not present in WordNet.

As a preliminary investigation, we implemented a Term Categorization system and weevaluated it on nouns using WordNet Domain as a gold standard. As the learning devicewe adopted AdaBoost.MHKR [Sebastiani et al., 2000], a more efficient variant of theAdaBoost.MHR algorithm proposed in [Schapire and Singer, 2000]. Both algorithmsare an implementation of boosting, a method for supervised learning which has successfullybeen applied to many different domains. As measures of effectiveness, both micro-averagingand macro-averaging have been used [Lewis, 1991], grounded on precision and recall.

We will train and test the system using using different sets of terms contained in onemonth of the Reuters corpus (November 1996, including 61,022 news). The subset of termsused in the experiment are included in WordNet Domain and their domain annotationdoes not contain the Factotum domain (this is a significant difference with respect tothe experiment described in [Avancini et al., 2003].

5.3.5 Evaluation and Discussion

In order to test the effectiveness of the system for domain acquisition for NEs, we willrestrict our attention to nouns, which seem semantically closer to NEs. We will considerdifferent settings with different ranges of frequency of the terms in the collections.

5.4 Experiment 5.D – Topic Signatures

5.4.1 Current Status

• Initial Design: Deliverable D2.1 Section 5.4

18http://www.reuters.com/

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 37: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 36

Terms # of training # of test Precision Recall F1used terms terms micro micro micro

All terms 19 (freq ≥ 1) 9080 4923 0.8 0.054 0.1All nouns (freq ≥ 1) 7929 3964 0.809633 0.06153 0.114369All nouns (freq ≥ 10) 3237 1619 0.821516 0.130384 0.2205All nouns (freq ≥ 30) 1811 906 0.812325 0.186136 0.302872All nouns (freq ≥ 60) 1189 595 0.818792 0.224678 0.352601Monodom (freq ≥ 1) 5203 2602 0.756757 0.021538 0.041885Monodom (freq ≥ 10) 1799 899 0.758065 0.052455 0.098121Monodom (freq ≥ 30) 877 439 0.702128 0.075688 0.136646Monodom (freq ≥ 60) 534 267 0.672727 0.141221 0.233438

Table 3: Fµ1 results.

Terms # of training # of test Precision Recall F1used terms terms macro macro macro

All terms 20 (freq ≥ 1) 9080 4923 0.85 0.023 0.045All nouns (freq ≥ 1) 7929 3964 0.87315 0.025969 0.050438All nouns (freq ≥ 10) 3237 1619 0.866964 0.043374 0.082615All nouns (freq ≥ 30) 1811 906 0.897793 0.051622 0.09763All nouns (freq ≥ 60) 1189 595 0.906126 0.062565 0.117048Monodom (freq ≥ 1) 5203 2602 0.909205 0.012431 0.024527Monodom (freq ≥ 10) 1799 899 0.853348 0.025032 0.048636Monodom (freq ≥ 30) 877 439 0.927158 0.079149 0.145847Monodom (freq ≥ 60) 534 267 0.98839 0.107307 0.192591

Table 4: F M1

results.

• Current reports: Working Paper WP5.5

• Summary: D5.1 Section 5

• Status: Active

5.4.2 Introduction and Background

The Meaning topic signature task will extract topic signatures for all word senses inthe Multilingual Central Repository (Mcr, linked to WordNet version 1.6). This is asub-task of WP5 (Acquisition) and topic signatures will be used in further acquisitiontasks. Topic signatures also have potential uses in Word Sense Disambiguation (WP6) andCross-Lingual Information Retrieval and Question Answering (WP8).

Topic signatures aim to associate a topical vector to each word sense. The dimensions ofthis topical vector are the words in the vocabulary and the vector elements contain weightswhich are intended to capture the relatedness of the words to the the target word sense.In other words, each word sense is associated with a set of related words with associated

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 38: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 37

weights. For instance, the first sense of church glossed as ”a group of Christians; any groupprofessing Christian doctrine or belief” might have the following topic signature:

church(1177.83) catholic(700.28) orthodox(462.17) roman(353.04) religion(252.61)byzantine(229.15) protestant(214.35) rome(212.15) western(169.71) established(161.26)coptic(148.83) ...

We can build such lists from a sense-tagged corpora by just observing which words co-occur distinctively with each sense. The problem is that sense-tagged corpora are scarce.Alternatively, we can try to associate a number of documents from existing corpora witheach sense and then analyze the occurrences of words in such documents.

In this experiment we have followed two strategies to build topic signatures:

1. Using monolingual corpora and monosemous relatives. We have followed themonosemous relatives method, inspired by [Leacock et al., 1998]. This method usesmonosemous synonyms or hyponyms to construct the queries. For instance, the firstsense of channel in WordNet has a monosemous synonym “transmission channel”.All the occurrences of “transmission channel” in any corpus can be taken to refer tothe first sense of channel. In our case we have used the following kind of relationsin order to get the monosemous relatives: hypernyms, direct and indirect hyponyms,and siblings. The advantages of this method are that it is simple, it does not neederror-prone analysis of the glosses, and it can be used with languages where glossesare not available in their respective WordNets.

2. Using a second language. This method assumes that mappings of senses andwords are different between two different languages. We used Chinese, a very distantlanguage from English, as the second language in our experiments, since the moredistant two languages are, the more likely that senses are lexicalised differently [Philipand Yarowsky, 1999]. The idea is that firstly we translate an English ambiguous wordw to Chinese, using an English-Chinese lexicon; each sense of w maps to a distinctset of Chinese word translations. Then we use these sets of Chinese words to retrievetheir contexts, from large amounts of Chinese text, constructing a corpus in Chinesefor each set of these Chinese words. Finally, we segment the Chinese corpora andthen translate them back to English word by word, using a Chinese-English lexicon.Our method takes advantage of the large amount of Chinese text available in corporaand on the Web. This method could be applied to other language pairs, as long asthey are distant and bilingual lexicons are available, e.g. Spanish-Chinese, Italian-Japanese, etc.

Topic signatures for words have been successfully used in summarization tasks [Lin andHovy, 2000] and [Agirre et al., 2000; Agirre et al., 2001] have shown that it is possible toobtain good quality topic signatures for word senses.

In this second round Meaning will apply past research (which was tested on a limitednumber of words) to a vast amount of word senses, i.e. all word senses of all parts of

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 39: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 38

speech for a number of Meaning languages. This will allow us to apply topic signaturesextensively to the following areas in Meaning:

• Word Sense Disambiguation (cf. Experiment WP6.H in this report).

• Clustering word senses and synsets (cf. Experiment WP5.H in this report).

• Classifying new concepts in the WordNet hierarchy (see [Alfonseca and Manandhar,2002]), which will be explored in the next round.

5.4.3 Source Data

Depending of the approach taken to build the topic signatures, we use the following data:

1. The basic source is the web used like a huge corpora. After constructing the topic sig-nature we use the British National Corpus (BNC) for filtering processes. The sourcefor word senses is the Multilingual Central Repository (Mcr, linked to WordNetversion 1.6). The information for each word sense in Mcr is used to build querieswhich are fed into Google.

2. To build the topic signatures, we need an English-Chinese lexicon and a Chinese-English lexicon, and a large amount of Chinese text, from Chinese corpora, or fromthe Web. In our experiments, we adopt the Yahoo Student English-Chinese Lexicon(http://cn.yahoo.com/ dictionary) and the LDC Chinese-English Translation Lex-icon as the bilingual lexicons, retrieving Chinese text from the Chinese GigawordCorpus and the People’s Daily Newswire Website (http://www.people.com.cn).

5.4.4 Experiments

On the one hand, we will use the monosemous queries to build topic signatures from theweb for all English noun senses in the Mcr. In order to build the queries we use themonosemous relatives method. For each monosemous noun in the Mcr (WN1.6) we builda query and retrieve examples in Google. For each query we set the maximum of 1000“snippets”. The snippets are processed to extract meaningful sentences.

On the other hand, we have devised further experiments to evaluate the quality ofalternative ways of constructing topic signatures, and the utility of the topic signatures inother NLP tasks (see below).

In short, we will perform the following experiments:

a. Build publicly available topic signatures for all WordNet 1.6 noun senses (that is,the English part of the Mcr). These topic signatures are based on the monolingualapproach.

b. Compare similarity measures based on Topic Signatures to other, hierarchy-basedsimilarity measures (also for the monolingual approach).

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 40: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 39

c. Evaluate the usefulness of the Topic Signatures on Word Sense Disambiguation (forthe bilingual approach).

d. Use topic signatures to cluster word senses (also for the monolingual approach, c.f.Experiment 5.H).

5.4.5 Evaluation

Evaluation of automatically acquired semantic and world knowledge information is not aneasy task. There is no gold standard for topic signatures and hand evaluation is arbitrary.Therefore the usefulnes of topic signatures is evaluated, rather than their quality. Theevaluation tasks are (named to correspond to the experiments above):

b. Compare similarity measures based on Topic Signatures to other, hierarchy-basedsimilarity measures. The correlation of a variety of similarity measures based ontopic signatures will be compared to usual hierarchy-based approaches.

c. Evaluate the usefulness of the Topic Signatures on Word Sense Disambiguation (forthe bilingual approach).

d. The quality of the clusters produced by topic signatures will be compared with regardto the clustering strategies (c.f. Experiment 5.H)

5.5 Experiment 5.E – Sense Examples

5.5.1 Current Status

• Initial Design: Deliverable D2.1 Section 5.5

• Current reports: Working Paper WP5.6

• Summary: D5.1 Section 6

• Status: Active

5.5.2 Introduction and Background

A promising current line of research of WSD uses semantically annotated corpora to trainMachine Learning (ML) algorithms to decide which word sense to choose in which contexts.Supervised WSD systems are data hungry. They suffer from the “knowledge acquisitionbottleneck”. These approaches are termed “supervised” because they learn from previouslysense annotated data and therefore they require a large amount of human intervention toannotate the training data. Although ML classifiers are undeniably effective, they are notfeasible until we can obtain reliable unsupervised training data.

Some recent work is focusing on reducing the acquisition cost and the need for supervi-sion in corpus-based methods for WSD. [Leacock et al., 1998; Mihalcea and Moldovan, 1999;

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 41: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 40

Agirre and Martinez, 2000] automatically generate arbitrarily large corpora for unsuper-vised WSD training, using the knowledge contained in WordNet to formulate search enginequeries over large text collections or the Web.

[Leacock et al., 1998] used a system called AutoTrain to collect monosemous relativesfor a set of nouns from a 30-million-word corpus of the San Jose Mercury News. Thesampling process retrieves the “closest” relatives first. For example, suppose that thesystem is asked to retrieve 100 examples for each sense of the noun court. The systemfirst looks for monosemous synonyms of the sense (e.g. tribunal), and when complete, fordaughter collocations (immediate hyponyms) which contain the target word as the head(e.g. superior court) and tallies the number of examples in the corpus for each. If thecorpus has 100 or more examples for these relatives, it retrieves a sampling of them. Ifthere are not enough examples, the remainder are inspected in the following order: all otherdaughters, hyponym collocations that contain the target; all other hyponyms; hypernyms;and finally, sisters. AutoTrain takes as broad a sampling as possible across the corpusand never takes more than one example from an article. Preliminary tests showed thatperformance declined when using local closed-class and part-of-speech cues obtained fromthe monosemous relatives. This is not surprising, as many of the relatives are collocationswhose local syntax is quite different from that of the polysemous word in its typical usage.Prior probabilities for the senses were taken from the manually tagged materials.

Even this method exhibits high accuracy results for WSD with respect to manuallytagged materials; its applicability for a particular word is limited by the existence ofmonosemous relatives and the number of instances of these monosemous relatives in thecorpus. Restricting the semantic relations to synonyms, direct hyponyms and direct hyper-nyms, they found that about 64% of the words in WordNet have monosemous relatives inthe corpus. The quality of the acquired data was evaluated indirectly comparing the resultsof a WSD system for 14 nouns when trained on monosemous relatives and on manuallytagged training materials.

The work of [Mihalcea and Moldovan, 1999] tries to overcome these limitations (1) byusing the word definitions provided by glosses, and (2) by using the web as a very largecorpora. In this case, they use the Altavista search engine to create complex search queriesusing boolean operators (AND, OR, NOT and NEAR) for increasing the quality of theinformation retrieved. For each sense of a word, their system forms search queries inascending order of preference following four different methods. The first method uses themonosemous synonyms. The second parses the gloss detecting noun and verb phrases,each of them constituting a search query. The third, after parsing the gloss, replaces thestop-words by the NEAR operator, and concatenates the words from the current synsetusing the AND operator. The fourth also parses the gloss keeping the head phrase andcombines it with the words from the synset using the AND operator.

Their approach was tested on 20 polysemous words retaining only a maximum of 10examples for each sense of a word, from the top ranked documents, leading an accuracy of91%. Using this method for these words, they obtained thirty times more examples thanappearing in SemCor.

[Agirre and Martinez, 2000] implemented the previously described method of [Mihalcea

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 42: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 41

and Moldovan, 1999] for obtaining training data for 13 words (8 nouns and 5 verbs), andtested on examples from SemCor. Only a few words get better results than random, andfor one particular word the error rate reached 100%.

Agirre and Martınez suggest that one possible explanation of this apparent disagree-ment with respect [Mihalcea and Moldovan, 1999] could be that the acquired examples,although correct in themselves, provide systematically misleading features (for instance, assuggested by Leacock et al. when using a large set of local closed-class and part-of-speechfeatures). Besides, all words were trained with equal numbers of examples. Obviously,further work is needed to analyse the source of errors and devise ways to overcome thesecontradictory results.

In order to test the feasibility of this approach, the Meaning consortium has developedand released a new tool: the first version of ExRetriever, a flexible system to perform sensequeries on large corpora. ExRetriever characterizes automatically each synset of a wordas a query (using mainly synonyms, hyponyms and the words of the definitions); and thenuses these queries to obtain sense examples (sentences) automatically from a large textcollection. The current implementation of ExRetriever accesses directly the content of theMcr. The system also uses SWISH-E to index large collections of text such as SemCor orBNC. ExRetriever has been designed to be easily ported to other lexical knowledge basesand corpora, including the possibility of querying search engines such as Google.

5.5.3 Source Data

Using ExRetriever, several experiments can be carried out comparing different query con-struction strategies. We have selected eight words from the Senseval-2 lexical sample task,and Semcor annotated data will be used as the gold standard for the evaluation of theexperiments.

Within the Meaning project both direct and indirect evaluation experiments of theExRetriever performance have been designed. However, for this round only direct evalua-tion on SemCor will be performed.

In future rounds, ExRetriever could use large text collections in other languages, includ-ing Spanish, Catalan and English EFE news corpora, Spanish and Catalan “El Periodico”,English SemCor and DSO, etc. (see Working Paper 3.1 for a description of the Meaning

corpora).Although corpora could be raw text, in order to perform PoS filtering, the corpus must

be preprocessed to obtain PoS tagged text (including some stemming or lemmatization).Filtering is necessary after querying in order to remove possible examples from other PoSnot corresponding with the particular sense the program is asking for.

For ACQ2, ExRetriever will not use local corpora but the web, querying the Internetdirectly using a web search engine such as Google.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 43: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 42

5.5.4 Experiments

Although this approach seems to be promising, it remains unclear which is the best strategyfor building sense queries from a large-scale knowledge base like WordNet. We will useExRetriever to explore the trade-off between coverage (collecting large quantities of senseexamples) and accuracy (making queries more precise and restrictive, and therefore lessproductive).

The first experiments will be performed using large scale corpora stored locally. Thiswill allow us to perform controlled tests and comparisons more conveniently. Later, whenwe have a clearer view of the knowledge to be used (e.g. regarding PoS, monosemous rela-tives only, synonyms, direct hypernyms, direct hyponyms, involved relations, etc.), queryconstruction (e.g. including or not AND-NOTs with characterizations of the other sensequeries), the complete query process (e.g. union set of queries, incremental construction,etc.), and post processing (e.g. using PoS, syntactic or domain filtering), the system willbe adapted to other languages (using the Mcr) and corpora (El Periodico, EFE).

Using ExRetriver on SemCor we can perform detailed micro-analysis on the data avail-able. That is, we can easily perform many adjustements for building queries and appropri-ately filtering out unwanted examples, balancing the trade-off between coverage (we wantto obtain all the examples of a particular sense occurring in a corpus) and precision (wewant only those corresponding to the particular sense).

5.5.5 Evaluation

Although, we can perform direct and indirect evaluation of the first prototype of ExRe-triever performance, only direct evaluation on SemCor will be performed in ACQ1.

Although restrictive queries are likely to produce poor coverage on SemCor due toits a small size (around 250000 words), this is the only sense tagged resource providingquantities of examples for all-words. Thus, using SemCor we can perform direct testing ofthe different strategies for all words.

Another very promising line of research will follow [Widdows, 2003]. This work presentsa theoretically motivated method for removing unwanted meanings directly from the orig-inal query in vector models. Irrelevance in vector spaces is modelled using orthogonality.Using this approach, query vector negation removes not only unwanted strings but un-wanted meanings. This method is applied to standard IR systems, processing queries suchas “play NOT game”. This work presents an algebra to operate with word vectors ratherthan words. It seems, following this approach, that most of the errors produced becauseof the substitution of the target word for their relatives can be avoided. Furthermore,using this approach, we can also use other sense tagged corpora for direct comparisonsof ExRetriever. Although the DSO corpus only provides sense tagged data for 141 words(nouns and verbs), there are examples in large quantities (of the order of thousands). Inthis case, queries need not include substitutive relatives, only query restrictions (locatedin the gloss, ported from other acquisition phases, etc.) over the polysemous target word.

When evaluated on a controled corpora, indirect evaluation will be also performed using

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 44: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 43

supervised WSD systems (see Section ??). Once a sense tagged corpora is acquired usingExRetriever, we will use several Machine Learning algorithms to perform cross-comparisonswith respect other sense tagged resources (SemCor, DSO and resources produced for Sen-seval).

5.6 Experiment 5.F – Lexical knowledge from MRDs

5.6.1 Current Status

• Initial Design: To be integrated in Deliverable D2.3

• Status: New experiment

EHU will acquire semantic relations from a Basque monolingual dictionary and UPC fromSpanish and English. No Working Paper is expected for this round.

5.7 Experiment 5.G – Improved Selectional Preferences

5.7.1 Current Status

• Initial Design: Deliverable D2.1 Section 6.2 (experiment WP6.A3a.v0)

• Current reports: Working Paper WP6.2b

• Summary: D6.1

• Status: Active

5.7.2 Introduction

The Tree Cut Models (tcms) that we used in the first round of Meaning acquisition forlearning selectional preferences from unannotated text often suffered from an overly highlevel of generalisation, that is classes which are very high in the WordNet hierarchy areused to represent the preferences. Prototypical classes such as food as the direct object ofeat are sometimes hidden by a selectional preference for a class further up the hyponymhierarchy, such as entity. This is partly because of the polysemy of the training data andpartly because the tcm method, using the minimum description length principle, coversall the data rather than looking for prototypical classes. We are investigating 3 possibilitiesto acquire more specific, accurate and intuitive models:

5.7.3 Use of a weighting factor to counter the affect of sample size on TCMs

We are investigating the introduction of a weighting factor to counter the effect of datasize on the tcms. We are using the weighting factor proposed by [Wagner, 2002] andexperimenting with several values of the constant used in the weighting factor. We comparethe results obtained with this weighting to those obtained with the unweighted tcms on

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 45: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 44

the task of wsd of the polysemous nouns within SemCor. For this experiment we workwith nouns occurring in the direct object slots only, and for training we use data in thedirect object grammatical relations in the BNC data obtained using the RASP parser.

5.7.4 Automatic WSD of the training data used for selectional preferenceacquisition

We are performing collaborative experiments with partners in the Meaning consortiumto ascertain if automatic wsd of the input data for selectional preference acquisition canimprove the accuracy of the selectional preferences acquired, and help to combat someof the problems of over-generalisation caused by polysemy. It is not possible to use wsd

as a method of evaluation for the acquired selectional preferences because of the lack ofadequate training data for the wsd systems (EHU require hand-labelled training data)and testing data to evaluate the selectional preference models. Whilst we could divideSemCor into test and training data, we are aware that this resource has been utilised inthe production of the domain labels by IRST. We require test data involving a specified slotof a grammatical relation, and the Senseval-2 lexical sample would not provide enoughinstances of verb and argument in a specified relationship for training the wsd systems andtesting the selectional preference models because only a handful of verbs occur more thanonce with disambiguated direct-objects in the gold-standard. We have therefore designeda pseudo-disambiguation experiment where the task is for the selectional preferences todetermine which of 2 arguments is the one genuinely attested in the corpus data. Asample of test verbs is used. These include verbs being used for other experiments as wellas some drawn from the Senseval2 English lexical sample:

play, encounter, meet, take on, draw, equalize/equalise, coach, train, lose, win, buy,wear, build, teach, feed, ask, borrow, fix, construct, start, open, spend

The parsed BNC data involving verb-direct object grammatical relations for this sampleof verbs is divided into 10 even sized portions for 10-fold cross-validation. The aim is forEHU and IRST to independently tag as much of this data as possible with WordNet 1.6senses, aiming for good accuracy. Only the current sentence of context is given. Forevaluation, for each <direct object, verb> tuple, a pseudo-object is obtained at randomsubject to the constraints that the pseudo-object:

• occurs with the same frequency in the direct object slot tuples of the BNC

• has not been attested in the BNC tuples

For this experiment we do not remove genuine tuples from the test data if they appearin the training data as is done for some smoothing experiments [Clark and Weir, 2002]. Itis reported by Clark that in doing this, class-based systems are at a disadvantage whencompared with systems using bigrams directly [Keller and Lapata, 2003] since the directuse of bigrams necessitates that seen tuples are not removed from the training data. It isalso true that a proportion of data in a separate test set will have occurred previously, and

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 46: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 45

that performance should reflect this. It is important not to compare our results with thoseobtained using different training and test sets.

In this experiment, the selectional preference models are acquired from the trainingportion ( 9

10) of the BNC data, and are then applied to the test portion ( 1

10) on the task of

finding the attested direct objects under 10-fold cross-validation.

5.7.5 Protomodels

We are exploring methods to acquire selectional preferences which instead of covering allthe noun senses in WordNet, just give a probability distribution over a portion of “proto-typical classes”, where that portion can be disambiguated and where the disambiguationis performed using a ratio of types in a class, rather than tokens. We refer to these asprotomodels for ease of reference. We are acquiring these for verbs, rather than verbclasses, and the direct object grammatical relation initially. Our protomodels compriseclasses within the noun hierarchy which have the highest proportion of types occurring inthe data (for each argument head), rather than using the number of tokens, or frequency,as is used for the tcms. This will allow less frequent, but potentially informative tokens,to have some bearing on the models acquired. We then use only the frequency data of to-kens which can be disambiguated with reference to these classes to populate these selectedclasses.

For evaluation of these protomodels we will use two tasks. The first being wsd ofthe polysemous direct objects within SemCor, and this is to be reported in working paperWP6.10 (experiment 6.G) and the second is the pseudo-disambiguation task described insection 5.7.4 above and to be reported in working paper WP5.8.

5.8 Experiment 5.H – Clustering WordNet Word Senses

5.8.1 Current Status

• Initial Design: Described below

• Status: New experiment

5.8.2 Introduction

There is considerable literature on what makes word senses distinct, but there is no generalconsensus on which criteria should be followed. From a practical point of view, the needto make two senses distinct will depend on the target application. This is evident forinstance in Machine Translation, where some word senses will get the same translation(both television and communication senses of channel) are translated as kanal into Basque)while others don’t (groove sense of channel is translated as zirrikitu in Basque), dependingon the target and source languages.

In this experiment we explore a set of automatic methods to hierarchically cluster theword senses in WordNet.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 47: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 46

5.8.3 Data sources

The clustering methods that we will examine are based on the following informationsources:

• Similarity matrix for word senses based on the confusion matrix of all systems thatparticipated in Senseval-2.

• Similarity matrix for word senses produced by [Chugur and Gonzalo, 2002] usingtranslation equivalences in a number of languages.

• Similarity matrix based on the Topic Signatures for each word sense. The topicsignatures were constructed based on the occurrence contexts of the word senses,which can be extracted from hand-tagged data or automatically constructed fromthe Web (See WP5D.a in this round, and also [Agirre et al., 2000; Agirre et al.,2001].

5.8.4 Experiment

In order to construct the hierarchical clusters we will use Cluto [Karypis, 2001], a generalclustering environment or the context of occurrence of each word sense in the form ofa vector. The simmilarity matrixes from the previous section are fed into Cluto, whichoutputs the clusters.

5.8.5 Evaluation

The gold standard is based on the manual grouping of word senses provided in Senseval-2.This gold standard is used in order to compute purity and entropy values for the clusteringresults. The quality of a clustering solution is measured using two different metrics thatlook at the gold-standard labels of the word senses assigned to each cluster [Zhao andKarypis, 2001].

Some of the nouns in Senseval-2 have trivial clustering solutions, e.g. when all the wordsenses form a single cluster, or all clusters are formed by a single word sense. 20 nounshave non-trivial clusters and can therefore be used for evaluation.

5.9 Experiment 5.I – Multiwords: phrasal verbs

5.9.1 Current Status

• Initial Design: Described below

• Status: New experiment

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 48: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 47

5.9.2 Experiment

We are investigating whether we can automatically detect compositionality in phrasalverbs output from the RASP parser. Not every verb modified by a particle may be agenuine multiword unit, but may instead be a fully compositional verb modified by anadverbial, for example fly up. We assume that candidates which are less compositionalare more likely to be genuine phrasal verbs that warrant an entry in a lexicon. For thisexperiment, we randomly select a sample of phrasal candidates from our parser subjectto the constraint that we have an even split between 3 frequency ranges. We obtainhuman judgements from three native speakers of compositionality of these verbs usingan ordinal scale for compositionality. We seek to demonstrate that there is a significantagreement in rank order between human judgements. We then use the average ranksfor each item as a gold-standard and compare various measures aimed at automaticallydetecting non-compositionality using a thesaurus acquired from the parsed BNC usinga distributional similarity measure. We compare these non-compositionality measures bylooking for correlation with the gold standard derived from the human judgements. We alsocontrast the correlation of this gold-standard with some statistics (χ2 and the log-likelihoodratio) commonly used for detecting multiwords and collocations. We additionally look atthe relation between the human judgements and the appearance of the candidates in gold-standard resources, namely WordNet and the anlt lexicon [Grover et al., 1993], on thepremise that non-compositional phrasals are more likely to be listed as multiwords in man-made resources. The working paper WP5.10 (experiment 5.I) will report the results of thisexperiment.

5.10 Experiment 5.J – New Senses

5.10.1 Current Status

• Initial Design: To be integrated in deliverable D2.3

• Status: New experiment

Three possibilities have been identified:

1. Classifying new terms using Topic Signatures [Alfonseca and Manandhar, 2002]

2. Classifying new terms using web directories [Santamaria et al., 2003]

3. Training Machine Learning models for semantic classes rather than word classes (seeexperiment WP6.J Semantic Class Classifiers).

EHU will explore the first approach in ACQ2. IRST will explore the second approach forACQ2 and UPC the third approach in ACQ2. A complete new design of this experimentis expected for next round.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 49: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 48

0

20

40

60

80

100

0 20 40 60 80 100

reca

ll

precision

First Sense

"using HTD" "without HTD" "First Sense"

Figure 3: The first sense heuristic compared with Senseval-2 results

5.11 Experiment 5.K – Ranking Senses automatically

5.11.1 Current Status

• Initial Design: Described below

• Status: New experiment

5.11.2 Introduction and background

The first sense heuristic which is often used as a baseline for supervised wsd systemsfrequently outperforms wsd systems which take context into account. This is largelybecause of the skewed frequency distribution of the senses for most words. This is shownby the results of Senseval-2 in figure 3 below where the first sense heuristic was obtainedusing the sense tagged data from SemCor [Palmer et al., 2001]. Whilst a first sense heuristicbased on a sense-tagged corpus such as SemCor is clearly useful, there is a strong case forobtaining a first, or predominant, sense from untagged corpus data so that one can tunethis to the genre or domain at hand.

We are developing a method for automatically ranking WordNet senses. We experimentwith nouns only at this stage. We rank noun senses using the “nearest neighbours” in athesaurus acquired from automatically parsed text based on the method of [Lin, 1998].This provides the nearest k neighbours to each target noun, along with the distributionalsimilarity score between the target noun and its neighbour. We then use the WordNet sim-ilarity package [Patwardhan and Pedersen, 2003] to give us a semantic similarity measure(hereafter referred to as the WordNet similarity measure) to weight the contribution thateach neighbour makes to the various senses of the target word. This package provides an

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 50: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 49

implementation of a host of WordNet similarity measures, and we are experimenting withas many of these as possible.

We are experimenting to see the extent that the predominant sense acquired from BNCparsed data agrees with the data in SemCor, using evaluation metrics for measuring theaccuracy of finding the predominant sense in SemCor, and a measure to see how well theacquired predominant sense performs at disambiguating the SemCor tokens, given that itis only a heuristic and does not take context into account. As well as being useful fordetermining the top ranking senses of a word, we hope that our method will be good foridentifying infrequent and potentially redundant senses. This will be particularly usefulwhen applying the method to domain specific text, rather than balanced text like that inthe BNC and SemCor. At this stage we are investigating the use of a threshold which filtersa constant percentage of sense tokens for identifying rare senses. We quantify how many ofthe SemCor noun types are filtered which do not occur in SemCor, and the percentage oftokens that would be filtered erroneously from SemCor if these types were indeed removed.

We are performing preliminary experiments with 2 domain specific sections of theReuters corpus (Sports and Finance). We determine the extent that the predominantsense of a sample of words differs in these 2 corpora from that acquired from the balancedBNC data and seek to demonstrate that the predominant senses accord are intuitive, giventhese domains.

The working paper WP5.11 (experiment 5.K) will report the results of these experi-ments.

6 Design of WP6 Word Sense Disambiguation

6.1 Introduction

The initial design of WP6 included for the first Meaning cycle six experiments coveringmost of the WSD0 topics and some for WSD1. All of them are undergoing. For this newround, the consortium decided to continue all the experiments, adding new experimentsfor covering topics of WSD1 and WSD2.

This working package has two main objectives. On the one hand we will developstate-of-the-art all-words WSD systems for all languages that will act as baselines. Thesystems will be based on currently existing tagged corpora for English and on unsupervisedknowledge-based methods for languages other that English.

On the other hand, several experiments have been designed to explore how to improvecurrent WSD technology. Some experiments try to explore algorithms and more informedfeatures in order to improve the accuracy of supervised WSD systems. Other experimentsseek to break the acquisition bottleneck, using a combination of automatically acquiredexamples, or suplementing labelled data with large amounts of unlabelled data via boot-strapping techniques or transductive algorithms.

Below we describe the design of a set of twelve experiments (A – L).

Experiment 6.A All-words WSD systems for English

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 51: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 50

Experiment 6.B High Precision English WSD for bootstrapping

Experiment 6.C WSD based on automatic acquisition of high quality sense examplesfrom large text collections.

Experiment 6.D Transductive approach using Support Vector Machines on labelled andunlabelled data.

Experiment 6.E All-words WSD systems for the rest of languages

Experiment 6.F Contribution of linguistically more informed features in supervised WSD

Experiment 6.G Unsupervised WSD

Experiment 6.H Boostrapping

Experiment 6.I Effect of Sense Clusters

Experiment 6.J Semantic Class Classifiers

Experiment 6.K Effect of Ranking Senses Automatically

Experiment 6.L Disambiguating WN Glosses

Experiment A plans to evaluate different systems to produce a baseline state-of-the-artall-words system for English. This system will be evaluated in the Senseval-3 English all–words task. Experiment E evaluates the portability of those systems to the other languagesin Meaning.

As experiments B, C and D, explore different ways to bootstrap supervised WSD intounsupervised or minimally supervised systems, they have been integrated into a commonExperiment H (Boostrapping). Experiment B aims to produce high precision systems thatcould feed supervised systems. Experiment C tries to acquire automatically examples forword senses and train supervised systems with them. Experiment D adds unlabelled datato existing training data to test whether the performance improves.

Experiment F will examine features as used by present supervised WSD systems. Itwill specially analyze the contribution of syntactic and semantic features.

After examining the results of experiment A, the baseline system for WSD0 in Englishwill be deployed in this round. Experiment E should provide results for languages otherthan English. Experiments B, C and D should provide preliminary results on breakingthe acquisition bottleneck, and experiment F should provide clues about the contributionof additional features. Experiment G is devoted to study different unsupervised WSD.Experiments I and J test in the WSD scenario the results of experiment WP5.H (SenseClusters) and WP5.K (Ranking Senses Automatically).

The last cycle of WSD will produce all-words systems which improve the baseline systemdeployed in WSD1. The experiments B to F will provide clues on the best way to acomplishthis.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 52: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 51

6.2 Experiment 6.A – English all-word WSD system

6.2.1 Current Status

• Initial Design: Deliverable D2.1 Section 6.2

• Current reports: Working Papers WP6.2a, WP6.2b

• Summary: D6.1 Section 4.1

• Status: Active

6.2.2 Introduction, Background and Goals

This report presents the work carried out up to date in the natural extension and comple-tion of the work carried out in experiment A of WSD0 (see WP-6.2). Since the consortiumdecided to participate as a unique meaning group in the English All-Words task of Senseval-3, we have been working in designing and constructing an improved all-words prototypefor on-line WSD.

Improvements will be made at several levels, including:

• Linguistic processing of text

• Feature codification

• Combination of several word/class-experts (coming from different ML algorithms,feature sets, etc.)

• Unsupervised classification of words not covered by the Machine Learning compo-nents (due to the scarcity of training data)

• Mapping between WN-1.6 to WN-1.7

• On-line processing of texts

The prototype constructed will be the basis for the Meaning WSD system to participatein Senseval-3 English all-word tasks and for further developements in WSD2.

Participant groups: TALP, UPV/EHU, IRST, Sussex.Experiment Coordination: TALP (Lluıs Marquez)

6.2.3 Source Data and Tools

• SemCor corpus: as the basis for training data

• Mcr0 (English): to obtain extra examples and richer features

• Senseval-2 all-words corpus: for evaluation

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 53: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 52

• Machine Learning algorithms to train WSD classifiers: MaxEnt, DL, AdaBoost, SVM(see WP–6.2), and Naive Bayes.

• A pipeline of linguistic analyzers: including segmentation, POS tagging, lemmati-sation, detection of named entities, detection of multiword expressions and phrasalverbs, and parsing of syntactic relations.

• A feature extractor: working on the output of the pipeline of linguistic analyzers

• This experiment is connected with experiment J (semantic class classifiers). Theclassifiers learnt in exp J may be included in the ensemble of classifiers to combine.Also with experiment F, since the features designed in that experiment may be reusedhere.

6.2.4 Design and Architecture

To achieve a competent all-words prototype efforts must be done in several issues:

1. Linguistic processing of text A pipeline consisting of a set of the best linguisticprocessors has been constructed. Its architecture is described in the following section.

2. Feature codification The feature extractor, which will be working on the ouput of thepipeline of linguistic processors, will include features from the Mcr (i.e. the Sumo

ontology, domains, etc.) and also those developed in experiment F. The featureextractor is already developed.

3. A mix of the best word/class-experts coming from different ML algorithms: SVM,MaxEnt, DL, AdaBoost, Naive Bayes and cosine vector similarities. These systemsare being trained on SemCor1.6. This mix will be obtained as a result of combinationschemes like majority voting, weighted voting, pairwise voting, etc.

4. Unsupervised systems, based on knowledge, for words not covered by the MachineLearning components will be integrated. See the unsupervised WSD processors de-veloped in experiments G and E.

The architecture of the prototype is as follows: the input of the pipeline will be raw text.Once this text has been processed (by the linguistic processors) it serves as the input forthe feature extractor. These features will be used to feed our ML algorithms and build theclassification models. The SemCor corpus will be used to train the prototype and build themodels. The Mcr will be used as the source to obtain extra examples and richer features.And finally to test the system, Senseval-2 all-words corpus will be used.

The linguistic processors pipeline architecture consists of the following parts:

• Tokenizer Segmentates the text.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 54: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 53

• Tagger In charge of POS tagging. We use the SVMTagger described in [Gimenez andMarquez, 2003] which uses the WSJ tag set.

• Tag Converter Which translates the WSJ POS tags to WordNet tags.

• Lematiser Providing the lema. The tool used for this task is the WordNet-basedwnmorph software.

• NER For detection of named entities. We use the abionet described in [Carreras etal., 2002a]

• Multiwords and Phrasal Verbs Detection Based on lists extracted from WordNet 1.7.

• Information from Minipar Providing syntactic relations coming from an improvedversion of Minipar will be integrated in the pipeline.

6.2.5 Steps/Experiments

1. Finish the experiment on exploiting hypo/hypernymy relations to obtain more train-ing examples from SemCor (see section 2.7 of WP6.2). The remaining work is to testexamples coming from the hypernymy relation and from the glosses and definitionsof the English WordNet, and to devise the best aggregation scheme. The resultsof this part will produce a Working Paper, and will allow us to decide whether toinclude or not the extra examples for the final learning of the all-words prototype.Current status: Completed, positive results in a restricted framework but negativewhen scope is augmented.

2. Developing and test of the pipeline of linguistic processors and feature extractor.Description of tools and formats. A comparison between the UPV/EHU and TALPlinguistic modules will be performed in order to decide the best options. Modulestested will be: tokenizers, POS taggers, lemmatisers, named entity detectors, mul-tiword detectors, phrasal verbs, etc. External parsers will be also tested. Currentstatus: Almost completed, some decisions remain to be taken.

3. Feature codification of training examples from SemCor and other sources. The fea-ture codification module will include some extra features from Mcr0 and also thosedeveloped in experiment F. Current status: Almost completed.

4. Retraining of the basic WSD classifiers on this new set of examples: DL, Naive Bayes,MaxEnt, AdaBoost, SVM. Current status: Ongoing.

5. Simple combinations of the basic WSD classifiers (majority voting, weighted voting,pairwise voting, etc.) and evaluation of the whole system on the Senseval-2 all-wordscorpus. Current status: pending.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 55: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 54

6. Inclusion of a knowledge–based module for default classification of words not coveredby the ML classification modules. This is related to the unsupervised WSD processorsdeveloped in experiments G and E. Current status: pending.

7. Including in the ensemble, if available, the classifiers of experiment J (semantic-classexperts). Exploring simple ways of combination: cascade filtering, voting, etc.

8. More complex combination strategies involving meta-learning.

Note: experiments 7 and 8 are extensions that will be delayed until next MEANING cycle.

6.2.6 Evaluation and Discussion

• The evaluation during developement will be done on the Senseval-2 all-words corpusand compared to the CNTS-Antwerp system, which is based on a similar architecture.Local classifiers may be optimized on small sets of selected words (lexical samples).

• The evaluation of the final Meaning-WSD1 all–words system will be done in theSenseval-3 “English all-words” task in March 2004.

Calendar:Mid-late February 2004: Training all Machine Learning and Knowledge-based modules

and selection of combination techniques:meta-learning strategies, etc.March-April 2004: SENSEVAL-3 competition.

6.2.7 Extensions of WSD1 (future work for WSD2)

Perform the sense disambiguation in context (not word by word as independent classi-fication problems), taking into account the global coherence of the text sense taggings.For that, we need the local classifiers to provide probabilities/confidence values over thepossible tags for each word, and a sequence-tagging scheme for finding the best (mostprobable/confident) path among all possibilities. This may be done by using a generativemodel (HMM, ME, CRF), Relaxation labelling ([Padro, 1998]), “inference with classi-fiers” ([Carreras et al., 2002b]), etc. It would be also very interesting to explore theall-words WSD in connection with the semantic parsing level ([Atserias et al., 2001]). Inthis case, a two-level interacting architecture of WSD-Parsing could be trained at thesame time by using on–line learning with global feedback ([Carreras and Marquez, 2003a;Carreras and Marquez, 2003b]). Again, relaxation labelling is another alternative.

6.3 Experiment 6.B – High Precision English WSD for boot-

strapping

6.3.1 Current Status

• Initial Design: Deliverable D2.1 Section 6.3

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 56: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 55

• Current Design: Integrated with experiment WP6.H

• Current reports: not available in WSD0

• Summary: D6.1 Section 4.3

In WSD0 no new experiments were carried out on high precision WSD. However, ourcurrent baselines systems were summarised in D6.1 Section 4.3 (supervised WSD, unsu-pervised WDD and unsupervised Selectional Preferences).

In WSD1 we will use our current implementation of high precision WSD for English toacquire more accurate Selectional Preferences (see WP5 ACQ Experiment G).

Now, experiments B, C and D have been integrated together in a major experiment H(Bootstrapping).

6.4 Experiment 6.C – High quality sense examples

6.4.1 Current Status

• Initial Design: Deliverable D2.1 Section 6.4

• Current Design: Integrated with experiment WP6.H

• Current reports: Working Paper WP6.4

• Summary: D6.1 Section 4.4

The goal of this experiment is to evaluate up to which point we can automaticallyacquire examples for word senses and train accurate supervised WSD on them. Now,experiments B, C and D have been integrated in a major experiment WP5.H (Bootstrap-ping).

6.5 Experiment 6.D – Transductive Support Vector Machines

6.5.1 Current Status

• Initial Design: Deliverable D2.1 Section 6.5

• Current Design: Integrated with experiment WP5.H

• Current reports: Working Paper WP6.5

• Summary: D6.1 Section 4.5

In this experiment we tested whether the addition of unlabelled data to existing trainingdata contributes to improve the performance of the purely supervised systems using Trans-ductive Support Vector Machines. Looking at the results of the experiments of this work,we can conclude that, at this moment, the use of unlabelled examples with the transductive

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 57: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 56

approach is not useful for WSD. However, there are other bootstrapping techniques to takeadvantage of the unlabelled data that needs further investigation. Now, experiments B, Cand D have been integrated in a major experiment H (Bootstrapping).

6.6 Experiment 6.E – All-words WSD systems for the rest oflanguages

6.6.1 Current Status

• Initial Design: Deliverable D2.1 Section 6.6

• Current reports: Working Paper WP6.6

• Summary: D6.1 Section 4.6

• Status: Active

6.6.2 Current Status

• Initial Design: Deliverable D2.1 Section 6.6

• Current reports: Working Paper WP6.6

• Summary: D6.1 Section 4.6

• Status: Active

6.6.3 Introduction and Background

The Meaning consortium planned a set of experiments in order to evaluate the portabilityof WSD technologies among the languages considered in the project. In particular, theaim of Experiment 6.E is to verify the performance of a WSD system based on domaininformation in an “all words” task for Italian, Catalan, Spanish and Basque. The domain-based system is Domain Driven Disambiguation (DDD) [Magnini et al., 2002b], developedat ITC-irst, which is based on the information included in WordNet Domains. Inorder to perform the evaluation it is required the availability of a gold-standard for eachof the languages of the experiment. Since such a gold-standard for the “all words” taskis currently available only for Italian (i.e. the MultiSemcor corpus), we started with thislanguage, postponing the evaluation on the other languages in the second round of theMeaning project.

6.6.4 Source Data

In the experiment we used WordNet Domains [Magnini and Cavaglia, 2000], freelyavailable for research purposes at http://wndomains.itc.it, and already included in theMeaning MCR.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 58: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 57

In addition, as test set we have used MultiSemCor [Bentivogli and Pianta, 2002]. TheMultiSemCor project aims at creating a semantically annotated corpus by exploiting in-formation contained in an already annotated corpus. The main hypothesis underlying thismethodology is that, given a text and its translation into another language, the translationpreserves to a large extent the meaning of the source language text. This means that if oneof the two texts is already semantically tagged, and if we can align at word level the paralleltexts, it should be possible to transfer the word sense annotation from the tagged text toits translation. The final result of the project will be the MultiSemCor corpus, an Italiancorpus annotated with PoS, lemma and word sense, but also an aligned English/Italianparallel corpus lexically annotated with a shared inventory of word senses.

6.6.5 Experiment

In order to test the DDD system we used three texts (br-b13-ita-gs, br-f03-ita-gs, br-p12-ita-gs) in MultiSemCor. These three texts have been manually aligned and sense annotatedby lexicographers so that they can be considered as a gold-standard. There are a total of1087 instances to disambiguate (109 adjectives, 571 nouns, 86 adverbs, 312 verbs).

6.6.6 Evaluation

In table 5 we report the results. In the last row we report the performance of the system onthe same three parallel English texts from MultiSemCor. The difference in the performance(better for English) is likely due to the the fact that we utilized the relevance thresholdscalculated on the BNC English corpus also for the Italian.

precision recall F1 coveragea 0.77 0.65 0.71 0.84n 0.67 0.40 0.50 0.60r 0.79 0.77 0.78 0.98v 0.54 0.20 0.29 0.37

Total 0.68 0.40 0.50 0.58

English SemCor 0.84 0.53 0.65 0.63

Table 5: Performance of DDD on Italian and English SemCor

6.7 Experiment 6.F – More informed features

6.7.1 Current Status

• Initial Design: Deliverable D2.1 Section 6.7

• Current reports: not available in WSD0

• Summary: D6.1 Section 4.7

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 59: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 58

• Status: Active

6.7.2 Introduction and Background

Many current Natural Language Processing (NLP) systems rely on linguistic knowledgeacquired from tagged text via Machine Learning (ML) methods. The main problem facingsuch systems is the sparse data problem. It is to be noted that both in NLP and WordSense Disambiguation (WSD), most of the events occur rarely, even when large quantitiesof training data are available. Besides, fine-grained analysis of the context requires a repre-sentation with many features, some of them very rare, but which can be very informative.Therefore, the estimation of rare-occurring features is crucial to achieve high precision. InWSD, if all occurrences of a feature for a given word occur in the same senses, MaximumLikelihood Estimation (MLE) would give a 0 probability to the other senses of the wordgiven the feature, which is a severe underestimation.

Smoothing is the technique that tries to estimate the probability distribution thataproximates the one we expect to find in held-out data. Apart from improving the perfor-mance of ML methods, another motivation for a better estimation of features is that theycan be a way to detect good features for bootstrapping, even for low amounts of trainingdata.

In this work, we test the smoothing method proposed by Yarowsky in his PhD (Yarowsky,1995) for the WSD task. We implement an approach that relies on the grouping of fea-tures with the same raw frequency for the target word (or target PoS, for back-off), andthe interpolation curves for the observed frequencies. The impact of several smoothingstrategies is also presented.

6.7.3 Source Data

The experiments will performed using the Senseval-2 English Lexical-Sample data [Ed-monds and Cotton, 2001]. This will allow us to compare our results with the systemsin the competition and with other recent works that have experimented on this dataset.This corpus consist on 73 target words (nouns, verbs, and adjectives), with 4328 testinginstances, and aproximately twice as much training. We will use the training part withcross-validation to estimate the parameters for the ML methods, when needed, and to ob-tain the smoothed frequencies for the features. The systems will be trained in the trainingpart, and tested in the testing data.

6.7.4 The Experiment

Four ML methods from the literature well be applied with six smoothing strategies. Themethods are Naive Bayes (NB), Support Vector Machines (SVM), Decision Lists (DL), andVector Cosine Model (Vector). The smoothing techniques range from no smoothing at all,to the combination of different approaches). Not all combinations are possible, becausesome methods could not work without some smoothing (NB). Some other methods do

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 60: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 59

not require complex smoothing techniques, but their application was tested (Vector). Thecombination of the different learners will be also tested.

6.8 Experiment 6.G – Unsupervised WSD

6.8.1 Introduction and Background

The Tree Cut Models (tcms) that we used for unsupervised wsd in the first round ofMEANING [McCarthy and Carroll, 2003] have typically suffered from an overly high levelof generalisation, that is classes which are very high in the WordNet hierarchy are used torepresent the preferences. Prototypical classes such as food as the direct object of eat aresometimes hidden by a selectional preference for a class further up the hyponym hierarchy,such as entity. This is partly because of the polysemy of the training data and partlybecause the tcm method, using the minimum description length principle, covers all thedata rather than looking for prototypical classes. The over-generalisation has an effect onwsd since the conditional probability for different senses will be the same if they fall underthe same node in the cut, how well or not they are distinguished then depends on themarginal probabilities calculated from the data for the grammatical relation irrespectiveof the verb since these are used in a ratio to estimate the probability of a sense given theconditional probability.

In work package 5.G, we describe a new type of selectional preference model, a pro-

tomodel (for prototypical), which aims to cover only a portion of the data where thatportion can be disambiguated and where the disambiguation is performed using a ratioof types in a class, rather than tokens. The aim of these models is to reduce the impactof noise, atypical arguments and polysemy by only covering data which can be disam-biguated with reference to the other arguments. The effect of polysemy is also lessenedby using types, rather than tokens, to define the classes that represent the probabilitydistribution for the verb and grammatical relation. Thus frequent, but polysemous itemswill be prevented from giving rise to erroneous classes and overly-general preferences. Weare investigating the benefits of these more specific models on disambiguating polysemousnouns within SemCor. We acquire tcms and protomodels for the direct object slots ofall the verbs in SemCor using the parsed BNC for training data. We then use these todisambiguate the polysemous nouns acting as direct objects. We test these models on thetask of disambiguating the nouns acting as direct objects within the Senseval-2 Englishall-words data.

The working paper WP6.10 (experiment 6.G) will report the results of these experi-ments).

6.9 Experiment 6.H – Bootstrapping

6.9.1 Introduction and Background

This experiment tries to organize into a common experiment all the efforts related tobootstrapping WSD. In the D2.1 these efforts were dispersed in experiments:

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 61: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 60

Experiment B High Precision English WSD for bootstrapping

Experiment C WSD based on automatic acquisition of high quality sense examples fromlarge text collections.

Experiment D Transductive approach using Support Vector Machines on labelled andunlabelled data.

These efforts have been continuing in the current round, where we have focused on thefollowing:

Experiment H a) The effect of bias on an automatically built word sense corpus

Experiment H b) Word Sense Classification Using Topic Signatures Automatically Ac-quired Based on a Second Language

Experiment H c) High Precision Word Sense Disambiguation Based on Maximum En-tropy Probability Models

The results of recent WSD exercises (e.g. Senseval-2) show clearly that WSD methodsbased on hand-tagged examples are the best performing. However, one of the main draw-backs for supervised WSD is the acquisition bottleneck, as the systems need large amountsof hand-tagged data, which is costly to obtain. In order to overcome this, different researchlines are being pursued:

a. the automatic acquisition of training examples. We will apply the ”monosemousrelatives” method [Leacock et al., 1998] on the Web in order to test the validity ofthis source of knowledge (cf. Experiment 5.D).

b. the acquisition of English topic signatures automatically, using very large text col-lections in a second language (Chinese is used at this moment), either retrieved fromthe Web, or from available text corpora, and bilingual dictionaries (cf. Experiment5.D).

c. devising and testing bootstrapping algorithms that use existing training data to tagwith high precision unseen intances, in a similar fashion to co-training [Blum andMitchell, 1998].

6.9.2 Source Data

Each of the experiments use different but related data:

a. For training, we will use the all-words hand-tagged Semcor corpus and the Sense-val lexical-sample training corpus. The Senseval lexical-sample testing corpus wasused for testing. The target word-set was formed by the 29 nouns in the Englishlexical-sample task. These hand-tagged examples were complemented with the au-tomatically acquired corpus.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 62: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 61

b. The Web, available text corpora, and bilingual dictionaries, alongside the interestand plant hand-tagged corpora.

c. The interest [Bruce and Wiebe, 1994] and DSO [Ng and Lee, 1996] corpora havebeen used.

6.9.3 The Experiments

We will go through each experiment in turn:

a. The goal of this experiment is to evaluate up to which point we can automatically ac-quire examples for word senses and train accurate supervised WSD systems on them.The method is based on the monosemous relatives of the target words, and we willstudy some parameters that affect the quality of the acquired corpus: the distribu-tion of the number of training instances per each word sense (bias), the substitutionor not of the monosemous relative for the target word, and the type of features usedfor disambiguation (local vs. topical). The automatically-acquired corpus will betested separately, and in combination with Senseval and Semcor. The evaluation willbe performed using Decision Lists [Yarowsky, 1994] and Support Vector Machines[Joachims, 1999].

b. Topic signatures will be used to train the so-called “context-group discrimination”algorithm, and tested the sytem on hand-tagged copora.

c. In this experiment a new algorithm call re-training will be tried on the task of taggingnew instances with high precision, and compared to the co-training method.

6.10 Experiment 6.I – Effect of Sense Clusters

6.10.1 Current Status

• Initial Design: To be integrated in Deliverable D2.3

• Status: New experiment

We also plan to investigate the effect of the sense clusters acquired in ExperimentWP5.H “Coarse–grained sense clusters” in WSD. Those sense clusters acquired using TopicSignatures can be tested indirectly in WSD task: they could be used in order to binarizethe multiclass problem according to data-based clusters, and they could be also used toreturn multiple word sense tags instead of a single word sense.

6.11 Experiment 6.J – Semantic Class Classifiers

6.11.1 Current Status

• Initial Design: To be integrated in Deliverable D2.3

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 63: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 62

• Status: New experiment

Instead of learning word experts from SemCor (classifiers that learn models to distin-guish word senses), we plan to learn multiple semantic class experts (classifiers that learnto distinguish semantic classes). Those semantic classes will be different partitions/viewsof the annotated sense examples of SemCor. Finally, we plan also to combine all (word andsemantic class) classifiers into a single one. In that way, the classifiers will be experts ofparticular semantic classes: Semantic File, EuroWordNet Top Ontology or Sumo descrip-tors, MultiWordNet Domains, etc. We hope that this approach will produce more robustand high level classifiers.

6.12 Experiment 6.K – Effect of Ranking Senses Automatically

This experiment uses the automatically acquired predominant senses from the parsed dataof the BNC and investigates the WSD performance on the Senseval-2 English All-Wordsdata. Evaluation of WSD accuracy of this automatically acquired heuristic is also reportedfor disambiguating the nouns within SemCor. The working paper WP5.11 (experiment5.K) will report the results of these experiments).

6.13 Experiment 6.L – Disambiguating WN Glosses

6.13.1 Introduction and Background

One of the main future improvements in WordNet is a planned hand-tagging of the Word-Net glosses with their WordNet senses.

Sense disambiguation of definitions in any lexical resource is an important objectivein the language engineering community. The first significant disambiguation of dictionarydefinitions and creation of a hierarchy took place 25 years ago in the groundbreaking workof Robert Amsler (see [Rigau, 1998] for an extended survey on acquiring lexical knowledgefrom MRDs).

In the eXtended WordNet 21 [Mihalcea and Moldovan, 2001] the WordNet glosses aresyntactically parsed, transformed into logic forms and the content words are also seman-tically disambiguated. The key idea of the Extended WordNet project is to exploit therich information contained in the definitional glosses that is now used primarily by humansto identify correctly the meaning of words. In the first version of the eXtended WordNetreleased, XWN 0.1, the glosses of WordNet 1.7 are parsed, transformed into the logic formsand the senses of the words are disambiguated. Being derived from an automatic process,disambiguated words included into the glosses have assigned a confidence label indicatingthe quality of the annotation (gold, silver or normal).

The OntoWordNet project aims to achieve a formal specification of WordNet. Asan intermediate step, they also apply an automatic WSD system to the wordnet glosses

21http://xwn.hlt.utdallas.edu/

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 64: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 63

[Gangemi et al., 2003]. In this case, they use also a set of heuristics but in an iterativeprocess.

6.13.2 Source Data

The goal of eXtended WordNet, an ongoing project at the Human Language TechnologyResearch Institute, University of Texas at Dallas, is to develop a WSD tool that takes asinput the current or future versions of WordNet and automatically generates an eXtendedWordNet that provides several important enhancements intended to remedy the presentlimitations of WordNet. In the eXtended WordNet the WordNet glosses are syntacticallyparsed, transformed into logic forms and content words are semantically disambiguated.

The eXtended WordNet project [Harabagiu et al., 1999] aims to transform the WordNetglosses into a format that allows the derivation of additional semantic and logic relations.The first version of the Extended WordNet used in this experiment is based on WordNet1.7. The eXtended WordNet contains three parts: part of speech tagging and parsing,logic form transformation, and semantic disambiguation. In this experiment we only willuse based word semantic disambiguation information.

POS Variants Synsets

noun 107930 74488verb 10806 12754adj 21365 18523adv 4583 3612Total 144684 109377

Table 6: WordNet 1.7 Statistics

WordNet 1.7 contains a total number of 109,367 glosses divided into noun, verb, adjec-tive and adverb glosses 6. In order to be consistent with the logic form transformation andparsing trees, in each gloss they removed the examples and the comments in parentheses.This resulted in 564,748 disambiguated open class words. For disambiguating these openclass words they used both manual and automatic annotation. Automatic annotation wasdone using two programs: one specially designed to disambiguate the WordNet glossescalled XWN WSD, and an in-house system for WSD of open text. A voting between thetwo systems was performed and it estimate a precision of 90% for the words tagged withthe same sense by both system. The precision of annotation was classified as ”gold” formanually checked words, ”silver” for the words automatically tagged with the same senseby the both disambiguation systems, and ”normal” for the rest of the words automaticallyannotated by the XWN WSD system (around 70% precision). Word forms correspondingto the verbs ”to be” and ”to have” were not disambiguated automatically. 7 presents thenumber of the open class words in each category for sets of glosses corresponding to eachpart of speech for XWN1.0.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 65: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 64

POS Open class words Gold Silver Normal

noun 440758 12171 271763 216510verb 44469 2688 14299 27390adj 70748 353 23825 46589adv 8516 1007 3853 3651Total 564491 16219 313740 294140

Table 7: Disambiguated words in each category

6.13.3 Experiments

Our main goal is to build a new system WSD system based initially on the main heuristicsof [Mihalcea and Moldovan, 2001; Novischi, 2002; Gangemi et al., 2003]. We plan toimprove the current system’s performance by the current content of the Mcr this newsophisticated heuristics.

First, we will preprocess the WordNet glosses in the following steps:

• Splitting definitions and examples

• Tokenization

• Part of speech tagging using Brill’s tagger [Ide and Veronis, 1995],

• MultiWord identification.

Most of the success of the heuristics used in our own WSD system rely on a veryaccurate part of speech tagging and multiword indentification.

6.13.4 Evaluation

We will use the semantic annotations of the eXtended WordNet as our gold standard.Obviously, the more reliable source to compare our results are those annotated with the“gold” label. We will study the performance of our system providing different views ofthe results depending on Part-of-speech of the synset, Part-of-speech of the word to bedisambiguated, kind of confidence on the annotation (normal, silver, gold), etc.

We also plan to participate in the task 12 of SENSEVAL-3: “Word-Sense Disambigua-tion of WordNet Glosses”22 to be held in March 2004.

7 Design of WP7 Evaluation and Assessment

Evaluation and assessment in the Meaning project covers three main areas:

22http://www.clres.com/SensWNDisamb.html

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 66: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 65

• evaluation of the output of the linguistic processors;

• evaluation of lexical information acquired from corpora, and ported from one lan-guage to another via the Multilingual Central Repository; and

• evaluation of word sense disambiguation accuracy.

We discuss each of these areas separately below.

7.1 Output of Linguistic Processors

We will use the techniques described in the Year 1 deliverable, D2.1.

7.2 Acquired/Ported Lexical Information

We will use the techniques described in the Year 1 deliverable, D2.1.

7.3 Word Sense Disambiguation

EHU, IRST, Sussex and UPC will all be heavily involved in the Senseval-3 Word SenseDisambiguation (WSD) exercise (due to take place in March 2004), both as participantsand as task coordinators. We give below the task descriptions that the project partners willparticipate in (information taken from<http://www.senseval.org/senseval3/tasks.html>).

01. English all words

As we did for Senseval2, we will tag approximately 5000 words of coherent Penn Treebanktext with WN 1.7 tags. We will tag all of the predicating words and the head words oftheir arguments, and as many adjectives and adverbs as we can. We will do double-blindtagging with adjudication.

02. Italian all words

In addition to the lexical sample task, we propose an “all words” task for Italian. Eachparticipant will be provided with a relatively small set extracted from the Italian Treebank,consisting of about 5000 words. The sentences can be provided with POS tagging andsyntactic dependency-based tagging (functional annotation). The content words (nouns,verbs, and adjectives) will be semantically tagged according to the sense repository ofItalWordNet. Bernardo Magnini (IRST) is joint coordinator for this task.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 67: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 66

03. Basque lexical sample

We propose a “Lexical-Sample” task for Basque in order to evaluate supervised and semi-supervised learning systems for WSD. Each participant will be provided with a relativelysmall set of labelled examples (two-thirds of 75+15∗ senses+7∗multiwords) and a com-paratively very large set of unlabelled examples (ten times more when possible) foraround40 words. The test set will be comprised with one-third of75+15∗senses+7∗multiwords.We target at two types of participants: supervised systems (not using unlabelled data) andsemi-supervised systems (those taking profit from the unlabelled data), but unspervisedsystems can also participate, of course. The sense inventory will be manually linked toWordNet 1.6 (automatic links to WordNet 1.7 will be also provided). This task will becoordinated with other lexical-sampletasks (Catalan, English, Italian, Romanian, Spanish)in order to share around 10 of the target words. Eneko Agirre (EHU) is coordinator forthis task.

04. Catalan lexical sample

We propose a “Lexical-Sample” task for Catalan in order to evaluatesupervised and semi-supervised learning systems for WSD. Each participant will be provided with a relativelysmall set of labelled examples (two-thirds of 75 + 15 ∗ senses) and a comparatively verylarge set of unlabelled examples (ten times more, when possible) for around 45 words.The test set will be comprised with one-third of 75 + 15 ∗ senses. We target at twotypes of participants: supervised systems (not using unlabelled data) and semi-supervisedsystems (those taking profit from the unlabelled data), but unspervised systems can alsoparticipate, of course. The sense inventory, which is specially developed for the task,will be manually linked to WordNet 1.6 (automatic links to WordNet 1.7 will be alsoprovided). This task will be coordinated with otherlexical-sample tasks (Basque, English,Italian, Romanian, Spanish) in order to share around 10 of the target words. Lluıs Marquez(UPC) is coordinator for this task.

06. English lexical sample

The goal of this task is to create a framework for the evaluation of systems that performWord Sense Disambiguation. The data will be collected via the Open Mind Word Expert(OMWE) interface. To ensure reliability, we collect at least two tags per item, and con-duct inter-tagger agreement and replicability tests. Previously performed evaluations haveproved the high quality and usefulness of the OMWE data. By the time Senseval-3 willtake place, we estimate to have enough data for at least 150 ambiguous nouns, adjectives,verbs, and adverbs. Part of the test data will be created by lexicographers from the De-partment of Linguistics at UNT. Another part of the test data will be extracted from thesense tagged corpus collected over the Web. We will also provide sense maps to enableboth fine grained and coarse grained evaluations. It is anticipated that the English lexicalsample task will also include a set of test items drawn from current Web pages.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 68: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 67

09. Italian lexical sample

We propose a “Lexical-Sample” task for Italian in order to evaluate supervised and semi-supervised learning systems for WSD. Each participant will be provided with a relativelysmall set of labelled examples (two-thirds of 75 + 15 ∗ senses) and a comparatively verylarge set of unlabelled examples (ten times more, when possible) for around 45 words.The test set will be comprised with one-third of 75 + 15 ∗ senses. We target at twotypes of participants: supervised systems (not using unlabelled data) and semi-supervisedsystems (those taking profit from the unlabelled data), but unspervised systems can alsoparticipate, of course. The sense inventory, which is specially developed for the task, willbe manually linked to WordNet 1.6 (automatic links to WordNet 1.7 will be also provided).This task will be coordinated with other lexical-sample tasks (Basque, English, Catalan,Romanian, Spanish) in order to share around 10 of the target words. Bernardo Magnini(IRST) is joint coordinator for this task.

08. Spanish lexical sample

We propose a “Lexical-Sample” task for Spanish in order to evaluate supervised and semi-supervised learning systems for WSD. Eachparticipant will be provided with a relativelysmall set of labelled examples (two-thirds of 75+15∗senses) and a comparatively very largesetof unlabelled examples (ten times more, when possible) for around 45 words. The test setwill be comprised with one-third of 75+15∗senses. We target at two types of participants:supervised systems (not usingunlabelled data) and semi-supervised systems (those takingprofit fromthe unlabelled data), but unspervised systems can also participate, of course.The sense inventory, which is specially developed for the task, will be manually linked toWordNet 1.6 (automatic links to WordNet 1.7 will be also provided). This task will becoordinated with otherlexical-sample tasks (Basque, Catalan, English, Italian, Romanian)in order to share around 10 of the target words. Lluıs Marquez (UPC) is coordinator forthis task.

12. Word-Sense Disambiguation of WordNet Glosses

In connection with WordNet 2.0 [Fellbaum, 1998] and eXtended WordNet (XWN, [Mihalceaand Moldovan, 2001]), a large number of the WordNet glosses are being hand-tagged. Eachcontent word (noun, verb, adjective,and adverb) is being labelled with their WordNetsenses. This manual effort is time-consuming and energy intensive. The Senseval-3 task isto perform this tagging automatically using all hand-tagged glosses from XWN as the testset, with the hand-tagging also serving as the golds tandard for evaluation. The task willbe performed as an “all-words” task, except that no context will be provided. However, itis expected that participants will make use of additional WordNet information (synset, theWordNet hierarchy, and other WordNet relations) in their disambiguation. This task isintended to promote the exploitation of ordinary dictionary definitions in machine-readabledictionaries.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 69: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 68

References

[Abney, 1991] S. Abney. Parsing by chunks. In In R. C. Berwick, S. P. Abney, and CarolTenny, editors, Principle-Based Parsing: Computation and Psycholinguistics, Boston,MA, 1991. Kluwer Academic Publishers.

[Agirre and Martinez, 2000] E. Agirre and D. Martinez. Exploring automatic word sensedisambiguation with decision lists and the web. In Proceedings of the COLING workshopon Semantic Annotation and Intelligent Annotation, Luxembourg, 2000.

[Agirre and Martinez, 2001] E. Agirre and D. Martinez. Learning class-to-class selectionalpreferences. In Proceedings of CoNLL01, Toulouse, France, 2001.

[Agirre and Martinez, 2002] E. Agirre and D. Martinez. Integrating selectional preferencesin wordnet. In Proceedings of the first International WordNet Conference in Mysore,India, 21-25 January 2002.

[Agirre et al., 2000] E. Agirre, O. Ansa, D. Martinez, and E. Hovy. Enriching very largeontologies with topic signatures. In Proceedings of ECAI’00 workshop on Ontology Learn-ing, Berlin, Germany, 2000.

[Agirre et al., 2001] E. Agirre, O. Ansa, D. Martınez, and E. Hovy. Enriching wordnetconcepts with topic signatures. In Proceedings of the NAACL workshop on WordNetand Other lexical Resources: Applications, Extensions and Customizations, Pittsburg,2001.

[Agirre et al., 2002] E. Agirre, O. Ansa, X. Arregi, J.M. Arriola, A. Diaz de Ilarraza,E. Pociello, and L. Uria. Methodological issues in the building of the basque wordnet:quantitative and qualitative analysis. In Proceedings of the first International WordNetConference in Mysore, India, 21-25 January 2002.

[Alfonseca and Manandhar, 2002] E. Alfonseca and S. Manandhar. An unsupervisedmethod for general named entity recognition and automated concept discovery. In Pro-ceedings of the 1st International Conference on General WordNet, Mysore, India, 2002.

[Atserias et al., 1997] J. Atserias, S. Climent, X. Farreres, G. Rigau, and H. Rodrıguez.Combining multiple methods for the automatic construction of multilingual wordnets.In Procceeding of RANLP’97, pages 143–149, Bulgaria, 1997.

[Atserias et al., 2001] J. Atserias, L. Padro, and G. Rigau. Integrating multiple knowledgesources for robust semantic parsing. In Proceedings of the International Conference,Recent Advances on Natural Language Processing RANLP’01, Tzigov Chark, Bulgaria,2001.

[Atserias et al., 2003] Jordi Atserias, Mauro Castillo, Francis Real, Horacio Rodrıguez,and German Rigau. Exploring large-scale acquisition of multilingual semantic models

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 70: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 69

for predicates. In Proceedings of SEPLN’03, pages 39–46, Alcala de Henares, Spain,September 2003. ISSN 1136-5948.

[Atserias et al., 2004] Jordi Atserias, Luıs Villarejo, German Rigau, Eneko Agirre, JohnCarroll, Bernardo Magnini, and Piek Vossen. The meaning multilingual central reposi-tory. In Proceedings of the Second International Global WordNet Conference (GWC’04),Brno, Czech Republic, January 2004. ISBN 80-210-3302-9.

[Avancini et al., 2003] H. Avancini, A. Lavelli, B. Magnini, F. Sebastiani, and R. Zanoli.Expanding domain-specific lexicons by term categorization. In Proceedings of the 18thACM Symposium on Applied Computing, Special Track on Information Access and Re-trieva Systems (SAC’03), Melbourne, Florida, 2003.

[Benıtez et al., 1998] L. Benıtez, S. Cervell, G. Escudero, M. Lopez, G. Rigau, andM. Taule. Methods and tools for building the catalan wordnet. In Proceedings ofthe ELRA Workshop on Language Resources for European Minority Languages, FirstInternational Conference on Language Resources & Evaluation, Granada, Spain, 1998.

[Bentivogli and Pianta, 2002] L. Bentivogli and E. Pianta. Opportunistic semantic tag-ging. In Proceedings of the Third International Conference on Language Resources andEvaluation (LREC 2002), Las Palmas, Canary Islands, 2002.

[Bentivogli et al., 2002] L. Bentivogli, E. Pianta, and C. Girardi. Multiwordnet: devel-oping an aligned multilingual database. In First International Conference on GlobalWordNet, Mysore, India, 2002.

[Blum and Mitchell, 1998] A. Blum and T. Mitchell. Combining Labeled and UnlabeledData with Co–Training. In Proceedings of the 11th Annual Conference on ComputationalLearning Theory, COLT-98, pages 92–100, Madison, Wisconsin, 1998.

[Brent, 1991] M. Brent. Automatic acquisition of subcategorization frames from untaggedtext. In Proceedings of 29th annual meeting of the Association for Computational Lin-guistics, ACL’91, Berkeley, CA, 1991.

[Brent, 1993a] M. Brent. Automatic acquisition of a large subcategorization dictionaryfrom corpora. In Proceedings of 31th annual meeting of the Association for Computa-tional Linguistics, ACL’93, Columbus, Ohio, 1993.

[Brent, 1993b] M. Brent. From grammar to lexicon: unsupervised learning of lexical syn-tax. Computational Linguistics, 19(2):243 – 262, 1993.

[Briscoe and Carroll, 1997] T. Briscoe and J. Carroll. Automatic extraction of subcatego-rization from corpora. In Proceedings of 5th Conference on Applied Natural LanguageProcessing, pages 356 – 363, Washington DC, USA, 1997.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 71: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 70

[Bruce and Wiebe, 1994] R. Bruce and J. Wiebe. Word sense disambiguation using decom-posable models. In Proceedings of the ACL’94, 32nd Annual Meeting of the Associationfor Computational Linguistics, pages 139–145, Las Cruces, US, 1994.

[Carreras and Marquez, 2003a] X. Carreras and L. Marquez. Online learning via globalfeedback for phrase recognition. In Proceedings of 17th Annual Conference on NeuralInformation Processing Systems, NIPS’03, Vancouver, Canada, 2003.

[Carreras and Marquez, 2003b] X. Carreras and L. Marquez. Phrase recognition by fil-tering and ranking with perceptrons. In Proceedings of the International Conferenceon Recent Advances in Natural Language Processing (RANLP’03), Borovets, Bulgaria,2003.

[Carreras et al., 2002a] X. Carreras, L. Marquez, and L. Padro. Named entity extrac-tion using adaboost. In Proceedings of Sixth Conference on Natural Language Learning(CoNLL’02 Shared Task Contribution), Taipei, Taiwan, 2002.

[Carreras et al., 2002b] X. Carreras, L. Marquez, V. Punyakanok, and D. Roth. Learningand inference for clause identification. In Proceedings of 13th European Conference onMachine Learning (ECML’02), Helsinki, Finland, 2002.

[Carroll and Rooth, 1998] G. Carroll and M. Rooth. Valence induction with a head-lexicalized pcfg. In Proceedings of the 3rd conference on empirical methods in naturallanguage processing (EMNLP 3), Granada, 1998.

[Chugur and Gonzalo, 2002] I. Chugur and J. Gonzalo. A study of polysemy and senseproximity in the senseval-2 test suite. In Proceedings of ACL’02 workshop on d SenseDisambiguation: recent successes and future, Philadelphia, PA, USA, 2002.

[Clark and Weir, 2002] S. Clark and D. Weir. Class-based probability estimation using asemantic hierarchy. Computational Linguistics, 28(2):187 – 206, 2002.

[Daude et al., 1999] J. Daude, L. Padro, and G. Rigau. Mapping Multilingual HierarchiesUsing Relaxation Labeling. In Joint SIGDAT Conference on Empirical Methods inNatural Language Processing and Very Large Corpora (EMNLP/VLC’99), Maryland,US, 1999.

[Daude et al., 2000] J. Daude, L. Padro, and G. Rigau. Mapping WordNets Using Struc-tural Information. In Proceedings of 38th annual meeting of the Association for Compu-tational Linguistics (ACL’2000), Hong Kong, 2000.

[Daude et al., 2001] J. Daude, L. Padro, and G. Rigau. A complete wn1.5 to wn1.6 map-ping. In Proceedings of NAACL Workshop ”WordNet and Other Lexical Resources:Applications, Extensions and Customizations”, Pittsburg, PA, United States, 2001.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 72: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 71

[Daude et al., 2003a] J. Daude, L. Padro, and G. Rigau. Making Wordnet MappingsRobust. In Proceedings of the 19th Congreso de la Sociedad Espanola para el Proce-samiento del Lenguage Natural, SEPLN’03, Universidad Universidad de Alcala deHenares. Madrid, Spain, 2003.

[Daude et al., 2003b] J. Daude, L. Padro, and G. Rigau. Validation and Tuning of WordnetMapping Techniques. In Proceedings of the International Conference on Recent Advancesin Natural Language Processing (RANLP’03), Borovets, Bulgaria, 2003.

[Dunning, 1993] T. Dunning. Accurate methods for the statistics of surprise and coinci-dence. Computational linguistics, 19(1):61–74, 1993.

[Edmonds and Cotton, 2001] P. Edmonds and S. Cotton. Senseval-2: Overview. In Pro-ceedings of 2nd International Workshop ”Evaluating Word Sense Disambiguation Sys-tems”, SENSEVAL-2, Toulouse, France, 2001.

[Fellbaum, 1998] C. Fellbaum, editor. WordNet. An Electronic Lexical Database. The MITPress, 1998.

[Gahl, 1998] S. Gahl. Automatic extraction of subcorpora based on subcategorizationframes from a part-of-speech tagged corpus. In Proceedings of 36th annual meetingof the Association for Computational Linguistics and 17th International Conference onComputational Linguistics (COLING/ACL’98), Montreal, Canada, 1998.

[Gale et al., 1992] W. Gale, K. Church, and D. Yarowsky. One sense per discourse. InProceedings of of DARPA speech and Natural Language Workshop, Harriman, NY, 1992.

[Gangemi et al., 2003] A. Gangemi, R. Navigli, and P. Velardi. Axiomatizing wordnetglosses in the ontowordnet project. In Proceedings of 2nd International Semantic WebConference Workshop on Human Language Technology for the Semantic Web and WebServices, Sanibel Island, Florida, 2003.

[Ge et al., 1998] N. Ge, J. Hale, and E. Charniak. A statistical approach to anaphoraresolution. In Proceedings of the Sixth ACL/SIGDAT Workshop on Very Large Corpora,pages 161–171, 1998.

[Gimenez and Marquez, 2003] J. Gimenez and L. Marquez. Fast and accurate part-of-speech tagging: The svm approach revisited. In Proceedings of the International Con-ference on Recent Advances in Natural Language Processing (RANLP’03), Borovets,Bulgaria, 2003.

[Grover et al., 1993] C. Grover, J. Carroll, and T. Briscoe. The alvey natural languagetools grammar. Technical report, TR 284, Computer Laboratory, University of Cam-bridge, 1993.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 73: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 72

[Guarino and Welty, 2000] Nicola Guarino and Christopher A. Welty. A formal ontology ofproperties. In Proceedings of ECAI’2000 Workshop on Knowledge Acquisition, Modelingand Management, pages 97–112, 2000.

[Habash and Dorr, 2002] N. Habash and B. Dorr. Handling translation divergences: Com-bining statistical and symbolic techniques in generation-heavy machine translation. InProceedings of the Fifth Conference of the Association for Machine Translation in theAmericas, AMTA-2002, Tiburon, CA, 2002.

[Harabagiu et al., 1999] S. Harabagiu, G. Miller, and D. Moldovan. Wordnet 2 - a morpho-logically and semantically enhanced resource. In Proceedings of ACL on StandardizingLexical Resources (SIGLEX’99), Maryland, MD, 1999.

[Hindle and Rooth, 1993] D. Hindle and M. Rooth. Structural ambiguity and lexical rela-tions. Computational linguistics, 19(2):103–120, 1993.

[Ide and Veronis, 1995] N. Ide and J. Veronis. Transformation-based error-driven learningand natural language processing: A case study in part of speech tagging. ComputationalLinguistics, 21(4), 1995.

[Joachims, 1999] T. Joachims. Making large-scale svm learning practical. In Advances inKernel Methods - Support Vector Learning, B. Scholkopf and C. Burges and A. Smola(ed.). MIT-Press, 1999.

[Justeson and Katz, 1995] J. Justeson and S. Katz. Principled disambiguation: discrim-inating adjective senses with modified nouns. Computational Linguistics, 21(1):1–28,1995.

[Karypis, 2001] G. Karypis. Cluto. a clustering toolkit. Technical report, Department ofComputer Science Minneapolis, University of Minnesota, 2001.

[Keller and Lapata, 2003] F. Keller and M. Lapata. Using the web to obtain frequenciesfor unseen bigrams. Computational Linguistics, 29(3):459 – 484, 2003.

[Korhonen, 2002] A. Korhonen. Subcategorization acquisition. PhD thesis, University ofCambridge, 2002.

[Lapata, 1993] M. Lapata. Acquiring lexical generalizations from corpora: A case studyfor diathesis alternations. In Proceedings of 37th annual meeting of the Association forComputational Linguistics (ACL’99), College Park, Maryland, 1993.

[Lapata, 2001] M. Lapata. The Acquisition and Modeling of Lexical Knowledge: A Corpus-based Investigation of Systematic Polysemy. PhD thesis, University of Edinburgh, 2001.

[Leacock et al., 1998] C. Leacock, M. Chodorow, and G. Miller. Using Corpus Statisticsand WordNet Relations for Sense Identification. Computational Linguistics, 24(1):147–166, 1998.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 74: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 73

[Lewis, 1991] D. Lewis. Evaluating Text Categorization. In Proceedings of Speech andNatural Language Workshop, pages 312–318. Morgan Kaufmann, 1991.

[Li and Abe, 1998] H. Li and N. Abe. Generalizing case frames using a thesaurus and themdl principle. Computational linguistics, 24(2):217–244, 1998.

[Lin and Hovy, 2000] C. Lin and E. Hovy. The automated acquisition of topic signaturesfor text summarization. In Proceedings of International Conference of ComputationalLinguistics, COLING’00, 2000. Strasbourg, France.

[Lin and Pantel, 1994] D. Lin and P. Pantel. Concept Discovery from Text. In 15th Inter-national Conference on Computational Linguistics, COLING’02, Taipei, Taiwan, 1994.

[Lin, 1998] D. Lin. Automatic retrieval and clustering of similar words. In Proceedings ofCOLING-ACL’1998, Montreal, Canada, 1998.

[Magnini and Cavaglia, 2000] B. Magnini and G. Cavaglia. Integrating subject field codesinto wordnet. In In Proceedings of the Second Internatgional Conference on LanguageResources and Evaluation LREC’2000, Athens. Greece, 2000.

[Magnini et al., 2002a] B. Magnini, M. Negri, H. Tanev, and R. Prevete. A WordNet-BasedApproach to Named Entities Recognition. In Proceedings of the SemaNet’02 workshopon Building and Using Semantic Networks, Taipei, Taiwan, 2002.

[Magnini et al., 2002b] B. Magnini, C. Strapparava, G. Pezzulo, and A. Gliozzo. The roleof domain information in word sense disambiguation. Natural Language Engineering,4(8), 2002.

[McCarthy and Carroll, 2003] D. McCarthy and J. Carroll. Disambiguating nouns, verbsand adjectives using automatically acquired selectional preferences. Computational Lin-guistics, 29(4):639 – 654, 2003.

[McCarthy et al., 2001] D. McCarthy, J. Carroll, and J. Preiss. Disambiguating noun andverb senses using automatically acquired selectional preferences. In Proceedings of theSENSEVAL-2 Workshop at ACL/EACL’01, Toulouse, France, 2001.

[McCarthy, 2001] D. McCarthy. Lexical Acquisition at the Syntax-Semantics Interface:Diathesis Aternations, Subcategorization Frames and Selectional Preferences. PhD the-sis, University of Sussex, 2001.

[Mihalcea and Moldovan, 1999] R. Mihalcea and I. Moldovan. An Automatic Method forGenerating Sense Tagged Corpora. In Proceedings of the 16th National Conference onArtificial Intelligence. AAAI Press, 1999.

[Mihalcea and Moldovan, 2001] R. Mihalcea and D. Moldovan. extended wordnet:Progress report. In Proceedings of NAACL Workshop on WordNet and Other LexicalResources, Pittsburgh, PA, 2001.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 75: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 74

[Ng and Lee, 1996] H. Ng and H. Lee. Integrating Multiple Knowledge Sources to Disam-biguate Word Sense: An Exemplar-based Approach. In Proceedings of the 34th AnnualMeeting of the Association for Computational Linguistics. ACL, 1996.

[Niles and Pease, 2001] I. Niles and A. Pease. Towards a standard upper ontology. InIn Proceedings of the 2nd International Conference on Formal Ontology in InformationSystems (FOIS-2001), pages 17–19. Chris Welty and Barry Smith, eds, 2001.

[Novischi, 2002] A. Novischi. Accurate semantic annotations via pattern matching. InFlorida Artificial Intelligence Research Society (FLAIRS’02), Pensacola, Florida, May2002.

[Padro, 1998] L. Padro. A Hybrid Environment for Syntax-Semantic Tagging. PhD thesis,Departament de LSI. Universitat Politecnica de Catalunya, 1998.

[Palmer et al., 2001] M. Palmer, C. Fellbaum, S. Cotton, L. Delfs, and H. Trang Dang.English tasks: All-words and verb lexical sample. In Proceedings of the SENSEVAL-2Workshop. In conjunction with ACL’2001/EACL’2001, Toulouse, France, 2001.

[Patwardhan and Pedersen, 2003] S. Patwardhan and T. Pedersen. The cpan word-net::similarity package. Technical report, http://search.cpan.org/author/SID/WordNet-Similarity-0.03/, 2003.

[Philip and Yarowsky, 1999] R. Philip and D. Yarowsky. Distinguishing systems and dis-tinguishing senses: new evaluation methods for word sense disambiguation. NaturalLanguage Engineering, 2(5):113–134, 1999.

[Resnik, 1993] P. Resnik. Selection and Information: A Class-Based Approach to LexicalRelationships. PhD thesis, University of Pennsylvania, 1993.

[Resnik, 1997] P. Resnik. Selectional preference and sense disambiguation. In Proceedingsof ACL SIGLEX Workshop on Tagging Text with Lexical Semantics: Why, What, andHow?, Washington, D.C., 1997.

[Ribas, 1995] F. Ribas. On Acquiring Appropriate Selectional Restrictions from CorporaUsing a Semantic Taxonomy. Phd. Thesis, Software Department (LSI). Technical Uni-versity of Catalonia (UPC), Barcelona, Spain, 1995.

[Rigau, 1998] G. Rigau. Automatic Acquisition of Lexical Knowledge from MRDs. PhDthesis, Departament de LSI. Universitat Politecnica de Catalunya, 1998.

[Roland and Jurafsky, 1998] D. Roland and D. Jurafsky. How verb subcategorization fre-quencies are affected by corpus choice. In Proceedings of 36th annual meeting of theAssociation for Computational Linguistics and 17th International Conference on Com-putational Linguistics (COLING/ACL’98), Montreal, Canada, 1998.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 76: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 75

[Roland and Jurafsky, 2002] D. Roland and D. Jurafsky. Verb sense and verb subcatego-rization probabilities. In Stevenson S. and Merlo P. (eds.) The Lexical Basis of SentenceProcessing: Formal, Computational, and Experimental Issues, John Benjamins, Amster-dam, 2002.

[Roland et al., 2000] D. Roland, D. Jurafsky, L. Menn, S. Gahl, E. Elder, and C. Rid-doch. Verb subcategorization frequency differences between business-news and balancedcorpora. In Proceedings of ACL Workshop on Comparing Corpora, Hong Kong, China,2000.

[Santamaria et al., 2003] C. Santamaria, J. Gonzalo, and F. Verdejo. Automatic associa-tion of web directories with word senses. Compututational Linguistics, Speccial Issue onweb as a corpus, 29(3):485–502, 2003.

[Schapire and Singer, 2000] R. Schapire and Y. Singer. Boostexter: A boosting-based sys-tem for text categorization. Machine Learning, 39(2/3):135 – 168, 2000.

[Schutze, 1992] H. Schutze. Dimensions of meaning. In Proceedings of Supercomputing,Los Alamitos, California, 1992.

[Sebastiani et al., 2000] F. Sebastiani, A. Sperduti, and N. Valdambrini. An improvedboosting algorithm and its application to automated text categorization. In ArvinAgah, Jamie Callan, and Elke Rundensteiner, editors, Proceedings of CIKM-00, 9thACM International Conference on Information and Knowledge Management, pages 78–85, Washington, US, 2000.

[Stevenson and Merlo, 1998] S. Stevenson and P. Merlo. Automatic verb classificationusing distributions of grammatical features. In In Proc. of the 9th Conference of theEuropean Chapter of the Association of Computational Linguistics (EACL’98), Bergen,Norway, 1998.

[Ushioda et al., 1993] A. Ushioda, D. Evans, T. Gibson, and A. Waibel. The automaticacquisition of frequencies of verb subcategorization frames from tagged corpora. In Pro-ceedings of ACL Workshop on the Acquisition of Lexical Knowledge from Text, Colum-bus, Ohio, 1993.

[Vossen, 1998] P. Vossen, editor. EuroWordNet: A Multilingual Database with LexicalSemantic Networks . Kluwer Academic Publishers , 1998.

[Wagner, 2002] A. Wagner. Learning thematic role relations for wordnets. In Proceedings ofESSLLI-2002 Workshop on Machine Learning Approaches in Computational Linguistics,Trento, Italy, 2002.

[Walde, 2000] S. Walde. Clustering Verbs Semantically According to their AlternationBehaviour. In Proceedings of the 18th International Conference on Computational Lin-guistics (COLING-00), pages 747–753, Saarbrucken, Germany, August 2000.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies

Page 77: Basic Design of the architecture and methodologies (second round)

Work Package 2-D2.2 Version: DraftBasic Design of the architecture and methodologies (second round) Page : 76

[Widdows, 2003] D. Widdows. Orthogonal negation in vector spaces for modelling word-meanings and document retrieval. In Proceedings of 41th annual meeting of the Associ-ation for Computational Linguistics (ACL’2003), Saporo, Japan, 2003.

[Yarowsky, 1994] D. Yarowsky. Decision lists for lexical ambiguity resolution: Applicationto accent restoration in spanish and french. In Proceedings of the 32nd Annual Meetingof the Association for Computational Linguistics (ACL’94), 1994.

[Zarkar and Zeman, 2000] A. Zarkar and D. Zeman. Automatic extraction of subcatego-rization frames for czech. In Proceedings of 18th International Conference on Computa-tional Linguistics (COLING’00), Saarbruken, Germany, 2000.

[Zhao and Karypis, 2001] Y. Zhao and G. Karypis. Criterion functions for document clus-tering: Experiments and analysis. Technical report, Department of Computer ScienceMinneapolis, University of Minnesota, 2001.

IST-2001-34460 - MEANING - Developing Multilingual Web-scale Language Technologies