1 lexus: a flexible web- based lexicon tool interacting with iso data category registry peter...

38
1 LEXUS: A flexible web-based Lexicon Tool Interacting with ISO Data Category Registry Peter Wittenburg, Marc Kemps-Snijders MPI for Psycholinguistics

Post on 21-Dec-2015

236 views

Category:

Documents


0 download

TRANSCRIPT

1

LEXUS: A flexible web-based Lexicon Tool

Interacting with ISO Data Category Registry

Peter Wittenburg, Marc Kemps-SnijdersMPI for Psycholinguistics

2

Outline

• background – problem• MPI motivation = NLP motivation • playing LEGO• ISO TC37/SC4• Data Categories • Lexical Markup Framework

• LEXUS Tool Mark• Demo Mark

• Outlook

3

Background7 DOBES teams and 12 different lexica (structures, purposes)

Tuvan orthographyTuvan appendixGerman orthographyRussian orthographyRussian appendixXakas orthographyTofa orthography

stem orthographysense *lexical sub-entry * sense nr

sensegram catgram subcatEngl Translexample *

orthographyEngl. Transl[T|pr] nr

simple spreadsheet

little morecomplex incl1:N relations

entry-type = [stem|idiom|lexical word]headouter-body-L*headword

citation formhomograph nophonetic form

inner-body-Lgrammar

sense numbervarietymeaningetymologytableexample*comment*picture/photo*housekeeping*

glossword-level-glossreversaldefinitionencyclopedic infoscientific namesemantic domainsemantic indexthesaurussemantic relation*cross-ref*

small part of a complexlexicon structureat top level 4 different entry types (only one is shown)

4

Problem

• have to use one archival lexicon representation format based on XML • have to build one archival exploitation framework • however, receive lexica

• character encodings • in all sorts of formats (var. XML, SBX, CHAT, even Word) • in various structures • with different terminologies (lexical attributes, values)

• how to do cross-lexical searches? • how to do lexical merging, linking and comparison?• how to solve lexicon-corpus interaction?• etc

• in NLP the same problems • lack of standards • lack of re-usability• lack of interoperability

• you knew this already or?

5

Why not play LEGO?

• concrete lexicon schema is basically seen as lexical attributes groupedtogether with others and embedded in a tree structure.

sense nr

sense

gram cat

engl trans

examples ortho

engl trans

gloss

1:N

1:1

data categories(lexical attributes,

linguistic concepts)

components(sub-schemas)

6

What else: Relations

• actually component association is a relation of special type

• need various type of relations between attributes and units in value strings • each relation can be associated with features, i.e. relations can be seen as components in its own

breite Sitzgelegenheit

something broad to sit on

bank

etwas um zu sitzen

something to sit on

sitzgelegenheit

gegenteil zu breit

contrary to broad

schmal

7

What else: Inheritance

common attributes

particular attributes

b’ang

common attributes

particular attributes

boeb’ang

common attributes

particular attributes

goeb’ang

just one example to reduce typing

8

What else: conditions (operations)

• probably better examples around if value(X) then modify contraints(Y) etc

head

outer-body-L

lexemtype

meaning

sense nr

meaning effect

categorial effect

sense nr

just one example from DOBES

etc etc

etc etc

if lexemtype = “stem | idiom | lexical word”

if lexemtype = “auxil | inflect affix”

9

ISO TC37/SC4 – the solution?

• ISO TC37/SC4 is about standardization in LR Management

• central is data category registry • basically a flat list of linguistic concepts • will contain is_a relations that are part of the concept definition

“transitive_verb” is_a “verb”• with proper definitions and conceptual space (value range) • request for filling DCR (Metadata, morphology, syntax, …)

• looking for abstract models (frameworks) • for lexica • for annotation structures • for semantic annotations • for syntactic annotations • …

10

Underlying Model

Data element concept Conceptual domain

Data element Value domain

Complex datcat Set of Simple datcats

/Gender//masculine//feminine//neuter/

m, f, nImplemented as an XMLattribute named ‘gen’

XML schema declaration

<w lemme=“vert” gen=“f”>verte</w>

XML object List of values

Dutch systemis different

complex datcatssimple datcats

11

Lexical Markup Framework

General Model

MetamodelData category

selection

Lexical model

12

Core Model

Metamodel• Made of lexical layers

Lexical layers• Made of lexical components (or components)

Lexical DB

1..1

Global Info

1..1

Lexical Entry

0..n

1..1

0..n

Form

1..1

0..n

1..1

Sense

• basis for modeling purposes is UML • there will be an XML-schema based instantiation

13

Extended Model

Lexical DB

1..1

Global Info

1..1

Lexical Entry

0..n

1..1

0..n

Form

1..1

0..n

1..1

Sense

1..1

1..n

Morphology

0..1

1..1

Inflexion

1..1

Paradigm

0..n

/lemma//POS//gender//key form/

/orthography//gender//number//tense//person//mood/

/orthography//variant for/

/identifier/

14

Proposed Extensions

Lexical Entry

1..1

Form

1..1

0..n

1..1

1..1

Sense

Syntactic family

Semantic formula

Semantic argument

Construct set

Syntactic construct

Syntactic position

0..n1..1 1..1

1..1 1..1

1..1

0..n

1..10..n

1..10..n

Syntactic family

Syntactic construct

Semantic frame

still ongoing discussions

15

What will LMF be?

• descriptions of the general model (metamodel + DCS)• DC have to be ISO 11179/12620/… compliant • Core model

including component building, relations, conditions, inheritance• Extension mechanism • Proposed but not normative extensions (morphology, syntax, …)• XML-schema based instantiation

• currently version 5 of the Draft Proposal ISO/TC 37/SC 4 N130 Rev.5 Date: 2005-03-19 Working draft of ISO WD 24613:2005

• web-site: http://www.tc37sc4.org/

16

Goal LEXUS

To provide a framework capable of handling diverse lexicon structures and formats.

Lexus is based upon Lexicon Markup Framework within ISO TC37/SC4 that defines a blueprint for such a flexible framework.LEXUS is first test and reference implementation of LMF.

Increase interoperability by offering well accepted data categories (ISO, GOLD, Shoebox MDF)

Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004

17

Current Status

•supports full LMF core model

•allows for flexible creation of structures and content.

•supports use of well-accepted Data Category Registries (ISO 12620, Shoebox MDF)

•allows for dynamic editing of structures and content.

•supports use of multimedia content.

•import of existing lexica (Shoebox, Chat)

•export( Shoebox/LMF XML)

•customizable layout

Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004

18

Current Status

•user authentication

•personal workspace for creating and editing lexica

•merging facilities

•simple and advanced search

Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004

19

Current Status (Technical)

•Implemented in java and using Open Source components

•Uses Spring to ‘wire’ the application•Modular approach avoiding ‘hard’ links

•Uses Hibernate as the persistence framework•Allows use of multiple databases (Postgres, MySQL,…)

•Uses Tomcat as Servlet Container

Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004

20

Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004

Users must authenticate before loggin onto the application.

Logging onto the application

21

Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004

Each user has his/her own personal workspace where private lexica are stored

User workspace

22

Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004

New lexica may be created…

Lexicon creation

23

Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004

Lexicon import

New lexica may be imported from a lexical resource…

24

Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004

Lexicon structure

The LMF core model can be identified in this simple structure.Components and datacategories can be identified using different icons.All may be dynamically created or modified.

25

Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004

Lexicon structure

Representation of a more complex structure. By selecting a node in the Tree the content of a component or datacategory is shown and may be modified.

26

Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004

Data category selection

Data categories can easily be selected from data category registries. .

27

Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004

Lexical entry overview

Overview of lexical entries. By selecting a lexical entry the details will be revealed.

28

Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004

Lexical entry details

Details of a lexical entry. Entry structure modifications are bound to schema definition, e.g. cardinality.

29

Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004

Lexical entry details

Attribute values can be easily modified. Various value types are supported( text, video, audio, image or file)

30

Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004

Lexical entry details

Example of uploading a video file.

31

Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004

Lexical entry details

Viewing multimedia content.

32

Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004

Alternative entry view

Alternative views are provided which may be customized in look and feel.

33

Synchronization of lexica

Personal Workspace

Main Lexicon

Lexica may be copied to and modified in personal workspaceWorkshop ‘LexicalDabases and digital tools’NijmegenApril 2004

34

Synchronization of lexica

Personal Workspace

Main Lexicon

Lexica may be synchronized with main lexiconWorkshop ‘LexicalDabases and digital tools’NijmegenApril 2004

35

Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004

Synchronization of lexica

When synchronizing lexica the user is notified of structural changes and is in total control of the synchronization proces.

36

Future directions

•Support for various types of relations

•Import of data from other sources

•Support for other Data Category Registries, e.g. GOLD

•Integration with MPI archive

•Integration with exploitation tools (ELAN, ANNEX)

•Miscellaneous user requests

Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004

37

References• ISO (2004): Lexical Markup Framework. ISO Document in progress• N. Ide, A. Lenci, N. Calzolari (2003): RDF Instantiation of ISLE/Mile Lexical Entries. LDC Workshop. Philadelphia• P. Wittenburg, W. Peters, S. Drude (2002): Analysis of Lexical Structures from Field Linguistics and Language Engineering. LREC 2002 Conference. Las Palma, Mai • P. Wittenburg (2001): Lexical Structures. MPI Technical Report. MPI Nijmegen • J. Bell, S. Bird (2000): A Preliminary Study of the Structure of Lexicon Entries. Workshop on Web-Based Language Documentation and Description. Philadelphia. • Ide, N., Kilgarriff, A. and Romary, L. (2000), A Formal Model of Dictionary Structure and Content, Euralex, Stuttgart

Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004

38

Example lexical structure

Example lexical structure used in the TEOP project within DOBES

Stem orthography

Sense *

Lexical subentrySense nr

sense

Gram cat

Gram subcat

Engl. Transl.

Example *

orthography

Engl. Transl.

[T/pr] nr

* sign stands for 1:n relations of sub-structures

Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004