1 lexus: a flexible web- based lexicon tool interacting with iso data category registry peter...
Post on 21-Dec-2015
236 Views
Preview:
TRANSCRIPT
1
LEXUS: A flexible web-based Lexicon Tool
Interacting with ISO Data Category Registry
Peter Wittenburg, Marc Kemps-SnijdersMPI for Psycholinguistics
2
Outline
• background – problem• MPI motivation = NLP motivation • playing LEGO• ISO TC37/SC4• Data Categories • Lexical Markup Framework
• LEXUS Tool Mark• Demo Mark
• Outlook
3
Background7 DOBES teams and 12 different lexica (structures, purposes)
Tuvan orthographyTuvan appendixGerman orthographyRussian orthographyRussian appendixXakas orthographyTofa orthography
stem orthographysense *lexical sub-entry * sense nr
sensegram catgram subcatEngl Translexample *
orthographyEngl. Transl[T|pr] nr
simple spreadsheet
little morecomplex incl1:N relations
entry-type = [stem|idiom|lexical word]headouter-body-L*headword
citation formhomograph nophonetic form
inner-body-Lgrammar
sense numbervarietymeaningetymologytableexample*comment*picture/photo*housekeeping*
glossword-level-glossreversaldefinitionencyclopedic infoscientific namesemantic domainsemantic indexthesaurussemantic relation*cross-ref*
small part of a complexlexicon structureat top level 4 different entry types (only one is shown)
4
Problem
• have to use one archival lexicon representation format based on XML • have to build one archival exploitation framework • however, receive lexica
• character encodings • in all sorts of formats (var. XML, SBX, CHAT, even Word) • in various structures • with different terminologies (lexical attributes, values)
• how to do cross-lexical searches? • how to do lexical merging, linking and comparison?• how to solve lexicon-corpus interaction?• etc
• in NLP the same problems • lack of standards • lack of re-usability• lack of interoperability
• you knew this already or?
5
Why not play LEGO?
• concrete lexicon schema is basically seen as lexical attributes groupedtogether with others and embedded in a tree structure.
sense nr
sense
gram cat
engl trans
examples ortho
engl trans
gloss
1:N
1:1
data categories(lexical attributes,
linguistic concepts)
components(sub-schemas)
6
What else: Relations
• actually component association is a relation of special type
• need various type of relations between attributes and units in value strings • each relation can be associated with features, i.e. relations can be seen as components in its own
breite Sitzgelegenheit
something broad to sit on
bank
etwas um zu sitzen
something to sit on
sitzgelegenheit
gegenteil zu breit
contrary to broad
schmal
7
What else: Inheritance
common attributes
particular attributes
b’ang
common attributes
particular attributes
boeb’ang
common attributes
particular attributes
goeb’ang
just one example to reduce typing
8
What else: conditions (operations)
• probably better examples around if value(X) then modify contraints(Y) etc
head
outer-body-L
lexemtype
meaning
sense nr
meaning effect
categorial effect
sense nr
just one example from DOBES
etc etc
etc etc
if lexemtype = “stem | idiom | lexical word”
if lexemtype = “auxil | inflect affix”
9
ISO TC37/SC4 – the solution?
• ISO TC37/SC4 is about standardization in LR Management
• central is data category registry • basically a flat list of linguistic concepts • will contain is_a relations that are part of the concept definition
“transitive_verb” is_a “verb”• with proper definitions and conceptual space (value range) • request for filling DCR (Metadata, morphology, syntax, …)
• looking for abstract models (frameworks) • for lexica • for annotation structures • for semantic annotations • for syntactic annotations • …
10
Underlying Model
Data element concept Conceptual domain
Data element Value domain
Complex datcat Set of Simple datcats
/Gender//masculine//feminine//neuter/
m, f, nImplemented as an XMLattribute named ‘gen’
XML schema declaration
<w lemme=“vert” gen=“f”>verte</w>
XML object List of values
Dutch systemis different
complex datcatssimple datcats
12
Core Model
Metamodel• Made of lexical layers
Lexical layers• Made of lexical components (or components)
Lexical DB
1..1
Global Info
1..1
Lexical Entry
0..n
1..1
0..n
Form
1..1
0..n
1..1
Sense
• basis for modeling purposes is UML • there will be an XML-schema based instantiation
13
Extended Model
Lexical DB
1..1
Global Info
1..1
Lexical Entry
0..n
1..1
0..n
Form
1..1
0..n
1..1
Sense
1..1
1..n
Morphology
0..1
1..1
Inflexion
1..1
Paradigm
0..n
/lemma//POS//gender//key form/
/orthography//gender//number//tense//person//mood/
/orthography//variant for/
/identifier/
14
Proposed Extensions
Lexical Entry
1..1
Form
1..1
0..n
1..1
1..1
Sense
Syntactic family
Semantic formula
Semantic argument
Construct set
Syntactic construct
Syntactic position
0..n1..1 1..1
1..1 1..1
1..1
0..n
1..10..n
1..10..n
Syntactic family
Syntactic construct
Semantic frame
still ongoing discussions
15
What will LMF be?
• descriptions of the general model (metamodel + DCS)• DC have to be ISO 11179/12620/… compliant • Core model
including component building, relations, conditions, inheritance• Extension mechanism • Proposed but not normative extensions (morphology, syntax, …)• XML-schema based instantiation
• currently version 5 of the Draft Proposal ISO/TC 37/SC 4 N130 Rev.5 Date: 2005-03-19 Working draft of ISO WD 24613:2005
• web-site: http://www.tc37sc4.org/
16
Goal LEXUS
To provide a framework capable of handling diverse lexicon structures and formats.
Lexus is based upon Lexicon Markup Framework within ISO TC37/SC4 that defines a blueprint for such a flexible framework.LEXUS is first test and reference implementation of LMF.
Increase interoperability by offering well accepted data categories (ISO, GOLD, Shoebox MDF)
Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004
17
Current Status
•supports full LMF core model
•allows for flexible creation of structures and content.
•supports use of well-accepted Data Category Registries (ISO 12620, Shoebox MDF)
•allows for dynamic editing of structures and content.
•supports use of multimedia content.
•import of existing lexica (Shoebox, Chat)
•export( Shoebox/LMF XML)
•customizable layout
Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004
18
Current Status
•user authentication
•personal workspace for creating and editing lexica
•merging facilities
•simple and advanced search
Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004
19
Current Status (Technical)
•Implemented in java and using Open Source components
•Uses Spring to ‘wire’ the application•Modular approach avoiding ‘hard’ links
•Uses Hibernate as the persistence framework•Allows use of multiple databases (Postgres, MySQL,…)
•Uses Tomcat as Servlet Container
Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004
20
Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004
Users must authenticate before loggin onto the application.
Logging onto the application
21
Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004
Each user has his/her own personal workspace where private lexica are stored
User workspace
22
Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004
New lexica may be created…
Lexicon creation
23
Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004
Lexicon import
New lexica may be imported from a lexical resource…
24
Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004
Lexicon structure
The LMF core model can be identified in this simple structure.Components and datacategories can be identified using different icons.All may be dynamically created or modified.
25
Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004
Lexicon structure
Representation of a more complex structure. By selecting a node in the Tree the content of a component or datacategory is shown and may be modified.
26
Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004
Data category selection
Data categories can easily be selected from data category registries. .
27
Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004
Lexical entry overview
Overview of lexical entries. By selecting a lexical entry the details will be revealed.
28
Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004
Lexical entry details
Details of a lexical entry. Entry structure modifications are bound to schema definition, e.g. cardinality.
29
Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004
Lexical entry details
Attribute values can be easily modified. Various value types are supported( text, video, audio, image or file)
30
Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004
Lexical entry details
Example of uploading a video file.
31
Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004
Lexical entry details
Viewing multimedia content.
32
Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004
Alternative entry view
Alternative views are provided which may be customized in look and feel.
33
Synchronization of lexica
Personal Workspace
Main Lexicon
Lexica may be copied to and modified in personal workspaceWorkshop ‘LexicalDabases and digital tools’NijmegenApril 2004
34
Synchronization of lexica
Personal Workspace
Main Lexicon
Lexica may be synchronized with main lexiconWorkshop ‘LexicalDabases and digital tools’NijmegenApril 2004
35
Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004
Synchronization of lexica
When synchronizing lexica the user is notified of structural changes and is in total control of the synchronization proces.
36
Future directions
•Support for various types of relations
•Import of data from other sources
•Support for other Data Category Registries, e.g. GOLD
•Integration with MPI archive
•Integration with exploitation tools (ELAN, ANNEX)
•Miscellaneous user requests
Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004
37
References• ISO (2004): Lexical Markup Framework. ISO Document in progress• N. Ide, A. Lenci, N. Calzolari (2003): RDF Instantiation of ISLE/Mile Lexical Entries. LDC Workshop. Philadelphia• P. Wittenburg, W. Peters, S. Drude (2002): Analysis of Lexical Structures from Field Linguistics and Language Engineering. LREC 2002 Conference. Las Palma, Mai • P. Wittenburg (2001): Lexical Structures. MPI Technical Report. MPI Nijmegen • J. Bell, S. Bird (2000): A Preliminary Study of the Structure of Lexicon Entries. Workshop on Web-Based Language Documentation and Description. Philadelphia. • Ide, N., Kilgarriff, A. and Romary, L. (2000), A Formal Model of Dictionary Structure and Content, Euralex, Stuttgart
Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004
38
Example lexical structure
Example lexical structure used in the TEOP project within DOBES
Stem orthography
Sense *
Lexical subentrySense nr
sense
Gram cat
Gram subcat
Engl. Transl.
Example *
orthography
Engl. Transl.
[T/pr] nr
* sign stands for 1:n relations of sub-structures
Workshop ‘LexicalDabases and digital tools’NijmegenApril 2004
top related