a metamodel to represent terminology data collections

35
HAL Id: inria-00525421 https://hal.inria.fr/inria-00525421 Submitted on 18 Aug 2021 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. A Metamodel to Represent Terminology Data Collections Laurent Romary To cite this version: Laurent Romary. A Metamodel to Represent Terminology Data Collections. Open Forum 2003 on Metadata Registries, Jan 2003, Santa Fe, United States. inria-00525421

Upload: others

Post on 22-Mar-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

HAL Id: inria-00525421https://hal.inria.fr/inria-00525421

Submitted on 18 Aug 2021

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

A Metamodel to Represent Terminology DataCollections

Laurent Romary

To cite this version:Laurent Romary. A Metamodel to Represent Terminology Data Collections. Open Forum 2003 onMetadata Registries, Jan 2003, Santa Fe, United States. �inria-00525421�

A Metamodel to Represent Terminology Data Collections

Open Forum 2003 on Metadata Registries Terminology and Ontologies Track

20-24 January 2003

Laurent Romary

Laboratoire Loria-INRIA

Open Forum 2003 on Metadata Registries 2

Summary

From terminologies to ontologies (and

back…)

Experience gained in TC37/SC3 while working on

ISO 16642 (Terminological Mark-up Framework)

Abstracting away from XML structures

Paving the way for future work within ISO

TC37/SC4

The central role played by the metadata registry

Relation between TC37/SC4, ISO 11179 and W3C

TC37/SC1: Principles and methods

TC37/SC2: Terminography and Lexicography

TC37/SC3: Computer applications for terminology

TC37/SC4: Language resource management

Open Forum 2003 on Metadata Registries 3

General context

Designing a platform for representing

terminological data

ISO TC37/SC3 context (computer applications in

terminology)

Competition between two formats (i.e. two DTDs)

Design of ISO 16642: TMF - Terminological Markup

Framework

European IST/Salt project

Working on the interoperability of lex-term formats

Open Forum 2003 on Metadata Registries 4

The ecology of lex-term data

Legacy terminological databases

Client’s lex-term banks

On

-line

ac

ce

ss

E

dit

ors

(d

istr

ibu

ted

) Terminological

and lexical DB

MT

sy

ste

m

External resources

Create and update Im

po

rt

Query and publish

Export/Import and merge

MT

lexicon

Oth

er

term

ba

nk

s

Interchange

Open Forum 2003 on Metadata Registries 5

Objectives of ISO 16642

Providing a platform to:

Describe existing data structures

How does a client’s information relate to one’s own

terminological database

Design company specific environments

E.g. to integrate lexicographic information related to MT

Identify ways of mapping these structures to

industrial standards

E.g. export data in TBX

Open Forum 2003 on Metadata Registries 6

A family of formats

TMF

TML1 TML2 TML3 TMLi …

GMT

TMF - Terminological Markup Framework

TML - Terminological Markup Language

GMT - Generic Mapping Tool

Open Forum 2003 on Metadata Registries 7

General principles

Expressing constraints for representing computerized

terminologies

What is the underlying structure of computerized

terminologies?

Which data categories are used and under what conditions?

Maintaining interoperability between representations

Providing a conceptual tool for comparing two given formats

Open Forum 2003 on Metadata Registries 8

Designing a TML

DCR - Data Category Registry

DCS - Data Category Selection

GMT - Generic Mapping Tool

Meta-model

DCS:

• DCR subset

• Application dependant

data categories

Data Category Registry

(Cf. ISO 12620)

Dialect:

• Expension trees

• Styles + Vocabularies

Interoperability conditions

Terminological Markup Language (TML)

GMT

Open Forum 2003 on Metadata Registries 9

Meta-model

Terminological Data Collection (TDC)

Global Information (GI) Complementary Information (CI)

Terminological Entry (TE)

Language Section (LS)

Term Section (TS)

Term Component Section (TCS)

*

*

*

*

Open Forum 2003 on Metadata Registries 10

Data categories

Existing background: ISO 12620: Computer applications for terminology - data

categories

Around 300 entries:

Term, Part of speech, Preferred term, Animacy (Animate, Inanimate)

Abbreviated form for, Broader concept generic, …

Towards a formal description of data categories: RDF model of data category

Editing, on-line browsing, TML modeling

Basic attributes (inspired by ISO 11179)

Identification of the data category (ID, name, definition etc.)

Values (Character data, Integer, picklist etc.)

Locations of the data category in relation to the meta-model

Administrative fields to maintain one’s own specification

Open Forum 2003 on Metadata Registries 11

Putting 16642 at work: decomposition of a

a terminological entry

Open Forum 2003 on Metadata Registries 12

TBX representation

<termEntry id='ID67'>

<descrip type='subjectField'>manufacturing</descrip>

<descrip type='definition'>A value between 0 and 1 used in ...</descrip>

<langSet lang='en'>

<tig>

<term>alpha smoothing factor</term>

<termNote type='termType'>fullForm</termNote>

</tig>

</langSet>

<langSet lang='hu'>

<tig>

<term>Alfa ...</term>

</tig>

</langSet>

</termEntry>

Open Forum 2003 on Metadata Registries 13

Identifying the structural skeleton

id=‘ID67’ [attribute]

subjectField=‘ manufacturing ’ [typedElement]

definition=‘A value…’ [typedElement]

lang=‘ hu ’ [attribute] lang=‘ en ’ [attribute]

term=‘…’ [element]

term=‘alpha smoothing factor’ [element]

termType=‘fullForm’ [typedElement]

TE

LS

TS tig

langSet

tig

langSet

termEntry

TE - Terminological Entry

LS - Language Section

TS - Term Section

Open Forum 2003 on Metadata Registries 14

TMF information model

TE

TS

LS LS

TS

id=‘ID67’

subjectField=‘ manufacturing ’

definition=‘A value…’

lang=‘ hu ’ lang=‘ en ’

term=‘…’ term=‘alpha smoothing factor’

termType=‘fullForm’

Open Forum 2003 on Metadata Registries 15

GMT representation

<struct type=“TE”>

<feat type=“id”>ID67</feat>

<feat type=“subjectField”>manufacturing</feat>

<feat type=“definition”>A value between 0 and 1 used in ...</feat>

<struct type=“LS”>

<feat type=“lang”>en</feat>

<struct type=“TS”>

<feat type=“term”>alpha smoothing factor</feat>

<feat type=“termType”>fullForm</feat>

</struct>

</struct>

<struct type=“LS”>

<feat type=“lang”>hu</feat>

<struct type=“TS”>

<feat type=“term”>Alfa ...</feat>

</struct>

</struct>

</struct>

Open Forum 2003 on Metadata Registries 16

Styles and vocabularies

Open Forum 2003 on Metadata Registries 17

Implementing a DatCat

Definitions: ‘ style ’ — The way a given DatCat is implemented as an

XML object

‘ vocabulary ’ — symbols needed to express the implementation of a given DatCat in its associated style

E.g.: DatCat: /definition/

Style = Element

Vocabulary = [“def”]

<def>pencil whose casing …</def>

Open Forum 2003 on Metadata Registries 18

From an information model point of view…

Open Forum 2003 on Metadata Registries 19

Modeling Information Units

Data Category

Specification Feature structures

Schema fragments XML fragments

Type Instance

Model

Implementation

Styles

(vocab+anchors)

Open Forum 2003 on Metadata Registries 20

Modeling Structure

Meta-Model

(Fixed by 16642) Structural skeleton

XML Schema

fragments XML outline

Type Instance

Model

Implementation

Expansion trees

Open Forum 2003 on Metadata Registries 21

Going further

Data categories as metadata for

language resources in the context of

TC37 *(/SC2 + /SC3 + /SC4)

Open Forum 2003 on Metadata Registries 22

Goals of ISO TC 37/SC 4

TC37/SC4 - Language Resource Management Prepare international standards/guidelines for effective

language resource management in mono- and multi-lingual applications

Develop principles and methods for creating, coding, processing and managing language resources

written corpora, lexical databases, spoken language corpora, etc.

Platform for designing and implementing linguistic resource

formats and processes

Multi-layer annotation of linguistic resources

Exchange of information between NLP modules

Open Forum 2003 on Metadata Registries 23

TC37/SC4 overall rationale

WG1 Basic descriptors and mechanisms

for language resources

WG2 Representation schemes

WG3 Multilingual text representation

WG4 Lexical databases W

G5

Wo

rkflo

w o

f lang

uag

e Reso

urce M

anag

emen

t

Open Forum 2003 on Metadata Registries 24

Why is metadata central?

Problem: We will never agree on one single format for one

single purpose Good reasons for that: various theoretical backgrounds,

various levels of processing, various applicative contexts etc.

Standardization should provide description/mapping means between formats Objective: defining interoperability principles within

processing levels

– Morpho-syntax, Syntax, Semantics, Lexica, etc.

Open Forum 2003 on Metadata Registries 25

Meta data for content description

Author: ‘Salinas’

"¿Tú sabes lo que eres de mí?

¿Sabes tú el nombre?

No es el que todos te llaman,

esa palabra usada

que se dicen las gentes,

Auteur: ‘Salinas’

"¿Tú sabes lo que eres de mí?

¿Sabes tú el nombre?

No es el que todos te llaman,

esa palabra usada

que se dicen las gentes,

/auteur/

Author=/auteur/

Metadata registry

Open Forum 2003 on Metadata Registries 26

Meta data for structural description

Author: ‘Salinas’

<p> "¿Tú sabes lo que eres de mí?

¿Sabes tú el nombre?

No es el que todos te llaman,

esa palabra usada

que se dicen las gentes,

</p>

Auteur: ‘Salinas’

<para> "¿Tú sabes lo que eres de mí?

¿Sabes tú el nombre?

No es el que todos te llaman,

esa palabra usada

que se dicen las gentes,

</para>

/paragraphe/

<p>=/paragraphe/

Metadata registry

Open Forum 2003 on Metadata Registries 27

Multiple uses of data categories

Data category selection

Meta model

Documentation

Meta-data

XML schemas

XSL filters

Open Forum 2003 on Metadata Registries 28

An MDR for TC37

Data Category Registry

Committee

Committee

Committee

12620-2 view 12620-3 view 12620-j view …

DCR board (sc2-sc3-sc4)

Terminology

Language coding

Part 1

Part 1

Part 2

Part 2

Part 3

Part i

Part 3 Part i

Meta-data for lang. res.

Harmonization role

Selection role

Core resource

Committee Morphosyntax

Part 4

/French/

/French/

/Gender/

/Gender/

Open Forum 2003 on Metadata Registries 29

Several issues

Understanding our relation with other

initiatives

Open Forum 2003 on Metadata Registries 30

Issues - relation to ISO 11179

Data element concept Conceptual domain

Data element Value domain

Complex datcat Set of Simple datcats

/Gender/

/masculine/

/feminine/

/neuter/

m, f, n Implemented as an XML

attribute named ‘gen’

XML schema declaration

<w lemme=“vert” gen=“f”>verte</w>

XML object List of values

Open Forum 2003 on Metadata Registries 31

Issues

Data categories for language resources

Containers and Value

/Gender/ /Masculine/, /Feminine/, /Neuter/

Value meanings as administered items

Associating DatCats with views

Contexts?

Restrictions on applicability

/Gender/ applies to fr/en/de, but not to jp

Styles and vocabularies

Hierarchies of data categories

Classification system

Open Forum 2003 on Metadata Registries 32

Issues - relation to W3C

What we need

to represent:

What W3C (SemWeb)

Format we could use:

ISO 11179 features

TC 37 registry

Specific format (XML)

Data Category

TC 37/SC 4 standard

(e.g. POS annotation)

RDFS: to express

how features combine

RDFS: specific

constraints for LR

RDF: to represent

Elementary entries

OWL: to relate levels in MM,

properties, relations

XML schema: to control

Instances of the format

Open Forum 2003 on Metadata Registries 33

Perspective

Implementing a data category registry: a priority for TC37/SC4 Common background for a variety of future standards

Specificities related to committee activities (e.g. experts, votes)

Towards a real ontology of linguistic objects

Collaboration with the ISO 11179 community is essential

Open Forum 2003 on Metadata Registries 34

For More Information

Laurent Romary

Laboratoire Loria-INRIA

[email protected]