principles of isocat , a data category registry

31
ww.isocat.org Principles of ISOcat, a Data Category Registry Marc Kemps-Snijders a , Menzo Windhouwer a , Sue Ellen Wright b a Max Planck Institute for Psycholinguistics, b Kent State University marc.kemps-snijders@mpi.nl , [email protected] , [email protected] 4/8/2010 1 Lexicon Tools and Lexicon Standards

Upload: lilli

Post on 13-Feb-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Principles of ISOcat , a Data Category Registry. Marc Kemps- Snijders a , Menzo Windhouwer a , Sue Ellen Wright b a Max Planck Institute for Psycholinguistics, b Kent State University marc.kemps-snijders @ mpi.nl , [email protected] , [email protected]. Outline. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 1

Principles of ISOcat,a Data Category Registry

Marc Kemps-Snijdersa, Menzo Windhouwera, Sue Ellen Wrightb

aMax Planck Institute for Psycholinguistics, bKent State [email protected] , [email protected], [email protected]

4/8/2010

Page 2: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 2

Outline

• What are Data Categories?• How can you use Data Categories?• What is a Data Category Registry?• How can you use a Data Category Registry?• Future work

4/8/2010

Page 3: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 3

ISO 12620:2009

• Terminology and other content and language resources — Specification of data categories and management of a Data Category Registry for language resources– An ISO TC 37/SC 3 standard (see [1])– Successor to ISO 12620:1999 which contained a

hardcoded list of Data Categories

4/8/2010

Page 4: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 4

What is a Data Category?

• The result of the specification of a given data field– A data category is an elementary descriptor in a linguistic

structure or an annotation scheme.

• Specification consists of 3 main parts:– Administrative part

• Administration and identification– Descriptive part

• Documentation in various working languages– Linguistic part

• Conceptual domain(s for various object languages)

4/8/2010

Page 5: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 5

Data category example

• Data category: /Grammatical gender/– Administrative part:

• Identifier: grammaticalGender• PID: http://www.isocat.org/datcat/DC-1297

– Descriptive part:• English definition: Category based on (depending on languages) the

natural distinction between sex and formal criteria.• French definition: Catégorie fondée (selon la langue) sur la

distinction naturelle entre les sexes ou d'autres critères formels. – Linguistic part:

• Morposyntax conceptual domain: /male/, /feminine/, /neuter/• French conceptual domain: /male/, /feminine/

4/8/2010

Page 6: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 6

Mandatory parts of the specification

• For each data category:– a mnemonic identifier– an English definition– an English name

• For complex data categories:– a conceptual domain

• For standardization candidates:– a profile– a justification

4/8/2010

Page 7: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 7

Data Category types

4/8/2010

writtenForm

string

open

grammaticalGender

string

neuter

masculine

feminine

closed

simple:

email

string

constrained

Constraint: .+@.+

complex:

Page 8: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 8

Data Category relationships

4/8/2010

• Value domain membership

• Subsumption relationships between simple data categories (legacy)

• Relationships between complex data categories are not stored in the DCR

partOfSpeech

string

pronoun

personalpronoun

Page 9: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 9

No ontological relationships?

• Rationale: – Relation types and modeling strategies for a given

data category may differ from application to application;

– Motivation to agree on relation and modeling strategies will be stronger at individual application level;

– Integration of multiple relation structures in DCR itself could lead to endless ontological clutter.

4/8/2010

Page 10: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 10

Usage of is-a relationships between simple DCs

4/8/2010

Data category Morposyntax Terminology

/partOfSpeech/ X X

/adjective/ X X

/ordinalAdjective/ X

/participleAdjective/ X

/qualifierAdjective/ X

/adposition/ X X

/circumposition/ X

/preposition/ X

/postposition/ X

Page 11: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 11

How can you use Data Categories?

4/8/2010

Lexicon

Lexical Entry

Form Sense

0..*

0..*1..*

1..*

partOfSpeech

writtenForm

writtenForm

grammaticalGender

lexicalType

Word Form

Lemma

Language BWO genders

grammaticalGenderwordOrder

A LMF (ISO 24613:2008) compliant(schema for a) lexicon

A (schema for a) typological database

Shar

ed se

man

tics!

Page 12: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 12

How?

• A (TC 37) meta model which is instantiated with a domain/application specific data category selection into a data model

• An proprietary data model with a related data category selection

4/8/2010

Page 13: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 13

How?

4/8/2010

<lmf:lexicon xml:lang=“jp” alphabet=“ipa”><lmf:entry><lmf:lemma><lmf:writtenForm>nihongo</…>…</…>…</…>…

</…>

Page 14: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 14

Referencing Data Categories

• Each Data Category should be uniquely identifiable– Ambiguity: different domains use the same term but mean different ‘things’– Semantic rot: even in the same domain the meaning of a term changes over

time– Persistence: for archived resources Data Category references should still be

resolvable and point to the specification as it was at/close to time of creation

• Persistent IDentifiers– ISO/DIS 24619 Language resource management -- Persistent identification

and access in language technology applications– ISOcat uses ‘cool URIs’ (see [6])

• http://www.isocat.org/datcat/DC-1297 (/grammaticalGender/)

4/8/2010

Page 15: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 15

Where do you put these references?

• In a schema:

<rng:attribute name=“alphabet” dcr:datcat=“http://www.isocat.org/datcat/…”><rng:value dcr:datcat=“http://www.isocat.org/datcat/…”>

ipa</…>…

</…>

4/8/2010

Page 16: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 16

ISO TC 37 standards using Data Categories

• Terminological Markup Framework (TMF; ISO 16642)• Lexical Markup Framework (LMF; ISO 24613)• TermBase eXchange (TBX; ISO 30042)• Morpho-syntactic Annotation Framework (MAF; ISO 24611)• Linguistic Annotation Framework (LAF; ISO 24612)

• Meta models which can be instantiated into a specific data model with data categories

• However, some still refer to ISO 12620:1999 Data Categories and some don’t support all types (see [3])

4/8/2010

Page 17: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 17

Other uses of Data Categories

• CLARIN Component Metadata Infrastructure (CMDI)• ISO 12620:2009 provides a small XML vocabulary, DC

Reference (see [4]), which provides elements and attributes to embed Data Category references in arbitrary XML documents– Including: XML Schema, Relax NG, TEI/ISO feature

structures, …• The references can be used in URI based ‘mappings’:

– Including: ODD, RDF-based vocabularies (OWL, SKOS), …

4/8/2010

Page 18: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 18

What is a Data Category Registry?

• A (coherent) set of Data Categories, in our case for linguistic resources

• A system to manage this set:– Create and edit Data Categories– Share Data Categories, e.g., resolve PID references– Standardize Data Categories

• Grass roots approach

4/8/2010

Page 19: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 19

Standardize Data Categories

4/8/2010

Submissiongroup

Data Category RegistryBoard

Validation

Thematic DomainGroup

Evaluation

Stewardshipgroup

Decision Group

rejected rejectedPublication

Page 20: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 20

Thematic Domain Groups

4/8/2010

TDG 1: MetadataTDG 2: MorphosyntaxTDG 3: Semantic Content Representation TDG 4: Syntax TDG 5: Machine Readable DictionaryTDG 6: Language Resource OntologyTDG 7: LexicographyTDG 8: Language CodesTDG 9: TerminologyTDG 11: Multilingual Information ManagementTDG 12: Lexical ResourcesTDG 13: Lexical SemanticsTDG 14: Source Identification

• TDGs are the owner and guardians of a coherent subset of the DCR

• TDGs own one or more profiles

• Each TDG has a chair• A number of judges (assigned by

SC P members)• A number of expert members (up

to 50%)

• TDGs are constituted at the TC37/SC plenary

• New TDGs need to be proposed by a SC

1. Translation2. Sign language3. Audio

Page 21: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 21

How can you use a Data Category Registry?

• You can:– Find Data Categories relevant for your resources and embed references

to them so the semantics of (parts of) your resources are made explicit• This can be supported by tools you use, e.g., ELAN, LEXUS and the CMDI

Component Editor directly interact with ISOcat– Interact with Data Category owners to improve (the coverage of) their

Data Categories– Create (together with others) new Data Categories and/pr selections

needed for your resources and share those– Submit (your) Data Categories for standardization

– Free of charge– Grass roots approach

4/8/2010

Page 22: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 22

ISOcat

• Reference implementation of ISO 12620:2009• The TC 37 Data Category Registry

4/8/2010

Page 23: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 23

Future work

• Finish first complete version of ISOcat:– Standardization process

• Cleanup of the current set of Data Categories– TDGs cleanup their profiles– Standardize first sets of Data Categories

• Interaction with other TC 37 standards:– Migration from ISO 12620:1999– Full support for all types of Data Categories

4/8/2010

Page 24: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 24

More future work

• Additional Data Categories types– Container Data Categories• Complex and Simple only cover ‘leafs’ and their values

– Data Category Concepts• Basic building blocks for knowledge bases

• Relation Registries (RR)– Stores (your) (semantic) relationships between

Data Categories

4/8/2010

Page 25: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 25

Container Data Categories

• Use the administrative and descriptive parts to manage standardization and describe the containers (components/tables/classes/objects/inner

nodes…) of a meta/data model in the DCR• But the relationships between components

and complex data categories wouldn’t be stored in the DCR (maybe in the RR)

4/8/2010

Page 26: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 26

Data Category Concepts• ISO 12620:2009:

– 3.1.3 data category (DC): result of the specification of a given data field• EXAMPLE /partOfSpeech/, /grammaticalGender/, /grammaticalNumber/; the values associated

with these items (for example, /noun/, /verb/, /feminine/, /plural/, etc.) are also data categories according to this International Standard, but values of this type are not viewed as data element concepts (3.1.4) in the ISO/IEC 11179 family of standards.

• NOTE 2 A data category corresponds closely, but not identically, to a data element concept in ISO/IEC 11179.

• DCR Guidelines for the definition– Definitions shall follow the rules outlined for intentional definitions in ISO 704

• They should begin with the superordinate concept, either immediately above or at a higher level of the data category concept being defined;

• They should list critical and delimiting characteristic(s) that distinguish the concept from other related concepts.

• So DCs and concepts are related, maybe this relationship should become clear in the DCR? Maybe the concept descriptions should be in the DCR, and the ontological relationships in the RR

4/8/2010

Page 27: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 27

Possible full model

4/8/2010

lexicon

language alphabet entry

lemma

writtenForm

japanese ipa

Data model Knowledge base

Page 28: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 28

GOLD in ISOcat

• Map GOLD concepts to DC types:– Some to closed DCs with simple DC hierarchies

• For example: /formUnit/ with simple DCs /Foot/, /Grapheme/, … and /Segment/ which is the parent of /Consonant/, /Vowel/, …

– Some to simple DCs (as they can’t have values)• Could be candidates for the proposed new data category

concept type?– Ontological relations could be stored in the (to be build)

RR• The combination of the DCR + RR should result in the (part of

the) ontological structure

4/8/2010

Page 29: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 29

Registry network

4/8/2010

Linguistic resources

Data category registries

Relation registries

MPIDCR

ISODCR

Typological Database SystemRRMPI RR

MPIarchive

TDSdatabaseresource

Page 30: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 304/8/2010

Thank you for your attention!

Visitwww.isocat.org

Questions?www.isocat.org/forum/

[email protected]

Page 31: Principles of  ISOcat , a Data Category Registry

www.isocat.org

Lexicon Tools and Lexicon Standards 31

References[1] ISO 12620, Terminology and other language and content resources --

Specification of data categories and management of a Data Category Registry for language resources.

[2] http://www.isocat.org/manual/DCRGuidelines.pdf[3] M.A. Windhouwer, S.E. Wright, M. Kemps-Snijders. Referencing ISOcat

data categories. In proceedings of the LREC 2010 LRT standards workshop. Malta, May 18, 2010.

[4] http://www.isocat.org/12620/[5] http://www.isocat.org/rest/help.html[6] Tim Berners-Lee, Cool URIs don't change, 1998.

4/8/2010