collaboratively defining widely accepted linguistic data categories in the isocat data category...

30
ww.isocat.org Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry 28 March 2013 1 eHg - New Trends in e-Humanities Menzo Windhouwer The Language Archive – DANS tla.mpi.nl menzo.windhouwer @dans.knaw.nl

Upload: menzo-windhouwer

Post on 21-Jun-2015

310 views

Category:

Documents


1 download

DESCRIPTION

New Trends in e-Humanities, KNAW e-Humanities Group, Amsterdam, March 28, 2013

TRANSCRIPT

Page 1: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 1

Collaboratively DefiningWidely Accepted Linguistic Data Categories

in the ISOcat Data Category Registry

28 March 2013

Menzo WindhouwerThe Language Archive – DANS

[email protected]

Page 2: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 2

The Language Archive

• Founded in September 2011• Supported by MPG, BBAW and KNAW (DANS)• Grown out of the Technical Group at the MPI for

Psycholinguistics• Since 1990ies: challenge of archiving digital data• 2000 – 2016 VolkswagenFoundation DOBES

project on Endangered Languages• Active in many European infrastructure projects:

CLARIN, EUDAT, DASISH, …

28 March 2013

Page 3: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 3

Language Archiving Technology

• Full lifecycle support– Core: resources– Key: metadata– ‘New’: CMDI, ISOcat, AV recognition, …

• Archive size: – 70 Tb of resources– 22.000 hours AV recordings– 75.000 sessions (metadata)– 5 million annotated segments– 50 lexica

• My focus: Knowledge Systems– LEXUS, an online lexicon tool– ISOcat and companions28 March 2013

Page 4: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 4

Typological Database Nijmegen

28 March 2013

TOP NOTION tds:Noun GROUPS{ NOTION tdn:GrammaticalDistinctions LABEL "Grammatical distinctions for nouns." GROUPS { NOTION tdn:AgentNouns LABEL "Agent nouns." DESCRIPTION "Nouns can function as the agent of a clause." LINK TO CONCEPT agentRole GROUPS { NOTION tdn:v098_plusAffix LABEL "Agent nouns formed by verb stem plus affix." LINK TO CONCEPTS (agentRole, verbalMorphology, boundAffix) DESCRIPTION <p>Agent nouns are formed by a verb stem plus an affix, e.g. English <qv>walk-er</qv>.</p> NOTE AUTHOR IS "TDS" TYPE IS "original TDN label" "AGENT NOUNS ARE VERB STEM PLUS AFFIX" IS FIELD v098;...

Notes: TDN is not in archived in TLA, but curated in TDS, a previous project I worked on, and now archived at DANS;also this not a TDN punchcard

Explicit semantics!

Page 5: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 5

DOBES corpora

28 March 2013

Explicit semantics!

Shared semantics!

Page 6: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 6

Oxford English Dictionary

28 March 2013Source: http://www.oxford-royale.co.uk/news/2010/12/04/new-online-edition-of-oxford-english-dictionary.html

Page 7: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 7

Terminology Community of Practice

• Community started out on paper (A5 fiches), just like OED

• 80’s - 90’s projects to standardize data category, the ‘fields’ on the fiches/in the files/database records, names

• ISO 12620:1999 Data Categories a companion standard to ISO 12200 Machine-readable terminology interchange format (MARTIF)

28 March 2013

Page 8: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 8

ISO 12620:1999

28 March 2013

Page 9: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 9

Towards a Data Category Registry• Problems with ISO 12620:1999 a hardcoded list of data categories

– Not easily extensible– Ordering heavily debated– Outdated and limited in range at the moment of release

• Developments– In the SALT project an interchange model (TBX) based on MARTIF/data

categories was created, which was widely adopted – ISO 11179 Metadata Registries was released, which describes the

standardization of data element concepts for metadata– ISO released Annex ST Standards as databases, which describes an ISO

procedure to standardize registry entries– In the LIRICS project a pilot Data Category Registry, SYNTAX, was created

28 March 2013

Page 10: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 10

ISO 12620:2009• Terminology and other content and language resources — Specification of

data categories and management of a Data Category Registry for language resources– A data model for data category specifications inspired by ISO 11179– A procedure to standardize data category specification compliant with

Annex ST– Each data category gets a unique Persistent Identifier (PID)– The Max Planck Institute for Psycholinguistics is appointed as the

Registration Authority of the ISO/TC 37 DCR • In use by a growing number of ISO TC 37 standards

– Lexical Markup Framework (LMF)– Linguistic Annotation Framework (LAF)– Morph-syntactic Annotation Framework (MAF)– …– could be more, e.g., Feature System Declarations (FSD)

28 March 2013

Page 11: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 11

Example Data Category specification

• Data category: /Grammatical gender/– Administrative part:

• Identifier: grammaticalGender• PID: http://www.isocat.org/datcat/DC-1297

– Descriptive part:• English definition: Category based on (depending on languages) the

natural distinction between sex and formal criteria.• French definition: Catégorie fondée (selon la langue) sur la

distinction naturelle entre les sexes ou d'autres critères formels.

– Linguistic part:• Morposyntax conceptual domain: /masculine/, /feminine/, /neuter/• French conceptual domain: /masculine/, /feminine/

28 March 2013

DCR

Data Category

Global Information

Administration Information Section

Administration Record

Registration Group

Submission Group

Stewardship Group

Decision Group

Description Section

Language Section

Name Section

Definition Section Example Section

Explanation Section

Data Element Name Section

Complex Data...

Simple Data...

Closed Data...

Open Data...

Constrained Dat...

Linguistic Section

Closed Linguistic...

Open Linguisti...

Constrained Linguistic...

Conceptual Domain

Value Domain

Open Conceptua...

Conceptual...

Profile Value...

Change Section

Figure 5 - The description part

Figure 6 - The linguistic part

Figure 4 - The administration part

Page 12: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 12

Standardization procedure

Submissiongroup

Data Category RegistryBoard

Validation

Thematic DomainGroup

Evaluation

Stewardshipgroup

Decision Group

rejected rejected

Publication

28 March 2013

Page 13: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 1328 March 2013

Thematic Domain GroupsTDG 1: MetadataTDG 2: MorphosyntaxTDG 3: Semantic Content Representation TDG 4: Syntax TDG 6: Language Resource OntologyTDG 7: LexicographyTDG 8: Language CodesTDG 9: TerminologyTDG 11: Multilingual Information ManagementTDG 12: Lexical ResourcesTDG 13: Lexical Semantics

• TDGs are the owner and guardians of a coherent subset of the DCR

• TDGs own one or more profiles

• Each TDG has a chair• A number of members assigned by

SC P members• A number of expert members

invited by the chair (up to 50%)

• TDGs are constituted at the TC37/SC plenary

• New TDGs need to be proposed by a SC

1. Translation2. (Sign language)

Page 14: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 1428 March 2013

ISOcat - the ISO TC 37/DCR

• A (coherent) set of Data Categories, in our case for linguistic resources

• A system to manage this set:– Create and edit Data Categories– Share Data Categories, e.g., resolve PID references– Standardize Data Categories

• An API for tools to access the DCR

• Grass roots approach– Anyone can access the DCR and use or

create the data categories (s)he needs

Page 15: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 15

Refering to ISOcat data categories• PIDs of data categories can easily embedded in XML documents

<lmf:LexicalEntry> <tei:f name="partOfSpeech" dcr:datcat="http://www.isocat.org/datcat/DC-1345" fVal="commonNoun” dcr:valueDatcat="http://www.isocat.org/datcat/DC-1256"/> <lmf:Lemma type="Form"><tei:f name="writtenForm” dcr:datcat="http://www.isocat.org/datcat/DC-1836" fVal="clergyman"/> </lmf:Lemma></lmf:LexicalEntry>

• Also embedding in other formats is possible, e.g., via comments• Preferably annotate schemas, so a whole range of resources is annotated in one

go28 March 2013

Page 16: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 16

A glimpse of ISOcat

28 March 2013

Page 17: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 17

Collaboration in ISOcat

• Registered user can contact eachother via mediated email– Ask the owner if a data category can be adapted a little

to your needs• Registered users can start up a group and invite

other users to join– Work together on a set of data categories– Interact via a public and/or private forum

• A group can submit data categories for ISO standardization

28 March 2013

Page 18: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 18

Component MetaData Infrastructure

• CMDI is developed by CLARIN and on its way to standardization by ISO TC 37– Limitations existing metadata schemas: DC/OLAC, IMDI, TEI

header• Inflexible: too many (IMDI) or too few (OLAC) metadata elements• Limited interoperability (both semantic and syntactic)• Problematic (unfamiliar) terminology for some sub-communities.• Limited support for LT tool & services descriptions

– The idea is to address this by:• Explicit defined schema & semantics • User/project/community defined components

28 March 2013

Page 19: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 19

CMDI architecture

28 March 2013

OAI-PMHData provider

OAI-PMHService provider

Localmetadatarepository

Joint metadatarepository

metadatamodeler

metadatauser

metadatacreator

componentregistry &

editor

metadataeditor

metadatacurator

metadatacurator

metadatacatalogue

RelationRegistry

search &semantic mapping

DATA

ISOcat

Page 20: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 20

Athens Core

• Bootstrapped the Metadata data categories selection in ISOcat– Based on existing metadata standards, e.g., DC,

OLAC, IMDI, TEI– Many translations in european languages

• Users add the data categories they need to the Metadata profile and use them in CMDI

28 March 2013

Page 21: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 21

CMDI architecture

28 March 2013

OAI-PMHData provider

OAI-PMHService provider

Localmetadatarepository

Joint metadatarepository

metadatamodeler

metadatauser

metadatacreator

metadataeditor

metadatacurator

metadatacurator

RelationRegistry

search &semantic mapping

DATA

ISOcatmetadatacatalogue

componentregistry &

editor

Page 22: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 22

CMDI architecture

28 March 2013

OAI-PMHData provider

OAI-PMHService provider

Localmetadatarepository

Joint metadatarepository

metadatamodeler

metadatauser

metadatacreator

metadataeditor

metadatacurator

metadatacurator

RelationRegistry

search &semantic mapping

DATA

ISOcatmetadatacatalogues(VLO, MI)

componentregistry &

editor

Page 23: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 23

CMDI (intermediate) results

• Diverse metadata profiles– Center or projects create specific ones, but reuses components where

possible

• Shared and explicit semantics help to overcome– Terminological differences– Differences in structure

• Future– Get more context sensitive

• e.g. documentation language vs. speaker language

– Crosswalks• equivalent metadata data categories are easily introduced due to the open nature

of ISOcat

– User specific relationships• e.g. theory specific differences can be more important to one user then another

28 March 2013

Page 24: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 24

Metadata TDG• Standardization efforts of the Metadata TDG stalled

– Large overlap with the work/people at the Athens-Core meetings• Community level agreement is maybe enough

– Activity motivation should not depend on one person, the TDG chair, only • The need for explicit and shared semantics is not clear enough yet … more evangelization needed

– Unfamiliarity with the work• Terminologists are more used to this kind of review work• Online review vs. old ISO ‘paper’ process

– Members have little time, it is difficult to sync schedules• TDG experts tend to be senior scientist• Continuous process vs. sporadic bursts of activity

– Unpaid work• Project funding vs. wide acceptance in the community• However, a project might bootstrap a thematic domain

• The same problems hold for other TDGs– Current tendency to tie data category (selection) standardization to a new/revised

standard, e.g., MAF and TBX– Redesign of the standardization process is coming up

• ISO is not actively supporting Annex ST Standards as Databases anymore28 March 2013

Page 25: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 25

Community efforts

• LMF-related: UBY, RELISH/GOLD• Sign Language• CLARIN

– CMDI, Athens Core– CLARIN-NL/VL

• Call 1 – 4 projects created CMDI and annotated resources/schemas• ISOcat content coordinator: Ineke Schuurman

– Tutorials, guidelines (do’s and don’ts) and feedback

• Better community support in ISOcat– Views, e.g., CLARIN-NL/VL– Recommended by, e.g., DC-4949– …

28 March 2013

Page 26: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 26

Conclusions and future work• Communties can already create a coherent view on ISOcat

– the CMDI use case shows potential– maybe funder support needed to bootstrap specific domains

• The standardized core will take (a long) time– like all standardization work

• Next to metadata also content– explicit semantics would be profitable even when not shared and/or used for resource

discovery– resources created with tools that support ISOcat will create such resources more easy

• Companion registries:– relations between data categories (RELcat)– annotated schemas for language resources (SCHEMAcat)– interaction with the CLARIN vocabulary service (CLAVAS)

• Data categories vs. concepts

28 March 2013

Page 27: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 27

Detour: ISOcat and LOD/Semantic Web

• Archives and infrastructures look at the resources as they are, i.e., in general no conversions to triples

• However, ISOcat data categories can easily be used in RDF resources:partOfSpeech dcr:datcat <http://www.isocat.org/datcat/DC-396> ;

rdfs:label "part of speech"@en ;rdfs:comment "A category assigned to a word based on its

grammatical and semantic properties."@en .

• The Relation Registry, which is a tripple store, will in general support lightweight, semi-formal ontologies

M. Windhouwer, S.E. Wright. Linking to linguistic data categories in ISOcat. LDL 2012.28 March 2013

Page 28: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 2828 March 2013

Thank you for your attention!

Visitwww.isocat.org

Questions?www.isocat.org/forum/

[email protected]

AcknowledgementsThanks to anyone at TLA, Sue Ellen Wright, Ineke Schuurman, Marc Kemps-Snijders, CLARIN-NL, CLARIN, ISO TC 37

Page 29: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 29

A whole litter of cats!

28 March 2013

Data Category Registry - ISOcat

Linguistic knowledge baseLinguistic resource (schema)Data categoriesContainers

Concepts

Concept Registry

Relation

Relation Registry - RELcat

Schema Registry - SCHEMAcat

Page 30: Collaboratively Defining Widely Accepted Linguistic Data Categories in the ISOcat Data Category Registry

www.isocat.org

eHg - New Trends in e-Humanities 30

ISO 11179: concepts vs. data elements/categories

28 March 2013

ISO 12620 Data Categories