customizing the imdi metadata schema for endangered languages heidi johnson (ailla) arienne dwyer...
TRANSCRIPT
Customizing the IMDI metadata schema for endangered languages
Heidi Johnson (AILLA)
Arienne Dwyer (DOBES)
Introduction
IMDI: International Standards for Language Engineering Metadata Initiative DOBES: Volkswagen Foundation’s Documentation of Endangered Languages initiativeAILLA: the Archive of the Indigenous Languages of Latin America
Types of resources
Audio and video recordings in various digital formatsAnnotation text files, e.g. transcriptions and translationsStandalone texts, e.g. dictionaries, poetryWide range of genres: from verbal art to scholarly analyses
Bundles of resources
Session (IMDI, 2001): resources resulting from a linguistic elicitation session - recordings and annotations.Only models one kind of resource production - a recording session.Collections will include a greater variety of resources, in sets of related materials.
Types of bundles
Canonical bundle: the original session. A digitized recording, in different formats, and some textual annotation files, also in different formats.Minimal bundle: a single file. Examples: dictionary, poem, recording of uninterpretable chants.Meta-bundle: a bundle containing other bundles. Example: a book about a set of annotated recordings.
Bundle elements
Current:– Name of bundle– Date and place of production
Proposed:– Resource relations– Date archived– Last modified
Major subschemas
Project
Collector
Content
Participants
Resources
References
The Content Subschema
Genre is the top-level category:– Interaction: conversation, interview …– Explanation: description, recipe …– Performance: narrative, poem, oratory …– Teaching: primer, textbook …– Analysis: grammar, dictionary …
Other Content categories
Modality: speech, writing, gestureCommunication context:– Interactivity– Planning– Involvement
LanguagesTaskDescriptionKeys
AILLA’s Content Keys
Register: a characterization of how the discourse reflects the social context. Example: honorific speech
Style: about poetic and stylistic effects. Examples: parallelism, metered verse.
The Project subschema
Current elements:– Name: a nickname or acronym– Title: official title– ID: a unique identifier– Contact information
Proposed element:– Funder: name of funding organization
The Collector subschema
AILLA renames this Depositor, since this is the individual we have to keep track of (e.g. for Level 3 access permission). When the Depositor is not also the Collector, Collector can be listed under Participants.
The Participants subschema
Type: functional role, e.g. creator
Role: family relationship
Name/Full name
Language(s)
Ethnic group, age, sex:
Education
Anonymous: True if participant’s Full name is reserved; False otherwise
AILLA additions to Participants
Origin: Place (country, region, etc) of origin of the creator of the primary resource in the bundle (e.g. the speaker whose voice is recorded).
Occupation: Can be relevant in assessing accuracy of some kinds of data.
The Resources subschema
Resources contains information about formats and provenance of files in a bundle.Media Files: audio, video, etc. Annotation Files: text files.Proposal: call them all Media Files, to reduce redundancy in the database. (All have URL, size, etc. elements.)
Text resources
Current elements:– Type: type of annotation, e.g. phonetic
transcription.– Content encoding: annotation encoding
scheme, e.g. EUROTYP.– Character encoding: character set(s) used
in a text file.
Text resources 2
Proposed elements:– Transcription type– Translation (aka Glossing) type– Software: used to produce transcriptions,
translations, other annotations (e.g. Shoebox)
Describe Annotator in Participants (along with Translator, etc.)
Proposed subschema
Place: composed of several elements:– Continent– Country– Region– Subregion (address)
Repeated at least twice, in Bundle and in Participants (Origin).Might also be useful in the Language subschema.
Conclusion
IMDI schema is a flexible tool.
Customization through Key/Value pairs allows local modifications.
Most of the proposed changes are terminological, moving from the DOBES in-house terminology to more general usage.