the rosetta project digital language archive

25
The Rosetta Project Digital Language Archive Laura Buszard-Welcher The Long Now Foundation / University of California, Berkeley

Upload: janet

Post on 15-Jan-2016

34 views

Category:

Documents


0 download

DESCRIPTION

The Rosetta Project Digital Language Archive. Laura Buszard-Welcher The Long Now Foundation / University of California, Berkeley. The Rosetta Project Archive. A public, Web-based, digital archive of language documentation - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Rosetta Project   Digital Language Archive

The Rosetta Project Digital Language Archive

Laura Buszard-Welcher

The Long Now Foundation /

University of California, Berkeley

Page 2: The Rosetta Project   Digital Language Archive

The Rosetta Project Archive• A public, Web-based, digital

archive of language documentation

• Part of the National Science Digital Library (NSF program for dissemination of educational STEM resources)

• Over 95,000 pages of resources on over 2,300 languages

• Over 3000 wordlists (Swadesh lists, 500-1500 term lists)

• New! Audio files

Page 3: The Rosetta Project   Digital Language Archive

Project Goals: Resources• We are a digital language archive with comprehensive,

global scope: we can and do accept digital resources on any language, dialect, family, or subgroup.

• Promotes linguistic diversity by broadly disseminating resources on languages with small numbers of speakers--contributes to the effort to document and disseminate resources on endangered languages.

• Comprehensive scope both requires and builds communities: global networks of linguists, speakers, educators

Page 4: The Rosetta Project   Digital Language Archive

Project Goals: Interoperability and Resource Discovery

• Supporting metadata standardization and interoperability (OLAC participating archive and individuals, E-MELD, GOLD, LSA Conversation on Endangered Language Archiving)

• Promoting resource discovery through open archive search: we serve oai_dc, nsdl_dc, olac_dc metadata

Page 5: The Rosetta Project   Digital Language Archive

Project Goals: Developing tools for collaborative linguistic research

• Endangered Language Query Room

• DOCS (Digital Online Curation Services)

• LangGator

• Wordlist tool (collaboration with MPI-EVA)

• New Rosetta V2.0 Website

Page 6: The Rosetta Project   Digital Language Archive

Site Infrastructure• Plone 2.1 content management system, running in the

Zope Application Server• Open source, leverages worldwide developer

communities• Lots of “plug in” modules for functionality expansion

– CMF Bibliography AT, Plone Board, etc.

• Heavily modified infrastructure (language node design) and user interface

Page 7: The Rosetta Project   Digital Language Archive

Nodal Architecture• Languages, language families, family subgroups,

dialects all represented by nodes.• A node is a content aggregation page• Nodes and parent-child relationships each have unique

IDs• The system currently represents Ethnologue language

relationships, but has the flexibility to be agnostic about them, represent relationships from various theoretical perspectives

Page 8: The Rosetta Project   Digital Language Archive

Node Pages• Accessible from a variety of

browse and search pages– Browse by language name, family,

country data type– Quick search, advanced search

• Node page organization– Node metadata– Descriptive Resources– Navigation: classification tree– Links to people functions, LINGUIST

List people search– External links: searches

Page 9: The Rosetta Project   Digital Language Archive

Content• In-house collection, vetting

– Primary focus of collection– Rosetta descriptive categories

• Special collections– Endangered Language Fund (ELF) Digital Archives– Alan Lomax Audio Collection – Future collections that come in through DOCS

• Future development– Uploaded, peer-reviewed resources– Collaborative content areas (bulletin boards, wiki)

Page 10: The Rosetta Project   Digital Language Archive

Scanning• Historically, the primary focus

of in-house collection• Rosetta serves over 95,000

images from a variety of published resources

• Excerpts in data categories (see following slides)

• Public domain resources can be scanned in their entirety

Page 11: The Rosetta Project   Digital Language Archive

Categories of Collection (1)Ethnologue metadata

General information from www.ethnologue.com about language affiliation, where spoken, number of speakers, dialects, alternate language names.

General description

General description of the language. Origin and current distribution of language, number of speakers, family, typology, history, etc.

Maps Maps of the geographic distribution of a language and its relationship to other languages in the region.

Orthography Writing system(s) of the language with any accompanying guide to pronunciation, use, etc.

Phonology A description of the basic sound units in a language (phonemes) and how they combine to form utterances.

Page 12: The Rosetta Project   Digital Language Archive

Categories of Collection (2)Grammar How a language combines the smallest units of meaning

(morphemes) to create words and words to create sentences.

Core Word Lists

A common word list of 100 or 200 terms typically collected in linguistic fieldwork (“Swadesh Lists”), often used for comparative purposes.

Numbers A description of the numbering system(s) in a language with a list of basic terms.

Parallel Texts A common text with translation for each language. Initially Genesis Chapters 1-3 (a commonly collected text). Now also the UN Declaration of Human Rights.

Glossed Texts Transcribed indigenous texts with word glosses, free translations and grammatical markup.

Page 13: The Rosetta Project   Digital Language Archive

Resource Pages• Accessed from node pages

• Bibliographic metadata

• Links to other resources

• Resource bundles

• Associated resource files– Scanned images

– OCR’ed live text files

– Annotated text files

– Audio/video files

– User comments

Page 14: The Rosetta Project   Digital Language Archive

Community Functions• Goal: build a network of linguists, speakers, educators• People:

– Member pages– Regional and language curators

• Collaborative content:– Discussions (nodes, resources)– Resource upload– Vetting by volunteer language/family experts– In the future? Wiki documents (unvetted, but resources

produced may go through higher vetting levels)

Page 15: The Rosetta Project   Digital Language Archive

Member Gallery

• Central access to member search and browse

• Central access to language forums

• Highlighted members

Page 16: The Rosetta Project   Digital Language Archive

MemberProfile Page

• User-defined content area• List of recent uploads• Lists of recent forum

postings

Page 17: The Rosetta Project   Digital Language Archive

Audio Digitization• Alan Lomax language audio

collection (mostly reel-to-reel, some cassette)

• Edirol external digitizer (96 kHz sample rate, 24 bit depth)

• Sound Forge 7.0, uncompressed .wav

• Now accepting audio deposits (on a limited basis)

• We archive and serve digital resources, not physical media

Page 18: The Rosetta Project   Digital Language Archive

Rosetta Depositor Consent Form• Prompted by special collections (ELF, Alan Lomax Audio)

• Intended to work on paper, or in digital form

• Inspired by AILLA’s graded access system

• Encourages depositors to see archiving as a kind of publication: assumes dissemination of some or all of resources– “In general, we encourage all depositors to make their resources freely

available, and to consider archiving with us as a form of publication. If you feel the need to place an extreme form of restriction on the resource, then our project may not be the most suitable place to archive your resource. We reserve the right to archive only those resources that we deem appropriate to our project, with respect to both content and access.”

Page 19: The Rosetta Project   Digital Language Archive

Level 1: Open access to recordings Users have full access to recordings after agreeing to our Terms and Conditions. For this level, we assume that depositors have already gained permission for public access from the speakers or authors of the resource. Level 1 access may be applied to the entire deposit, or to parts of the deposit. If portions of the deposit are to be restricted, attach a detailed description that clearly identifies them, and designate one of the following access levels (2-5) for each restricted portion.

Page 20: The Rosetta Project   Digital Language Archive

Level 2: Access limited by password

Users may access recordings only if they know a password that you create. This type of access allows you to keep resources private, or provide access to others by sharing the password with them. Access limited by passwords must be renegotiated with The Rosetta Project every five years, at which time depositors may continue use of a previous password, choose a new password, or select another access level (Rosetta will contact the depositor at the appropriate time). If not renegotiated, access to the resource changes to open access (Level 1).

Page 21: The Rosetta Project   Digital Language Archive

Level 3: Access protected by a time limit

Users may not access the resources until after a specified date. Although we encourage all depositors to make their resources freely available, we understand that some depositors may want to restrict access to resources for a few years (normally five or less) while preparing a publication, such as a dissertation. After the date you specify, access to the resource changes to open access (Level 1).

Page 22: The Rosetta Project   Digital Language Archive

Levels 4 and 5: Designated Controllers Level 4. The depositor controls access to the resource. The Rosetta Project will provide contact information, and the user will have to contact the depositor directly for permission, and the depositor then will write to The Rosetta Project. If permission is granted, The Rosetta Project will give the user access to the resource.

Level 5. The depositor designates another person or organization to control the resource. The Rosetta Project will contact the controller on the user’s behalf. If permission is granted, The Rosetta Project will give the user access to the resource (please attach controller’s contact information).

Page 23: The Rosetta Project   Digital Language Archive

Depositor/Controller Responsibilies Note: for Levels 2, 3, 4, and 5, the depositor must ensure that the appropriate contact information is up to date. If contact information is not up to date, or documented good faith attempts made by the Rosetta archive or its users to obtain access are not answered, then determinations of permission to access and use the resource reverts to the curator of the archive.

Page 24: The Rosetta Project   Digital Language Archive

The Archivist in the Driver’s Seat

• Archiving and serving digital resources is a valuable, (and expensive) service

• Some archives also provide digitization services• For these reasons, archives can be expected to set

conditions on what they will archive• Rosetta’s consent forms are intended to ensure that:

– The majority of our resources are publicly accessible on the Web (all are available for listening in person)

– Archivist is never at the mercy of extreme access restrictions– All access conditions work toward open access (Level 1)

Page 25: The Rosetta Project   Digital Language Archive

URLs• Electronic Metastructure for Endangered Language Data (E-MELD)

http://www.emeld.org (School of Best Practice, FIELD Tool).• Endangered Language Query Rooms http://rosettaproject.org:8080/emeldbase/.• The Ethnologue http://www.ethnologue.com.• General Ontology for Linguistic Description (GOLD) http://www.linguistics

-ontology.org or http://emeld.org/school/workroom/terminology/• LINGUIST List http://www.linguistlist.org • National Science Digital Library (NSDL) http://nsdl.org • ODIN www.csufresno.edu/odin• Open Language Archives Community (OLAC) http://www.language-

archives.org.• The Rosetta Project, http://www.rosettaproject.org/live. A preview of the new

Web site is available at http://preview.rosettaproject.org.