laura welcher - the rosetta project and the language commons

68
The Rosetta Project: Building a 10,000 Year Library of All Human Language

Upload: longnow

Post on 13-Jun-2015

4.311 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Laura Welcher - The Rosetta Project and The Language Commons

The Rosetta Project:

Building a 10,000 Year Library !of All Human Language!

Page 2: Laura Welcher - The Rosetta Project and The Language Commons

A bit of background…

Page 3: Laura Welcher - The Rosetta Project and The Language Commons

The 10,000 Year Clock

Danny Hillis

“I want to build a clock that ticks once a year. The century hand advances once every one hundred years, and the cuckoo comes out on the millennium.”!

Page 4: Laura Welcher - The Rosetta Project and The Language Commons

Prototype 1

Page 5: Laura Welcher - The Rosetta Project and The Language Commons

Clock Mountain

Page 6: Laura Welcher - The Rosetta Project and The Language Commons
Page 7: Laura Welcher - The Rosetta Project and The Language Commons

The 10,000 Year Library

Stewart brand

“The Clock dramatizes the scope of historic time past and to come but offers no content. The Library is all content, especially past content with future significance…The value could lie in providing civilizations with a wisdom line: slow, robust, apparently inefficient. ”!

-from “Clock/Library” in The Clock of the Long Now!

Page 8: Laura Welcher - The Rosetta Project and The Language Commons

Library Projects: ��A Responsibility Record

Page 9: Laura Welcher - The Rosetta Project and The Language Commons

Library Projects:��All Species

Page 10: Laura Welcher - The Rosetta Project and The Language Commons

…We risk creating a Digital Dark Age – a void in the continuity of cultural record – because the formats and hardware which we entrust with our data are unlikely to outlast even the next ten years, much less our own lives.

- Danny Hillis

Library Projects: ��Time & Bits

Page 11: Laura Welcher - The Rosetta Project and The Language Commons

Strategies That Improve ��Data Longevity

•  For starters expand your scope: aim for at least 500 years (can you do better than paper?)

•  Use it or lose it – unused data dies

•  Provide access – promotes use, reuse, LOCKSS

•  Consider saving everything (e.g. Internet Archive)

•  Move it or lose it (“Movage”)

•  Consider atoms over bits (analog)

Page 12: Laura Welcher - The Rosetta Project and The Language Commons

Library Projects:��Long Server

Page 13: Laura Welcher - The Rosetta Project and The Language Commons

Library Projects: The Rosetta Project

•  Thousands of years ago we stored information on stone tablets – some of these are still around.

•  Hundreds of years ago we stored information in books – print on acid free paper can reliably be preserved 500 years.

•  Now we store information digitally, using hardware, software and encodings that are highly ephemeral.

Page 14: Laura Welcher - The Rosetta Project and The Language Commons

The Rosetta Disk��(One Possible Solution)

Page 15: Laura Welcher - The Rosetta Project and The Language Commons

Microscopic Analog ��Data Storage

Page 16: Laura Welcher - The Rosetta Project and The Language Commons

Microetched Pages

Page 17: Laura Welcher - The Rosetta Project and The Language Commons

Human Eye Readable Side

Page 18: Laura Welcher - The Rosetta Project and The Language Commons

Parallel Content in Multiple languages

•  The Rosetta Stone includes a decree of the divine cult of King Ptolomy V carved in 196 BC

•  Same text written in three different forms: Egyptian Hieroglyphs, Demotic (Early Egyptian Script preceding Coptic), and Ancient Greek

•  Working back from the Greek and somewhat known Demotic, were able to decipher the Hieroglyphs – thereby unlocking records of an entire ancient civilization

Page 19: Laura Welcher - The Rosetta Project and The Language Commons

Rosetta Disk Goal - Parallel content for all languages

Vocabulary Maps

Sound Structure Writing Systems

Word and Sentence Structure

Ethnographic Information

Parallel texts Numbering Systems

Other texts Color Systems

Page 20: Laura Welcher - The Rosetta Project and The Language Commons

Building the Collection��Book Scanning

Page 21: Laura Welcher - The Rosetta Project and The Language Commons

Building the Collection:��Swadesh Wordlists

Page 22: Laura Welcher - The Rosetta Project and The Language Commons

Audio Digitization

Page 23: Laura Welcher - The Rosetta Project and The Language Commons

Google Earth Interface

Page 24: Laura Welcher - The Rosetta Project and The Language Commons

“Born Digital” Materials

Endangered Language Documentation Project!

Page 25: Laura Welcher - The Rosetta Project and The Language Commons

6 First Edition Disks •  Brewster Kahle, Internet Archive

•  Charles Butcher, Lazy 8 Foundation – now in the permanent special collection of the University of Colorado Boulder Library

•  William Lidwell, author of Universal Principles of Design

•  Oliver Wilke – Oliver Wilke Stiftung für Sprachen

•  One is held by an anonymous donor, and one is in the Long Now Museum

Page 26: Laura Welcher - The Rosetta Project and The Language Commons

02004 Rosetta European Space Agency Mission

Page 27: Laura Welcher - The Rosetta Project and The Language Commons

Rosetta Disk��Museum Edition

In August 02009 we presented the prototype of the Rosetta Disk Museum Edition to Secretary Wayne Clough for the Smithsonian.!

Page 28: Laura Welcher - The Rosetta Project and The Language Commons

Endangered Languages

“The coming century will see either the death or doom of 90% of mankind’s languages”!

- Michael Krauss!

Page 29: Laura Welcher - The Rosetta Project and The Language Commons

Top Ten languages by Native Speakers (Millions)

0! 100! 200! 300! 400! 500! 600! 700! 800! 900!

Javanese!German!

Japanese!Russian!

Portuguese!Hindi!

Bengali!English!Spanish!

Mandarin!

Data: The Ethnologue (02009) available at www.ethnologue.com!

Page 30: Laura Welcher - The Rosetta Project and The Language Commons

Language Distribution

Half the world population speaks one of 10 languages (>1%)!

Most everyone else speaks one of 300 languages (4%)!

5% of the world speaks one of 6,500 languages (95%) !

1 Billion

100 Million

10 Thousand

Number of Languages!

Page 31: Laura Welcher - The Rosetta Project and The Language Commons

Why does it matter?

Page 32: Laura Welcher - The Rosetta Project and The Language Commons

Languages are...

Great Works of Art!

Page 33: Laura Welcher - The Rosetta Project and The Language Commons

Languages are...

Great Libraries!

Page 34: Laura Welcher - The Rosetta Project and The Language Commons

Languages are “How to” guides for Living on Planet Earth

Page 35: Laura Welcher - The Rosetta Project and The Language Commons

Languages Provide ��a window into our minds

Page 36: Laura Welcher - The Rosetta Project and The Language Commons

Freedom of Language - ��an inalienable human right

Individually you have:

•  The right to be recognized as a member of a language community

•  The right to use your language in private and in public

•  The right to use your own name

•  The right to interrelate and associate with your native speech community

•  The right to maintain and develop your own culture

Page 37: Laura Welcher - The Rosetta Project and The Language Commons

Freedom of Language - ��an inalienable human right

Collectively your speech community has:

•  The right for your own language and culture to be taught

•  The right of access to cultural services

•  The right to an equitable presence of your language and culture in the communications media

•  The right to receive attention in your own language from government bodies and in socioeconomic relations

From the Universal Declaration on Linguistic Rights, Barcelona, June 1996!

Page 38: Laura Welcher - The Rosetta Project and The Language Commons

Rosetta Project:��Long Now, Here & Now

Page 39: Laura Welcher - The Rosetta Project and The Language Commons

Open Digital Collection on��All Human Languages

Page 40: Laura Welcher - The Rosetta Project and The Language Commons

Rosetta Special Collection ��In the Internet Archive

Page 41: Laura Welcher - The Rosetta Project and The Language Commons

Rosetta Language Base – Linguistic Metastructure

•  Freebase: over 10,000 languages and linguistic entities linked by language family relationship

•  All data is linked to other kinds of data in Freebase

•  We have rectified ~1500 Wikipedia pages about human languages to our data set

Page 42: Laura Welcher - The Rosetta Project and The Language Commons

Rosetta Prototype Wiki

Page 43: Laura Welcher - The Rosetta Project and The Language Commons

New Initiative��The Language Commons

Page 44: Laura Welcher - The Rosetta Project and The Language Commons

The Language Commons��Working Group

Page 45: Laura Welcher - The Rosetta Project and The Language Commons

Language Commons��Goals:

•  To scale the amount of open language data (PD/CCZero to GPL to CCNC-BY to MIT/BSD)!

•  To seek the participation of holders of language data including publishers, corporations, and authors (including web authors), funders of research that generates language data, and the institutes, researchers, and projects who are themselves creating and/or curating language data.  !

•  To build open and available language data resources to further research, development, and global access to knowledge !

•  To help preserve and promote endangered languages!

Page 46: Laura Welcher - The Rosetta Project and The Language Commons

Language Commons��Participants

•  Translate.org, Meedan.net, Miro Project, Rosetta Project / Long Now Foundation, the Kamusi Project, Rosetta Foundation (translation service organization in Ireland), Fostering Language Resources Network (FLaReNet), European Language Resources Assocation (ELRA), The Berkman Center for Internet and Society

•  Biblotheca Alexandrina, Berkman Center for Internet and Society, IBM Watson Language Group, Center for Research in Computational Linguistics, King Abdullah’s Initiative for Arabic Content, International Development Research Center (Canada)

•  Saint Louis University, University of Melbourne, University of Michigan, Vassar, Universitat d’Alacant, University of Edinburgh, University of Pittsburgh, University of Pennsylvania, Eastern Michigan University, Tufts University

Page 47: Laura Welcher - The Rosetta Project and The Language Commons
Page 48: Laura Welcher - The Rosetta Project and The Language Commons

Language Distribution

Half the world population speaks one of 10 languages (>1%)!

Most everyone else speaks one of 300 languages (4%)!

5% of the world speaks one of 6,500 languages (95%) !

1 Billion

100 Million

10 Thousand

Number of Languages!

Page 49: Laura Welcher - The Rosetta Project and The Language Commons

Want to use your language in the digital domain?

1.  Is there a writing system for your language?!

a.  Yes! Continue to (2)!

b.  No! But you can still talk on your mobile phone, and post YouTube videos of yourself and your friends. Note you will need to type alphanumeric text (or use voice commands) in another more widely used language.!

Page 50: Laura Welcher - The Rosetta Project and The Language Commons

Want to use your language in the digital domain?

2.  Is there a unique identifier (ISO 639 code) for your language?!

a.  Yes! Continue to (3)!

b.  No! Bummer. Go back to (1).!

Page 51: Laura Welcher - The Rosetta Project and The Language Commons

Want to use your language in the digital domain?

3.  Is your writing system in Unicode?!

a.  Yes! Congratulations! Your script is now supported in the essential architecture of the digital domain.!

b.  No! Bummer. Either create one by adapting a supported script, build a proposal to get your script/unique characters supported in Unicode (contact the Script Encoding Initiative for help on this), go back to (1).!

Page 52: Laura Welcher - The Rosetta Project and The Language Commons

Want to use your language in the digital domain?

4.  Do you have a large corpus of natural texts – written and spoken?!

a.  Yes! Congratulations! You must be a speaker of a very economically powerful language. You continue to grow these corpora as you interact online every day (email, internet searches, SMS texts, depending somewhat on which ones you use) – and the services based on them keep getting better for you – natural language search, machine translation, speech recognition, etc.!

b.  No! Bummer. Go back to (3). You and billions of others are in the same circumstance. Many give up and simply use a mainstream language in the digital domain.!

Page 53: Laura Welcher - The Rosetta Project and The Language Commons

The Growing ��Linguistic Digital Divide

“There are hundreds of seriously under-documented languages that remain very much alive with hundreds of thousands to tens of millions of speakers each. The speakers of these languages number collectively in the billions, and as linguistic technology grows in importance, they find themselves of the far side of an increasingly large digital divide.”

- NSF Proposal “Seeding The Language Commons”

Page 54: Laura Welcher - The Rosetta Project and The Language Commons

Enabling Top 300 Languages ��as well as The Long Tail

•  We have substantial machine readable corpora for only about 20-30 of the world’s 6,900 languages. [Bird and Abney, 2010]!

•  There is a commercial motivation in enabling the 300 most widely spoken languages – if digital services and devices work for this group, that is 95% of humanity.!

•  The other 6,500 or so – the long tail – has no commercial motivation, but these languages can be documented and enabled by non-profit/academic/philanthropic efforts.!

•  The Long Tail can benefit from development of the 300 (and vice versa – if we are building better algorithms that can work with less data. !

Page 55: Laura Welcher - The Rosetta Project and The Language Commons

What we want to build…

Page 56: Laura Welcher - The Rosetta Project and The Language Commons

Proposal: Build an Encyclopedia of Human Language

An aggregation and discovery portal for information and resources on all 6,900 human

languages.

For use by: • language speakers • educators • researchers • general public

The Language Commons

Page 57: Laura Welcher - The Rosetta Project and The Language Commons

Why an Encyclopedia of human language?

•  To create the go-to place for information and resources on any and all human languages – for education, for research, for preservation!

•  To provide resources on lower density languages in case of crisis or emergency!

•  To take action in the face of impending language loss!

•  To act as testament for the genius of human cultural and linguistic diversity, and stand for freedom of language as a basic human right!

•  To provide a forward path for the use of the world’s languages in the digital domain (by building a massive repository of open linguistic corpora)!

Page 58: Laura Welcher - The Rosetta Project and The Language Commons

Basic Design Principles:

•  Comprehensive – One page (minimum for every human language)!

•  Extensible – includes language families, subgroups, languages, dialects, maybe even unique/noteworthy ideolects!

•  Flexible – multiple navigation options and suited for a variety of users and user views: by language taxonomy, by alternate taxonomy, by other grouping – like linguistic area, geographic, with robust search by language name, alternate names, ISO 639 code!

•  Open – open content, open contribution – the world should build it!

•  Visible – the site should be easily discoverable and references to it ubiquitous !

Page 59: Laura Welcher - The Rosetta Project and The Language Commons

Model: WikiLanguage

Page 60: Laura Welcher - The Rosetta Project and The Language Commons

Model: ��The Encyclopedia of Life

Page 61: Laura Welcher - The Rosetta Project and The Language Commons

Where will the Data Come From?

Page 62: Laura Welcher - The Rosetta Project and The Language Commons

Where will the Data Come From?

Page 63: Laura Welcher - The Rosetta Project and The Language Commons

Where will the Data Come From?

Global Lives Project World Premiere, February 02010 San Francisco, California Yerba Buena Center for the Arts

Page 64: Laura Welcher - The Rosetta Project and The Language Commons

Where will the Data Come From?

Photo by Erik Hersman!

You! Everyone has a language and can help document it.!

Page 65: Laura Welcher - The Rosetta Project and The Language Commons

Language Commons��What we’ve done this year

•  Established a special collection at the Internet Archive, built an uploader, and have accessioned several major corpora from working group participants!

•  Declaration of purpose, Identity!

•  Written grants, most notably to NSF for “Seeding the Language Commons: Software for Large Scale Transcription and Translation of Oral Literature”!

•  Participants have made presentations about The Language Commons all over the world (Long Now presented at Wikimania in Gdansk last summer)!

Page 66: Laura Welcher - The Rosetta Project and The Language Commons

How Long Now is Helping

•  Long Now has offered to be the umbrella organization for The Language Commons, as a project closely related to the aims and goals of The Rosetta Project.!

•  We are looking towards integrating the two digital collections – so that Rosetta’s parallel collection can seed the Language Commons.!

•  The Language Commons collection would continue to serve as a source for future Rosetta Disks and other Long Now data preservation projects. !

Page 67: Laura Welcher - The Rosetta Project and The Language Commons

Language Commons��How YOU can help!

•  Please tell other people about The Language Commons – Tweet, Facebook, write blog posts or articles about the need for an open Language Commons.!

•  We need serious funding to build the Encyclopedia of Human Language – and we are working on this! But if you have any leads or suggestions please let us know.!

•  Consider a generous contribution of open language data.!

Page 68: Laura Welcher - The Rosetta Project and The Language Commons

Thank you!

[email protected]!