laura welcher - the rosetta project and the language commons
TRANSCRIPT
The Rosetta Project:
Building a 10,000 Year Library !of All Human Language!
A bit of background…
The 10,000 Year Clock
Danny Hillis
“I want to build a clock that ticks once a year. The century hand advances once every one hundred years, and the cuckoo comes out on the millennium.”!
Prototype 1
Clock Mountain
The 10,000 Year Library
Stewart brand
“The Clock dramatizes the scope of historic time past and to come but offers no content. The Library is all content, especially past content with future significance…The value could lie in providing civilizations with a wisdom line: slow, robust, apparently inefficient. ”!
-from “Clock/Library” in The Clock of the Long Now!
Library Projects: ��A Responsibility Record
Library Projects:��All Species
…We risk creating a Digital Dark Age – a void in the continuity of cultural record – because the formats and hardware which we entrust with our data are unlikely to outlast even the next ten years, much less our own lives.
- Danny Hillis
Library Projects: ��Time & Bits
Strategies That Improve ��Data Longevity
• For starters expand your scope: aim for at least 500 years (can you do better than paper?)
• Use it or lose it – unused data dies
• Provide access – promotes use, reuse, LOCKSS
• Consider saving everything (e.g. Internet Archive)
• Move it or lose it (“Movage”)
• Consider atoms over bits (analog)
Library Projects:��Long Server
Library Projects: The Rosetta Project
• Thousands of years ago we stored information on stone tablets – some of these are still around.
• Hundreds of years ago we stored information in books – print on acid free paper can reliably be preserved 500 years.
• Now we store information digitally, using hardware, software and encodings that are highly ephemeral.
The Rosetta Disk��(One Possible Solution)
Microscopic Analog ��Data Storage
Microetched Pages
Human Eye Readable Side
Parallel Content in Multiple languages
• The Rosetta Stone includes a decree of the divine cult of King Ptolomy V carved in 196 BC
• Same text written in three different forms: Egyptian Hieroglyphs, Demotic (Early Egyptian Script preceding Coptic), and Ancient Greek
• Working back from the Greek and somewhat known Demotic, were able to decipher the Hieroglyphs – thereby unlocking records of an entire ancient civilization
Rosetta Disk Goal - Parallel content for all languages
Vocabulary Maps
Sound Structure Writing Systems
Word and Sentence Structure
Ethnographic Information
Parallel texts Numbering Systems
Other texts Color Systems
Building the Collection��Book Scanning
Building the Collection:��Swadesh Wordlists
Audio Digitization
Google Earth Interface
“Born Digital” Materials
Endangered Language Documentation Project!
6 First Edition Disks • Brewster Kahle, Internet Archive
• Charles Butcher, Lazy 8 Foundation – now in the permanent special collection of the University of Colorado Boulder Library
• William Lidwell, author of Universal Principles of Design
• Oliver Wilke – Oliver Wilke Stiftung für Sprachen
• One is held by an anonymous donor, and one is in the Long Now Museum
02004 Rosetta European Space Agency Mission
Rosetta Disk��Museum Edition
In August 02009 we presented the prototype of the Rosetta Disk Museum Edition to Secretary Wayne Clough for the Smithsonian.!
Endangered Languages
“The coming century will see either the death or doom of 90% of mankind’s languages”!
- Michael Krauss!
Top Ten languages by Native Speakers (Millions)
0! 100! 200! 300! 400! 500! 600! 700! 800! 900!
Javanese!German!
Japanese!Russian!
Portuguese!Hindi!
Bengali!English!Spanish!
Mandarin!
Data: The Ethnologue (02009) available at www.ethnologue.com!
Language Distribution
Half the world population speaks one of 10 languages (>1%)!
Most everyone else speaks one of 300 languages (4%)!
5% of the world speaks one of 6,500 languages (95%) !
1 Billion
100 Million
10 Thousand
Number of Languages!
Why does it matter?
Languages are...
Great Works of Art!
Languages are...
Great Libraries!
Languages are “How to” guides for Living on Planet Earth
Languages Provide ��a window into our minds
Freedom of Language - ��an inalienable human right
Individually you have:
• The right to be recognized as a member of a language community
• The right to use your language in private and in public
• The right to use your own name
• The right to interrelate and associate with your native speech community
• The right to maintain and develop your own culture
Freedom of Language - ��an inalienable human right
Collectively your speech community has:
• The right for your own language and culture to be taught
• The right of access to cultural services
• The right to an equitable presence of your language and culture in the communications media
• The right to receive attention in your own language from government bodies and in socioeconomic relations
From the Universal Declaration on Linguistic Rights, Barcelona, June 1996!
Rosetta Project:��Long Now, Here & Now
Open Digital Collection on��All Human Languages
Rosetta Special Collection ��In the Internet Archive
Rosetta Language Base – Linguistic Metastructure
• Freebase: over 10,000 languages and linguistic entities linked by language family relationship
• All data is linked to other kinds of data in Freebase
• We have rectified ~1500 Wikipedia pages about human languages to our data set
Rosetta Prototype Wiki
New Initiative��The Language Commons
The Language Commons��Working Group
Language Commons��Goals:
• To scale the amount of open language data (PD/CCZero to GPL to CCNC-BY to MIT/BSD)!
• To seek the participation of holders of language data including publishers, corporations, and authors (including web authors), funders of research that generates language data, and the institutes, researchers, and projects who are themselves creating and/or curating language data. !
• To build open and available language data resources to further research, development, and global access to knowledge !
• To help preserve and promote endangered languages!
Language Commons��Participants
• Translate.org, Meedan.net, Miro Project, Rosetta Project / Long Now Foundation, the Kamusi Project, Rosetta Foundation (translation service organization in Ireland), Fostering Language Resources Network (FLaReNet), European Language Resources Assocation (ELRA), The Berkman Center for Internet and Society
• Biblotheca Alexandrina, Berkman Center for Internet and Society, IBM Watson Language Group, Center for Research in Computational Linguistics, King Abdullah’s Initiative for Arabic Content, International Development Research Center (Canada)
• Saint Louis University, University of Melbourne, University of Michigan, Vassar, Universitat d’Alacant, University of Edinburgh, University of Pittsburgh, University of Pennsylvania, Eastern Michigan University, Tufts University
Language Distribution
Half the world population speaks one of 10 languages (>1%)!
Most everyone else speaks one of 300 languages (4%)!
5% of the world speaks one of 6,500 languages (95%) !
1 Billion
100 Million
10 Thousand
Number of Languages!
Want to use your language in the digital domain?
1. Is there a writing system for your language?!
a. Yes! Continue to (2)!
b. No! But you can still talk on your mobile phone, and post YouTube videos of yourself and your friends. Note you will need to type alphanumeric text (or use voice commands) in another more widely used language.!
Want to use your language in the digital domain?
2. Is there a unique identifier (ISO 639 code) for your language?!
a. Yes! Continue to (3)!
b. No! Bummer. Go back to (1).!
Want to use your language in the digital domain?
3. Is your writing system in Unicode?!
a. Yes! Congratulations! Your script is now supported in the essential architecture of the digital domain.!
b. No! Bummer. Either create one by adapting a supported script, build a proposal to get your script/unique characters supported in Unicode (contact the Script Encoding Initiative for help on this), go back to (1).!
Want to use your language in the digital domain?
4. Do you have a large corpus of natural texts – written and spoken?!
a. Yes! Congratulations! You must be a speaker of a very economically powerful language. You continue to grow these corpora as you interact online every day (email, internet searches, SMS texts, depending somewhat on which ones you use) – and the services based on them keep getting better for you – natural language search, machine translation, speech recognition, etc.!
b. No! Bummer. Go back to (3). You and billions of others are in the same circumstance. Many give up and simply use a mainstream language in the digital domain.!
The Growing ��Linguistic Digital Divide
“There are hundreds of seriously under-documented languages that remain very much alive with hundreds of thousands to tens of millions of speakers each. The speakers of these languages number collectively in the billions, and as linguistic technology grows in importance, they find themselves of the far side of an increasingly large digital divide.”
- NSF Proposal “Seeding The Language Commons”
Enabling Top 300 Languages ��as well as The Long Tail
• We have substantial machine readable corpora for only about 20-30 of the world’s 6,900 languages. [Bird and Abney, 2010]!
• There is a commercial motivation in enabling the 300 most widely spoken languages – if digital services and devices work for this group, that is 95% of humanity.!
• The other 6,500 or so – the long tail – has no commercial motivation, but these languages can be documented and enabled by non-profit/academic/philanthropic efforts.!
• The Long Tail can benefit from development of the 300 (and vice versa – if we are building better algorithms that can work with less data. !
What we want to build…
Proposal: Build an Encyclopedia of Human Language
An aggregation and discovery portal for information and resources on all 6,900 human
languages.
For use by: • language speakers • educators • researchers • general public
The Language Commons
Why an Encyclopedia of human language?
• To create the go-to place for information and resources on any and all human languages – for education, for research, for preservation!
• To provide resources on lower density languages in case of crisis or emergency!
• To take action in the face of impending language loss!
• To act as testament for the genius of human cultural and linguistic diversity, and stand for freedom of language as a basic human right!
• To provide a forward path for the use of the world’s languages in the digital domain (by building a massive repository of open linguistic corpora)!
Basic Design Principles:
• Comprehensive – One page (minimum for every human language)!
• Extensible – includes language families, subgroups, languages, dialects, maybe even unique/noteworthy ideolects!
• Flexible – multiple navigation options and suited for a variety of users and user views: by language taxonomy, by alternate taxonomy, by other grouping – like linguistic area, geographic, with robust search by language name, alternate names, ISO 639 code!
• Open – open content, open contribution – the world should build it!
• Visible – the site should be easily discoverable and references to it ubiquitous !
Model: WikiLanguage
Model: ��The Encyclopedia of Life
Where will the Data Come From?
Where will the Data Come From?
Where will the Data Come From?
Global Lives Project World Premiere, February 02010 San Francisco, California Yerba Buena Center for the Arts
Where will the Data Come From?
Photo by Erik Hersman!
You! Everyone has a language and can help document it.!
Language Commons��What we’ve done this year
• Established a special collection at the Internet Archive, built an uploader, and have accessioned several major corpora from working group participants!
• Declaration of purpose, Identity!
• Written grants, most notably to NSF for “Seeding the Language Commons: Software for Large Scale Transcription and Translation of Oral Literature”!
• Participants have made presentations about The Language Commons all over the world (Long Now presented at Wikimania in Gdansk last summer)!
How Long Now is Helping
• Long Now has offered to be the umbrella organization for The Language Commons, as a project closely related to the aims and goals of The Rosetta Project.!
• We are looking towards integrating the two digital collections – so that Rosetta’s parallel collection can seed the Language Commons.!
• The Language Commons collection would continue to serve as a source for future Rosetta Disks and other Long Now data preservation projects. !
Language Commons��How YOU can help!
• Please tell other people about The Language Commons – Tweet, Facebook, write blog posts or articles about the need for an open Language Commons.!
• We need serious funding to build the Encyclopedia of Human Language – and we are working on this! But if you have any leads or suggestions please let us know.!
• Consider a generous contribution of open language data.!
Thank you!