wikibhasha by dr a kumaran

31
WikiBhasha Dr A Kumaran Microsoft Research Feb 2011 om Digital Inclusion to Digital Democra

Upload: nift

Post on 26-May-2015

848 views

Category:

Education


1 download

TRANSCRIPT

Page 1: Wikibhasha by  Dr A Kumaran

WikiBhasha

Dr A KumaranMicrosoft Research

Feb 2011

From Digital Inclusion to Digital Democracy…

Page 2: Wikibhasha by  Dr A Kumaran

“Egypt will not become liberal democracy overnight;

it has a long and difficult journey…Becoming an open and empowered

democracy needs a slow process of popular self-education, and it will not come easily or naturally.”

Hindustan Times EditorialFeb 14, 2011

Page 3: Wikibhasha by  Dr A Kumaran

• Language Technology Research (15Min)

• “Digital Inclusion vs. Digital Democracy” (5 Min)

• WikiBhasha (5 Min)

Agenda

Page 4: Wikibhasha by  Dr A Kumaran

Language Technology ResearchFrom Classical to Statistical…

Page 5: Wikibhasha by  Dr A Kumaran

Technology & Language...

Print Era: Post-Gutenberg 15 Cen. AD ~O(GB)? Printing tech./coordination &

distribution/…

Electronics & Computing Era Late 20th Cen. ~O(TB-PB) Electronic

standards/Multimedia/Storage tech./…

Library of Alexandria 3 Cen. BC ~O(KB-MB) Scripts/Media for

preservation/…

Page 6: Wikibhasha by  Dr A Kumaran

Internet = Information Deluge

• Information produced per year: 5 Exabytes

…500,000 Libraries of Congress

~800 MB per person => 30’ high stack

Page 7: Wikibhasha by  Dr A Kumaran

Need of the hour: Language Technology Research!

• Language Technology is primarily concerned with processing Natural Language data

Search Information Extraction Machine Translation Language Understanding …

• Needs “Computational Linguistics” Research!

Page 8: Wikibhasha by  Dr A Kumaran

Linguistics & Computational Linguistics

Linguistics is the scientific study of languages

Computational Linguistics studies Computational Models for languages

Lexicography - Letter-level Morphology & Phonology - Word-level Syntax - Sentence-level Semantics - … Pragmatics - …

Page 9: Wikibhasha by  Dr A Kumaran

Classical vs Statistical Approaches

• Started as classical & analytical study (~4-5 decades)– Based on deep Linguistic Knowledge– Still not perfect after half-a-century of research!

• Evolving into a Statistical discipline (last ~1 decade)– Collect large [appropriate] linguistic corpora – Study “patterns”

Page 10: Wikibhasha by  Dr A Kumaran

Ex #1: Which Language? (In general, Language Identification)

• Is a document in English or Finnish or Tamil?

0

5

10

15

20

25

Near-perfect identification!

“Length of words”

Page 11: Wikibhasha by  Dr A Kumaran

IKONE-2007

Ex #2: Which is Right?(In general, Grammar Checking & Modeling)

• Are these correct sentences?– S1: “Yesterday evening, I will have tea”– S2: “I enjoy my tea with biscuits” – S3: “I enjoy my tea with motor oil”

• “Normal use of words” P(S1): 0.01

P(S2): 0.75 P(S3): 0.05

Page 12: Wikibhasha by  Dr A Kumaran

Ex #3: What are you reading?(In general, Document Classification)

• Is the following document from a “Novel” or “News”

“Feature Vectors” tunable for specific purpose

CAN CO

ULD

MAY

MIG

HT

MUST

WIL

L

“Specific set of words”

Page 13: Wikibhasha by  Dr A Kumaran

Ex #4: Searchd1: Human machine interface for ABC computer applicationsd2: A survey of user opinion of computer system response timed3: Human-Machine interface system: System designd4: System and human system engineering testing of EPSd5: Relation of user perceived response time to error measurementd6: The generation of random, binary, ordered treesd7: The intersection graph pf paths in treesd8: Graph minors IV: Widths of trees and well-quasi orderingd9: Graph minors: A survey

d1: Human machine interface for ABC computer applicationsd2: A survey of user opinion of computer system response timed3: Human-Machine interface system: System designd4: System and human system engineering testing of EPSd5: Relation of user perceived response time to error measurementd6: The generation of random, binary, ordered treesd7: The intersection graph pf paths in treesd8: Graph minors IV: Widths of trees and well-quasi orderingd9: Graph minors: A survey

Syst

em

graphHuman

d1

d2

d3

d4d5

d6

Query Term

Cosine Similarity

Page 14: Wikibhasha by  Dr A Kumaran

Ex #5: Statistical MT

President – ஜனா�தி�பதி� visits – செ�ல்கி�றா�ர்

Chennai – செ�ன்னைனா

visits

ChennaiPresident

ஜனா�தி�பதி�

செ�ல்கி�றா�ர்

செ�ன்னைனா

Statistical Models

Parallelcorpora

President visits Chennai ஜனா�தி�பதி� செ�ன்னைனா செ�ல்கி�றா�ர்

President inaugurates

Tamil Conference

ஜனா�தி�பதி�திமி�ழ்மி�நா�ட்

னை� துவக்குகி�றா�ர்

Page 15: Wikibhasha by  Dr A Kumaran

For most Language Technologies…

• Statistical approaches EXIST, and are proven to be very successful!

Data are Critical!

• Theorem: Data drives Research & Technology!

Page 16: Wikibhasha by  Dr A Kumaran

• Axiom: Web = Language data– Read “Wikipedia = Language Data”

Where are the data?

Engli

sh

French

Japan

ese

Dutch

Span

ish

Russian

Finnish

Espera

nto

Slova

k

Romanian

Ukrainian

Danish

Hebrew

Slove

nian

Serb

ian

Korean

Arabic

Croati

an

Volapük

Greek

Newar

Persian

Vietnam

ese

Basque

Hindi

Wikipedia Content by Language

• Corollary: “Technology will be skewed, similarly”• Empirically, true!

Page 17: Wikibhasha by  Dr A Kumaran

• Technology has developed in any/all languages with LARGE digital presence

• It is in every community’s interest, to create and preserve its asset: Its Language

Digital Inclusion to Digital democracy…

Page 18: Wikibhasha by  Dr A Kumaran

WikiBhashaResearch Project on Crowd-sourcing

to explore collaborative data creation for Computational Linguistic research

(first focus: parallel data)

Page 19: Wikibhasha by  Dr A Kumaran

Content Creation by Infusion…

MachineTranslation

System

CollaborativeTranslation

Cache

LinguisticResources

WikiBABEL on

Wikipedia

Article to

Target Wikipe

dia

• Rough content using Machine Translation• Appropriate community correction to create value…

Page 20: Wikibhasha by  Dr A Kumaran

WikiBhasha V1.0

Published in WikiSYM 2008 Conference; Adopted for some products in Microsoft

• Hold a set of Wikipedia articles• In-progress data, hosted locally• Per-sentence edits only

• Little traction with Wikipedians

Page 21: Wikibhasha by  Dr A Kumaran

WikiBhasha V2.0: Design Objectives

• #1: Focus users on their purpose (say, Wikipedia)• Content Creation, and not Translation

• #2: In-site Solution– WikiBABEL to stay on Wikipedia for the session– Submit any/all contributions

• #3: Generic components, but specifically purposed– Vendor Neutrality– Componentized Architecture– …

Page 22: Wikibhasha by  Dr A Kumaran

WikiBhasha: User View

WikiBABEL UX

WikiBhasha 2.0User

Community

CTF

Dictionary

Cloud Services

API’sWikipedia

• Designed WikiBhasha as a thin edit layer – Stays on Wikipedia– User contribution submitted to Wikipedia

Page 23: Wikibhasha by  Dr A Kumaran

Cloud Services Layer

WikiBhasha CORE Components

Source/Target Wiki System

Interface

GUI Components(Wikipedia-specific UI and Workflow)

WikiBABEL [Edit]

WikiBABEL-CORESource/Target Wiki

System Interface

(Wiki API’s for Content Pull/Push,

Content & User Management, …)

User Managem

ent(Authentication, User Credentials

Management, User Preferences/Skills,

Contributions Tracking, …)

Linguistic Resources

(Mono-/Bi-lingual Dictionaries, Thesauri, …)

Lang. Technolog

y Component

s(Machine

Translation, Transliteration,

Summarization …)

Content Managem

ent(Content Discovery,

Versions, Tagging,

Notification Lists, …)

User-Experienc

e(Linguistically Aware

Wiki-site AwareWorkflow Engine)

Contextual Help

(Domain-specific, Context-specific, User-Contribution

Aware Help…)

Communication

(Message Boards,Email/Alert

Mechanisms,Wikis, …)

User-Interface

(Generic UI Components,

Scratch Pad, …)

WikiBhasha UI/UX/IntegrationComponents Layer

3rd Party Linguistic Services

MediaWikiSoftware

Mediawiki Extension

s

MediaWiki Layer

Wikipedia

CTF

• WikiBhasha designed to be modular & extendible– Open-sourced, so community can contribute/enhance

WikiBhasha: Developer View

Page 24: Wikibhasha by  Dr A Kumaran

WikiBhasha: A Community Project

• WikiBhasha is available as a Bookmarklet/ Wikipedia user-script– Please contribute to your Wikipedia!

• WikiBhasha source code available as a MediaWiki Extension– http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/WikiBhash

a

– Please enhance it!

Page 25: Wikibhasha by  Dr A Kumaran

WikiBhasha Release• Released & Open-sourced in 10/’10• Announced jointly by MSR and WMF

Page 26: Wikibhasha by  Dr A Kumaran

• Covered in 20+ languages/countries across the world

WikiBhasha Release

Page 27: Wikibhasha by  Dr A Kumaran

• ~500K Visits & ~100K Unique Visitors• Visits from 50+ countries– Primarily from Europe (and Eastern Europe)

• Many “casual visitors” who may become “contributors”!

0

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000

Others (37%)

UK (5%)

Russian Federation

(5%)Poland (8%)India (11%)

German (14%)

US (30%)

WikiBhasha Release

Page 28: Wikibhasha by  Dr A Kumaran

Community Program

• Being conducted in 5 demographics– Allahabad &Banaras– Cairo– Delhi, …

• Objectives– Interaction with Wikipedians & Language

Enthusiasts– To study community adoption, user experience,

data creation, and ultimately, technology development…

Page 29: Wikibhasha by  Dr A Kumaran

Back to “Digital Democracy”Communities to Research…

Page 30: Wikibhasha by  Dr A Kumaran

Languages: Communities & Technology

• Research requires Data– Participatory Internet provides the data

needed!– Digital “haves and have-nots”

• For many languages of the world– Digital Inclusion is a necessary first step– Digital Democracy is a process in which the

communities may have to take active part in…

Page 31: Wikibhasha by  Dr A Kumaran

Thank you!http://research.microsoft.com/en-us/groups/mls

http://www.wikibhasha.org