wikibhasha by dr a kumaran
TRANSCRIPT
WikiBhasha
Dr A KumaranMicrosoft Research
Feb 2011
From Digital Inclusion to Digital Democracy…
“Egypt will not become liberal democracy overnight;
it has a long and difficult journey…Becoming an open and empowered
democracy needs a slow process of popular self-education, and it will not come easily or naturally.”
Hindustan Times EditorialFeb 14, 2011
• Language Technology Research (15Min)
• “Digital Inclusion vs. Digital Democracy” (5 Min)
• WikiBhasha (5 Min)
Agenda
Language Technology ResearchFrom Classical to Statistical…
Technology & Language...
Print Era: Post-Gutenberg 15 Cen. AD ~O(GB)? Printing tech./coordination &
distribution/…
Electronics & Computing Era Late 20th Cen. ~O(TB-PB) Electronic
standards/Multimedia/Storage tech./…
Library of Alexandria 3 Cen. BC ~O(KB-MB) Scripts/Media for
preservation/…
Internet = Information Deluge
• Information produced per year: 5 Exabytes
…500,000 Libraries of Congress
~800 MB per person => 30’ high stack
Need of the hour: Language Technology Research!
• Language Technology is primarily concerned with processing Natural Language data
Search Information Extraction Machine Translation Language Understanding …
• Needs “Computational Linguistics” Research!
Linguistics & Computational Linguistics
Linguistics is the scientific study of languages
Computational Linguistics studies Computational Models for languages
Lexicography - Letter-level Morphology & Phonology - Word-level Syntax - Sentence-level Semantics - … Pragmatics - …
Classical vs Statistical Approaches
• Started as classical & analytical study (~4-5 decades)– Based on deep Linguistic Knowledge– Still not perfect after half-a-century of research!
• Evolving into a Statistical discipline (last ~1 decade)– Collect large [appropriate] linguistic corpora – Study “patterns”
Ex #1: Which Language? (In general, Language Identification)
• Is a document in English or Finnish or Tamil?
0
5
10
15
20
25
Near-perfect identification!
“Length of words”
IKONE-2007
Ex #2: Which is Right?(In general, Grammar Checking & Modeling)
• Are these correct sentences?– S1: “Yesterday evening, I will have tea”– S2: “I enjoy my tea with biscuits” – S3: “I enjoy my tea with motor oil”
• “Normal use of words” P(S1): 0.01
P(S2): 0.75 P(S3): 0.05
Ex #3: What are you reading?(In general, Document Classification)
• Is the following document from a “Novel” or “News”
“Feature Vectors” tunable for specific purpose
CAN CO
ULD
MAY
MIG
HT
MUST
WIL
L
“Specific set of words”
Ex #4: Searchd1: Human machine interface for ABC computer applicationsd2: A survey of user opinion of computer system response timed3: Human-Machine interface system: System designd4: System and human system engineering testing of EPSd5: Relation of user perceived response time to error measurementd6: The generation of random, binary, ordered treesd7: The intersection graph pf paths in treesd8: Graph minors IV: Widths of trees and well-quasi orderingd9: Graph minors: A survey
d1: Human machine interface for ABC computer applicationsd2: A survey of user opinion of computer system response timed3: Human-Machine interface system: System designd4: System and human system engineering testing of EPSd5: Relation of user perceived response time to error measurementd6: The generation of random, binary, ordered treesd7: The intersection graph pf paths in treesd8: Graph minors IV: Widths of trees and well-quasi orderingd9: Graph minors: A survey
Syst
em
graphHuman
d1
d2
d3
d4d5
d6
Query Term
Cosine Similarity
Ex #5: Statistical MT
President – ஜனா�தி�பதி� visits – செ�ல்கி�றா�ர்
Chennai – செ�ன்னைனா
visits
ChennaiPresident
ஜனா�தி�பதி�
செ�ல்கி�றா�ர்
செ�ன்னைனா
Statistical Models
Parallelcorpora
President visits Chennai ஜனா�தி�பதி� செ�ன்னைனா செ�ல்கி�றா�ர்
President inaugurates
Tamil Conference
ஜனா�தி�பதி�திமி�ழ்மி�நா�ட்
னை� துவக்குகி�றா�ர்
For most Language Technologies…
• Statistical approaches EXIST, and are proven to be very successful!
Data are Critical!
• Theorem: Data drives Research & Technology!
• Axiom: Web = Language data– Read “Wikipedia = Language Data”
Where are the data?
Engli
sh
French
Japan
ese
Dutch
Span
ish
Russian
Finnish
Espera
nto
Slova
k
Romanian
Ukrainian
Danish
Hebrew
Slove
nian
Serb
ian
Korean
Arabic
Croati
an
Volapük
Greek
Newar
Persian
Vietnam
ese
Basque
Hindi
Wikipedia Content by Language
• Corollary: “Technology will be skewed, similarly”• Empirically, true!
• Technology has developed in any/all languages with LARGE digital presence
• It is in every community’s interest, to create and preserve its asset: Its Language
Digital Inclusion to Digital democracy…
WikiBhashaResearch Project on Crowd-sourcing
to explore collaborative data creation for Computational Linguistic research
(first focus: parallel data)
Content Creation by Infusion…
MachineTranslation
System
CollaborativeTranslation
Cache
LinguisticResources
WikiBABEL on
Wikipedia
Article to
Target Wikipe
dia
• Rough content using Machine Translation• Appropriate community correction to create value…
WikiBhasha V1.0
Published in WikiSYM 2008 Conference; Adopted for some products in Microsoft
• Hold a set of Wikipedia articles• In-progress data, hosted locally• Per-sentence edits only
• Little traction with Wikipedians
WikiBhasha V2.0: Design Objectives
• #1: Focus users on their purpose (say, Wikipedia)• Content Creation, and not Translation
• #2: In-site Solution– WikiBABEL to stay on Wikipedia for the session– Submit any/all contributions
• #3: Generic components, but specifically purposed– Vendor Neutrality– Componentized Architecture– …
WikiBhasha: User View
WikiBABEL UX
WikiBhasha 2.0User
Community
CTF
Dictionary
Cloud Services
API’sWikipedia
• Designed WikiBhasha as a thin edit layer – Stays on Wikipedia– User contribution submitted to Wikipedia
Cloud Services Layer
WikiBhasha CORE Components
Source/Target Wiki System
Interface
GUI Components(Wikipedia-specific UI and Workflow)
WikiBABEL [Edit]
WikiBABEL-CORESource/Target Wiki
System Interface
(Wiki API’s for Content Pull/Push,
Content & User Management, …)
User Managem
ent(Authentication, User Credentials
Management, User Preferences/Skills,
Contributions Tracking, …)
Linguistic Resources
(Mono-/Bi-lingual Dictionaries, Thesauri, …)
Lang. Technolog
y Component
s(Machine
Translation, Transliteration,
Summarization …)
Content Managem
ent(Content Discovery,
Versions, Tagging,
Notification Lists, …)
User-Experienc
e(Linguistically Aware
Wiki-site AwareWorkflow Engine)
Contextual Help
(Domain-specific, Context-specific, User-Contribution
Aware Help…)
Communication
(Message Boards,Email/Alert
Mechanisms,Wikis, …)
User-Interface
(Generic UI Components,
Scratch Pad, …)
WikiBhasha UI/UX/IntegrationComponents Layer
3rd Party Linguistic Services
MediaWikiSoftware
Mediawiki Extension
s
MediaWiki Layer
Wikipedia
CTF
• WikiBhasha designed to be modular & extendible– Open-sourced, so community can contribute/enhance
WikiBhasha: Developer View
WikiBhasha: A Community Project
• WikiBhasha is available as a Bookmarklet/ Wikipedia user-script– Please contribute to your Wikipedia!
• WikiBhasha source code available as a MediaWiki Extension– http://svn.wikimedia.org/viewvc/mediawiki/trunk/extensions/WikiBhash
a
– Please enhance it!
WikiBhasha Release• Released & Open-sourced in 10/’10• Announced jointly by MSR and WMF
• Covered in 20+ languages/countries across the world
WikiBhasha Release
• ~500K Visits & ~100K Unique Visitors• Visits from 50+ countries– Primarily from Europe (and Eastern Europe)
• Many “casual visitors” who may become “contributors”!
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
400,000
Others (37%)
UK (5%)
Russian Federation
(5%)Poland (8%)India (11%)
German (14%)
US (30%)
WikiBhasha Release
Community Program
• Being conducted in 5 demographics– Allahabad &Banaras– Cairo– Delhi, …
• Objectives– Interaction with Wikipedians & Language
Enthusiasts– To study community adoption, user experience,
data creation, and ultimately, technology development…
Back to “Digital Democracy”Communities to Research…
Languages: Communities & Technology
• Research requires Data– Participatory Internet provides the data
needed!– Digital “haves and have-nots”
• For many languages of the world– Digital Inclusion is a necessary first step– Digital Democracy is a process in which the
communities may have to take active part in…
Thank you!http://research.microsoft.com/en-us/groups/mls
http://www.wikibhasha.org