multilingual information access in a digital library vamshi ambati, rohini u, pramod, n balakrishnan...

15
Multilingual Multilingual Information Access Information Access in a Digital Library in a Digital Library Vamshi Ambati, Rohini U, Vamshi Ambati, Rohini U, Pramod, N Balakrishnan Pramod, N Balakrishnan and Raj Reddy and Raj Reddy International Institute of Information Technology International Institute of Information Technology Hyderabad, India Hyderabad, India

Upload: gavin-booker

Post on 18-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Multilingual Multilingual Information Access Information Access

in a Digital Libraryin a Digital LibraryVamshi Ambati, Rohini U, Vamshi Ambati, Rohini U, Pramod, N Balakrishnan Pramod, N Balakrishnan

and Raj Reddyand Raj Reddy

International Institute of Information International Institute of Information TechnologyTechnology

Hyderabad, IndiaHyderabad, India

IIIT Hyderabad - http://dli.iiit.acIIIT Hyderabad - http://dli.iiit.ac.in.in

22

ContextContext

Digital Library of IndiaDigital Library of India155,000 English books155,000 English books145,000 Other language books145,000 Other language books

Population of literatesPopulation of literates20% of India understand English20% of India understand English80% can not80% can not

IIIT Hyderabad - http://dli.iiit.acIIIT Hyderabad - http://dli.iiit.ac.in.in

33

Multilingual Access to Multilingual Access to InformationInformation

Retrieve a bookRetrieve a bookBy metadataBy metadataBy keyword / contentBy keyword / contentCross Lingual Information RetrievalCross Lingual Information Retrieval

Read a bookRead a bookHelp understand sentences in a Help understand sentences in a languagelanguage

Help understand sentences across Help understand sentences across languageslanguages

Machine TranslationMachine Translation

IIIT Hyderabad - http://dli.iiit.acIIIT Hyderabad - http://dli.iiit.ac.in.in

44

Approaches to Approaches to Multilingual AccessMultilingual Access

Cross Lingual RetrievalCross Lingual RetrievalTranslate Query to Document Translate Query to Document LanguageLanguage

Translate Document to Query Translate Document to Query LanguageLanguage

Machine TranslationMachine TranslationKnowledge Based ApproachesKnowledge Based ApproachesCorpus Based ApproachesCorpus Based ApproachesHybrid ApproachesHybrid Approaches

IIIT Hyderabad - http://dli.iiit.acIIIT Hyderabad - http://dli.iiit.ac.in.in

55

Challenges in Challenges in Multilingual AccessMultilingual Access

Corpus Based ApproachesCorpus Based ApproachesUnavailability of Parallel Corpus Unavailability of Parallel Corpus for pairs of languagesfor pairs of languages

Unavailability of Computational Unavailability of Computational Linguistics Resources Linguistics Resources

Dictionary Based ApproachesDictionary Based ApproachesUnavailability of multiple Unavailability of multiple bilingual dictionariesbilingual dictionaries

IIIT Hyderabad - http://dli.iiit.acIIIT Hyderabad - http://dli.iiit.ac.in.in

66

ResourcesResources

Universal DictionaryUniversal DictionaryConceived and implemented by Conceived and implemented by Michael Shamos at CMU, USAMichael Shamos at CMU, USA

ITRANSITRANSA transcription scheme and A transcription scheme and associated tool built by IISc, associated tool built by IISc, IIIT and CMUIIIT and CMU

CorpusCorpusData Entry by TTD and DLI projectData Entry by TTD and DLI projectTIDES projectTIDES project

IIIT Hyderabad - http://dli.iiit.acIIIT Hyderabad - http://dli.iiit.ac.in.in

77

Universal DictionaryUniversal Dictionary

IIIT Hyderabad - http://dli.iiit.acIIIT Hyderabad - http://dli.iiit.ac.in.in

88

How are we doing it How are we doing it

Cross Lingual Search (Identify Cross Lingual Search (Identify Information)Information) Dictionary lookup Dictionary lookup User feedback basedUser feedback based Lucene Search EngineLucene Search Engine

Machine Translation (Understand Machine Translation (Understand Information)Information) Corpus based technique (EBMT)Corpus based technique (EBMT) Dictionary based word-word lookupDictionary based word-word lookup Good-enough translation vs Perfect translationGood-enough translation vs Perfect translation

IIIT Hyderabad - http://dli.iiit.acIIIT Hyderabad - http://dli.iiit.ac.in.in

99

Cross Lingual RetrievalCross Lingual Retrieval

IIIT Hyderabad - http://dli.iiit.acIIIT Hyderabad - http://dli.iiit.ac.in.in

1010

Cross Lingual RetrievalCross Lingual Retrieval

IIIT Hyderabad - http://dli.iiit.acIIIT Hyderabad - http://dli.iiit.ac.in.in

1111

Reading Assistant SystemReading Assistant System

IIIT Hyderabad - http://dli.iiit.acIIIT Hyderabad - http://dli.iiit.ac.in.in

1212

Reading AssistantReading Assistant

IIIT Hyderabad - http://dli.iiit.acIIIT Hyderabad - http://dli.iiit.ac.in.in

1313

Status TodayStatus Today

CLIR for 6 languagesCLIR for 6 languagesMT for 3 languagesMT for 3 languages

Shakti (a knowledge based MT Shakti (a knowledge based MT system)system)

Parallel Corpus for Hindi-EngParallel Corpus for Hindi-EngUDICT UDICT

About 40 Foreign LanguagesAbout 40 Foreign Languages6 Indian Languages6 Indian Languages

IIIT Hyderabad - http://dli.iiit.acIIIT Hyderabad - http://dli.iiit.ac.in.in

1414

What more is needed?What more is needed?

UDICTUDICT Improving coverage of existing languagesImproving coverage of existing languages Adding new languagesAdding new languages

Machine Translation Machine Translation Corpus acquisition Corpus acquisition State of art techniques applied to Indian State of art techniques applied to Indian LanguagesLanguages

Multi-way parallel corpus developmentMulti-way parallel corpus development Textual format for the booksTextual format for the books

Books currently are in Image formats Books currently are in Image formats OCR should be developed for textual OCR should be developed for textual contentcontent

Thank YouThank You

Questions ?Questions ?