multilingual information access in a digital library vamshi ambati, rohini u, pramod, n balakrishnan...
TRANSCRIPT
Multilingual Multilingual Information Access Information Access
in a Digital Libraryin a Digital LibraryVamshi Ambati, Rohini U, Vamshi Ambati, Rohini U, Pramod, N Balakrishnan Pramod, N Balakrishnan
and Raj Reddyand Raj Reddy
International Institute of Information International Institute of Information TechnologyTechnology
Hyderabad, IndiaHyderabad, India
IIIT Hyderabad - http://dli.iiit.acIIIT Hyderabad - http://dli.iiit.ac.in.in
22
ContextContext
Digital Library of IndiaDigital Library of India155,000 English books155,000 English books145,000 Other language books145,000 Other language books
Population of literatesPopulation of literates20% of India understand English20% of India understand English80% can not80% can not
IIIT Hyderabad - http://dli.iiit.acIIIT Hyderabad - http://dli.iiit.ac.in.in
33
Multilingual Access to Multilingual Access to InformationInformation
Retrieve a bookRetrieve a bookBy metadataBy metadataBy keyword / contentBy keyword / contentCross Lingual Information RetrievalCross Lingual Information Retrieval
Read a bookRead a bookHelp understand sentences in a Help understand sentences in a languagelanguage
Help understand sentences across Help understand sentences across languageslanguages
Machine TranslationMachine Translation
IIIT Hyderabad - http://dli.iiit.acIIIT Hyderabad - http://dli.iiit.ac.in.in
44
Approaches to Approaches to Multilingual AccessMultilingual Access
Cross Lingual RetrievalCross Lingual RetrievalTranslate Query to Document Translate Query to Document LanguageLanguage
Translate Document to Query Translate Document to Query LanguageLanguage
Machine TranslationMachine TranslationKnowledge Based ApproachesKnowledge Based ApproachesCorpus Based ApproachesCorpus Based ApproachesHybrid ApproachesHybrid Approaches
IIIT Hyderabad - http://dli.iiit.acIIIT Hyderabad - http://dli.iiit.ac.in.in
55
Challenges in Challenges in Multilingual AccessMultilingual Access
Corpus Based ApproachesCorpus Based ApproachesUnavailability of Parallel Corpus Unavailability of Parallel Corpus for pairs of languagesfor pairs of languages
Unavailability of Computational Unavailability of Computational Linguistics Resources Linguistics Resources
Dictionary Based ApproachesDictionary Based ApproachesUnavailability of multiple Unavailability of multiple bilingual dictionariesbilingual dictionaries
IIIT Hyderabad - http://dli.iiit.acIIIT Hyderabad - http://dli.iiit.ac.in.in
66
ResourcesResources
Universal DictionaryUniversal DictionaryConceived and implemented by Conceived and implemented by Michael Shamos at CMU, USAMichael Shamos at CMU, USA
ITRANSITRANSA transcription scheme and A transcription scheme and associated tool built by IISc, associated tool built by IISc, IIIT and CMUIIIT and CMU
CorpusCorpusData Entry by TTD and DLI projectData Entry by TTD and DLI projectTIDES projectTIDES project
IIIT Hyderabad - http://dli.iiit.acIIIT Hyderabad - http://dli.iiit.ac.in.in
77
Universal DictionaryUniversal Dictionary
IIIT Hyderabad - http://dli.iiit.acIIIT Hyderabad - http://dli.iiit.ac.in.in
88
How are we doing it How are we doing it
Cross Lingual Search (Identify Cross Lingual Search (Identify Information)Information) Dictionary lookup Dictionary lookup User feedback basedUser feedback based Lucene Search EngineLucene Search Engine
Machine Translation (Understand Machine Translation (Understand Information)Information) Corpus based technique (EBMT)Corpus based technique (EBMT) Dictionary based word-word lookupDictionary based word-word lookup Good-enough translation vs Perfect translationGood-enough translation vs Perfect translation
IIIT Hyderabad - http://dli.iiit.acIIIT Hyderabad - http://dli.iiit.ac.in.in
99
Cross Lingual RetrievalCross Lingual Retrieval
IIIT Hyderabad - http://dli.iiit.acIIIT Hyderabad - http://dli.iiit.ac.in.in
1010
Cross Lingual RetrievalCross Lingual Retrieval
IIIT Hyderabad - http://dli.iiit.acIIIT Hyderabad - http://dli.iiit.ac.in.in
1111
Reading Assistant SystemReading Assistant System
IIIT Hyderabad - http://dli.iiit.acIIIT Hyderabad - http://dli.iiit.ac.in.in
1212
Reading AssistantReading Assistant
IIIT Hyderabad - http://dli.iiit.acIIIT Hyderabad - http://dli.iiit.ac.in.in
1313
Status TodayStatus Today
CLIR for 6 languagesCLIR for 6 languagesMT for 3 languagesMT for 3 languages
Shakti (a knowledge based MT Shakti (a knowledge based MT system)system)
Parallel Corpus for Hindi-EngParallel Corpus for Hindi-EngUDICT UDICT
About 40 Foreign LanguagesAbout 40 Foreign Languages6 Indian Languages6 Indian Languages
IIIT Hyderabad - http://dli.iiit.acIIIT Hyderabad - http://dli.iiit.ac.in.in
1414
What more is needed?What more is needed?
UDICTUDICT Improving coverage of existing languagesImproving coverage of existing languages Adding new languagesAdding new languages
Machine Translation Machine Translation Corpus acquisition Corpus acquisition State of art techniques applied to Indian State of art techniques applied to Indian LanguagesLanguages
Multi-way parallel corpus developmentMulti-way parallel corpus development Textual format for the booksTextual format for the books
Books currently are in Image formats Books currently are in Image formats OCR should be developed for textual OCR should be developed for textual contentcontent