tutorial-cum-workshop building a machine translation...

Tutorial-cum-Workshopon

Building a Machine Translation System for Less Resourced Languages

Anil Kumar SinghIIT (BHU), Varanasi, [email protected]

Abstract

Machine translation, besides being a technological challenge, is an enormously enabling tech-nology for people and for languages. Last few decades, particularly the last two, have seen thecreation of such systems for many language pairs. However, there are still a lot of ‘less re-sourced’ languages for which there are no machine translation systems. This is particularly truefor a linguistically diverse region like South Asia. Machine translation, when using a rule-basedapproach (which could be said to be more suitable for less resourced languages), also happensto involve the creation of several core kinds of language resources (dictionaries, corpora etc.) aswell as core tools (morphological analyzer, part-of-speech tagger etc.). To learn rule-based ma-chine translation is, therefore, a very good way to learn the basics of language resource creationand Natural Language Processing. Doing it for less resourced languages has its own peculiarities.

In the proposed tutorial-cum-workshop, we will take a specific language pair and see how we canbuild a machine translation system for it from scratch. We will try to use a hands-on approach forthis learning. The focus will be mainly on the rule-based approach, but we will briefly introducethe statistical approach.

Topical Outline

• Introduction: What is machine translation (MT)?

• The two (or three) approaches: Rule-based or statistical (or hybrid)?

• Rule-based approach: A set of NLP problems

• Examples: Some rule-based MT systems

• A pipeline of tools

• A collection of language resources

• Tokenization

• Morphological analysis: Data creation

• Morphological analysis: Building the tool

• POS tagger: Corpus annotation

• POS tagger: Building the tool

• Chunker: Corpus annotation

• Chunker: Building the tool

• Transfer grammar and the reordering tool

• Bilingual dictionary and lexical substitution

• Word sense disambiguation

• Morphological generation

• Named entity recognition and transliteration

• Handling tense, aspect and modality

• Hurdles for less resourced languages

• Aligned parallel corpus and the statistical approach

• Summing Up: Some concluding remarks

Description of the Proposer

The proposer is a researcher and a teacher who has been working in the area of NLP (specifically ma-chine translation) for the last twelve years. He works as an Assistant Professor in the department ofComputer Science and Engineering at IIT (BHU), Varanasi, India. He has published on various topicsin NLP and has organized a couple of research workshops and a couple of introductory workshops onNLP, including a regional version of ICON (regICON-2015). He did his PhD in Computational Lin-guistics from IIIT, Hyderabad, India. He is interested as much in development and implementation asin research. He is the creator of Sanchay, a collection of tools and APIs familiar to some researchersin India. He spent one year as a post-doctoral researcher at LIMSI-CNRS, Orsay, France, working onmachine translation quality estimation (MTQE). The work there resulted in some publications and in thedevelopment of a tool called Questimate (along with a companion tool called LatticeFst). He has beenassociated over the years with machine translation and related activities (particularly resource creation)in several different capacities. He also happens to be a professional translator and has translated froma wide variety of domains (from computers to poetry). At present, he is also involved in building ma-chine translation systems for Bhojpuri, Maithili and Magahi, which are all less resourced languages. Theproposed tutorial-cum-workshop is based to a large extent on his own experiences.

Tutorial Title: Domain Adaptation: Principles, Applications and Systems

Himanshu Sharad Bhatt & Manjira Sinha

Xerox Research Center India, Bengaluru

Himanshu.bhatt,[email protected]

ABSTRACT

Machine learning based algorithms require training data to learn from past examples and predict on new

instances. It is generally observed that more the training examples better is the performance. However,

performance of these algorithms severely degrades if the algorithm is applied for doing the same task in

different domains, due to change in data distribution. For example, an algorithm trained for classifying

sentiments for retail products may not yield optimum performance when directly applied for

categorizing sentiments in a financial domain. To retain its performance the algorithms has to be re-

trained from scratch each time on the labelled instances from test domain. Scarcity of labelled instances

from test domain and the time and effort required for data labelling are one of the biggest challenge

towards deployment in real life.

Domain adaptation has gained a lot of attention due to its effectiveness in making machine learning

algorithms re-usable to in related scenarios with limited supervision. Domain adaptation is related to the

concept of transfer learning. The main goal in DA is to efficiently adapt the models trained on labelled

data from one domain to categorize the data in another domain with different data distribution, with high

accuracy. Continuing from the above example of sentiment categorization, a good domain adaptation

algorithm will modify the sentiment classifier learnt in the retail domain in a way to predict sentiments

in the financial domain with accuracy comparable to that would have been achieved by using labelled

data from financial domain. Due to these benefits of cost-efficiency and re-usability, at present a

plethora of researches are being conducted in Domain Adaptation and Transfer learning [1-10]

Expected outcome for attendees: The tutorial will enlighten the attendees about transfer learning

(domain adaptation) and how to make existing solutions for a specific task such as sentiment

categorization, conversation labelling etc. re-usable and adaptable across domains with limited

supervision. We expect that post-session attendees will appreciate the need for domain adaptation and

identify applications of domain adaptation in their areas.

Tutorial outline:

This will be a half day tutorial.

The first part of the tutorial will focus on the theory and technical aspects of DA. The second half will

focus on case-studies for attendees to appreciate how this works in real-life settings. The following is an

outline of the topics that will be discussed.

Introduction to Domain Adaptation

Need for domain adaptation

Supervised and unsupervised domain adaptation

Online domain adaptation

Applications (10 min)

Case study on cross—domain sentiment categorization

The workshop will enlighten the attendees realize the potential of re-usable algorithms that can

be deployed with limited supervision. We expect that post-session attendees will appreciate the

need for domain adaptation and identify applications of domain adaptation in their areas.

Description of the Proposer

Himanshu S. Bhatt is a Research Scientist at Xerox Research Centre India since Feb,2014, where he is

a member of Text and Graph Analytics Group and leads project for efficient scaling of machine learning

based solutions/offerings across different domains and industries. He received his PhD from IIIT-Delhi,

India in 2014 where he worked on varied machine learning paradigms such as online learning, co-

training, transfer learning, clustering, re-ranking, as well as genetic and memetic algorithms. His PhD

dissertation was awarded as one of the best doctoral thesis by Indian National Academy of Engineering

(INAE) and Indian Unit of Pattern Recognition and Artificial Intelligence (IUPRAI) in 2014. Himanshu

has over 20 publications in refereed journals, book chapters, and conferences. He is a recipient of IBM

PhD fellowship 2011-13 and two best poster awards in IEEE conferences.

Manjira Sinha has joined Xerox in May, 2015. She is a part of the Text and Graph Analytics group at

XRCI. She is currently working in two major areas: analysing social media data for urban infrastructure

and domain adaptation for text categorization. Manjira has submitted her Ph.D. at Indian Institute of

Technology Kharagpur. Her area of interests are Language Comprehension and Psycholinguistics,

Natural Language Processing, Assistive Technology and Human Computer Interaction.

References:

1. J. Blitzer, M. Dredze, and F. Pereira. 2007. Biographies, bollywood, boomboxes and blenders: Domain

adaptation for sentiment classification. In Proceedings of Association for Computational Linguistics,

pages 187–205.

2. M. Chen, K. Q Weinberger, and J. Blitzer. 2011. Co-training for domain adaptation. In Proceedings of

Advances in Neural Information Processing Systems, pages 2456–2464.

3. W Dai, G-R Xue, Q Yang, and Y Yu. 2007. Co-clustering based classification for out-of-domain

documents. In Proceedings of International Conference on Knowledge Discovery and Data Mining, pages

210–219.

4. Y.-S. Ji, J.-J. Chen, G. Niu, L. Shang, and X.-Y. Dai. 2011. Transfer learning via multi-view principal

component analysis. Journal of Computer Science and Technology, 26(1):81–98.

5. J. Jiang and C. Zhai. 2007. Instance weighting for domain adaptation in NLP. In Proceedings of

Association for Computational Linguistics, volume 7, pages 264–271.

6. C. Luo, Y. Ji, X. Dai, and J. Chen. 2012. Active learning with transfer learning. In Proceedings of

Association for Computational Linguistics Student Research Workshop, pages 13–18. Association for

Computational Linguistics.

7. S. J. Pan and Q. Yang. 2010. A survey on transfer learning. IEEE Transactions on Knowledge and Data

Engineering, 22(10):1345–1359.

8. S. J. Pan, X. Ni, J-T Sun, Q. Yang, and Z. Chen. 2010. Cross-domain sentiment classification via spectral

feature alignment. In Proceedings International Conference on World Wide Web, pages 751–760.

9. S. J. Pan, I. W. Tsang, J. T. Kwok, and Q. Yang. 2011. Domain adaptation via transfer component

analysis. IEEE Transactions on Neural Networks, 22(2):199–210.

10. P. Zhao and S. C. H. Hoi. 2010. OTL: A Framework of Online Transfer Learning. In Proceeding of

International Conference on Machine Learning.

Proposal for Tutorial on “Translation & Transliteration between Related Languages” Proposed by Mitesh Khapra

Researcher, IBM India Research Laboratory Anoop Kunchukuttan

Research Scholar, Center for Indian Language Technology, IIT Bombay Under the direction of: Prof. Pushpak Bhattacharyya, IIT Patna

Abstract Universal techniques for Machine Translation/Transliteration (MT/MX) have proven to be challenging to develop. However, a large chunk of MT/MX requirements is among related languages owing to government, business and sociocultural communication needs (e.g. India, European Union). The proposed tutorial will discuss how the relatedness among languages can be leveraged to improve translation/transliteration quality, achieve better generalization, share linguistic resources, and reduce resource requirements. This tutorial is aimed at Machine Translation/Transliteration researchers and developers. The tutorial will also be relevant for researchers interested in multilingual computation, especially involving Indian languages. We introduce notions of relatedness useful for MT/MX, and principles for leveraging relatedness. We explore how vocabulary shared between related languages can help MT. Then, we move beyond bilingual MT/MX and present how pivotbased and multisource methods incorporate knowledge from multiple languages, and handle language pairs lacking parallel corpora. We present approaches to multilingual word alignment, which show improvement over bilingual alignment. Finally, we discuss sharing of language resources (data & rules) among related languages, as well as among groups of related languages; thus, introducing the notion of a ‘language group’ being an apt level of system abstraction for building MT/MX systems. Proposed Duration: Half Day

Tutorial Outline 1. Introduction: 15 slides, 30 minutes ([1],[2],[3])

Motivation Brief introduction to Language Typology Useful notions of language relatedness Principles for leveraging relatedness

2. Taking advantage of orthographic similarity and cognates: 15 slides, 40 min

Transliteration & Cognate Mining ([4],[5],[8],[28]) Integrating transliteration & translation in decoder ([7]) Transliteration of OOV words ([9]) Characterlevel translation ([6])

3. Multilingual word alignment: 10 slides, 20 min ([19],[20],[21]) 4. Multilingual phrase alignment: 20 slides, 40 minutes

Use of assisting & bridge languages ([12],[13],[27]) Pivotbased Methods ([10],[11],[14],[15]) Multisource translation ([16],[17],[18]) Combining pivotbased SMT and transliteration methods

5. Sharing language resources: 8 slides, 15 min ([22],[23],[24])

Sharing among related languages Sharing for translation between two groups of related languages

6. Conclusion & Future Directions: 45 slides, 10 min 7. Tools & Resources: 10 min ([24],[25],[26],[27])

Moses transliteration & transliteration mining system System combination tools Transliteration and Script conversion with the Indic NLP library

8. Question and Answer session: 10 min Expected Duration: 180 minutes

Proposer Profiles: Mitesh Khapra Researcher, IBM India Research Laboratory, Bangalore. [email protected] Mitesh Khapra obtained his Ph.D. from the Indian Institute of Technology, Bombay in the area of Natural Language Processing with a focus on reusing resources for multilingual computation. His areas of interest include Statistical Machine Translation, Text Analytics, Crowdsourcing, Argument Mining and Deep Learning. He is currently working as a researcher at IBM Research India where he is focusing on mining arguments from large unstructured text. He has coauthored papers in top NLP and ML conferences such as ACL, NAACL, EMNLP, AAAI and NIPS. To view the complete publication list and presenter profile, please visit: http://dblp.unitrier.de/pers/hd/k/Khapra:Mitesh_M= Anoop Kunchukuttan Ph.D Scholar, Center for Indian Language Technology, Dept of Computer Science & Engineering, IIT Bombay [email protected] Anoop Kunchukuttan is a research scholar at the Indian Institute of Technology Bombay. He is advised by Prof. Pushpak Bhattacharyya on his research work involving machine translation and transliteration among related languages. He has also investigated other NLP problems multiword extraction, grammar correction, crowdsourcing and information extraction. He has coauthored papers in top NLP conferences such as ACL, NAACL, CONLL, LREC, ICON. Prior to joining the Ph.D program, Anoop received his Masters degree in Computer Science and Engineering from IIT Bombay in 2008. He has worked in the software industry for 4.5 years, during which he led the development of large scale systems for information extraction and retrieval over medical text. To view the complete publication list and presenter profile, please visit: www.cse.iitb.ac.in/~anoopk

mailto:[email protected]

http://dblp.uni-trier.de/pers/hd/k/Khapra:Mitesh_M=

mailto:[email protected]

http://www.cse.iitb.ac.in/~anoopk

http://www.cse.iitb.ac.in/~adityaj/



References: 1. Subbarao, Karumuri V. South Asian languages : a syntactic typology. Cambridge University

Press, 2012. 2. Halvor Eifring, Bøyesen Rolf Theil. Linguistics for students of Asian and African languages.

Institutt for østeuropeiske og orientalske studier. 2004. 3. Preslav Nakov, Hwee Tou Ng. Improving statistical machine translation for a resourcepoor

language using related resourcerich languages. Journal of Artificial Intelligence Research. 2012.

4. Greg Kondrak. Cognates and word alignment in bitexts. MT Summit. 2005. 5. Greg Kondrak, Daniel Marcu and Kevin Knight. Cognates can improve statistical translation

models. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. 2003.

6. Jorge Tiedemann. Characterbased PSMT for closely related languages. In Proceedings of the 13th Annual Conference of the European Association for Machine Translation, EAMT. 2009.

7. Nadir Durrani, Hassan Sajjad, Hieu Hoang and Philipp Koehn. Integrating an unsupervised transliteration model into statistical machine translation. EACL. 2014

8. Hassan Sajjad, Alexander Fraser, and Helmut Schmid. A statistical model for unsupervised and semisupervised transliteration mining. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. 2012.

9. Anoop Kunchukuttan, Ratish Puduppully, and Pushpak Bhattacharyya. BrahmiNet: A transliteration and script conversion system for languages of the Indian subcontinent. Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations. 2015

10. Hua Wu and Haifeng Wang. Pivot language approach for phrasebased statistical machine translation. Machine Translation. 2007.

11. Michael Paul, Andrew Finch, and Eiichrio Sumita. How to choose the best pivot language for automatic translation of lowresource languages. ACM Transactions on Asian Language Information Processing (TALIP). 2013.

12. Raj Dabre, Fabrien Cromiers, Sadao Kurohashi, and Pushpak Bhattacharyya. Leveraging small multilingual corpora for smt using many pivot languages. Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2015.

13. Mitesh Khapra, A. Kumaran and Pushpak Bhattacharyya. Everybody loves a rich cousin: An empirical study of transliteration through bridge languages.Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics. 2010.

14. Akiva Miura, Graham Neubig, Sakriani Sakti, Tomoki Toda, Satoshi Nakamura.Improving Pivot Translation by Remembering the Pivot. Association for Computational Linguistics. 2015.

15. N. Bertoldi, M. Barbaiani, M. Federico, R. Cattoni. Phrasebased statistical machine translation with pivot languages. IWSLT. 2008.

16. Franz Och and Hermann Ney. Statistical multisource translation. In Proceedings of MT Summit VIII. Machine Translation in the Information Age , MT Summit. 2001.

17. Schroeder, J., Cohn, T., and Koehn, P.Word lattices for multisource translation. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics. 2009.

18. Evgeny Matusov, Nicola Ueffing, and Hermann Ney. Computing Consensus Translation for Multiple Machine Translation Systems Using Enhanced Hypothesis Alignment. EACL. 2006.

19. Haifeng Wang, Hua Wu, and Zhanyi Liu. Word alignment for languages with scarce resources using bilingual corpora of other language pairs. COLINGACL. 2006

20. Kumar, S., Och, F. J., Macherey, W. Improving word alignment with bridge languages. In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. 2007.

21. Östling, Robert. Bayesian word alignment for massively parallel texts. 14th Conference of the European Chapter of the Association for Computational Linguistics. 2014.

22. Sinha, R., Sivaraman, K., Agrawal, A., Jain, R., Srivastava, R., and Jain, A..ANGLABHARTI: a multilingual machine aided translation project on translation from English to Indian languages. In IEEE International Conference on Systems, Man and Cybernetics. 1995.

23. Anoop Kunchukuttan, Abhijit Mishra, Rajen Chatterjee, Ritesh Shah, Pushpak Bhattacharyya. SataAnuvadak: Tackling Multiway Translation of Indian Languages. Language Resources and Evaluation Conference. 2014.

24. Nadir Durrani, Barry Haddow, Phillip Koehn, Kenneth Heafield. Edinburgh’s phrasebased machine translation systems for WMT14. Proceedings of the ACL 2014 Ninth Workshop on Statistical Machine Translation. 2014.

25. David Steele and Lucia Specia. WAContinuum: Visualising Word Alignments across Multiple Parallel Sentences Simultaneously. ACLIJCNLP. 2015.

26. Kenneth Heafield, Alon Lavie. Combining Machine Translation Output with Open Source: The Carnegie Mellon MultiEngine Machine Translation Scheme. The Prague Bulletin of Mathematical Linguistics. 2010.

27. A. Kumaran, Mitesh M. Khapra, and Pushpak Bhattacharyya. Compositional Machine Transliteration. ACM Transactions on Asian Language Information Processing. 2010.

28. Raghavendra Udupa, Mitesh M Khapra. Transliteration equivalence using canonical correlation analysis. Advances in Information Retrieval. 2010.

Format of Submission of Tutorial/Workshop Proposal to ICON-2015 Conference

1. Title of the Tutorial/Workshop: Universal POS Tagging and Dependency Relations 2. Proposer’s Name: 1.HimaBindu Maringanti 2.Mojgan Seraji

3. Proposer’s Affiliation and contact details: 1. Professor, North Orissa

University, [email protected], mob: 9861569765 2. Ph.d, Department of Computational Linguistics, Uppsala University, Sweden,

4. Abstract of the Tutorial/Workshop: Natural Language is primary for human communication and the variety of languages available and spoken by humans on the planet are numerous. So, not only Linguists, but Computational experts often wonder at the intricate structure and morphology of new languages; languages that they do not speak or unknown languages, to devise techniques to process and understand or inter‐translate. While the Linguists’ focus would be on the nuances of such new languages and try to map its morphemes/tokens onto the semantically equivalent units of the language(s) known(problem reduction); the Computational Linguists invent/discover techniques that could probably be effectively used to process the new language(s) for best understanding. Morphological Analysis of text includes part‐of‐speech tagging as a major step and conflict resolution as a complex step at a later stage. Taking this phase of NLP to a further step ahead, the Universal POS tagging (Nivre et.al.) was introduced, wherein the tagging is independent of language or its genre. Also Universal Dependencies (Nivre et. al.) are becoming, making NLP task challenging as well as thrilling, which would enable linguists from all over the world, align and compare various languages and understand their similarities and differences( Conceptual Dependencies of knowledge representation). Universal Dependencies is a project having the goal of facilitating multilingual parser development, cross‐lingual learning and research on parsing with a perspective of language structure and usage. The UDs, as they are called has an evolution from Stanford dependencies (de Marneffe et al., 2006, 2008, 2014), Google universal POSs (Petrov et al., 2012), and the Interset interlingua for morpho‐syntactic tagsets (Zeman, 2008). The concept of universally standardized and accepted cross‐lingual tagsets and relation‐labels goes far in developing parallel corpora, develop an understanding of various similar languages and understanding the dissimilarities between languages. They not only help in a universal annotation, but also allow extensions based upon the specific requirements of languages, not always enabling one to find an equivalent morpheme in the other language. Ease of Machine translation is an important advantage of this research, thus enabling the understanding of multiple languages by parallel corpora alignment, using either treebanks or context‐sensitive rules. Presently around 35 languages have been annotated; so lot of scope exists in expanding this project by populating the corpus, annotating and validating. The theme is recent for research and hence there may arise many researchers/ teams to join hands and further this project that aims at Universal Understanding and Peace.

The workshop could be for half‐a‐day, starting with a lecture on the theme followed by some hands on experience on sample corpus, annotating (for beginners) and/or validating ( for senior participants ). Post workshop panel discussion would automatically be triggered. So, an extra slot may be reserved for the purpose.

5. Description of the proposer(s):

1. Hima Bindu Maringanti, Professor of Computer Science having 22 years of experience in

teaching at UG/PG levels of Engineering and has around 50 publications, presently researching on AI and NLP, Psycho and Neuro Linguistics, Emotion Modeling and Cognitive Science. Recently visited Computational Linguistics department of Uppsala University for a month as a visiting researcher, wherein the contact person was Prof. Joakim Nivre and done work on annotating and also validating the Cairo CICLing Corpus in some Indian languages.

My Brief Profile

• B.Tech from Osmania University, Hyderabad in the branch Electronics & Communication engg • M.Tech from Indian School of Mines, Dhanbad in Computer Science & engg • Ph.D from IIIT, Allahabad in the field of Emotional Intelligence • A total of 20 years of teaching experience in various institutes of repute , including OEC, ITER,

CVREC, IIITA, JAYPEE University and presently Professor in NOU, Baripada. • Was a research associate with BARC, Bombay at ISM, Dhanbad. Selected and Nominated as

Senior Scientist by NCST, Mumbai. • Life member of ISTE, CSI and executive member of IEEE and ACM. Executive board member of

International Institute of Information Science, Florida, USA. Editorial board member of JoC, Global Science and Technology Forum, Singapore. TPC ( technical programme committee ) member of a no. of National / International conferences.

• Has around 50 research publications in various journals, ISBN serial numbers and conference proceedings.

• Successfully completed one AICTE –funded RPS project “ e‐counseling : A psyche‐monitoring system “, of 12.65 lakhs, at JIIT, Noida. Presently the Co‐cordinator of the “Capacity Building of SC/ST students in IT Tools” project of DeitY ( dept of electronics and information technology ) of 2 crores, at North Orissa University, Baripada, Odisha.

• Areas of interest and research include Heuristic algorithms, Natural Language Processing, Cognitive Science , Emotion Modeling and Affective computing.

• Hobbies include classical music, blog writing, women empowerment and child development, sports, Student‐ Counseling , choreography and ballet designing.

2. Mojgan Seraji, Ph.D. is a faculty at Computational Linguistics department of Uppsala University,

Sweden, whose research is on Persian Treebanks and an active contributor to the UD project. Profile: http://uppsala.academia.edu/MojganSeraji

Technical Programme Committee 1. Prof. Sanghamitra Mohanty, [email protected], Odisha, India 2. Prof.Joakim Nivre, Computational Linguistics department, Uppsala University, Sweden 3. Prof. Sudeshna Sarkar, IIT, Kharagpur 4. Prof. Jorg Teidmann, Computational Linguistics department, Uppsala University, Sweden 5. Dr.R.K. Balabantray, IIIT, Bhubaneswar 6. Dr.Divakar Yadav, Department of Computer Science, JIIT University, Noida 7. Dr.Deepak Garg, Thapar University

tutorial-cum-workshop building a machine translation...

Documents