international journal of computer engineering and … to punjabi...problem using hybrid approach of...

8
International Journal of Computer Engineering and Technology (IJCET), ISSN 0976- 6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME 350 ENGLISH TO PUNJABI MACHINE TRANSLATION SYSTEM USING HYBRID APPROACH OF WORD SENSE DISAMBIGUATION AND MACHINE TRANSLATION 1 Gurleen Kaur Sidhu, 2 Navjot Kaur 1 Department of Computer Science and Engineering, Sri Guru Granth Sahib World University Fatehgarh Sahib, Punjab 140406, India 2 Department of Computer Science and Engineering, Punjabi university Patiala, Punjab 140406, India ABSTRACT Machine Translation and Word Sense Disambiguation are most popular applications of Natural Language Processing, because Machine Translation is cheap and best to understand than any other language during conversation. Whereas Word Sense Disambiguation helps to get the correct meaning of particular word in which context that is used. In our system we are using hybrid approach with help of which we can disambiguate the words and can get best result of machine translation. Conditional Random Field algorithm with decision list using direct mapping is easiest method with best result to solve the problem of disambiguation. In our system, Conditional Random field, divide the data into categories and calculate the frequency of words with respect to the category. Category having maximum frequency in the sentence meaning will relates to that category. Accuracy of our System for correct sentences is 81.2% on the bases of tested sentences only. Keywords: Conditional Random Field, Machine Translation, Natural language, Word Sense disambiguation, Hybrid approach. I. INTRODUCTION During automatic translation of sentences there is a problem of incorrect sense in the target text. The process of assigning correct sense according to context is known as Word Sense Disambiguation. We have a lot of applications and online sites which are helpful to give the meaning of the input text. But they are not able to disambiguate the meanings. We try to solve this problem using hybrid approach of word sense disambiguation and machine translation. Machine INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) ISSN 0976 – 6367(Print) ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), pp. 350-357 © IAEME: www.iaeme.com/ijcet.asp Journal Impact Factor (2013): 6.1302 (Calculated by GISI) www.jifactor.com IJCET © I A E M E

Upload: others

Post on 10-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: International Journal of Computer Engineering and … TO PUNJABI...problem using hybrid approach of word sense disambiguation and machine translation. Machine INTERNATIONAL JOURNAL

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

350

ENGLISH TO PUNJABI MACHINE TRANSLATION SYSTEM USING

HYBRID APPROACH OF WORD SENSE DISAMBIGUATION AND

MACHINE TRANSLATION

1Gurleen Kaur Sidhu,

2Navjot Kaur

1Department of Computer Science and Engineering, Sri Guru Granth Sahib World University

Fatehgarh Sahib, Punjab 140406, India 2Department of Computer Science and Engineering, Punjabi university Patiala, Punjab 140406, India

ABSTRACT

Machine Translation and Word Sense Disambiguation are most popular applications of

Natural Language Processing, because Machine Translation is cheap and best to understand than

any other language during conversation. Whereas Word Sense Disambiguation helps to get the

correct meaning of particular word in which context that is used. In our system we are using hybrid

approach with help of which we can disambiguate the words and can get best result of machine

translation. Conditional Random Field algorithm with decision list using direct mapping is easiest

method with best result to solve the problem of disambiguation. In our system, Conditional

Random field, divide the data into categories and calculate the frequency of words with respect to

the category. Category having maximum frequency in the sentence meaning will relates to that

category. Accuracy of our System for correct sentences is 81.2% on the bases of tested sentences

only.

Keywords: Conditional Random Field, Machine Translation, Natural language, Word Sense

disambiguation, Hybrid approach.

I. INTRODUCTION

During automatic translation of sentences there is a problem of incorrect sense in the target

text. The process of assigning correct sense according to context is known as Word Sense

Disambiguation. We have a lot of applications and online sites which are helpful to give the

meaning of the input text. But they are not able to disambiguate the meanings. We try to solve this

problem using hybrid approach of word sense disambiguation and machine translation. Machine

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING &

TECHNOLOGY (IJCET)

ISSN 0976 – 6367(Print)

ISSN 0976 – 6375(Online)

Volume 4, Issue 4, July-August (2013), pp. 350-357

© IAEME: www.iaeme.com/ijcet.asp

Journal Impact Factor (2013): 6.1302 (Calculated by GISI)

www.jifactor.com

IJCET

© I A E M E

Page 2: International Journal of Computer Engineering and … TO PUNJABI...problem using hybrid approach of word sense disambiguation and machine translation. Machine INTERNATIONAL JOURNAL

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

351

translation and word sense disambiguation are the most popular applications of Natural language

processing. To process the data available on Internet in Blogs, website, social sites, and business

site which are presented in natural language is known as Natural language processing. More

information about history and overview of applications are discussed in Fig1. Introduction.

Fig1. Introduction

To review the previously used techniques on different languages are discussed in Literature

survey. Methodology part is use to explain the proposed technique which is the combination of

various sub-techniques or algorithms of Word sense disambiguation and Machine Translation. Result

and discussion is use to discuss the advantages and disadvantages of the system. Conclusion explains

the how much beneficial the proposed system is, accuracy is also discussed in this part. Future Work

gives us the direction in this field.

II. LITERATURE SURVEY

Review of English study is given in Fig 2, in this brief introduction of six part of speech and

their sub types are given. Whereas remaining two parts are preposition and article. Articles are use to

distinguish the vowels & consonants, to define singular „a‟ used.

Page 3: International Journal of Computer Engineering and … TO PUNJABI...problem using hybrid approach of word sense disambiguation and machine translation. Machine INTERNATIONAL JOURNAL

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

352

Fig 2. Review of Part of speech in English

Review of research papers for techniques.[1]- Hybrid (statistical +rules) approach based

transliteration system of person names; from a person name written in Punjabi (Gurumukhi Script),

the system produces its English (Roman Script) transliteration. Experiments have shown that the

performance is sufficiently high. The overall accuracy of system comes out to be 95.23%. Reasons

behind the wrong answers of named entities are Multiple Transliterations, Wrong Input of Words,

Character Gap, One-to-Multi mapping Problem.[2] The natural language processing is a

multidisciplinary field at intersection of linguistic, psycholinguistic, Computer science and

engineering, machine learning and statistics. Also gives the reasons of popularity of the Natural

language processing day by day. More increase in business world more people move from one to

another country, help counters are established everywhere to Conway the proper message need to

process the natural language. [6]-Machine translation is used to translate the source text into the

target text with or without the help of human assistance. Machine translation has various approaches:

direct Translation method- word to word directly translate. Transfer-Based Translation- is done with

the proper knowledge of the rule of any language in which we want to translate. Interlingua-based

translation – inter-mediator is used to convert into target language. Corpus-based translation - is use

the parallel corpus of source and target text. Hybrid translation- is made with the help of above all.

Nancy ide (1998) [7] - define the various applications in which we can use the word sense

disambiguation method. [11]- The supervised learning method of word Sense Disambiguation, which

is Cosine Similarity. researcher extract two sets of features; the set of words that have occurred

frequently in the text and Cosine similarity algorithm uses the concept of inner product of two

vectors. After converting each context to a vector of words, cosine similarity measures the similarity

between a new context and each existing context in the training corpus. [12] Researcher work on

shahmukhi to Gurumukhi transliteration and try to remove the ambiguity problem. To different

approaches are used for word sense disambiguation that are: state sequence representation as a

Hidden Markov Model and N-gram in which small window of size -5+ is used. Accuracy for word

Sense Disambiguation using both approaches is calculated more than 92%.

Page 4: International Journal of Computer Engineering and … TO PUNJABI...problem using hybrid approach of word sense disambiguation and machine translation. Machine INTERNATIONAL JOURNAL

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

353

III. METHODOLOGY

Fig 3. Flow Chat for Proposed System

Algorithm for Proposed Punjabi to English Machine Translation System:

Step1: START �

input text

Step2: Check the text is present or not

o If present then move to step3

Else

o Display the message “please enter the text first”

Step3: ANALYSIS the sentence

o TOKENIZATION (Split sentence on the bases of white-space count the words)

Repeat the next two steps for every token

o PREPOCESSING (further divide into 2 subparts)

o Text normalization (optional)

� Implement the proposed algorithm for American to British English

o Sentence Differentiation

� Rules implement to check sentence is simple or compound

o PART OF SPEECH TAGGING (DIRECT MAPPING IMPLEMENTED)

After Analysis the sentence move on Step 4

Step4: SYNTHESIS the sentence

o DIRECT MAPPING( WORD + POS )

o PRESENT then FETCH the MEANING (MOVE ON reorder)

o Otherwise HYBRID APPROACH FOR WSD implement on sentence

� If (WORD+ POS ) having multiple CATEGORIES

� Increase the counter of all category(Repeat the above step for all tokens )

� Check that category having (Ambiguous word+ maximum Frequency)

assigns that meaning to the ambiguous word.

� Fetch the meaning move on next REORDER

Page 5: International Journal of Computer Engineering and … TO PUNJABI...problem using hybrid approach of word sense disambiguation and machine translation. Machine INTERNATIONAL JOURNAL

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

354

o REORDER

o According to target text.

Step5: TRANSLATION ENGINE

o OUTPUT ( after Reordering combine the words in the form of sentence and display)

Step 6: END.

IV. RESULTS AND DISCUSSION

• First Case: general case is explained with 2 main examples that are give in below figures

with their discussion according to their results. In this Simple sentence is entered as input

which is correct in format our system show the output better than the previous one.

Fig.4: Correct and incorrect Sentence with discussion

• Random words used in sentence: System gives their meaning if present in the database

but avoid generating the sentence.

Fig.5 shows the Error given by our system due to incorrect formation of input sentence.

That‟ s why our system gives the message try again. To check whether Sentence formation is

incorrect

Fig. 5: System gives Error

Page 6: International Journal of Computer Engineering and … TO PUNJABI...problem using hybrid approach of word sense disambiguation and machine translation. Machine INTERNATIONAL JOURNAL

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

355

Fig.7 : Lack of Word Sense disambiguation

Fig.8: Remove ambiguity of Words

Our system uses the Conditional random field to remove the ambiguity of the words. In

above fig. Input sentence is „we visited the bank and that was situated at the bank‟ .

Bank word is ambiguous here. First we check the conjunction word so that meaning of words

fetch according to the sub-parts. So in first sub-part there is no specific category the sentence relates

to the general category so we fetch the meaning which is generally used most that is financial bank.

Then we solve the second part here is the word ‟ situated‟ which is belongs to geography category.

We fetch the both meanings of bank. But here condition is applied the word used in sentence with

maximum frequency will be used. So we use the meaning of Bank related with geography category

for second part. Then reorder the sentence with respect to their POS then generate the target sentence

as display in the fig8.

Inaccuracy of result: character-gap, wrong input, word not present in database.

Page 7: International Journal of Computer Engineering and … TO PUNJABI...problem using hybrid approach of word sense disambiguation and machine translation. Machine INTERNATIONAL JOURNAL

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

356

V. CONCLUSION

We conclude that Machine Translation and Word Sense Disambiguation are most popular

applications of Natural Language Processing, because Machine Translation is cheap and best to

understand than any other language during conversation. Whereas Word Sense Disambiguation helps

to get the correct meaning of particular word in which context that is used. From Literature Survey,

we conclude that the basic structure and various sub-parts of part of speech of both languages

English and Punjabi. Also, know the previously implemented techniques by the different researchers.

In our system we are using hybrid approach with help of which we can disambiguate the words and

can get best result of machine translation. Conditional Random Field algorithm with decision list

using direct mapping is easiest method with best result to solve the problem of disambiguation.

Accuracy of our System is given below:

Fig.9: Accuracy table for testing the system

VI. FUTURE WORK

• More techniques can combine with this system for more accuracy.

• More data can use.

• Categories can further classify into sub-parts.

• Part of speech can more explore with sub-categories.

VII. ACKNOWLEDGEMENTS

As a part of my course I have taken the problem as “English to Punjabi Machine Translation

System using Hybrid Approach of Word Sense Disambiguation and Machine Translation” as

my Thesis Topic. I am very thankful to Mrs. Navjot Kaur, Assistant Professor, Punjabi University,

and Patiala for giving me such a valuable support in doing my work. She provided all the relevant

material that was sufficient for me to complete my thesis work. She provided help and time

whenever asked for. Last but not least, a word of thanks for the authors of all those books and papers

which I have consulted during my thesis work as well as for preparing the report. At the end thanks

to the Almighty for not letting me down at the time of crisis and showing me the silver lining in the

dark clouds.

Page 8: International Journal of Computer Engineering and … TO PUNJABI...problem using hybrid approach of word sense disambiguation and machine translation. Machine INTERNATIONAL JOURNAL

International Journal of Computer Engineering and Technology (IJCET), ISSN 0976-

6367(Print), ISSN 0976 – 6375(Online) Volume 4, Issue 4, July-August (2013), © IAEME

357

VIII. REFERENCES

JOURNAL

[1]. Kamal Deep, Dr.Vishal Goyal, Hybrid Approach for Punjabi to English Transliteration

System, International Journal of Computer Applications (0975 – 8887)Volume 28– No.1,

August 2011

[2]. Fabio Ciravegna, Recent Advances in Natural Language Processing, IEEE Computer

Society 2003.

[4]. J. Hutchins, An introduction to Machine Translation. Academic Press, 1992.

[7]. Nancy Ide, Jean Veronis, Introduction to the Special Issue on Word Sense Disambiguation:

The State of the Art, 1998J.

[8]. Pushpak Bhattacharyya, CS460/626: Natural LanguageProcessing/Speech, NLP and the

Web (Lecture 25– Knowledge Based andSupervised WSD), IIT Bombay, 6th March, 2012,

p.24.

[9]. Pushpak Bhattacharyya, CS460/626: Natural LanguageProcessing/Speech, NLP and the

Web (Lecture 25– Knowledge Based andSupervised WSD), IIT Bombay, 6th March, 2012,

p.35.

[10]. Durgesh D Rao, Machine Translation, pp.61-70, July1998.

[13]. Kamaljeet Kaur Batra, G S Lehal, Rule Based Machine Translation of Noun Phrases from

Punjabi to English, IJCSI International Journal of Computer Science Issues, Vol. 7, Issue 5,

September 2010.

[14]. P.Tamilselvi, S.K.Srivatsa, Case Based Word Sense Disambiguation Using Optimal

Features, 2011 International Conference on Information Communication and Management

IPCSIT vol.16 Singapore, (2011).

BOOKS

[15]. Wren & Martin, English Grammar and Composition, S.CHAND Publication,

THESIS

[6]. R.Harshawardhan,Rule Based Machine Translation System For English To Malayalam

Language, Centre for Excellence in Computational Engineering and Networking, December

2011.

[28]. Kamal Deep, Dr.Vishal Goyal, Hybrid Approach for Punjabi to English Transliteration

System, Punjabi university Patiala, September 2011.

PROCEEDING PAPER

[3]. Available: http://en.wikipedia.org/wiki/Natural_language_processing

[11]. M. Nameh, S.M. Fakhrahmad, M. Zolghadri Jahromi, A New Approach to Word Sense

Disambiguation Based on Context Similarity, Proceedings of the World Congress on

Engineering 2011 Vol I, pp. 456-459.

[12]. Tejinder Singh Saini, Gurpreet Singh Lehal Word Disambiguation in Shahmukhi to

Gurmukhi Transliteration, Proceedings of the 9th Workshop on Asian Language Resources,

Chiang Mai, Thailand, November 12 and 13, 2011, pages 79–87.

[26]. Available at: http://en.wikipedia.org/wiki/Machine_translation

[27]. Available at: http://en.wikipedia.org/wiki/Word-sense_disambiguation