automatic translation of nominal compound into hindi prashant mathur iiit hyderabad soma paul iiit...

Post on 27-Dec-2015

234 Views

Category:

Documents

5 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Automatic Translation of Nominal Compound into Hindi

Prashant Mathur

IIIT Hyderabad

Soma Paul

IIIT Hyderabad

OUTLINEOUTLINE

What is a Nominal Compound (NC) ? Translation variation of English NC into

Hindi Motivation Approach Results Future Work Bibliography

2Prashant Mathur

Nominal Compound

A construct of two or more nouns. The rightmost noun being the head, preceding

nouns modifiers.

Oil Pump : a device used to pump oil

Customer satisfaction indices : index that indicates the satisfaction rate of customer

Two word nominal compounds are the object of study here

3Prashant Mathur

Frequency of NC in English Corpus (Baldwin et al 2004)

Corpus Words NC Frequency

BNC 84M 2.6%

Reuters 108M 3.9%

4Prashant Mathur

OUTLINEOUTLINE

What is a Nominal Compound (NC) ? Translation variation of English NC into

Hindi Motivation Approach Results Future Work Bibliography

5Prashant Mathur

Variation in translating English NC into Hindi

As Nominal Compound ‘Hindu texts’ hindU SastroM, ‘milk production’ dugdha

utpAdana

As Genitive Construction ‘rice husk’ cAval kI bhUsI, ‘room temperature’ kamare ka tApamAnaAs one word Cow dung gobar

As Adjective Noun Construction ‘nature cure’ prAkratik cikitsA, ‘hill camel’ ‘pahARI UMTa’

As other syntactic phrase wax work mom par kalAkArI ‘work on wax’, body pain SarIr meM dard ‘pain in body’Others Hand luggage haat meM le jaaye jaane vaale saamaan

6Prashant Mathur

OUTLINEOUTLINE

What is a Nominal Compound (NC) ? Translation variation of English NC into

Hindi Motivation Approach Results Future Work Bibliography

7Prashant Mathur

Motivation

Issues in translation Choice of the appropriate target lexeme during

lexical substitution; and Selection of the right target construct type.

Occurrence of NCs in a corpus is high in frequency, however individual compound occur only a few times.

NCs are too varied to be precompiled in an exhaustive list of translated candidates

8Prashant Mathur

Therefore …

NCs are to be handled on the fly. The task of translation of NCs from English

into Hindi becomes a challenging task of NLP

9Prashant Mathur

With Google translator

When tested on the same dataset that has been used to evaluate our system

Translation formation Precision

Overall 45%

Eng NC Hindi NC 29%

Eng NC Hindi Genitive 10%

Others 6%

10Prashant Mathur

OUTLINEOUTLINE

What is a Nominal Compound (NC) ? Translation variation of English NC into

Hindi Motivation Approach Results Future Work Bibliography

11Prashant Mathur

Approach

Translation template generation Extraction of NC from English corpus Sense disambiguation of components Lexical substitution of the component nouns using

Bi-Lingual Dictionary Preparing translation candidates Corpus Search of translation candidates and their

Ranking.

12Prashant Mathur

Translation Template GenerationTranslation Template Generation

Construction Type No. of occurrences Percentage

Nominal Compound 3959 42.9%

Genitive 1976 21.4%

Long Phrases 581 6.284

Adjective Noun Phrase 557 6.024%

Single Word 766 8.285%

Transliterated Nominal Compound

1208 13.065%

None 199 2.152%

We did the survey of 50,000 sentences of parallel corpora and found out the following construction types.

13Prashant Mathur

Some Templates

Nominal Compound H1 H2

Genitive H1 kA H2 H1 ke H2 H1 kI H2

Long Phrases H1 pe H2 H1 meM H2 H1 par H2 H1 ke xvArA H2 H1 se prApwa H2

Total of 44 templates were formed, some of them are showed below.

Adjective H1-ikA H2

Single-Word H1

14Prashant Mathur

Approach

Translation template generation Extraction of NC from English corpus Sense disambiguation of components Lexical substitution of the component nouns using Bi-

Lingual Dictionary Preparing translation candidates Corpus Search of translation candidates and their

Ranking.

15Prashant Mathur

ExtractionExtraction

1Tree-Tagger is a POS-Tagger which gives some extra information.

Word Tree-Tagger word POS TAG lemmarods rods_NNS_rod

2As assumed previously we consider only Noun-Noun formation as Nominal Compound.

16Prashant Mathur

Approach

Translation template generation Extraction of NC from English corpus Sense disambiguation of components Lexical substitution of the component nouns using Bi-

Lingual Dictionary Preparing translation candidates Corpus Search of translation candidates and their

Ranking.

17Prashant Mathur

Lexical Substitution

18Prashant Mathur

Step 3 : Sense Disambiguation of components

To reduce the number of translation candidates

Example :

Campaigns for road safety are organized to keep everyone safer on the Indian roads

Noun Component

No. of WN sense

Sense selected

Synset

Road 2 #1 <road, route>

Safety 6 #2 <safety, refuge>

19Prashant Mathur

WordNet Sense-Relate by Ted Peterson. 80% accuracy in case of NC disambiguation.

20Prashant Mathur

Approach

Translation template generation Extraction of NC from English corpus Sense disambiguation of components Lexical substitution Preparing translation candidates Corpus Search of translation candidates and their

Ranking.

21Prashant Mathur

Lexical Substitution

Now how to translate it into Hindi ?We don’t have direct wordnet mapping from

English to Hindi. We use alternative method to translate.

22Prashant Mathur

Step 4: Lexical SubstitutionStep 4: Lexical Substitution

Acquire all possible translations for all the words within a synset.

Road path, maarg, saDak, raastaa

Route maarg, saDak, raastaa

Safety ahAnikArakatA, suraksita sthAna, suraksA, salAmatI, suraksA sAdhana

Refuge ASraya sthAna, ASraya, sahArA, SaraNa, CipanA

23Prashant Mathur

Contd…

Select those Hindi words which are common translations to all English words of a synset, if there is one

Selected words are: maarg, saDak, raastaa

All words are selected

Road path, maarg, saDak, raastaa

Route maarg, saDak, raastaa

Safety ahAnikArakatA, suraksita sthAna, suraksA, salAmatI, suraksA sAdhana

Refuge

ASraya sthAna, ASraya, sahArA, SaraNa, CipanA

24Prashant Mathur

Approach

Translation template generation Extraction of NC from English corpus Sense disambiguation of components Lexical substitution Preparing translation candidates Corpus Search of translation candidates and their

Ranking.

25Prashant Mathur

Step 5: Preparing Translation CandidateStep 5: Preparing Translation Candidate

For “road safety” Templates generated are:

mArga para surakRA,

mArga surakRA,

SaDak para surakRA,

SaDak kI surakRA

...

26Prashant Mathur

Approach

Translation template generation Extraction of NC from English corpus Sense disambiguation of components Lexical substitution Preparing translation candidates Corpus Search of translation candidates and their

Ranking.

27Prashant Mathur

Step 6 Corpus Search Step 6 Corpus Search

Hindi Corpus (Raw): 28 million words IndexedSearch – pattern match

28Prashant Mathur

Example

election time cunAva ke samaya temple community maMxira kA samAja marriage customs vivAha kI praWA

But we didn’t found any translation for

road safety Ф

Prashant Mathur 29

CTQ (Corpus based Translation Quality)

Rate a given translation candidate for both The fully specified translation and Its parts in the context of the translation template in

question.

CTQ (w1H , w2

H , t) = αP(w1H , w2

H , t) + βP(w1H,t) P(w2

H , t) P(t)

t is the translation template used w1

H , w2H are the translations of components of NC

α = 1, β=0 if P(w1H , w2

H , t) > 0 (didn’t perform variation in α, β constants)

30Prashant Mathur

Contd..

Example road safety P(w1

H , w2H , t) = 0

road mArga, mArga ke, mArga meM, saDaka, saDaka par … safety surakRA, ke surakRA, meM surakRA, … so on

P (mArga, meM) * P(meM, surakRA) * P(meM) = (2.28*10-5) * (9.14*10-6) * (.286) = 6 * 10-11

P (mArga, kI) * P(kI, surakRA) * P(kI) = (1.35 × 10-5) * (3.82857143 × 10-5) * (.228) = 1.17 × 10-10

Higher probablity for “mArga kI surakRA”

31Prashant Mathur

Ranking

Baseline Ranking: Count based ranking

A stronger ranking measure CTQ

(borrowed from Baldwin and Tanaka (2004))

32Prashant Mathur

Results

0

10

20

30

40

50

60

70

80

90

100

Dictionary 1st Sense+Dict WSD + Dict

Baseline Recall

Baseline Precision

CTQ Recall

CTQ Precision

14

50

24

46.1

24.6

53.6

19

56.2

28

54.1

28.5

62.1

33Prashant Mathur

Contd..

Measure taken to improve recall: By using genitives as default construct when

translation for a NC is not found

Motivation: We conduct one experiment on development data We verify whether the NCs for which no translation found

during corpus search can be legitimately translated as a genitive construct

We found the heuristics is working for 59% cases

34Prashant Mathur

Results

0102030405060

Recall

Precision

24.8

54

44.5

57

Using genitive as default construct where the system fails to produce a translation

35Prashant Mathur

Related works

Similar approaches (search of translation templates in the corpus) adopted in Bungum and Oepen (2009) for Norwegian to

English nominal compound translation Tanaka and Baldwin (2004) for English to

Japanese nominal compound and vice versa

36Prashant Mathur

Conclusion

Novelty of our approach Using a WSD tool on Source language - to select

the correct sense of nominal components The result : The number of possible translation

candidates to be searched in the target language corpus is significantly reduced.

37Prashant Mathur

Future Work

Multinary NC translation Using semantic features provided in

UW-Dictionary Varying α & β in ranking technique to produce

more effective results.

38Prashant Mathur

Bibliography

Translation by Machine of Complex Nominals: Getting it right Tanaka and Timothy Baldwin

Translation Selection for Japanese-English Noun-Noun Compounds

Tanaka, Takaaki and Timothy Baldwin

Automatic Translation Of Noun Compounds Rackow, Ido Dagan, Ulrike Schwall

Norwegian to English nominal compound translation Bungum, Oepen

39Prashant Mathur

top related