parsing aligned parallel corpus by projecting syntactic relations from annotated source corpus

8
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 301–308, Sydney, July 2006. c 2006 Association for Computational Linguistics Parsing Aligned Parallel Corpus by Projecting Syntactic Relations from Annotated Source Corpus Shailly Goyal Niladri Chatterjee Department of Mathematics Indian Institute of Technology Delhi Hauz Khas, New Delhi - 110 016, India {shailly goyal, niladri iitd}@yahoo.com Abstract Example-based parsing has already been proposed in literature. In particular, at- tempts are being made to develop tech- niques for language pairs where the source and target languages are different, e.g. Direct Projection Algorithm (Hwa et al., 2005). This enables one to develop parsed corpus for target languages having fewer linguistic tools with the help of a resource- rich source language. The DPA algo- rithm works on the assumption of Di- rect Correspondence which simply means that the relation between two words of the source language sentence can be pro- jected directly between the correspond- ing words of the parallel target language sentence. However, we find that this as- sumption does not hold good all the time. This leads to wrong parsed structure of the target language sentence. As a solution we propose an algorithm called pseudo DPA (pDPA) that can work even if Direct Correspondence assumption is not guaran- teed. The proposed algorithm works in a recursive manner by considering the em- bedded phrase structures from outermost level to the innermost. The present work discusses the pDPA algorithm, and illus- trates it with respect to English-Hindi lan- guage pair. Link Grammar based pars- ing has been considered as the underlying parsing scheme for this work. 1 Introduction Example-based approaches for developing parsers have already been proposed in literature. These approaches either use examples from the same lan- guage, e.g., (Bod et al., 2003; Streiter, 2002), or they try to imitate the parse of a given sentence using the parse of the corresponding sentence in some other language (Hwa et al., 2005; Yarowsky and Ngai, 2001). In particular, Hwa et al. (2005) have proposed a scheme called direct projection algorithm (DPA) which assumes that the relation between two words in the source language sen- tence is preserved across the corresponding words in the parallel target language. This is called Di- rect Correspondence Assumption (DCA). However, with respect to Indian languages we observed that the DCA does not hold good all the time. In order to overcome the difficulty, in this work, we propose an algorithm based on a vari- ation of the DCA, which we call pseudo Direct Correspondence Assumption (pDCA). Through pDCA the syntactic knowledge can be transferred even if not all syntactic relations may be projected directly from the source language to the target lan- guage in toto. Further, the proposed algorithm projects the relations between phrases instead of projecting relations between words. Keeping in line with (Hwa et al., 2005), we call this algorithm as pseudo Direct Projection Algorithm (pDPA). The present work discusses the proposed pars- ing scheme for a new (target) language with the help of a parser that is already available for a language (source) and using word-aligned paral- lel corpus of the two languages under considera- tion. We propose that the syntactic relationships between the chunks of the input sentence T (of the target language) are given depending upon the relationships of the corresponding chunks in the translation S of T . Along with the parsed struc- ture of the input, the system also outputs the con- stituent structure (phrases) of the given input sen- 301

Upload: iitd

Post on 10-Dec-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions, pages 301–308,Sydney, July 2006.c©2006 Association for Computational Linguistics

Parsing Aligned Parallel Corpus by Projecting Syntactic Relations fromAnnotated Source Corpus

Shailly Goyal Niladri ChatterjeeDepartment of Mathematics

Indian Institute of Technology DelhiHauz Khas, New Delhi - 110 016, India

{shailly goyal, niladri iitd}@yahoo.com

Abstract

Example-based parsing has already beenproposed in literature. In particular, at-tempts are being made to develop tech-niques for language pairs where the sourceand target languages are different, e.g.Direct Projection Algorithm (Hwa et al.,2005). This enables one to develop parsedcorpus for target languages having fewerlinguistic tools with the help of a resource-rich source language. The DPA algo-rithm works on the assumption of Di-rect Correspondence which simply meansthat the relation between two words ofthe source language sentence can be pro-jected directly between the correspond-ing words of the parallel target languagesentence. However, we find that this as-sumption does not hold good all the time.This leads to wrong parsed structure of thetarget language sentence. As a solutionwe propose an algorithm called pseudoDPA (pDPA) that can work even if DirectCorrespondence assumption is not guaran-teed. The proposed algorithm works in arecursive manner by considering the em-bedded phrase structures from outermostlevel to the innermost. The present workdiscusses the pDPA algorithm, and illus-trates it with respect to English-Hindi lan-guage pair. Link Grammar based pars-ing has been considered as the underlyingparsing scheme for this work.

1 Introduction

Example-based approaches for developing parsershave already been proposed in literature. These

approaches either use examples from the same lan-guage, e.g., (Bod et al., 2003; Streiter, 2002), orthey try to imitate the parse of a given sentenceusing the parse of the corresponding sentence insome other language (Hwa et al., 2005; Yarowskyand Ngai, 2001). In particular, Hwa et al. (2005)have proposed a scheme called direct projectionalgorithm (DPA) which assumes that the relationbetween two words in the source language sen-tence is preserved across the corresponding wordsin the parallel target language. This is called Di-rect Correspondence Assumption (DCA).

However, with respect to Indian languages weobserved that the DCA does not hold good all thetime. In order to overcome the difficulty, in thiswork, we propose an algorithm based on a vari-ation of the DCA, which we call pseudo DirectCorrespondence Assumption (pDCA). ThroughpDCA the syntactic knowledge can be transferredeven if not all syntactic relations may be projecteddirectly from the source language to the target lan-guage in toto. Further, the proposed algorithmprojects the relations between phrases instead ofprojecting relations between words. Keeping inline with (Hwa et al., 2005), we call this algorithmas pseudo Direct Projection Algorithm (pDPA).

The present work discusses the proposed pars-ing scheme for a new (target) language with thehelp of a parser that is already available for alanguage (source) and using word-aligned paral-lel corpus of the two languages under considera-tion. We propose that the syntactic relationshipsbetween the chunks of the input sentence T (ofthe target language) are given depending upon therelationships of the corresponding chunks in thetranslation S of T . Along with the parsed struc-ture of the input, the system also outputs the con-stituent structure (phrases) of the given input sen-

301

tence.In this work, we first discuss the proposed

scheme in a general framework. We illustrate thescheme with respect to parsing of Hindi sentencesusing the Link Grammar (LG) based parser for En-glish and the experimental results are discussed.Before that in the following section we discussLink Grammar briefly.

2 Link Grammar and Phrases

Link grammar (LG) is a theory of syntax whichbuilds simple relations between pairs of words,rather than constructing constituents in tree-likehierarchy. For example, in an SVO language likeEnglish, the verb forms a subject link (S-) to someword on its left, and an object link (O+) with someword on its right. Nouns make the subject link(S+) to some word (verb) on its right, or objectlink (O-) to some word on its left.

The English Link Grammar Parser (Sleator andTemperley, 1991) is a syntactic parser of Englishbased on LG. Given a sentence, the system as-signs to it a syntactic structure, which consists ofa set of labeled links connecting pairs of words.The parser also produces a “constituent” represen-tation of a sentence (showing noun phrases, verbphrases, etc.). It is a dictionary-based system inwhich each word in the dictionary is associatedwith a set of links. Most of the links have someassociated suffixes to provide various information(e.g., gender (m/f), number (s/p)), describingsome properties of the underlying word. The En-glish link parser lists total of 107 links. Table1 gives a list of some important links of EnglishLG along with the information about the words ontheir left/right and some suffixes.

Link Word in Left Word in Right SuffixesA Premodifier Noun -D Determiners Nouns s/m,c/uJ Preposition Object of the prepo-

sitions/p

M Noun Post-nominal Modi-fier

p/v/g/a

MV Verbs/adjectives Modifying phrase p/a/i/l/x

O Transitive verb Direct or indirect ob-ject

s/p

P Forms of “be” Complement of “be” p/v/g/aPP Forms of “have” Past participle -S Subject Finite verb s/p, i, g

Table 1: Some English Links and Their Suffixes

As an example, consider the syntactic struc-

ture and constituent representation of the sentencegiven below.

+--------Ss--------+| +----Jp---+ |

+--Ds-+-Mp-+ +-Dmc-+ +-Pa-+| | | | | | |

the teacher of the boys is good

(S (NP (NP The teacher)(PP of (NP the boys)))

(VP is)(ADJP good).)

It may be noted that in the phrase structure ofthe above sentence, verb phrase as obtained fromthe phrase parser has been modified to some ex-tent. The algorithm discussed in this work as-sumes verb phrases as the main verb along withall the auxiliary verbs.

For ease of presentation and understanding, weclassify phrase relations as Inter-Phrase and Intra-phrase relations. Since the phrases are often em-bedded, different levels of phrase relations are ob-tained. From the outermost level to the innermost,we call them as “first level”, “second level” of re-lations and so on. One should note that an ith levelIntra-phrase relation may become Inter-phrase re-lation at a higher level.

As an example, consider the parsing and phrasestructure of the English sentence given above.In the first level the Inter-phrase relations (cor-responding to the phrases “the teacher ofthe boys”, “is” and “good”) are Ss and Paand the remaining links are Intra-phrase relations.In the second level the only Inter-phrase rela-tionship is Mp (connecting “the teacher” and“the boys”), and the Intra-phrase relations areDs, Jp and Dmc. In third and the last level, Jp isthe Inter-phrase relationship and Dmc is the Intra-phrase relation (corresponding to “of” and “theboys”).

The algorithm proposed in Section 4 usespDCA to first establish the relations of the tar-get language corresponding to the first-level Inter-phrase relations of the source language sentence.Then recursively it assigns the relations corre-sponding to the inner level relations.

3 DCA vis-a-vis pDCA

Direct Correspondence Assumption (DCA) statesthat the relation between words in source languagesentence can be projected as the relations betweencorresponding words in the (literal) translation inthe target language. Direct Projection Algorithm

302

(DPA), which is based on DCA, is a straightfor-ward projection procedure in which the dependen-cies in an English sentence are projected to thesentence’s translation, using the word-level align-ments as a bridge. DPA also uses some monolin-gual knowledge specific to the projected-to lan-guage. This knowledge is applied in the form ofPost-Projection transformation.

However with respect to many language pairssyntactic relationships between the words cannotalways be imitated to project a parse structurefrom source language to target language. For il-lustration consider the sentence given in Figure 1.We try to project the links from English to Hindiin Figure 1(a) and Hindi to Bangla in Figure 1(b).For Hindi sentence, links are given as discussed byGoyal and Chatterjee (2005a; 2005b).

(a)

(b)

Figure 1: Failure of DCA

We observe that in the parse structure of the tar-get language sentences, neither all relations arecorrect nor the parse tree is complete. Thus, weobserve that DPA leads to, if not wrong, a veryshallow parse structure. Further, Figure 1(b) sug-gests that DCA fails not only for languages be-longing to different families (English-Hindi), butalso for languages belonging to the same family(Hindi-Bangla).

Hence it is necessary that the parsing algo-rithm should be able to differentiate between thelinks which can be projected directly and thelinks which cannot. Further it needs to identifythe chunks of the target language sentence thatcannot be linked even after projecting the links

from the source language sentence. Thus we pro-pose pseudo Direct Correspondence Assumption(pDCA) where not all relations can be projecteddirectly. The projection algorithm needs to takecare of the following three categories of links:

Category 1: Relationship between two chunksin the source language can be projected to the tar-get language with minor or no changes (for ex-ample, subject-verb, object-verb relationships inthe above illustration). It may be noted that sinceexcept for some suffix differences (due to mor-phological variations), the relation is same in thesource and the target language.

Category 2: Relationship between two chunksin the source language can be projected to thetarget language with major changes. For ex-ample, in the English sentence given in Figure2(a), the relationship between the girl and inthe white dress is Mp, i.e. “nominal mod-ifier (preposition phrase)”. In the correspondingphrases ladkii and safed kapde waalii of Hindi,although the relationship is same, i.e., “nominalmodifier”, the type of nominal modifier is chang-ing to waalaa/waale/waalii-adjective. If the dis-tinction between the types of nominal modifiers isnot maintained, the parsing will be very shallow.Hence the modification in the link is necessary.

Category 3: Relationship between two chunksin the target language is either entirely differentor can not be captured from the relationship be-tween the corresponding chunk(s) in the sourcelanguage. For example, the relationship betweenthe main verb and the auxiliary verb of the Hindisentence in Figure 2(a) can not be defined us-ing the English parsing. Such phrases should beparsed independently.

The proposed algorithm is based on the above-described concept of pDCA which gives the parsestructure of the sentences given in Fig. 2.

While working with Indian languages, we foundthat outermost Inter-phrase relations usually be-long to Category 1, and remaining relations be-long to Category 2. Generally an innermost Intra-phrase relation (like verb phrase) belongs to Cate-gory 3. Thus, outermost Inter-phrase relations canusually be projected to target language directly, in-nermost Intra-phrase relations for the target lan-guage which are independent of the source lan-guage should be decided on the basis of languagespecific study and remaining relationship should

303

(a)

(b)

Figure 2: Parsing Using pDCA

be modified before projection from source to tar-get language.

4 The Proposed Algorithm

DPA (Hwa et al., 2005) discusses projection pro-cedure for five different cases of word align-ment of source-target language: one-to-one, one-to-none, one-to-many, many-to-one and many-to-many. As discussed earlier, DPA is not sufficientfor many cases. For example, in case of one-to-many alignment, the proposed solution is to firstcreate a new empty word that is set as head ofall multiply aligned words in target language sen-tence, and then the relation is projected accord-ingly. But, in such cases, relations between thesemultiply-aligned words can not be given, and thusthe resulting parsing becomes shallow. The pro-posed algorithm (pDPA) overcomes these short-comings as well.

The pDPA works in the following way. It re-cursively identifies the phrases of the target lan-guage sentence, and assigns the links betweenthe two phrases/words of the target language sen-tence by using the links between the correspond-ing phrases/words in the source language sen-tence. It may be noted that link between phrasesmeans link between the head words of the corre-sponding phrases. Assignment of links starts fromthe outermost level phrases. Syntactic relationsbetween the constituents of the target languagephrase(s) for which the syntactic structure doesnot correspond with the corresponding phrase(s)

in the target language are given independently. Alist of link rules is maintained which keeps the in-formation about modification(s) required in a linkwhile projecting from the source language to thetarget language. These rules are limited to closedcategory words, to parts of speech projected fromsource language, or to easily enumerated lexicalcategories.

Figure 3 describes the algorithm. The algorithmtakes an input sentence (T ) and the parsing and theconstituent structure of its parallel sentence (S).Further S and T are assumed to be word-aligned.Initially, S and T are passed to the module Project-From(), which identifies the constituent phrases ofS and the relations between them. Then each setof phrases and relations is passed to the moduleParseFrom(). ParseFrom() module takes as inputtwo source phrases/words, relation between them,and corresponding target phrases. It projects thecorresponding relations in the target language sen-tence T . ParseFromSpecial() module is requiredif the relation between phrases of source languagecan not be projected so directly to the target lan-guage. Module Parse() assigns links between theconstituent words of the target language phrases∈ P . Notations used in the algorithm are as fol-lows:

• By T ′ ∼ S′ we mean that T ′ is aligned withS′, T ′ and S′ being some text in the targetand source language, respectively.

• Given a language, the head of a phrase is usu-ally defined as the keyword of the phrase. Forexample, for a verb phrase, the head word isthe main verb.

• P is the exhaustive set of target languagephrases for which Intra-phrase relations areindependent of the corresponding source lan-guage phrases.

• Rule list R is the list of source-target lan-guage specific rules which specifies the mod-ifications in the source language relations tobe projected appropriately in the target lan-guage.

• Given the parse and constituent structure of atext S, Ψij = 〈Si, Sj , L〉, where L is the re-lation between the constituent phrases/wordsSi and Sj of S. Ψ′

ij = 〈Ti, Tj〉, Ti ∼ Si andTj ∼ Sj . Further, Φij = 〈Ψij ,Ψ′

ij〉.

304

ProjectFrom(S′, T ′): // S′ is a source// language sentence or phrase, T ′ ∼ S′

{IF T ′ ∈ PTHEN Parse(T ′);ELSE

S′ = {S1, S2, . . . , Sn}; // Sis are//constituent phrases/words of S′

T ′ = {T1, T2, . . . , Tn} // Ti ∼ Si

Find all Ψij = 〈Si, Sj , L〉 from S′ andcorresponding Ψ′

ij = {Ti, Tj} from T ′;Φij = 〈Ψij , Ψ′

ij〉For all i, j, push (S , Φij);While !empty(S )

Φ = pop(S );IF L /∈ LTHEN ParseFrom(Φ);ELSE ParseFromSpecial(Φ);

}Parse(T ′): // T ′ is a target language phrase{Assign links between constituent words of T ′

using target language specific rules;}ParseFrom(Φ): // Φ = 〈Ψ,Ψ′〉;

// Ψ = 〈S1, S2, L〉; Ψ′ = 〈T1, T2〉;{IF T1 6= {φ} & T2 6= {φ} THEN

Find head words t1 ∈ T1 and t2 ∈ T2;Assign relation L′ between t1 and t2; // L′

//is target language link corresponding//to L identified using R

IF T1 is a phrase and not already parsedTHEN ProjectFrom(S1, T1);IF T2 is a phrase and not already parsedTHEN ProjectFrom(S2, T2);}ParseFromSpecial(Φ): // Φ = 〈Ψ,Ψ′〉;

// Ψ = 〈S1, S2, L〉; Ψ′ = 〈T1, T2〉;{Use target language specific rules to identify ifthe relation between T1 and T2 is given by L′;IF true THEN ParseFrom(Φ);ELSE

Assign required relations using rules;IF T1 is a phrase and not already parsedTHEN ProjectFrom(S1, T1);IF T2 is a phrase and not already parsedTHEN ProjectFrom(S2, T2);

}Figure 3: pseudo Direct Projection Algorithm

• S is a stack of Φijs.

• L is the set of source language relationswhose occurrence in parse of some S′ maylead to different structure of T ′, where T ′ ∼S′.

In the following sections we discuss in detail thescheme for parsing Hindi sentences using parsestructure of the corresponding English sentence.Along with the parse structure of the input, thephrase structure is also obtained.

5 Case study: English to Hindi

Prior requirements for developing a parsingscheme for the target language using the proposedalgorithm are: development of target languagelinks, word alignment technique, phrase identifi-cation procedure, creation of rule set R, morpho-logical analysis, development of ParseFromSpe-cial() module. In this section we discuss these de-tails for adapting a parser for Hindi using EnglishLG based parser.

Hindi Links. Goyal and Chatterjee (2005a;2005b) have developed links for Hindi Link Gram-mar along with their suffixes. Some of the Hindilinks are briefly discussed in the Table 2. It maybe noted that due to the free word order of Hindi,direction can not be specified for some links, i.e.,for such links “Word in Left” and “Word in Right”(second and third column of Table 2) shall be readas “Word on one side” and “Word on the otherside”, respectively.

Link Word in Left Word in Right DirectedS Subject Main verb NOSN ne Main verb NOO Object Main verb NOJ noun/pronoun postposition YESMV verb modifier Main verb NOMA Adjective Noun YESME aa-e-ii form of

verbNoun YES

MW waalaa/waale/waalii

Noun YES

PT taa-te-tii form ofverb

declension ofverb honaa

YES

D Determiner Head noun YES

Table 2: Some Hindi Links

Word Alignment. The algorithm requires thatthe source and target language sentences areword aligned. Some English-Hindi word align-ment algorithms have already been developed, e.g.

305

(Aswani and Gaizauskas, 2005). However, for thecurrent implementation alignment has been donemanually with the help of an online English-Hindidictionary1.

Identification of Phrases and Head Words.

Verb Phrases. Corresponding to any main verbvi present in the Hindi sentence, a verb phrase isformed by considering all the auxiliary verbs fol-lowing it. A list of Hindi auxiliary verbs, alongwith the linkage requirements has been main-tained. This list is used to identify and link verbphrases. Main verb of the verb phrase is consid-ered to be the head word.

Noun and Postposition2 Phrases. English NPis translated in Hindi as either NP or PP3. Also,English PP can be translated as either NP or PP. Ifthe Hindi noun is followed by any postposition,then that postposition is attached with the nounto get a PP. In this case the postposition is con-sidered as the head. Hindi NP corresponding tosome English NP is the maximal span of the words(in Hindi sentence) aligned with the words in thecorresponding English NP. The Hindi noun whoseEnglish translation is involved in establishing theInter-phrase link is the head word. Note that if thelast word (noun) in this Hindi NP is followed byany postposition (resulting in some PP), then thatpostposition is also included in the NP concerned .In this case the postposition is the head of the NP.The system maintains a list of Hindi postpositionsto identify Hindi PPs.

For example, consider the translation pair thelady in the room had cooked thefood∼ kamre (room) mein (in) baiThii huii (-)aurat (lady) ne (-) khaanaa (food) banaayaa(cooked) thaa (-).

The phrase structure of the English sen-tence is (NP1 (NP2 the lady) (PP1

in (NP3 the room))) (V P1 hadcooked) (NP4 the food).

Here, some of the Hindi phrases are as follows:kamre mein and aurat ne are identified as Hindi

PP corresponding to English PP1 and NP2. Thewords mein and ne are considered as their headwords, respectively. Since the maximal span of

1www.sanskrit.gde.to/hindi/dict/eng-hin-itrans.html2In Hindi prepositions are used immediately after the

noun. Thus, we refer to them as “postposition”.3PP for English is preposition phrase and for Hindi it

stands for postposition phrase.

translation of words of English NP1 is kamre meinbaiThii huii aurat which is followed by postposi-tion ne, the Hindi phrase corresponding to NP1

is kamre mein baiThii huii aurat ne with ne asthe head word. As huii and thii, which followthe verbs baiThii4 and banaayaa respectively, arepresent in the auxiliary verb list, Hindi VPs areobtained as baiThii huii and banaayaa thaa (cor-responding to V P1).

Phrase Set P . Hindi verb phrase and postposi-tion phrases are linked independent of the corre-sponding phrases in the English sentence. Thus,P = {V P, PP}.

Rule List R. Below we enlist some of the rulesdefined for parsing Hindi sentences using the En-glish links (E-links) of the parallel English sen-tences. Note that these rules are dependent on thetarget language.

Corresponding to E-link S: If the Hindi subject isfollowed by ne, then the subject makes a Jn linkwith ne, and ne makes an SN link with the verb.

Corresponding to E-link O: If the Hindi object isfollowed by ko, then the object makes a Jk linkwith ko, and ko makes an OK link with the verb.

Corresponding to E-links M, MX: English NPsmay have preposition phrase, present participle,past participle or adjective as postnominal modi-fiers which are translated as prenominal modifiers,or as relative clause in Hindi. The structure ofpostnominal modifier, however, may not be pre-served in the Hindi sentence. If the sentence is notcomplex, then the corresponding Hindi link maybe one of MA (adjective), MP (postposition phrase),MT (present participle), ME (past participle), or MW(waalaa/waale/waalii-adjective). An appropriatelink is to be assigned in Hindi sentence after iden-tification of the structure of the nominal modifier.These cases are handled in the module ParseFrom-Special(). The segment of the module that handlesEnglish Mp link is given in Figure 4.

Further, since morphological information ofHindi words can not be always extracted using cor-responding English sentence, a morphological an-alyzer is required to extract the information5. Forthe current implementation, morphological infor-

4We observe that English PP as postnominal modifier maybe translated as verbal prenominal modifier in Hindi and insuch cases some unaligned word is effectively a verb.

5For Hindi, some work is being carried out in this direc-tion, e.g., http://ccat.sas.upenn.edu/plc/ tamilweb/hindi.html

306

ParseFromSpecial(Φ): // Φ = 〈Ψ, Ψ′〉;// Ψ = 〈S1, S2, L〉; Ψ′ = 〈T1, T2〉;

{IF L = Mp THEN //S1 and S2 are NP and PP, resp.

IF T2 is followed by some verb, v, not aligned withany word in S THEN

T3 = VP corresponding to v;Parse(T3);Find head word t1 ∈ T1;Assign MT/ME link between v and t1;Assign MVp link between postposition (in T2)and v;ProjectFrom(S1, T1); ProjectFrom(S2, T2);

ELSEParseFrom(Φ);

ELSECheck for other cases of L;

}

Figure 4: ParseFromSpecial() for ‘Mp’ Link

mation is being extracted using some rules in sim-pler cases, and manually for more complex cases.

5.1 Illustration with an ExampleConsider the English sentence (S) the girlin the room drew a picture, its parsedand constituent structure as given in Figure 5. Fur-ther, the corresponding Hindi sentence (T ), andthe word-alignment is also given.

Figure 5: An Example

The step-by-step parsing of the sentence as perthe pDPA is given below.ProjectFrom(S, T ):

S = {S1, S2, S3}, where S1, S2, S3 are thephrases the girl in the room, drew anda picture, respectively. From the definition ofHindi phrases, corresponding Ti’s are identified as“kamre mein baithii laDkii ne”, “banaayaa” and“ek chitr”. From the parse structure of S, Φ’s areobtained as Φ12 = 〈〈S1, S2,Ss〉, 〈T1, T2〉〉 andΦ23 = 〈〈S2, S3,Os〉, 〈T2, T3〉〉. These Φ’s arepushed in the stack S and further processing isdone one-by-one for each of them. We show thefurther process for the Φ12.

Since Ss /∈ L , ParseFrom(Φ12) is executed.ParseFrom(Φ12):

The algorithm identifies t1 = ne, t2 = banaayaa.The Hindi link corresponding to Ss will be SN.The module ProjectFrom(S1, T1) is then called.ProjectFrom(S1, T1):

S1 = {S11, S12}, where S11 and S12 are thegirl and in the room, respectively. Corre-sponding T11 and T12 are ladkii ne and kamremein. Thus, Φ = 〈〈S11, S12,Mp〉, 〈T11, T12〉〉.Since L = Mp ∈ L , ParseFromSpecial(Φ) iscalled.ParseFromSpecial(Φ): (Refer to Figure 4)

Since T2 is followed by an unaligned verbbaithii, the algorithm finds T3 as baithii, andt1 as ne. It assigns ME link between baithiiand ne. Further, MVp link is assigned betweenmein and baithii. Then ProjectFrom(S11, T11) andProjectFrom(S12, T12) are called. Since both T11

and T12 ∈ S , J and Jn links are assigned be-tween constituent words of T11 and T12, respec-tively, using Hindi-specific rules.

Similarly, Φ23 is parsed.The final parse and phrase structure of the sen-

tence are obtained as given in Figure 6.

Figure 6: Parsing of Example Sentence

6 Experimental Results

Currently the system can handle the followingtypes of phrases in different simple sentences.

Noun Phrase. There can be four basic elementsof an English NP6: determiner, pre-modifier, noun(essential), post-modifier. The system can han-dle any combination of the following: adjective,noun, present participle or past participle as pre-modifier, and adjective, present participle, pastparticiple or preposition phrase as post-modifier.Note that some of these cases may be translated ascomplex sentence in Hindi (e.g., (book on thetable ∼ jo kitaab mej par rakhii hai). We areworking upon such cases.

6Pronouns as NPs are simple.

307

Verb Phrase. The system can handle all the fouraspects (indefinite, continuous, perfect and perfectcontinuous) for all three tenses. Other cases ofVPs (e.g., modals, passives, compound verbs) canbe handled easily by just identifying and puttingthe corresponding auxiliary verbs and their link-ing requirements in the auxiliary verb list.

Since the system is not fully automated yet, wecould not test our system on a large corpus. Thesystem has been tested on about 200 sentencesfollowing the specific phrase structures mentionedabove. These sentences have been taken randomlyfrom translation books, stories books and adver-tisement materials. These sentences were manu-ally parsed and a total of 1347 links were obtained.These links were compared with the system’s out-put. Table 3 summarizes the findings.

Correct Links : 1254Links with wrong suffix : 47Wrong links : 22Links missing : 31

Table 3: Experimental Results

After analyzing the results, we found that

• For some links, suffixes were wrong. Thiswas due to insufficiency of rules identifyingmorphological information.

• Due to incompleteness of some cases ofParseFromSpecial() module, some wronglinks were assigned. Also, some links whichshould not have been projected, were pro-jected in the Hindi sentence. We are workingtowards exploring these cases in detail.

• Some links were found missing in the pars-ing since corresponding sentence structuresare yet to be considered in the scheme.

7 Concluding Remarks

The present work focuses on development of Ex-ample based parsing scheme for a pair of lan-guages in general, and for English to Hindi in par-ticular.

Although the current work is motivated by(Hwa et al., 2005), the algorithm proposed hereinprovides a more generalized version of the projec-tion algorithm by making use of some target lan-guage specific rules while projecting links. This

provide more flexibility in the projection algo-rithm. The flexibility comes from the fact that un-like DPA the algorithm can project links from thesource language to the target language even if thetranslations are not literal. Use of rules at the pro-jection level gives more robust parsing and reducesthe need of post-editing. The proposed schemeshould work for other target languages also pro-vided the relevant rules can be identified. Fur-ther, since LG can be converted to DependencyGrammar (DG) (Sleator and Temperley, 1991),this work can be easily extended for languages forwhich DG implementation is available.

At present, we have focused on developingparsing scheme for simple sentences. Work has tobe done to parse complex sentences. Once a size-able parsed corpus is generated, it can be used fordeveloping the parser for a target language usingbootstrapping. We are currently working on theselines for developing a Hindi parser.

ReferencesNiraj Aswani and Robert Gaizauskas. 2005. A hy-

brid approach to aligning sentences and words inEnglish-Hindi parallel corpora. In ACL 2005 Work-shop on Building and Using Parallel Texts: Data-driven machine translation and Beyond.

Rens Bod, Remko Scha, and Khalil Sima’an, editors.2003. Data-Oriented Parsing. Stanford: CSLI Pub-lications.

Shailly Goyal and Niladri Chatterjee. 2005a. Study ofHindi noun phrase morphology for developing a linkgrammar based parser. Language in India, 5.

Shailly Goyal and Niladri Chatterjee. 2005b. Towardsdeveloping a link grammar based parser for Hindi.In Proc. of Workshop on Morphology, Bombay.

Rebecca Hwa, Philip Resnik, Amy Weinberg, ClaraCabezas, and Okan Kolak. 2005. Bootstrap-ping parsers via syntactic projection across paralleltexts. Natural Language Engineering, 11(3):311–325, September.

Daniel Sleator and Davy Temperley. 1991. ParsingEnglish with a link grammar. Computer Sciencetechnical report CMU-CS-91-196, Carnegie MellonUniversity, October.

Oliver Streiter. 2002. Abduction, induction andmemorizing in corpus-based parsing. In ESSLLI-2002 Workshop on Machine Learning Approachesin Computational Linguistics,, Trento, Italy.

David Yarowsky and Grace Ngai. 2001. Inducing mul-tilingual POS taggers and NP bracketers via robustprojection across aligned corpora. In NAACL-2001,pages 200–207.

308