shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/21171/5/5.doc · web...

1. PREPROCESSING

4.1 INTRODUCTION

In Chapter 2, the research work reviewed information retrieval and cross

language information retrieval. Based on the literature review a framework is proposed

in Chapter 3. In the following chapter, the pre-processing of user’s query is explained.

The preprocessing stage accepts the query in Telugu and processes it using the

grammar rules and ontology to arrive at an intermediate English construct. This will then

be given to the search engine in the post processing stage. The grammar rule structure

and the ontological model are also explained. Finally, case studies of how the input

Telugu query is converted to the output English intermediate constructs are shown.

4.2 METHODOLOGY OF PROPOSED PRE-PROCESSING

The major objective of this research work (pre-processing stage) is to convert the

user query in Telugu into the relevant English constructs. There are three distinct

components that contribute to the success of the pre-processing. Figure 4.1 shows the

overall process of query pre-processing.

Figure 4.1 Overall process of query pre-processing

4.2.1 Tokenizer

The user gives the query to the system. The tokenizer divides text into a

structure of tokens. All contiguous strings of alphabetic characters are part of one token.

Figure 4.2 shows the tokenizer component in pre-processing

Figure 4.2 Tokenization component

Tokens are separated by whitespace characters, such as a space or line break,

or by punctuation characters. Figure 4.3 explains the working process of a sample user

given Telugu query.

Figure 4.3 Tokenizer process

Steps in tokenization of user query are given below,

TokenizationInput Telugu query

Output

Segmenting Text into Words: The boundary identification is a somewhat trivial

task since the majority of Telugu language characters are bound by explicit

structures. A simple program can replace white spaces with word boundaries and

cut off leading and trailing quotation marks, parentheses and punctuation. In

figure 4.4 a sample text segmentation process is shown.

Figure 4.4 Simple Telugu sentence tokenization

Handling Abbreviations: In Telugu language a period is directly attached to the

previous word. However, when a period follows an abbreviation it is an integral

part of this abbreviation and should be tokenized together with it. Figure 4.5

shows the sample sentence with abbreviations is shown.

Figure 4.5 Tokenizer example

Numerical and special expressions are difficult to handle in Telugu language.

They can produce a lot of confusion to a tokenizer because they usually involve

rather complex alpha numerical and punctuation syntax. For this the blank

Example:Telugu query: సచిన్ ఆడుతున్న మ్యా�చ్ (Sachin playing match) Tokenized terms:సచిన్ (Sachin)ఆడుతున్న (playing)మ్యా�చ్ (match)

Example:Telugu query: దీని ధర ఎంత (How Much) Tokenized terms:

దీని ధర ఎంత

spaces between the words are considered. In Figure 4.6 a sample example of

special expression tokenization of query is shown.

Figure 4.6 Tokenizer example

for special expressions

4.2.2 Language Grammar Rules

The tokens are sent to the language grammar rule component to process. The

detailed flow of the grammar structure is explained in Appendix 1. In this sub section,

the essence is explained briefly.

The essence of Telugu grammar is as follows.

It follows the Subject, Object and Verb (SOV) pattern.

There are three persons, namely, First person, Second person and Third person,

Two way distinctions in Number namely Singular (Sg.) and Plural (pl.) and three

way distinctions of Gender namely Masculine, Feminine and Neutral.

Feminine singular belongs to the Neuter and the Feminine plural belongs to the

Human.

Example:Telugu query: రక్షక భటులని పిలవండి! (Call the security person) Tokenized terms:రక్షక (the security)భటులని (person)పిలవండి! (Call)

Apart from the three types of tenses, namely, Past, Present and Future, Telugu

has one more special tense that is, the Future Habitual.

Figure 4.6 shows the language grammar rules component.

Figure 4.7 Language Grammar rules component

The grammar rules are used to preprocess the text. The idea is to identify the

appropriate word sense in the text. This helps to avoid the issues of out of vocabulary

text. If the user query is a complex one the reordered sentence will be sent to the

morphological analyzer to identify the tense of a verb and inflections that are adding to

verb. But the morphological structure of Telugu verbs inflects for tense, person, gender,

and number. The nouns inflect for plural, oblique, case and postpositions. Figure 4.8

explains the working process of a sample user given Telugu query.

Figure 4.8 Grammar rules component process

The structure of verbal complexity is unique and capturing this complexity in a

machine analyzable and generatable format is a challenging task. Inflections of the

Telugu verbs include finite, infinite, adjectival, adverbial and conditional markers. The

verbs are classified into certain number of paradigms based on the inflections.

For computational need In Telugu language there are 37 paradigms of verb and

each paradigm with 160 inflections and sixty seven paradigms are identified for Telugu

noun. Each paradigm has 117 sets of inflected forms. Based on the nature of the

inflections the root words are classified into groups. An example is shown in Table 4.1.

Table 4.1 Sample Telugu sentence order

Sentence దినేష్ పనికి వెళ్లతాడు .

Words దినేష్ పనికి వెళ్లతాడుు.

Transliteration Dinesh Paniki veḷtāḍu

Gloss Dinesh to work goes.

Parts Subject Object Verb

Converted Dinesh goes to work.

Telugu pronouns include Personal pronouns and Demonstrative, pronouns (The

persons speaking, the persons spoken to, or the persons or things spoken about),

Reflexive pronouns (in which the object of a verb is being acted on by verb's subject),

http://en.wikipedia.org/wiki/Transliteration

http://en.wikipedia.org/wiki/Gloss_(annotation)#In_linguistics

Interrogative Pronoun, Indefinite pronoun, Demonstrative adjective and Interrogative

adjective Pronouns, Possessive adjective Pronouns, Pronouns referring to numbers and

Distributive Pronouns.

Telugu language uses postpositions for word in different cases. With the use of

postpositions, there are eight possible cases (vibhakti) is shown in Table 4.2.

A noun in Telugu is the markings of gender, number, person and case makers

are identified in three noun distinctions indicating: Human male/females, singular/ plural

and non-humans. For the noun denotes human male it should end with inflection “-du”

and for the human females it ends with “-di”.

In number marking on noun cases it occurs in singular and plural. In case of

large number of nouns the form of the plural inflection is “–lu”, while in case of some

nouns of human male category, the form of plural suffix alternant is “–ru”. For gender

number person marking on nouns is explicit only in 1st and 2nd person in both singular

and plural cases. Telugu language uses a wide variety of case markers and post-

position suffixes are those which express grammatical case relations such as

nominative, accusative, dative, instrumental, genitive, commutative, vocative and

causal.

Table 4.2 Post positions for Telugu sentence order

Telugu English SignificanceUsual

SuffixesTransliteration of

Suffixes

Panchami

Vibhakti

(పంచమీ విభకి()

Ablative of

motion from

Motion from an

animate/inanimate

object

వలనన్,

కంటెన్, పట్టి+

valanan, kaMTen,

paTTi

Dviteeya

Vibhakti

(ది,తీయా

విభకి()

Accusative Object of action

నిన్, నున్,

లన్, కూర్చి2,

గుర్చించి

nin, nun, lan,

kUrci, guriMci

Chaturthi

Vibhakti

(చతుర్చి4 విభకి()Dative

Object to whom

action is performed,

Object for whom

action is performed

కొఱకున్, కై korakun, kai

Shashthi

Vibhakti (షష్ఠీ:

విభకి()

Genitive Possessive

కిన్, కున్,

యొక్క, లోన్,

లోపలన్

kin, kun, yokka,

lOn, lOpalan

Truteeya

Vibhakti

(తృతీయా

విభకి()

Instrumental,

Social

Means by which

action is done

(Instrumental),

Association, or

means by which

action is done

(Social)

చేతన్, చేన్,

తోడన్, తోన్

cEtan, cEn,

tODan, tOn

Saptami

Vibhakti

Locative Place in which, On

the person of

(animate) in the

అందున్, నన్ aMdun, nan

(సప్తమీ విభకి() presence of

Prathama

Vibhakti

(ప్రథమ్యా విభకి()Nominative Subject of sentence

డు, ము, వు,

లుDu, mu, vu, lu

A verb in Telugu sentence is a finite or non-finite verb which occurs according to

the situations like rising pitch, meaning question, level pitch, falling pitch, and meaning

command. In Telugu all verbs have finite and non-finite forms.

A finite form is one that can stand as the main verb of a sentence and occur

before a final pause (full stop) and a non- finite form cannot stand as a main verb and

rarely occurs before a final pause. There are eight finite rules for Telugu verb arranged

in three verbal structures: stem or inflection root, tense mode suffix and personal suffix.

These rules are discussed below in table 4.3 for a verb “ ”ఆట్లా్ల డు (playing) with a root

word “ఆట్ల” (play).

Table 4.3 Finite verb rules

Type Structure Rule Example

Inflection or

Stem root(Rule 1) Imperative

Singular –du atla –du

Plural –andi atla –andi

Tense –

mode suffix

(Rule 2) Admonitive

or abusive

kAlu (to burn),

kUlu (to fall),

pagulu (to break)

In this case due to semantic

restrictions, many verbs

cannot occur in this mode

(Rule 3) Obligative

(in all persons)-Ali

atlad –Ali (I, We, You)

(singular, plural)

Personal

suffix (es)

(Rule 4) Habitual-

future or non-past-ta-

atla – ta – Am (we shall play)

atla – ta – Adu (He shall play)

atla – tun – di (she will play)

atla – ta – Anu (I shall play)

atla – ta – Ava (you will play)

atla – ta – Ay (they play)

atla – ta – Aru (they will play)

(Rule 5) Past tense -din-

atla – din – Anu (I played)

atla – din – Ava (you played

(Singular))

atla – din – Aru (you played

(plural))

atla – din – Am (we played)

atla – din – Adu (he played)

atla – din – di (she/ it played)

atla – din – Aru (they played)

(Rule 6) Hortative -da- atla – da – tAm (let us play, or

we shall play)

(Rule 7) Negative

tense-data-

atla – data – va (you (do, did,

and shall) not play)

atla – data – Du (he (does,

did, and shall) not play)

atla – data – nu (I (do, did,


atla – data – m (we (do, did,


atla – data – ru (they (do, did,


atla – data – du (she/ it (do,

did, and shall) not play)

(Rule 8) Negative

imperative or

prohibitive

-Ak-

atla – Ak – andi (you (plural)

don’t play)

atla – Ak – u (you (singular)

don’t play)

In the same way Non Finite Verbs are ten verbs which may be arranged into two

structural types like Unbound and Bound and this rules are shown in Table 4.4 non-

finite verb rules.

Table 4.4 Non-finite verb rules

Type Structure Rule Example

Bound type(Rule 9)

Present-ta-un-

atladu- ta- unnAnu (I am playing)

atladu - ta- un- nA (even playing

(now))

atladu - ta- un- tE (if playing)

atladu - ta- un- na (that playing)

Unbound type

(Rule 10)

Concessive-dinA atla- dinA (even though played)

(Rule 11)

Conditional-itE atla- itE (if played)

(Rule 12)

Present

participle

-dutu atla- dutU (playing)

(Rule 13) Past

participle-di atla- di (having played)

(Rule 14) -ta atla –ta (to play)

Infinitive

(Rule 15) Past

adjective-dina atla- dina (that played)

(Rule 16)

Negative

adjective

-dani atla- dani (not played)

(Rule 17)

Negative

participle

-aku atla- aku (not playing)

(Rule 18)

Habitual

adjective

-dE atla- dE (that plays)

The subject, object, verb and inflection are identified using the above grammar

rules.

4.2.3 Bilingual Ontology

The terms are looked into the ontology for the English equivalent terms. The

bilingual ontology for information retrieval is constructed based on the English Telugu

language vocabulary relationships. In this research work ontology is a key element for

the pre-processing of the query and the post-processing of the results. Block diagram of

bilingual ontology component is shown in the Figure 4.9.

Figure 4.9 Ontology Component

Ontology may take a variety of forms, but necessarily it will include a vocabulary

of terms, and some specification of their meaning. It includes the definitions and an

indication of how the concepts are inter-related which collectively impose the structure

on a domain and constrain the possible interpretations of the terms. Figure 4.11

illustrates the workflow of bilingual ontology component in the preprocessing stage for

the CLIR and it also shows the connecting relationship of ontology terms.

Figure 4.10 Process flow of bilingual ontology component

Firstly, the English terms are mapped with Telugu terms, which come from

Telugu English bilingual dictionary, Consequently, English Telugu ontology may contain

terms that do not appear in the original Telugu English bilingual dictionary, or vice

versa. It compares the number of terms in both versions. The termNs that do not appear

in both languages are considered as Out Of Vocabulary (OOV) terms. The result of the

alignment is the term list which is treated as the basis for extension of ontology. Each

Telugu term in the list is considered as a seed term, which is used to search for Telugu

synonyms online.

Secondly, the search engine is used to retrieve results in Telugu for each Telugu

term, which are assumed to contain candidate Telugu synonyms. Thirdly, Telugu

translations of terms are extracted from the retrieved results using sequential

application of the following: a) linguistic rules, which provide the text segments

potentially containing translations; b) mutual information filtering, which refines the

candidate translations. Fourthly, the frequencies of each English term and Telugu

translation in the results retrieved by search engine are calculated; and term weights

are computed using these frequencies.

Figure 4.11 Ontology Relationship Hierarchies

Finally, the aligned term pairs, the English translations, term weights, and the

ontology entry terms are merged according to the ontology hierarchy, forming the

Telugu English bilingual ontology. The order of displaying the suggestions is shown

Query

RelatedRelevant RelatedRelevantMeaning

Meaning RelationshipRelated

RelatedRelevantMeaning

1

4 8 10

3

7 9 115 6

2

below in Figure 4.12 the meaning, relationship terms, and related terms are expanded

in the order and shown to the users.

In this research work ontology the terms are considered into four types of

records: meaning, related, relevant and supplementary concept record. All of them are

used in day to day life of the users. A sample structure of the ontology is shown in figure

4.13.

Figure 4.12 Sample ontology structure

4.2.4OOV

Component

The terms that are not available in ontology are considered as out of vocabulary

terms. These terms are handled by the Out of Vocabulary component. The Block

diagram of OOV component is shown in the Figure 4.12.

Sports: ఆటలు

Clubs: క్లబ్స్J

Competitions: పోటీలుhas

Is a

Personal:వ�తిగత

Tournaments:తౌర్నమెంత్స్J

Family: గూQ ప్

Sub class of

Sub class of Sub class of

Fav team: జటు+

group:గుంపు

Location: ప్రా్ర ంతం

Is a

Is a

has

Players:ఆటగాలుు

Cricket:కిర్కే్కట్

Football:ఫుట్లాZల్

Tennis:టెని్నస్

Audience:ప్రే్రక్షకులు

not

Umpire:అంపైర్

Regions: ప్రదేశం ప్రా్ర ంతాలు

Country: దేశం

has

Is a

Is a

Is a

has

has

Figure 4.13 Out of Vocabulary Component

The out of vocabulary processing system transliterates the term into target

language. This helps to avoid the issues of out of vocabulary text. With this the terms

are rearranged and the query is converted into the target language. The pre-processing

of the query done and the same is sent along with the user given query to web for

results related to the quires.

Case 1 shows an example how the user given query is processed and converted into

English language using pre-processing system. Here a step by step process of the pre-

processing system for query is discussed below:

Step1: User enters the query “ చెన్నైdలో మంచి భోజనశాల” (good hotel in Chennai)

Step2: tokenizer tokenize the query into tokens

చెన్నైdలో (token 1) మంచి (token 2) భోజనశా ల (token 3)

Step3: Apply grammar rules to the tokens, first look into the tokens for inflection.

If any inflection is found and the equivalent grammar rule is used to

identify subject object and verb

In the above tokens “లో” is the inflection term, it is attached with the

subject “చెన్నైd” and the verb here is “మంచి” and the object is “భోజనశా ”ల .

Step4: once the terms (subject, object verb and inflection) are identified then look

into the ontology for equivalent terms.

Here, the terms చెన్నైd (chennai) and భోజనశా ల (hotel) are found in ontology

and the inflection లో (in) is taken from inflection table. But the term మంచి

(good) is not available in the ontology.

Step5: the terms that are not available in ontology are sent to the OOV

component to transliterate literally

Step6: once the terms are converted now the query is constructed in English

using the subject object verb and inflection. Here the above query is

constructed as “manchi hotel in Chennai”

Step7: now the query is sent to the next stage for results

The flow chart for the pre-processing system is shown in the Figure 4.13.

Figure 4.14 Flow Chart for the

Pre-

Processing stage

Start

User enters the query in Telugu

Tokenize the user query into tokens

Rule identification based on the inflection and verb

Lookup into the ontology for

equivalent terms

Stop

Language Grammar Rules to identify Subject, Object and verb

Inflection Table lookup

Yes

Query reconstruction into source language

Transliteration

No

4.3 CONCLUSION

The user given query is processed in preprocessing and the query is converted

into the source language using the language grammar rules and the ontology. Here the

grammar rules play a major role in identifying the terms (subject, object and verb) and

also the rule to convert the query. With the help of ontology the terms are easily looked

up and the terms that are not available in ontology are also transliterated using the OOV

component. Thus the Telugu query has been converted into the English equivalent now

the query will be processed by the search engine and relevant results retrieved.

shodhganga.inflibnet.ac.inshodhganga.inflibnet.ac.in/bitstream/10603/21171/5/5.doc · web...

Documents