[ieee 2012 4th international conference on intelligent human computer interaction (ihci) -...

6
IEEE Proceedings of 4 th International Conference on Intelligent Human Computer Interaction, Kharagpur, India, December 27-29, 2012 Identification of Nominal Multiword Expressions in Bengali Using CRF Tanmoy Chakraborty Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur India Email: its [email protected] Abstract—One of the key issues in both natural language understanding and generation is the appropriate processing of Multiword Expressions (MWEs). MWEs pose a huge problem to a precise language processing due to their idiosyncratic nature and diversity in lexical, syntactical and semantic properties. The semantic of a MWE can be expressed transparently or opaquely after combining the semantic of its constituents. This paper deals with the identification of Nominal Multiword Expressions in the Bengali text using Conditional Random Field (CRF) machine learning technique. Bengali is highly agglutinative and morpho- logically rich language. Thus the selection of features such as surrounding words, POS tag, prefix, suffix, length etc are proved to be very effective for running the CRF tool for the identification of Nominal MWEs. Compared to the statistical system built in Bengali language for compound noun MWEs identification, our proposed system shows higher accuracy in terms of precision, recall and F-score. We also conclude that with the identification of Reduplicated MWEs (RMWEs) and considering it as a feature makes reasonable improvement compared to the earlier system. Index Terms—Multiword Expressions, Bengali, CRF, Redupli- cations I. I NTRODUCTION Over the past two decades or so, Multiword Expressions (MWEs) have been identified with an increasing amount of interest in the field of Computational Linguistics and Natural Language processing. The term MWE is used to refer to the various types of linguistic units and expressions including idioms (kick the bucket, “to die”), noun compounds (village community), phrasal verbs (find out, “search”), other habitual collocations like conjunction (as well as), institutionalized phrases (many thanks) etc. However, while there is no uni- versally agreed definition for MWE as yet, most researchers use the term to refer to those frequently occurring phrasal units which are subject to certain level of semantic opaqueness, or non-compositionality. Sag et al. (2002) [1] defined them as “idiosyncratic interpretations that cross word boundaries (or spaces)”. The identification of MWEs in several languages have started with concentration on compound nouns, noun-verb combination, some on idioms and phrases and so on but not much on combined MWEs. The reason may be that the combined identification of MWEs are tough in any language. MWE is treated as a special issue of semantics where the individual components of an expression often fail to keep their meanings intact within the actual meaning of that expression. This opaqueness in meaning may be partial or total depending on the degree of compositionality of the whole expression [2]. MWEs span a continuum from complete compositionality (aka “institutionalized phrases”) (e.g., many thanks, which decom- pose into simplex senses and generally display high syntactic variability) to partial compositionality (e.g., light house, where partial meaning is identified from the components) then to idiosyncratically compositionality (e.g., spill the beans, “to reveal”, which are decomposable but coerce their parts into taking semantics unavailable outside the MWE and undergo a certain degree of syntactic variation) and finally complete non-compositionality (e.g., hot dog, where no decomposition analysis is possible and the MWE is semantically impenetra- ble). A number of research activities regarding MWE have been carried out in various languages like English, German and many other European languages. Various statistical co- occurrence measurements like Mutual Information [3], Log- Likelihood [4], Salience [5] have been suggested for identifi- cation of MWEs. In case of Indian languages, a considerable approach in compound noun MWE extraction [6], complex predicate ex- traction [7], clustering based approach [8] and a classification based approach for Noun-Verb collocations [9] have been done. In Bengali, works on automated extraction of MWEs are limited in number. One method of automatic extraction of Noun-Verb MWE in Bengali [10] has been carried out using morphological evidence and significance function. They have classified Bengali MWEs based on their morpho-syntactic flex- ibilities. They proposed a statistical approach for extracting the verbal compounds from a medium size corpus. Chakraborty and Bandyopadhyay (2010) [11] attempted to extract noun- noun bigram MWEs from Bengali corpus using statistical approach. In this experiment, we have tried to build up standard lexicon of Bengali Nominal MWEs so that it can help to develop proper training samples for machine learning approach as well as a gold standard to evaluate our system. For the first time in Bengali language, we introduce CRF to tag MWEs using the information of morphological and phraseological markers and dependencies between candidate phrase and con- textual tokens. Beside this, we incorporate the information of reduplicated MWEs in the feature set and draw the conclusion that it improves the performance of CRF model significantly. 978-1-4673-4369-5/12/$31.00 c 2012 IEEE

Upload: tanmoy

Post on 26-Jan-2017

215 views

Category:

Documents


3 download

TRANSCRIPT

IEEE Proceedings of 4th International Conference on Intelligent Human Computer Interaction, Kharagpur, India, December 27-29, 2012

Identification of Nominal Multiword Expressions inBengali Using CRF

Tanmoy ChakrabortyDepartment of Computer Science and Engineering

Indian Institute of Technology, Kharagpur

India

Email: its [email protected]

Abstract—One of the key issues in both natural languageunderstanding and generation is the appropriate processing ofMultiword Expressions (MWEs). MWEs pose a huge problem toa precise language processing due to their idiosyncratic natureand diversity in lexical, syntactical and semantic properties. Thesemantic of a MWE can be expressed transparently or opaquelyafter combining the semantic of its constituents. This paper dealswith the identification of Nominal Multiword Expressions in theBengali text using Conditional Random Field (CRF) machinelearning technique. Bengali is highly agglutinative and morpho-logically rich language. Thus the selection of features such assurrounding words, POS tag, prefix, suffix, length etc are provedto be very effective for running the CRF tool for the identificationof Nominal MWEs. Compared to the statistical system built inBengali language for compound noun MWEs identification, ourproposed system shows higher accuracy in terms of precision,recall and F-score. We also conclude that with the identificationof Reduplicated MWEs (RMWEs) and considering it as a featuremakes reasonable improvement compared to the earlier system.

Index Terms—Multiword Expressions, Bengali, CRF, Redupli-cations

I. INTRODUCTION

Over the past two decades or so, Multiword Expressions

(MWEs) have been identified with an increasing amount of

interest in the field of Computational Linguistics and Natural

Language processing. The term MWE is used to refer to the

various types of linguistic units and expressions including

idioms (kick the bucket, “to die”), noun compounds (villagecommunity), phrasal verbs (find out, “search”), other habitual

collocations like conjunction (as well as), institutionalized

phrases (many thanks) etc. However, while there is no uni-

versally agreed definition for MWE as yet, most researchers

use the term to refer to those frequently occurring phrasal units

which are subject to certain level of semantic opaqueness, or

non-compositionality. Sag et al. (2002) [1] defined them as

“idiosyncratic interpretations that cross word boundaries (or

spaces)”.

The identification of MWEs in several languages have

started with concentration on compound nouns, noun-verb

combination, some on idioms and phrases and so on but

not much on combined MWEs. The reason may be that the

combined identification of MWEs are tough in any language.

MWE is treated as a special issue of semantics where the

individual components of an expression often fail to keep their

meanings intact within the actual meaning of that expression.

This opaqueness in meaning may be partial or total depending

on the degree of compositionality of the whole expression [2].

MWEs span a continuum from complete compositionality (aka

“institutionalized phrases”) (e.g., many thanks, which decom-

pose into simplex senses and generally display high syntactic

variability) to partial compositionality (e.g., light house, where

partial meaning is identified from the components) then to

idiosyncratically compositionality (e.g., spill the beans, “to

reveal”, which are decomposable but coerce their parts into

taking semantics unavailable outside the MWE and undergo

a certain degree of syntactic variation) and finally complete

non-compositionality (e.g., hot dog, where no decomposition

analysis is possible and the MWE is semantically impenetra-

ble). A number of research activities regarding MWE have

been carried out in various languages like English, German

and many other European languages. Various statistical co-

occurrence measurements like Mutual Information [3], Log-

Likelihood [4], Salience [5] have been suggested for identifi-

cation of MWEs.

In case of Indian languages, a considerable approach in

compound noun MWE extraction [6], complex predicate ex-

traction [7], clustering based approach [8] and a classification

based approach for Noun-Verb collocations [9] have been

done. In Bengali, works on automated extraction of MWEs

are limited in number. One method of automatic extraction of

Noun-Verb MWE in Bengali [10] has been carried out using

morphological evidence and significance function. They have

classified Bengali MWEs based on their morpho-syntactic flex-

ibilities. They proposed a statistical approach for extracting the

verbal compounds from a medium size corpus. Chakraborty

and Bandyopadhyay (2010) [11] attempted to extract noun-

noun bigram MWEs from Bengali corpus using statistical

approach.

In this experiment, we have tried to build up standard

lexicon of Bengali Nominal MWEs so that it can help to

develop proper training samples for machine learning approach

as well as a gold standard to evaluate our system. For the first

time in Bengali language, we introduce CRF to tag MWEs

using the information of morphological and phraseological

markers and dependencies between candidate phrase and con-

textual tokens. Beside this, we incorporate the information of

reduplicated MWEs in the feature set and draw the conclusion

that it improves the performance of CRF model significantly.

978-1-4673-4369-5/12/$31.00 c©2012 IEEE

Finally, we add a post-processing step based on the heuristics

that the constituent of an MWE always belong to a single

chunk. Though it reasonably improves the precision value, the

recall value is dropped because of the inefficiency of Bengali

Shallow parser to tag the chunk of a raw text.

Section II describes the classification of Nominal MWEs

in Bengali, Section III gives very brief idea of Conditional

Random Field model, Section IV gives detail description of

experimental methodology, Section V illustrates the evaluation

part, Section VI shows the improvement using RMWEs and

the conclusion is drawn in Section VII.

II. NOMINAL MULTIWORD EXPRESSIONS IN BENGALI

The compound noun or nominal compound consists of

more than one free morpheme and when acts as a MWE,

the components sometime lose their individual literal meaning

and looks like a single semantic unit. The compound noun

MWEs can occur in open, closed or hyphenated forms and sat-

isfy semantic non-compositionality, statistical co-occurrence

or literal phenomena [6] etc. Agarwal et al. (2004) [10]

have classified the Bengali MWEs in three main classes that

consists of twelve different fine-grained subclasses. However,

we have classified Bengali Nominal MWEs into eight different

subclasses based on their morpho-syntactic flexibilities. The

classifications are as follows:

Named-Entities (NE): Name of the people (RabindranathThakur, “Rabindranath Tagore”), name of the location

(Bharat-barsa, “India”), name of the organization (PaschimBanga Siksha Samsad, “West Bengal Board of Education”)

etc where inflection can be added to the last word only.

Idiomatic Compound Nouns: These are unproductive and

idiomatic in nature and inflection can be added only to the

last word. The formation of this type is due to the hidden

conjunction between the components or extinction of inflection

from the first component (maa-baba, “mother and father”).

Idioms: They are also compound nouns with idiosyncratic

meaning, but first noun is generally in possessive form (taserghar, “fragile”). Sometime, individual components may not

carry any significant meaning and may not be a part of the

dictionary (gadai laskari chal, “indolent habit”). For them, no

inflection is allowed even to the last word.

Numbers: They are highly productive, impenetrable and

allow slight syntactic variations like inflections. Inflections can

be added only to the last component (soya sat ghanta, “seven

hours and fifteen minutes”).

Relational Noun Compounds: They are mainly kin terms

and bigram in nature. Inflection can be added with the last

word ( pistuto bhai, “maternal cousin”).

Conventionalized Phrases: Sometime they are called as

“Institutionalized phrase”. They are not idiomatic and a par-

ticular word combination coming to be used to refer to a

given object. They are productive and have unexpectedly low

frequency and in doing so, contrastively highlight the statis-

tical idiomaticity of the target expression (bibhha barshiki,“marriage anniversary”).

Simile Terms: They are analogy term in Bengali and

sometime similar to the idioms except the fact that they are

semi-productive (hater panch, “remaining resource”).

Reduplicated Terms: Reduplications are non-productive

and tagged as noun phrase. They are further classified

as onomatopoeic expressions (khat khat, “knocking”), com-

plete reduplication (bara-bara, “big big”), partial reduplica-

tion ( thakur-thukur, “God”), semantic reduplication (matha-mundu, “head”), correlative Reduplication (maramari, “fight-

ing”) [12].

Identification of reduplication has already been carried out

using the clues of the Bengali morphological pattern [12]. A

number of research activities in Bengali Named Entity (NE)

detection have been carried out [13], but the lack of standard

tool to detect NEs inhibits to incorporate it within the existing

system. In this experiment, we mainly focus on the extraction

of the above mentioned Nominal MWEs in Bengali.

III. CONDITIONAL RANDOM FIELD (CRF)

Conditional Random Field (CRF) is a new probabilistic

model for segmenting and labeling sequence data [14]. CRF

is an undirected graphical model that encodes a conditional

probability distribution with a given set of features. For the

given observation sequential data X(X1X2...Xn), and their

corresponding status label Y (Y 1Y 2...Y n), a linear chain

structure which CRF defines as the conditional probability as

follows:

P (Y |X) =1

Zxexp(

∑i

∑j

λjfj(yi−1, yi, X, i)) (1)

Where, Zx is a normalization and it makes the probability

of all state sequences sum to 1. Function inside the summation

is the feature function and λj is a learned weight associated

with the feature fj . Maximum entropy learning algorithm can

be used to train CRF. For the given observation sequential

data, the most probable sequence can be determined by

Y ∗ = argmaxj

P (Y |X) (2)

Where, Y can be efficiently determined using Viterbi al-

gorithm. An N-best list of labeling sequences can also be

obtained using modified Viterbi algorithm and A* search. The

main advantage of CRF comes from the fact that it can relax

the assumption of conditional independence of the observed

data often used in generative approaches, an assumption that

might be too restrictive for a considerable number of object

classes.

IV. EXPERIMENTAL METHODOLOGY

The system architecture of the proposed model is shown

in Figure 1. The process begins with the preprocessing of

crawled corpus which is very scattered and unformatted. Then

the cleaned corpus is fed into the CRF model for training and

testing phases. Before that, we annotate the MWEs from the

cleaned corpus. CRF labels the candidate phrases as MWEs

or not using the statistics learned from the training dataset.

Finally, we use a post-processing step to filter some false

positive terms from the output of the CRF model. We report

both the results before and after post-processing steps in the

evaluation phase.

Feature Extraction

CleanedDocument

DocumentCollection

Feature Extraction

CRF Model Labeling

Date Training Data Test

Preprocessing

Post−processing

Final Result

CRF

Fig. 1. Proposed system architecture

A. Corpus Acquisition and Candidate Extraction

Resource acquisition is one of the challenging obstacles

to work with electronically resource constrained languages

like Bengali. However, our system uses a large number of

Bengali articles written by the noted Indian Nobel laureate

Rabindranath Tagore1, and 150 articles each of Sarat Chandra

Chottopadhyay and a group of other Bengali authors2. The

statistics of the entire dateset is tabulated in Table I. As the

order of the documents within the sequence is not of major

importance, we merged all the articles and a raw corpus.

The actual motivation of choosing the literature domain in

the present task was to develop a useful statistics and further

work on the Stylometry analysis. However, in the literature,

the application of MWEs is large compared to the other do-

main like Tourism, Scientific documents because the semantic

versatility of MWEs often influences the writer to express

his viewpoint appropriately. Especially in Bengali literature,

the idiomatic expressions, relations terms are quite frequently

used by the writers. Our crawled corpus was very scattered

and unformatted that we aided basic semi-automatic pre-

processing techniques to make the corpus suitable for parsing.

The parsing using Bengali shallow parser3 has been done

for identifying the POS, chunk, root and inflection and other

morphological information of the token. Some of the tokens

are misspelled due to typographic or phonetic error. Thus,

the Shallow parser is not able to detect their actual root and

inflection properly. Shallow parser is little confused with some

of the nominal tags like common noun (NN) and proper noun

(NNP) because of the continuous need for coinage of new

terms for describing new concepts. For identifying all Nominal

MWEs present in the document, we have taken both of them.

1http://www.rabindra-rachanabali.nltr.org2http://banglalibrary.evergreenbangla.com/3http://ltrc.iiit.ac.in/analyzer/bengali

TABLE ISTATISTICS OF THE USED DATASET

Authors # documents # tokens # unique tokensRabindranath 150 6,862,580 4,978,672

TagoreSarat Chandra 150 4,083,417 2,987,450Chottopadyhay

Others 150 3,818,216 2,657,813

B. Annotation Agreement

Three annotators identified as A1, A2 and A3 (those are

linguistic experts working with our project) were engaged

to carry out the annotation. They were asked to divide all

extracted phrase into three classes and the definition of each

class has been also provided with examples:

Class 1: Valid Nominal MWEs (M): phrases which show

total non-compositionality and their meanings are hard to

predict from their constituents (e.g., hater panch, “remaining

resource”).

Class 2: Valid N-N semantic collocations but not MWEs(S): phrases which are partial or total compositional, sometime

act as institutionalized phrases and show Statistical Idiomatic-

ity (e.g., bibaha barsiki, “marriage anniversary”).

Class 3: Invalid candidates (E): phrases enlisted due to

error in parsing like POS, chunk, inflection (e.g., granthagartayri, “build library”).

The candidates in Class 3 are filtered initially and their

individual frequencies are noted as 53.90%. Then the remain-

ing 46.10% (5628 phrases) of total candidates are annotated

and labeled as “M” (MWEs) or “S” (Semantically collocated

phrases) and they are fed into the evaluation phase.

The annotation agreement is measured using standard Co-

hen’s kappa coefficient (κ) [15]. It is a statistical measure of

inter-rater agreement for qualitative (categorical) items. It mea-

sures the agreement between two raters who separately classify

items into some mutually exclusive categories. MWEs are

words or strings of words that are selected by the annotators.

The agreement is carried out between the sets of text spans

selected by the two annotators for each of the expressions.

We have employed another strategy in addition with kappa (κ)

to calculate the agreement between annotators. We chose the

measure of agreement on set-valued items (MASI) [16] that is

used for measuring agreement in the semantic and pragmatic

annotations. MASI is a distance between sets whose value is

1 for identical sets, and 0 for disjoint sets. For sets A and B,

it is defined as: MASI = J ∗M , where the Jaccard metric

(J) is:

J =A ∩B

A ∪B(3)

Monotonicity (M) is defined as follows:

M =

⎧⎪⎪⎨⎪⎪⎩

1, A = B2/3, A ⊂ B or B ⊂ A1/3, A ∩B �= φ,A−B �= φ and B −A �= φ0, A ∩B = φ

(4)

Table II illustrates the agreement statistics using two mea-

sures. Among the full-agreement MWEs, 50% of them are

used for training, 25% of them are used for development

dataset and rest of the candidates are taken for testing phase.

TABLE IIINTER-ANNOTATION AGREEMENT

MWEs Pair-wise agreement (%) between annotators(#5628) A1-A2 A1-A3 A2-A3 Avg.KAPPA 87.23 86.14 88.78 87.38MESI 87.17 87.02 89.02 87.73

C. MWE Extraction Using CRF

The process of MWE extraction using CRF requires feature

selection, preprocessing which includes arrangement of tokens

or words into sentences with other notations, creation of model

file after training and the testing with other corpus. For the

current work, C++ based CRF++ 0.53 package4 which is

readily available as open source for segmenting or labeling

sequential data is used. Following subsections explain the

overall process in detail:

1) Feature Selection: The feature selection is important in

CRF. The various features used in the system are:

F={Wi−m, ...,Wi−1,Wi,Wi+1, ...,Wi+n, |prefix| <=n,|suffix| <= n, Surrounding POS tag, word length, wordfrequency, acceptable prefix, acceptable suffix}

Surrounding words as feature: Preceding and following

words of a particular word can be used as features since the

preceding and following words influence the present word.

Word suffixes and prefixes as feature: The suffix and

prefix play an important role for Bengali POS tagging. A

maximum of n characters for every word is considered for

suffix and prefix, for words with length less then n, a NIL is

substituted in the respective field for the corresponding suffix

and prefix. These prefix characters or suffix characters are

considered regardless of whether it is meaningful or not.

Surrounding POS tag: MWEs can be a combination of

noun-noun, verb-noun, adjective-noun POS patterns, so the

POS of the surrounding words are considered.

Length of the word: Length of the word is set to 1 if it is

greater than 3 otherwise, it is set to 0. Very short words are

rarely proper nouns.

Word frequency: A range of frequency is being set: those

words with frequency < 100 occurrences are set the value 0,

those words which occurs >= 100 but less than 400 are set to

1 and so on. The word frequency is considered as one feature

since MWEs are rare in occurrence.

Acceptable Prefix: Eight prefixes have been manually

identified in Bengali and the list of prefixes is used as one

feature. A binary notation is being used in such a way that

a ‘1’ is set if the word consist of one among the acceptable

prefixes otherwise a ‘0’.

4http://crfpp.sourceforge.ne

Acceptable suffixes: Twenty suffixes have been manually

identified in Bengali and the list of suffixes is used as one

feature. A binary notation is being used in such a way that

a ‘1’ is set if the word consist of one among the acceptable

suffixes otherwise a ‘0’.

2) Feature Extraction: The input file is a preprocessed

Bengali text document. Generally training and test file must

consist of multiple tokens. In addition, a token consists of

multiple (but fixed-numbers) columns where the columns are

used by a template file. Each token must be represented in one

line, with the columns separated by white spaces (spaces or

tabular characters). A sequence of token becomes a sentence.

The template file gives the complete idea about the feature

selection. Before undergoing training and testing in the CRF,

the input document is converted into a multiple token file

with fixed columns and the template file allows the feature

combination and selection. Two standard files of multiple

tokens with fixed columns are created: one for training and

another one for testing. In the training file the last column

should be tagged with all those identified MWEs by marking

“B-MWE” for the beginning of MWE and “I-MWE” for the

rest of the MWE else “O” for those which are not MWE

whereas in the test file we can either use the same tagging

for comparisons or only “O” for all the tokens regardless of

whether it is MWE or not.

TABLE IIINOTATION USED IN THE RESULT SECTION

Notation Meaning

W[-i,+j] Words spanning from the ith left position

to the jth right position

POS[-i,+j] POS tags of the words spanning from the ith

left position to the jth right positionPre Prefix of the wordSuf Suffix of the word

3) Post-processing: The phrases tagged by the CRF as

MWEs are fed further to a post-processing step. In order

to find out correct Nominal MWEs, our first intuition was

that all the terms present in an MWE should together make

a single nominal chunk. After verifying the output of the

development set, we have seen that CRF tagged few phrases

as MWEs whose constituent terms belong to multiple chunk

in the parsed corpus. In the post-processing step, we prune

all such tagged MWEs whose constituents belong to multiple

chunks. Note that we blindly believe the chunking information

that Bengali Shallow Parser produces. In the evaluation phase,

we will see that due to wrong chunking by the parser, the recall

value is drooped significantly after post-processing, though a

considerable amount of precision is increased for this post-

pruning.

V. EVALUATION

A. Evaluation Metrics

In order to evaluate our system, we use standard Information

Retrieval based metrics: Precision, Recall and F-score. Based

on our present task, they are defined below.

TABLE IVFEW RESULTS OF FEATURE TUNNING EXPERIMENT OVER DEVELOPMENT SET

Features Precision Recall F-score

W [−3,+3], POS[−3,+3], |Pre| <= 4, |Suf | <= 4, Length, word frequency, acceptable prefix, acceptable suffix 60.28 84.58 70.39

W [−4,+4], POS[−4,+4], |Pre| <= 5, |Suf | <= 5, Length, word frequency, acceptable prefix, acceptable suffix 58.29 78.59 66.93

W [−4,+3], POS[−4,+3], |Pre| <= 4, |Suf | <= 4, Length, word frequency, acceptable prefix, acceptable suffix 57.89 75.65 65.59

W [−3,+4], POS[−5,+4], |Pre| <= 4, |Suf | <= 4, Length, word frequency, acceptable prefix, acceptable suffix 51.30 72.26 60.00

W [−2,+2], POS[−2,+2], |Pre| <= 3, |Suf | <= 3, Length, word frequency, acceptable prefix, acceptable suffix 45.62 68.98 54.92

W [−4,+4], POS[−4,+4], |Pre| <= 4, |Suf | <= 4, Length, word frequency, acceptable prefix, acceptable suffix 52.32 77.62 62.51

W [−5,+4], POS[−4,+3], |Pre| <= 5, |Suf | <= 5, Length, word frequency, acceptable prefix, acceptable suffix 38.69 49.63 43.48

W [−5,+5], POS[−5,+5], |Pre| <= 6, |Suf | <= 6, Length, word frequency, acceptable prefix, acceptable suffix 37.78 48.23 42.37

W [−2,+3], POS[−2,+3], |Pre| <= 4, |Suf | <= 4, Length, word frequency, acceptable prefix, acceptable suffix 47.60 65.30 55.06

W [−3,+2], POS[−3,+2], |Pre| <= 4, |Suf | <= 4, Length, word frequency, acceptable prefix, acceptable suffix 48.77 62.23 54.68

Precision of a system is defined by the number of correct

tagged MWEs as a ratio of the total number of tagged MWEs

by the system.

Precision(P ) =No of correct tagging

Total # tagged MWEs by the system(5)

Recall of a system is defined as the accuracy of the system

in terms of all correct MWEs in a given document.

Recall(R) =No of correct tagging

Total # actual MWEs in the document(6)

F-Score (F1) is a trade off between Precision and Recall,

which is defined as the harmonic mean of Precision and Recall

(when we give equal weight to both Precision and Recall).

F − Score =(2× Precision×Recall)

(Precision+Recall)(7)

B. Best Feature Selection

In order to select best features, we have performed lot

of variation of feature parameters while we change the

combinations of the suggested features. We have started

the combination from four words prior and four words

succeeding a given word, POS tag of previous four words and

the following four words, prefixes and suffixes starting from

four characters, word length, frequency, acceptable prefixes

and suffixes. With the experiments performed above we are

able to find the best feature selection for the CRF. Table III

gives the idea of notations used further. Few of the tunning

features with their corresponding performances are reported

in Table IV. The best combination over development dataset

is reported as follows:

F={Wi−3,Wi−2,Wi−1,Wi,Wi+1,Wi+2,Wi+3, POS(s)of the current and 3 preceding and following word(s),|prefix| <= 4, |suffix| <= 4, Length of the word, wordfrequency, acceptable prefix, acceptable suffix}

C. Performance Analysis

We have compared our results with the results obtained from

statistical measures by Chakraborty and Bandyopadhyay [11].

The parameter of the statistical method is set to 0.5 (those

candidates having combined statistical score >= 0.5 are

labeled as MWEs). It is worth noting that statistical method

was reported for the Noun-Noun bigram MWEs; but we extend

it up to n-gram to make it equivalent with the proposed

algorithm. Table V shows that the precision, recall and F-

Score are improved significantly in CRF based approach. The

reason is quite obvious that the statistical method can not

have any signature of the contextual words which is actually

play an important role in any token labeling task. Moreover,

the hidden interdependencies of the candidate phrase with

the surrounding words act as a significant clue to find out

whether it occurs naturally at that context or it occurs by

chance. This information is very tactfully captured in CRF

based model. Furthermore, we evaluate our system before and

after the post-processing step. As shown in Table V, small

percentage of recall is dropped after post-filtering and the

precision is increased reasonably which in tern increases the

F-score of the system. While searching for the probable cause

of decreasing the recall value, we have noticed that some of

the true MWEs present in the document are parsed wrongly

by shallow parser. Because of the morphological peculiarity of

Bengali language, shallow parser separates some valid chunks

into distinct chunks. We point out that 1.5% of recall is

dropped due to this reason.

TABLE VCOMPARISON OF OUR RESULTS (CRF+PP: USING POST-PROCESSING,

CRF-PP: WITHOUT USING POST-PROCESSING) WITH STATISTICAL

SIMILARITY MEASURE (STAT) ON TEST SET

System Precision(%) Recall(%) F-score(%)STAT 46.23 78.56 58.21

CRF-PP 60.88 80.69 69.39CRF+PP 65.72 78.90 71.70

VI. PERFORMANCE IMPROVEMENT USING REDUPLICATED

MWES

Nongmeikapam and Bandyopadhyay (2010) [17] mentioned

that prior tagging of reduplicated phrase can improve the pro-

cessing of MWE tagging by CRF. They have shown significant

improvement of accuracy in Manipury language. We have tried

to use the same concept to tag Reduplicated MWEs (RMWEs)

in Bengali and use them as a feature of the CRF model.

Chakraborty and Bandyopadhyay (2010) [12] have already

done the extraction of all types of reduplicated MWEs from

Bengali corpus using the clues of the morphological patterns.

All types of reduplications namely onomatopoeic, complete,

partial, correlative reduplications can be identified by this

system. But for semantic reduplication identification, we need

standard dictionary to tackle the synonymous and antonymous

patters between the constituents. We have developed exactly

the same experimental setup as discussed in their paper [12]

and extract RMWEs. The outputs of this phase are marked

with “B-RMWE” for the beginning and “I-RMWE” for the

rest of the RMWE and “O” for the non-RMWEs. This output

is placed as a new column in the multiple token file for both

training and testing phases of CRF. The CRF toolkit is run

again to compare with the previous output. The output is

shown in Table VI. It signifies the improvement of perfor-

mance compared to the previous model.

TABLE VIRESULTS ON THE TEST SET USING REDUPLICATED MWES AS A FEATURE

OF CRF

System Precision (%) Recall (%) F-score (%)CRF-PP 62.40 82.95 71.22CRF+PP 67.98 80.01 73.01

VII. CONCLUSION

In this experiment, we have incorporated CRF to identify

Nominal MWEs from Bengali corpus. We used various mor-

phological features and tuned our system to get better feature

set. We included reduplicated MWEs as a feature of CRF, and

it showed reasonable improvement in terms of Precision and

F-score. We have also seen that the lack of performance of

the basic morphological tool (e.g., shallow parser) can make a

significant deterioration of the overall performance the system.

However, beside experimenting with MWEs, we are trying to

develop a preliminary version of Bengali dependency parser.

we plan to include clausal dependency information from the

dependency parser along with the existing feature set. We

also plan to handle other types of MWEs like Verbal MWEs,

Prepositional MWEs in Bengali. Moreover, we will also try

to use an hybrid approach in the processing step to prune the

false positive candidates and make the final list rich with all

relevant MWEs.

REFERENCES

[1] I. Sag, T. Baldwin, F. Bond, A. Copestake and D. Flickinger, “MultiwordExpressions: A Pain in the Neck for NLP”, In Proceedings of Conferenceon Intelligent Text Processing and Computational Linguistics (CICLING),pp. 1-15, 2002.

[2] T. Chakraborty, S. Pal, T. Mondal, T. Saikh, and S. Bandyopadhyay,“Shared task system description: Measuring the Compositionality ofBigrams using Statistical Methodologies”, In Proceedings of Distribu-tional Semantics and Compositionally (DiSCo), The 49th Annual Meetingof the Association for Computational Linguistics: Human LanguageTechnologies (ACL-HLT 2011), Portland, Oregon, USA, pp. 38-42, June24, 2011.

[3] K. W. Church, and P. Hans, “Word Association Norms, Mutual In-formation and Lexicography”, In Proceedings of 27th Association forComputational Linguistics (ACL), 16(1), pp. 22-29, 1990.

[4] T. Dunning, “Accurate Method for the Statistic of Surprise and Coinci-dence”, In Computational Linguistics, pp. 61-74, 1993.

[5] A. Kilgarriff, and J. Rosenzweig, “Framework and results for EnglishSENSEVAL”, Computers and the Humanities, Senseval Special Issue,34(1-2), pp. 15-48, 2000.

[6] F. A. Kunchukuttan, and O. P. Damani, “A System for CompoundNoun Multiword Expression Extraction for Hindi” In proceeding of 6thInternational Conference on Natural Language Processing (ICON), pp.20-29, 2008.

[7] D. Das, S. Pal, T. Mondal, T. Chakraborty, and S. Bandyopadhyay, “Au-tomatic Extraction of Complex Predicates in Bengali”, In Proceedings ofMultiword Expressions: from Theory to Applications (MWE 2010), The23rd International Conference on Computational Linguistics (COLING),Beijing, China, pp. 37-45, August 28, 2010,

[8] T. Chakraborty, D. Das, and S. Bandyopadhyay, “Semantic Clustering: anAttempt to Extract Multiword Expressions in Bengali”, In Proceedingsof Multiword Expressions: from Parsing and Generation to the RealWorld (MWE 2011), The 49th Annual Meeting of the Association forComputational Linguistics: Human Language Technologies (ACL-HLT2011), Portland, Oregon, USA, pp. 8-11, June 23, 2011.

[9] S. Venkatapathy, and A. Joshi, “Measuring the relative compositionality ofverb-noun (V-N) collocations by integrating features”, In Proceedings ofHuman Language Technology Conference and Conference on EmpiricalMethods in Natural Language Processing (HLT/EMNLP), Association forComputational Linguistics, pp. 899-906, 2009.

[10] A. Agarwal, B. Ray, M. Choudhury, S. Sarkar, and A. Basu, “AutomaticExtraction of Multiword Expressions in Bengali: An Apaproach forMiserly Resource Scenario”, In Proceedings of International Conferenceon Natural Language Processing (ICON), pp. 165-174, 2004.

[11] T. Chakraborty, and S. Bandyopadhyay, “Identification of Noun-noun(NN) collocation as multiword expression in Bengali Corpus”, In8th International Conference on Natural Language Processing (ICON),India, 2010.

[12] T. Chakraborty, and S. Bandyopadhyay, “Identification of Reduplicationin Bengali Corpus and their Semantic Analysis: A Rule Based Approach”,In proceedings of the 23rd International Conference on ComputationalLinguistics (COLING 2010), Workshop on Multiword Expressions: fromTheory to Applications (MWE 2010). Beijing, China, pp. 72-75, 2010.

[13] A. Ekbal, R. Haque, and S. Bandyopadhyay, “Maximum Entropy BasedBengali Part of Speech Tagging”, In proceedings of Advances in NaturalLanguage Processing and Applications Research in Computing Science,pp. 67-78, 2008.

[14] C. Zhang, H. Wang, Y. Liu, D. Wu Liao, and B. Wang, “AutomaticKeyword Extraction from Documents Using Conditional Random Fields”,Journal of Computational Information Systems(4.3), pp. 1169-1180, 2008.

[15] J. Cohen, “A coefficient of agreement for nominal scales”, Educationaland Psychological Measurement, vol. 20, pp. 3746, 1960.

[16] R.J. Passonneau, “Measuring agreement on set-valued items (MASI) forsemantic and pragmatic annotation”, In Proceedings of 5th InternationalConference on Language Resources and Evaluation, 2006.

[17] K. Nongmeikapam, and S. Bandyopadhyay, “Identification of MWEsUsing CRF in Manipuri and Improvement Using Reduplicated MWEs”,In Proceedings of 8th International Conference on Natural LanguageProcessing, India, 2010.