using mutual information technique in cross language...

30
Using Mutual Information Technique in Cross Language Information Retrieval Syandra Sari Mirna Adriani Trisakti University University of Indonesia [email protected] [email protected] ICADL 2008 Bali, 3 Sept 2008

Upload: others

Post on 12-Sep-2019

1 views

Category:

Documents


0 download

TRANSCRIPT

Using Mutual Information

Technique in Cross Language

Information Retrieval

Syandra Sari Mirna AdrianiTrisakti University University of Indonesia

[email protected] [email protected]

ICADL 2008Bali, 3 Sept 2008

Syandra Sari 2

Outline

1. Introduction

2. CLIR

3. Related Research

4. Our Work

5. Mutual Information & QE

6. Experiment

7. Result and Analysis

8. Conclusion

Syandra Sari 3

Introduction

• The explosive growth of the World Wide Web

• Multilingual world in the Internet.

� Stimulated the CLIR research area

Syandra Sari 4

C L I R

• The challenge in this area is to overcome the language barrier between the query and the document collection.

• needs to transform queries and documents into a common representation, so that monolingual IR techniques can be applied

Syandra Sari 5

C L I R

Two approaches in CLIR:

• translate the query into the language of the documents or

• translate the documents into the language of the query.

Syandra Sari 6

C L I R

Techniques for translation process:

1.Machine translation

2.Bilingual dictionary

3.Parallel or comparable corpus

Syandra Sari 7

Related Research

• Yiming Yang et. al (1998):

– Created a corpus-based term-equivalence matrix

extracted automatically from bilingual corpora

– English-Chinese CLIR and English-Japanese CLIR

• Lavrenko et. al (2002):

– Applied language model for CLIR using parallel corpus

– Chinese-English CLIR

• Martin Braschler (2004):

– Used similarity thesauri for query translation.

– Some European language CLIR (English, Italian,

German, French, Spanish)

Syandra Sari 8

CLIR for Indonesian Language

• In our earlier study (Hayurani et.al, 2006):

– Indonesian-English CLIR

– Machine translation � 84.82% of monolingual

performance

– Bilingual dictionary � 51.98% of monolingual

performance

– Parallel corpus (using bilingual dictionary) �

8% of monolingual performance

Syandra Sari 9

Our Work

• is to do the translation process for Indonesia-English CLIR using

– parallel corpus

• Pseudo translation

• Mutual Information technique

– Dictionary

– Machine translations

• is aimed at evaluating language resources and tools available for Indonesian-English pair

Syandra Sari 10

Mutual Information

• In monolingual IR, was used for finding word association. (Church, 1990)

• In our work:

• For measuring the association degree between Indonesian and English word pair.

• The word pair that has highest mutual information value is considered to be the best word pair.

Syandra Sari 11

Mutual Information

• Mutual information of two points (words), x and

y, is defined to be:

• P(x) is the occurrence probability of word x

• P(y) is the occurrence probability of word y

• P(x,y) is the probability that the words x and y

occur together

• I(x,y) is mutual information value

2

( , )( , ) log

( ) ( )

P x yI x y

P x P y=

Syandra Sari 12

Mutual Information• In (Church, 1990) and (Myung-Gill, 1999) mutual

information value is computed based on word co-occurrence statistics and can be define as follows:

• f(x) is the number of documents containing x in a corpus;

• f(y) is the number of documents containing y in the corpus;

• f(x,y) is the number of documents containing both x and y in the corpus;

• N is the number of items or words in the corpus

2

* ( , )( , ) log

( ) ( )

N f x yI x y

f x f y=

Syandra Sari 13

Mutual Information

• We adapted the formula for Indonesian-English

parallel corpus

2

( , )

( , )log *( )

( ( ) 1)*( ( ) 1)Indonesian English

I x y

D x yN N

D x D y

=

+

+ +

Syandra Sari 14

Mutual Information

• D(x) is the number of Indonesian documents that contain Indonesian word x (exclusive);

• D(y) is the number of English documents that contain English word y (exclusive);

• D(x,y) is the number of Indonesian-English document pairs that contain Indonesian word x and English word y;

• NIndonesian is the number of items or words in the Indonesian corpus;

• NEnglish is the number of items or words in the English corpus

Syandra Sari 15

Query Expansion

• is process of adding words found in a certain number of top English documents retrieved into the query

• We used language model formula in choosing the best words to be added to the query

Syandra Sari 16

Experiment (1)

• Query

– 50 Indonesian queries from CLEF 2006

• Merek Nestle Merek-merek apa yang

dipasarkan oleh Nestle di seluruh dunia

• Nestlé Brands What brands are marketed

by Nestlé around the world

• Collection

– 169.478 English documents from CLEF 2006

(Glasgow Herald 1995, Los Angeles Time 1994)

Syandra Sari 17

Experiment (2)

Building Indonesian-English Parallel Corpus

English

corpus

Indonesian

corpus

Machine

Translation

INDONESIAN-ENGLISH

PARALLEL CORPUS

BUILDING PARALLEL CORPUS

Syandra Sari 18

Experiment (3)

Indonesian

Queries

English

Queries

Building BILINGUAL LIST WORD

using MUTUAL INFORMATION

& Translation Process Relevant

Document

IRS

Indonesian-English Parallel Corpus

Syandra Sari 19

Experiment (4)

Queries were also translated using:

• Bilingual dictionary

• Machine translations:

– Toggletext

– Transtool

• Pseudo-translation based on parallel corpus

Syandra Sari 20

Result and Analysis

• The Mean Average Precision (MAP) of

– monolingual English queries,

– the Indonesian queries

translated using

• bilingual dictionary

• Toggletext machine translation

• Transtool machine translation

Technique MAP

Monolingual 0,3242

Dictionary0,1685

(-48,02%)

M. Translation

(Toggletext)

0,2750

(-15,18%)

M. Translation

(Transtool)

0,2529

(-22,00%)

Syandra Sari 21

Result and AnalysisThe Mean Average Precision (MAP) of the Indonesian

queries translated using parallel corpus : pseudo-translation and mutual information technique

Technique MAP

Parallel-Pseudo Translation (PT) 0.2245 (-30.75%)

Parallel-Mutual Information(MI) 0.1085 (-66.53%)

Parallel- Mutual Information with

query expansion (MI-QE)0.1357 (-58.14%)

Syandra Sari 22

Result and Analysis

• English word

from the

highest value

of MI

(first rank)

• 110 word from

260 word

Indonesian word

English word

from MI

(first rank)

merek trademark

keadaan circumstance

pengangguran unemployment

visa visa

bahaya danger

Syandra Sari 23

Result &

Analysis

• English word from 2nd to

5th value of

MI

• 51 word

Indonesian

word

English word from

MI 1st rank to 5th

rank

Correct

English

word

Rank

africa

afrikan, african,

kwazulu, AFRICA,

gatsha

AFRICA 4

perawatan

aftercare, surgery,

TREATMENT,

carefree, arthritis

TREAT

MENT

3

bijaksana

discreet, PRUDENT,

wiser, wisdom,

indiscreet

PRU

DENT

2

main

romp, overplay,

PLAY playground,

nut

PLAY 3

Syandra Sari 24

Result and Analysis

• Example of Indonesian words get wrong English words as translation

• 44 word

Indonesian word

English word

from MI (first

to fifth rank)

pengaruh

relentless,

unsuspected,

mindful, favor,

abscond

produksi

latch, lug,

lubricant,

rapacity, sewage

universitas

conniption, petal,

ulna, grantee,

packager

Syandra Sari 25

Result and Analysis

• There are 55 Indonesian words did not

get best / correct English words because

of the stemming process such as:

1. Some Indonesian words get antonym word in English

• Example: “adil” was translated into “unjustice” using MI

• (“unjustice” is antonym of “justice”).

2. Some Indonesian words get related word in English.

Example: “matahari” was translated into “sunlight”, “solar”using MI (“sunlight”, “solar” are related to “sun”).

Syandra Sari 26

Result and Analysis

3. Some Indonesian words don’t get the best

translation.

Example: “putri” was translated into “daughter” using MI

the best translation for “putri” in English is “princess.

4. Some Indonesian words get English word that is a

translation for other variation of the Indonesian

words.

Example: “acara” was translated into “lawyer” using MI

“Lawyer” means “pengacara” in Indonesia.

“Pengacara” is one of variation for word “acara”.

Syandra Sari 27

Result and Analysis• There are 9 phrases but only 3 phrases get

correct translation in English

Indonesian phrase In English Using Mutual Information

bahan bakar fuel firewood explosive

energi atom atomic energy atomic energy

gerhana matahari solar eclipse solar eclipse

gempa bumi earthquake quake

lintas alam cross country undergo straightaway

perdana menteri prime ministry morihiro

sidang pengadilan trial unjustice convocation

uang sekolah tuition tuition

undang-undang dasar constitution basic invitation

Syandra Sari 28

Conclusion

• We find that mutual information

– could rank the words in good order based on value of

mutual information to get the best translation, however sometimes it gives the wrong or incorrect translation.

• Machine translation techniques are the best

translation method so far

• Based on the result of our evaluation, there is still

room for improvement to explore parallel corpus

in our future work.

Syandra Sari 29

Future Research

• explore better techniques in finding bilingual word pairs than mutual information technique.

• Apply better query expansion technique to improve the result

Syandra Sari 30