iterative translation disambiguation for cross-language information retrieval

15
1 Intelligent Database Systems Lab 國國國國國國國國 National Yunlin University of Science and T echnology Iterative Translation Disambiguation for Cross- Language Information Retrieval Advisor : Dr. Hsu Presenter : Yu-San Hsieh Author : Christof Monz and Bonn ie J. Dorr 005.SIGIR.520-527

Upload: kayla

Post on 21-Jan-2016

39 views

Category:

Documents


0 download

DESCRIPTION

Iterative Translation Disambiguation for Cross-Language Information Retrieval. Advisor : Dr. Hsu Presenter : Yu-San Hsieh Author : Christof Monz and Bonnie J. Dorr. 2005.SIGIR.520-527. Outline. Motivation Objective Approach Experiment Result Introduction Experiment Conclusions. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Iterative Translation Disambiguation for Cross-Language Information Retrieval

1Intelligent Database Systems Lab

國立雲林科技大學National Yunlin University of Science and Technology

Iterative Translation Disambiguation for Cross-Language Information

Retrieval

Advisor : Dr. Hsu

Presenter : Yu-San Hsieh

Author : Christof Monz and Bonnie J. Dorr

2005.SIGIR.520-527

Page 2: Iterative Translation Disambiguation for Cross-Language Information Retrieval

2

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Motivation Objective Approach Experiment Result Introduction Experiment Conclusions

Outline

Page 3: Iterative Translation Disambiguation for Cross-Language Information Retrieval

3

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Motivation Many words or phrases in one language can

be translated into another language in a number of way, so translation ambiguity is very common ,that impacting the effectiveness of information retrieval.

Penalty (English)

Elfmeter (Soccer)

Strafe (punishment)

Page 4: Iterative Translation Disambiguation for Cross-Language Information Retrieval

4

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.

Objective Finding a proper distribution of translation

probabilities that can solve the translation ambiguity problem.

Page 5: Iterative Translation Disambiguation for Cross-Language Information Retrieval

5

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Approach Find a proper of translation

probabilities. Computing Term Weight

─ Initialization Step

─ Iteration Step

─ Normalization Step

─ All term weights in a vector

─ Iteration Stop

tradeunion

europe

gewerbe

geschaeft

handel

europa

union

gewerkschaft

2*2.02*2.01*20833.0

)|(

:

1,1

iiT stw

ex

Page 6: Iterative Translation Disambiguation for Cross-Language Information Retrieval

6

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Approach

Measuring association strength─ Pointwise mutual information

─ Dice coefficient

─ Log Likelihood ratio

Page 7: Iterative Translation Disambiguation for Cross-Language Information Retrieval

7

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiment Result

Individual queries (topic)

Differences

baseline

Improve

Page 8: Iterative Translation Disambiguation for Cross-Language Information Retrieval

8

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Introduction

Two techniques for cross-language retrieval─ Translate collection of document into target language

and apply monolingual retrieval─ Translate the query into target language and apply

translated query retrieval Three approach may be used produce the

translations─ Machine translation system─ Dictionary ─ Parallel corpus to estimate the probabilities

Page 9: Iterative Translation Disambiguation for Cross-Language Information Retrieval

9

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Introduction

One language translation into another language in a number ways.─ Penalty (English) => Elfmeter (soccer) or Strafe (punis

hment)

Page 10: Iterative Translation Disambiguation for Cross-Language Information Retrieval

10

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Introduction A approach can solve the problem of word sele

ction is to use co-occurrences between term. Problem (a larger number of terms)

─ Data-sparseness Use very large corpora for counting co-occruences frequencies Use internet search engines Smoothing

Page 11: Iterative Translation Disambiguation for Cross-Language Information Retrieval

11

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiment Test Data

─ CLEF 2003 English to German bilingual data─ Choice 56 topic (title, description, narrative)

Morphological Normalization─ Source-language word (topic) normalized to match in bilingual dictiona

ry ─ De-compounding: 5-grams─ Assign weights to 5-gram substrings

Page 12: Iterative Translation Disambiguation for Cross-Language Information Retrieval

12

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiment Retrieval Model

─ Lnu.Itc weighting scheme

─ Weighted document similarity

Statistical Significance─ Bootstrap method

Bootstrap sample One-tailed significance testing (compare two retrieval method)

Page 13: Iterative Translation Disambiguation for Cross-Language Information Retrieval

13

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Experiment

Found some problem in experiment─ Individual average precision of Log Likelihood ratio

decreases for a number of query. Unknown word

The original word from the source language is include in the target language query.

Example Women’s Conference Beijing

Women

(專有名詞 )Women

Women

Assign weighted =1

Result1.Woman control document simliarity

2.Most top-ranked documents contain

Women as the only matching term.

normalized

Not find : Woman

Page 14: Iterative Translation Disambiguation for Cross-Language Information Retrieval

14

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.Conclusions

Our approach improve retrieval effectiveness compare to baseline using bilingual dictionary lookup.

Experimental result show that Log Likelihood Ratio has the strong positive impact.

Page 15: Iterative Translation Disambiguation for Cross-Language Information Retrieval

15

Intelligent Database Systems Lab

N.Y.U.S.T.

I. M.My opinion

Advantage: It only requires a bilingual dictionary and a

monolingual corpus in the target language.

Disadvantage: Unknown word

Apply