ferhan ture dissertation defense may 24 th , 2013

63

Upload: shiela

Post on 23-Feb-2016

38 views

Category:

Documents


0 download

DESCRIPTION

“Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation. Ferhan Ture Dissertation defense May 24 th , 2013 Department of Computer Science University of Maryland at College Park. Motivation. f orum posts. multi-lingual text. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Ferhan  Ture Dissertation defense  May  24 th ,  2013
Page 2: Ferhan  Ture Dissertation defense  May  24 th ,  2013

“Searching to Translate”, and “Translating to Search”: When Information Retrieval Meets Machine Translation

Ferhan TureDissertation defense May 24th, 2013

Department of Computer ScienceUniversity of Maryland at College Park

Page 3: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Motivation• Fact 1: People want to access information

e.g., web pages, videos, restaurants, products, …

• Fact 2: Lots of data out there… but also lots of noise, redundancy, different languages

• Goal: Find ways to efficiently and effectively- Search complex, noisy data- Deliver content in appropriate form

3

multi-lingual textuser’s native language

forum postsclustered summaries

Page 4: Ferhan  Ture Dissertation defense  May  24 th ,  2013

retriev ir find materi (usual document unstructur natur (usual text satisfi need larg collect (usual store comput work assum materi collect document written natur languag need form queri rang word entir document typic approach ir repres document vector weight term term mean word stem pre-determin list word .g. `` '' `` '' `` '' may remov set term found creat nois search process document score relat queri score queri term independ aggreg term-docu score

Information Retrieval

4

Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers). In our work, we assume that the material is a collection of documents written in natural language, and the information need is provided in the form of a query, ranging from a few words to an entire document. A typical approach in IR is to represent each document as a vector of weighted terms, where a term usually means either a word or its stem. A pre-determined list of stop words (e.g., ``the'', ``an'', ``my'') may be removed from the set of terms, since they have found to create noise in the search process. Documents are scored, relative to the query, usually by scoring each query term independently and aggregating these term-document scores.

queri:11.69, ir:11.39, vector:7.93, document:7.09, nois:6.92, stem:6.56, score:5.68, weight:5.67, word:5.46, materi:5.42, search:5.06, term:5.03, text:4.87, comput:4.73, need:4.61, collect:4.48, natur:4.36, languag:4.12, find:3.92, repres:3.58

Page 5: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Cross-Language Information Retrieval

5

Information Retrieval (IR) bzw. Informationsrückgewinnung, gelegentlich ungenau Informationsbeschaffung, ist ein Fachgebiet, das sich mit computergestütztem Suchen nach komplexen Inhalten (also z. B. keine Einzelwörter) beschäftigt und in die Bereiche Informationswissenschaft, Informatik und Computerlinguistik fällt. Wie aus der Wortbedeutung von retrieval (deutsch Abruf, Wiederherstellung) hervorgeht, sind komplexe Texte oder Bilddaten, die in großen Datenbanken gespeichert werden, für Außenstehende zunächst nicht zugänglich oder abrufbar. Beim Information Retrieval geht es darum bestehende Informationen aufzufinden, nicht neue Strukturen zu entdecken (wie beim Knowledge Discovery in Databases, zu dem das Data Mining und Text Mining gehören).

89,9332,345221,932106,13492,5414,073--162,67178,346241,58019,3185,802327,094104,82223,89095,936187,3499,394

3.42.92.72.52.42.121.81.81.71.71.51.51.51.41.11.00.90.8

Page 6: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Machine Translation

6

Maschinelle Übersetzung (MT) ist, um Text in einer Ausgangssprache in

entsprechenden Text in der Zielsprache geschrieben übersetzen.

Machine translation (MT) is to translate text written in a source language into corresponding text in a target language.

Page 7: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Motivation• Fact 1: People want to access information

e.g., web pages, videos, restaurants, products, …

• Fact 2: Lots of data out there… but also lots of noise, redundancy, different languages

• Goal: Find ways to efficiently and effectively- Search complex, noisy data- Deliver content in appropriate form

7

multi-lingual textuser’s native language

MTCross-language IR

Page 8: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Outline

•Introduction

•Searching to Translate (IR MT)- Cross-Lingual Pairwise Document Similarity- Extracting Parallel Text From Comparable Corpora

•Translating to Search (MT IR)- Context-Sensitive Query Translation

•Conclusions

8

(Ture et al., SIGIR’11)(Ture and Lin, NAACL’12)

(Ture et al., SIGIR’12), (Ture et al., COLING’12)(Ture and Lin, SIGIR’13)

Page 9: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Extracting Parallel Text from the Web

9

Preprocess

Signature Generatio

n Sliding WindowAlgorith

m

Candidate Generation

2-step Parallel Text Classifier

doc vectorsF signaturesF

doc vectorsE

signaturesE

sourcecollection F

targetcollection E

Phase 1

Phase 2candidate

sentence pairsaligned bilingual sentence pairs(F-E parallel text)

cross-lingualdocument pairs

Preprocess

Signature Generatio

n

Page 10: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Pairwise Similarity

• Pairwise similarity: • finding similar pairs of documents in a large collection

• Challenges• quadratic search space• measuring similarity effectively and efficiently

• Focus on recall and scalability

10

Page 11: Ferhan  Ture Dissertation defense  May  24 th ,  2013

NeEnglisharticles

NeEnglish

document vectors

NeSignatur

es

Signature

generation

Sliding windowalgorith

m

<nobel=0.324, prize=0.227, book=0.01, …>

[0111000010...]

Preprocess

Locality-Sensitive Hashing

Similar article pairs

Page 12: Ferhan  Ture Dissertation defense  May  24 th ,  2013

• LSH(vector) = signature- faster similarity computation

s.t. similarity(vector pair) ≈ similarity(signature pair) e.g.,

~20 times faster than computing (cosine) similarity from vectors

similarity error ≈ 0.03

• Sliding window algorithm - approximate similarity search based on LSH- linear run-time

12

Locality-Sensitive Hashing

(Ravichandran et al., 2005)

Page 13: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Sliding window algorithm

sort

.

.

.

sort

permuteGenerating tables

Signatures

….1,110110111012,011100001013,10101010000…

p1

pQ

list1….11111101010,110011000110,201100100100,3…

listQ

….11111001011,100101001110,210010000101,3…

table1

tableQ

….01100100100,110011000110,211111101010,3…

….00101001110,110010000101,211111001011,3…

Map Reduce

.

.

.

Page 14: Ferhan  Ture Dissertation defense  May  24 th ,  2013

tableQ

.

.

.

Map

Sliding window algorithm

14

Detecting similar pairs

00000110101000100011110010010110100110000000001100100000011001111100110101000001110100101001001101110010110011

table1

….01100100100,110011000110,211111101010,3…

Page 15: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Sliding window algorithmExample

Signatures

(1,11011011101)(2,01110000101)(3,10101010000)

table1p1

p2

list1

list2 table2

(<2,11111001011>,1)(<2,00101001110>,2)(<2,10010000101>,3)

(<1,11111101010>,1)(<1,10011000110>,2)(<1,01100100100>,3)

(<1,01100100100>,3)(<1,10011000110>,2)(<1,11111101010>,1)

(<2,00101001110>,2)(<2,10010000101>,3)(<2,11111001011>,1)

Map Reduce

Distance(3,2) = 7Distance(2,1) = 5

Distance(2,3) = 7Distance(3,1) = 6

✗✓

✗✓

# tables = 2window size = 2

# bits = 11

Page 16: Ferhan  Ture Dissertation defense  May  24 th ,  2013

16

MT

Doc A

MT translate doc vector

vA

German English

DocB

English

doc vector vB

Doc A

CLIR translate

doc vector vA

German

DocB

English

doc vector vB

doc vector vA

CLIR

Cross-lingual Pairwise Similarity

Page 17: Ferhan  Ture Dissertation defense  May  24 th ,  2013

17

MT vs. CLIR for Pairwise Similarity

low similarity valuespositive-negativeclearly separated

MT slightly better than CLIR, but 600 times slower!

clir-negclir-posmt-negmt-pos

Page 18: Ferhan  Ture Dissertation defense  May  24 th ,  2013

NeEnglisharticles

NeEnglish

document vectors

NeSignatur

es

Signature

generation

Sliding windowalgorith

m

<nobel=0.324, prize=0.227, book=0.01, …>

[0111000010...]

PreprocessSimilar article pairs

Locality-Sensitive Hashing for Pairwise Similarity

Page 19: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Locality-Sensitive Hashing for Cross-Lingual Pairwise Similarity

CLIRTranslate

Nf Germa

n articles

NeEnglisharticles

Ne+NfEnglish

document vectors

<nobel=0.324, prize=0.227, book=0.01, …>

NeEnglish

document vectors

Similar article pairs

NeSignatur

es

Signature

generation

Sliding windowalgorith

m

[0111000010...]

Preprocess

Page 20: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Evaluation

• Experiments with De/Es/Cs/Ar/Zh/Tr to En Wikipedia

• Collection: 3.44m En + 1.47m De Wikipedia articles

• Task: For each German Wikipedia article, find:

{all English articles s.t. cosine similarity > 0.30}

20

# bits (D) = 1000# tables (Q) = 100-1500window size (B) = 100-2000

Page 21: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Scalability

21

Page 22: Ferhan  Ture Dissertation defense  May  24 th ,  2013

two sources of error

Signatures

Brute-force approach

Similar

article

pairs

upperbound

document

vectorsBrute-force approach

Similar

article

pairs

ground truth

Signatures

Signature generatio

n

Sliding windowalgorith

m

document

vectors

Similar

article

pairsalgorithm

output

Evaluation

22

Page 23: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Evaluation

23

95% recall39% cost

99% recall70% cost

95% recall40% cost

99% recall62% cost

100% recallno savings = no free lunch!

Page 24: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Outline

•Introduction

•Searching to Translate (IR MT)- Cross-Lingual Pairwise Document Similarity- Extracting Parallel Text From Comparable Corpora

•Translating to Search (MT IR)- Context-Sensitive Query Translation

•Conclusions

24

(Ture et al., SIGIR’11)(Ture and Lin, NAACL’12)

(Ture et al., SIGIR’12), (Ture et al., COLING’12)(Ture and Lin, SIGIR’13)

Page 25: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Approach 1. Generate candidate sentence pairs from each document

pair2. Classify each candidate as ‘parallel’ or ‘not parallel’

Challenge: 10s millions doc pairs ≈ 100s billions sentence pairs

Solution: 2-step classification approach3. a simple classifier efficiently filters out irrelevant pairs 4. a complex classifier effectively classifies remaining pairs

Phase 2: Extracting Parallel Text

25

Page 26: Ferhan  Ture Dissertation defense  May  24 th ,  2013

• cosine similarity of the two sentences• sentence length ratio: the ratio of lengths of the two

sentences• word translation ratio: ratio of words in source (target)

sentence with a translation in target (source) sentence

Parallel Text (Bitext) Classifier

26

Page 27: Ferhan  Ture Dissertation defense  May  24 th ,  2013

sentence

detection+tf-

idf

cross-lingualdocument pairs

sentence pairs

simple classificat

ion

complexclassificat

ionbitext

S1

bitext S2

source document

target document

sentences andsent. vectors

cartesian product

X

MAP

REDUCE

candidategeneration2.4 hours

shuffle&sort1.3 hours

simple classification

4.1 hours

Bitext Extraction Algorithm

27

complex classification

0.5 hours

400 billion 214billion

132billion

Page 28: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Extracting Bitext from WikipediaSize Language

English

German Spanish Chinese Arabic Czech Turkish

Documents

4.0m 1.42m 0.99m 0.59m 0.25m 0.26m 0.23m

Similar doc pairs

- 35.9m 51.5m 14.8m 5.4m 9.1m 17.1m

Sentences ~90m 42.3m 19.9m 5.5m 2.6m 5.1m 3.5mCandidate sentence pairs

- 530b 356b 62b 48b 101b 142b

S1 - 292m 178m 63m 7m 203m 69mS2 - 0.2-3.3m 0.9-3.3m 50k-290k 130-320k 0.5-1.6m 8-250kBaseline training data

- 2.1m 2.1m 303k 3.4m 0.78m 53k

Dev/Test set

- WMT-11/12

WMT-11/12

NIST-06/08

NIST-06/08

WMT-11/12

held-out

Baseline BLEU

- 24.50 33.44 25.38 63.15 23.11 27.22

Page 29: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Evaluation on MT

Page 30: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Evaluation on MT

Page 31: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Conclusions (Part I)

31

•Summary- Scalable approach to extract parallel text from a

comparable corpus- Improvements over state-of-the-art MT baseline- General algorithm applicable to any data format

•Future work- Domain adaptation- Experimenting with larger web collections

Page 32: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Outline

•Introduction

•Searching to Translate (IR MT)- Cross-Lingual Pairwise Document Similarity- Extracting Parallel Text From Comparable Corpora

•Translating to Search (MT IR)- Context-Sensitive Query Translation

•Conclusions

32

(Ture et al., SIGIR’11)(Ture and Lin, NAACL’12)

(Ture et al., SIGIR’12), (Ture et al., COLING’12)(Ture and Lin, SIGIR’13)

Page 33: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Cross-Language Information Retrieval

• Information Retrieval (IR): Given information need, find relevant material.

• Cross-language IR (CLIR): query and documents in different languages

•“Why does China want to import technology to build Maglev Railway?”➡ relevant information in Chinese documents

• “Maternal Leave in Europe”➡ relevant information in French, Spanish, German, etc.

33

query (ranked) documents

Page 34: Ferhan  Ture Dissertation defense  May  24 th ,  2013

grammar

extractor

decoder

language model

tokenaligne

rtokenalignments

query“maternal leave in Europe”

sentence-aligned parallel corpus

token translation probabilities

n best translations

1-best translation“congé

de maternité en Europe”

Machine Translation for CLIR

34

language model

translationgrammar

STATISTICALMT

SYSTEM

Page 35: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Token-based CLIR

•Token translation formula

35

… most leave their children in …... aim of extending maternity leave to … ...

… la plupart laisse leurs enfants…… l’objectif de l’extension des congé de maternité à …...

Token-based probabilities

Page 36: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Token-based CLIR

36

Maternal leave in Europe1. laisser (Eng. forget) 49%2. congé (Eng. time off) 17%3. quitter (Eng. quit) 9%4. partir (Eng. disappear) 7%

Page 37: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Document Retrieval•How to score a document, given a query?

37

[maternité : 0.74, maternel : 0.26]

“maternal leave in Europe”Query q1

Document

DocumentDocument

Documentd1

tf(maternité)tf(maternel)

df(maternité)df(maternel)…

Page 38: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Token-based CLIR

38

Maternal leave in Europe1. laisser (Eng. forget) 49%2. congé (Eng. time off) 17%3. quitter (Eng. quit) 9%4. partir (Eng. disappear) 7%

Page 39: Ferhan  Ture Dissertation defense  May  24 th ,  2013

1. laisser (Eng. forget) 49%2. congé (Eng. time off) 17%3. quitter (Eng. quit) 9%4. partir (Eng. disappear) 7%

39

Maternal leave in Europe1. laisser (Eng. forget) 49%2. congé (Eng. time off) 17%3. quitter (Eng. quit) 9%4. partir (Eng. disappear) 7%

Token-based CLIR

Page 40: Ferhan  Ture Dissertation defense  May  24 th ,  2013

1. laisser (Eng. forget) 49%2. congé (Eng. time off) 17%3. quitter (Eng. quit) 9%4. partir (Eng. disappear) 7%

Context-Sensitive CLIR

40

This talk: MT for context-sensitive CLIR

Maternal leave in Europe1. laisser (Eng. forget) 49%2. congé (Eng. time off) 17%3. quitter (Eng. quit) 9%4. partir (Eng. disappear) 7%

12%70%6%5%

Page 41: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Previous approach: Token-based CLIR

41

Previous approach: MT as black boxOur approach: Looking inside the box

grammar

extractor

decoder

language model

MTtokenaligne

rtoken alignments

query“maternal leave in Europe”

sentence-aligned parallel corpus

token translation probabilities

n best derivations

1-best translation“congé

de maternité en Europe”

language model

translationgrammar

n best derivations

STATISTICALMT

SYSTEM

Page 42: Ferhan  Ture Dissertation defense  May  24 th ,  2013

MT for Context-Sensitive CLIR

42

language model

MTtokenaligne

r

grammar

extractor

tokenalignments

translationgrammar

query

sentence-aligned parallel corpus

“maternal leave in Europe”

decoder

token translation probabilities

n best translations

1-best translation“congé

de maternité en Europe”

Page 43: Ferhan  Ture Dissertation defense  May  24 th ,  2013

CLIR from translation grammar

•Token translation formula

43

S [X : X] , 1.0X [X1 leave in europe : congé de X1 en europe] , 0.9X [maternal : maternité] , 0.9X [X1 leave : congé de X1] , 0.74X [leave : congé ] , 0.17X [leave : laisser] , 0.49...

Grammar-based probabilities

S1

X1

X2 leave in Europe

maternal

S1

X1

X2 en Europe

maternité

congé de

Synchronoushierarchical derivation

SynchronousContext-FreeGrammar (SCFG)[Chiang, 2007]

Page 44: Ferhan  Ture Dissertation defense  May  24 th ,  2013

MT for Context-Sensitive CLIR

44

language model

MTtokenaligne

r

grammar

extractor

tokenalignments

translationgrammar

query

sentence-aligned parallel corpus

“maternal leave in Europe”

decoder

token translation probabilities

n best translations

1-best translation“congé

de maternité en Europe”

Page 45: Ferhan  Ture Dissertation defense  May  24 th ,  2013

MT for Context-Sensitive CLIR

45

language model

MTtokenaligne

r

grammar

extractor

tokenalignments

translationgrammar

query

sentence-aligned parallel corpus

“maternal leave in Europe”

decoder

token translation probabilities

n best translations

1-best translation“congé

de maternité en Europe”

Page 46: Ferhan  Ture Dissertation defense  May  24 th ,  2013

CLIR from n-best derivations

46

t(1): { , 0.8 }

t(k): { kth best derivation , score(t(k)|s) }

t(2): { , 0.11 }

• Token translation formula

...

Translation-based probabilities

S1

X1

X2 leave in Europe

maternal

S1

X1

X2 en Europe

maternité

congé de

S1

X1 in Europe

maternal leave

S1

X1

maternité

en Europe

congé de

Page 47: Ferhan  Ture Dissertation defense  May  24 th ,  2013

MT for Context-Sensitive CLIR

47Ambiguity preserved

Cont

ext s

ensit

ivity

1-best MT

token

based

n best derivations

tokenalignments

translationgrammar

sentence-alignedbitext

1-best translation

MT pipeline

grammar

based

translation

based

Prnbest

PrSCFG

Prtoken

Page 48: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Combining Evidence•For best results, we compute an interpolated probability

distribution:

48

leave laisser 0.14 congé 0.70quitter 0.06…

leave laisser 0.72 congé 0.10quitter 0.09…

leave laisser 0.09 congé 0.90quitter 0.11…

Prtoken PrSCFG Prnbest

35%40%

25%leave laisser 0.33 congé 0.54quitter 0.8…

Printerp

Page 49: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Combining Evidence•For best results, we compute an interpolated probability

distribution:

49

leave laisser 0.14 congé 0.70quitter 0.06…

leave laisser 0.72 congé 0.10quitter 0.09…

leave laisser 0.09 congé 0.90quitter 0.11…

Prtoken PrSCFG Prnbest

100%0%

0%leave laisser 0.72 congé 0.10quitter 0.09…

Printerp

Page 50: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Combining Evidence

50

•For best results, we compute an interpolated probability distribution:

Page 51: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Experiments•Three tasks: 1. TREC 2002 English-Arabic CLIR task 50 English queries and 383,872 Arabic documents 2. NTCIR-8 English-Chinese ACLIA task 73 English queries and 388,859 Chinese documents 3. CLEF 2006 English-French CLIR task 50 English queries and 177,452 French documents

• Implementation- cdec MT system [Dyer et al, 2010]- using Hiero-style grammars, GIZA++ for token alignments

51

Page 52: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Comparison of ModelsEnglish-French CLEF 2006Comparison of ModelsEnglish-Arabic TREC 2002

52

Grammar-based Translation-based (10-best)

Token-based

Best interpolation

1-best MT

Comparison of ModelsEnglish-Chinese NTCIR-8

Page 53: Ferhan  Ture Dissertation defense  May  24 th ,  2013

53

Comparison of ModelsOverview

Page 54: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Comparison of Models

English-Chinese English-Arabic English-French0.00

0.05

0.10

0.15

0.20

0.25

0.30

Token-basedGrammar-basedTranslation-based1-best MTInterpolated

Mea

n Av

erag

e Pr

ecis

ion

(MAP

)

54

Interpolated significantly better than token-based and 1-best in all three cases.

Page 55: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Conclusions (Part II)•Summary

- A novel framework for context-sensitive and ambiguity-preserving CLIR

- Interpolation of proposed models works best- Significant improvements in MAP for three tasks

•Future work- Robust parameter optimization- Document vs. query translation with MT

55

Page 56: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Contributions

CLIRTranslation

Model

MT Translation

Model

MTpipeline

baselinebitext

Token-basedCLIR

extracted

bitext

BitextExtraction

comparable corpora

+

Page 57: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Contributions

CLIRTranslation

Model

MT Translation

Model

MTpipeline

baselinebitext

Token-basedCLIR

extracted

bitext

BitextExtraction

comparable corpora

+

Higher BLEU for

5 lang pairs

Page 58: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Token-basedCLIR

Contributions

MTpipeline

baselinebitext

CLIRTranslation

Model

MT Translation

Model

Context-sensitive CLIR

Page 59: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Contributions

CLIRTranslation

Model

MTpipeline

baselinebitext

MT Translation

Model

Context-sensitive CLIR

Higher MAPfor

3 lang pairs

Page 60: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Contributions

MT Translation

Model

MTpipeline

baselinebitext

extracted

bitext

BitextExtraction

comparable corpora

+Context-

sensitive CLIR

CLIRTranslation

Model

Higher MAPfor

3 lang pairs

Higher BLEU for

5 lang pairs

Page 61: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Contributions

MT Translation

Model

MTpipeline

baselinebitext

Token-basedCLIR

extracted

bitext

BitextExtraction

comparable corpora

+

CLIRTranslation

Model

CLIRTranslation

Model

morebitext

Higher BLEUafter additional

iteration

Page 62: Ferhan  Ture Dissertation defense  May  24 th ,  2013

•LSH-based MapReduce approach to pairwise similarity•Exploration of parameter space for sliding window algorithm•MapReduce algorithm to generate candidate sentence pairs•2-step classification approach to bitext extraction Bitext from Wikipedia: improvement over state-of-the-art MT

•Set of techniques for context-sensitive CLIR using MT Combination-of-evidence works best

•Framework for better integration of MT and IR•Bootstrapping approach to show feasibility

•All code and data as part of Ivory project (www.ivory.cc) 62

Contributions

Page 63: Ferhan  Ture Dissertation defense  May  24 th ,  2013

Thank you!