an effective approach for searching closest sentence translations from the web

30
An Effective Approach for Searching Closest Sentence Translations from The Web Ju Fan, Guoliang Li, and Lizhu Zhou Database Research Group, Tsinghua University DASFAA 2011 – Apr. 23, Hong Kong Database Research Group

Upload: oprah-patterson

Post on 15-Mar-2016

23 views

Category:

Documents


0 download

DESCRIPTION

Database Research Group. An Effective Approach for Searching Closest Sentence Translations from The Web. Ju Fan , Guoliang Li, and Lizhu Zhou Database Research Group, Tsinghua University DASFAA 2011 – Apr. 23, Hong Kong. Outline. Introduction Overview of Our Approach - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: An Effective Approach for Searching Closest Sentence Translations from The Web

An Effective Approach for Searching Closest Sentence Translations from The Web

Ju Fan, Guoliang Li, and Lizhu Zhou

Database Research Group, Tsinghua University

DASFAA 2011 – Apr. 23, Hong Kong

DatabaseResearch

Group

Page 2: An Effective Approach for Searching Closest Sentence Translations from The Web

OutlineOutline

• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion

04/24/23 2SCST@DASFAA 2011

Page 3: An Effective Approach for Searching Closest Sentence Translations from The Web

OutlineOutline

• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion

04/24/23 3SCST@DASFAA 2011

Page 4: An Effective Approach for Searching Closest Sentence Translations from The Web

BackgroundBackground

• Parallel sentences on the Web▪Sentences with the well-translated

counterpart▪An English-to-Chinese Example

• A rich source for translation• Commercial Systems04/24/23 4SCST@DASFAA 2011

Obama said he hopes to get Congress to approve it next year奥巴马总统说他争取让希望国会明年批准该协议。 -- blog.hjenglish.com

Page 5: An Effective Approach for Searching Closest Sentence Translations from The Web

Parallel Sentences

E.g.,The result is

good结果很好

BackgroundBackground

04/24/23 5SCST@DASFAA 2011

Parallel SentenceDatabase

Sen 1 (E-C)Sen2 (E-C)

Sen3 (E-C)

sen n (E-C)

……

Closest Sentenceswith Translation

QuerySentence(English)

Web

Parallel SentenceDiscovery and Extraction

Sentence-Level Translation Aid

Sentence Matching

An effective similarity model between sentences in the source language (e.g., English sentences)

Research Issue

Page 6: An Effective Approach for Searching Closest Sentence Translations from The Web

MotivationMotivation

04/24/23 6SCST@DASFAA 2011

• Existing approaches:▪ Word-based, e.g., translation model, edit

distance, …▪ Gram-based, e.g., N-gram, V-gram ▪ All subsequences of a sentence

Cannot capture the order of words

Don’t consider the syntactic information

Too expensive

We propose a phrase-based similarity model1.Syntactic information 2.Frequency information3.Lengths of phrases

Page 7: An Effective Approach for Searching Closest Sentence Translations from The Web

OutlineOutline

• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion

04/24/23 7SCST@DASFAA 2011

Page 8: An Effective Approach for Searching Closest Sentence Translations from The Web

Problem DefinitionProblem Definition

04/24/23 8SCST@DASFAA 2011

DataData: : A Database of A Database of Parallel SentencesParallel Sentences

TranslatorTranslator

QueryQuery: : Query Sentence (Query Sentence (EnglishEnglish))

AnswerAnswer::Sentences with its translationsSentences with its translations

Sentence1: English - ChineseSentence2: English - ChineseSentence3: English - Chinese

Page 9: An Effective Approach for Searching Closest Sentence Translations from The Web

Phrase-Based Sentence MatchingPhrase-Based Sentence Matching

04/24/23 9SCST@DASFAA 2011

q

Phrase f1

Phrase f2

Phrase fn

……

sPhrase f’1

Phrase f’2

Phrase f’n

……

SimilarityModel

Parallel SentencesParallel SentencesPhrase

Selection

Phrase DatabasePhrase Database

OfflineOffline

OnlineOnline

Page 10: An Effective Approach for Searching Closest Sentence Translations from The Web

OutlineOutline

• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion

04/24/23 10SCST@DASFAA 2011

Page 11: An Effective Approach for Searching Closest Sentence Translations from The Web

Phrase-Based Similarity ModelPhrase-Based Similarity Model

04/24/23 11SCST@DASFAA 2011

q

Phrase f1

Phrase f2

Phrase fn

……

sPhrase f’1

Phrase f’2

Phrase f’n

……

SimilarityModel

Parallel SentencesParallel SentencesPhrase

Selection

Phrase DatabasePhrase Database

OfflineOffline

OnlineOnline

Page 12: An Effective Approach for Searching Closest Sentence Translations from The Web

Similarity ModelSimilarity Model

04/24/23 12SCST@DASFAA 2011

sim(q,s) = ∑f ∈Fq∩Fs φ(q,f) φ(s,f)

Query Sentence, q

A Sentence in the DB, s

PhrasePhraseSet, Set, FFqq

PhrasePhraseSet, Set, FFss

f1, f2, f3, ……, fm

f'1', f'2, f'3, ……, f'n

w(f)

φ(q,f):syntactic importance of f to q

φ(s,f):syntactic importance of f to s

Shared Phrases:

f ∈Fq∩Fs w(f):weight of f

(IDF)

Fq∩Fs

Fs

Page 13: An Effective Approach for Searching Closest Sentence Translations from The Web

Syntactic Importance of PhrasesSyntactic Importance of Phrases

04/24/23 13SCST@DASFAA 2011

φ(q,f)

Sentence Sentence qq

Phrase Phrase ff

He has eaten an apple

he eaten apple

= Πm α m Πg β g

has anGapGap

Dependency TreeDependency Tree

eaten

he apple has

an

α0

d·α0 d·α0 d·α0

d2·α0d: a decay factor

β g : penalty(constant)

α m : syntactic weight of matched term

Page 14: An Effective Approach for Searching Closest Sentence Translations from The Web

Features of the Similarity ModelFeatures of the Similarity Model

• More General▪Subsumes Jaccard, Cosine similarity,…

• Syntactic Information▪Weight of matched terms▪Weight of terms in the gap

• Frequency Information▪Weight of phrases

04/24/23 14SCST@DASFAA 2011

Page 15: An Effective Approach for Searching Closest Sentence Translations from The Web

OutlineOutline

• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion

04/24/23 15SCST@DASFAA 2011

Page 16: An Effective Approach for Searching Closest Sentence Translations from The Web

High-Quality Phrase SelectionHigh-Quality Phrase Selection

04/24/23 16SCST@DASFAA 2011

q

Phrase f1

Phrase f2

Phrase fn

……

sPhrase f’1

Phrase f’2

Phrase f’n

……

SimilarityModel

Parallel SentencesParallel SentencesPhrase

Selection

Phrase DatabasePhrase Database

OfflineOffline

OnlineOnline

Page 17: An Effective Approach for Searching Closest Sentence Translations from The Web

High-Quality PhraseHigh-Quality Phrase

• Extend grams by allowing discontinuous terms• A heuristic for selecting phrases

▪ Gap constraint: syntactic relationship of discontinuous terms

▪ Frequency constraint: infrequent (large IDF)▪ Maximum constraint: 1) not a prefix; 2) max. length

04/24/23 17SCST@DASFAA 2011

He has eaten an appleSentence Sentence qq

he eaten apple

syntactic

Frequency# of sentences

In the DB having it

Page 18: An Effective Approach for Searching Closest Sentence Translations from The Web

Phrase SelectionPhrase Selection

• Selecting phrases with gap and maximum constraints

04/24/23 18SCST@DASFAA 2011

He ate a red appleSentence Sentence ss

he eat red apple

Sentence Graph1)Sequential relationship2)Syntactic relationship

• Longest path from a node = A phrase satisfying• Gap constraint• Maximum constraint

Page 19: An Effective Approach for Searching Closest Sentence Translations from The Web

Phrase SelectionPhrase Selection

04/24/23 19SCST@DASFAA 2011

• Select phrases with frequency constraint (Threshold = 2)

Sentences in the DBHe has an apple

He ate a red apple

He has a pencil

He has

N0(8)

N1(4)

N2(3)

N27(1) N4(1)

N28(0) N5(0)

he

have

pencil apple

# #

N9(1)

eat

N11(1)

red

N15(1)

apple

N13(1)

apple

#N14(0)

haveeat red

……

Use a frequency trie

N29(0)

#

Prune freq-uent phrases

Page 20: An Effective Approach for Searching Closest Sentence Translations from The Web

OutlineOutline

• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion

04/24/23 20SCST@DASFAA 2011

Page 21: An Effective Approach for Searching Closest Sentence Translations from The Web

Experiment SetupExperiment Setup

• Data Sets▪DI: 520,899 parallel sentences from ICIBA▪DC: 800,000 parallel sentences from CNKI

• Baseline Methods▪ Jaccard Coefficient, Edit Distance, Cosine

Similarity▪Translation Model Methods (TM)▪Cosine Similarity with VGRAM

04/24/23 21SCST@DASFAA 2011

Page 22: An Effective Approach for Searching Closest Sentence Translations from The Web

Experiment SetupExperiment Setup

• Evaluation Metrics▪BLEU

◦ A well known metric for machine translation◦ Example:

▪Precision◦ A user study to label whether the translations are

useful

04/24/23 22SCST@DASFAA 2011

qq: He has eaten an apple: He has eaten an apple

ss: He has a pencil: He has a pencil他吃了一个苹果他吃了一个苹果他有一支铅笔他有一支铅笔

Ref. Translation

Translation

BLEU

Page 23: An Effective Approach for Searching Closest Sentence Translations from The Web

Effects of Phrase SelectionEffects of Phrase Selection

04/24/23 23SCST@DASFAA 2011

Effect on max. length on DI Effect on freq. threshold on DC

Page 24: An Effective Approach for Searching Closest Sentence Translations from The Web

Comparison with Similarity ModelsComparison with Similarity Models

04/24/23 24SCST@DASFAA 2011

Comparison on the DI data set

Page 25: An Effective Approach for Searching Closest Sentence Translations from The Web

Comparison with Existing MethodsComparison with Existing Methods

04/24/23 25SCST@DASFAA 2011

Comparison on the DC data set

Page 26: An Effective Approach for Searching Closest Sentence Translations from The Web

User StudiesUser Studies

• Methods used in commercial systems

04/24/23 26SCST@DASFAA 2011Comparison on the DI data set

Page 27: An Effective Approach for Searching Closest Sentence Translations from The Web

OutlineOutline

• Introduction• Overview of Our Approach• Phrase-Based Similarity Model• Phrase Selection• Experiments• Conclusion

04/24/23 27SCST@DASFAA 2011

Page 28: An Effective Approach for Searching Closest Sentence Translations from The Web

ConclusionConclusion

• Searching closest sentence translations from the Web

• A phrase-based sentence similarity model

• High-quality phrase selection methods

• Extensive experiments and user studies

04/24/23 28SCST@DASFAA 2011

Page 29: An Effective Approach for Searching Closest Sentence Translations from The Web

04/24/23 SCST@DASFAA 2011 29

Thanks

My Homepage: http://dbgroup.cs.tsinghua.edu/fanju

Page 30: An Effective Approach for Searching Closest Sentence Translations from The Web

Frequency ConstraintFrequency Constraint

• Index structures▪Phrase Sentence

• Frequent phrases large inverted index

04/24/23 30SCST@DASFAA 2011