using pivot/bridge languages

48
Using Pivot/Bridge Languages Matthias Eck

Upload: salaam

Post on 05-Jan-2016

33 views

Category:

Documents


1 download

DESCRIPTION

Using Pivot/Bridge Languages. Matthias Eck. General Problem. Resources are available between languages A and B … and between languages B and C … but not C and A How to train translation models between C and A?. A. C. B. 1 st paper. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Using Pivot/Bridge Languages

Using Pivot/Bridge Languages

Matthias Eck

Page 2: Using Pivot/Bridge Languages

General Problem

Resources are available between languages A and B… and between languages B and C… but not C and A

How to train translation models between C and A?

A

C B

Page 3: Using Pivot/Bridge Languages

1st paper

Multipath Translation Lexicon Induction via Bridge Languages

Gideon S. Mann and David Yarowsky NAACL 2001

Method for inducing translation lexicons based on transduction models of cognate pairs via bridge languages

Page 4: Using Pivot/Bridge Languages

Lexicon via Cognate pairs

Lexicon: Mapping of word in source language to words in

target language

Here: Lexicon is built between arbitrary languages using

models of cognate pairs and cognate distance

Page 5: Using Pivot/Bridge Languages

Romance Family

General idea

English Spanish Portuguese

Italian

French

Romanian

dictionarycognate

model

source targetbridge

Page 6: Using Pivot/Bridge Languages

Cognate pairs can make up significant portion of lexicon if languages are in the same family and close

Translation pairs

English French

nephew neveu typical cognate pair

father pere Historically related, but now distant

water eau not related

Page 7: Using Pivot/Bridge Languages

Cognate string edit distance

Obvious condition for a good distance D

So we choose

…as the translation for s

D(s,n)D(s,c)

(s,n)(s,c)

TncSs

Then

noncognate cognate If

, ,

),(minargˆ tsDtTt

Page 8: Using Pivot/Bridge Languages

Used distance measures

L: Levenshtein distance Minimum sum of the costs of edit operations required to

transform one string into another Deletion, Substitution, Insertion – traditional cost 1

S: Stochastic transducers Probabilistic costs for each possible edit operation

H: Hidden Markov Model Each character has separate edit operation parameters

Page 9: Using Pivot/Bridge Languages

Distance Measures

Variants of Levenshtein distance: L-V: vowel substitution cost only: 0.5

L-S/L-A: Filter probabilities obtained by S into 3 classes 0.5, 0.75, 1 L-S: Each pair separately trained L-A: Collectively trained for all Romance languages

Limitation Method cannot discover translation pairs with having

no surface form relationship

Assumed cognate pairs: Levenshtein edit distance < 3 Few false positives

Page 10: Using Pivot/Bridge Languages

Intra Family Translation Lexicon Induction

Family: Romance languages Available: dictionary (English/Bridge language)

General evaluation algorithm:1. Select 100 word pairs from dictionary for testing2. For adaptive metrics: Select hypothesized word pairs

(Edit distance < 3) as cognate pairs and train on them

3. For each word in source language select closest word from the 100 target words

Page 11: Using Pivot/Bridge Languages

Results

Source Languages: Spanish, French, Italian, Romanian

Target Language: Portuguese

1000 word pairs in dictionary for Spanish/Portuguese 900 for other language pairs

Page 12: Using Pivot/Bridge Languages

Results

Pure Levenshtein distance works surprisingly well S gives boost on French-Portuguese Reason could be that Spanish-Portuguese are closer

than French-Portuguese L-S usually best

Page 13: Using Pivot/Bridge Languages

Consonant-to-consonant

Consonant-to-consonant edit operations

Most probable forFrench – Portuguese

French Portuguese

n m

c g

p f

g n

b v

p f

x s

s c

c q

g v

t d

Page 14: Using Pivot/Bridge Languages

Analysis

Page 15: Using Pivot/Bridge Languages

Analysis - Example

Page 16: Using Pivot/Bridge Languages

Multiple bridge languages

Slavic Family

English Czech

Ukrainian

dictionarycognate

model

source targetbridge

Russian

Polish

Serbian

Page 17: Using Pivot/Bridge Languages

Translation Lexicon Induction

Algorithm (One or more bridge languages)

For each word s SFor each bridge language B

Translate s → b Bt T, Calculate D(b,t)

Rank t by D(b,t)

Score t using information from all bridgesSelect highest scored tMap s → t

Page 18: Using Pivot/Bridge Languages

Results

One bridge languages, but multiple pathes

Page 19: Using Pivot/Bridge Languages

Examples

Page 20: Using Pivot/Bridge Languages

Different Pathways

English to Portuguese (via Romance languages)

English to Norwegian (via Germanic languages)

English to Ukrainian (via Slavic languages)

Portuguese to English (via Germanic languages, French)

Page 21: Using Pivot/Bridge Languages

Results

Page 22: Using Pivot/Bridge Languages

2nd Paper

Inducing Translation Lexicons via Diverse Similarity Measures and Bridge Languages

Charles Schafer and David Yarowsky COLING 2002

Improves results of first paper by introducing additional similarity scores between candidate translations

Page 23: Using Pivot/Bridge Languages

Basic Idea

Decompose:

P(English|Serbian) = P(English|Czech) x P(Czech|Serbian)

For any language L close to Czech: P(English|L) = P(English|Czech) x P(Czech|L)

P (Czech|L) as presented was done using similarity on cognate pairs

Page 24: Using Pivot/Bridge Languages

Covered Languages

English Czech

Hindi

Nepali

Bengali

Marathi

Gujarati

Punjabi

Polish

Slovak

Ukrainian

Bulgarian

Serbian

Slovene

Page 25: Using Pivot/Bridge Languages

Resources

Serbian – Czech – English Czech – English

dictionary: 171k word pairs

Corpora:English: 192M wordsSerbian: 12M(News data from web)

Gujarati – Hindi – English Hindi – English

dictionary:74k word pairs

Corpora:Gujarati: 2M

Page 26: Using Pivot/Bridge Languages

Problem with Cognate Pairs

Serbian Czech English

prazan prizen

pazen

prazdny

favor

grace

patronage

blank

emptycorrect

not correct

Page 27: Using Pivot/Bridge Languages

Idea

Introduce additional similarity models Weighted Levenshtein Similarity Context Similarity Date distributional Similarity Relative frequency Similarity Burstiness Similarity and Inverse Document

Frequency Use of Additional Bridge Languages

Combine them with weighted string distance

Page 28: Using Pivot/Bridge Languages

Weighted Levenshtein Similarity

1. Iteration: Vowel cluster operations have half the cost of single consonant substitutions, insertions and deletions

dist(vowel+, vowel+)

Use highest weighted of the top 2000 to re-estimate edit weights

Some high probability substitutions:

Page 29: Using Pivot/Bridge Languages

Context Similarity

Compare narrow and wide contexts for candidatesContext: bag of words (Narrow: radius 1/ Wide: radius 10)

1. Calculate Context on Source Language (Serbian)2. Translate to English using current estimations 3. Compare with English Contexts via Cosine Similarity

Page 30: Using Pivot/Bridge Languages

Context Similarity - Example

Nezavisnost pravo: 2 suvereniteti: 3 deklaracije: 3 pokrajina: 4

Context in Serbian Corpus with frequencies

Page 31: Using Pivot/Bridge Languages

Context Similarity - Example

Nezavisnost pravo: 2 suvereniteti: 3 deklaracije: 3 pokrajina: 4

2 1.5 1.5 1.5 4 1.5

justice

majesty

sovereignty

declaration

country ornamental

Translate with Initial Lexicon

Page 32: Using Pivot/Bridge Languages

Context Similarity - Example

Nezavisnost pravo: 2 suvereniteti: 3 deklaracije: 3 pokrajina: 4

2 1.5 1.5 1.5 4 1.5

justice

majesty

sovereignty

declaration

country ornamental

10 0 479 836 191 013

104 0 21 4 141 0184681

expression

religion

Independence

Freedom

00

Context of Candidates in English Corpus

Page 33: Using Pivot/Bridge Languages

Context Similarity - Example

Nezavisnost pravo: 2 suvereniteti: 3 deklaracije: 3 pokrajina: 4

2 1.5 1.5 1.5 4 1.5

justice

majesty

sovereignty

declaration

country ornamental

10 0 479 836 191 013

104 0 21 4 141 0184681

expression

religion

Independence

Freedom

00

COS

Cosine Similarity finds correct candidate(Independence)

Page 34: Using Pivot/Bridge Languages

Date distributional Similarity

News Data: Events are reported in parallel in multiple languages

(+/- 2 days)

Construct term frequency vectors over time and compare candidates

Page 35: Using Pivot/Bridge Languages

Date distributional Similarity

Page 36: Using Pivot/Bridge Languages

Relative Frequencies

Word and translation are likely to have similar relative frequencies

Modest frequency variations are expected

Useful to rule out pairings with several orders of magnitude difference in relative frequency

Ratio of logs of frequencies correlates well with translational compatibility

Page 37: Using Pivot/Bridge Languages

Relative Frequency Similarity

Correct translation “laud” has higher RF Score than higher ranked incorrect candidates “calibre”, “quarter” and “class”

Page 38: Using Pivot/Bridge Languages

Burstiness Similarity

Define Burstiness to measure differences

Page 39: Using Pivot/Bridge Languages

Burstiness Similarity

Burstiness matches better for correct translations “laud” and “praise”

Page 40: Using Pivot/Bridge Languages

Combine the different measures

1. Weighted Levenshtein distance to get initial candidate pairs

2. Calculate 8 similarity measures Weighted Levenshtein Wide bag-of-words context similarity Narrow bag of words context similarity Local News date distribution similarity All News date distribution similarity IDF similarity Burstiness similarity

Page 41: Using Pivot/Bridge Languages

Combine the different measures

3. Integrate similarity measures into a single similarity function:1. POS Similarity

Bias in favor of compatible parts of speech (N, V, ADJ)Penalty for non-matching candidates

2. Sort candidates for each score in decreasing orderAssign Ranks 0,1,… and normalize by count

3. Scoring: Similarity models have associated weights

Page 42: Using Pivot/Bridge Languages

Weight Allocation

Page 43: Using Pivot/Bridge Languages

Evaluation

3 Evaluation Criteria Exact Match Accuracy

Percentage of correct English in the top k ranks

Median Position of the per word highest ranked correct translation

Page 44: Using Pivot/Bridge Languages

Results

Page 45: Using Pivot/Bridge Languages

Results

Improvements with second bridge language

Page 46: Using Pivot/Bridge Languages

Additional Bridge Language Work

Interlingua based Statistical Machine Translation Manuel Kauers, Stephan Vogel, Christian Fügen, Alex

Waibel ICSLP 2002

Paper covers SMT from Text to a structured Interlingua format (IF)

Corpus English/IF is available…but we also want to translate other languages into IF?

English IF

Page 47: Using Pivot/Bridge Languages

Generalized problem

Assume we have translation model F to E and G to F… but we want G to E?

Decompose:

Because:

E

G F

Page 48: Using Pivot/Bridge Languages

And just translating…

Experiments done during PF-STAR project 2003/2004

Training data: 48k lines of BTEC data Test data: 506 lines, Test set for CSTAR 2003

Translating Chinese → Italian Also via a bridge language Chinese → English →

Italian

Ch → It Ch → En → It

ITC-IRST 0.1769/4.5251 0.1695/4.4343

CMU/UKA 0.2030/4.8210 0.2238/4.9453