learning bilingual lexicons from monolingual corpora

64
Learning Bilingual Lexicons from Monolingual Corpora Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick and Dan Klein Computer Science Division University of California, Berkeley

Upload: blake

Post on 23-Feb-2016

71 views

Category:

Documents


0 download

DESCRIPTION

Learning Bilingual Lexicons from Monolingual Corpora. Aria Haghighi , Percy Liang, Taylor Berg-Kirkpatrick and Dan Klein Computer Science Division University of California, Berkeley. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box .: A A A A A A A A A. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Learning Bilingual Lexicons from Monolingual Corpora

Learning Bilingual Lexicons from Monolingual Corpora

Aria Haghighi, Percy Liang, Taylor Berg-Kirkpatrick and Dan Klein

Computer Science DivisionUniversity of California, Berkeley

Page 2: Learning Bilingual Lexicons from Monolingual Corpora

Standard MT Approach

SourceText

TargetText

Need (lots of) parallel sentences May not always be available

Need (lots of) sentences

Page 3: Learning Bilingual Lexicons from Monolingual Corpora

MT from Monotext

SourceText

TargetText

This talk: translation w/o parallel text? Koehn and Knight (2002) & Fung (1995)

Need (lots of) sentences

Page 4: Learning Bilingual Lexicons from Monolingual Corpora

Task: Lexicon Induction

SourceText

TargetText

Matchingmstate

world

name

Source Words s

nation

estado

política

Target Words t

mundo

nombre

Page 5: Learning Bilingual Lexicons from Monolingual Corpora

Data Representation

state

SourceText

Orthographic Features

1.01.0

1.0

#sttatte#

Context Features

20.05.0

10.0

worldpoliticssociety

Page 6: Learning Bilingual Lexicons from Monolingual Corpora

Data Representation

state

Orthographic Features1.0

1.0

1.0

#sttatte#

5.0

20.0

10.0

Context Features

worldpoliticssociety

SourceText

estado

Orthographic Features1.0

1.0

1.0

#esstado#

10.0

17.0

6.0

Context Features

mundopolitica

sociedadTargetText

Page 7: Learning Bilingual Lexicons from Monolingual Corpora

Canonical Correlation Analysis

Source Space Target Space

PCA

PCA

Page 8: Learning Bilingual Lexicons from Monolingual Corpora

3

1

2

Canonical Correlation Analysis

PCA

Source Space

12 3 23 1

Target Space

2

3 1

PCA

Page 9: Learning Bilingual Lexicons from Monolingual Corpora

Canonical Correlation Analysis

1

Source Space Target Space

23

2

3 1

CCA

CCA

21 3 21 3

Page 10: Learning Bilingual Lexicons from Monolingual Corpora

Canonical Correlation Analysis

21 3

Canonical Space

1

23

2

3 1

Source Space Target Space

Page 11: Learning Bilingual Lexicons from Monolingual Corpora

Canonical Correlation Analysis

2

Canonical Space

2

2

2

Source Space Target Space

Page 12: Learning Bilingual Lexicons from Monolingual Corpora

Generative ModelSource Words

sTarget Words

tMatchingm

Page 13: Learning Bilingual Lexicons from Monolingual Corpora

Generative Model

estadostateSource Space Target Space

PAria

Canonical Space

Page 14: Learning Bilingual Lexicons from Monolingual Corpora

Generative ModelSource Words

sTarget Words

tMatchingmstate

world

name

nation

estado

nombre

politica

mundo

Page 15: Learning Bilingual Lexicons from Monolingual Corpora

E-Step: Obtain posterior over matching

M-Step: Maximize CCA Parameters

Learning: EM?

Page 16: Learning Bilingual Lexicons from Monolingual Corpora

Learning: EM?

0.2

0.15

0.30

0.10

0.30....

Getting expectations over matchings is #P-hard! See John DeNero’s paper

“The Complexity of Phrase Alignment Problems”

Page 17: Learning Bilingual Lexicons from Monolingual Corpora

Hard E-Step: Find bipartite matching

M-Step: Solve CCA

Inference: Hard EM

Page 18: Learning Bilingual Lexicons from Monolingual Corpora

Experimental Setup

Nouns only (for now)

Seed lexicon – 100 translation pairs

Induce lexicon between top 2k source and target word-types

Evaluation: Precision and Recall against lexicon obtained from Wiktionary Report p0.33, precision at recall 0.33

Page 19: Learning Bilingual Lexicons from Monolingual Corpora

Feature Experiments

Series10

25

50

75

100

61.1

Edit Dist

Prec

isio

n Baseline: Edit Distance

4k EN-ES Wikipedia Articles

Page 20: Learning Bilingual Lexicons from Monolingual Corpora

Feature Experiments

Series10

25

50

75

100

61.1

Series1; Ortho; 80.1

Edit Dist Ortho

Prec

isio

n MCCA: Only orthographic features

4k EN-ES Wikipedia Articles

Page 21: Learning Bilingual Lexicons from Monolingual Corpora

Feature Experiments

Series10

25

50

75

100

Series1; Edit Dist;

61.1

Series1; Ortho; 80.1

Series1; Context;

80.2

Edit Dist Ortho Context

Prec

isio

n MCCA: Only Context features

4k EN-ES Wikipedia Articles

Page 22: Learning Bilingual Lexicons from Monolingual Corpora

Feature Experiments

Series10

25

50

75

100

Series1; Edit Dist;

61.1

Series1; Ortho; 80.1

Series1; Context;

80.289.0

Edit Dist Ortho Context MCCA

Prec

isio

n MCCA: Orthographic and context features

4k EN-ES Wikipedia Articles

Page 23: Learning Bilingual Lexicons from Monolingual Corpora

Feature ExperimentsPr

ecis

ion

Recall

Page 24: Learning Bilingual Lexicons from Monolingual Corpora

Feature ExperimentsPr

ecis

ion

Recall

Page 25: Learning Bilingual Lexicons from Monolingual Corpora

Corpus Variation

93.8

100k EN-ES Europarl Sentences

Identical Corpora

Series10

25

50

75

100

93.8

Identical

Prec

isio

n

Page 26: Learning Bilingual Lexicons from Monolingual Corpora

Corpus Variation

Comparable Corpora

4k EN-ES Wikipedia Articles

¼

Series10

25

50

75

100

93.8 89.0

Identical Wiki

Prec

isio

n

Page 27: Learning Bilingual Lexicons from Monolingual Corpora

Corpus Variation

Unrelated Corpora

92 8968

100k English and Spanish Gigaword

?

Series10

25

50

75

100

93.8 89.0Series1;

Unre-lated; 68.3

Identical Wiki Unrelated

Prec

isio

n

Page 28: Learning Bilingual Lexicons from Monolingual Corpora

Seed Lexicon Source

Automatic Seed Use edit distance to induce seed lexicon as inKoehn & Knight (2002)

92

4k EN-ES Wikipedia Articles

Series10

25

50

75

100

91.8 93.8

Auto Seed Gold Seed

Prec

isio

n

Page 29: Learning Bilingual Lexicons from Monolingual Corpora

Analysis

Page 30: Learning Bilingual Lexicons from Monolingual Corpora

Analysis

Top Non-Cognates

Page 31: Learning Bilingual Lexicons from Monolingual Corpora

Analysis

Interesting Mistakes

Page 32: Learning Bilingual Lexicons from Monolingual Corpora

Language Variation

Page 33: Learning Bilingual Lexicons from Monolingual Corpora

Language Variation

Page 34: Learning Bilingual Lexicons from Monolingual Corpora

AnalysisOrthography Features

Context Features

Page 35: Learning Bilingual Lexicons from Monolingual Corpora

Summary

Learned bilingual lexicon from monotext Matching + CCA model Possible even from unaligned corpora Possible for non-related languages High-precision, but much left to do!

Page 36: Learning Bilingual Lexicons from Monolingual Corpora

Thank you!

http://nlp.cs.berkeley.edu

Page 37: Learning Bilingual Lexicons from Monolingual Corpora
Page 38: Learning Bilingual Lexicons from Monolingual Corpora

Error Analysis

Top 100 errors 21 correct translations not in gold 30 were semantically related 15 were orthographically related (coast,costas) 30 were seemingly random

Page 39: Learning Bilingual Lexicons from Monolingual Corpora

Bleu Experiment

On English-French only 1k parallel sentences Without lexicon BLEU: 13.61 With lexicon BLEU: 15.22

Page 40: Learning Bilingual Lexicons from Monolingual Corpora

More Numbers

Page 41: Learning Bilingual Lexicons from Monolingual Corpora
Page 42: Learning Bilingual Lexicons from Monolingual Corpora

Conclusion

Three cases of unsupervised learning in NLP

Unsupervised systems can be competitive with supervised systems

Future problems Document summarization Building MindNet-like resources Discourse Analysis

Page 43: Learning Bilingual Lexicons from Monolingual Corpora

Generative Model

estadostateSource Space Target Space

Latent Space

Orthographic Features1.0

1.0

1.0

#sttat

te#

5.0

20.0

10.0

Context Featuresworldpolitics

society

Generate Matched Words

Page 44: Learning Bilingual Lexicons from Monolingual Corpora

Generative Model

estadostate

Source Space Target Space

Latent Space

Orthographic Features1.0

1.0

1.0

#sttat

te#

5.0

20.0

10.0

Context Featuresworldpolitics

society

Generate Matched Words

state

Page 45: Learning Bilingual Lexicons from Monolingual Corpora

Translation Lexicon Induction

SourceText

TargetText

state

world

name

Source Words s

estado

nombre

mundo

Target Words tMatching

m

Page 46: Learning Bilingual Lexicons from Monolingual Corpora

Generative Model

For each matched word pair:

For each unmatched source word:

For each unmatched target word:

Page 47: Learning Bilingual Lexicons from Monolingual Corpora

Results: Accuracy

Page 48: Learning Bilingual Lexicons from Monolingual Corpora

Corpus Variation

Disjoint Sentences

[email protected] [email protected] [email protected]

75

100

ParallelWikiDisjoint

Page 49: Learning Bilingual Lexicons from Monolingual Corpora

Corpus Variation

Unrelated

[email protected] [email protected] [email protected]

75

100

ParallelWikiUnrelated

?

Page 50: Learning Bilingual Lexicons from Monolingual Corpora

Machine Translation

SourceText

TargetText

Page 51: Learning Bilingual Lexicons from Monolingual Corpora

Machine Translation

SourceText

TargetText

Page 52: Learning Bilingual Lexicons from Monolingual Corpora

Machine Translation

Source Word Target Word P(T | S)state estado 0.98world mundo 0.97name nombre 0.99

SourceText

TargetText

What are we generating?

Page 53: Learning Bilingual Lexicons from Monolingual Corpora

Canonical Correlation Analysis

Source Space Target Space

PCAPCA

CCACCA

Canonical Space

1

23

2

3 1

1 2 3

Page 54: Learning Bilingual Lexicons from Monolingual Corpora

Corpus Variation

Unrelated Corpora

[email protected] Best F150

75

100

ParallelWiki

Page 55: Learning Bilingual Lexicons from Monolingual Corpora

E-Step: Compute matching posteriors

M-Step: Estimate

Inference: EM?

P (mjs;t)

Page 56: Learning Bilingual Lexicons from Monolingual Corpora

Data Representation

state

Orthographic Features1.0

1.0

1.0

#sttatte#

5.0

20.0

10.0

Context Features

worldpoliticssociety

SourceText

estado

Orthographic Features1.0

1.0

1.0

#esstado#

10.0

17.0

6.0

Context Features

mundopolitica

sociedadTargetText

What are we generating?

Page 57: Learning Bilingual Lexicons from Monolingual Corpora

Language Variation

Page 58: Learning Bilingual Lexicons from Monolingual Corpora

Generative Model

estadostateSource Space Target Space

Latent Space

PAria

Generate matched word vectors

Page 59: Learning Bilingual Lexicons from Monolingual Corpora

Generative Model

Matchingmstate

world

name

Source Words s

nation

estado

nombre

política

Target Words t

mundo

Generate unmatched word vectors

Page 60: Learning Bilingual Lexicons from Monolingual Corpora

Results: Example Matches

Page 61: Learning Bilingual Lexicons from Monolingual Corpora

Results: Examples

Top Non-Cognates Interesting Mistakes

Page 62: Learning Bilingual Lexicons from Monolingual Corpora

PCAPCA

Canonical Correlation Analysis

Source Space Target Space

PCAPCA

CCACCA

Canonical Space

1

23

2

3 1

1 2 3

Page 63: Learning Bilingual Lexicons from Monolingual Corpora

Generative Model

Matchingmstate

world

name

Source Words s

nation

estado

nombre

política

Target Words t

mundo

Page 64: Learning Bilingual Lexicons from Monolingual Corpora

Corpus Variation

Identical Corpora

p0.33 89.0

Recall

Prec

ision

100k EN-ES Europarl Sentences