Download - Word and Phrase Alignment
![Page 1: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/1.jpg)
Word and Phrase Alignment
Presenters:Marta Tatu
Mithun Balakrishna
![Page 2: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/2.jpg)
Translating Collocations for
Bilingual Lexicons: A Statistical
Approach
Frank Smadja, Kathleen R. McKeown and Vasileios
HatzivassiloglouCL-1996
![Page 3: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/3.jpg)
3
Overview – Champollion
Translates collocations from English into French using an aligned corpus (Hansards)
The translation is constructed incrementally, adding one word at a time
Correlation method: the Dice coefficient Accuracy between 65% and 78%
![Page 4: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/4.jpg)
4
The Similarity Measure Dice coefficient (Dice, 1945)
where p(X,Y), p(X), and p(Y) are the joint and marginal probability of X and Y
If the probabilities are estimated using maximum likelihood, then
where fX, fY, and fXY are the absolute frequencies of appearance of “1”s for X and Y
![Page 5: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/5.jpg)
5
Algorithm - Preprocessing
Source and target language sentences must be aligned (Gale and Church 1991)
List of collocations to be translated must be provided (Xtract, Smadja 1993)
![Page 6: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/6.jpg)
6
Algorithm 1/3
1. Champollion identifies a set S of k words highly correlated with the source collocation
The target collocation is in the powerset of S
These words have a Dice-measure Td ( = 0.10) and appear Tf ( = 5 ) times
2. Form all pairs of words from S3. Evaluate the correlation between
each pair and the source collocation (Dice)
![Page 7: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/7.jpg)
7
Algorithm 2/3
4. Keep pairs that score above the threshold Td
5. Construct 3–word elements containing one of the highly correlated pairs plus a member of S
6. …7. Until for some n ≤ k, no n–word scores
above the threshold
![Page 8: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/8.jpg)
8
Algorithm 3/3
8. Champollion selects the best translation among the top candidates
9. In case of ties, the longer collocation is preferred
10. Determine whether the selected translation is a single word, a flexible, or a rigid collocation, in case of multiword translations
Are the words used consistently in the same order and at the same distance?
![Page 9: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/9.jpg)
9
Experimental Setup
DB1 = 3.5*106 words (8 months of 1986)
DB2 = 8.5*106 words (1986 and 1987) C1 = 300 collocations from DB1 of mid-
range frequency C2 = 300 collocations from 1987 C3 = 300 collocations from 1988 Three fluent bilingual speakers
Canadian French vs. continental French
![Page 10: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/10.jpg)
10
Results
![Page 11: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/11.jpg)
11
Future Work
Translating the closed class words Tools for the target language Separating corpus-dependent
translations from general ones Handling low frequency collocations Analysis of the effects of thresholds Incorporating the length of the
translation into the score Using nonparallel corpora
![Page 12: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/12.jpg)
12
Comments
![Page 13: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/13.jpg)
A Pattern Matching Method for Finding
Noun and Proper Noun Translations from Noisy
Parallel Corpora
Pascal FungACL-1995
![Page 14: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/14.jpg)
14
Goal of the Paper
Create bilingual lexicon of nouns and proper nouns
From unaligned, noisy parallel texts of Asian/Indo-European language pairs
Pattern matching method
![Page 15: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/15.jpg)
15
Introduction
Previous research on sentence-aligned, parallel texts
Alignment not always practical Unclear sentence boundaries in corpora Noisy text segments present in only one
language Two main steps
Find small bilingual primary lexicon Compute a better secondary lexicon from
these partially aligned texts
![Page 16: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/16.jpg)
16
Algorithm
1. Tag the English half of the parallel text Nouns and proper nouns (they have
consistent translations over the entire text)
Tagged English part with a modified POS tagger
Find translations for nouns, plural nouns and proper nouns only
![Page 17: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/17.jpg)
17
Algorithm
2. Positional Difference Vectors Correspondence between a word and its
translated counterpart In their frequency In their positions
Correspondence need not be linear Calculation
p – position vector of a word V – positional difference vector V[i-1] = p[i] – p[i-1]
![Page 18: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/18.jpg)
18
Algorithm
![Page 19: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/19.jpg)
19
Algorithm
3. Match pairs of positional difference vectors, giving scores
Dynamic Time Warping (Fung & McKeown, 1994)
For non-identical vectors Trace correspondence between all points in V1
and V2 No penalty for deletions and insertions
Statistical filters
![Page 20: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/20.jpg)
20
Dynamic Time Warping Given V1 and V2,
which point in V1 corresponds to which point in V2?
![Page 21: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/21.jpg)
21
Algorithm
![Page 22: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/22.jpg)
22
Algorithm5. Finding anchor points and eliminating
noise Every word pair selected to run DTW
Obtain DTW score Obtain DTW path
Plot DTW paths of all such word pairs Keep highly reliable points and discard rest Point (i,j) is noise if
![Page 23: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/23.jpg)
23
Algorithm
![Page 24: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/24.jpg)
24
Algorithm
6. Finding low frequency bilingual word pairs
Non-linear segment binary vectors V1[i] = 1 if word occurs in ith segment
Binary vector correlation measure
![Page 25: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/25.jpg)
25
Results
![Page 26: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/26.jpg)
26
Comments
![Page 27: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/27.jpg)
Automated Dictionary Extraction for
“Knowledge-Free” Example-Based
Translation
Ralf D. BrownTMIMT-1997
![Page 28: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/28.jpg)
28
Goal of the Paper
Extract a bilingual dictionary Using a aligned bilingual corpus Perform tests to compare the
performance of PanEBMT using Collins Spanish-English dictionary +
WordNet English root/synonym list Various automatically extracted bilingual
dictionaries
![Page 29: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/29.jpg)
29
Introduction
![Page 30: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/30.jpg)
30
Extracting Bilingual Dictionary
Extracted from corpus using Correspondence table Threshold Schema
Correspondence Table Two dimensional array Indexed by source language words Indexed by target language words
Cross-product word entries of each sentence pair are incremented
![Page 31: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/31.jpg)
31
Extracting Bilingual Dictionary
Similar word orders language pairs biased
Threshold setting A step function
Unreachably high for co-occurrence < MIN Constant otherwise
A sliding scale Start at 1.0 for co-occurrence = 1 Slide smoothly to MIN threshold value
![Page 32: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/32.jpg)
32
Extracting Bilingual Dictionary
Filtering Symmetric threshold
Asymmetric threshold
Any elements of Correspondence table which fail both tests set to zero
Non-zero elements added to dictionary
![Page 33: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/33.jpg)
33
Extracting Bilingual Dictionary - Results
![Page 34: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/34.jpg)
34
Extracting Bilingual Dictionary - Errors
High-frequency Error-ridden terms Short list high frequency words (all words
which appear in at least 20% of source sentences)
Short list sentence pairs containing extactly one or two high frequency words
Results in 7 of 16 words – Zero error Merge with results from first pass
![Page 35: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/35.jpg)
35
Experimental Setup
Manually created tokenization – 47 equivalence classes, 880 words and translations of each word
Two test texts 275 UN corpus sentences : in-domain 253 Newswire sentences : out-of-domain
![Page 36: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/36.jpg)
36
Results
![Page 37: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/37.jpg)
37
Comments
![Page 38: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/38.jpg)
Extracting Paraphrases from a Parallel Corpus
Regina Barzilay and Kathleen R. McKeown
ACL-2001
![Page 39: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/39.jpg)
39
Overview
Corpus-based unsupervised learning algorithm for paraphrase extraction Lexical paraphrases (single and multi-word)
(refuse, say no) Morpho-syntactic paraphrases
(king’s son, son of the king) (start to talk, start talking)
Phrases which appear in similar contexts are paraphrases
![Page 40: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/40.jpg)
40
Data
Multiple English translations of literary texts written by foreign authors Madam Bovary, Fairy Tales, Twenty
Thousand Leagues Under the Sea, etc. 11 translations
![Page 41: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/41.jpg)
41
Preprocessing
Sentence alignment Translations of the same source contain a
number of identical words 42% of the words in corresponding
sentences are identical (average) Dynamic programming (Gale & Church,
1991) 94.5% correct alignments (127 sentences)
POS tagger and chunker NP and VP
![Page 42: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/42.jpg)
42
Algorithm – Bootstrapping
Co-training method: DLCoTrain (Collins & Singer, 1999)
Similar contexts surround two phrases paraphrase
Having good paraphrase predictor contexts new paraphrases
1. Analyze contexts surrounding identical words in aligned sentence pairs
2. Use these contexts to learn new paraphrases
![Page 43: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/43.jpg)
43
Feature Extraction
Paraphrase features Lexical: tokens for each phrase in the
paraphrase pair Syntactic: POS tags
Contextual features: left and right syntactic contexts surrounding the paraphrase (POS n-grams)tried to comfort her left1=“VB1 TO2”, right1=“PRP$3”
tried to console her left2=“VB1 TO2”, right2=“PRP$3”
![Page 44: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/44.jpg)
44
Algorithm
Initialization Identical words are the seeds (positive
paraphrasing examples) Negatives are created by pairing each word
with all the other words in the sentence Training of the context classifier
Record contexts around positive and negative paraphrases of length ≤ 3
Identify the strong predictors based on their strength and frequency
![Page 45: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/45.jpg)
45
Algorithm
Keep the most frequent k = 10 contexts with a strength > 95%
Training of the paraphrasing classifier Using the context rules extracted
previously, derive new pairs of paraphrases When no more paraphrases are
discovered, stop
![Page 46: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/46.jpg)
46
Results
9483 paraphrases, 25 morpho-syntactic rules Out of 500: 86.5% (without context), 91.6%
(with context) correct paraphrases 69% recall evaluated on 50 sentences
![Page 47: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/47.jpg)
47
Future Work
Extract paraphrases from comparable corpora (news reports about the same event)
Improve the context representation
![Page 48: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/48.jpg)
48
Comments
![Page 49: Word and Phrase Alignment](https://reader036.vdocuments.mx/reader036/viewer/2022062304/56813e6a550346895da87d20/html5/thumbnails/49.jpg)
49
Thank You !