#5 predicting machine translation quality
TRANSCRIPT
Predictingmachine translation quality
I am @bittlingmayer.My company is @SignalNLabs
interests: translation quality, translation crowdsourcing, transliteration, browser translation integrations, topic classification, automatic source-side correction
previously @Google, @Adobe, @Cerner
Ciao!
Today’s topics
◉ Why translation quality?◉ What is the problem?◉ Our data model◉ Our learning infra
Quality estimation?
sentence-level quality
good machine translation vs bad
1
Quality evaluation?
corpus-level quality given reference translations
machine translation vs human translation
2
Why quality?Why is predicting quality useful?
Machine translation should not be a gamble.
$4.501M chars by machine
Optimisation Function
$100001M chars at 5¢/word by human
Perfect Prediction == Perfect Translation
translator
predictor
reward [scores, rankings]
state
action [translations]
Reinforcement Learning
What’s the problem?Is it really harder than self-driving cars?
Language is hard.
Context.
Data are dirty.
Bridging.
Payoff
What is solvable?
Effort
bad input
50% of errors
context/customisation
like a human
like Search, FB, Maps...source-side ambiguity
ideally interactivebad output
What is quality?Can we quantify the quality of a translation?
Accuracy
What is sentence-level quality?
Fluency
Low Quality
Good Enough
Misleading
Human Quality
Recall vs Precision vs Accuracy
actual bad
predicted bad
Trivial 90% Accuracy Example
actual bad
predicted bad: 100%
How does quality vary?
to English to top languages to other
from English
from top languages
from other
How does quality vary?
Wikipedia
news
dialogues, film subtitles, Coursera, Medium
“everyday” reviews, customer service
your children’s WhatsApp messages
my WhatsApp messages
Other concepts of quality?
How do we solve it?With data and features
What is our data model?
source target score
en-zh Hello 您好 1.0
en-zh The car is driving. The car is driving. 0.0
en-ru The car is driving. Автомобиль вождения. 0.3
... ... ... ...
What is our data model?
source target src_length_bytes ... trg_spam_prob score
en-zh Hello 您好 5 ... 0.5 1.0
en-zh The car is driving. The car is driving. 19 ... 0.2 0.0
en-ru The car is driving. Автомобиль вождения. 19 ... 0.1 0.3
... ... ... ... ... ... ...
10-1000 featuressignals engineered by us
1000-10M rowssentences* hand-scored by linguists
language-agnosticLanguage is just another feature.
Human scoresEvaluate many translations by hand
Human Evaluation Score Types
Labels
good/bad
multilabels
word-level labels
Ranking
rank multiple systems
Post-Edit
to comprehensible
to human quality
Human Evaluation Score Types
Labels
good/bad
0.0-1.0
multilabels
word-level labels
Ranking
rank multiple systems
Post-Edit
to comprehensible
to human quality
requires smaller dataset and budget
$0.001 / row @ 5x redundancy$
QuEst baseline featuresquest.dcs.shef.ac.uk/quest_files/features_blackbox_baseline_17
number of tokens in the source sentencenumber of tokens in the target sentenceaverage source token lengthLM probability of source sentenceLM probability of target sentencenumber of occurrences of the target word within the target hypothesis (averaged for all words in the hypothesis - type/token ratio)average number of translations per source word in the sentence (as given by IBM 1 table thresholded such that prob(t|s) > 0.2)average number of translations per source word in the sentence (as given by IBM 1 table thresholded such that prob(t|s) > 0.01) weighted by the inverse frequency of each word in the source corpuspercentage of unigrams in quartile 1 of frequency (lower frequency words) in a corpus of the source language (SMT training corpus)percentage of unigrams in quartile 4 of frequency (higher frequency words) in a corpus of the source languagepercentage of bigrams in quartile 1 of frequency of source words in a corpus of the source languagepercentage of bigrams in quartile 4 of frequency of source words in a corpus of the source languagepercentage of trigrams in quartile 1 of frequency of source words in a corpus of the source languagepercentage of trigrams in quartile 4 of frequency of source words in a corpus of the source languagepercentage of unigrams in the source sentence seen in a corpus (SMT training corpus)number of punctuation marks in the source sentencenumber of punctuation marks in the target sentence
number of tokens
length
LM probability
number of occurrences of the target word within the target hypothesis
average number of translations per source word in the sentence
…
percentage of unigrams in quartile 1 of frequency (lower frequency words)
… percentage of unigrams in quartile n of frequency (higher frequency words)
…
percentage of trigrams in quartile 1 of frequency of source words
… percentage of trigrams in quartile n of frequency of source words
number of punctuation marks
bad input signals
vot tak narod ho4et napisat'
Возможно, вы имели в виду: вот так народ хочет написать
human vot tak narod ho4et napisat' vot tak narod ho4et napisat'
search вот так народ хочет написать That's how people want to write
translation Вот так народ хочет написать. So people want to write.
bad output signals
ambiguity signals
translation signals
Google Microsoft Wiktionary ...
Merry Christmas Krismasi! Krismasi Njema! heri ya KrismasiKrismasi njema
...
eat apples kula mapera kula apples ∅ ...
lexical signals
sygnały leksykalne
char signals
sygnały znaków
syntactic signals
parse tree to sequence conversion
sequence to sequence learning
cross-lingual signals
outside signals
context/customisation signals
Other signals?
50-99+% accuracyDepends on the benchmark! ;-)
1000-10M rows
10-1000 features
Data augmentation?
Can we use parallel corpora?target
Onartutako gertaerak Aholkuak eta iradokizunak Etorkizuneko egitasmoei buruz galdetzea onespena eskatzea Laguntza eskatzea Jende galdetzea itxaron Norbait iritzia eskatzea Etorkizunari Garrantzia emanez informazio saihestea Bad pertsona … … ... Aditu batek ingelesez izatea Being Lucky zaharra izatea pobrea izatea ari irekietan aberatsa izatea Ziur izatea / zenbait ari kezkaturik Aspergarria! Your Mind aldatzeak Pertsonak txaloak Up Hipokresia kexu
source
받아 들여지는 사실 조언 및 제안 향후 계획에 대해 물어 승인 요청 도움을 요청 사람을 요구하는 대기 누군가의 의견을 물어 미래에 대한 태도 제공 정보 방지 나쁜 사람들 … … ... 영어 전문가 인 존재 럭키 오래 되 가난 안심되는 부자가되는 확인 인 / 특정 걱정되는 지루한! 당신의 마음을 변경 사람을 응원합니다 위선에 대해 불평
score
1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 … … ... 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
What is our learning infra?
H2O.ai deeplearning
Do we need deep learning?
Why doesn’t deep learning
work for translation?
Want to learn more?
The real experts
◉ Dr. Lucia Specia◉ quest.dcs.shef.ac.uk◉ statmt.org/wmt15/quality-estimation-task.html
ACL 2016 will be held in Berlin in August.
Reading