seeing the wood for the trees in mt evaluation: an lsp success story from rws
TRANSCRIPT
Seeing the Wood for the Trees in MT Evaluation an LSP success story from RWS
John TinsleyCEO and Co-founder
LocWorld. Silicon Valley. 16th October 2015
The Ensemble ArchitectureTM
Chinese pre-ordering rules
StatisticalPost-editing
Input
Output
Training Data
Spanish med-deviceentity recognizer Multi-output
Combination
Korean pharmatokenizer
Patent inputclassifier
Client TM/terminology (optional)
Japanese scriptnormalisation
GermanCompounding rules
Moses
RBMT
Moses
Moses
Combining linguistics, statistics, and MT expertise
Wait! Let’s take a step back
Why?
When? How?
What? Improve translator
productivity
How much faster does MT make
them?
Measure gains in speed Perpetually
Lots of different ways to do evaluation– automatic scores
• BLEU, METEOR, GTM, TER
– fluency, adequacy, comparative ranking– task-based evaluation
• error analysis, post-edit productivity
Different metrics, different intelligence– what does each type of metric tell us?– which ones are usable at which stage of evaluation?
e.g. can we really use automatic scores to assess productivity?
e.g. does productivity delta really tell us how good the output is?
MT Evaluation – where do we start!?
ProblemLarge Chinese to English patent translation project. Challenging content and language
QuestionWhat if any efficiencies can machine translation add to the workflow of RWS translators?
How we applied different types of MT evaluation and different stages in the process, at various go/no stages, to help RWS to assess whether MT is viable for this project
Client Case Study – RWS
- UK headquartered public company- Founded 1958- 9th largest LSP (CSA 2013 report)- Leader in specialist IP translations
Can we improve our baseline engines through customisation? Step 1: Baseline and Customisation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
BLEU TER
Iconic Baseline
Iconic Customised
What next?
How good is the output relative to the task, i.e. post-editing?- fluency/adequacy not going to tell us- let’s start with segment level TER
- Huge improvement
- Intuitively, scores reflect well but don’t really say anything
- Let’s dig deeper
Translation Edit Rate: correlates well with practical evaluations
If we look deeper, what can we learn?
INTELLIGENCE
• Proportion of full matches (i.e. big savings)
• Proportion of close matches (i.e. faster that fuzzy matches)
• Proportion of poor matches
ACTIONABLE INFORMATION
• Type of sentence with high/low matches
• Weaknesses and gaps
• Segments to compare and analyse in translation memory
TER
sco
re
Step 2: Segment-level automatic analysis
Distribution of segment-level TER scores
This represents a 24% potential productivity gain
segment length
With MT experience and previous MT integration, productivity testing can be run in the production environment. In this case we used, the TAUS Dynamic Quality Framework
Step 3: Productivity testing
Productivity Test
With MT experience and previous MT integration, productivity testing can be run in the production environment. In this case we used, the TAUS Dynamic Quality Framework
Beware the variables!• Translators: different experience, speed, perceptions of MT
– 24 translators: senior, staff, and interns
• Test sets: not representative; particularly difficult– 2 tests sets, comprising 5 documents, and cross-fold validation
• Environment and task: inexperience and unfamiliarity– Training materials, videos, and “dummy” segments
Step 3: Productivity testing
Overall average
Findings and Learnings
25% productivity gain
Experienced: 22%Staff: 23%
Interns: 30%
Test set 1.1: 25%Test set 1.2: 35%Test set 2.1: 06%Test set 2.2: 35%
Correlates with TER
Rollout with junior staff for more immediate impact on bottom line?
Don’t be over concerned by outliers.Use data to facilitate source content profiling?
What it tells us
By Translator Profile
By Test Set
Look our for anomalies– segments with long timings (above average ratio words/minute)– sentences that don’t change much from MT to post-edit– segments with unusually short timings
In this case, the next step is production roll-out to validate these in the actual translator workflow over an extended period.
Warnings, Tips, and Next Steps
Now would be the right time to do fluency/adequacy if you need to verify that post-editing is producing, at least, similar quality output
We need to marry data that we know from operations with data we produce during MT evaluations to create business intelligence
Let’s look at how we can find that out and what it means…
Making the business case for MT
KNOWNS
• Revenue from translation
• Costs (internal, outsourced)
• Variations of this information across content and languages
UNKNOWNS
• MT performance
• Cost of MT
• Variations of this information across content and languages
Calculating the ROI on MT
Parameters
Per word rate (LSP) Vendor Rate Produc3vity Gain Project Word Count MT Cost €0.10 €0.08 5,000,000
MT Weighted Word Count
No Machine Transla3on With Machine Transla3on LSP Revenue €500,000 LSP Revenue €500,000 Vendor Cost €400,000 Vendor Cost MT Cost 0 MT Cost Gross Profit €100,000 Gross Profit Gross Profit Margin 20.0% Gross Profit Margin
Gross Profit Increase when using MT ???%
**These numbers are for illustrative purposes only and not related to the case study
Calculating the ROI – plugging in the numbers
Parameters
Per word rate (LSP) Vendor Rate Produc3vity Gain Project Word Count MT Cost €0.10 €0.08 25% 5,000,000 €0.008
MT Weighted Word Count 3,750,000
No Machine Transla3on With Machine Transla3on LSP Revenue €500,000 LSP Revenue €500,000 Vendor Cost €400,000 Vendor Cost €300,000 MT Cost 0 MT Cost €40,000 Gross Profit €100,000 Gross Profit €160,000 Gross Profit Margin 20.0% Gross Profit Margin 32%
Gross Profit Increase when using MT 60%
**These numbers are for illustrative purposes only and not related to the case study