seeing the wood for the trees in mt evaluation: an lsp success story from rws

Seeing the Wood for the Trees in MT Evaluation an LSP success story from RWS

John TinsleyCEO and Co-founder

LocWorld. Silicon Valley. 16th October 2015

This is a piece of string

We provide Machine Translation solutions with Subject Matter Expertise

But how does that work?

Pre-processing Post-processing

Input Output

Training Data

Data Engineering

The Ensemble ArchitectureTM

Chinese pre-ordering rules

StatisticalPost-editing

Input

Output

Training Data

Spanish med-deviceentity recognizer Multi-output

Combination

Korean pharmatokenizer

Patent inputclassifier

Client TM/terminology (optional)

Japanese scriptnormalisation

GermanCompounding rules

Moses

RBMT

Moses

Moses

Combining linguistics, statistics, and MT expertise

Where does evaluation fit it?

MT Adoption

Cycle

Wait! Let’s take a step back

Why?

When? How?

What? Improve translator

productivity

How much faster does MT make

them?

Measure gains in speed Perpetually

Lots of different ways to do evaluation–  automatic scores

•  BLEU, METEOR, GTM, TER

–  fluency, adequacy, comparative ranking–  task-based evaluation

•  error analysis, post-edit productivity

Different metrics, different intelligence–  what does each type of metric tell us?–  which ones are usable at which stage of evaluation?

e.g. can we really use automatic scores to assess productivity?

e.g. does productivity delta really tell us how good the output is?

MT Evaluation – where do we start!?

ProblemLarge Chinese to English patent translation project. Challenging content and language

QuestionWhat if any efficiencies can machine translation add to the workflow of RWS translators?

How we applied different types of MT evaluation and different stages in the process, at various go/no stages, to help RWS to assess whether MT is viable for this project

Client Case Study – RWS

- UK headquartered public company- Founded 1958- 9th largest LSP (CSA 2013 report)- Leader in specialist IP translations

Can we improve our baseline engines through customisation? Step 1: Baseline and Customisation

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

BLEU TER

Iconic Baseline

Iconic Customised

What next?

How good is the output relative to the task, i.e. post-editing?- fluency/adequacy not going to tell us- let’s start with segment level TER

-  Huge improvement

-  Intuitively, scores reflect well but don’t really say anything

-  Let’s dig deeper

Translation Edit Rate: correlates well with practical evaluations

If we look deeper, what can we learn?

INTELLIGENCE

• Proportion of full matches (i.e. big savings)

• Proportion of close matches (i.e. faster that fuzzy matches)

• Proportion of poor matches

ACTIONABLE INFORMATION

• Type of sentence with high/low matches

• Weaknesses and gaps

• Segments to compare and analyse in translation memory

TER

sco

re

Step 2: Segment-level automatic analysis

Distribution of segment-level TER scores

This represents a 24% potential productivity gain

segment length

With MT experience and previous MT integration, productivity testing can be run in the production environment. In this case we used, the TAUS Dynamic Quality Framework

Step 3: Productivity testing

Productivity Test

Productivity Test

With MT experience and previous MT integration, productivity testing can be run in the production environment. In this case we used, the TAUS Dynamic Quality Framework

Beware the variables!•  Translators: different experience, speed, perceptions of MT

–  24 translators: senior, staff, and interns

•  Test sets: not representative; particularly difficult–  2 tests sets, comprising 5 documents, and cross-fold validation

•  Environment and task: inexperience and unfamiliarity–  Training materials, videos, and “dummy” segments

Step 3: Productivity testing

Overall average

Findings and Learnings

25% productivity gain

Experienced: 22%Staff: 23%

Interns: 30%

Test set 1.1: 25%Test set 1.2: 35%Test set 2.1: 06%Test set 2.2: 35%

Correlates with TER

Rollout with junior staff for more immediate impact on bottom line?

Don’t be over concerned by outliers.Use data to facilitate source content profiling?

What it tells us

By Translator Profile

By Test Set

Look our for anomalies–  segments with long timings (above average ratio words/minute)–  sentences that don’t change much from MT to post-edit–  segments with unusually short timings

In this case, the next step is production roll-out to validate these in the actual translator workflow over an extended period.

Warnings, Tips, and Next Steps

Now would be the right time to do fluency/adequacy if you need to verify that post-editing is producing, at least, similar quality output

We need to marry data that we know from operations with data we produce during MT evaluations to create business intelligence

Let’s look at how we can find that out and what it means…

Making the business case for MT

KNOWNS

•  Revenue from translation

•  Costs (internal, outsourced)

•  Variations of this information across content and languages

UNKNOWNS

•  MT performance

•  Cost of MT

•  Variations of this information across content and languages

Calculating the ROI on MT

Parameters

Per word rate (LSP) Vendor Rate Produc3vity Gain Project Word Count MT Cost €0.10 €0.08 5,000,000

MT Weighted Word Count

No Machine Transla3on With Machine Transla3on LSP Revenue €500,000 LSP Revenue €500,000 Vendor Cost €400,000 Vendor Cost MT Cost 0 MT Cost Gross Profit €100,000 Gross Profit Gross Profit Margin 20.0% Gross Profit Margin

Gross Profit Increase when using MT ???%

**These numbers are for illustrative purposes only and not related to the case study

Calculating the ROI – plugging in the numbers

Parameters

Per word rate (LSP) Vendor Rate Produc3vity Gain Project Word Count MT Cost €0.10 €0.08 25% 5,000,000 €0.008

MT Weighted Word Count 3,750,000

No Machine Transla3on With Machine Transla3on LSP Revenue €500,000 LSP Revenue €500,000 Vendor Cost €400,000 Vendor Cost €300,000 MT Cost 0 MT Cost €40,000 Gross Profit €100,000 Gross Profit €160,000 Gross Profit Margin 20.0% Gross Profit Margin 32%

Gross Profit Increase when using MT 60%

**These numbers are for illustrative purposes only and not related to the case study

Fit for Purpose Evaluation

Thank You! [email protected]

@IconicTrans

seeing the wood for the trees in mt evaluation: an lsp success story from rws

Technology