icic 2014 high volume, high quality patent translation across multiple domains

56
Copyright © 2014, Asia Online Pte Ltd Copyright © 2014, Asia Online Pte Ltd High volume, High Quality Patent Translation across Multiple Domains Dion Wiggins Chief Executive Officer [email protected]

Upload: dr-haxel-congress-and-event-management-gmbh

Post on 11-Jun-2015

505 views

Category:

Business


0 download

DESCRIPTION

Due to their complexity and technical variances, patents are some of the most difficult documents to translate, whether translated by machines or humans. Language Studio™ is leveraged by several leading patent providers to translate in excess of 2 billion words of patent content every day, in 20 different domains and writing styles, from languages such as Chinese, Japanese, Korean, German and others. Extensive research and experimentation has been applied over the last 8 years to develop unique approaches to patent translation processing. These include the automated detection of IPC class groupings and document sections such as Title, Claim, Abstract and Description – each of which has their own writing style and domain preferences. This presentation will explore the complexities of patent translation and present some novel approaches to addressing the challenges of this unique domain.

TRANSCRIPT

Page 1: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd Copyright © 2014, Asia Online Pte Ltd

High volume, High Quality Patent Translation

across Multiple Domains

Dion Wiggins Chief Executive Officer [email protected]

Page 2: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• Language Studio™ is a language processing platform, not just a translation tool

• We currently support 534 language pairs

• Our very first customer was LexisNexis Univentio in 2008 – Our first commercial engine was translating Japanese patents

into English

• Not all customers are in the patent space, but patents are the most complex content that we have ever encountered

Page 3: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• Collectively our customers are translating more than 2 billion words per day

• One single customer is translating more than 1 billion words a day of patent content

• Our highest rate of throughput required by a customer (government) to date is 600 million words per minute – Yes, we can support this volume if you can provide the

hardware – approx. 25K CPU cores – Currently being designed and architected ahead of deployment

Page 4: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• Equivalent of 20 million four drawer filing cabinets filled with text.

• The volume of data is expected to increase by 20 times by 2020.

Page 5: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• Equivalent of 20 million four drawer filing cabinets filled with text.

• The volume of data is expected to increase by 20 times by 2020.

Page 6: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

A method of distilling a polymerizable vinyl compound selected from the group consisting of acrolein,

methacrolein, acrylic acid, methacrylec acid, hydroxyethyl acrylate, hydroxyethyl methacrylate, hydroxypropyl

acrylate, hydroxypropyl methacrylate, glycidyl acrylate and glycidyl methacrylate, the method comprising distilling the

polymerizable vinyl compound in the presence of a polymerization inhibitor using a distillation tower having perforated trays without downcomers and wherein the

temperature of the inner wall of the tower is maintained at a temperature sufficient to prevent the condensation of

the vapor being distilled, whereby the polymerizable vinyl compound is distilled without the formation of polymer.

Page 7: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

Translate 13 million historical patents from Japanese to English and also translate all new Japanese patents going

forward. Follow this with the same task in many other languages.

It would take a human translator

152,257 years to translate all existing Japanese patents into

English and would cost US$ 40 billion.

Page 8: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

Quality requires an

understanding of

the data

There is no exception to this rule

Page 9: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• Structured XML –Header

• Language • IPC • …

–Sections • Title • Claim • Abstract • Description

Page 10: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• Writing Style Changes – Between domains of knowledge – Between sections of the patent document

• Multiple Classes Of Data – Formulas

• Detection • Transformation • Protection

– Reference Numbers • Breaks fluency of translation • Not part of the text, meta data

– Numbers + Units – Dates – Patent Numbers

Page 11: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• Content Formatting – Broken sentences – Wrong encoding – OCR

• Different formats data – USPTO, EPO, WP and many others have their own formats – Changes in format in different offices

• Quality of Learning Data – Spelling errors – Poor quality human translations – Words glued together – OCR

• the data provided told us it wasn’t OCRed, but…

Page 12: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• Gaps in Data – Many terms are not in the learning data

• Tricks By Authors – Changing writing mechanism

• i.e. Switch to Katakana with there is a perfectly good Kanji term

• Bilingual Data – Matching patent documents between various patent office

formats – Matching sentences – Removing poor quality translations – Fixing “broken data”

Page 13: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• Sentence Length – The longest patent sentence we have seen so far is 4,500 words

in a single sentence

• Throughput Requirements – Front File

• Translated and published within X hours of be published by Y patent office

– Back File • All patents going back to X within 3 months

– This is millions of documents

Page 14: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd Copyright © 2014, Asia Online Pte Ltd

Page 15: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• Unique Customization and Quality Improvement Plan

• Clean Data Strategy

• One Engine, Multiple Writing Styles – Writing Styles By

• Content Domain • Document Section

– Sentence by sentence domain switching

• Hybrid – Rules + Syntax + Statistics

• Multiple Translations – Only the best will do

• Ongoing Improvement – Driven by Quality and Measurement

Page 16: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd Copyright © 2014, Asia Online Pte Ltd

Page 17: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

Data Cleaning

Data Preparation

Data Collections

Training

Diagnostics and Fine Tuning

Original Translation Sources

Translate

Quality Assurance

Language Pair Foundation Data

Domain Foundation Data

Page 18: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

Language Pair Foundation Domain Foundation

Client Data

+ =

Custom Engine

Asia Online Foundation Data

+ Sub-Domain Specific Data

Manufactured Data

Page 19: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• Definition – Domain – Target Audience – Preferred Writing Style – Glossaries, Non-Translatable Terms, Preferred Capitalization – Special Formatting Requirements – Quality Requirements

• Data Gathering – Source data in domain – Bilingual data to support domain – Monolingual data to support domain

• Data Analysis – Gap analysis – High frequency terms – Term extraction

• Data Generation – Supporting grammar structures – Source Data Analysis

• Cleaning of Data • Tuning and Test Set Preparation • Diagnostic Engine

– Fine tuning

Provided by client and gathered from third parties.

Page 20: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• Data Preparation – Language ID – Encoding ID – Class Definition – Rule Definition – Writing Style Definition – Data Alignment – Data Cleaning & Repair – Gap Analysis – Word segmentation – De-compounding

– Data Manufacturing – Spelling Correction – Domain detection – Syntax parsing – Reordering rules – Data structuring rules – Language Normalization – Term Normalization

Page 21: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• Engine Training – 5 major categories

• Leverage IPC • Override option for user to bypass IPC logic

– 4 writing styles • Title, Claim, Abstract, Description

– 20 different sub-engines • 5 categories x 4 styles

– Tuning/testing data for each of the 20 sub-engines – Integration of 20 sub-engines into a single engine

Page 22: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• Runtime Translation – Pre-Translation Corrections – Domain detection – Syntax parsing – Reordering rules – Data structuring rules – Statistical translation – Multi-candidate translations – Class extraction and processing

Page 23: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd Copyright © 2014, Asia Online Pte Ltd

Page 24: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• There is no magic in MT, human effort is required.

• The quality of the output and suitability for purpose is directly in proportion to the amount of human effort.

• Without human direction, MT will cost more in the long term and is more likely to fail.

Page 25: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• Source – The entire body of data in the back file

• Target – Every USPTO patent published from 1976 until current

• Bilingual Data – USPTO, EPO, etc. matching documents

Page 26: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• This is the actual format from one customer

Page 27: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

Page 28: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

Page 29: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• Data – Gathered from as many sources as

possible. – Domain of knowledge does not matter. – Data quality is not important. – Data quantity is important.

• Theory – Good data will be more statistically

relevant.

• Data – Gathered from a small number of

trusted quality sources. – Domain of knowledge must match

target – Data quality is very important. – Data quantity is less important.

• Theory – Bad or undesirable patterns cannot be

learned if they don’t exist in the data.

Dirty Data SMT Model

Clean Data SMT Model

Page 30: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

English Source Human Translation Google Translation Google Context I went to the bank Fui al banco Fui al banco Bank as in finance

I went to the bank to

deposit money

Fui al banco para depositar

dinero

Fui al banco a depositar

el dinero

Bank as in finance

I went to the bank of

the turn in my car

Fui en coche a la

inclinación de la vuelta

Fui a la orilla de la vuelta

en mi coche

Bank as in river bank

I put my car into the

bank of the turn

Puse mi coche en la

inclinación de la vuelta.

Pongo mi coche en el

banco de la vuelta

Bank as in finance

I swam to the bank of

the river

Nadé en la orilla del río Nadé hasta la orilla del

río

Bank as in river bank

I banked my money Deposité mi dinero Yo depositado mi dinero Banked as in finance

I banked my car into the

turn

Incliné mi coche en la

vuelta

Yo depositado mi coche

en la vuelta

Banked as in finance

I banked my plane into

a steep dive

Incliné mi avión en para

una zambullida.

Yo depositado en mi

avión en picada

Banked as in finance

The above examples show that Google is biased towards the banking and finance domain Issue:

There is much more multilingual banking and finance data available to learn from than there is aeronautical or water sports data available. Cause:

Page 31: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

Dirty Data SMT Baseline

Language Studio™ Clean Data SMT

Foundation

Dirty Data SMT Baseline

20% Required for Noticeable Improvement Client Data

Initial Customization Improvement Improvement

< 0.1%

Language Studio™ Clean Data SMT

Foundation

Client Data

Initial Customization

Manufactured Data

Page 32: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

Page 33: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• Language Studio™ provides tools and processes for normalization of terminology

• Benefits include cost reductions, faster deliverables, higher customer satisfaction and happier post editors

Page 34: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

Translation quality can be greatly improved by performing 3 similar but different cross references of data.

All Source Data to be Translated

Bilingual Data

Monolingual Target Language Data

Bilingual Data

Bilingual Data

Monolingual Target Language Data

Goal: Identify words in the source data to be translated that are not in the bilingual data.

Benefit: Ensures all words in the data to be translated are known and will be translated correctly.

Action: Human translate or locate word lists from industry sources and directories and add to bilingual data.

Goal: Identify words in the monolingual target language data that are not in the bilingual data.

Benefit: Ensures all words in the monolingual target language data are known, ensuring that data to be translated in future but not yet known will be translated better.

Action: Human translate or locate word lists from industry sources and directories and add to bilingual data.

Goal: Identify words in the bilingual data that are missing or low frequency in the monolingual target language data.

Benefit: Ensures that there is enough grammatical representation of the words, phrases and terminology in the monolingual target language data. This delivers greater fluency in translation output.

Action: Generate monolingual target language data using Language Studio™ Pro Crawl and Generate Tools and add to monolingual data.

EN

EN

1

2

3

Page 35: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

Gruppenmasterdatenverarbeitungsvorrichtungssynchronisationsinformation

Leistungswirkungsgradindexmarkierungsberechnungseinrichtung

Schwenkmotorbetriebsdrehmomentbegrenzungswertberechnungsschritt

Differenzialmechanismusumschaltbedingungsänderungseinrichtung

Kraftstoffverbrauchsratenprioritätsmodusauswahlschalter

Reproduktionsunmöglichkeitsgegenmaßnahmeneinrichtung

Telefonbuchdatenübertragungsprotokollverbindungsabschnitts

Leistungswirkungsgradindexmarkierungsberechnungseinrichtung

Bezugspunktsolldrehungsgeschwindigkeitsfestlegungsabschnitt

Höhenstandsaufnahmedifferenzdrucksondenresonanzverstimmung

Maschinenrotationspumpenkapazitätsbefehlwandlungsabschnitt

Brennkraftmaschinenausgangsdrehmomenterfassungseinrichtung

Telefonbuchdatenübertragungsprotokollverbindungsabschnitt

übermaßwankwinkelauftrittstendenzbeurteilungseinrichtung

Unterstützungsdrehmomentbegrenzungswertberechnungsschritt

Personenwahrscheinlichkeitsberechnungsverarbeitungsroutine

Positionsaktualisierungsinformationsübertragungszeitpunkt

Automatikgetriebehydraulikfluidtemperaturerfassungseinheit

Leistungswirkungsgradindexmarkierungsberechungseinrichtung

Octadecylaminodimethyltrimethoxysilylpropylammoniumchlorid

Katalysatorverschlechterungsbeurteilungseinrichtung

Kraftstoffverbrauchsprioritätsmodusauswahlschalter

Page 36: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd Copyright © 2014, Asia Online Pte Ltd

Page 37: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• Generic MT from Google, Bing, etc. offers unknown productivity gains and sometimes productivity loss due to lack of control.

• Competitors offer < 20-40% productivity gains due to domain centric and “dirty data SMT” customization model.

• Language Studio™ : – Targets of 150-300%+ productivity gains with granular sub-domain “clean data SMT”

approach. – Provides complete control of writing style, terminology and is mapped to target audience

reducing editing effort.

Language Pair

Top-Level Domain

Engines/Sub-Domains

EN-ES Automotive

Honda Cars

Motorbikes

Toyota Marketing

Service Reports

User Manuals

Engineering Service Manuals

User Manuals

Engineering Service Manuals

Client Product Target Audience / Purpose

Cars

50%+ 90%+ 150-300%+

Customization Level:

Typical Productivity Gain:

Google/Bing Quality Level

Typical Competitor Quality Level

Generic

????

Domain

< 20-40%

Page 38: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

Translated text can be stylized based on the style of the Monolingual data.

ES

Millions of Sentence Pairs

News paper article

Business News

The Economist New York Times Forbes

Children’s Books

Harry Potter Rupert the Bear Famous Five

Bilingual Data Monolingual Data

Text written in the style of

business news

EN

Text written in the style of children’s

books

EN

Possible Vocabulary

Writing Style & Grammar

Page 39: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

Spanish Original Before

Translation:

Se necesitó una gran maniobra política muy prudente a fin de facilitar una cita de los dos enemigos históricos.

Business News After Translation:

Significant amounts of cautious political maneuvering were required in order to facilitate a rendezvous between the two bitter historical opponents.

Children’s Books After Translation:

A lot of care was taken to not upset others when organizing the meeting between the two long time enemies.

Page 40: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• 5 different main categories – Tests were performed on more granular categories, but they did not

have much impact for the effort – Categories automatically detected using the IPC data

• IPCs within various ranges are mapped into 1 of 5 categories

• 4 writing styles determined by the XML identifiers for the Title, Claims, Abstract and Description section.

• Language Studio is configured to recognize a sentence header and change style for every sentence based on the header.

• This permits 20 writing styles within a single engine. – Changes the use of bilingual and monolingual data as required per

style

Page 41: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd Copyright © 2014, Asia Online Pte Ltd

Page 42: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

Pre-Processing Rules

Hybrid Rules and SMT Engine Model

Hybrid Rules and Corrective Statistical Engine Model

• Sentence Segmentation • Word Segmentation • Phrase Reordering • Dates and Numbers • Patterns, Formulas etc. • Pre-Normalization • Spell Checking • Custom Runtime Glossary • Pre-Formatting

• Capitalization • Post-Formatting • Grammar Checking • Post-Normalization • XML Tag Reinsertion • Currency Conversion • Cross Referencing • Other custom post processing

This is more of a Band-Aid approach

as the core MT is still a traditional Rules Based MT

Engine

Statistical Machine Translation Post-Processing Rules

Statistical Correction of Rules Errors Translation Rules

EN

No

Yes

ES

No Yes

• Statistical Smoothing • “Automated Post Editing”

Page 43: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• Problem – Reference numbers break translation fluency

• Solution – Use JavaScript rules – Remove from translation recording its original position – Track the movement position of the word associated with the

reference number and reinsert after translation

However, malware on electronic device 103 must still make requests of resource 106 if it is to carry out malicious activities.

Apartments are in very good condition, well equipped and furnished to a very good standard. los apartamentos están en |0-2,0, 0=0 0=1 1=2 2=3 | muy buenas condiciones |3-5,0, 0=0 1=1 2=2 | , |6-6,0, | bien equipados y amueblados |7-10,0, 0=0 1=1 2=2 3=3 | a un nivel muy bueno |11-15,0, 0=0 1=1 2=3 3=4 4=2 | . |16-16,0, |

Page 44: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• Problem: – An infinite number or highly variable data element that statistics will not

handle well

• Solution – Use JavaScript rules – Associate the data element with the class and store data on a Session object – Substitute the data element with the class identifier – Translate with the class – all data of the class will be treated the same – After translation merge the data element back into the class using word

tracking information

The above-identified U.S. patent application Ser. No. 13/155,881, filed Jun. 8, 2011 provides further details of searching by image.

The above-identified @PATENTNOPREFIX@ @PATENTNO@, filed

@DATE@ provides further details of searching by image.

Page 45: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

Page 46: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

Page 47: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

Page 48: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd Copyright © 2014, Asia Online Pte Ltd

Page 49: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• Problem: – Sometimes it is not possible to predict the best approach

to deliver the best quality

• Solution: – Perform multiple approaches and score them

• Language Studio supports multiple ordering and restructuring formats for a single segment of data.

• Each can be evaluated independently using a number of scoring metrics and the best quality translation result returned – Scores for Segment Level Confidence, Language Model, Source Matching,

TM Matching, Terminology Confidence

Page 50: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd Copyright © 2014, Asia Online Pte Ltd

Page 51: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

4. Manage Manage translation projects while generating corrective data for quality improvement.

2. Measure Measure the quality of the engine for rating and future improvement comparisons

3. Improve Provide corrective feedback removing potential for translation errors.

1. Customize Create a new custom engine using foundation data and your own language assets

Page 52: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• Exception handling – Long sentences – Bad sentences – Bug bears

• New Data – Integrate quickly as it is produced by various patent offices – Data produced regularly

• Hire Specialists – People to work on data and rules that understand the engine

and know how to refine it

• Outsource Term Translation – Find a specialist that can translate terms from Gap Analysis

Page 53: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• Coined by Laura Rossi from LexisNexis – A nasty or bad word that should never be in the translation

output

• Previous solution – Find in the phrase table data

• Remove • Re-binarize

– Find in the training data • Remove

– Very time consuming

• Language Studio Solution – Bad word list – Can be updated any time – Translation engine decoder will ignore any data that has a bad

word in it

Page 54: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• Training data can often have gaps in coverage and an excess of data in other areas. • Gaps in coverage reduce translation quality. • Gaps can quickly be filled via post editing the machine translated output and submitting

the data back to the system for further learning. • Many gaps can be filled with monolingual data only. • Further gaps can be identified and resolved by analyzing the text that is to be translated

for high frequency terms and unknown words • In some cases incorrect data may be statistically more relevant. Post editing will raise the

relevance of the correct grammar.

Sufficient Data Threshold

Data Shortfall

Post Edited Feedback and Generated Data to Fill Gaps

Example of Training Data D

ata

Vo

lum

e

More initial data provided for training results in greater vocabulary and grammatical coverage above the Sufficient Data Threshold and less post editing feedback required.

Gaps in Topic Coverage

Page 55: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd

• Document and Proximity Translations – All existing translation platforms translate at a sentence level

only. – By leveraging information in the document or in near proximity

to the current sentence, higher quality translations are possible.

• Immediate Quality Updates – Updates to engine quality within 60 minutes of making edits. – Updates to engine quality by learning automatically from

external sources.

• Improved Slavic language support – Generation of inflected forms – Deeper grammatical and syntactical analysis

Page 56: ICIC 2014 High volume, High Quality Patent Translation across Multiple Domains

Copyright © 2014, Asia Online Pte Ltd Copyright © 2014, Asia Online Pte Ltd

High volume, High Quality Patent Translation

across Multiple Domains

Dion Wiggins Chief Executive Officer [email protected]