TAUS MACHINE TRANSLATION SHOWCASE Vancouver, Canada
The Simplified Guide to Getting Started in SMT Wednesday, 29 October 2014 Tom Hoar, Precision Translation Tools
The research within the project MosesCore leading to these results has received funding from the European Union 7th Framework Programme, grant agreement no 288487
PTTools
• SoJware vendor -‐ founded Feb 2010 – Adobe : Photoshop – PTTools : DoMT
• DoMT brand – DoMT Deskop: organize and manage training corpora, models and custom workflows.
– DoMT Server: automaIon soluIon
• Customer educaIon
Who We Are
Current State
• Who has not heard of SMT? • Requires powerful, expensive hardware • Huge translaIon memories • Complicated processes • Dearth of skilled personnel
Current SMT
Then vs Now
Current SMT
2007 2014
Hardware 50 CPUs in private cloud One 24-‐CPU machine
Mega corpus 2 weeks 36 hours
Cost US $100K++ US $1,500
1992 2014
Computer SGI @ $100K Dell @ $5,000
SoGware Eclipse Alias @$25K Adobe CS Cloud $1,500
Graphic ProducKon $300 per hour $30++ per hour
Business Models
• Where is the work done? • Who does the work? • Outsourced
– Free – For Fee
• Insourced – Enterprise Server – Desktop ApplicaIon
Current SMT
Reality 2014
• Inexpensive capable hardware exists • TranslaIon memories within reach • Processes migraIng to soJware • Training available for exisIng personnel
Current SMT
Is Academic Moses Enough?
“There are considerable amounts of addiIonal funcIonality... that are not included in Moses that are essenIal in order to offer a strong and innovaIve commercial MT plajorm.” – Philipp Koehn – Professor, University of Edinburgh
(http://kv-emptypages.blogspot.com/2013/09/understanding-mt-customization.html)
“Simple Guide”
Manage Corpora
• Acquire – TranslaIon memory archives – Public corpora – Convert docs – Recycle post-‐edited MT
• Process – Transform/filter – Curate/categorize
“Simple Guide”
Manage SMT Models
• Train TranslaIon models • Train Language model • Tune SMT model • Evaluate SMT model • Deploy SMT engine • Versioning
“Simple Guide”
Produce MT
• Manual – Import/export TMX – Import/Export XLIFF – Doc-‐to-‐doc support
• AutomaIon – TMS IntegraIon – CAT IntegraIon
“Simple Guide”
SMT Specialists
• ComputaIonal linguists are scienIst who specialize in language and compuIng to create and advance the science.
• Specialists are localizaIon engineers who review the data and select tools to prepare a training corpus that minimizes post-‐ediIng in commercial producIon.
Human Resources
Specialist’s Required Skills
• OrganizaIon skills (e.g. manage TM’s) • Observant of paserns • Willingness to learn • Regular expression – helpful • Programming skills – unnecessary • ComputaIonal linguists – unnecessary • System Administrator – unnecessary
Human Resources
Observant of Paserns
<ut>{\cs6\f1\cf6\lang1024 </ut> <span class="small-‐text"> <ut>} </ut>Copyright © 1997-‐2009 &nbsp;\ n \ n • Archived TMX content
– RTF – HTML & XML-‐escaped HTML – XML – Broken programmer’s markup
Human Resources
Use Cases
• Large LSP – Extensive MT experience – CSA Top 10
• 2 Medium LSP’s – Post-‐ediIng experience – In-‐house localizaIon engineers
• Freelance Translator – United NaIons contractor – Technically savvy Use Cases
Welocalize
• Work: SoJware localizaIon • Hardware: Virtual machines for pilot • SMT models: EN-‐ES, EN-‐DE, EN-‐ZH, EN-‐RU • Corpus: All corpora < 500,000 segment pairs • Training: 3-‐month pilot • Results: “Approached outsourcing vendors”
– Zero-‐edit measure: 25-‐45%
Use Cases
EQHO CommunicaIons
• Work: SoJware localizaIon • Hardware: $1,500 new 6-‐core computer • SMT model: EN <-‐> European language • Corpus: ~130,000 segment pairs • Training: 3 month pilot • Results: BLEU’s 80 to 85
– Zero-‐edit measure: 23-‐43%
Use Cases
Mid-‐sized European LSP
• Work: Financial and regulatory reports • SMT model: EN <-‐> European language • Corpus: ~800,000 segment pairs (25 years) • Training: 20 hours of tutorials over 2 months • Homework: Categorize TM’s for 4+ months • Results: BLEU’s rose from low 50’s to mid-‐80’s
Use Cases
Freelance Translator
• Work: United NaIons environmental reports • Hardware: $1,500 new 6-‐core computer • SMT model: EN <-‐> European language • Corpus: ~250,000 segment pairs (25 years) • Training: 40 hours of tutorials over 2 months • Results: BLEU’s 75 to 85
– Zero-‐edit measure: averaged 35%
Use Cases
Conclusion
• Regardless of business model – Mange Corpora – Generate Models – Product MT – Publish Results
• Re-‐purpose exisIng staff with training • Rightsourcing