taus open source machine translation showcase, paris, manuel herranz, pangeanic, 4 june 2012

Post on 03-Jul-2015

635 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

This presentation is a part of the MosesCore project that encourages the development and usage of open source machine translation tools, notably the Moses statistical MT toolkit. MosesCore is supported by the European Commission Grant Number 288487 under the 7th Framework Programme. For the latest updates, follow us on Twitter - #MosesCore

TRANSCRIPT

TAUS OPEN SOURCE MACHINE TRANSLATION SHOWCASE

PangeaMT Putting Open Standards to work16:00-16:15Monday 4 June

Manuel HerranzPangeanic

PangeaMT – putting open standards to work… well

Manuel Herranz#manuelhrrnz #pangeanic E: mherranz@pangea.com.mtpangeanic

MACHINE TRANSLATION

Make myday,

I S N O T

I Sbecome

a post-editor

Chomsky: Imagine that ifin the futurelarge enoughamounts of data existed, they could be processed bycomputers withenoughcomputingpower

rule-based systems, IBM licenses, many linked to patent EN/RU & Intel

First statistical papers

1st Open source SMT

Translation industryappropriating Moseshttp://euromatrixplus.net/moses

DIY SMT

http://t.co/HDTboxQ

PEAK of ColdWar and informationcontrol.Products & informationdirected toconsumers/ users / citizens

BEGINNING of data resources. Internet.Accessability toinformation

Content generated BY USERS / CONSUMERS / CITIZENS, multilingual, free informationexchange across theworld

Types of LSPs (Ben Sargent – TMS Inspiration Days April ‘11 – Krakow)

a) develop it for their use and for their clients (developers of a system),

b) buyers of systems (they do not want the headache of starting from scratch and prefer to buy ready-build solutions) and finally

c) there are those who prefer the mix&match approach (buying some good solutions outside and building interfaces and what they know works best for their business). The trend is towards unification

2007/08

.

2009/10

2011/12

• DIY SMT• Empower Users• Glossary• Automated re-training• Transfer architecture and know-how to users

• Compatibility withcommercial formats (ttx, sdlxliff, itd)

2007 and before

• RB tests with commercial software• Insufficiently good output• Only internal production

• EU Post-Editing Award• V1: Small data sets (2-5M words), automotive & electronics

• (ES), then Fr/It/De in other fields

• Division born• 00's of engine trials and language combinations

• Open-Source to commercial

• TMX / XLIFF workflows

As of May 2009: 487 Billion gigabytes or1,000,000,000 * 487,000,000,000 = 4,87 x 1020

EstimatesUp 50% a year (Oracle)Doubles every 11 hours (IBM)

OBJECTIVES = CHALLENGES 2007 - 2010

Turn academic development (Moses) into a commercial application.

To provide High Q MT for Post-Editing and save time and cost. No Google-type broad TR but domain-specific, user-centric.

Lower entry level for MT. Bring affordability user control / empowerment to MT. Bring it to the user, take away from programmer.

How? By fostering open-standard geared translation automation strategies.

To use only community-based Open standards –> Oasis / ISO: xliff / tmx, xml). NO proprietary formats (technology independence) so USERS are not “locked” in to buying and updating expensive software.

DIY SMT June 2011 http://t.co/HDTboxQ

9

The rush for data

Soon realised that there was a rush to gather data but that other resources around data were necessary

cleaning

More cleaning

10

cleaning

More cleaning

<tu srclang="en-GB">

<tuv xml:lang="EN-GB">

<seg>A system for recovering the methane that is emitted from the manure so that

it does not leak into the atmosphere.</seg>

</tuv>

<tuv xml:lang="FR-FR">

<seg>Système permettant de r€ pérer le méthane qui se dégage de l'engrais naturel

d'origine animale de sorte qu'il ne se dissipe pas dans l'atm sphère.</seg>

</tuv>

<tuv xml:lang=“EN-US">

<seg>On 22nd May we decided not to join the group.</seg>

<tuv xml:lang=“DE-DE">

<seg>Am 22. </seg>

<tu srclang="en-GB">

<tuv xml:lang="EN-GB">

<seg>The President of the United States visited Costa Rica.</seg>

</tuv>

<tuv xml:lang=“ES-ES">

<seg>El Presidente de los Estados Unidos, el señor Obama y su esposa la señora

Michelle, visitaron Costa Rica el pasado sábado.</seg>

</tuv>

11

cleaning

More cleaning

<tuv xml:lang=“JP">

<seg>同書は「通訳・翻訳キャリアガイド」の2011-2012年度版。

英字新聞のジャパンタイムズ社が強みとするジャーナリスティックな視点で、通訳や翻訳という仕事が持つ魅力ややりがい、プロに要求されるスキルおよび意識の持ち方などを紹介。また通訳者・翻訳者になるための道すじから、実際の仕事の現場にいたるまで、今日の通訳・翻訳業界の実像を包括的に紹介。</seg>

<tuv xml:lang=“EN-US">

<seg>It is a journalistic point of view and strengths of the English-

language newspaper Japan Times. It includes a description of the exciting and

rewarding work of translation and interpretation, as well as the introduction of

consciousness and how to acquire the required professional skills. The road to

becoming a translator and interpreter also down to the actual work site, a

comprehensive guide to interpreting the reality of today'stranslation industry.

</seg>

Translation MT+PE

Automotive 400 wph 900 wph

Marketing 250 wph 450 wph

Software 350 wph 1,000 wph

Domains are managed at TM and at engine level

I created this engine with medical, pharma TMX and added environmental

TMs to boost coverage - Client deals with plant-based natural drugs / ayurveda

Tag-based TM selection

2015

2014

2013

2011

2010

2012

2018

2017

2016

User

em

po

werm

ent

• MT acceptance growth (still)

• Translator engagement challenge (being solved particularly with in-house translators & economic climate)

• Need for data is being addressed – still more work to be done.

• The difference will be madeby data handling and MTtechniques (hybrid, combination, syntax, re-ordering, etc)

• Users and practitioners now can build their own systems, A TREND BEING FOLLOWED BY OTHER PLAYERS.

Until 2011/12

YEAR2016

00

0's

of c

usto

miz

ed

MT

syste

ms

In 5 years... after 2017… where?

Tech. notthe realm of afew providers

Ubiquitious MT2009

top related