a ndrejs v asiļjevs c hairman of the b oard [email protected]
DESCRIPTION
d ata is c ore. s. a ndrejs v asiļjevs c hairman of the b oard [email protected]. LOCALIZATION WORLD PARIS, JUNE 5, 2012. L anguage technology developer Localization service provider Leadership in smaller languages Offices in Riga (Latvia), Tallinn (Estonia) and Vilnius (Lithuania) - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/1.jpg)
sandrejs vasiļjevs
chairman of the [email protected]
data is core
LOCALIZATION WORLD PARIS, JUNE 5, 2012
![Page 2: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/2.jpg)
• Language technology developer
• Localization service provider
• Leadership in smaller languages
• Offices in Riga (Latvia), Tallinn (Estonia) and Vilnius (Lithuania)
• 135 employees
• Strong R&D team
• 9 PhDs and candidates
![Page 3: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/3.jpg)
MTmachine translation
machine translation
![Page 4: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/4.jpg)
INNOVATIONd i s r u p ti v e
d i s r u p ti v e
![Page 5: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/5.jpg)
rule-based MT
statistical MT
• High quality translation in specialized domains• Require highly qualified
linguists, researchers and software developers• Time and resource consuming• Difficult to evolve
• Translation and linguistic knowledge is derived from data• Relatively easy and quick to develop• Requires huge amounts of parallel and monolingual data• Translation quality inconsistent and can differ dramatically from
domain to domain
MT paradigms
![Page 6: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/6.jpg)
CHALLENGE
![Page 7: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/7.jpg)
15largest
languages
50%
![Page 8: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/8.jpg)
domains
IT Aerospace
Agriculture Automotive
Chemistry Coal and mining industries
Communications Culture
Defence Education
Electronics Energy
Finance Food technology
Government affairs Legal
Life sciences Logistics
Marketing Mechanical engineering
Medicine Pharmaceuticals
Religion Social affairs
Trade
![Page 9: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/9.jpg)
one size fits all
?
![Page 10: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/10.jpg)
DATA
![Page 11: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/11.jpg)
![Page 12: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/12.jpg)
The total body of European Union law applicable in the EU Member States
JRC-Acquis http://langtech.jrc.it/JRC-Acquis.html
![Page 13: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/13.jpg)
The DGT Multilingual Translation Memory of the Acquis Communautaire
DGT-TMhttp://langtech.jrc.it/DGT-TM.html
![Page 14: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/14.jpg)
Parallel data collected from the Web by University of Uppsala
90 languages, 3800 language
2,7B parallel units
Opushttp://opus.lingfil.uu.se
![Page 15: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/15.jpg)
open European language resource infrastructure
http://www.meta-net.eu
![Page 16: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/16.jpg)
Data for SMT training
![Page 17: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/17.jpg)
PLATFORM
![Page 18: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/18.jpg)
Moses toolkit
[ttable-file]0 0 5 /.../unfactored/model/phrase-table.0-0.gz% ls steps/1/LM_toy_tokenize.1* | catsteps/1/LM_toy_tokenize.1steps/1/LM_toy_tokenize.1.DONEsteps/1/LM_toy_tokenize.1.INFOsteps/1/LM_toy_tokenize.1.STDERRsteps/1/LM_toy_tokenize.1.STDERR.digeststeps/1/LM_toy_tokenize.1.STDOUT% train-model.perl \--corpus factored-corpus/proj-syndicate \--root-dir unfactored \--f de --e en \--lm 0:3:factored-corpus/surface.lm:0% moses -f moses.ini -lmodel-file "0 0 3 ../lm/europarl.srilm.gz“use-berkeley = truealignment-symmetrization-method = berkeleyberkeley-train = $moses-script-dir/ems/support/berkeley-train.shberkeley-process = $moses-script-dir/ems/support/berkeley-process.shberkeley-jar = /your/path/to/berkeleyaligner-2.1/berkeleyaligner.jarberkeley-java-options = "-server -mx30000m -ea"berkeley-training-options = "-Main.iters 5 5 -EMWordAligner.numThreads 8"berkeley-process-options = "-EMWordAligner.numThreads 8"berkeley-posterior = 0.5tokenizein: raw-stemout: tokenized-stemdefault-name: corpus/tokpass-unless: input-tokenizer output-tokenizertemplate-if: input-tokenizer IN.$input-extension OUT.$input-extensiontemplate-if: output-tokenizer IN.$output-extension OUT.$output-extensionparallelizable: yesworking-dir = /home/pkoehn/experimentwmt10-data = $working-dir/data
![Page 19: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/19.jpg)
buildyour ownMT engine
![Page 20: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/20.jpg)
Tilde / CoordinatorLATVIA
University of EdinburghUK
Uppsala UniversitySWEDEN
Copehagen UniversityDENMARK
University of ZagrebCROATIA
MoraviaCZECH REPUBLIC
SemLabNETHERLANDS
![Page 21: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/21.jpg)
• Cloud-based self-service MT factory
• Repository of parallel and monolingual corpora for MT generation
• Automated training of SMT systems from specified collections of data
• Users can specify particular training data collections and build customised MT engines from these collections
• Users can also use LetsMT! platform for tailoring MT system to their needs from their non-public data
![Page 22: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/22.jpg)
• Stores SMT training data• Supports different formats –
TMX, XLIFF, PDF, DOC, plain text
• Converts to unified format• Performs format
conversions and alignmentResourceRepository
![Page 23: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/23.jpg)
• Put users in control of their data
• Fully public or fully private should not be the only choice
• Data can be used for MT generation without exposing it
• Empower users to create custom MT engines from their data
user-driven machine translation
![Page 24: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/24.jpg)
• Integration with CAT tools• Integration in web pages • Integration in web browsers• API-level integration
integration
![Page 25: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/25.jpg)
Integration of MT in SDL Trados
![Page 26: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/26.jpg)
Training UsingSharing of training data
Giza++Moses SMT toolkit
SMT Resource Repository
SMT Multi-Model Repository
(trained SMT models)
Proc
esin
g, E
valu
ation
...
Upl
oad
Anon
ymou
sac
cess
Auth
entic
ated
acce
ss
System management, user authentication, access rights control ...
Web page
Web service
Web pagetranslation widget
CAT tools
Web browserPlug-ins
SMT Resource Directory
SMT System Directory
Moses decoder
![Page 27: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/27.jpg)
![Page 28: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/28.jpg)
![Page 29: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/29.jpg)
use caseFORTERA
![Page 30: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/30.jpg)
EVALUATION
![Page 31: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/31.jpg)
• Keyboard-monitoring of post-editing (O´Brien, 2005)
• Productivity of MS Office localization (Schmidtke, 2008)
5-10% productivity gain for SP, FR, DE
• Adobe(Flournoy and Duran, 2009)
22%-51% productivity increase for RU, SP, FR
• Autodesk Moses SMT system (Plitt and Masselot, 2010)
74% average productivity increase for FR, IT, DE, SP
Previous Work
![Page 32: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/32.jpg)
Evaluation at Tilde
• Latvian: About 1,6 M native speakers Highly inflectional - ~22M possible
word forms in total Official EU language
• Tilde English – Latvian MT system
• IT Software Localization Domain
• Evaluation of translators’ productivity
![Page 33: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/33.jpg)
English-Latvian data
Bilingual corpus Parallel unitsLocalization TM 1 290 KDGT-TM 1 060 KOPUS EMEA 970 KFiction 660 KDictionary data 510 KWeb corpus 900 KTotal 5 370 K
Monolingual corpus WordsLatvian side of parallel corpus
60 M
News (web) 250 MFiction 9 MTotal, Latvian 319 M
![Page 34: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/34.jpg)
MT Integration into Localization Workflow
Evaluate original / assign Translator and Editor
Analyze against TMs
Translateusing translation suggestions for TMs
and MT
Evaluate translation quality / Edit
Fix errors
Ready translation
MT translate new sentences
![Page 35: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/35.jpg)
• Key interest of localization industry is to increase productivity of translation process while maintaining required quality level
• Productivity was measured as the translation output of an average translator in words per hour
• 5 translators participated in evaluation including both experienced and new translatorsEvaluation of
Productivity
![Page 36: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/36.jpg)
• Performed by human editors as part of their regular QA process
• Result of translation process was evaluated, editors did not know was or was not MT applied to assist translator
• Comparison to reference is not part of this evaluation
• Tilde standard QA assessment form was used covering the following text quality areas:
Accuracy
Spelling and grammar
Style
Terminology
Evaluation of Quality
![Page 37: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/37.jpg)
QA Grades
Error Score (sum of weighted errors)
Resulting Quality Evaluation
0…9 Superior
10…29 Good
30…49 Mediocre
50…69 Poor
>70 Very poor
Tilde Localization QA assessment applied in the evaluation
![Page 38: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/38.jpg)
Evaluation data
►54 documents in IT domain►950-1050 adjusted words in
each document►Each document was split in
half:
►the first part was translated using suggestions from TM only
►the second half was translated using suggestions from both TM and MT
![Page 39: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/39.jpg)
%
productivity32.9%*
* Skadiņš R., Puriņš M., Skadiņa I., Vasiļjevs A., Evaluation of SMT in localization to under-resourced inflected language, in Proceedings of the 15th International Conference of the European Association for Machine Translation EAMT 2011, p. 35-40, May 30-31, 2011, Leuven, Belgium
Latvian
![Page 40: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/40.jpg)
Evaluation at Moravia
► IT Localization domain►Systems trained on the
LetsMT platform►English - Czech translation
25.1% productivity increase
Error score increase from 19 to 27, still at the GOOD grade (<30)
►English – Polish translation 28.5% productivity
increase Error score increase from
16.8 to 23.6, still at the GOOD grade (<30)
![Page 41: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/41.jpg)
%
productivity
25%
*For Czech and Polish formal evaluation was done by MoraviaForor Slovak productivity increase was estimated by Fortera
28.5%
Slovak* Polish
25.1%
Czech
![Page 42: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/42.jpg)
MORE DATA
![Page 43: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/43.jpg)
corpora collection tools
comparability metrics
named entity recognition tools
terminology extraction tools
ACCURAT TOOLKIT
![Page 44: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/44.jpg)
use caseAUTOMOTIVE
MANUFACTURER
![Page 45: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/45.jpg)
very smalltranslation memories(just 3500 sentences)
noin-domain corpora in target languages
nomoney for expensive developments
?
![Page 46: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/46.jpg)
Terminology extraction
Web crawling parallel
monolingual
Parallel data extraction from comparable corpora
data collection workflow
![Page 47: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/47.jpg)
TMs
Terminology glossary
Parallel phrases
Parallel Named Entities
Monolingual target language corpus
Resulting data
![Page 48: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/48.jpg)
General domain data as a basis
Domain specific language model
Impose domain specific terminology, named entity translations
Add linguistic knowledge atop of statistical components
SMT Training
![Page 49: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/49.jpg)
right data &right tools
![Page 50: a ndrejs v asiļjevs c hairman of the b oard andrejs@tilde.com](https://reader036.vdocuments.mx/reader036/viewer/2022062815/5681695d550346895de11297/html5/thumbnails/50.jpg)
tilde.comtechnologies
for smaller
languages
The research within the LetsMT! project leading to these results has received funding from the ICT Policy Support Programme (ICT PSP), Theme 5 – Multilingual web, grant agreement no 250456