mašīntulkojumu kombinēšana - tildes pētniecības seminārs

Download Mašīntulkojumu kombinēšana - Tildes pētniecības seminārs

If you can't read please download the document

Upload: matiss-rikters

Post on 19-Jan-2017

163 views

Category:

Technology


3 download

TRANSCRIPT

mantulkojumu kombinana

Matss Rikters

Darba vadtja: Dr. Dat., prof. Inguna Skadia

Tildes ptniecbas seminrs

Rg, 2016. gada 4. mart

Pirm prezentcija

http://tbs/dev/ai/ResearchSeminar/MSMT.ppt

http://www.slideshare.net/matissrikters/msmt-47423148

Pirm prezentcija

Paststts par

MT virzieniem

Hibrds MT veidiem

Literatras izpti par daudzsistmu hibrdo MT

Plnotiem eksperimentiem, citiem padartajiem un plnotajiem darbiem

Saturs

Hibrd mantulkoana

Daudzsistmu hibrd MT

Vienkra mantulkojumu kombinana

Veselu tulkojumu kombinana

Tulkojumu dau kombinana

Lingvistiski motivta mantulkojumu kombinana

Citi darbi

Tlki plni

Hibrd mantulkoana

Statistisk likumu enerana

RBMT sistmas likumi enerti no treniu korpusiem

Vairkkrtja apstrde (multi-pass)

Secga datu apstrde skum ar RBMT, tad SMT

Daudzsistmu hibrd MT

Paralli darbintas vairkas MT sistmas

Daudzsistmu hibrd MT

Ldzgi ptjumi:

SMT + RBMT (Ahsan and Kolachina, 2010)

Confusion networks (Barrault, 2010)

+ neironu tklu modelis (Freitag et al., 2015)

SMT + EBMT + TM + NE (Santanu et al., 2014)

Rekursva teikumu dekompozcija (Mellebeek et al., 2006)

mantulkojumu kombinana

Veselu tulkojumu kombinana

Iztulko pilnu teikumu ar vairkm MT sistmm

Izvlas labko

mantulkojumu kombinana

Veselu tulkojumu kombinana

Iztulko pilnu teikumu ar vairkm MT sistmm

Izvlas labko

Tulkojumu fragmentu kombinana

Sadala teikumu fragmentos

K fragmenti tiek emti teikuma sintakses koka augstkie apakkoki

Iztulko katru fragmentu ar vairkm MT sistmm

Izvlas labkos fragmentus un tos apvieno

Veselu tulkojumu kombinana

Veselu tulkojumu kombinana

Labk tulkojuma izvle:

KenLM (Heafield, 2011) calculates probabilities based on the observed entry with longest matching history :

where the probability and backoff penalties are given by an already-estimated language model. Perplexity is then calculated using this probability: where given an unknown probability distribution p and a proposed probability model q, it is evaluated by determining how well it predicts a separate test sample x1, x2... xN drawn from p.

Veselu tulkojumu kombinana

Labk tulkojuma izvle:

Trents 5-grammu valodas modelis ar

KenLM

JRC-Acquis korpusu v. 2.2 (Steinberger, 2006) - 1.4 miljoniem latvieu valodas juridisk domna teikumu

Teikumi novrtti attiecb pret valodas modeli ar KenLM query programmu

Veselu tulkojumu kombinana

Labk tulkojuma izvle:

Trents 5-grammu valodas modelis ar

KenLM

JRC-Acquis korpusu v. 2.2 (Steinberger, 2006) - 1.4 miljoniem latvieu valodas juridisk domna teikumu

Teikumi novrtti attiecb pret valodas modeli ar KenLM query programmu

Testa dati

1581 patvagi izvlti teikumi no JRC-Acquis korpusa

Mints ar ACCURAT balanstais izvrtanas korpuss - 512 visprgu teikumu(Skadi et al., 2010), bet rezultti bija mazk labi

Veselu tulkojumu kombinana

SistmaBLEUIzvlto tulkojumu patsvarsGoogleBingLetsMTViendiGoogle Translate16.92100 %---Bing Translator17.16-100 %--LetsMT28.27--100 %-Hibrds Google + Bing17.2850.09 %45.03 %-4.88 %Hibrds Google + LetsMT22.8946.17 %-48.39 %5.44 %Hibrds LetsMT + Bing22.83-45.35 %49.84 %4.81 %Hibrds Google + Bing + LetsMT21.0828.93 %34.31 %33.98 %2.78 %

Maijs 2015

Tulkojumu fragmentu kombinana

Tulkojumu fragmentu kombinana

Sintaktisk analze:

Berkeley Parser (Petrov et al., 2006)

Teikuma sadalana fragmentos no sintakses koka augj lmea apakkokiem

Tulkojumu fragmentu kombinana

Sintaktisk analze:

Berkeley Parser (Petrov et al., 2006)

Teikuma sadalana fragmentos no sintakses koka augj lmea apakkokiem

Labk fragmenta izvle:

5-grammu valodas modelis ar KenLM un JRC-Acquis korpusu - 1.4 miljoniem latvieu valodas juridisk domna teikumu

Teikumi novrtti attiecb pret valodas modeli ar KenLM query programmu

Tulkojumu fragmentu kombinana

Sintaktisk analze:

Berkeley Parser (Petrov et al., 2006)

Teikuma sadalana fragmentos no sintakses koka augj lmea apakkokiem

Labk fragmenta izvle:

5-grammu valodas modelis ar KenLM un JRC-Acquis korpusu - 1.4 miljoniem latvieu valodas juridisk domna teikumu

Teikumi novrtti attiecb pret valodas modeli ar KenLM query programmu

Testa dati

1581 patvagi izvlti teikumi no JRC-Acquis korpusa

Mints ar ACCURAT balanstais izvrtanas korpuss - 512 visprgu teikumu, bet rezultti bija mazk labi

SistmaBLEUIzvlto tulkojumu patsvarsMSMTSyMHyTGoogleBingLetsMTGoogle Translate18.09100%--Bing Translator18.87-100%-LetsMT30.28--100%Hibrds Google + Bing18.7321.2774%26%-Hibrds Google + LetsMT24.5026.2425%-75%Hibrds LetsMT + Bing24.6626.63-24%76%Hibrds Google + Bing + LetsMT22.6924.7217%18%65%

Tulkojumu fragmentu kombinana

Septembris 2015

Lingvistiski motivta mantulkojumu kombinana

Gudrka teikumu dalana fragmentos

Teikuma koku apstaig no lejas uz augu, no labs uz kreiso pusi

Pievieno vrdu aktulajam fragmentam, ja

Fragment nav prk daudz vrdu (teikuma vrdu skaits / 4)

Vrds ir tikai vienu simbolu gar vai nesatur alfabta simbolus

Aktulais fragments skas ar enitva frzi (of )

Citdk veido jaunu fragmentu

Ja sank oti daudz fragmentu, process tiek atkrtots, pieaujot fragment vairk k (teikuma vrdu skaits / 4) vrdu

Lingvistiski motivta mantulkojumu kombinana

Gudrka teikumu dalana fragmentos

Teikuma koku apstaig no lejas uz augu, no labs uz kreiso pusi

Pievieno vrdu aktulajam fragmentam, ja

Fragment nav prk daudz vrdu (teikuma vrdu skaits / 4)

Vrds ir tikai vienu simbolu gar vai nesatur alfabta simbolus

Aktulais fragments skas ar enitva frzi (of )

Citdk veido jaunu fragmentu

Ja sank oti daudz fragmentu, process tiek atkrtots, pieaujot fragment vairk k (teikuma vrdu skaits / 4) vrdu

Izmaias MT API sistms

LetsMT Tildes biroja sistmas API viet pagaidm Hugo.lv API

Pievienots Yandex API

Lingvistiski motivta mantulkojumu kombinana

Labk tulkojuma izvle:

Trenti 6-grammu un 12-grammu valodas modei ar

KenLM

JRC-Acquis korpusu v. 2.2 - 1.4 miljoniem latvieu valodas juridisks nozares teikumu

DGT-Translation Memory korpusu (Steinberger, 2011) 3.1 miljoniem latvieu valodas juridisks nozares teikumu

Teikumi novrtti attiecb pret valodas modeli ar KenLM query programmu

Lingvistiski motivta mantulkojumu kombinana

Labk tulkojuma izvle:

Trenti 6-grammu un 12-grammu valodas modei ar

KenLM

JRC-Acquis korpusu v. 2.2 - 1.4 miljoniem latvieu valodas juridisks nozares teikumu

DGT-Translation Memory korpusu (Steinberger, 2011) 3.1 miljoniem latvieu valodas juridisks nozares teikumu

Teikumi novrtti attiecb pret valodas modeli ar KenLM query programmu

Testa dati

1581 patvagi izvlti teikumi no JRC-Acquis korpusa

ACCURAT balanstais izvrtanas korpuss - 512 visprgu teikumu

Lingvistiski motivta mantulkojumu kombinana

Lingvistiski motivta mantulkojumu kombinana

Teikuma fragmenti ar SyMHyTTeikuma fragmenti ar ChunkMTRecentlytherehas been an increased interest in the automated discovery of equivalent expressions in different languages.Recently there has been an increased interestin the automated discovery of equivalent expressionsin different languages .

Lingvistiski motivta mantulkojumu kombinana

Lingvistiski motivta mantulkojumu kombinana

Lingvistiski motivta mantulkojumu kombinana

SistmaBLEUViendiBingGoogleHugoYandexBLEU--17.4317.7317.1416.04MSMT - Google + Bing 17.707.25%43.85%48.90%--MSMT- Google + Bing + LetsMT17.633.55%33.71%30.76%31.98%-SyMHyT - Google + Bing 17.954.11%19.46%76.43%--SyMHyT - Google + Bing + LetsMT17.303.88%15.23%19.48%61.41%-ChunkMT - Google + Bing 18.2922.75%39.10%38.15%--ChunkMT visas etras19.217.36%30.01%19.47%32.25%10.91%

Janvris 2016

Publikcijas

Matss Rikters"Multi-system machine translation using online APIs for English-Latvian" ACL-IJCNLP 2015

Matss Rikters and Inguna Skadia"Syntax-based multi-system machine translation" LREC 2016

Darbi proces

Matss Rikters and Inguna Skadia"Combining machine translated sentence chunks from multiple MT systems"Iesniegts uz CICLING 2016

Matss Rikters"K-translate - interactive multi-system machine translation"Iesniegts uz Baltic DB & IS 2016

Matss Rikters and Pteris ikiforovs"iEMS an interactive experiment management system for the Moses SMT toolkit "Plnots iesniegt uz EAMT 2016

Matss Rikters"Recent research in Multi-System Machine Translation"Plnots iesniegt Baltic Journal of Modern Computing

Darbi proces

K-translate - interactive multi-system machine translation

Aptuveni tas pats ChunkMT ietrpts vizul noformjum

Uzzm sintakses koku ar iekrsotiem fragmentiem

Attlo, no kuras MT sistmas kur fragments izvlts

Attlo izvles prliecbas koeficientu

Piedv tulkoanai izmantot tiesaistes API vai lietotja ievadtus tulkojumus

Tiks nodrointi resursi tulkoanai starp angu, franu, latvieu, vcu valodm

Darbinms tmeka prlkprogramm

Darbi proces

K-translate - interactive multi-system machine translation

Kods pieejams

http://ej.uz/MSMT

http://ej.uz/SyMHyT

http://ej.uz/chunker

Tlki plni

Vl uzlabojumi teikumu dalanai fragmentos

Hibrdaj MT risinjum ieviest pau daudzvrdu savienojumu apstrdi un pievrst tiem lielku uzmanbu

Citu veidu valodas modei

POS tag + lemma

Recurrent Neural Network Language Model (Mikolov et al., 2010)

Continuous Space Language Model (Schwenk et al., 2006)

Character-Aware Neural Language Model (Kim et al., 2015)

Labk kandidta izvle ar MT kvalittes prognozi

QuEst++ (Specia et al., 2015)

SHEF-NN (Shah et al., 2015)

Tlkas idejas

Atsauces

Ahsan, A., and P. Kolachina. "Coupling Statistical Machine Translation with Rule-based Transfer and Generation, AMTA-The Ninth Conference of the Association for Machine Translation in the Americas." Denver, Colorado (2010).

Barrault, Loc. "MANY: Open source machine translation system combination." The Prague Bulletin of Mathematical Linguistics 93 (2010): 147-155.

Santanu, Pal, et al. "USAAR-DCU Hybrid Machine Translation System for ICON 2014" The Eleventh International Conference on Natural Language Processing. , 2014.

Mellebeek, Bart, et al. "Multi-engine machine translation by recursive sentence decomposition." (2006).

Heafield, Kenneth. "KenLM: Faster and smaller language model queries." Proceedings of the Sixth Workshop on Statistical Machine Translation. Association for Computational Linguistics, 2011.

Steinberger, Ralf, et al. "The JRC-Acquis: A multilingual aligned parallel corpus with 20+ languages." arXiv preprint cs/0609058 (2006).

Petrov, Slav, et al. "Learning accurate, compact, and interpretable tree annotation." Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2006.

Steinberger, Ralf, et al. "Dgt-tm: A freely available translation memory in 22 languages." arXiv preprint arXiv:1309.5226 (2013).

Raivis Skadi, Krlis Goba, Valters ics. 2010. Improving SMT for Baltic Languages with Factored Models. Proceedings of the Fourth International Conference Baltic HLT 2010, Frontiers in Artificial Intelligence and Applications, Vol. 2192. , 125-132.

Mikolov, Tomas, et al. "Recurrent neural network based language model." INTERSPEECH. Vol. 2. 2010.

Schwenk, Holger, Daniel Dchelotte, and Jean-Luc Gauvain. "Continuous space language models for statistical machine translation." Proceedings of the COLING/ACL on Main conference poster sessions. Association for Computational Linguistics, 2006.

Kim, Yoon, et al. "Character-aware neural language models." arXiv preprint arXiv:1508.06615 (2015).

Specia, Lucia, G. Paetzold, and Carolina Scarton. "Multi-level Translation Quality Prediction with QuEst++." 53rd Annual Meeting of the Association for Computational Linguistics and Seventh International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing: System Demonstrations. 2015.

Shah, Kashif, et al. "SHEF-NN: Translation Quality Estimation with Neural Networks." Proceedings of the Tenth Workshop on Statistical Machine Translation. 2015.

Paldies!

Jautjumi?

Teikumu dalana tekstvienbsTulkoana ar tiesaistes MT APIGoogle TranslateBing TranslatorLetsMTLabk tulkojuma izvleTulkojuma izvade

Teikumu dalana tekstvienbsTulkoana ar tiesaistes MT APIGoogle TranslateBing TranslatorLetsMTLabk tulkojuma izvleTulkojuma izvade

Teikumu dalana tekstvienbsTulkoana ar tiesaistes MT APIGoogle TranslateBing TranslatorLetsMTLabko fragmentu izvleTulkojumu izvadeTeikumu sadalana fragmentosSintaktisk analzeTeikumu apvienoana

Teikumu dalana tekstvienbsTulkoana ar tiesaistes MT APIGoogle TranslateBing TranslatorLetsMTLabko fragmentu izvleTulkojumu izvadeTeikumu sadalana fragmentosSintaktisk analzeTeikumu apvienoana

Teikuma sintakses koksKoka datu struktraFragmentu sarakstsKoka datu struktra ar martiem fragmentiemApstaig koku/apakkokuAktul koka/apakkoka fragmentsfvs < tvs / 4 fvs > 1Pievieno fragmentu sarakstamApvieno ar pdjo fragmentu sarakstfvs = 1enitva frzeNealfabtisksfvs fragmenta vrdu skaitstvs teikuma vrdu skaits

Start pageTranslate with online systemsInput translations to combineInput translated chunksSettingsTranslation resultsInput source sentenceInput source sentence

Start pageTranslate with online systemsInput translations to combineInput translated chunksSettingsTranslation resultsInput source sentenceInput source sentence