exploiting multilingual corpora for machine translation andreas eisele saarland university &...
TRANSCRIPT
Exploiting Multilingual Corpora Exploiting Multilingual Corpora for Machine Translationfor Machine Translation
Andreas EiseleSaarland University & DFKI
Arona, September 2005 JRC Enlargement and Integration Workshop
Exploiting parallel corpora in up to 20 languages
Exploiting Multilingual Corpora 2 [email protected]
OverviewOverview
Multilingual/MT Projects & Tools at DFKI MT-Related Activities at Saarland University Work in the PTOLEMAIOS Project Plans for Near-Term Future
Exploiting Multilingual Corpora 3 [email protected]
Multilingual Projects at DFKIMultilingual Projects at DFKI
Main LT Application Areas: Multilingual Natural Communication Multilingual Document Production Crosslingual Information and Knowledge
Management
Exploiting Multilingual Corpora 4 [email protected]
Multilingual Natural CommunicationMultilingual Natural Communication
NL Dialogue Systems (DISCO, COSMA, Interprice)
Speech Dialogue Processing (Verbmobil, Interprice)
Robust Speech Parsing (Verbmobil, Interprice) Automatic Processing and Answering of Email
(COSMA, ICC, XtraMind)
Natural Speech Synthesis (Mary, Interprice)
Sample Application Areas: e-commerce (product search, CRM)
Application Projects with Interprice, AOL Europe and spin-off company XtraMind Technologies
Exploiting Multilingual Corpora 5 [email protected]
Multilingual Document ProductionMultilingual Document Production
Terminology Checking (DiET, FLAG, WHITEBOARD, SKATE) Grammar and Style Checking (LATESLAV, FLAG, SKATE) Controlled Language Checking (FLAG, WHITEBOARD,
SKATE) Automatic XML Tagging (WHITEBOARD) Consistency Control (BiLD, WHITEBOARD)
Sample Application Areas: multilingual document production, web-content production
Application Project with SAPSpin-Off company
Exploiting Multilingual Corpora 6 [email protected]
Crosslingual Information and Crosslingual Information and Knowledge ManagementKnowledge Management
Crosslingual Content Management (TWENTYONE, MUCHMORE) Crosslingual Information Retrieval (TWENTYONE, MULINEX, MIETTA, MUCHMORE) Crosslingual Multimedia Retrieval (POP-EYE, OLIVE, MUMIS, DIRECT INFO) Crosslingual Information Extraction (PARADIME, WHITEBOARD , DIRECT INFO) Crosslingual Text Mining, Terminology Extraction (GETESS, AIRFORCE, WIPO) Multilingual Summarization (MULINEX, MUCHMORE, MUSI) Multilingual Language Generation (TG/2, TEMSIS, MIETTA)
Sample Application Areas: multilingual and crosslingual search, tourism information on the web, up to date air quality reporting, information management for mega-events (world championship, Olympic Games), phonetic trademark search, term extraction from patent translations
Application Projects with German Telekom, ESG, Dresdner Bank, law firm Boehmert&Boehmert, feasibility study on terminology extraction with WIPO (via acrolinx), …
Exploiting Multilingual Corpora 7 [email protected]
Multilingual Resources at DFKIMultilingual Resources at DFKI
POS-tagger TnT (T.Brants) and Chunkie can be trained for arbitrary languages
Middleware HoG for multilingual robust shallow and HPSG-based deep analysis (mapping into RMRSs)
Morphologies from MMorph project exist for German, English, French, Spanish, Italian
Morphologies are encoded as FS transducers, usable for error-tolerant analysis and generation
Adding more languages is very easy (as done for Arabic with A.Soudi)
Uniform handling of all EU languages would be extremely convenient, but linguistic resources are currently lacking
Exploiting Multilingual Corpora 8 [email protected]
Multilingual Projects at DFKIMultilingual Projects at DFKI
Main LT Application Areas: Multilingual Natural Communication Multilingual Document Production Crosslingual Information and Knowledge
ManagementTopic emerging since 2005: Machine Translation
Exploiting Multilingual Corpora 9 [email protected]
Machine Translation at DFKIMachine Translation at DFKI
Topics in Compass (Digital Olympics 2006):Multi-Engine Machine Translation, Speech Technologies, Multilingual Content Management, Cross-lingual Information Retrieval and Multilingual Question Answering
Open LOGOSLOGOS MT ® = one of the largest and most powerful
among the commercial MT enginesDFKI turned LOGOS MT into an open source product
(in cooperation with GlobalWare AG)Plans for integrated, hybrid MT from rule-based and
stochastic engines (code name: EuroMatrix)
Exploiting Multilingual Corpora 10 [email protected]
MT Activities at Saarland UniversityMT Activities at Saarland University
Guiding principle: Start with method that works today, improve it by adding linguistic functionality as appropriate
Starting point: Phrase-based SMT (Köhn,Och,Marcu, HLT-NAACL2003)
Conceptually, phrase-based SMT is an intermediate step between TM and MT, combines TM’s ability to learn from examples with compositionality of MT
Among best approaches in ongoing DARPA evaluation campaign Easy to deploy (thanks to tools by F.J. Och and P. Köhn) Conceptually very simple, hence a good candidate to enrich
models with linguistic sophistication
Exploiting Multilingual Corpora 11 [email protected]
MT Activities at Saarland UniversityMT Activities at Saarland University
April ’05: participation in ACL shared task on statistical machine translation with a multi-engine approach {Finnish,French,German,Spanish} English
May ‘05: participation in DARPA MT evaluation with baseline phrase-based SMT system (Chinese English)
Project seminar on empirical MT, students learned to turn parallel corpora into SMT systems (based on EuroParl corpus, but also Welsh ↔ English and Arabic ↔ English)
Diploma Thesis on corpus-based MT via RMRS alignmentExperience: Using parallel corpora for MT quickly yields very promising results! Experience: Using parallel corpora for MT quickly yields very promising results!
We should have more language pairs and more data…We should have more language pairs and more data… Crawling of UN document repository, collection of 6-way parallel
{Arabic,Chinese,English,French,Russian,Spanish} corpus (+ some German)
Exploiting Multilingual Corpora 12 [email protected]
The PTOLEMAIOS projectThe PTOLEMAIOS project
Assumptions: Advanced language technology for truly multilingual
applications is a key challenge for computational linguistics Treebanking and supervised learning have been successful
for English (and some other languages), but may not be feasible for “smaller” languages
Parallel corpora can be used to transfer knowledge about linguistic relations across languages or to induce linguistic knowledge from data
Word alignments derived from simple models (GIZA++) can help to support this process
“Parallel-Text-based Optimization for Language learning ― Exploiting Multilingual Alignment for the
Induction Of Syntactic grammars”
Exploiting Multilingual Corpora 13 [email protected]
PTOLEMAIOSPTOLEMAIOS
Funding: Emmy-Noether fellowship from DFG, P.I. Jonas KuhnExpected Duration: April 2005 – March 2009Original Goal:
Induce grammars from parallel corpora (and evaluate them in isolation)Revised Goal (since August’05):
Evaluate grammars wrt. impact on MT performanceFirst Steps:
Use GIZA++-derived word alignment as filter to speed up parsing, several papers on suitable parsing algorithms
Use of LinearB’s SMT decoder on phrase-aligned EuroParl corpusPlanned Steps:
Explore the usefulness of syntactic analyses for phrase-based SMTword-based and syntax-based partial analyses are offered to decoderdecoder can exploit syntax if useful, fall back to plain PBSMT if notoptimal weight of syntactic dependencies can be determined empirically
Work on more languages (UN corpus in 6 languages, AC corpus)
Exploiting Multilingual Corpora 14 [email protected]
EuroMatrix: current situation EuroMatrix: current situation (joint work with Philipp Köhn and Chris Callison-Burch, Edinburgh)(joint work with Philipp Köhn and Chris Callison-Burch, Edinburgh)
MT systems per language pair (data taken from J.Hutchins’ Compendium of Translation Software, 10th Edition)
Exploiting Multilingual Corpora 15 [email protected]
EuroMatrix: current situationEuroMatrix: current situation
Most language pairs remain uncovered
Exploiting Multilingual Corpora 16 [email protected]
EuroMatrix: SMT for many languagesEuroMatrix: SMT for many languages
EuroParl Corpus has been constructed to build statistical MT systems
Source: “Europarl: A Parallel Corpus for Statistical Machine Translation”, Philipp Köhn, MT summit X, September 2005
Exploiting Multilingual Corpora 17 [email protected]
EuroMatrix: SMT for many languagesEuroMatrix: SMT for many languages
Multilingual corpora can be aligned across all languages…
Source: “Europarl: A Parallel Corpus for Statistical Machine Translation”, Philipp Köhn, MT summit X, September 2005
Exploiting Multilingual Corpora 18 [email protected]
EuroMatrix: SMT for many languagesEuroMatrix: SMT for many languages
SMT systems derived from the corpora vary in quality
Source: “Europarl: A Parallel Corpus for Statistical Machine Translation”, Philipp Köhn, MT summit X, September 2005
Exploiting Multilingual Corpora 19 [email protected]
EuroMatrix: SMT for many languagesEuroMatrix: SMT for many languages
Difficulty of translation into and from a given language may differ widely…
Source: “Europarl: A Parallel Corpus for Statistical Machine Translation”, Philipp Köhn, MT summit X, September 2005
Exploiting Multilingual Corpora 20 [email protected]
EuroMatrixEuroMatrix
Ideas: For language pairs where rule-based MT and SMT based on parallel
corpora exist, they should be integrated to exploit complementary strengths of both approaches
Parallel corpora can then be used in two ways feeding the SMT sub-system fine-tuning the integrated setup
For language pairs where only monolingual resources (lexicons, morphologies, taggers,…) and parallel corpora exist, transfer rules operating on linguistic representations should be derived from data
We need a generic framework that allows to plug and play with different approaches (an open source MT toolbox)
Development of MT systems needs open evaluation campaign, in the style of DARPA MTeval / ACL shared task
Exploiting Multilingual Corpora 21 [email protected]
ConclusionConclusion
Machine translation performance can be enabled/ boosted by parallel corpora
Current work just scratches the surface of what can be done
SMT systems for the languages of new member states should soon emerge from AC corpus
More parallel data for these languages would be desirable (100MW much better than 10MW!)
It would be very helpful to cooperate with teams from “new” countries for morphologies, taggers, parsers,…