dutch parallel corpus a multilingual annotated corpus lieve macken language & translation...
TRANSCRIPT
Dutch Parallel Corpus A Multilingual Annotated Corpus
Lieve MackenLanguage & Translation Technology Team
University College Ghent
Dutch Parallel Corpus
• Annotated sentence aligned corpus• 10 million words• Dutch - English / Dutch – French• Linguistic annotations
– PoS & lemma– Shallow syntactic analysis
• Quality control• May 2006- September 2009
Users and applications
• Fundamental research– Translation studies / contrastive linguistics– Corpus linguistics
• Support applications– Translation support (CAT)– Didactic support (CALL)
• HLT applications– Machine Translation / Terminology
Extraction– Training and test data
Fundamental Research
• High-quality data• Balanced by translation direction
Contrastive Linguistics
Translation Studies
Translation product Translation process
Language systems Translation strategies
Parallel & comparable corpus
Dutch texts
Dutch translations
English & French translations
English & French texts
Language Learning - CorpusCall
• Computer Assisted Language Learning– Reference samples– Learning activity
• Key Words in Context– Authentic language usage
• Example Nederlex– Electronic reading platform for French students
learning Dutch– Development reading platform: FUNDP, Namur– Compilation parallel corpus: REBECA project
(K.U.Leuven Campus Kortrijk)
Nederlex
Full text corpora as Translator’s aid
• Computer assisted Translation– To identify more appropriate TL equivalent,
idiomatic expressions– Extension to bilingual dictionaries– Words in context
• Example: TransSearch (Canadian Hansards)– Simard & Macklovitch 2005
Machine Translation• Data-driven development of MT-systems
– Example Based MT & Statistical MT
• P. Khoen 2005: 110 SMT-systemen trained on Europarl-corpus– Example output Finnish-English:
we know very well that the current treaties are not enough and that in future , it is necessary to develop a better structure for the union and , therefore perustuslaillisempi structure , which also expressed more clearly what the member states and the union is concerned .
Large corpora are useful …
• Number crunching applications• Statistical analysis• Automatic analysis• No human intervention
… but less adequate for:
• Applications involving quality at all levels
• Applications involving human analysis• Educational applications
DPC requirements
1) Corpus design2) Linguistic annotation3) Quality control4) Corpus exploitation & availability
DPC requirements
1) Corpus design2) Linguistic annotation3) Quality control4) Corpus exploitation & availability
Design: translation directions
• Language Pairs & Translation Directions
• Balanced wrt language pair and translation direction
– Min. 2 mio words/translation direction
EN NL
EN NL FR
NL FR
Design: text types
• Commercial publishers– Fictional & non-fictional literature e.g.
novels, essays– Journalistic texts, e.g. news articles
• Institutions– Instructive texts, e.g. user manuals– Administrative texts, e.g. meeting minutes– External communication, e.g. promotion
material, newsletters
Text providers
• Quality– Published material– Professional translation division
• Copyright clearance– License agreements– Collaboration with Dutch Agency of HLT
50 Text providersText Type Provider
Administrative texts European parliament, Europarl, Melexis, Flemish government, Speeches Kok, Balkende, Melexis, FOD Sociale Zekerheid, …
External Communication
BMM, Bosch, Barco, NMBS Holding, Arcelor Mittal, Fédération du tourisme de la province de Namur, Westtoer, …
Literature Ons Erfdeel, Lannoo, Vlaams Fonds der Letteren, Nijgh&VanDitmar, Le Dilletante, …
Journalistic texts Roularta, The Independent, The Guradian/ The Observer, De Standaard, De Morgen, Campuskrant, ING, Fortis, …
Instructive texts IBM, Bosch, DNS, Eli-lilly, …
DPC requirements
1) Corpus design2) Linguistic annotation3) Quality control4) Corpus exploitation & availability
Linguistic Annotation
• Structure– Paragraphs, sentences, words
Linguistic Annotation
• Structure– Paragraphs, sentences, words
• Alignment– Sentence alignment
• Vanilla Aligner• Microsoft Bilingual Aligner• Melamed’s GMA Aligner
– (Sub-sentential alignment)
Linguistic Annotation
• Structure– Paragraphs, sentences, words
• Alignment– Sentence alignment– (Sub-sentential alignment)
• Linguistic annotation– Lemma– PoS
Corpus Representation
• Text Mark-up– TEI
• Encoding– UTF8
DPC requirements
1) Corpus design2) Linguistic annotation3) Quality control4) Corpus exploitation & availability
Quality control
• Manually checked– 10% of whole corpus
• Spot checking– Based on error analysis of manually
verified data
• Automatic control procedures– e.g. automatic comparison of output from
different alignment programs
Alignment mergeTekst taal1 Tekst taal2
AL1
AL2
12345
12345
12345
12345
12345
12345
manualcheck
Alignment merge
Quality control
• Manually checked– 10% of whole corpus
• Spot checking– Based on error analysis of manually
verified data
• Automatic control procedures– e.g. automatic comparison of output from
different alignment programs
External validation
• Formal validation by CST (Centre for language Technology - Copenhagen)
• Suitability test by Xplanation
DPC requirements
1) Corpus design2) Linguistic annotation3) Quality control4) Corpus exploitation & availability
Corpus exploitation• Web search interface
– Parallel KWIC concordance– Simple queries– Extended queries
• Pattern matching & annotation labels
• Full text resource– Data-driven automatic learning (e.g. SMT)– Two monolingual XML-files + alignment
file
Metadata• Additional filter to retrieve samples
– Text-related data• Language, text type, domain and keywords
– Translation-related data• Source language, target language
– Annotation-related data• Quality label
Availability• Via Dutch Agency for Human
Language Technologies (TST-centrale)
DPC objectives
• Quality control• Level of annotation
– Sentence alignment– PoS, lemma
• Balanced composition– Translation direction– Text types
• Availability– Via Dutch Agency for Human Language
Technologies (TST-centrale)
• K.U. Leuven campus KortrijkProf. Dr. Piet DesmetDr. Hans PaulussenLic. Maribel Montero Perez
• Univeristy College Ghent - School of Translation Studies
Prof. Dr. Willy VandewegheDra. Lieve MackenOrphée Declercq
DPC Team
Questions?
www.kuleuven-kortrijk.be/dpc