dutch parallel corpus a multilingual annotated corpus lieve macken language & translation...

34
Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

Upload: clyde-norris

Post on 27-Dec-2015

229 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

Dutch Parallel Corpus A Multilingual Annotated Corpus

Lieve MackenLanguage & Translation Technology Team

University College Ghent

Page 2: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

Dutch Parallel Corpus

• Annotated sentence aligned corpus• 10 million words• Dutch - English / Dutch – French• Linguistic annotations

– PoS & lemma– Shallow syntactic analysis

• Quality control• May 2006- September 2009

Page 3: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

Users and applications

• Fundamental research– Translation studies / contrastive linguistics– Corpus linguistics

• Support applications– Translation support (CAT)– Didactic support (CALL)

• HLT applications– Machine Translation / Terminology

Extraction– Training and test data

Page 4: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

Fundamental Research

• High-quality data• Balanced by translation direction

Contrastive Linguistics

Translation Studies

Translation product Translation process

Language systems Translation strategies

Page 5: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

Parallel & comparable corpus

Dutch texts

Dutch translations

English & French translations

English & French texts

Page 6: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

Language Learning - CorpusCall

• Computer Assisted Language Learning– Reference samples– Learning activity

• Key Words in Context– Authentic language usage

• Example Nederlex– Electronic reading platform for French students

learning Dutch– Development reading platform: FUNDP, Namur– Compilation parallel corpus: REBECA project

(K.U.Leuven Campus Kortrijk)

Page 7: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

Nederlex

Page 8: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

Full text corpora as Translator’s aid

• Computer assisted Translation– To identify more appropriate TL equivalent,

idiomatic expressions– Extension to bilingual dictionaries– Words in context

• Example: TransSearch (Canadian Hansards)– Simard & Macklovitch 2005

Page 9: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

Machine Translation• Data-driven development of MT-systems

– Example Based MT & Statistical MT

• P. Khoen 2005: 110 SMT-systemen trained on Europarl-corpus– Example output Finnish-English:

we know very well that the current treaties are not enough and that in future , it is necessary to develop a better structure for the union and , therefore perustuslaillisempi structure , which also expressed more clearly what the member states and the union is concerned .

Page 10: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

Large corpora are useful …

• Number crunching applications• Statistical analysis• Automatic analysis• No human intervention

Page 11: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

… but less adequate for:

• Applications involving quality at all levels

• Applications involving human analysis• Educational applications

Page 12: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

DPC requirements

1) Corpus design2) Linguistic annotation3) Quality control4) Corpus exploitation & availability

Page 13: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

DPC requirements

1) Corpus design2) Linguistic annotation3) Quality control4) Corpus exploitation & availability

Page 14: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

Design: translation directions

• Language Pairs & Translation Directions

• Balanced wrt language pair and translation direction

– Min. 2 mio words/translation direction

EN NL

EN NL FR

NL FR

Page 15: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

Design: text types

• Commercial publishers– Fictional & non-fictional literature e.g.

novels, essays– Journalistic texts, e.g. news articles

• Institutions– Instructive texts, e.g. user manuals– Administrative texts, e.g. meeting minutes– External communication, e.g. promotion

material, newsletters

Page 16: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

Text providers

• Quality– Published material– Professional translation division

• Copyright clearance– License agreements– Collaboration with Dutch Agency of HLT

Page 17: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

50 Text providersText Type Provider

Administrative texts European parliament, Europarl, Melexis, Flemish government, Speeches Kok, Balkende, Melexis, FOD Sociale Zekerheid, …

External Communication

BMM, Bosch, Barco, NMBS Holding, Arcelor Mittal, Fédération du tourisme de la province de Namur, Westtoer, …

Literature Ons Erfdeel, Lannoo, Vlaams Fonds der Letteren, Nijgh&VanDitmar, Le Dilletante, …

Journalistic texts Roularta, The Independent, The Guradian/ The Observer, De Standaard, De Morgen, Campuskrant, ING, Fortis, …

Instructive texts IBM, Bosch, DNS, Eli-lilly, …

Page 18: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

DPC requirements

1) Corpus design2) Linguistic annotation3) Quality control4) Corpus exploitation & availability

Page 19: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

Linguistic Annotation

• Structure– Paragraphs, sentences, words

Page 20: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

Linguistic Annotation

• Structure– Paragraphs, sentences, words

• Alignment– Sentence alignment

• Vanilla Aligner• Microsoft Bilingual Aligner• Melamed’s GMA Aligner

– (Sub-sentential alignment)

Page 21: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

Linguistic Annotation

• Structure– Paragraphs, sentences, words

• Alignment– Sentence alignment– (Sub-sentential alignment)

• Linguistic annotation– Lemma– PoS

Page 22: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

Corpus Representation

• Text Mark-up– TEI

• Encoding– UTF8

Page 23: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

DPC requirements

1) Corpus design2) Linguistic annotation3) Quality control4) Corpus exploitation & availability

Page 24: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

Quality control

• Manually checked– 10% of whole corpus

• Spot checking– Based on error analysis of manually

verified data

• Automatic control procedures– e.g. automatic comparison of output from

different alignment programs

Page 25: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

Alignment mergeTekst taal1 Tekst taal2

AL1

AL2

12345

12345

12345

12345

12345

12345

manualcheck

Alignment merge

Page 26: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

Quality control

• Manually checked– 10% of whole corpus

• Spot checking– Based on error analysis of manually

verified data

• Automatic control procedures– e.g. automatic comparison of output from

different alignment programs

Page 27: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

External validation

• Formal validation by CST (Centre for language Technology - Copenhagen)

• Suitability test by Xplanation

Page 28: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

DPC requirements

1) Corpus design2) Linguistic annotation3) Quality control4) Corpus exploitation & availability

Page 29: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

Corpus exploitation• Web search interface

– Parallel KWIC concordance– Simple queries– Extended queries

• Pattern matching & annotation labels

• Full text resource– Data-driven automatic learning (e.g. SMT)– Two monolingual XML-files + alignment

file

Page 30: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

Metadata• Additional filter to retrieve samples

– Text-related data• Language, text type, domain and keywords

– Translation-related data• Source language, target language

– Annotation-related data• Quality label

Page 31: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

Availability• Via Dutch Agency for Human

Language Technologies (TST-centrale)

Page 32: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

DPC objectives

• Quality control• Level of annotation

– Sentence alignment– PoS, lemma

• Balanced composition– Translation direction– Text types

• Availability– Via Dutch Agency for Human Language

Technologies (TST-centrale)

Page 33: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

• K.U. Leuven campus KortrijkProf. Dr. Piet DesmetDr. Hans PaulussenLic. Maribel Montero Perez

• Univeristy College Ghent - School of Translation Studies

Prof. Dr. Willy VandewegheDra. Lieve MackenOrphée Declercq

DPC Team

Page 34: Dutch Parallel Corpus A Multilingual Annotated Corpus Lieve Macken Language & Translation Technology Team University College Ghent

Questions?

www.kuleuven-kortrijk.be/dpc