creating parallel and comparable corpora for work in domain specific areas of language belinda maia...

38
Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Post on 19-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Creating parallel and comparable corpora for work in domain specific areas of language

Belinda Maia

FLUP

Page 2: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Parallel corpora - definition

• “A parallel corpus is a collection of texts, each of which is translated into one or more other languages than the original. The simplest case is where two languages only are involved: one of the corpora is an exact translation of the other. ....... The direction of the translation may not even be known”.

Page 3: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Parallel corpora - uses

• “Parallel corpora are objects of interest at present because of the opportunity offered to align original and translation and gain insights into the nature of translation. From this work it is hoped that tools to aid translation will be devised. Probabilistic machine translation systems can moreover be trained on such corpora”.

Page 4: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Comparable corpora - definition

• “A comparable corpus is one which selects similar texts in more than one language or variety. There is as yet no agreement on the nature of the similarity, because there are very few examples of comparable corpora”.

Page 5: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Comparable corpora - uses

• “The possibilities of a comparable corpus are to compare different languages or varieties in similar circumstances of communication, but avoiding the inevitable distortion introduced by the translations of a parallel corpus”.

Page 6: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Quotations from:

• EAGLES - Expert Advisory Group on Language Engineering Standards

• Guidelines – 1996 – at:

• http://www.ilc.pi.cnr.it/EAGLES96/browse.html

Page 7: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Parallel corpora - alignment & annotation

• Most common form of alignment = at sentence level• E.g. Text aligners:

– WORDSMITH – recognizes full stops only– WinAlign – TRADOS – recognizes a certain amount of

formatting, paragraphs, numbers, tagging

• Ongoing research to align at: – term/word level – tag level

Page 8: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Parallel corpora - alignment & annotation

problems

• Different linguistic theories = different annotation schemes– E.g. Morphological, syntactic or semantic?

• Different languages = different annotation schemes– E.g. English / Portuguese / Polish / Finnish /Chinese

• Different languages = different types of alignment– E.g. English / Hebrew / Chinese

Page 9: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Parallel corpora - professional uses

• Translation memories – aligned collections of repetitive texts in special domains – Provide previous translations for translator to

consult / copy– Allow economy in translation process – Provide material for probabilistic machine

translation – E.g. EU translation services, Canadian Hansard

Page 10: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Translation memories – requirements

• “Garbage in = garbage out!”• Original > good quality – hence

– Emphasis on: good editing and proof reading > controlled language

– E.g. EU documentation – training people to edit English documents written by non-native speakers

• Translation > good quality – but certain parallel relationship to the original

• Therefore: tendency to homogeneity– (e.g. Eurospeak)

Page 11: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Parallel corpora - academic uses

• For studying the translation process• For studying translation solutions• E.g.

– INTERSECT – French/English (Brighton)

– English-Norwegian Parallel Corpus Project (Oslo)

– COMPARA/DISPARA – Portuguese/English – online at http://www.portugues.mct.pt/

• For terminology extraction

Page 12: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Parallel corpora - requirements

• Theory should allow for any original + translation - warts and all!– Much literary criticism of translation thrives on the

‘warts’!

– Useful for study of errors, translationese etc

• Practical applications require quality: – Contrastive linguistics

– Pedagogical applications

– Terminology extraction

Page 13: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Comparable corpora – Perceived needs

• Texts as:– Examples of ‘natural’ original text in the source

language culture– E.g.

• Legal texts written according to local conventions

• Socially conventional texts: e.g. the ‘deaths column’ and advertisements for houses and jobs.

• Academic / scientific texts – different cultural conventions

Page 14: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Comparable corpora – Advantages

• Availability– More texts – Greater variety

• Versatility - applications for research in:– Discourse analysis– Pragmatics– Information retrieval– Knowledge engineering

Page 15: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

What makes texts /corpora

COMPARABLE?

Page 16: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

EAGLES - quotes

• “A comparable corpus is one which selects similar texts in more than one language or variety”.

Similar - in more than one language

AND/ORSimilar - in variety

“...similar circumstances of communication..”

Page 17: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Similarity – Form/content?

• Form– Size, no. of words, sentences, paragraphs– Length of texts– Format - .txt, .doc, .html,.xml

• Content– General language– Specialised domains

Page 18: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Similarity- Structure/Function?

• Structure– Formal, carefully constructed texts – e.g. Legal

texts– Informal, loosely organized discourse – e.g.

transcriptions of conversation

• Function– Social– Cultural

Page 19: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Similarity- Register?

• Register– Field – situation, subject matter etc– Tenor – interpersonal relationships

• e.g. formal/informal, politeness, etc

– Mode• Spoken: e.g. speech, formal dialogue, conversation

• Written: e.g. book, essay, instruction manual

• Multimedia: e.g. Encarta, films

Page 20: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Similarity - Dialect?

• Dialect– Geographical

• e.g. urban/rural areas, developed/developing countries

– Temporal • e.g. historical periods, different age groups

– Social• e.g. social classes, educational backgrounds

Page 21: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Comparability in Very Large Corpora

• Very Large Corpora comparable if :– similar in size – constructed according to same criteria –e.g.

quantity and quality of text types

• Consider: – British National Corpus– Mannheimer Corpora

Page 22: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Comparability in newspaper corpora

• Newspaper corpora vary according to:– Type: ‘quality’/‘popular’, general/specialised

content– Time: same day/month/year > ‘concurrent’

corpora

• Consider:– CETEMPúblico - Portuguese– Reuter’s Corpus - English

Page 23: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Comparability in literary corpora

• Period:– Medieval, 18th Century, Post-war

• School:– Romanticism, Realism, Post-modernism

• Genre:– Novel, science fiction, drama, poetry

Page 24: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Comparability in technical and scientific corpora - form

• Pamphlets

• Manuals

• Textbooks

• Articles and papers

• Dissertations, theses

Page 25: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Comparability in technical and scientific corpora - content

• Everyday information

• Encyclopedic information

• Instructions

• Education

• Expert-to-expert communication

Page 26: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Constructing comparable corpora - general language

• Where does one start?

• Very large comparable corpora in 2 or more languages = mega-proposition!

• Carefully selected annotated general corpora – like ICAME corpora (Brown, LOB etc) = a possibility + limitations

Page 27: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Using comparable corpora - general language

• Advantages:– Comparative and contrastive research at all

levels– Particularly useful for lexicographical research

and search for syntactic patterns

• Disadvantages:– Difficult to manage for more delicate analysis– Unnecessary for certain types of research

Page 28: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Constructing comparable corpora – Newspaper texts

• Newspaper corpora – Relatively easy to acquire– A wide variety of fields – Similarity in

• tenor

• mode

Page 29: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Using comparable corpora – Newspaper texts

Concurrent corpora > extraction of similar news items > e.g.– War reports– Politics – election campaigns– Football during the World Cup

OR > styles of journalism > comparing individual journalists etc.

Page 30: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Constructing comparable corpora – general language + restricted text type

• General subject texts of similar text type – e.g. Encyclopedia entries, tourism pamphlets

• Literary texts of similar period, school or genre

• Technical and scientific texts with similar form or function e.g. textbooks

Page 31: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Using comparable corpora – general language

+ restricted text type

• Discourse analysis

• Pragmatics

• Genre analysis

• Sociolinguistic analysis

Page 32: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Constructing comparable corpora – specialized language

• Special domains at various levels – e.g.– Geography > population geography > ethnic

minorities– Engineering > mechanical engineering >

tribology– Medicine > oncology > breast cancer

Page 33: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Using comparable corpora – specialized language

• Genre analysis

• Terminology extraction

• Information retrieval

• Web browsing technology

• Knowledge engineering

Page 34: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

All corpora construction

• Must establish:– Overall general policy in relation to:

• Form – computational structure

• Content of sub-corpora

• Availability to general / restricted public

– Specific objectives of sub-corpora

Page 35: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

All corpora construction

• Must take into account:– Copyright restrictions– Effect of external factors on the text

• Idiosyncracies of individual author

• Characteristics of writing in specific cultural/ social situation

• Homogenising effect of internationalisation– Eurospeak

– Anglicisation of scientific terminology

Page 36: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Linguateca - Porto More immediate objectives

• To construct comparable and parallel corpora in Portuguese and English using:– Texts in special domains already being investigated

– Adding corpora from special domains as and when the opportunity arises

• To construct the necessary computational framework for using the corpora for research

• To make these corpora as widely available as the respective copyright situation permits

Page 37: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Linguateca - Porto Longer-term objectives

• To extend the notion of comparability to:– genre-specific corpora – restricted general language corpora

• To construct integrated networks of comparable corpora

• To extend these objectives to other languages

• To contribute to similar projects elsewhere

Page 38: Creating parallel and comparable corpora for work in domain specific areas of language Belinda Maia FLUP

Bibliography

• Bourigault, Didier, Christian Jacquemin, & Marie-Claude L’Homme. (Eds.) 2001. Recent Advances in Computational Terminology. Amsterdam & Philadelphia: John Benjamins Publishing Co.

• Charlet, J., M.Zacklad G.Kassel D.Bourigault. 2001. Ingénierie des connaissances. Paris: Éditions Eyrolles.

• Veronis, Jean (Ed). 2000. Parallel Text Processing – Alignment and Use of Translation Corpora. Dordrecht: Kluwer Academic Publishers.