corpus design ii see g kennedy, introduction to corpus linguistics, ch . 2 cf meyer, english...

18
Corpus design II See G Kennedy, Introduction to Corpus Linguistics, Ch 2 CF Meyer, English Corpus Linguistics, Ch. 3

Post on 19-Dec-2015

252 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Corpus design II See G Kennedy, Introduction to Corpus Linguistics, Ch . 2 CF Meyer, English Corpus Linguistics, Ch. 3

Corpus design II

See G Kennedy, Introduction to Corpus Linguistics, Ch . 2

CF Meyer, English Corpus Linguistics, Ch. 3

Page 2: Corpus design II See G Kennedy, Introduction to Corpus Linguistics, Ch . 2 CF Meyer, English Corpus Linguistics, Ch. 3

2/18

Issues in corpus design

• General purpose vs specialized• Dynamic (monitor) vs static• Representativeness and balance• Size• Collection, permission• Text capture and markup• Storage and access• Organizations

Page 3: Corpus design II See G Kennedy, Introduction to Corpus Linguistics, Ch . 2 CF Meyer, English Corpus Linguistics, Ch. 3

3/18

Collecting samples of speech

• Aim to collect natural samples• Cannot tape record surreptitiously

– Early corpora were done in thisa way, with permission sought afterwards

– Nowadays regarded as unethical, perhaps even illegal

• “Observer’s paradox”: presence of recorder effects behaviour

• Can be overcome (somewhat) by recording lots of material and sampling from the middle

Page 4: Corpus design II See G Kennedy, Introduction to Corpus Linguistics, Ch . 2 CF Meyer, English Corpus Linguistics, Ch. 3

4/18

Collecting written samples

• Much easier to obtain, but beware important issue of permission– Copyrighted material cannot be freely stored and

distributed– “Fair use” law allows use of up to 2,000 words for

private research– Corpus samples are often >2,000 words, and often

distributed widely, sometimes for profit (or at least at a price to cover/recoup costs)

– Copyright laws may differ between countries

Page 5: Corpus design II See G Kennedy, Introduction to Corpus Linguistics, Ch . 2 CF Meyer, English Corpus Linguistics, Ch. 3

5/18

Permission

• Can be quite onerous obtaining copyright permission– Time consuming to wait for a reply to a

request: do you go ahead and include it (ie start work on annotation and mark-up), or wait?

– Big risk, eg English-Norwegian Parallel Corpus contains copyrighted material and can only be used by U Oslo researchers, on site!

Page 6: Corpus design II See G Kennedy, Introduction to Corpus Linguistics, Ch . 2 CF Meyer, English Corpus Linguistics, Ch. 3

6/18

Text capture

• Easiest if text is already machine-readable, though there may still be some issues with mark-up– eg MRT obtained from publishers may have print

formatting information embedded in it– Text captured from an online source may have HTML

mark-up

• If text exists in printed form, scanning is a possibility– OCR is generally very good quality, but text must still

be carefully checked– Issue of how to deal with printing effects such as

hyphenation, headers and footers, footnotes

Page 7: Corpus design II See G Kennedy, Introduction to Corpus Linguistics, Ch . 2 CF Meyer, English Corpus Linguistics, Ch. 3

7/18

Text capture: re-keying

• If OCR is not suitable/available– eg hand-written texts, or medium is not flat

• Re-keying is only option• Highly expensive, time-consuming and error-

prone• With manuscripts, there may be an issue of

“keyboarder correction”– Example of Learner English corpus of handwritten

essays: important not to correct “errors”– PhD student collected handwritten essays by (Arabic)

learners of English for error analysis: first task was to “type them in”

Page 8: Corpus design II See G Kennedy, Introduction to Corpus Linguistics, Ch . 2 CF Meyer, English Corpus Linguistics, Ch. 3

8/18

Handwritten text

• Are these capital Ts?• Is this crossed out?• Is this a v or a t?• Is this depend or depond?

• etc.• What does this say?

• Compared to these?

Page 9: Corpus design II See G Kennedy, Introduction to Corpus Linguistics, Ch . 2 CF Meyer, English Corpus Linguistics, Ch. 3

9/18

Mark-up

• Issues like this can be overcome by mark-up• Annotate the text to show explicitly where there is

anything special– Doubtful text– Incorrect text (mark up can show what was probably

meant)– Extraneous material

• This is also an important issue in computer storage of ancient manuscripts

• More detail later

Page 10: Corpus design II See G Kennedy, Introduction to Corpus Linguistics, Ch . 2 CF Meyer, English Corpus Linguistics, Ch. 3

10/18

Speech corpora

• “Corpus” usually means transcribed speech data

• Many issues surrounding transcription of speech

• Some of them similar to issues with handwriting

• Others particular to speech

Page 11: Corpus design II See G Kennedy, Introduction to Corpus Linguistics, Ch . 2 CF Meyer, English Corpus Linguistics, Ch. 3

11/18

Transcribing speech

• Not just a matter of typing in what was said, though this is of course a major element– And may not be straightforward– How much “correction” to do in transcription– eg of hesitations, false starts, and other speech phenomena

• Speech corpora usually encode information about paralinguistic and non-linguistic features– Speed of delivery, pauses– Loudness (whispering, shouting, singing) – Coughs and other non-speech sounds which may be meaningful

(grunt, tutting, hesitation noises)– Even outside noises if relevant (eg passing siren, music, animals),

as they might “contribute” to the discussion

Page 12: Corpus design II See G Kennedy, Introduction to Corpus Linguistics, Ch . 2 CF Meyer, English Corpus Linguistics, Ch. 3

12/18

Transcribing speech

• Some conventions have emerged, eg …• Vocalized pauses: use phonetic symbols or

conventional spelling– or uh, ah, erm, uhuh (!)

• How to transcribe contractions like gotta, gonna, sorta, …– Notice how some are completely conventional, eg

can’t, won’t• How (and whether) to transcribe partially uttered

words and repetitions• How to represent unintelligible speech

Page 13: Corpus design II See G Kennedy, Introduction to Corpus Linguistics, Ch . 2 CF Meyer, English Corpus Linguistics, Ch. 3

13/18

Storage

• Where will the data be kept, and who will have access?– If corpus is for public distribution, will it be by license, or freely

available?– If by license, distribute online (with password) or on CD?

• Nowadays, fortunately, size is not such an issue though– Big corpora have to be distributed on multiple CDs– Downloading from a website can take hours

• Note that it is not only the corpus data that must be distributed:– Many corpora have associated software packages to facilitate

exploration– For speech corpora, original recordings may be available

Page 14: Corpus design II See G Kennedy, Introduction to Corpus Linguistics, Ch . 2 CF Meyer, English Corpus Linguistics, Ch. 3

14/18

Access

• Efficient access to corpus data comes hand-in-hand with corpus structure

• No good having structured corpus if that structure can’t be used to delimit searches

• Best if corpus is cross-indexed on all searchable criteria, ie all details that are encoded in headers

Page 15: Corpus design II See G Kennedy, Introduction to Corpus Linguistics, Ch . 2 CF Meyer, English Corpus Linguistics, Ch. 3

15/18

Organizations

• Several organizations, often based in universities, have their own corpus material, and are also very active in issues surrounding Corpus Linguistics

• “corpora” mailing list http://nora.hd.uib.no/corpora/• ELRA European Language Resources Association

http://www.elra.info/• LDC Linguistic Data Consortium

http://www.ldc.upenn.edu/• TEI Text Encoding Inititative http://www.tei-c.org/

Page 16: Corpus design II See G Kennedy, Introduction to Corpus Linguistics, Ch . 2 CF Meyer, English Corpus Linguistics, Ch. 3

16/18

• aims to make available the language resources for language engineering and to evaluate language engineering technologies

• active in identification, distribution, collection, validation, standardisation, improvement

• promotes the production of language resources • supports the infrastructure to perform evaluation

campaigns– Mainly through ELDA (Evaluation and Language

Resources Distribution Agency) http://www.elda.org/

http://www.elra.info/

Page 17: Corpus design II See G Kennedy, Introduction to Corpus Linguistics, Ch . 2 CF Meyer, English Corpus Linguistics, Ch. 3

17/18

• Based at U Penn

• supports language-related education, research and technology development by creating and sharing linguistic resources: data, tools and standards

• http://www.ldc.upenn.edu/

Page 18: Corpus design II See G Kennedy, Introduction to Corpus Linguistics, Ch . 2 CF Meyer, English Corpus Linguistics, Ch. 3

18/18

• collectively develops and maintains a standard for the representation of texts in digital form

• chief deliverable is a set of Guidelines which specify encoding methods for machine-readable texts, chiefly in the humanities, social sciences and linguistics

• http://www.tei-c.org/index.xml