corpus creation for lexicography

24
Corpus Creation for Lexicography Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland)

Upload: brianna-kirby

Post on 04-Jan-2016

81 views

Category:

Documents


2 download

DESCRIPTION

Corpus Creation for Lexicography. Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland). Tasks. Design Collection Encoding. The project. A New English-Irish Dictionary Authoritative, general purpose - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Corpus Creation for Lexicography

Corpus Creation for Lexicography

Adam Kilgarriff, Michael RundellLexicography MasterClass, UK

Elaine Ui DhonnchadhaITE (Linguistics Institute of Ireland)

Page 2: Corpus Creation for Lexicography

Kilgarriff: Asialex June 2005 2

Tasks

DesignCollectionEncoding

Page 3: Corpus Creation for Lexicography

Kilgarriff: Asialex June 2005 3

The project

A New English-Irish Dictionary Authoritative, general purpose Academics, translators, students, secretaries

One year ‘set-up’ phase Limited time, limited budget Many tasks, including corpus development

Irish and UK Government funded Lead contractor: LexMasterClass Subcontractor: ITE

Page 4: Corpus Creation for Lexicography

Kilgarriff: Asialex June 2005 4

Languages

EnglishIrish

Page 5: Corpus Creation for Lexicography

Kilgarriff: Asialex June 2005 5

The Irish language

A Celtic languageLong literary tradition

Irish-Latin dictionary from 9th century

Main language of Ireland until 1850-1900 English took over (British imperialist policies)

62,000 speakers as main languageGaeltacht: Irish-speaking areasThree dialects

Page 6: Corpus Creation for Lexicography

Kilgarriff: Asialex June 2005 6

Gaeltacht areas

Page 7: Corpus Creation for Lexicography

Kilgarriff: Asialex June 2005 7

Design: English

Source language for NEID Very large resource wanted

Eg for word sketches, see Friday talk

Three language varieties Irish (Hiberno-English) British American

Page 8: Corpus Creation for Lexicography

Kilgarriff: Asialex June 2005 8

American 100M words Journalistic text available

British 100M words British National Corpus (BNC)

Model balanced corpus Spoken conversation (10%) Books, newspapers, magazines Popular, academic, technical

Page 9: Corpus Creation for Lexicography

Kilgarriff: Asialex June 2005 9

Hiberno-English

25 M wordsGoal: balanced like BNC except

No budget for spoken corpus collection New category: web Dates: since independence (1922)

Emphasis on current language

Page 10: Corpus Creation for Lexicography

Kilgarriff: Asialex June 2005 10

Design: Irish

30 M wordsStarting point: BNC-likeNative speakers

Native speakers language “better” Many texts written by non-native speakers Record status where possible

Newspapers, websites: no info available

Dialect Record where possible

Page 11: Corpus Creation for Lexicography

Kilgarriff: Asialex June 2005 11

“High quality Irish”

Smaller than 150 years ago Many documents are translations Learners’ errors, inelegant prose Samuel Johnson: “writers of the first reputation”

Con Who judges? Risk of literary or backward-looking bias

Lexicographers needs corpus to translateBoot the computer as well as the babbling brook

Trench and the OED: “an historian, not a critic” Will a quality filter limit corpus breadth (and size)?

Page 12: Corpus Creation for Lexicography

Kilgarriff: Asialex June 2005 12

Quality: outcome

Wide range of text types wantedParticular effort to gather native speaker

non-translations

Period for corpus: 1883-present Most earlier texts: literary Most text types: usually recent

Page 13: Corpus Creation for Lexicography

Kilgarriff: Asialex June 2005 13

Text category Irish Hiberno-English

Words: actual

Words: actual

Books-imaginative

7,600,000 6,000,000

Books-Informative

8,400,000 7,000,000

Newspapers 4,500,000 5,300,000

Periodicals 2,600,000 700,000

Official/Govt 1,200,000 1,000,000

Broadcast 400,000 0

Websites 5,500,000 5,000,000

TOTALS 30,200,000 25,000,000

Page 14: Corpus Creation for Lexicography

Kilgarriff: Asialex June 2005 14

Collection

Use existingAsk publishersWeb

Page 15: Corpus Creation for Lexicography

Kilgarriff: Asialex June 2005 15

Use existing

Irish: PAROLE corpus (8M words, ITE)English

British: BNC American: LDC Gigaword – wds journalism Limerick Corpus of Spoken English Northern Ireland Corpus of Transcribed Speech

Page 16: Corpus Creation for Lexicography

Kilgarriff: Asialex June 2005 16

Ask publishers

The junkmail problem

Appeals to national pride Charm and persistence Team member who knows them all

Page 17: Corpus Creation for Lexicography

Kilgarriff: Asialex June 2005 17

Web

Fast becoming the usual place to look Kilgarriff and Grefenstette, CL 2003

Preliminary experiments at least 15 M words of Irish out there

Hiberno-English English as found on sites where Irish was found

Page 18: Corpus Creation for Lexicography

Kilgarriff: Asialex June 2005 18

Web issues

Formats conversion from pdf etc needed

Character representation Not many pages “do the right thing”

Navigational material: “click here”ListsMixed languages Duplication

Page 19: Corpus Creation for Lexicography

Kilgarriff: Asialex June 2005 19

Text category Irish Hiberno-English

Words: actual

Words: target

Words: actual

Words: target

Books-imaginative

7,600,000 9,000,000 6,000,000 7,500,000

Books-Informative

8,400,000 6,000,000 7,000,000 5,000,000

Newspapers 4,500,000 4,500,000 5,300,000 3,750,000

Periodicals 2,600,000 2,500,000 700,000 2,250,000

Official/Govt 1,200,000 1,500,000 1,000,000 1,000,000

Broadcast 400,000 1,000,000 0 750,000

Websites 5,500,000 5,500,000 5,000,000 4,750,000

TOTALS 30,200,000 30,000,000 25,000,000 25,000,000

Page 20: Corpus Creation for Lexicography

Kilgarriff: Asialex June 2005 20

Encoding

Clean-upLinguistic processingDelivery formalism

Page 21: Corpus Creation for Lexicography

Kilgarriff: Asialex June 2005 21

Clean-up

Deletion of:Title pages, table of contents, tables, figures,

footnotes, endnotes, page headers and footers, crosswords, TV listings, sports results, team listings …

Page 22: Corpus Creation for Lexicography

Kilgarriff: Asialex June 2005 22

Linguistic processing

Lemmatize give giving gives given gave => give (verb)

Part-of-speech tagging bank (verb) or bank (noun)?

English: existing tools used Irish: tools developed from scatch

Elaine Ui Dhonnchadha: thesis work Finite state methods, constraint grammar Separate talk

Page 23: Corpus Creation for Lexicography

Kilgarriff: Asialex June 2005 23

Delivery formalism

Both XML Corpus Encoding Standards (XCES) For longevity, interchange format

And Loaded into Word Sketch Engine Corpus query tool optimised for lexicography,

linguistic research Good for searching on grammar, text type etc

Friday talk

Page 24: Corpus Creation for Lexicography

Kilgarriff: Asialex June 2005 24

Conclusion

Large corpora for high-quality lexicographyDeveloped in one year, modest budgetDesign, collection and encodingDelivered in a convenient form for the

lexicographer

Thank you