corpus creation for lexicography
DESCRIPTION
Corpus Creation for Lexicography. Adam Kilgarriff, Michael Rundell Lexicography MasterClass, UK Elaine Ui Dhonnchadha ITE (Linguistics Institute of Ireland). Tasks. Design Collection Encoding. The project. A New English-Irish Dictionary Authoritative, general purpose - PowerPoint PPT PresentationTRANSCRIPT
Corpus Creation for Lexicography
Adam Kilgarriff, Michael RundellLexicography MasterClass, UK
Elaine Ui DhonnchadhaITE (Linguistics Institute of Ireland)
Kilgarriff: Asialex June 2005 2
Tasks
DesignCollectionEncoding
Kilgarriff: Asialex June 2005 3
The project
A New English-Irish Dictionary Authoritative, general purpose Academics, translators, students, secretaries
One year ‘set-up’ phase Limited time, limited budget Many tasks, including corpus development
Irish and UK Government funded Lead contractor: LexMasterClass Subcontractor: ITE
Kilgarriff: Asialex June 2005 4
Languages
EnglishIrish
Kilgarriff: Asialex June 2005 5
The Irish language
A Celtic languageLong literary tradition
Irish-Latin dictionary from 9th century
Main language of Ireland until 1850-1900 English took over (British imperialist policies)
62,000 speakers as main languageGaeltacht: Irish-speaking areasThree dialects
Kilgarriff: Asialex June 2005 6
Gaeltacht areas
Kilgarriff: Asialex June 2005 7
Design: English
Source language for NEID Very large resource wanted
Eg for word sketches, see Friday talk
Three language varieties Irish (Hiberno-English) British American
Kilgarriff: Asialex June 2005 8
American 100M words Journalistic text available
British 100M words British National Corpus (BNC)
Model balanced corpus Spoken conversation (10%) Books, newspapers, magazines Popular, academic, technical
Kilgarriff: Asialex June 2005 9
Hiberno-English
25 M wordsGoal: balanced like BNC except
No budget for spoken corpus collection New category: web Dates: since independence (1922)
Emphasis on current language
Kilgarriff: Asialex June 2005 10
Design: Irish
30 M wordsStarting point: BNC-likeNative speakers
Native speakers language “better” Many texts written by non-native speakers Record status where possible
Newspapers, websites: no info available
Dialect Record where possible
Kilgarriff: Asialex June 2005 11
“High quality Irish”
Smaller than 150 years ago Many documents are translations Learners’ errors, inelegant prose Samuel Johnson: “writers of the first reputation”
Con Who judges? Risk of literary or backward-looking bias
Lexicographers needs corpus to translateBoot the computer as well as the babbling brook
Trench and the OED: “an historian, not a critic” Will a quality filter limit corpus breadth (and size)?
Kilgarriff: Asialex June 2005 12
Quality: outcome
Wide range of text types wantedParticular effort to gather native speaker
non-translations
Period for corpus: 1883-present Most earlier texts: literary Most text types: usually recent
Kilgarriff: Asialex June 2005 13
Text category Irish Hiberno-English
Words: actual
Words: actual
Books-imaginative
7,600,000 6,000,000
Books-Informative
8,400,000 7,000,000
Newspapers 4,500,000 5,300,000
Periodicals 2,600,000 700,000
Official/Govt 1,200,000 1,000,000
Broadcast 400,000 0
Websites 5,500,000 5,000,000
TOTALS 30,200,000 25,000,000
Kilgarriff: Asialex June 2005 14
Collection
Use existingAsk publishersWeb
Kilgarriff: Asialex June 2005 15
Use existing
Irish: PAROLE corpus (8M words, ITE)English
British: BNC American: LDC Gigaword – wds journalism Limerick Corpus of Spoken English Northern Ireland Corpus of Transcribed Speech
Kilgarriff: Asialex June 2005 16
Ask publishers
The junkmail problem
Appeals to national pride Charm and persistence Team member who knows them all
Kilgarriff: Asialex June 2005 17
Web
Fast becoming the usual place to look Kilgarriff and Grefenstette, CL 2003
Preliminary experiments at least 15 M words of Irish out there
Hiberno-English English as found on sites where Irish was found
Kilgarriff: Asialex June 2005 18
Web issues
Formats conversion from pdf etc needed
Character representation Not many pages “do the right thing”
Navigational material: “click here”ListsMixed languages Duplication
Kilgarriff: Asialex June 2005 19
Text category Irish Hiberno-English
Words: actual
Words: target
Words: actual
Words: target
Books-imaginative
7,600,000 9,000,000 6,000,000 7,500,000
Books-Informative
8,400,000 6,000,000 7,000,000 5,000,000
Newspapers 4,500,000 4,500,000 5,300,000 3,750,000
Periodicals 2,600,000 2,500,000 700,000 2,250,000
Official/Govt 1,200,000 1,500,000 1,000,000 1,000,000
Broadcast 400,000 1,000,000 0 750,000
Websites 5,500,000 5,500,000 5,000,000 4,750,000
TOTALS 30,200,000 30,000,000 25,000,000 25,000,000
Kilgarriff: Asialex June 2005 20
Encoding
Clean-upLinguistic processingDelivery formalism
Kilgarriff: Asialex June 2005 21
Clean-up
Deletion of:Title pages, table of contents, tables, figures,
footnotes, endnotes, page headers and footers, crosswords, TV listings, sports results, team listings …
Kilgarriff: Asialex June 2005 22
Linguistic processing
Lemmatize give giving gives given gave => give (verb)
Part-of-speech tagging bank (verb) or bank (noun)?
English: existing tools used Irish: tools developed from scatch
Elaine Ui Dhonnchadha: thesis work Finite state methods, constraint grammar Separate talk
Kilgarriff: Asialex June 2005 23
Delivery formalism
Both XML Corpus Encoding Standards (XCES) For longevity, interchange format
And Loaded into Word Sketch Engine Corpus query tool optimised for lexicography,
linguistic research Good for searching on grammar, text type etc
Friday talk
Kilgarriff: Asialex June 2005 24
Conclusion
Large corpora for high-quality lexicographyDeveloped in one year, modest budgetDesign, collection and encodingDelivered in a convenient form for the
lexicographer
Thank you