comparable corpora bootcat (ccbc) ( or: in praise of bootcat)

23
Comparable Corpora BootCaT (CCBC) (or: In Praise of BootCaT) Adam Kilgarriff, Jan Pomikalek, Avinesh PVS Lexical Computing Ltd. Work Supported by EU FP7 Project PRESEMT

Upload: bennett-delaney

Post on 03-Jan-2016

33 views

Category:

Documents


1 download

DESCRIPTION

Comparable Corpora BootCaT (CCBC) ( or: In Praise of BootCaT). Adam Kilgarriff, Jan Pomikalek, Avinesh PVS Lexical Computing Ltd. Work Supported by EU FP7 Project PRESEMT. Just-in-time corpora. Krista Varantola Translators, terminologists In-domain terminology: Domain dictionaries - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Comparable Corpora BootCaT (CCBC) ( or:  In Praise of BootCaT)

Comparable Corpora BootCaT (CCBC)(or: In Praise of BootCaT)

Adam Kilgarriff, Jan Pomikalek, Avinesh PVSLexical Computing Ltd.

Work Supported by EU FP7 Project PRESEMT

Page 2: Comparable Corpora BootCaT (CCBC) ( or:  In Praise of BootCaT)

Just-in-time corpora

Krista Varantola

Translators, terminologists

In-domain terminology: Domain dictionaries

• Don’t exist

• Out of date

• Not accessible

Collect in-domain web pages

Instant corpus2

Page 3: Comparable Corpora BootCaT (CCBC) ( or:  In Praise of BootCaT)

BootCaT (Bootstrapping Corpora and Terms)

Baroni and Bernardini 2004

User: input ‘seed terms’

Send 3-at-a-time to a search engine• Returns search hits page

Retrieve those pages

A corpus!• Cleaning, deduplicating, linguistic processing

Extract terms• Can use extracted terms as seeds, iterate

3

Page 4: Comparable Corpora BootCaT (CCBC) ( or:  In Praise of BootCaT)

Very successful

Widely used More implementations

SkE has WebBootCaT, web front end Secret:

piggybacks on search enginesThey do the donkey-work

• on-domain, text-rich pages, no spam, …

4

Page 5: Comparable Corpora BootCaT (CCBC) ( or:  In Praise of BootCaT)

Also use for

General language corpusLong list of general seed wordsPioneer: SharoffLCL: Corpus Factory

‘Varieties of Learner English’General English, same queries except

• Region=UK, US, Canada, Aus, China, Japan, Korea

Validation under way

5

Page 6: Comparable Corpora BootCaT (CCBC) ( or:  In Praise of BootCaT)

Sketch Engine

Page 7: Comparable Corpora BootCaT (CCBC) ( or:  In Praise of BootCaT)

Corpus query tool, since 2003

Widely used by lexicographersCommercial

• OUP, CUP, Collins, Macmillan, Le Robert, Cornelsen, Shogukakan

National dictionary projects• Bulgaria, Czech Republic, Estonia,

Netherlands, Slovakia, Slovenia

UniversitiesLinguistics, language research, NLP,

language teaching7

Page 8: Comparable Corpora BootCaT (CCBC) ( or:  In Praise of BootCaT)

44 languages and counting

Large corpora ready-to-use for

Arabic Bengali Bulgarian Chinese Czech Croatian Danish Dutch English Estonian Finnish French German Greek Gujarati Hebrew Hindi Indonesian Irish Italian Japanese Korean Latin Malay Malayalam Norwegian Persian Polish Portuguese Romanian Russian Serbian Setswana Slovak Slovene Spanish Swahili Swedish Tamil Telugu Thai Turkish Urdu Vietnamese

8

Page 9: Comparable Corpora BootCaT (CCBC) ( or:  In Praise of BootCaT)

Handles large corporaLargest to date: 8 billion words

Fast Web-based: no software to install Build ‘instant corpora’ from the web Load your own corpus

Quota of space on SkE server Word sketches

One-page, automatic accounts of a word’s grammatical and collocational behaviour

Free 30-day trial: sketchengine.co.uk9

Page 10: Comparable Corpora BootCaT (CCBC) ( or:  In Praise of BootCaT)

10

Adam Kilgarriff

Lexical Computing Ltd.

Page 11: Comparable Corpora BootCaT (CCBC) ( or:  In Praise of BootCaT)

WebBootCaT

BootCaT integrated in SkE BootCaT a corpus

Clean, de-dupe, POS-tag, thenLoad into Sketch Engine

11

Page 12: Comparable Corpora BootCaT (CCBC) ( or:  In Praise of BootCaT)
Page 13: Comparable Corpora BootCaT (CCBC) ( or:  In Praise of BootCaT)
Page 14: Comparable Corpora BootCaT (CCBC) ( or:  In Praise of BootCaT)
Page 15: Comparable Corpora BootCaT (CCBC) ( or:  In Praise of BootCaT)
Page 16: Comparable Corpora BootCaT (CCBC) ( or:  In Praise of BootCaT)

Observation

Specialist domain, L1 Specialist domain, L2 Matching terminology

16

Page 17: Comparable Corpora BootCaT (CCBC) ( or:  In Praise of BootCaT)

Going multilingual

Translate seeds English: volcanology volcanologist "volcanic

eruption" seismographs Eyjafjallajokull geodic "deformation monitoring" tephra magma stratigraphic tephrochronology geochronological "volcanic ash" ablation rhyolitic

French:vulcanologue volcanologie "éruption volcanique" sismographes Eyjafjallajokull "surveillance de la déformation" géodiques tephra magma téphrochronologiestratigraphique géochronologiques "de cendres volcaniques" ablation rhyolitiques

Thanks again Google

BootCaT for French

Page 18: Comparable Corpora BootCaT (CCBC) ( or:  In Praise of BootCaT)
Page 19: Comparable Corpora BootCaT (CCBC) ( or:  In Praise of BootCaT)

CCBC

Input: L1, L1 seeds, L2 Choose dictionary

Google as default• Google dictionary (25 lg pairs, limited API)• Google translate (1225 lg pairs, only 1 transl)

Option: edit translations Bootcat 2 corpora Bilingual word sketches

19

Page 20: Comparable Corpora BootCaT (CCBC) ( or:  In Praise of BootCaT)

Bilingual word sketches(very first pass)

For L1 nodeword nFor each of its translations n1, n2, …

• For each collocate c in word sketch• For each of its translations c1, c2, …

• Does ci occur as collocate in word sketch for ni?

• If yes: output <c; ni , ci >

• Add L1 and L2 examples sentences

20

Page 21: Comparable Corpora BootCaT (CCBC) ( or:  In Praise of BootCaT)

21

Page 22: Comparable Corpora BootCaT (CCBC) ( or:  In Praise of BootCaT)

Notes

Grammatical relationsUsed to find collocationsThen thrown away

Thresholds: what is “in a word sketch” Which dictionary

Issue: as for seeds

Live (just)22

Page 23: Comparable Corpora BootCaT (CCBC) ( or:  In Praise of BootCaT)

23