gdex: automatically finding good dictionary examples in a corpus
DESCRIPTION
GDEX: Automatically finding good dictionary examples in a corpus. Users appreciate examples. Paper: space constraints Electronic: no space constraints Give lots of examples Constraint: Cost of selection, editing. Project. Macmillan English dictionary Already had 1000 collocation boxes - PowerPoint PPT PresentationTRANSCRIPT
GDEX: Automatically finding good dictionary examples in a corpus
Kivik 2013 Kilgarriff: GDEX 1
Kivik 2013 Kilgarriff: GDEX 2
Users appreciate examples
Paper: space constraints Electronic: no space constraints
Give lots of examplesConstraint: Cost of selection, editing
Kivik 2013 Kilgarriff: GDEX 3
Project
Macmillan English dictionary Already had 1000 collocation boxes Average 8 per box New electronic version
All 8000 collocations need examples Authentic; from corpus
Kivik 2013 Kilgarriff: GDEX 4
Old method
Lexicographer Gets concordance for collocation Reads through until they find a good
example Cut, paste, edit
Kivik 2013 Kilgarriff: GDEX 5
New method
Lexicographer Gets sorted concordance
20 best examples in spreadsheet Less reading through Tick the first good one, edit
Kivik 2013 Kilgarriff: GDEX 6
What makes a good example?
Readable EFL users
Informative Typical, for the collocation Gives context which helps user
understand the target word/phrase
Kivik 2013 Kilgarriff: GDEX 7
Readability
70 years research Not just (or mainly) EFL
Educational theory Teaching children to read
Instruction manuals Early work: US military
Publishing People like newspapers and magazines that
they find easy to read
Kivik 2013 Kilgarriff: GDEX 8
Readability tests Fleish-Kincaid Reading Ease test
1948 Ave sentence length, ave word length In some word processing software
Many similar measures Recent work
training data for different reading levels Language modelling Tailored readability according to domain, L1
Target levels US grades Now, increasingly: Common European Framwork
Kivik 2013 Kilgarriff: GDEX 9
GDEX
Get concordance for collocation For each sentence
Score it Sort Show best ones to lexicographer
Kivik 2013 Kilgarriff: GDEX 10
GDEX heuristics Sentence length (10-26 words) Mostly common words is good Rare words are bad Sentences
Start with capital, end with one of .!? No [, ], <, >, http, \ Not much other punctuation, numbers Not too many capitals Typicality: third collocate is a plus
Kivik 2013 Kilgarriff: GDEX 11
Weighting
For each sentence Score on each heuristic Weight scores Add together weighted score
How to set weights? Two students:
Manually judged 1000 “good examples” Weights set so system makes same choices
as students
Kivik 2013 Kilgarriff: GDEX 12
Was it successful? Did it save lexicographer time?
Definitely (says project manager)
Rough guess Average number of corpus lines to read
until you find a good one: Unsorted: 20 Sorted: 5
Kivik 2013 Kilgarriff: GDEX 13
Corpus choice
Started with BNC but Too old Not enough examples
If no good examples in corpus, GDEX can’t help
Changed to UKWaC 20 times bigger; from web; contemporary Better Most web junk filtered out Usually a good example in top twenty
Kivik 2013 Kilgarriff: GDEX 14
GDEX and TALC TALC
Teaching and Language Corpora Goal: bring corpora into lg teaching Usual problem
Concordances are tough for learners to read
Way forward GDEX examples Half way between dictionary and corpus
Kivik 2013 Kilgarriff: GDEX 15
GDEX: Models for use
More examples for dictionaries Speed up, as with MED or Fully automatic “more examples”
Corpus query tool Option in the Sketch Engine
Only show concordances with high scores
Automatic collocations dictionary http://forbetterenglish.com
Recent developments
Configurable GDEX For other languages Interface to help set up
Commonest string Between ‘bare collocate’ and example
Kivik 2013 Kilgarriff: GDEX 16
Kivik 2013 Kilgarriff: GDEX 17