a large list of confusion sets for spellchecking assessed against a corpus of real-word errors jenny...
TRANSCRIPT
A large list of confusion sets for spellchecking assessed against a corpus of real-word errors
Jenny Pedler, Roger Mitton
LREC 2010
Some real-word errors
The sand-eel is the principle food for many birds and animals.
Our teacher tort us to spell.
Henley Regatta comes near the top of the English social calender.
Spellchecker-induced real-word errors
The Wine Bar Company is opening a chain of brassieres.
The nightwatchman threw the switch and eliminated the backyard.
The original Cupertinos
"reinforcing bilateral and multilateral Cupertino"
"South Asian Association for regional Cupertino"
Confusion sets
{cite, sight, site}
{form, from}
{passed, past}
{peace, piece}
{principal, principle}
{quiet, quite, quit}
{their, there, they're}
{weather, whether}
{you're, your}
The confusion-set approach has been demonstrated to work with
(a) a short list of confusion sets,
(b) artificial test data.
To assess its potential for real, unrestricted text, we need:
(1) a realistically-sized list of confusion sets,
(2) a corpus of running text containing genuine real-word errors.
A list of confusion sets• Tuned string-to-string edit-distance
• ~ 6000 sets
• Headword (confusables)– wright (right, write) – right (rite, write)– write (right, rite, writ)
Inflected forms Proper nouns Usage errors – e.g. <fewer, less>
A corpus of real-word errors
Sentences 675
Words 12024
Total errors (tokens) 833
Distinct errors (types) 428
Distinct error/target pairs 495
quit quietquit quite
The collation of the information was <ERR targ = really> relay </ERR> <ERR targ = quite> quit </ERR> easy to do.
Corpus mark-up example
Corpus profile: Frequent errors
Error|target pair Frequencythere|their 35form|from 20to|too 19their|there 19a|an 18its|it's 17your|you're 15weather|whether 12cant|can't 10collage|college 9
Corpus profile: Homophone errorsHomophone set N. Occs
there, their, they're 38
to, too, two 23
its, it's 17
your, you're 15
weather, whether 12
herd, heard 5
witch, which 4
hear, here 3
wile, while 3
14% of distinct error/target pairs
Corpus profile: Simple errorsError Type N.Errors % Errors
Omission (e.g. ether, either)
142 29%
Substitution (e.g. vary, very) 104 21%
Insertion (e.g. bellow, below) 56 11%
Transposition (e.g. dose, does) 12 2%
All simple 314 63%
All error pairs 495 100%
How would our list cope with our corpus?
Types Tokens
Detectable and correctableE.g. shod (should)
44% 58%
Detectable but not correctableE.g. martial (material)
16% 12%
Not detectable (inflection error)E.g. friend (friends), take (taken)
23% 17%
Not detectable (other)E.g. pads (passed)
17% 13%
Total (100%) 495 833
Non-detectable/non-correctable
Error not a headword (“non-detectable”)
Target not a candidate (“non-correctable”)
Pair Frequency Pair Frequencya, an 17 an, a 4the, they 4 cause, because 3is, his 2 as, has 2is, it 2 easy, easily 2i, it 2 for, from 2u, your 2 in, is 2
mouths, months 2none, non 2no, know 2
Using the list for spellchecking
• Rules based on surrounding context
• May be unreliable– 25% errors have another error within 2 words– 9% are another real-word error
• Syntax-based methods– Easiest to implement– Shown to have good performance
Syntax-based rules: potential
Tagsets Types Tokens
Distinctbellow (NN1,VVB,VVI)below (AV0, PRP)
58% 68%
? Overlappingpray (VVB, VVI, AV0)prey (NN1, VVB, VVI)
31% 25%
Matchingconfirm (VVI, VVB)conform (VVI, VVB)
11% 7%
Total errors (=100%) 299 580