a large list of confusion sets for spellchecking assessed against a corpus of real-word errors jenny...

24
A large list of confusion sets for spellchecking assessed against a corpus of real-word errors Jenny Pedler, Roger Mitton LREC 2010

Upload: elizabeth-brabham

Post on 14-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

A large list of confusion sets for spellchecking assessed against a corpus of real-word errors

Jenny Pedler, Roger Mitton

LREC 2010

Some real-word errors

The sand-eel is the principle food for many birds and animals.

Our teacher tort us to spell.

Henley Regatta comes near the top of the English social calender.

Spellchecker-induced real-word errors

The Wine Bar Company is opening a chain of brassieres.

The nightwatchman threw the switch and eliminated the backyard.

Cupertino, California

... to encourage cooperation and ...

... to encourage cooperation and ...

... to encourage cooperation and ...

Cupertino

co-operation

....

The original Cupertinos

"reinforcing bilateral and multilateral Cupertino"

"South Asian Association for regional Cupertino"

Confusion sets

{cite, sight, site}

{form, from}

{passed, past}

{peace, piece}

{principal, principle}

{quiet, quite, quit}

{their, there, they're}

{weather, whether}

{you're, your}

He had quiet a young girl staying with him

of 17 named Ethel Monticue.

He had quiet a young girl staying with him

quite?

quit?

of 17 named Ethel Monticue.

The confusion-set approach has been demonstrated to work with

(a) a short list of confusion sets,

(b) artificial test data.

To assess its potential for real, unrestricted text, we need:

(1) a realistically-sized list of confusion sets,

(2) a corpus of running text containing genuine real-word errors.

A list of confusion sets• Tuned string-to-string edit-distance

• ~ 6000 sets

• Headword (confusables)– wright (right, write) – right (rite, write)– write (right, rite, writ)

Inflected forms Proper nouns Usage errors – e.g. <fewer, less>

A corpus of real-word errors

Sentences 675

Words 12024

Total errors (tokens) 833

Distinct errors (types) 428

Distinct error/target pairs 495

quit quietquit quite

The collation of the information was <ERR targ = really> relay </ERR> <ERR targ = quite> quit </ERR> easy to do.

Corpus mark-up example

Corpus profile: Frequent errors

Error|target pair Frequencythere|their 35form|from 20to|too 19their|there 19a|an 18its|it's 17your|you're 15weather|whether 12cant|can't 10collage|college 9

Corpus profile: Homophone errorsHomophone set N. Occs

there, their, they're 38

to, too, two 23

its, it's 17

your, you're 15

weather, whether 12

herd, heard 5

witch, which 4

hear, here 3

wile, while 3

14% of distinct error/target pairs

Corpus profile: Simple errorsError Type N.Errors % Errors

Omission (e.g. ether, either)

142 29%

Substitution (e.g. vary, very) 104 21%

Insertion (e.g. bellow, below) 56 11%

Transposition (e.g. dose, does) 12 2%

All simple 314 63%

All error pairs 495 100%

How would our list cope with our corpus?

Types Tokens

Detectable and correctableE.g. shod (should)

44% 58%

Detectable but not correctableE.g. martial (material)

16% 12%

Not detectable (inflection error)E.g. friend (friends), take (taken)

23% 17%

Not detectable (other)E.g. pads (passed)

17% 13%

Total (100%) 495 833

Non-detectable/non-correctable

Error not a headword (“non-detectable”)

Target not a candidate (“non-correctable”)

Pair Frequency Pair Frequencya, an 17 an, a 4the, they 4 cause, because 3is, his 2 as, has 2is, it 2 easy, easily 2i, it 2 for, from 2u, your 2 in, is 2

mouths, months 2none, non 2no, know 2

Using the list for spellchecking

• Rules based on surrounding context

• May be unreliable– 25% errors have another error within 2 words– 9% are another real-word error

• Syntax-based methods– Easiest to implement– Shown to have good performance

Syntax-based rules: potential

Tagsets Types Tokens

Distinctbellow (NN1,VVB,VVI)below (AV0, PRP)

58% 68%

? Overlappingpray (VVB, VVI, AV0)prey (NN1, VVB, VVI)

31% 25%

Matchingconfirm (VVI, VVB)conform (VVI, VVB)

11% 7%

Total errors (=100%) 299 580

Resources available for download

www.dcs.bbk.ac.uk/~jenny/resources.html