a program for correcting spelling errors

8

Click here to load reader

Upload: charles-r-blair

Post on 14-Sep-2016

226 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: A program for correcting spelling errors

INFOI~MATION AN'D CONTROL ~, 60-67 (1960)

A Program for Correcting Spelling Errors

CHARLES R. BLAII~

Department of Defense, Washington 25, D. C.

A program using a simple, heuristic procedure for associating "similar" spellings is able to correct misspelled words. Given only a vocabulary of properly spelled words, the computer can correct most (including unanticipated) misspellings without human assistance.

Apart from practical applications, the process is interesting as an example of an unusual form of pattern recognition.

It is tempting to assume that English spelling is too irrational to be explained to a computer. If we limit ourselves to algorithms, perhaps this is true; yet if we give the machine an extensive vocabulary, it can be programmed to recognize as misspelled any word that is not in this list. Even this procedure will not detect all errors, for some misspellings are correct spellings of different words (e.g., advice can become advise). Since such errors can only be detected through context, I avoid this troublesome prospect by considering them as usage rather than spelling errors, and so outside the scope of my title.

Having discovered a word that is not in its vocabulary, what should the program do next? Obviously, it could maintain a dictionary which associates every misspelled word with its correctly spelled equivalent. But this auxiliary dictionary is potentially several times longer than the already sizable vocabulary of correctly spelled words. Unless the basia vocabulary is extremely limited, maintenance of the auxiliary dictionary is impracticable.

Any hope of programming customary orthographic "rules" is de- stroyed at first glance; for, while a machine could easily put " 'i' before 'e' except after ' c ' . . . ," it would have difficulty recognizing " . . . and when pronounced 'a' as in neighbor and weigh." Such coding difficulties, the numerous exceptions, and the lack of rules to cover many spelling errors make this approach unpromising.

If a spelling error is correctable without reference to the context in

6O

Page 2: A program for correcting spelling errors

A PROGRAM FOR CORRECTING SPELLING ERRORS 61

which it appears, then the misspelling must be sufficiently "close" to the correct spelling to permit unique association. Thus, if a machine is given a suitable criterion for computing the "similarity" of words, it can "correct" a spelling error by substituting the "most similar" correctly spelled word for the misspelling. In pattern recognition terms, a mis- spelled word is a pattern that is approximately equivalent to its correct version. Recognizing erroneous spellings requires devising some means of dividing all spellings into equivalence classes and giving the name of the class to each of its members.

How is "similarity" to be measured? One immediately thinks of ad hoc rules (e.g., if all other letters are the same, a word containing "ie" is very similar to a word containing "el") ;I but programming them intro- duces the same difficulties that arise in programming orthographic "rules."

One approach to associating "similar" words is exemplified by the Soundex method, which files names according to a code based on their pronunciation. To form the code, the initial letter of the surname is followed by a 3-digit number which is constructed by ignoring vowels and assigning the same digit to similar sounding consonants in their order of occurrence3 The filing clerk can then select the proper individual from the section of the file specified b y this code, on the basis of given names or other identifying information.

Although widely and successfully used by human clerks, Soundex is not readily adaptable as a machine process for correcting spelling errors. To be sure, the code construction could easily be programmed, but the fact that it associates correct spellings of different words means that an additional distinguishing criterion is required. It seemed more efficient to search for a single "similarity" measurement which normally would uniquely associate a misspelling with its correct equivalent.

An abbreviation is a particular type of "misspelling" which retains enough "similarity" to the original word to permit unique association. Unique association implies that the abbreviation retains the meaningful "kernel" of the word. A spelling error, to be recognizable without using context, must also contain the meaningful "kernel." Thus, we are led

1 An extensive collection of such rules is given in: Searching Aids for Alphabetic and Soundex Files. Remington Rand Management Controls Division, New York, n.d.

This s ta tement is slightly oversimplified. For further details see: Soundex. Remington Rand, New York, n.d.

Page 3: A program for correcting spelling errors

62 BLAIR

~o assume that two words are "s imi lar" if their abbreviations are identi- cal.

An r-letter abbreviation of an n-letter word can be produced by delet- ing those n - r letters which are least important in the identification of the word. The problem of producing an adequate abbreviation is, in application, tha t of deciding which letters in a word are the least impor- tant in determining its meaning. Information theorists assume that the information conveyed by a "message" is inversely proportional to its a priori probability of occurrence. One can apply this idea by eliminating the n - r letters in the order of their expected frequency; we tried this but found that even better results can be obtained by using the "fre- quency" of their occurrence as errors. An empirically constructed ap- proximation of the latter function is given in Table I. The inadequacy of this technique is soon revealed by encounters with abbreviations such as "xpnn" for exponent. Clearly weight must also be given to the position of the letter in the word. The first letter is of greatest importance, and, all other things being equal, the last letter is second in importance, followed by the second letter, the next to last letter, etc. Tha t is, if we reorder the letters in this fashion, the desirability of rejecting a letter in a given position is an increasing, monotonic function of the new posi- tion. An empirically constructed approximation of this function is given

TABLE I THE LOGARITHI~ OF THE DESIRABILITY OF DELETING A LETTER AS A FUNCTION

oF ITS NAME

Letter Score Letter Score

A 5 N 3 B 1 O 4 C 5 P 3 D 0 Q 0 E 7 tt 4 F 1 S 5 G 2 T 3 H 5 U 4 I 6 V 1 J 0 W 1 K 1 X 0 L 5 Y 2 M 1 Z 1

Page 4: A program for correcting spelling errors

A PROGRAM FOR CORRECTING SPELLING ERRORS 63

TABLE I I THE LOGARITHM OF THE DESIRABILITY OF DELETING A LETTER AS A FUNCTION

OF ITS POSITION

Position Score Position Score

1 0 9 5 2 1 10 5 3 2 11 6 4 3 12 6 5 4 13 6 6 4 14 6 7 5 15 6 8 5 16 up 7

in T a b l e I I . B y assuming t h a t the n a m e and pos i t ion of a l e t t e r inde- p e n d e n t l y de t e rmine the des i r ab i l i ty of re jec t ing it, one can fo rm an r - ]e t te r a b b r e v i a t i o n b y de le t ing the n - r l e t t e r s which have the l a rges t p r o d u c t 2 A l t h o u g h the a s sumpt ion of independence is no t s t r i c t l y t rue , i t is suff iciently accu ra t e for our purposes . M o r e ref ined resul ts could be o b t a i n e d b y s tor ing the larger t ab l e r equ i red for d e p e n d e n t var iab les .

Before i t is a sked to correct misspe l led words, t he mach ine m u s t compu te a n d s tore a shor t (we used J le t te rs ) a b b r e v i a t i o n of each cor rec t ly spel led word in the vocabu la ry . These a bb re v i a t i ons are t hen assoc ia ted wi th the i r comple te spel l ings a n d sor ted. T h e mach ine can now correct misspel l ings in a n y t ex t which con ta ins on ly those words in i ts vocabu la ry . R e a d i n g the words in order , i t forms the i r a bb re v i a t i ons a n d selects all iden t i ca l a b b r e v i a t i o n s of cor rec t ly spel led words. Nor - m a l l y th is process gives ~ un ique answer and the spel l ing assoc ia ted

EXAMPLE

A B S O R B E N T A B S O R B A N T 5 1 5 4 4 1 7 3 3. Letter score 5 1 5 4 4 1 5 3 3 0 2 4 5 5 5 4 3 1 Position score 0 2 4 5 5 5 4 3 i 5 3 9 9 9 6 11 6 4 Sum of scores 5 3 9 9 9 6 9 6 4

* * * * * Delete * * * * * A B B T Abbreviation A B B T

wi th t he a b b r e v i a t i o n is t h e n used for o u t p u t (see example) . W h e n an a b b r e v i a t i o n coincides w i t h more t h a n one v o c a b u l a r y en t ry , t he p ro-

3 To minimize time and storage requirements, 3-bit logarithms are added to compute the "product." The crudity of our estimates justifies no higher precision.

Page 5: A program for correcting spelling errors

64 BLAIR

gram compares longer abbreviations of this input word with longer abbreviations of the vocabulary entries it matched until a unique one has been selected. Of course, it is possible that a misspelling will be so extreme that its abbreviation will not appear in the vocabulary. When this happens the machine can do no more than indicate that this word was unidentifiable.

The association of common misspellings with their correctly spelled equivalents is illustrated in Table III. The program correctly identified 89 of the 117 misspelled words (3 required longer abbreviations) while incorrectly identifying only 2. 4 In this ease, the vocabulary of correctly spelled words was limited to the 117 words to be corrected; a larger vocabulary would increase the number of incorrect identifications. Be- fore condemning the machine's performance, test yourself by covering the correctly spelled column and see how well you compare. Unless you are an exceptional speller, it will be an illuminating--and humbl ingJ experience.

The two types of deficiency are easily detectable and correctable. A word that has been incorrectly identified by the program is virtually always conspicuous because it does not fit the context and a word not identified at all is made apparent by the blank space left in the output. These errors arise either because the word was not in the original vocabu- lary or because the misspelling was so extreme that it gave rise to a different abbreviation. The first type of error can be corrected by simply adding the new word to the vocabulary at the next updating run. The second type requires a certain amount of "cheating." A special voeabu- lary updating is used in which the correct spelling of this word and the abbreviation of the particular misspelling are placed in association in the vocabulary. Although inelegant, this procedure is quite efficient in allowing for peculiar exceptions and words that are too short to permit deleting all incorrect letters while maintaining the selected length of abbreviation.

Since this heuristic process was specifically designed for the type of spelling errors normally made by people, it is considerably less effective in eorreeting other types of errors. It would, for example, have little utility in correcting the output of a malfunctioning machine; fortunately, however, we have other means of dealing with these. Similarly, it is not

4 Interferred became intercede and philipinoes became Philippines. Neither of these errors would have occurred if 5 letter abbreviations had been used.

Page 6: A program for correcting spelling errors

T A B L E I I I EXAMPLES OF ASSOCIATING INCORRECT SPELLINGS WITH THEIR CORRECT

EQUIVALENTS BY ~ABBREVIATION ~

(F rom: H u t c h i n s o n , L. I . (1956). "Standard Handbook for Secretaries," 7th Edn . pp. 133-134. McGraw-Hi l l , N e w York . R e p r i n t e d b y permiss ion . )

Cor rec t spel l ing A bbrev i a t i ons Inco r r ec t spel l ing

A B S O R B E N T A B S O R P T I O N A C C O M M O D A T E A C Q U I E S C E A N A L Y Z E A N T A R C T I C A S I N I N E A S S I S T A N C E A U X I L I A R Y B A N A N A B A N K R U P T C Y B R E T H R E N B R I T A I N B U O Y A N C Y C A T E G O R Y C H A U F F E U R C H I M N E Y S C O L I S E U M COLOSSAL C O M M I T M E N T C O M M I T T E E C O N C E D E C O N S C I E N T I O U S C O N S E N S U S C O N T R O V E R S Y

A B B T A B O N A M D T ACQC ANYZ ANTC ASNN ASTN AUXY BANA BAKY BRTN BRTN BUYY CATY C F F R CMYS C O U M COAL COMT COMM COND CONS CONS COVY

C O R R U G A T E D C Y N I C A L D E U C E D E V E L O P D I G N I T A R Y D I S A P P O I N T D R A S T I C A L L Y E C S T A S Y E M B A R R A S S E X A G G E R A T E E X I S T E N C E E X T E N S I O N F E B R U A R Y F I E R Y F I L I P I N O S F L A M M A B L E F O R T H R I G H T F O R T Y FULFILL GNAWING GOVERNMENT

C O G D C Y N L D U C E D V O P D G R Y D I N T D R T Y E C T Y E M B S E X G T E X T N E X T N FBRY FIRY FNOS FMMB FOGT FOTY F U F L G N W G G O V T

G R A M M A R G R M R H E A R T R E N D I N G H D N G H E M O R R H A G E H I N D R A N C E H Y G I E N E I D I O S Y N C R A S Y I N C E N S E I N C I D E N T A L L Y I N F A L L I B L E I N O C U L A T E

H M G E H N D N H Y G N I D Y Y I N N S I N D Y I N F B I N O T

= A B B T A B B N

= A M D T AQUS A N Z E

= A N T C = A S N N = A S T N = A U X Y = B A N A -- B A K Y = B R T N = B R T N

B O Y Y = C A T Y = C F F R

C H M S = COUM = COAL = C O M T = COMM -- C O N D = CONS = CONS = COVY = C O G D

S Y N L = D U C E -- D V O P = D G R Y = D I N T = D R T Y = E C T Y = E M B S = E X G T = E X T N = E X T N = F B R Y = F I R Y

P H N S F L M B

= F O G T = F O T Y = FUFL

KNWG -- G O V T -- G R M R = H D N G = H M G E = H N D N = H Y G N = I D Y Y = I N N S = I N D Y = I N F B

I N N T

A B S O R B A N T A B S O R B T I O N A C C O M O D A T E A Q U I E S E A N A L I Z E A N T A R T I C A S S I N I N E A S S I S T E N C E A U X I L L A R Y B A N A N N A B A N K R U P C Y B R E T H E R E N B R I T I A N B O U Y A N C Y C A T A G O R E Y CHAUFFUER CHIMNIES COLOSIUM COLLOSAL COMMITTMENT COMMITEE CONSEDE CONSCIENTOUS CONCENSUS CONTROVERCY CORRIGATED SYNICAL DUECE DEVELLOPE DIGNATARY DISAPOINT DRASTICLY ECSTACY E M B A R A S S E X A G E R A T E E X I S T A N C E E X T E N T I O N F E B U A R Y F I R E Y P H I L I P I N O E S F L A M A B L E F O R T R I G H T F O U R T Y F U L L F I L K N A W I N G G O V E R M E N T G R A M M E R H E A R T R E N D E R I N G HEMORRAGE HINDERENCE HYGEINE IDIOCYNCRACY INSENSE INCIDENTLY INFALABLE INNOCULATE

65

Page 7: A program for correcting spelling errors

TABLE III--Continued

Correct spelling Abbreviat ions Incorrect spelling

I N S I S T E N C E I N T E R C E D E I N T E R F E R E D J E O P A R D I Z E KIMONO LICENSE LIQUEFY M A I N T E N A N C E M A N A G E M E N T M A N E U V E R MORTGAGED N I C K E L N I N E T Y N I N T H NOWADAYS OCCASIONALLY OCCURRENCE P A M P H L E T P E R M I S S I B L E P E R S E V E R A N C E P E R S U A D E P H I L I P P I N E S P I T T S B U R G H PLAGIARISM PLAYWRIGHT P R A I R I E P R E C E D I N G P R E C I P I C E P R E F E R A B L E PRESUMPTUOUS P R I V I L E G E P R O P E L L E R PSYCHOLOGICAL PUB LICLY P U R S U E R QRUESTIONNAIRE

E C I P I E N T R E L E V A N T RENOWN R E P E L RHAPSODY R H O D O D E N D R O N RHUBARB R H Y T t l M SACRILEGIOUS

INTN = INTN I N T D = I N T D I N F D INTD JODZ JPDS KMNO KMNA LINS LINC LQFY = LQFY MANN MANN M M N T = MMNT MAVR -- MAVR MOGD = MOGD N I K L = N I K L N N T H = N N T H NWDY = NWDY OCNY = OCNY OCNC = OCNC PAMT P H M T PRMB = PRMB PRVN = PRVN P R D E P U R D PHNS = PHNS PBGH PTBG PLGM = PLGM PWGT PLWT P R R E = P R R E P R D G = P R D G P R P C = P R P C P R F B = P R F B PRMS = PRMS PRVG = PRVG PROR = PROR PSYL = PSYL PUBY = PUBY P U R R P R U R QRUTR = QUTR

P N T R P N T RVNT = RVNT RNWN R N U N REPL R P L L R H D Y RADY R D D N -- R D D N RHBB RUBB R H Y M RYTM SAGS = SAGS SFTY = SFTY SCRS SIRS SEZE SIZE SPTE = SPTE SHRD = SHRD SIMR = SIMR SNTY = SNTY SOVR = SOVR SPMN SPMT SUNG = SUNG SUUS = SUUS TRFB -- TRFB UNPD = UNPD USGE = USGE VGTB = VGTB WDDY = WDDY WERD WIRD

SAFETY SCISSORS SEIZE SEPARATE S H E P H E R D SIMILAR S I N C E R I T Y SOUVENIR S P E C I M E N SUING S U R R E P T I T I O U S T R A N S F E R A B L E U N P A R A L L E L E D USAGE VEGETABLE WEDNESDAY W E I R D

INSISTANCE I N T E R S E D E I N T E R F E R R E D J E P R O D I S E KIMONA L ISE N CE LIQUIFY M A I N T A I N A N C E MANAGMENT MA N U V E U R MORTGAUGED N I C K L E N I N T Y N I N E T H NOWDAYS OCASSIONALY OCCURENCE P H A M P L E T PERMISSABLE P E R S E V E R E N C E PU RSU A D E P H I L L I P I N E S P ITTSBURG PLAIGARISM PLAYWRITE P R A R I E P R E C E E D I N G P R E S I P I C E P R E F E R R A B L E PRESUMPTOUS P R I V E L E G E P R O P E L L O R PSYCOLOGICAL PUBLICALLY P E R S U E R QRUESTIONAIRE

E S I P I E N T R E V E L E N T RE N O U N R E P E L L RAPHSODY R H O D O D R E N D O N RUI tBARB R Y T H M SACRELIGIOUS SAFTY SISSERS SIEZE S E P E R A T E S H E P E R D SIMILIAR S I N C E R E T Y SOUVINER S P E C I M E N T SUEING SUREPTITOUS T R A N S F E R R A B L E U N P A R A L E L L E D USEAGE VEGATABLE WEDENSDAY W I E R D

66

Page 8: A program for correcting spelling errors

A PROGRAM FOR CORRECTING SPELLING ERRORS 67

difficult to construct "misspellings" that the process will fail to correct, but it is surprisingly difficult to select such errors from the writings of people.

RECEIVED: August 31, 1959. Revised Sept. 22, 1959

BIBLIOGRAPHY

GLANTZ, m. T., (1957). On the recognition of information with a digital cora- puter. J. Assoc. Computing Mech. 4, 178-188.

KoRoL~v, L. N., (1958). Coding and code compression. J. Assoc. Computing Mach. 5, 328-330.

M~LL:~I~, G. A. and FRIEDMAN, E. A. (1957). The reconstruction of mutilated English texts. Information and Control, 1, 38-55.