zhuang-2004-a chinese ocr spelling check approach based-000.pdf

8/13/2019 Zhuang-2004-A Chinese OCR Spelling Check Approach Based-000.pdf

1/6

2004 IEEE International Conl'erence on S ys tem s, Man and Cyb ernetics

A Chinese OCR Spe1l:ing Check Approach Basedon Statistical Language Models*

Li Zhuang, Ta Bao, Xiaoyan ZhuDCST, Tsingh ua U niversity, Beijing, ChinaSta te Key Laborat,ory of Int,elligentTec.hnology and System s, B eijing, C hinaAbstract - This paper describes an effective spellingch.eck approach for Chinese OCR with n new multi-knowledge based statistical language m.ode1. This lan-guage model combines the contientianal n-gram langrrogemodel and th.e new LSA (Laten,t Semantic dnal sis)language model, so both. local information syntax) an.dglobal information (semantic) a r e utilized. Fur themore ,Chinese sim,ilar chamcters are wed in T.'iterbi seorchprocess to expand the candidate list in order to addmore possible correct results. With o w appmack, thebest recognition accuracy rate increases rom 79.3:K to91.9 , which m.ea.n,s60.9%error reduction.Keywo rds: OCR spelling check; post-process ing,multi-knowledge based language model, n-gram, L,SA,Chinese similar characters.

1 IntroductionOCR (Optical Character Recognition) means th.rt a

computer analyses character images autoinatically toachieve the text, information: and OCR engine is an :ICR(Independent Character Recognit,ion) engine usually Itrecognizes characters in every position of an image, andgives all the possible candidate results. In ICR recog-nition process, only image information is used, and theresults cont.ain niany incorre.ct words. So t.he spellingcorrecaion approaches, which play the most. imporiantrole in OCR post-processing, are required.

Familiar spelling check approaches are often basedon language knowledge, and mainly include rule-basedmethods and statistic-based methods. Rule-based met,h-ods use rule sets, which describe soiiie exact dict,ioiiaryknourledge such as word or character frequency, par(.-of-speech inforniation [7] and soiiie other syntax or mor-phological feat.ures [16] of a language, to detect dubi-ous areas and generate candidate words list. This kindof methods achieves significant success in some specialdomains, hut it is difficult to deal with open naturallanguage. On the other hand, statistic-based nietb.odsoften use a language model that is achieved by u:Sing

'~7603-6566-7/04/ 20.00 2004 IEEE.

Chunheng Wang Satoshi NaoiCo. Ltd . , Ltd . ,Fujitsu R&D Center Fujitsu LaboratoriesBeijing. China Kaw asaki, Japan

some language knodedge and analysing a huge of lan-guage phenomena on large corpus (SI [lZ][lA][lS], o morecontext information is utilized, and this kind of methodsis suit,ahle for general domains. hloreover,by using Ian-guage models, the candidate list. can he autoinat,icallyadjusted: thus we can get t,he correct resulta more ex-actly and quickly. Sometimes, rule-based methods andstatistic-based methods are used together to achieve het-ter performance 171.

.4ccording to the development of natural languageprocessing, using more knowledge from language itselfis required. That is to say semantic inforinat.ion niustbe introduced into language models. Since now thethe most, widely used language. model for OCR spellingcheck is n-gram models, and no semantic knowledge isinvolved, in this paper? we const.ruct a multi-knowledgebased statistical language hod el ait .h some seman~icn-formation introduced by using LSA language model 131, and put it into a Chinese OCR spelling check task.Furthermore, we add some Chinese similar charactersinformation in Viterbi search process to expand th e can-didate list in order to add more possible correct resu1t.s.The experiment results show that aft,er using the lan-guage model and Chinese similar characters informa-tion, the recognition accuracy rate increases from 79.3t.o 91.9%, which means 60.9% error reduction.

In section 2 of this paper, we will review some re-lated works, including the n-gram language model andthe LSA language model. In section 3, we will intro-duce the multi-knowledge based language models andChinese similar charachers inforinat.ion we use in OCRpost-processing. Experiment results and conclusion arein section and 5 , respectively.

2 Related works2.1 N-gram language model

Statistical language models always use the productof conditional probabilities to compute the appearanceprobability of a sentence. Suppose s denotes a sentence,W I W ~..wt is the words sequence of s,and h, is all his-tory inforniation before w , , and then the appearance

47 7


2/6

probability of sentence s is

P s )= P W l W 2 ...w,) n P w , l h , ) (1)i l

The niost widely adopted language model is t,he n-grani language model 1131. It supposes the word u ~only has relations with the 77. - 1 mmediately precedingwords, and so formula (1) can he rewritten as

t tP s )= n P ( w i / h i )= ~I(milali+ I.., -+ y . . .m ,-~)

i l i l (2)In n-grain models, n is often const,rained to 2 or 3,

which means higram and trigrain respectively.N-graili language model only considers the sequence

of a word string, but no meaniiigs of the words. Thatis to say, it involves syntax iiiforniation only, withoutseiiiant.ic n forniat on.2.2 LSA language modelLSA (Lat.ent Semantic Analpis) niethod has beenpresented long ago [GI?but only coinbined with statis-tical laiiguage model in recent years and now is a hottopic.

When to consider an occurrence probability of a wordbased on semantic meaning, t,he content should he un-derstood and then he used to forecast the ocrurreuceprobability, which means using the degree of how cowgruous the word and history information are. The occur-rence probabilit,ieSof a word in different docui ients arenot the same, because the type of documents restrictsthe using of words in it. Soue words that hax-e relationsalways occur in the same type of docu1nent.s. LSA lan-guage model is such a model that studies the relationsbetween words and t,ypesof documents. These.kiiids ofrelations can he got in the model, and that, shows themeaning .of latent seniantic analysis.

LSA language model analyses t,he training data byconstructing th e relat,ions between words and types ofdocuments. First., the following matrix is generated.

Each row of 11 is corresponding to a word, and each col-umn of Y is corresponding to a documelit in the trainingcorpus. The value of th e crossing of a row and a columnis the occurrence times of th e corresponding word in thecorresponding document. For example, W t meansword i appears t times in document j .

The size of matrix I V is AI x N , and it is rather hugefor coinmon training corpus. So SVD (Singular Value

Decomposition) inethod is used t,o reso1x.e this problem.For any W , here exists the following SVD [ ]

k 1It7 is a matrix reconstructed py the R niaxinial siii-gular values of It. Where U = {u l ,u? , u R } ,

= { ~ , I , W : . . , V R } , and S = drag u1,ul ... u R ) .11 is the hest rank-R approxiination to 14 for any uni-tarily invariant norin

ivhere /I / refers to L? norin [d]. So we can use ~i .osuhstit,uteII: and t,he sizes of 0:V are iiiuch smallerthan II:.

When we consider I v = c?S)VT?naiiiely treat PTas a group of ort.hodoxy vectors of R-diinensional LSAspace, and this decomposition is a projection of row1-ect.ors of IiJonto the space. Each row of CSpresentsthe coordinate of the corresponding row vector of Iii inthe space, and it is t.lie coordinate of t.he correspondingword in the space, too. As thebame, when we considerIir= 0 S O T ) , his decomposition is a projection of col-uni n vectors of I.? onto t.he R-dimensional space that isronst ructed hy orthodoxy vect,ors of 0. Each colunniof SVT presents the coordinate of bhe correspondingdocunient in the space After SVD, every word has acorresponding coordinate in th e space, and every docu-ment also has one. Thus t.he distance of two words ort.wo documents can he calculated easily [1][2].

In theR-dimensional space of the corresponding word,the history information q 1 appeared words) can beused to construct a vector d,-l (n.-dimensional vect.or),ah ere each dimension is a value th at indicates the timesof the corresponding word appeared in the history. Thecorresponding coordinate of this vector in LSA space is

-

(7)-T= d,_,Uwhere presents the history in LSA space, and wecan get the occurrence probabilities of words after thehistory by calculating the distance hetween words and& - I Here the following formula is used:

I q w 9 , J 9 - l )= cos(u,S~,t ,-lS+)

4728


3/6

....

Figuie 1: Our OCR system framework with s tati stical language model in post-processing

That the distance is closed to 1means the word has astrong relation with the hist,ory, and closed to 0 nieansthe word has a xveak relat,ion with the history. The oc-currrnce probability of every \yard can be calculated hynormalizing the sum of all distance bmi-een the wordsand the history. Furtherinore, the distribution functionof LSA language model can be presented as [5][9]:

where hi- presents the history information of U ,In practical terms, the formula (9) can be modified as

where e is a very small iiumber added to avoidthe phenomenon that for some uiq. K wq ,dp- i )~ ~ i n { K ( ~ , l , 2 q - l ) ~ l up} = 03 The language models used inthe OCR post-processing3.1 Overview of our OCR system frame-work

Our OCR system framework can be showi as Fig.1.On the left of the dash dotted line is an OCR engine

wit,hout post-processing, and the post-processing mod-ule is shown as the right part of the line. The kernelof the module is the st.atistica1 language model, a.hirhutilizes both local information (syntax) and global in-formation (semantic).3.2 A multi-knowledge based language

The perplexity of LSA language model is much higherthan t.hat of n-gram language model. This fact, indicatesthat though the idea of LSA language model is very out-perform, t h e performance of a language model just usingthe global history information without the local historyinformation is rather poor. That is because the differ-ence of thousands kinds of words can express the samemeaning using different sentence structures. So we builda multi-knowledge based language model tha t, combinesLSA language model using the global information andconventional n-gram language model using the local in-formation t.ogether.

Th e following formula [3] is used to calcu1at.e the dis-tribution function Pc of the combined language model:

model

Pc(wq hy-I)P N P ~ l U . -,,+ , . . . . W p - - I ) p L s . I ( L i y - - l tu..) (10)X P~.(U,.IU ~ - . ~ + . - . . .u . . , ) P ~ ~ . , ( d ~ - i l ~ , )w z t s

where Pry is the dist.ribution fuuction of t,he n-gram lan-guage model, and P ~ s as the distribution function ofthe LSA language model.In application, the importance of different languagemodel should be considered. Because the local his-t.ory information is niore important than the global his-tory information, n-gram language model is primary and

4729


4/6

LSA language model is secondary. So formula (10) smodified as following with Bayes forniula used:

11)where t.he parameter X is used to control the weightof the two language models in t.he combination. it istrained via minimizing the perplexity of the model o n aheld-out corpus.

The computat,ional complexity of training an LSAmodel is very high. In order to improve the efficiency:x-ord clustering [ll] s used. Then the conditional p r obabilit,y can be calculated with t,he following forniula:

KP ( w q p q - l )= cP l l :q lck)P ckI; ig - l ) 12)k=1

where Gk denotes t,he kt.h class. In our application, theC-nieans clustering algorithm is used first to coarselypartition t,he words into a few classesl and then t,hebot,tom-up clustering algorithm is used in every class toget final clustering results. With word clustering used,the computational cost decreases significantly.3.3 Chinese similar characters iiiforma-

In Chinese, soiiie characters are recognized as someother characters usually because their shapes are siniilarvery much. Such characters are called Chinese siniilarcharacters.Because an OCR engine depends mainly on imageinformation, and t.he similar characters often produceconfusion, in our application, Chinese similar charac-ters iiiforniat.ion is iiitroduced together. Here an hlLE(Maximum Likelihood Estimation) based iiiethod [ lo ] sused to generate t,he sets of similar characters. Let Fdenotes the first choice in the candidate list given byICR engine, and C denotes the correct character, thenthe approach of similar characters generation can be de-scribed as following:(1) Record n F,C) in the training corpus. n(F,C) sthe times tha t F is the first choice in the candidate listwhile the right character is G.(2) Calculate the coiifusioii probability P C I F ) =n F,C)/n F) , rhere n F) is the times that F is thefirst choice in the candidate list given by ICR engine.(3) Sort all possible C according to the probability(4)Save the first d possible characters as the similarcharacters of F , and save their confusion probabilities.

Chinese similar characters are used in the Viterbisearch process, where all the d + 1 candidat.es, whichinclude the first choice given by ICR engine and its isimilar characters, a re used in the Viterbi se.arch process

tion

P CIF).

to expand the candidate list t o increase t.he chance thatthe correct choice appears in the list.4 Experiment and discussion4.1 Experiment

The training corpus used here involves a lot of do-mains and contains about 100 million characters. nigram model and LSA model are built respect,irely, andthe n the multi-knowledge based language model A =0.88, 100 word classed) and bhe conventional n-gramlanguage model are compared on the test corpus, whichcontaiiis 1018 characters.

Th e results of t he experiments, one is ouly for lan-guage models and the ot.her is with Chinese similarcharacters used together on the test corpus, are shownas Tah.1 and Tab.2, respectively. Trigrain and tri-grani+LSA (described in 3.2) are chosen for languagemodels test. In tlie experiments only for language niod-els test, the perplexities of trigrain language niodel andtrigramtLSA language niodel are 185.3 nd 183.9, e-spectively.lye use the approach in 3.3 o generate Chinese sim-ilar characters aiid let hI=5. That is to say, all the 6characters for each candidate are added to the Viterbisearch process.

The best recognition accuracy rate increases from79.3 to 91.976, and the recognition error rate decreasesabout 60.9%. That indicates the errors of OCR resultscan be efficiently .reduced by using language niodel andsimilar characters information together.4.2 Disccusioii

The results of the experiments seem t,o show that theconventional ii-gram language model out.performs OUKmulti-kno\vledge based language model. We believe it. isdue to the fact that the seinaiitic infomiation is nioresensitive to the coiiteiit of tlie candidat,e list than thesyntax iuformat.ion. So it is required rhat the candi-date list includes more possible correct characters whent,he semantic information is used i n a language model.Roin the resuks above, we can see that after addingthe similar characters information, the absolute incre-ment of recognit.ion accuracy rate of trigrani languagemodel is 0.7 , while of trigramtLSA language modelt.he absolute increment is 1.2 . Tha t shows the valid-ity of using similar characters to expand the candidatelist, especially urhen using the seiiiant,ic informat.ion. Itcan be expected tha t the multi-knowledge based lan-guage model outperforms the conventional n-gram lan-guage model if the candidate list includes niore possiblecorrect characters.5 Conclusion

In our approach for Chinese OCR spelling check, meconstruct a multi-knowledge based language model in

4730


5/6


6/6

[Ill Sven Martin, Jorg Liermann, Hermann Ney, AI-gorithms for bigram and trigram word clustering;Speech Communication, 1998, vol. 24, pp. 19-37.

[12] Surapant Meknavin, Combining arigraln and nin-now in Thai OCR error correction; Proceedings ofthe Seventieth International Conference 011 Conipu-tational Linguistics (COLING-ACL 98), Montreal,Canada, August, pp. 836-842.

[13] Gerasinios Potamianos, Frederick Jelinek, AStudy of n-gram and decision tree let,ter languagemodeling methods, Speech Conimutrication 2.1(1998) pp. 171-192.

11.4 X. Tong D. A. Evans, A statistical approach toautomatic OCR error corsection in context, Pro-ceediugs of the Fourth \Vorkshop on Veery LargeCorpora (MVLC-4).

1151 X. Tong, ChengXiang Zhai, et a], ..OCR correc-tion and query expansion for retried on OCRdata-CLARIT TREC-5 Confusion Track Report ,Proceeding of the fifth Text Retrieval ConferenceTREC-5, NIST Special Publication 500-238, 1996,pp, 341-345.

[ l G ] Pornchai Tuiiimarattanaiioiit, Improvement ofThai OCR error correction using overlapping con-straints to correction suggestion, The Fifth Sym-posiuni on Natural Language Processing 2002 andOriental COCOSDA Workshop 2002.

4732

zhuang-2004-a chinese ocr spelling check approach based-000.pdf

Documents