recognition of the script in serbian documents using frequency occurrence and co-occurrence analysis

15
Hindawi Publishing Corporation e Scientific World Journal Volume 2013, Article ID 896328, 14 pages http://dx.doi.org/10.1155/2013/896328 Research Article Recognition of the Script in Serbian Documents Using Frequency Occurrence and Co-Occurrence Analysis Darko BrodiT, 1 Zoran N. MilivojeviT, 2 and Hedomir A. Maluckov 1 1 Technical Faculty in Bor, University of Belgrade, Vojske Jugoslavije 12, 19210 Bor, Serbia 2 Technical College Niˇ s, Aleksandra Medvedeva 20, 18000 Niˇ s, Serbia Correspondence should be addressed to Darko Brodi´ c; [email protected] Received 18 July 2013; Accepted 21 August 2013 Academic Editors: S. Bourennane, C. Fossati, and J. Marot Copyright © 2013 Darko Brodi´ c et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Any document in Serbian language can be written in two different scripts: Latin or Cyrillic. Although characteristics of these scripts are similar, some of their statistical measures are quite different. e paper proposed a method for the extraction of certain script from document according to the occurrence and co-occurrence of the script types. First, each letter is modeled with the certain script type according to characteristics concerning its position in baseline area. en, the frequency analysis of the script types occurrence is performed. Due to diversity of Latin and Cyrillic script, the occurrence of modeled letters shows substantial statistics dissimilarity. Furthermore, the co-occurrence matrix is computed. e analysis of the co-occurrence matrix draws a strong margin as a criteria to distinguish and recognize the certain script. e proposed method is analyzed on the case of a database which includes different types of printed and web documents. e experiments gave encouraging results. 1. Introduction Cryptography studies the problems concerning the conver- sion of information from a readable to some other state. It deals with information which is changing from one to another state. e initial information represents a plain text. When the information becomes encrypted, it is referred as a cipher text [1]. A substitution cipher is a method of encoding. According to it, the units of plain text are replaced with cipher text [2]. ey can be single letters, pairs of letters, triplets of letters, mixtures of the above, and so forth. In our application the encryption function is not needed to be injective [3] due to nature of further statistical analysis. It does not matter if it will encrypt two different plain texts into the same cipher text, because decryption of the cipher text is not considered. Hence, the cryptography is used only as a basis for modeling and analyzing documents written in Serbian language. Serbian language represents the European minority language. However, it is distinct due to its capability to be written in Latin and Cyrillic script, interchangeably. According to the baseline characteristics [4], each letter in the text file is replaced with the cipher which is taken from the set of four counterparts only. e basic idea is to distinguish the script (Latin or Cyrillic) according to statistical analysis of the cipher text. It is accomplished with frequency analysis concerning occurrence [5] as well as with the method using statistical measures extracted from gray-level co-occurrence matrix [6]. e letter frequency distribution is a function which assigns each letter a frequency of its occurrence in the text sample [7]. e gray-level co-occurrence matrix (GLCM) have used for the extraction of features needed for texture classification [8]. Nevertheless, it can be exploited for a letter co-occurrence in a text document [9]. At the final stage, the experiment is made on a custom oriented database containing text from printed and Web documents. e rest of the paper is organized as follows. Section 2 describes the full procedure of the proposed algorithm. Section 3 defines the experiment. Section 4 presents the results from experiment and discusses them. Section 5 makes a conclusion. 2. Proposed Algorithm e proposed algorithm converts document written in Latin and Cyrillic script which represent the plain text into cipher

Upload: bg

Post on 03-Feb-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Hindawi Publishing CorporationThe Scientific World JournalVolume 2013 Article ID 896328 14 pageshttpdxdoiorg1011552013896328

Research ArticleRecognition of the Script in Serbian Documents UsingFrequency Occurrence and Co-Occurrence Analysis

Darko BrodiT1 Zoran N MilivojeviT2 and Hedomir A Maluckov1

1 Technical Faculty in Bor University of Belgrade Vojske Jugoslavije 12 19210 Bor Serbia2 Technical College Nis Aleksandra Medvedeva 20 18000 Nis Serbia

Correspondence should be addressed to Darko Brodic dbrodictfboracrs

Received 18 July 2013 Accepted 21 August 2013

Academic Editors S Bourennane C Fossati and J Marot

Copyright copy 2013 Darko Brodic et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Any document in Serbian language can be written in two different scripts Latin or Cyrillic Although characteristics of these scriptsare similar some of their statistical measures are quite different The paper proposed a method for the extraction of certain scriptfrom document according to the occurrence and co-occurrence of the script types First each letter is modeled with the certainscript type according to characteristics concerning its position in baseline area Then the frequency analysis of the script typesoccurrence is performed Due to diversity of Latin and Cyrillic script the occurrence of modeled letters shows substantial statisticsdissimilarity Furthermore the co-occurrence matrix is computedThe analysis of the co-occurrence matrix draws a strong marginas a criteria to distinguish and recognize the certain script The proposed method is analyzed on the case of a database whichincludes different types of printed and web documents The experiments gave encouraging results

1 Introduction

Cryptography studies the problems concerning the conver-sion of information from a readable to some other stateIt deals with information which is changing from one toanother state The initial information represents a plain textWhen the information becomes encrypted it is referred as acipher text [1] A substitution cipher is a method of encodingAccording to it the units of plain text are replaced withcipher text [2] They can be single letters pairs of letterstriplets of letters mixtures of the above and so forth Inour application the encryption function is not needed to beinjective [3] due to nature of further statistical analysis Itdoes not matter if it will encrypt two different plain textsinto the same cipher text because decryption of the ciphertext is not considered Hence the cryptography is used onlyas a basis for modeling and analyzing documents written inSerbian language Serbian language represents the Europeanminority language However it is distinct due to its capabilityto be written in Latin and Cyrillic script interchangeablyAccording to the baseline characteristics [4] each letter in thetext file is replaced with the cipher which is taken from theset of four counterparts only The basic idea is to distinguish

the script (Latin or Cyrillic) according to statistical analysisof the cipher text It is accomplished with frequency analysisconcerning occurrence [5] as well as with the method usingstatistical measures extracted from gray-level co-occurrencematrix [6] The letter frequency distribution is a functionwhich assigns each letter a frequency of its occurrence inthe text sample [7] The gray-level co-occurrence matrix(GLCM) have used for the extraction of features needed fortexture classification [8] Nevertheless it can be exploited fora letter co-occurrence in a text document [9] At the finalstage the experiment is made on a custom oriented databasecontaining text from printed and Web documents

The rest of the paper is organized as follows Section 2describes the full procedure of the proposed algorithmSection 3 defines the experiment Section 4 presents theresults from experiment and discusses them Section 5makesa conclusion

2 Proposed Algorithm

The proposed algorithm converts document written in Latinand Cyrillic script which represent the plain text into cipher

2 The Scientific World Journal

Originaldocument

Latindocument

Cyrillicdocument

Encription Encription

Cipher Latindocument

Cipher Cyrillicdocument

Frequencyoccurrence

analysis

Criteria fordecision making

Correctrecognitionof the script

Co-occurrenceanalysis

Frequencyoccurrence

analysis

Co-occurrenceanalysis

Figure 1 The flow of proposed algorithm

text according to predefined encryption based on text linestructure definition Then the equivalent cipher texts aresubjected to the frequency and co-occurrence analysis Theresults of frequency analysis indicated a substantial differencebetween cipher texts obtained from Latin and Cyrillic textSimilarly co-occurrence analysis shows obvious quantitativedisparity in some measures This draws a strong margin as acriterion in order to distinguish and recognize a certain scripttype (Figure 1)

21 Text Line Structure Text in printed and Web documentsis defined as well-formed text type It is characterized bystrong regularity in shape The distances between the textlines are adequate to be split up The words are formedregularly with similar distance Their inter word spacing isdecent as well However in certain script the letters or signshave different position according to its baseline It is shownin Figure 2

From Figure 2 four virtual lines can be defined [4]

(i) The top-line(ii) The upper-line(iii) The base-line and(iv) The bottom-line

Table 1 Definition of script types according to the baselinecharacteristics

Script example Type of script Designationa Short Sb Ascender Aj Descender Dlj Full F

Accordingly a text line can be considered as beingcomposed of three vertical zones [4]

(i) The upper zone(ii) The middle zone and(iii) The lower zoneEach text line has at least a middle zone The upper zone

depends on capital letters and letters with ascenders whilethe lower zone depends on letters with descenders Only afew letters occupy the upper and lower zone

22 Encryption Two different sets are producedThey are119860119871

and 119860119862for Latin and Cyrillic alphabet respectively

119860119871

= ABC Z a b c z

119860119862

= A24 d a 3 5 e

(1)

Each of them consists of 60 elements that is letters whichare valid for Serbian language Furthermore both sets119860

119871and

119860119862are mapped into set 119862

119891119871

119860119871

997891997888rarr 119862

119891119862

119860119862

997891997888rarr 119862

(2)

These mappings are achieved in accordance with the textline area definitionThe structure of text line allows definitionof following script types [4]

(i) Full letter (F) where letter is present in all three zones(ii) Ascender letter (A) where character parts are present

in the upper and middle zones(iii) Descender letter (D) where character parts are

present in the lower and middle zones and(iv) Short letter (S) where character parts are present in

the middle zone onlyAccordingly all letters will be replaced with the cipher

from the following set

119862 = SAD F (3)

All letters can reach certain position which belongs to set119862 with a unique designation according to Table 1

It should be noted that above mappings are surjectiveSerbian language contains 30 letters Each letter in Latin

has a corresponding equivalent letter in Cyrillic Table 2shows Latin and Cyrillic letters as well as theirs designationaccording to Table 1

Statistical analysis of the letters and theirs correspondingtype for Latin and Cyrillic scripts is shown in Table 3

The Scientific World Journal 3

Ascender (A) Full (F)

Short (S) Descender (D)

Upper zone

Middle zoneLower zone

Top-lineUpper-line

Base-lineBottom-line

Figure 2 Definitions of the script characteristics

Table 2 Serbian Latin and equivalent Cyrillic alphabet according to the script types

Alphabet Latin Script types Latin Script types Cyrillic Script types Cyrillic Script types1 A A a S 0 A 1 S2 B A b A 2 A 3 A3 C A c S 4 F 5 D4 C A c A 6 A 7 A5 C A c A 8 A 9 S6 D A d A F S7 ETH A đ A lt A = F8 Dz A dz A gt F D9 E A e S A A S10 F A f A B A C F11 G A g D D A E S12 H A h A H A I S13 I A i S J A K S14 J A j D N A O D15 K A k A P A Q S16 L A l A R A S S17 Lj F lj F T A U S18 M A m S V A W S19 N A n S X A Y S20 Nj F nj D Z A [ S21 O A o S A ] S22 P A p D ^ A _ S23 R A r S P A a D24 S A s S b A c S25 S A s A d A e S26 T A t A f A g S27 U A u S h A i D28 V A v S j A k S29 Z A z S n A o S30 Z A z A p A q S

23 Frequency Analysis of the Occurrence In the proposedalgorithm all letters from certain script has been substitutedwith equivalent members of the set 119862 according to Table 2These circumstances for Latin document are shown in Fig-ure 3

Figure 3(b) shows the cipher text which is obtained fromLatin documents according to modeling given in Table 2Figures 3(c)ndash3(f) shows a subset of cipher text with eachelement of set 119862 that is S A D and F respectively Statisticalanalysis of the cipher text shows following 2217 elements ofS 598 elements of A 261 elements of D and 8 elements of F

types Accordingly distribution of 119862 set elements for Latindocument is shown in Figure 4

Currently the same Latin document is converted intoCyrillic one Similarly as in Latin document all letters fromCyrillic document are exchanged with the equivalent mem-bers of the set 119862 according to Table 2 These circumstancesfor Cyrillic document are shown in Figure 5

Figure 5(b) shows the cipher text which is obtained fromCyrillic documents according to modeling given in Table 2Figures 5(c)ndash5(f) shows a subset of cipher text with eachelement of set 119862 that is S A D and F respectively

4 The Scientific World Journal

(a) (b)

(c) (d)

(e) (f)

Figure 3 Application of the proposed algorithm (a) Original Latin text (b) Cipher text according to set 119862 (c) Only ldquoSrdquo text (d) Only ldquoFrdquotext (e) Only ldquoArdquo text (f) Only ldquoDrdquo text

The Scientific World Journal 5

Table 3 Statistical analysis of Latin and Cyrillic script types

Script Type of letters Occurrence of script types Distribution of script typesS A D F S () A () D () F ()

Latin Capital letters 0 28 0 2 0 9333 0 667Latin Small letters 12 13 4 1 4000 4333 1333 334Cyrillic Capital letters 0 27 0 3 0 9000 0 1000Cyrillic Small letters 21 2 5 2 7000 667 1666 667

S A D F0

500

1000

1500

2000

2500

Type of scripts

No

Figure 4 Distribution of 119862 setrsquos elements in cipher text obtainedfrom Latin document

Statistical analysis of the Cyrillic document image showsfollowing 2516 elements of S 53 elements of A 445 elementsof D and 26 elements of F types It should be noted that thesum of all 119862 set elements in Latin and Cyrillic document isnot quite identical It is valid due to difference in definition ofletters in two scripts In the Cyrillic script each letter is givenone and only one sign However in Latin script letters suchas dz lj and nj are represented by two letters Distribution of119862 set elements for Cyrillic document is presented in Figure 6

According to Figures 4 and 6 the comparison chart isdrawn It is shown below in Figure 7

Quantification of the script type appearance in a docu-ment written in Latin and Cyrillic is shown in Table 4

It is obvious that the Latin document compared toCyrillicone has slightly smaller number of short (S) descender (D)and full (F) letters Nonetheless the crucial margin is seen inascender (A) letters Hence it can be a measure of confidencefor detection of the script in a document given in Serbianlanguage

24 Co-Occurrence Analysis Let I be the gray scale imagewhich is under consideration It has 119872 row and 119873 columnswhile119879 is the total number of gray levelsThe spatial relation-ship of gray levels in the image I is expressed by the grayscaleco-occurrencematrix (GLCM)C [6 10] HenceC is a matrixthat describes the frequency of one gray level appearing ina specified spatial linear relationship with another gray levelwithin the area of investigation [11] In order to compute aco-occurrence matrixC we considered a central pixel 119868(119909 119910)

with a neighborhood defined by the window of interest Thiswindow is defined by two parameters inter-pixel distance (119889)

and orientation (120579) Typically the choice of 119889 is 1 (one pixel)

while the value of 120579 depends on the neighborhood Becauseof that each pixel has 8 neighbors given at following angles120579 = 0

∘ 45∘ 90∘ 135∘ 180∘ 225∘ 270∘ 315∘ However thecase of neighbors at 120579 = 0

∘ or at 120579 = 180∘ is similar to the

GLCM definition [12] So the choice may fall to 4 neighborspixels at 120579 = 0

∘ 45∘ 90∘ and 135∘ that is horizontal rightdiagonal vertical and left diagonal [13] These orientationsrefer to 4-adjacent pixels at (119909+119889 119910) (119909 119910minus119889) (119909minus119889 119910) and(119909 119910+119889) where119889 is 1 For each pixel of the neighborhood it iscounted the number of times a pixel pair appears specified bythe distance and orientation parameters The (119894 119895)th entry ofC represents the number of occasions a pixel with an intensity119894 is adjacent to a pixel with an intensity 119895 Hence for the givenimage I the co-occurrence matrix C is defined as [14]

119862 (119894 119895) =

119872

sum

119909=1

119873

sum

119910=1

1if 119868 (119909 119910) = 119894

119868 (119909 + Δ119909 119910 + Δ119910) = 119895

0 otherwise

(4)

where 119894 and 119895 are the image intensity values of the image119909 and 119910 are the spatial positions in the image I The offset(Δ119909 Δ119910) is specifying the distance between the pixel-of-interest and its neighbor It depends on the direction 120579 that isused and the distance 119889 at which the matrix is computedThesquarematrixC is of the order119873 Using a statistical approachlike GLCMprovides a valuable information about the relativeposition of the neighboring pixels in an image [12] In orderto normalize matrix C matrix P is calculated as [10]

119875 (119894 119895) =119862 (119894 119895)

sum119873

119894=1sum119873

119895=1119862 (119894 119895)

(5)

The normalized co-occurrence matrix P is obtained bydividing each element of C by the total number of co-occurrence pairs in C

To illustrate the computing of GLCM a four gray levelimage I is usedThe window parameters are 119889 = 1 and 120579 = 0

(horizontal) Initial matrix I is shown in Figure 8The procedure of calculating co-occurrence matrix for

grayscale matrix I (119889 = 1 and 120579 = 0∘) [12] is given in Figure 9

In order to GLCM be applied in our case set 119862 is mappedinto set 119862

119873by bijective function as

119891119862

119862 997891997888rarr 119862119873

(6)

where 119862119873

= 0 1 2 3 Furthermore the neighborhood isgiven as 2-connected (119909 minus 119889 and 119909 + 119889 around 119909 where119889 = 1) According to that the same document in Latin andCyrillic script is converted into cipher text It is shown belowin Figure 10

6 The Scientific World Journal

(a) (b)

(c) (d)

(e) (f)

Figure 5 Application of the proposed algorithm (a) Original Cyrillic text (b) Cipher text according to set 119862 (c) Only ldquoSrdquo text (d) Only ldquoFrdquotext (e) Only ldquoArdquo text (f) Only ldquoDrdquo text

The Scientific World Journal 7

Table 4 Percentage of script type occurrence in document

Type of script (TOS) Latin Cyrillic 119909 timesS 7188 8276 087A 1939 174 1114D 846 1464 057F 027 086 031It is obvious that the Latin document compared to Cyrillic one has slightly smaller number of short (S) descender (D) and full (F) letters Nonetheless thecrucial margin is seen in ascender (A) letters Hence it can be a measure of confidence for detection of the script in a document given in Serbian language

S A D F0

500

1000

1500

2000

2500

3000

Type of scripts

No

Figure 6 Distribution of 119862 sets elements in cipher text obtainedfrom Cyrillic document

S A D F0

500

1000

1500

2000

2500

3000

Type of scripts

LatinCyrillic

No

Figure 7 Comparison between distributions of 119862 set elements inLatin and equivalent Cyrillic document

1 0 0 20 1 1 12 2 1 10 0 0 00 1 0 10 0 2 30 0 1 1

I =

Figure 8 Initial 4-level grayscale matrix I (featuring 119872 = 7 119873 = 4and 119871 = 4)

Table 5 Normalized cooccurrence matrix

(a) For cipher text obtained from the Latin text

03722 02212 00623 0004802220 00343 00072 000623 00072 00016 000048 0 0 0

(b) For cipher text obtained from Cyrillic text

05863 00327 01326 0006400391 00104 00144 001262 00200 00224 0001600072 0 00008 0

To evaluate these cipher documents GLCM method isemployed Nevertheless various statistic measures obtainedfrom the co-occurrence matrix is introduced The primarygoal is to characterize the cipher text Five descriptors can beused to describe the image [15]

(i) Uniformity (UNI)(ii) Entropy (ENT)(iii) Maximum probability (MAX)(iv) Dissimilarity (DIS) and(v) Contrast (CON)

Uniformity (UNI) which is sometimes called angularsecondmoment (ASM) or energy (ENG)measures the imagehomogeneity It receives the highest value when GLCM hasfew entries of large magnitude In contrast it is low when allentries are nearly equalThe equation of the uniformity is [15]

UNI =

119873

sum

119894=1

119873

sum

119895=1

119875(119894 119895)2 (7)

Entropy (ENT) measures the disorder or the complexityof the image The highest value is found when the values of119875(119894 119895) are allocated quite uniformly throughout the matrixThis happens when the image has no pairs of gray level withparticular preference over othersThe equation of the entropyis [15 16]

ENT = minus

119873

sum

119894=1

119873

sum

119895=1

119875 (119894 119895) sdot log119875 (119894 119895) (8)

8 The Scientific World Journal

j

i 0 1 2 3 0 6 4 2 0 6 4 2 0 621 421 221 0

1 2 4 0 0 2 4 0 0 221 421 0 02 0 1 1 1 0 1 1 1 0 121 121 1213 0 0 0 0 0 0 0 0 0 0 0 0

(a) (b) (c)

≫ ≫C = P =

Figure 9 Co-occurrence matrix for grayscale matrix I (119889 = 1 and 120579 = 0∘) (a)The number of occasions a pixel with an intensity 119868 is adjacent

to a pixel with intensity 119895 (b) Co-occurrence matrix C (c) Normalized matrix P

(a) (b)

(c) (d)

Figure 10 Document conversion (a) Original Cyrillic text (b) Cipher text obtained fromCyrillic text according to set 119862119873 (c) Original Latin

text (equivalent to Cyrillic one) (d) Cipher text obtained from Latin text according to set 119862119873

Maximum probability (MAX) extracts the most probabledifference between gray scale value in pixels It is defined as[15]

MAX = max 119875 (119894 119895) (9)Dissimilarity (DIS) is a measure of the variation in gray

level pairs of the image It depends on distance from the

diagonal weighted by its probability The equation of thedissimilarity is [15]

DIS =

119873

sum

119894=1

119873

sum

119895=1

119875 (119894 119895) sdot1003816100381610038161003816119894 minus 119895

1003816100381610038161003816 (10)

The Scientific World Journal 9

(a) (b)

(c) (d)

Figure 11 Custom-oriented database (a) Printed document in Latin (b) Printed document in Cyrillic (c) Web document in Latin (d) Webdocument in Cyrillic

Contrast (CON) or inertia is a measure of the intensitycontrast between a pixel and its neighbor over the entireimage Hence it shows the amount of local variations presentin the image If the image is constant then the contrast willbe equal 0 The highest value of contrast is obtained whenthe image has random intensity and the pixel intensity and

neighbor intensity are very different The equation of thecontrast is [15 16]

CON =

119873

sum

119894=1

119873

sum

119895=1

119875 (119894 119895) sdot (119894 minus 119895)2 (11)

10 The Scientific World Journal

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

100

01

02

03

04

05

06

07

08

09

DocumentsShortDescendingFull

x-ti

mes

(a)

0

2

4

6

8

10

12

14

16

18

Ascending

x-ti

mes

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

(b)

Figure 12 The ratio of the script type occurrence (a) short descending and full (b) ascending

Table 6 Cooccurrence descriptors for Latin and Cyrillic cipher text

Serbian language Latin Cyrillic CharacterizationUniformity (energy) 02459 03811 Latin lt CyrillicEntropy minus16298 minus14363 Latin gt CyrillicMaximum probability 03722 05863 Latin lt CyrillicDissimilarity 07356 06669 Latin gt CyrillicContrast 10423 12660 Latin lt CyrillicFrom the above results it is clear that co-occurrence descriptors can fully characterize the difference between Latin andCyrillic scriptThismeans that frequencyanalysis of the occurrence can be supplemented with additional attributes in order to define a strong margin as a criterion to distinguish a certain script

A brief look at the normalized co-occurrence matrix Pfor the same document written in Latin and Cyrillic scripts(text representing the excerpt of the first four paragraphsfrom a document given in Figure 10) shows quite a differentcharacterization The test results are given in Table 5

Furthermore the calculation of five co-occurrencedescriptors shows the values given in Table 6

3 Experiments

For the sake of the experiment a custom-oriented databaseis created It consists of 10 documents These documentsrepresent excerpts from printed and web documents writtenin Serbian language The documents are created in bothscripts Latin and Cyrillic Printed documents are createdfrom PDF documents while web documents are extractedfrom web news The total length of documents given inthe database is approx 75000 letter characters per script(approx 40 pages) The length of printed documents is from2273 to 15840 letter characters Web documents are smallercompared to printed documents Their length is from 1231 to2502 letter characters It should be noted that all documentshave more than 1000 letter characters The example of the

printed and web document from the database is shown inFigure 11

4 Results and Discussion

According to the proposed algorithm all documents fromthe database are converted into equivalent cipher texts andsubjected to the frequency and co-occurrence analysis Firstthe frequency analysis of the script type occurrence in Latinas well as in Cyrillic documents is examined (Table 7) Theobtained results for each document are given in Table 8

The final processing of the results is based on cumulativemeasures like sum average max and min of script typeoccurrence in the database According to that the criteria areestablished All these are shown in Table 9

From cumulative results given in Table 10 some criteriacan be established It can be noted that the biggest marginbetween results are seen in the ratio of ascending letters Thisratio has the value of at least 8 Hence it is the strongestpoint of qualitative characterization and recognition of thecertain script Furthermore the smaller number of short anddescending scripts are common in Latin compared to Cyrillicdocuments At the and full letters are quite rare in a Latindocument However its characterization in criteria form is

The Scientific World Journal 11

Table 7 Frequency analysis of the script type occurrence in documents from database

Printed documents

Type of script Doc 1 Doc 2 Doc 3 Doc 4 Doc 5Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic

S 2243 2764 11396 13593 1510 1914 2217 2516 2069 2542A 906 53 4060 306 693 65 598 53 897 64D 183 468 724 1933 82 286 261 445 151 461F 0 7 0 8 0 8 8 26 0 12

Web documents

Type of script Doc 6 Doc 7 Doc 8 Doc 9 Doc 10Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic

S 1486 1799 1358 1682 783 996 1657 2078 1328 1637A 598 48 636 46 408 48 750 62 588 68D 99 304 75 292 58 174 134 344 99 284F 0 7 0 9 0 13 0 18 0 9The above results are further processed in order to calculate the ratio of script type occurrence between Latin and Cyrillic document Complete results aregiven in Table 8

Table 8 The ratio of script type occurrence between Latin and Cyrillic documents

Type of script Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc 9 Doc 10S 081 084 079 088 081 083 081 079 080 081A 1709 1327 1066 1128 1402 1246 1383 850 1210 865D 039 037 029 059 033 033 026 033 039 035F 000 000 000 031 000 000 000 000 000 000The results are presented in Figure 12

Table 9 The ratio of script type occurrence measures

Type of script sum Ratio CriteriaLatin Cyrillic Average Max Min

S 26047 31521 082 088 079 gt075A 10134 813 1221 1709 850 gt8D 1866 4991 036 059 026 lt06F 8 117 003 031 000

Table 10 GLCM five descriptors of the script type co-occurrence in documents from database

Printed documentsDoc 1 Doc 2 Doc 3 Doc 4 Doc 5

Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin CyrillicUniformity 02885 04725 02473 04167 02557 04120 02759 04545 02707 04498Entropy minus15191 minus11774 minus16379 minus13079 minus16047 minus12999 minus15675 minus11650 minus15847 minus11799Max probability 04655 06636 03952 06139 04120 06098 04439 06457 04349 06405Dissimilarity 06847 05933 07469 06592 07502 06427 07064 06041 07117 06217Contrast 10324 11790 11106 12859 11258 12261 10577 11449 10630 11949

Web documentsDoc 6 Doc 7 Doc 8 Doc 9 Doc 10

Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin CyrillicUniformity 02447 03714 02754 03817 02533 05005 02252 03147 02522 03325Entropy minus16524 minus13738 minus15725 minus13412 minus15990 minus10779 minus16778 minus15650 minus16144 minus15059Max probability 03964 05650 04409 05753 03972 06844 03195 05154 04016 05318Dissimilarity 07723 07320 06912 07209 07294 05686 08317 07667 07256 07416Contrast 11862 13869 10287 13681 10459 11158 12122 14220 10641 13716The above results are further processed in order to calculate the ratio of script type co-occurrence in between Latin and Cyrillic document These results areshown in Table 11

12 The Scientific World Journal

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

00102030405060708

DocumentsUniformity

x-ti

mes

02025

03035

04045

05055

Uni

form

ity (e

nerg

y)

DocumentsLatin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(a)

Entro

py

0

05

1

15

minus17

minus16

minus15

minus14

minus13

minus12

1

minus11

minus09

x-ti

mes

EntropyDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

(b)

03504

04505

05506

06507

Max

imum

pro

babi

lity

00102030405060708

x-ti

mes

Maximum probabilityDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(c)

05506

06507

07508

08509

095

Diss

imila

rity

0

02

04

06

08

1

12

x-ti

mes

DissimilarityDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(d)

Figure 13 Continued

The Scientific World Journal 13

1

11

12

13

14

15

Con

tras

t

0

02

04

06

08

1

Contrast

x-ti

mes

DocumentsLatin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(e)

Figure 13 Illustrations of co-occurrence descriptors in Latin and Cyrillic text (left) and its ratio (right) (a) Uniformity (b) Entropy (c)Maximum probability (d) Dissimilarity (e) Contrast

Table 11 The ratio of the co-occurrence descriptors between Latin and Cyrillic documents

Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc 9 Doc 10Uniformity 061 059 062 061 060 066 072 051 072 076Entropy 129 125 123 135 134 120 117 148 107 107Max probability 070 064 068 069 068 070 077 058 062 076Dissimilarity 115 113 117 117 114 106 096 128 108 098Contrast 088 086 092 092 089 086 075 094 085 078The final processing of the above results is based on cumulative measures like average max and min of script type co-occurrence in the database Accordingto that certain criteria are established All these are shown in Table 12

Table 12 The ratio of script type co-occurrence descriptors

Latin Cyrillic Ratio CriteriaMin Max Average Min Max Average Max Min Average

Uniformity 023 029 026 031 050 041 076 051 064 03Entropy minus168 minus152 minus160 minus157 minus108 minus130 148 107 125 Max probability 032 047 042 052 068 060 077 058 068 05Dissimilarity 068 083 074 057 077 067 128 096 111 Contrast 103 121 112 112 142 127 094 075 086

quite problematic due to their absence in Latin documentsfrom time to time

Furthermore the analysis of the script type co-occurrence in Latin as well as in Cyrillic documents isexamined according to GLCM method The obtained resultsfor each document are given in Table 10

The co-occurrence descriptor for Latin and Cyrillic textand its ratio is presented in Figure 13

From the above results some criteria can be establishedIt is clear that uniformity and maximum probability receivethe most distinct values in Latin and Cyrillic text Hencethese descriptors are suitable for qualitative characterizationof Latin and Cyrillic text as well as for creating criteria todistinguish a certain script type From the above results themargin criteria should be uniformity of 03 and maximumprobability of 05 These values of both descriptors representthe strong margin in qualifying the script in certain Serbiantext If we accompany them with the criteria obtained from

frequency analysis of the script type occurrence then the fullcriteria of decision making can be established This will leadto correct recognition of the script in Serbian text

5 Conclusion

The paper proposed the algorithm for recognition of exactscript in Serbian document Documents in Serbian languagecan be written in two different scripts Latin or CyrillicThe proposed algorithm converts document written in Latinand Cyrillic script into cipher text This way all alphabeticcharacters are exchanged with only four different encryptedsigns according to predefined encryption based on text linestructure definition Such ciphers texts are then subjected tothe frequency and co-occurrence analysis According to theobtained results a criteria for recognition of the certain scriptis proposed The proposed method is applied to the custom-oriented database which includes different types of printed

14 The Scientific World Journal

and web documents The experiment shows encouragingresults Possible applications can be seen in the area of webpage recognition

Future work will be toward the recognition of relatedlanguages as well as different languages written in the samescript

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this article (ldquoRecognition of theScript in Serbian Documents using Frequency Occurrenceand Co-occurrence Analysisrdquo by Darko Brodic Zoran NMilivojevic Cedomir A Maluckov)

References

[1] C Smith ldquoBasic cryptanalysis techniquesrdquo Tech Rep SANSInstitute 2001

[2] J Hoffstein J Pipher and J H Silverman An Introductionto Mathematical Cryptography Springer New York NY USA2008

[3] K Ruohonen ldquoMathematical cryptologyrdquo Lecture Notes 2010[4] A Zramdini and R Ingold ldquoOptical font recognition using

typographical featuresrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 20 no 8 pp 877ndash882 1998

[5] P Quaresma ldquoFrequency analysis of the Portuguese languagerdquoTR 2008003 University of Coimbra Portugal 2008

[6] R M Haralick K Shanmugam and I Dinstein ldquoTexturalfeatures for image classificationrdquo IEEE Transactions on SystemsMan and Cybernetics vol 3 no 6 pp 610ndash621 1973

[7] G Jaskiewicz ldquoAnalysis of letter frequency distribution inthe voynich manuscriptrdquo in Proceedings of the InternationalWorkshop (CSampP rsquo11) pp 250ndash261 Putusk Poland 2011

[8] S Cagnoni Evolutionary Image Analysis and Signal Processingvol 213 of Studies in Computational Intelligence SpringerBerlin Germany 2009

[9] R Nosaka C H Suryanto and K Fukui ldquoRotation invariantco-occurrence among adjacent LBPsrdquo in Proceedings of the 11thInternational Conference on Computer Vision (ACCV rsquo11) vol7088 of Lecture Notes in Computer Science pp 82ndash91 DaejeonRepublic of Korea 2011

[10] D A Clausi ldquoAn analysis of co-occurrence texture statistics as afunction of grey level quantizationrdquoCanadian Journal of RemoteSensing vol 28 no 1 pp 45ndash62 2002

[11] A Baraldi and F Parmiggiani ldquoInvestigation of the texturalcharacteristics associated with gray level cooccurrence matrixstatistical parametersrdquo IEEE Transactions on Geoscience andRemote Sensing vol 33 no 2 pp 293ndash304 1995

[12] F R Renzetti and L Zortea ldquoUse of a gray level co-occurrencematrix to characterize duplex stainless steel phases microstruc-turerdquo Frattura ed Integrita Strutturale vol 16 no 5 pp 43ndash512011

[13] R Susomboon D Raicu J Furst and T B Johnson ldquoA Co-occurrence texture semi-invariance to direction distance andpatient sizerdquo in Medical Imaging 2008 Image Processing vol6914 of Procedings of SPIE San Diego Calif USA February2008

[14] A Eleyan and H Demirel ldquoCo-occurrence matrix and itsstatistical features as a new approach for face recognitionrdquo

Turkish Journal of Electrical Engineering and Computer Sciencesvol 19 no 1 pp 97ndash107 2011

[15] D A Clausi and Y Zhao ldquoRapid extraction of image textureby co-occurrence using a hybrid data structurerdquoComputers andGeosciences vol 28 no 6 pp 763ndash774 2002

[16] R Jobanputra and D A Clausi ldquoPreserving boundaries forimage texture segmentation using grey level co-occurringprobabilitiesrdquo Pattern Recognition vol 39 no 2 pp 234ndash2452006

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mechanical Engineering

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Antennas andPropagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

2 The Scientific World Journal

Originaldocument

Latindocument

Cyrillicdocument

Encription Encription

Cipher Latindocument

Cipher Cyrillicdocument

Frequencyoccurrence

analysis

Criteria fordecision making

Correctrecognitionof the script

Co-occurrenceanalysis

Frequencyoccurrence

analysis

Co-occurrenceanalysis

Figure 1 The flow of proposed algorithm

text according to predefined encryption based on text linestructure definition Then the equivalent cipher texts aresubjected to the frequency and co-occurrence analysis Theresults of frequency analysis indicated a substantial differencebetween cipher texts obtained from Latin and Cyrillic textSimilarly co-occurrence analysis shows obvious quantitativedisparity in some measures This draws a strong margin as acriterion in order to distinguish and recognize a certain scripttype (Figure 1)

21 Text Line Structure Text in printed and Web documentsis defined as well-formed text type It is characterized bystrong regularity in shape The distances between the textlines are adequate to be split up The words are formedregularly with similar distance Their inter word spacing isdecent as well However in certain script the letters or signshave different position according to its baseline It is shownin Figure 2

From Figure 2 four virtual lines can be defined [4]

(i) The top-line(ii) The upper-line(iii) The base-line and(iv) The bottom-line

Table 1 Definition of script types according to the baselinecharacteristics

Script example Type of script Designationa Short Sb Ascender Aj Descender Dlj Full F

Accordingly a text line can be considered as beingcomposed of three vertical zones [4]

(i) The upper zone(ii) The middle zone and(iii) The lower zoneEach text line has at least a middle zone The upper zone

depends on capital letters and letters with ascenders whilethe lower zone depends on letters with descenders Only afew letters occupy the upper and lower zone

22 Encryption Two different sets are producedThey are119860119871

and 119860119862for Latin and Cyrillic alphabet respectively

119860119871

= ABC Z a b c z

119860119862

= A24 d a 3 5 e

(1)

Each of them consists of 60 elements that is letters whichare valid for Serbian language Furthermore both sets119860

119871and

119860119862are mapped into set 119862

119891119871

119860119871

997891997888rarr 119862

119891119862

119860119862

997891997888rarr 119862

(2)

These mappings are achieved in accordance with the textline area definitionThe structure of text line allows definitionof following script types [4]

(i) Full letter (F) where letter is present in all three zones(ii) Ascender letter (A) where character parts are present

in the upper and middle zones(iii) Descender letter (D) where character parts are

present in the lower and middle zones and(iv) Short letter (S) where character parts are present in

the middle zone onlyAccordingly all letters will be replaced with the cipher

from the following set

119862 = SAD F (3)

All letters can reach certain position which belongs to set119862 with a unique designation according to Table 1

It should be noted that above mappings are surjectiveSerbian language contains 30 letters Each letter in Latin

has a corresponding equivalent letter in Cyrillic Table 2shows Latin and Cyrillic letters as well as theirs designationaccording to Table 1

Statistical analysis of the letters and theirs correspondingtype for Latin and Cyrillic scripts is shown in Table 3

The Scientific World Journal 3

Ascender (A) Full (F)

Short (S) Descender (D)

Upper zone

Middle zoneLower zone

Top-lineUpper-line

Base-lineBottom-line

Figure 2 Definitions of the script characteristics

Table 2 Serbian Latin and equivalent Cyrillic alphabet according to the script types

Alphabet Latin Script types Latin Script types Cyrillic Script types Cyrillic Script types1 A A a S 0 A 1 S2 B A b A 2 A 3 A3 C A c S 4 F 5 D4 C A c A 6 A 7 A5 C A c A 8 A 9 S6 D A d A F S7 ETH A đ A lt A = F8 Dz A dz A gt F D9 E A e S A A S10 F A f A B A C F11 G A g D D A E S12 H A h A H A I S13 I A i S J A K S14 J A j D N A O D15 K A k A P A Q S16 L A l A R A S S17 Lj F lj F T A U S18 M A m S V A W S19 N A n S X A Y S20 Nj F nj D Z A [ S21 O A o S A ] S22 P A p D ^ A _ S23 R A r S P A a D24 S A s S b A c S25 S A s A d A e S26 T A t A f A g S27 U A u S h A i D28 V A v S j A k S29 Z A z S n A o S30 Z A z A p A q S

23 Frequency Analysis of the Occurrence In the proposedalgorithm all letters from certain script has been substitutedwith equivalent members of the set 119862 according to Table 2These circumstances for Latin document are shown in Fig-ure 3

Figure 3(b) shows the cipher text which is obtained fromLatin documents according to modeling given in Table 2Figures 3(c)ndash3(f) shows a subset of cipher text with eachelement of set 119862 that is S A D and F respectively Statisticalanalysis of the cipher text shows following 2217 elements ofS 598 elements of A 261 elements of D and 8 elements of F

types Accordingly distribution of 119862 set elements for Latindocument is shown in Figure 4

Currently the same Latin document is converted intoCyrillic one Similarly as in Latin document all letters fromCyrillic document are exchanged with the equivalent mem-bers of the set 119862 according to Table 2 These circumstancesfor Cyrillic document are shown in Figure 5

Figure 5(b) shows the cipher text which is obtained fromCyrillic documents according to modeling given in Table 2Figures 5(c)ndash5(f) shows a subset of cipher text with eachelement of set 119862 that is S A D and F respectively

4 The Scientific World Journal

(a) (b)

(c) (d)

(e) (f)

Figure 3 Application of the proposed algorithm (a) Original Latin text (b) Cipher text according to set 119862 (c) Only ldquoSrdquo text (d) Only ldquoFrdquotext (e) Only ldquoArdquo text (f) Only ldquoDrdquo text

The Scientific World Journal 5

Table 3 Statistical analysis of Latin and Cyrillic script types

Script Type of letters Occurrence of script types Distribution of script typesS A D F S () A () D () F ()

Latin Capital letters 0 28 0 2 0 9333 0 667Latin Small letters 12 13 4 1 4000 4333 1333 334Cyrillic Capital letters 0 27 0 3 0 9000 0 1000Cyrillic Small letters 21 2 5 2 7000 667 1666 667

S A D F0

500

1000

1500

2000

2500

Type of scripts

No

Figure 4 Distribution of 119862 setrsquos elements in cipher text obtainedfrom Latin document

Statistical analysis of the Cyrillic document image showsfollowing 2516 elements of S 53 elements of A 445 elementsof D and 26 elements of F types It should be noted that thesum of all 119862 set elements in Latin and Cyrillic document isnot quite identical It is valid due to difference in definition ofletters in two scripts In the Cyrillic script each letter is givenone and only one sign However in Latin script letters suchas dz lj and nj are represented by two letters Distribution of119862 set elements for Cyrillic document is presented in Figure 6

According to Figures 4 and 6 the comparison chart isdrawn It is shown below in Figure 7

Quantification of the script type appearance in a docu-ment written in Latin and Cyrillic is shown in Table 4

It is obvious that the Latin document compared toCyrillicone has slightly smaller number of short (S) descender (D)and full (F) letters Nonetheless the crucial margin is seen inascender (A) letters Hence it can be a measure of confidencefor detection of the script in a document given in Serbianlanguage

24 Co-Occurrence Analysis Let I be the gray scale imagewhich is under consideration It has 119872 row and 119873 columnswhile119879 is the total number of gray levelsThe spatial relation-ship of gray levels in the image I is expressed by the grayscaleco-occurrencematrix (GLCM)C [6 10] HenceC is a matrixthat describes the frequency of one gray level appearing ina specified spatial linear relationship with another gray levelwithin the area of investigation [11] In order to compute aco-occurrence matrixC we considered a central pixel 119868(119909 119910)

with a neighborhood defined by the window of interest Thiswindow is defined by two parameters inter-pixel distance (119889)

and orientation (120579) Typically the choice of 119889 is 1 (one pixel)

while the value of 120579 depends on the neighborhood Becauseof that each pixel has 8 neighbors given at following angles120579 = 0

∘ 45∘ 90∘ 135∘ 180∘ 225∘ 270∘ 315∘ However thecase of neighbors at 120579 = 0

∘ or at 120579 = 180∘ is similar to the

GLCM definition [12] So the choice may fall to 4 neighborspixels at 120579 = 0

∘ 45∘ 90∘ and 135∘ that is horizontal rightdiagonal vertical and left diagonal [13] These orientationsrefer to 4-adjacent pixels at (119909+119889 119910) (119909 119910minus119889) (119909minus119889 119910) and(119909 119910+119889) where119889 is 1 For each pixel of the neighborhood it iscounted the number of times a pixel pair appears specified bythe distance and orientation parameters The (119894 119895)th entry ofC represents the number of occasions a pixel with an intensity119894 is adjacent to a pixel with an intensity 119895 Hence for the givenimage I the co-occurrence matrix C is defined as [14]

119862 (119894 119895) =

119872

sum

119909=1

119873

sum

119910=1

1if 119868 (119909 119910) = 119894

119868 (119909 + Δ119909 119910 + Δ119910) = 119895

0 otherwise

(4)

where 119894 and 119895 are the image intensity values of the image119909 and 119910 are the spatial positions in the image I The offset(Δ119909 Δ119910) is specifying the distance between the pixel-of-interest and its neighbor It depends on the direction 120579 that isused and the distance 119889 at which the matrix is computedThesquarematrixC is of the order119873 Using a statistical approachlike GLCMprovides a valuable information about the relativeposition of the neighboring pixels in an image [12] In orderto normalize matrix C matrix P is calculated as [10]

119875 (119894 119895) =119862 (119894 119895)

sum119873

119894=1sum119873

119895=1119862 (119894 119895)

(5)

The normalized co-occurrence matrix P is obtained bydividing each element of C by the total number of co-occurrence pairs in C

To illustrate the computing of GLCM a four gray levelimage I is usedThe window parameters are 119889 = 1 and 120579 = 0

(horizontal) Initial matrix I is shown in Figure 8The procedure of calculating co-occurrence matrix for

grayscale matrix I (119889 = 1 and 120579 = 0∘) [12] is given in Figure 9

In order to GLCM be applied in our case set 119862 is mappedinto set 119862

119873by bijective function as

119891119862

119862 997891997888rarr 119862119873

(6)

where 119862119873

= 0 1 2 3 Furthermore the neighborhood isgiven as 2-connected (119909 minus 119889 and 119909 + 119889 around 119909 where119889 = 1) According to that the same document in Latin andCyrillic script is converted into cipher text It is shown belowin Figure 10

6 The Scientific World Journal

(a) (b)

(c) (d)

(e) (f)

Figure 5 Application of the proposed algorithm (a) Original Cyrillic text (b) Cipher text according to set 119862 (c) Only ldquoSrdquo text (d) Only ldquoFrdquotext (e) Only ldquoArdquo text (f) Only ldquoDrdquo text

The Scientific World Journal 7

Table 4 Percentage of script type occurrence in document

Type of script (TOS) Latin Cyrillic 119909 timesS 7188 8276 087A 1939 174 1114D 846 1464 057F 027 086 031It is obvious that the Latin document compared to Cyrillic one has slightly smaller number of short (S) descender (D) and full (F) letters Nonetheless thecrucial margin is seen in ascender (A) letters Hence it can be a measure of confidence for detection of the script in a document given in Serbian language

S A D F0

500

1000

1500

2000

2500

3000

Type of scripts

No

Figure 6 Distribution of 119862 sets elements in cipher text obtainedfrom Cyrillic document

S A D F0

500

1000

1500

2000

2500

3000

Type of scripts

LatinCyrillic

No

Figure 7 Comparison between distributions of 119862 set elements inLatin and equivalent Cyrillic document

1 0 0 20 1 1 12 2 1 10 0 0 00 1 0 10 0 2 30 0 1 1

I =

Figure 8 Initial 4-level grayscale matrix I (featuring 119872 = 7 119873 = 4and 119871 = 4)

Table 5 Normalized cooccurrence matrix

(a) For cipher text obtained from the Latin text

03722 02212 00623 0004802220 00343 00072 000623 00072 00016 000048 0 0 0

(b) For cipher text obtained from Cyrillic text

05863 00327 01326 0006400391 00104 00144 001262 00200 00224 0001600072 0 00008 0

To evaluate these cipher documents GLCM method isemployed Nevertheless various statistic measures obtainedfrom the co-occurrence matrix is introduced The primarygoal is to characterize the cipher text Five descriptors can beused to describe the image [15]

(i) Uniformity (UNI)(ii) Entropy (ENT)(iii) Maximum probability (MAX)(iv) Dissimilarity (DIS) and(v) Contrast (CON)

Uniformity (UNI) which is sometimes called angularsecondmoment (ASM) or energy (ENG)measures the imagehomogeneity It receives the highest value when GLCM hasfew entries of large magnitude In contrast it is low when allentries are nearly equalThe equation of the uniformity is [15]

UNI =

119873

sum

119894=1

119873

sum

119895=1

119875(119894 119895)2 (7)

Entropy (ENT) measures the disorder or the complexityof the image The highest value is found when the values of119875(119894 119895) are allocated quite uniformly throughout the matrixThis happens when the image has no pairs of gray level withparticular preference over othersThe equation of the entropyis [15 16]

ENT = minus

119873

sum

119894=1

119873

sum

119895=1

119875 (119894 119895) sdot log119875 (119894 119895) (8)

8 The Scientific World Journal

j

i 0 1 2 3 0 6 4 2 0 6 4 2 0 621 421 221 0

1 2 4 0 0 2 4 0 0 221 421 0 02 0 1 1 1 0 1 1 1 0 121 121 1213 0 0 0 0 0 0 0 0 0 0 0 0

(a) (b) (c)

≫ ≫C = P =

Figure 9 Co-occurrence matrix for grayscale matrix I (119889 = 1 and 120579 = 0∘) (a)The number of occasions a pixel with an intensity 119868 is adjacent

to a pixel with intensity 119895 (b) Co-occurrence matrix C (c) Normalized matrix P

(a) (b)

(c) (d)

Figure 10 Document conversion (a) Original Cyrillic text (b) Cipher text obtained fromCyrillic text according to set 119862119873 (c) Original Latin

text (equivalent to Cyrillic one) (d) Cipher text obtained from Latin text according to set 119862119873

Maximum probability (MAX) extracts the most probabledifference between gray scale value in pixels It is defined as[15]

MAX = max 119875 (119894 119895) (9)Dissimilarity (DIS) is a measure of the variation in gray

level pairs of the image It depends on distance from the

diagonal weighted by its probability The equation of thedissimilarity is [15]

DIS =

119873

sum

119894=1

119873

sum

119895=1

119875 (119894 119895) sdot1003816100381610038161003816119894 minus 119895

1003816100381610038161003816 (10)

The Scientific World Journal 9

(a) (b)

(c) (d)

Figure 11 Custom-oriented database (a) Printed document in Latin (b) Printed document in Cyrillic (c) Web document in Latin (d) Webdocument in Cyrillic

Contrast (CON) or inertia is a measure of the intensitycontrast between a pixel and its neighbor over the entireimage Hence it shows the amount of local variations presentin the image If the image is constant then the contrast willbe equal 0 The highest value of contrast is obtained whenthe image has random intensity and the pixel intensity and

neighbor intensity are very different The equation of thecontrast is [15 16]

CON =

119873

sum

119894=1

119873

sum

119895=1

119875 (119894 119895) sdot (119894 minus 119895)2 (11)

10 The Scientific World Journal

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

100

01

02

03

04

05

06

07

08

09

DocumentsShortDescendingFull

x-ti

mes

(a)

0

2

4

6

8

10

12

14

16

18

Ascending

x-ti

mes

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

(b)

Figure 12 The ratio of the script type occurrence (a) short descending and full (b) ascending

Table 6 Cooccurrence descriptors for Latin and Cyrillic cipher text

Serbian language Latin Cyrillic CharacterizationUniformity (energy) 02459 03811 Latin lt CyrillicEntropy minus16298 minus14363 Latin gt CyrillicMaximum probability 03722 05863 Latin lt CyrillicDissimilarity 07356 06669 Latin gt CyrillicContrast 10423 12660 Latin lt CyrillicFrom the above results it is clear that co-occurrence descriptors can fully characterize the difference between Latin andCyrillic scriptThismeans that frequencyanalysis of the occurrence can be supplemented with additional attributes in order to define a strong margin as a criterion to distinguish a certain script

A brief look at the normalized co-occurrence matrix Pfor the same document written in Latin and Cyrillic scripts(text representing the excerpt of the first four paragraphsfrom a document given in Figure 10) shows quite a differentcharacterization The test results are given in Table 5

Furthermore the calculation of five co-occurrencedescriptors shows the values given in Table 6

3 Experiments

For the sake of the experiment a custom-oriented databaseis created It consists of 10 documents These documentsrepresent excerpts from printed and web documents writtenin Serbian language The documents are created in bothscripts Latin and Cyrillic Printed documents are createdfrom PDF documents while web documents are extractedfrom web news The total length of documents given inthe database is approx 75000 letter characters per script(approx 40 pages) The length of printed documents is from2273 to 15840 letter characters Web documents are smallercompared to printed documents Their length is from 1231 to2502 letter characters It should be noted that all documentshave more than 1000 letter characters The example of the

printed and web document from the database is shown inFigure 11

4 Results and Discussion

According to the proposed algorithm all documents fromthe database are converted into equivalent cipher texts andsubjected to the frequency and co-occurrence analysis Firstthe frequency analysis of the script type occurrence in Latinas well as in Cyrillic documents is examined (Table 7) Theobtained results for each document are given in Table 8

The final processing of the results is based on cumulativemeasures like sum average max and min of script typeoccurrence in the database According to that the criteria areestablished All these are shown in Table 9

From cumulative results given in Table 10 some criteriacan be established It can be noted that the biggest marginbetween results are seen in the ratio of ascending letters Thisratio has the value of at least 8 Hence it is the strongestpoint of qualitative characterization and recognition of thecertain script Furthermore the smaller number of short anddescending scripts are common in Latin compared to Cyrillicdocuments At the and full letters are quite rare in a Latindocument However its characterization in criteria form is

The Scientific World Journal 11

Table 7 Frequency analysis of the script type occurrence in documents from database

Printed documents

Type of script Doc 1 Doc 2 Doc 3 Doc 4 Doc 5Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic

S 2243 2764 11396 13593 1510 1914 2217 2516 2069 2542A 906 53 4060 306 693 65 598 53 897 64D 183 468 724 1933 82 286 261 445 151 461F 0 7 0 8 0 8 8 26 0 12

Web documents

Type of script Doc 6 Doc 7 Doc 8 Doc 9 Doc 10Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic

S 1486 1799 1358 1682 783 996 1657 2078 1328 1637A 598 48 636 46 408 48 750 62 588 68D 99 304 75 292 58 174 134 344 99 284F 0 7 0 9 0 13 0 18 0 9The above results are further processed in order to calculate the ratio of script type occurrence between Latin and Cyrillic document Complete results aregiven in Table 8

Table 8 The ratio of script type occurrence between Latin and Cyrillic documents

Type of script Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc 9 Doc 10S 081 084 079 088 081 083 081 079 080 081A 1709 1327 1066 1128 1402 1246 1383 850 1210 865D 039 037 029 059 033 033 026 033 039 035F 000 000 000 031 000 000 000 000 000 000The results are presented in Figure 12

Table 9 The ratio of script type occurrence measures

Type of script sum Ratio CriteriaLatin Cyrillic Average Max Min

S 26047 31521 082 088 079 gt075A 10134 813 1221 1709 850 gt8D 1866 4991 036 059 026 lt06F 8 117 003 031 000

Table 10 GLCM five descriptors of the script type co-occurrence in documents from database

Printed documentsDoc 1 Doc 2 Doc 3 Doc 4 Doc 5

Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin CyrillicUniformity 02885 04725 02473 04167 02557 04120 02759 04545 02707 04498Entropy minus15191 minus11774 minus16379 minus13079 minus16047 minus12999 minus15675 minus11650 minus15847 minus11799Max probability 04655 06636 03952 06139 04120 06098 04439 06457 04349 06405Dissimilarity 06847 05933 07469 06592 07502 06427 07064 06041 07117 06217Contrast 10324 11790 11106 12859 11258 12261 10577 11449 10630 11949

Web documentsDoc 6 Doc 7 Doc 8 Doc 9 Doc 10

Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin CyrillicUniformity 02447 03714 02754 03817 02533 05005 02252 03147 02522 03325Entropy minus16524 minus13738 minus15725 minus13412 minus15990 minus10779 minus16778 minus15650 minus16144 minus15059Max probability 03964 05650 04409 05753 03972 06844 03195 05154 04016 05318Dissimilarity 07723 07320 06912 07209 07294 05686 08317 07667 07256 07416Contrast 11862 13869 10287 13681 10459 11158 12122 14220 10641 13716The above results are further processed in order to calculate the ratio of script type co-occurrence in between Latin and Cyrillic document These results areshown in Table 11

12 The Scientific World Journal

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

00102030405060708

DocumentsUniformity

x-ti

mes

02025

03035

04045

05055

Uni

form

ity (e

nerg

y)

DocumentsLatin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(a)

Entro

py

0

05

1

15

minus17

minus16

minus15

minus14

minus13

minus12

1

minus11

minus09

x-ti

mes

EntropyDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

(b)

03504

04505

05506

06507

Max

imum

pro

babi

lity

00102030405060708

x-ti

mes

Maximum probabilityDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(c)

05506

06507

07508

08509

095

Diss

imila

rity

0

02

04

06

08

1

12

x-ti

mes

DissimilarityDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(d)

Figure 13 Continued

The Scientific World Journal 13

1

11

12

13

14

15

Con

tras

t

0

02

04

06

08

1

Contrast

x-ti

mes

DocumentsLatin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(e)

Figure 13 Illustrations of co-occurrence descriptors in Latin and Cyrillic text (left) and its ratio (right) (a) Uniformity (b) Entropy (c)Maximum probability (d) Dissimilarity (e) Contrast

Table 11 The ratio of the co-occurrence descriptors between Latin and Cyrillic documents

Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc 9 Doc 10Uniformity 061 059 062 061 060 066 072 051 072 076Entropy 129 125 123 135 134 120 117 148 107 107Max probability 070 064 068 069 068 070 077 058 062 076Dissimilarity 115 113 117 117 114 106 096 128 108 098Contrast 088 086 092 092 089 086 075 094 085 078The final processing of the above results is based on cumulative measures like average max and min of script type co-occurrence in the database Accordingto that certain criteria are established All these are shown in Table 12

Table 12 The ratio of script type co-occurrence descriptors

Latin Cyrillic Ratio CriteriaMin Max Average Min Max Average Max Min Average

Uniformity 023 029 026 031 050 041 076 051 064 03Entropy minus168 minus152 minus160 minus157 minus108 minus130 148 107 125 Max probability 032 047 042 052 068 060 077 058 068 05Dissimilarity 068 083 074 057 077 067 128 096 111 Contrast 103 121 112 112 142 127 094 075 086

quite problematic due to their absence in Latin documentsfrom time to time

Furthermore the analysis of the script type co-occurrence in Latin as well as in Cyrillic documents isexamined according to GLCM method The obtained resultsfor each document are given in Table 10

The co-occurrence descriptor for Latin and Cyrillic textand its ratio is presented in Figure 13

From the above results some criteria can be establishedIt is clear that uniformity and maximum probability receivethe most distinct values in Latin and Cyrillic text Hencethese descriptors are suitable for qualitative characterizationof Latin and Cyrillic text as well as for creating criteria todistinguish a certain script type From the above results themargin criteria should be uniformity of 03 and maximumprobability of 05 These values of both descriptors representthe strong margin in qualifying the script in certain Serbiantext If we accompany them with the criteria obtained from

frequency analysis of the script type occurrence then the fullcriteria of decision making can be established This will leadto correct recognition of the script in Serbian text

5 Conclusion

The paper proposed the algorithm for recognition of exactscript in Serbian document Documents in Serbian languagecan be written in two different scripts Latin or CyrillicThe proposed algorithm converts document written in Latinand Cyrillic script into cipher text This way all alphabeticcharacters are exchanged with only four different encryptedsigns according to predefined encryption based on text linestructure definition Such ciphers texts are then subjected tothe frequency and co-occurrence analysis According to theobtained results a criteria for recognition of the certain scriptis proposed The proposed method is applied to the custom-oriented database which includes different types of printed

14 The Scientific World Journal

and web documents The experiment shows encouragingresults Possible applications can be seen in the area of webpage recognition

Future work will be toward the recognition of relatedlanguages as well as different languages written in the samescript

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this article (ldquoRecognition of theScript in Serbian Documents using Frequency Occurrenceand Co-occurrence Analysisrdquo by Darko Brodic Zoran NMilivojevic Cedomir A Maluckov)

References

[1] C Smith ldquoBasic cryptanalysis techniquesrdquo Tech Rep SANSInstitute 2001

[2] J Hoffstein J Pipher and J H Silverman An Introductionto Mathematical Cryptography Springer New York NY USA2008

[3] K Ruohonen ldquoMathematical cryptologyrdquo Lecture Notes 2010[4] A Zramdini and R Ingold ldquoOptical font recognition using

typographical featuresrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 20 no 8 pp 877ndash882 1998

[5] P Quaresma ldquoFrequency analysis of the Portuguese languagerdquoTR 2008003 University of Coimbra Portugal 2008

[6] R M Haralick K Shanmugam and I Dinstein ldquoTexturalfeatures for image classificationrdquo IEEE Transactions on SystemsMan and Cybernetics vol 3 no 6 pp 610ndash621 1973

[7] G Jaskiewicz ldquoAnalysis of letter frequency distribution inthe voynich manuscriptrdquo in Proceedings of the InternationalWorkshop (CSampP rsquo11) pp 250ndash261 Putusk Poland 2011

[8] S Cagnoni Evolutionary Image Analysis and Signal Processingvol 213 of Studies in Computational Intelligence SpringerBerlin Germany 2009

[9] R Nosaka C H Suryanto and K Fukui ldquoRotation invariantco-occurrence among adjacent LBPsrdquo in Proceedings of the 11thInternational Conference on Computer Vision (ACCV rsquo11) vol7088 of Lecture Notes in Computer Science pp 82ndash91 DaejeonRepublic of Korea 2011

[10] D A Clausi ldquoAn analysis of co-occurrence texture statistics as afunction of grey level quantizationrdquoCanadian Journal of RemoteSensing vol 28 no 1 pp 45ndash62 2002

[11] A Baraldi and F Parmiggiani ldquoInvestigation of the texturalcharacteristics associated with gray level cooccurrence matrixstatistical parametersrdquo IEEE Transactions on Geoscience andRemote Sensing vol 33 no 2 pp 293ndash304 1995

[12] F R Renzetti and L Zortea ldquoUse of a gray level co-occurrencematrix to characterize duplex stainless steel phases microstruc-turerdquo Frattura ed Integrita Strutturale vol 16 no 5 pp 43ndash512011

[13] R Susomboon D Raicu J Furst and T B Johnson ldquoA Co-occurrence texture semi-invariance to direction distance andpatient sizerdquo in Medical Imaging 2008 Image Processing vol6914 of Procedings of SPIE San Diego Calif USA February2008

[14] A Eleyan and H Demirel ldquoCo-occurrence matrix and itsstatistical features as a new approach for face recognitionrdquo

Turkish Journal of Electrical Engineering and Computer Sciencesvol 19 no 1 pp 97ndash107 2011

[15] D A Clausi and Y Zhao ldquoRapid extraction of image textureby co-occurrence using a hybrid data structurerdquoComputers andGeosciences vol 28 no 6 pp 763ndash774 2002

[16] R Jobanputra and D A Clausi ldquoPreserving boundaries forimage texture segmentation using grey level co-occurringprobabilitiesrdquo Pattern Recognition vol 39 no 2 pp 234ndash2452006

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mechanical Engineering

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Antennas andPropagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World Journal 3

Ascender (A) Full (F)

Short (S) Descender (D)

Upper zone

Middle zoneLower zone

Top-lineUpper-line

Base-lineBottom-line

Figure 2 Definitions of the script characteristics

Table 2 Serbian Latin and equivalent Cyrillic alphabet according to the script types

Alphabet Latin Script types Latin Script types Cyrillic Script types Cyrillic Script types1 A A a S 0 A 1 S2 B A b A 2 A 3 A3 C A c S 4 F 5 D4 C A c A 6 A 7 A5 C A c A 8 A 9 S6 D A d A F S7 ETH A đ A lt A = F8 Dz A dz A gt F D9 E A e S A A S10 F A f A B A C F11 G A g D D A E S12 H A h A H A I S13 I A i S J A K S14 J A j D N A O D15 K A k A P A Q S16 L A l A R A S S17 Lj F lj F T A U S18 M A m S V A W S19 N A n S X A Y S20 Nj F nj D Z A [ S21 O A o S A ] S22 P A p D ^ A _ S23 R A r S P A a D24 S A s S b A c S25 S A s A d A e S26 T A t A f A g S27 U A u S h A i D28 V A v S j A k S29 Z A z S n A o S30 Z A z A p A q S

23 Frequency Analysis of the Occurrence In the proposedalgorithm all letters from certain script has been substitutedwith equivalent members of the set 119862 according to Table 2These circumstances for Latin document are shown in Fig-ure 3

Figure 3(b) shows the cipher text which is obtained fromLatin documents according to modeling given in Table 2Figures 3(c)ndash3(f) shows a subset of cipher text with eachelement of set 119862 that is S A D and F respectively Statisticalanalysis of the cipher text shows following 2217 elements ofS 598 elements of A 261 elements of D and 8 elements of F

types Accordingly distribution of 119862 set elements for Latindocument is shown in Figure 4

Currently the same Latin document is converted intoCyrillic one Similarly as in Latin document all letters fromCyrillic document are exchanged with the equivalent mem-bers of the set 119862 according to Table 2 These circumstancesfor Cyrillic document are shown in Figure 5

Figure 5(b) shows the cipher text which is obtained fromCyrillic documents according to modeling given in Table 2Figures 5(c)ndash5(f) shows a subset of cipher text with eachelement of set 119862 that is S A D and F respectively

4 The Scientific World Journal

(a) (b)

(c) (d)

(e) (f)

Figure 3 Application of the proposed algorithm (a) Original Latin text (b) Cipher text according to set 119862 (c) Only ldquoSrdquo text (d) Only ldquoFrdquotext (e) Only ldquoArdquo text (f) Only ldquoDrdquo text

The Scientific World Journal 5

Table 3 Statistical analysis of Latin and Cyrillic script types

Script Type of letters Occurrence of script types Distribution of script typesS A D F S () A () D () F ()

Latin Capital letters 0 28 0 2 0 9333 0 667Latin Small letters 12 13 4 1 4000 4333 1333 334Cyrillic Capital letters 0 27 0 3 0 9000 0 1000Cyrillic Small letters 21 2 5 2 7000 667 1666 667

S A D F0

500

1000

1500

2000

2500

Type of scripts

No

Figure 4 Distribution of 119862 setrsquos elements in cipher text obtainedfrom Latin document

Statistical analysis of the Cyrillic document image showsfollowing 2516 elements of S 53 elements of A 445 elementsof D and 26 elements of F types It should be noted that thesum of all 119862 set elements in Latin and Cyrillic document isnot quite identical It is valid due to difference in definition ofletters in two scripts In the Cyrillic script each letter is givenone and only one sign However in Latin script letters suchas dz lj and nj are represented by two letters Distribution of119862 set elements for Cyrillic document is presented in Figure 6

According to Figures 4 and 6 the comparison chart isdrawn It is shown below in Figure 7

Quantification of the script type appearance in a docu-ment written in Latin and Cyrillic is shown in Table 4

It is obvious that the Latin document compared toCyrillicone has slightly smaller number of short (S) descender (D)and full (F) letters Nonetheless the crucial margin is seen inascender (A) letters Hence it can be a measure of confidencefor detection of the script in a document given in Serbianlanguage

24 Co-Occurrence Analysis Let I be the gray scale imagewhich is under consideration It has 119872 row and 119873 columnswhile119879 is the total number of gray levelsThe spatial relation-ship of gray levels in the image I is expressed by the grayscaleco-occurrencematrix (GLCM)C [6 10] HenceC is a matrixthat describes the frequency of one gray level appearing ina specified spatial linear relationship with another gray levelwithin the area of investigation [11] In order to compute aco-occurrence matrixC we considered a central pixel 119868(119909 119910)

with a neighborhood defined by the window of interest Thiswindow is defined by two parameters inter-pixel distance (119889)

and orientation (120579) Typically the choice of 119889 is 1 (one pixel)

while the value of 120579 depends on the neighborhood Becauseof that each pixel has 8 neighbors given at following angles120579 = 0

∘ 45∘ 90∘ 135∘ 180∘ 225∘ 270∘ 315∘ However thecase of neighbors at 120579 = 0

∘ or at 120579 = 180∘ is similar to the

GLCM definition [12] So the choice may fall to 4 neighborspixels at 120579 = 0

∘ 45∘ 90∘ and 135∘ that is horizontal rightdiagonal vertical and left diagonal [13] These orientationsrefer to 4-adjacent pixels at (119909+119889 119910) (119909 119910minus119889) (119909minus119889 119910) and(119909 119910+119889) where119889 is 1 For each pixel of the neighborhood it iscounted the number of times a pixel pair appears specified bythe distance and orientation parameters The (119894 119895)th entry ofC represents the number of occasions a pixel with an intensity119894 is adjacent to a pixel with an intensity 119895 Hence for the givenimage I the co-occurrence matrix C is defined as [14]

119862 (119894 119895) =

119872

sum

119909=1

119873

sum

119910=1

1if 119868 (119909 119910) = 119894

119868 (119909 + Δ119909 119910 + Δ119910) = 119895

0 otherwise

(4)

where 119894 and 119895 are the image intensity values of the image119909 and 119910 are the spatial positions in the image I The offset(Δ119909 Δ119910) is specifying the distance between the pixel-of-interest and its neighbor It depends on the direction 120579 that isused and the distance 119889 at which the matrix is computedThesquarematrixC is of the order119873 Using a statistical approachlike GLCMprovides a valuable information about the relativeposition of the neighboring pixels in an image [12] In orderto normalize matrix C matrix P is calculated as [10]

119875 (119894 119895) =119862 (119894 119895)

sum119873

119894=1sum119873

119895=1119862 (119894 119895)

(5)

The normalized co-occurrence matrix P is obtained bydividing each element of C by the total number of co-occurrence pairs in C

To illustrate the computing of GLCM a four gray levelimage I is usedThe window parameters are 119889 = 1 and 120579 = 0

(horizontal) Initial matrix I is shown in Figure 8The procedure of calculating co-occurrence matrix for

grayscale matrix I (119889 = 1 and 120579 = 0∘) [12] is given in Figure 9

In order to GLCM be applied in our case set 119862 is mappedinto set 119862

119873by bijective function as

119891119862

119862 997891997888rarr 119862119873

(6)

where 119862119873

= 0 1 2 3 Furthermore the neighborhood isgiven as 2-connected (119909 minus 119889 and 119909 + 119889 around 119909 where119889 = 1) According to that the same document in Latin andCyrillic script is converted into cipher text It is shown belowin Figure 10

6 The Scientific World Journal

(a) (b)

(c) (d)

(e) (f)

Figure 5 Application of the proposed algorithm (a) Original Cyrillic text (b) Cipher text according to set 119862 (c) Only ldquoSrdquo text (d) Only ldquoFrdquotext (e) Only ldquoArdquo text (f) Only ldquoDrdquo text

The Scientific World Journal 7

Table 4 Percentage of script type occurrence in document

Type of script (TOS) Latin Cyrillic 119909 timesS 7188 8276 087A 1939 174 1114D 846 1464 057F 027 086 031It is obvious that the Latin document compared to Cyrillic one has slightly smaller number of short (S) descender (D) and full (F) letters Nonetheless thecrucial margin is seen in ascender (A) letters Hence it can be a measure of confidence for detection of the script in a document given in Serbian language

S A D F0

500

1000

1500

2000

2500

3000

Type of scripts

No

Figure 6 Distribution of 119862 sets elements in cipher text obtainedfrom Cyrillic document

S A D F0

500

1000

1500

2000

2500

3000

Type of scripts

LatinCyrillic

No

Figure 7 Comparison between distributions of 119862 set elements inLatin and equivalent Cyrillic document

1 0 0 20 1 1 12 2 1 10 0 0 00 1 0 10 0 2 30 0 1 1

I =

Figure 8 Initial 4-level grayscale matrix I (featuring 119872 = 7 119873 = 4and 119871 = 4)

Table 5 Normalized cooccurrence matrix

(a) For cipher text obtained from the Latin text

03722 02212 00623 0004802220 00343 00072 000623 00072 00016 000048 0 0 0

(b) For cipher text obtained from Cyrillic text

05863 00327 01326 0006400391 00104 00144 001262 00200 00224 0001600072 0 00008 0

To evaluate these cipher documents GLCM method isemployed Nevertheless various statistic measures obtainedfrom the co-occurrence matrix is introduced The primarygoal is to characterize the cipher text Five descriptors can beused to describe the image [15]

(i) Uniformity (UNI)(ii) Entropy (ENT)(iii) Maximum probability (MAX)(iv) Dissimilarity (DIS) and(v) Contrast (CON)

Uniformity (UNI) which is sometimes called angularsecondmoment (ASM) or energy (ENG)measures the imagehomogeneity It receives the highest value when GLCM hasfew entries of large magnitude In contrast it is low when allentries are nearly equalThe equation of the uniformity is [15]

UNI =

119873

sum

119894=1

119873

sum

119895=1

119875(119894 119895)2 (7)

Entropy (ENT) measures the disorder or the complexityof the image The highest value is found when the values of119875(119894 119895) are allocated quite uniformly throughout the matrixThis happens when the image has no pairs of gray level withparticular preference over othersThe equation of the entropyis [15 16]

ENT = minus

119873

sum

119894=1

119873

sum

119895=1

119875 (119894 119895) sdot log119875 (119894 119895) (8)

8 The Scientific World Journal

j

i 0 1 2 3 0 6 4 2 0 6 4 2 0 621 421 221 0

1 2 4 0 0 2 4 0 0 221 421 0 02 0 1 1 1 0 1 1 1 0 121 121 1213 0 0 0 0 0 0 0 0 0 0 0 0

(a) (b) (c)

≫ ≫C = P =

Figure 9 Co-occurrence matrix for grayscale matrix I (119889 = 1 and 120579 = 0∘) (a)The number of occasions a pixel with an intensity 119868 is adjacent

to a pixel with intensity 119895 (b) Co-occurrence matrix C (c) Normalized matrix P

(a) (b)

(c) (d)

Figure 10 Document conversion (a) Original Cyrillic text (b) Cipher text obtained fromCyrillic text according to set 119862119873 (c) Original Latin

text (equivalent to Cyrillic one) (d) Cipher text obtained from Latin text according to set 119862119873

Maximum probability (MAX) extracts the most probabledifference between gray scale value in pixels It is defined as[15]

MAX = max 119875 (119894 119895) (9)Dissimilarity (DIS) is a measure of the variation in gray

level pairs of the image It depends on distance from the

diagonal weighted by its probability The equation of thedissimilarity is [15]

DIS =

119873

sum

119894=1

119873

sum

119895=1

119875 (119894 119895) sdot1003816100381610038161003816119894 minus 119895

1003816100381610038161003816 (10)

The Scientific World Journal 9

(a) (b)

(c) (d)

Figure 11 Custom-oriented database (a) Printed document in Latin (b) Printed document in Cyrillic (c) Web document in Latin (d) Webdocument in Cyrillic

Contrast (CON) or inertia is a measure of the intensitycontrast between a pixel and its neighbor over the entireimage Hence it shows the amount of local variations presentin the image If the image is constant then the contrast willbe equal 0 The highest value of contrast is obtained whenthe image has random intensity and the pixel intensity and

neighbor intensity are very different The equation of thecontrast is [15 16]

CON =

119873

sum

119894=1

119873

sum

119895=1

119875 (119894 119895) sdot (119894 minus 119895)2 (11)

10 The Scientific World Journal

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

100

01

02

03

04

05

06

07

08

09

DocumentsShortDescendingFull

x-ti

mes

(a)

0

2

4

6

8

10

12

14

16

18

Ascending

x-ti

mes

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

(b)

Figure 12 The ratio of the script type occurrence (a) short descending and full (b) ascending

Table 6 Cooccurrence descriptors for Latin and Cyrillic cipher text

Serbian language Latin Cyrillic CharacterizationUniformity (energy) 02459 03811 Latin lt CyrillicEntropy minus16298 minus14363 Latin gt CyrillicMaximum probability 03722 05863 Latin lt CyrillicDissimilarity 07356 06669 Latin gt CyrillicContrast 10423 12660 Latin lt CyrillicFrom the above results it is clear that co-occurrence descriptors can fully characterize the difference between Latin andCyrillic scriptThismeans that frequencyanalysis of the occurrence can be supplemented with additional attributes in order to define a strong margin as a criterion to distinguish a certain script

A brief look at the normalized co-occurrence matrix Pfor the same document written in Latin and Cyrillic scripts(text representing the excerpt of the first four paragraphsfrom a document given in Figure 10) shows quite a differentcharacterization The test results are given in Table 5

Furthermore the calculation of five co-occurrencedescriptors shows the values given in Table 6

3 Experiments

For the sake of the experiment a custom-oriented databaseis created It consists of 10 documents These documentsrepresent excerpts from printed and web documents writtenin Serbian language The documents are created in bothscripts Latin and Cyrillic Printed documents are createdfrom PDF documents while web documents are extractedfrom web news The total length of documents given inthe database is approx 75000 letter characters per script(approx 40 pages) The length of printed documents is from2273 to 15840 letter characters Web documents are smallercompared to printed documents Their length is from 1231 to2502 letter characters It should be noted that all documentshave more than 1000 letter characters The example of the

printed and web document from the database is shown inFigure 11

4 Results and Discussion

According to the proposed algorithm all documents fromthe database are converted into equivalent cipher texts andsubjected to the frequency and co-occurrence analysis Firstthe frequency analysis of the script type occurrence in Latinas well as in Cyrillic documents is examined (Table 7) Theobtained results for each document are given in Table 8

The final processing of the results is based on cumulativemeasures like sum average max and min of script typeoccurrence in the database According to that the criteria areestablished All these are shown in Table 9

From cumulative results given in Table 10 some criteriacan be established It can be noted that the biggest marginbetween results are seen in the ratio of ascending letters Thisratio has the value of at least 8 Hence it is the strongestpoint of qualitative characterization and recognition of thecertain script Furthermore the smaller number of short anddescending scripts are common in Latin compared to Cyrillicdocuments At the and full letters are quite rare in a Latindocument However its characterization in criteria form is

The Scientific World Journal 11

Table 7 Frequency analysis of the script type occurrence in documents from database

Printed documents

Type of script Doc 1 Doc 2 Doc 3 Doc 4 Doc 5Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic

S 2243 2764 11396 13593 1510 1914 2217 2516 2069 2542A 906 53 4060 306 693 65 598 53 897 64D 183 468 724 1933 82 286 261 445 151 461F 0 7 0 8 0 8 8 26 0 12

Web documents

Type of script Doc 6 Doc 7 Doc 8 Doc 9 Doc 10Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic

S 1486 1799 1358 1682 783 996 1657 2078 1328 1637A 598 48 636 46 408 48 750 62 588 68D 99 304 75 292 58 174 134 344 99 284F 0 7 0 9 0 13 0 18 0 9The above results are further processed in order to calculate the ratio of script type occurrence between Latin and Cyrillic document Complete results aregiven in Table 8

Table 8 The ratio of script type occurrence between Latin and Cyrillic documents

Type of script Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc 9 Doc 10S 081 084 079 088 081 083 081 079 080 081A 1709 1327 1066 1128 1402 1246 1383 850 1210 865D 039 037 029 059 033 033 026 033 039 035F 000 000 000 031 000 000 000 000 000 000The results are presented in Figure 12

Table 9 The ratio of script type occurrence measures

Type of script sum Ratio CriteriaLatin Cyrillic Average Max Min

S 26047 31521 082 088 079 gt075A 10134 813 1221 1709 850 gt8D 1866 4991 036 059 026 lt06F 8 117 003 031 000

Table 10 GLCM five descriptors of the script type co-occurrence in documents from database

Printed documentsDoc 1 Doc 2 Doc 3 Doc 4 Doc 5

Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin CyrillicUniformity 02885 04725 02473 04167 02557 04120 02759 04545 02707 04498Entropy minus15191 minus11774 minus16379 minus13079 minus16047 minus12999 minus15675 minus11650 minus15847 minus11799Max probability 04655 06636 03952 06139 04120 06098 04439 06457 04349 06405Dissimilarity 06847 05933 07469 06592 07502 06427 07064 06041 07117 06217Contrast 10324 11790 11106 12859 11258 12261 10577 11449 10630 11949

Web documentsDoc 6 Doc 7 Doc 8 Doc 9 Doc 10

Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin CyrillicUniformity 02447 03714 02754 03817 02533 05005 02252 03147 02522 03325Entropy minus16524 minus13738 minus15725 minus13412 minus15990 minus10779 minus16778 minus15650 minus16144 minus15059Max probability 03964 05650 04409 05753 03972 06844 03195 05154 04016 05318Dissimilarity 07723 07320 06912 07209 07294 05686 08317 07667 07256 07416Contrast 11862 13869 10287 13681 10459 11158 12122 14220 10641 13716The above results are further processed in order to calculate the ratio of script type co-occurrence in between Latin and Cyrillic document These results areshown in Table 11

12 The Scientific World Journal

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

00102030405060708

DocumentsUniformity

x-ti

mes

02025

03035

04045

05055

Uni

form

ity (e

nerg

y)

DocumentsLatin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(a)

Entro

py

0

05

1

15

minus17

minus16

minus15

minus14

minus13

minus12

1

minus11

minus09

x-ti

mes

EntropyDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

(b)

03504

04505

05506

06507

Max

imum

pro

babi

lity

00102030405060708

x-ti

mes

Maximum probabilityDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(c)

05506

06507

07508

08509

095

Diss

imila

rity

0

02

04

06

08

1

12

x-ti

mes

DissimilarityDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(d)

Figure 13 Continued

The Scientific World Journal 13

1

11

12

13

14

15

Con

tras

t

0

02

04

06

08

1

Contrast

x-ti

mes

DocumentsLatin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(e)

Figure 13 Illustrations of co-occurrence descriptors in Latin and Cyrillic text (left) and its ratio (right) (a) Uniformity (b) Entropy (c)Maximum probability (d) Dissimilarity (e) Contrast

Table 11 The ratio of the co-occurrence descriptors between Latin and Cyrillic documents

Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc 9 Doc 10Uniformity 061 059 062 061 060 066 072 051 072 076Entropy 129 125 123 135 134 120 117 148 107 107Max probability 070 064 068 069 068 070 077 058 062 076Dissimilarity 115 113 117 117 114 106 096 128 108 098Contrast 088 086 092 092 089 086 075 094 085 078The final processing of the above results is based on cumulative measures like average max and min of script type co-occurrence in the database Accordingto that certain criteria are established All these are shown in Table 12

Table 12 The ratio of script type co-occurrence descriptors

Latin Cyrillic Ratio CriteriaMin Max Average Min Max Average Max Min Average

Uniformity 023 029 026 031 050 041 076 051 064 03Entropy minus168 minus152 minus160 minus157 minus108 minus130 148 107 125 Max probability 032 047 042 052 068 060 077 058 068 05Dissimilarity 068 083 074 057 077 067 128 096 111 Contrast 103 121 112 112 142 127 094 075 086

quite problematic due to their absence in Latin documentsfrom time to time

Furthermore the analysis of the script type co-occurrence in Latin as well as in Cyrillic documents isexamined according to GLCM method The obtained resultsfor each document are given in Table 10

The co-occurrence descriptor for Latin and Cyrillic textand its ratio is presented in Figure 13

From the above results some criteria can be establishedIt is clear that uniformity and maximum probability receivethe most distinct values in Latin and Cyrillic text Hencethese descriptors are suitable for qualitative characterizationof Latin and Cyrillic text as well as for creating criteria todistinguish a certain script type From the above results themargin criteria should be uniformity of 03 and maximumprobability of 05 These values of both descriptors representthe strong margin in qualifying the script in certain Serbiantext If we accompany them with the criteria obtained from

frequency analysis of the script type occurrence then the fullcriteria of decision making can be established This will leadto correct recognition of the script in Serbian text

5 Conclusion

The paper proposed the algorithm for recognition of exactscript in Serbian document Documents in Serbian languagecan be written in two different scripts Latin or CyrillicThe proposed algorithm converts document written in Latinand Cyrillic script into cipher text This way all alphabeticcharacters are exchanged with only four different encryptedsigns according to predefined encryption based on text linestructure definition Such ciphers texts are then subjected tothe frequency and co-occurrence analysis According to theobtained results a criteria for recognition of the certain scriptis proposed The proposed method is applied to the custom-oriented database which includes different types of printed

14 The Scientific World Journal

and web documents The experiment shows encouragingresults Possible applications can be seen in the area of webpage recognition

Future work will be toward the recognition of relatedlanguages as well as different languages written in the samescript

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this article (ldquoRecognition of theScript in Serbian Documents using Frequency Occurrenceand Co-occurrence Analysisrdquo by Darko Brodic Zoran NMilivojevic Cedomir A Maluckov)

References

[1] C Smith ldquoBasic cryptanalysis techniquesrdquo Tech Rep SANSInstitute 2001

[2] J Hoffstein J Pipher and J H Silverman An Introductionto Mathematical Cryptography Springer New York NY USA2008

[3] K Ruohonen ldquoMathematical cryptologyrdquo Lecture Notes 2010[4] A Zramdini and R Ingold ldquoOptical font recognition using

typographical featuresrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 20 no 8 pp 877ndash882 1998

[5] P Quaresma ldquoFrequency analysis of the Portuguese languagerdquoTR 2008003 University of Coimbra Portugal 2008

[6] R M Haralick K Shanmugam and I Dinstein ldquoTexturalfeatures for image classificationrdquo IEEE Transactions on SystemsMan and Cybernetics vol 3 no 6 pp 610ndash621 1973

[7] G Jaskiewicz ldquoAnalysis of letter frequency distribution inthe voynich manuscriptrdquo in Proceedings of the InternationalWorkshop (CSampP rsquo11) pp 250ndash261 Putusk Poland 2011

[8] S Cagnoni Evolutionary Image Analysis and Signal Processingvol 213 of Studies in Computational Intelligence SpringerBerlin Germany 2009

[9] R Nosaka C H Suryanto and K Fukui ldquoRotation invariantco-occurrence among adjacent LBPsrdquo in Proceedings of the 11thInternational Conference on Computer Vision (ACCV rsquo11) vol7088 of Lecture Notes in Computer Science pp 82ndash91 DaejeonRepublic of Korea 2011

[10] D A Clausi ldquoAn analysis of co-occurrence texture statistics as afunction of grey level quantizationrdquoCanadian Journal of RemoteSensing vol 28 no 1 pp 45ndash62 2002

[11] A Baraldi and F Parmiggiani ldquoInvestigation of the texturalcharacteristics associated with gray level cooccurrence matrixstatistical parametersrdquo IEEE Transactions on Geoscience andRemote Sensing vol 33 no 2 pp 293ndash304 1995

[12] F R Renzetti and L Zortea ldquoUse of a gray level co-occurrencematrix to characterize duplex stainless steel phases microstruc-turerdquo Frattura ed Integrita Strutturale vol 16 no 5 pp 43ndash512011

[13] R Susomboon D Raicu J Furst and T B Johnson ldquoA Co-occurrence texture semi-invariance to direction distance andpatient sizerdquo in Medical Imaging 2008 Image Processing vol6914 of Procedings of SPIE San Diego Calif USA February2008

[14] A Eleyan and H Demirel ldquoCo-occurrence matrix and itsstatistical features as a new approach for face recognitionrdquo

Turkish Journal of Electrical Engineering and Computer Sciencesvol 19 no 1 pp 97ndash107 2011

[15] D A Clausi and Y Zhao ldquoRapid extraction of image textureby co-occurrence using a hybrid data structurerdquoComputers andGeosciences vol 28 no 6 pp 763ndash774 2002

[16] R Jobanputra and D A Clausi ldquoPreserving boundaries forimage texture segmentation using grey level co-occurringprobabilitiesrdquo Pattern Recognition vol 39 no 2 pp 234ndash2452006

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mechanical Engineering

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Antennas andPropagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

4 The Scientific World Journal

(a) (b)

(c) (d)

(e) (f)

Figure 3 Application of the proposed algorithm (a) Original Latin text (b) Cipher text according to set 119862 (c) Only ldquoSrdquo text (d) Only ldquoFrdquotext (e) Only ldquoArdquo text (f) Only ldquoDrdquo text

The Scientific World Journal 5

Table 3 Statistical analysis of Latin and Cyrillic script types

Script Type of letters Occurrence of script types Distribution of script typesS A D F S () A () D () F ()

Latin Capital letters 0 28 0 2 0 9333 0 667Latin Small letters 12 13 4 1 4000 4333 1333 334Cyrillic Capital letters 0 27 0 3 0 9000 0 1000Cyrillic Small letters 21 2 5 2 7000 667 1666 667

S A D F0

500

1000

1500

2000

2500

Type of scripts

No

Figure 4 Distribution of 119862 setrsquos elements in cipher text obtainedfrom Latin document

Statistical analysis of the Cyrillic document image showsfollowing 2516 elements of S 53 elements of A 445 elementsof D and 26 elements of F types It should be noted that thesum of all 119862 set elements in Latin and Cyrillic document isnot quite identical It is valid due to difference in definition ofletters in two scripts In the Cyrillic script each letter is givenone and only one sign However in Latin script letters suchas dz lj and nj are represented by two letters Distribution of119862 set elements for Cyrillic document is presented in Figure 6

According to Figures 4 and 6 the comparison chart isdrawn It is shown below in Figure 7

Quantification of the script type appearance in a docu-ment written in Latin and Cyrillic is shown in Table 4

It is obvious that the Latin document compared toCyrillicone has slightly smaller number of short (S) descender (D)and full (F) letters Nonetheless the crucial margin is seen inascender (A) letters Hence it can be a measure of confidencefor detection of the script in a document given in Serbianlanguage

24 Co-Occurrence Analysis Let I be the gray scale imagewhich is under consideration It has 119872 row and 119873 columnswhile119879 is the total number of gray levelsThe spatial relation-ship of gray levels in the image I is expressed by the grayscaleco-occurrencematrix (GLCM)C [6 10] HenceC is a matrixthat describes the frequency of one gray level appearing ina specified spatial linear relationship with another gray levelwithin the area of investigation [11] In order to compute aco-occurrence matrixC we considered a central pixel 119868(119909 119910)

with a neighborhood defined by the window of interest Thiswindow is defined by two parameters inter-pixel distance (119889)

and orientation (120579) Typically the choice of 119889 is 1 (one pixel)

while the value of 120579 depends on the neighborhood Becauseof that each pixel has 8 neighbors given at following angles120579 = 0

∘ 45∘ 90∘ 135∘ 180∘ 225∘ 270∘ 315∘ However thecase of neighbors at 120579 = 0

∘ or at 120579 = 180∘ is similar to the

GLCM definition [12] So the choice may fall to 4 neighborspixels at 120579 = 0

∘ 45∘ 90∘ and 135∘ that is horizontal rightdiagonal vertical and left diagonal [13] These orientationsrefer to 4-adjacent pixels at (119909+119889 119910) (119909 119910minus119889) (119909minus119889 119910) and(119909 119910+119889) where119889 is 1 For each pixel of the neighborhood it iscounted the number of times a pixel pair appears specified bythe distance and orientation parameters The (119894 119895)th entry ofC represents the number of occasions a pixel with an intensity119894 is adjacent to a pixel with an intensity 119895 Hence for the givenimage I the co-occurrence matrix C is defined as [14]

119862 (119894 119895) =

119872

sum

119909=1

119873

sum

119910=1

1if 119868 (119909 119910) = 119894

119868 (119909 + Δ119909 119910 + Δ119910) = 119895

0 otherwise

(4)

where 119894 and 119895 are the image intensity values of the image119909 and 119910 are the spatial positions in the image I The offset(Δ119909 Δ119910) is specifying the distance between the pixel-of-interest and its neighbor It depends on the direction 120579 that isused and the distance 119889 at which the matrix is computedThesquarematrixC is of the order119873 Using a statistical approachlike GLCMprovides a valuable information about the relativeposition of the neighboring pixels in an image [12] In orderto normalize matrix C matrix P is calculated as [10]

119875 (119894 119895) =119862 (119894 119895)

sum119873

119894=1sum119873

119895=1119862 (119894 119895)

(5)

The normalized co-occurrence matrix P is obtained bydividing each element of C by the total number of co-occurrence pairs in C

To illustrate the computing of GLCM a four gray levelimage I is usedThe window parameters are 119889 = 1 and 120579 = 0

(horizontal) Initial matrix I is shown in Figure 8The procedure of calculating co-occurrence matrix for

grayscale matrix I (119889 = 1 and 120579 = 0∘) [12] is given in Figure 9

In order to GLCM be applied in our case set 119862 is mappedinto set 119862

119873by bijective function as

119891119862

119862 997891997888rarr 119862119873

(6)

where 119862119873

= 0 1 2 3 Furthermore the neighborhood isgiven as 2-connected (119909 minus 119889 and 119909 + 119889 around 119909 where119889 = 1) According to that the same document in Latin andCyrillic script is converted into cipher text It is shown belowin Figure 10

6 The Scientific World Journal

(a) (b)

(c) (d)

(e) (f)

Figure 5 Application of the proposed algorithm (a) Original Cyrillic text (b) Cipher text according to set 119862 (c) Only ldquoSrdquo text (d) Only ldquoFrdquotext (e) Only ldquoArdquo text (f) Only ldquoDrdquo text

The Scientific World Journal 7

Table 4 Percentage of script type occurrence in document

Type of script (TOS) Latin Cyrillic 119909 timesS 7188 8276 087A 1939 174 1114D 846 1464 057F 027 086 031It is obvious that the Latin document compared to Cyrillic one has slightly smaller number of short (S) descender (D) and full (F) letters Nonetheless thecrucial margin is seen in ascender (A) letters Hence it can be a measure of confidence for detection of the script in a document given in Serbian language

S A D F0

500

1000

1500

2000

2500

3000

Type of scripts

No

Figure 6 Distribution of 119862 sets elements in cipher text obtainedfrom Cyrillic document

S A D F0

500

1000

1500

2000

2500

3000

Type of scripts

LatinCyrillic

No

Figure 7 Comparison between distributions of 119862 set elements inLatin and equivalent Cyrillic document

1 0 0 20 1 1 12 2 1 10 0 0 00 1 0 10 0 2 30 0 1 1

I =

Figure 8 Initial 4-level grayscale matrix I (featuring 119872 = 7 119873 = 4and 119871 = 4)

Table 5 Normalized cooccurrence matrix

(a) For cipher text obtained from the Latin text

03722 02212 00623 0004802220 00343 00072 000623 00072 00016 000048 0 0 0

(b) For cipher text obtained from Cyrillic text

05863 00327 01326 0006400391 00104 00144 001262 00200 00224 0001600072 0 00008 0

To evaluate these cipher documents GLCM method isemployed Nevertheless various statistic measures obtainedfrom the co-occurrence matrix is introduced The primarygoal is to characterize the cipher text Five descriptors can beused to describe the image [15]

(i) Uniformity (UNI)(ii) Entropy (ENT)(iii) Maximum probability (MAX)(iv) Dissimilarity (DIS) and(v) Contrast (CON)

Uniformity (UNI) which is sometimes called angularsecondmoment (ASM) or energy (ENG)measures the imagehomogeneity It receives the highest value when GLCM hasfew entries of large magnitude In contrast it is low when allentries are nearly equalThe equation of the uniformity is [15]

UNI =

119873

sum

119894=1

119873

sum

119895=1

119875(119894 119895)2 (7)

Entropy (ENT) measures the disorder or the complexityof the image The highest value is found when the values of119875(119894 119895) are allocated quite uniformly throughout the matrixThis happens when the image has no pairs of gray level withparticular preference over othersThe equation of the entropyis [15 16]

ENT = minus

119873

sum

119894=1

119873

sum

119895=1

119875 (119894 119895) sdot log119875 (119894 119895) (8)

8 The Scientific World Journal

j

i 0 1 2 3 0 6 4 2 0 6 4 2 0 621 421 221 0

1 2 4 0 0 2 4 0 0 221 421 0 02 0 1 1 1 0 1 1 1 0 121 121 1213 0 0 0 0 0 0 0 0 0 0 0 0

(a) (b) (c)

≫ ≫C = P =

Figure 9 Co-occurrence matrix for grayscale matrix I (119889 = 1 and 120579 = 0∘) (a)The number of occasions a pixel with an intensity 119868 is adjacent

to a pixel with intensity 119895 (b) Co-occurrence matrix C (c) Normalized matrix P

(a) (b)

(c) (d)

Figure 10 Document conversion (a) Original Cyrillic text (b) Cipher text obtained fromCyrillic text according to set 119862119873 (c) Original Latin

text (equivalent to Cyrillic one) (d) Cipher text obtained from Latin text according to set 119862119873

Maximum probability (MAX) extracts the most probabledifference between gray scale value in pixels It is defined as[15]

MAX = max 119875 (119894 119895) (9)Dissimilarity (DIS) is a measure of the variation in gray

level pairs of the image It depends on distance from the

diagonal weighted by its probability The equation of thedissimilarity is [15]

DIS =

119873

sum

119894=1

119873

sum

119895=1

119875 (119894 119895) sdot1003816100381610038161003816119894 minus 119895

1003816100381610038161003816 (10)

The Scientific World Journal 9

(a) (b)

(c) (d)

Figure 11 Custom-oriented database (a) Printed document in Latin (b) Printed document in Cyrillic (c) Web document in Latin (d) Webdocument in Cyrillic

Contrast (CON) or inertia is a measure of the intensitycontrast between a pixel and its neighbor over the entireimage Hence it shows the amount of local variations presentin the image If the image is constant then the contrast willbe equal 0 The highest value of contrast is obtained whenthe image has random intensity and the pixel intensity and

neighbor intensity are very different The equation of thecontrast is [15 16]

CON =

119873

sum

119894=1

119873

sum

119895=1

119875 (119894 119895) sdot (119894 minus 119895)2 (11)

10 The Scientific World Journal

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

100

01

02

03

04

05

06

07

08

09

DocumentsShortDescendingFull

x-ti

mes

(a)

0

2

4

6

8

10

12

14

16

18

Ascending

x-ti

mes

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

(b)

Figure 12 The ratio of the script type occurrence (a) short descending and full (b) ascending

Table 6 Cooccurrence descriptors for Latin and Cyrillic cipher text

Serbian language Latin Cyrillic CharacterizationUniformity (energy) 02459 03811 Latin lt CyrillicEntropy minus16298 minus14363 Latin gt CyrillicMaximum probability 03722 05863 Latin lt CyrillicDissimilarity 07356 06669 Latin gt CyrillicContrast 10423 12660 Latin lt CyrillicFrom the above results it is clear that co-occurrence descriptors can fully characterize the difference between Latin andCyrillic scriptThismeans that frequencyanalysis of the occurrence can be supplemented with additional attributes in order to define a strong margin as a criterion to distinguish a certain script

A brief look at the normalized co-occurrence matrix Pfor the same document written in Latin and Cyrillic scripts(text representing the excerpt of the first four paragraphsfrom a document given in Figure 10) shows quite a differentcharacterization The test results are given in Table 5

Furthermore the calculation of five co-occurrencedescriptors shows the values given in Table 6

3 Experiments

For the sake of the experiment a custom-oriented databaseis created It consists of 10 documents These documentsrepresent excerpts from printed and web documents writtenin Serbian language The documents are created in bothscripts Latin and Cyrillic Printed documents are createdfrom PDF documents while web documents are extractedfrom web news The total length of documents given inthe database is approx 75000 letter characters per script(approx 40 pages) The length of printed documents is from2273 to 15840 letter characters Web documents are smallercompared to printed documents Their length is from 1231 to2502 letter characters It should be noted that all documentshave more than 1000 letter characters The example of the

printed and web document from the database is shown inFigure 11

4 Results and Discussion

According to the proposed algorithm all documents fromthe database are converted into equivalent cipher texts andsubjected to the frequency and co-occurrence analysis Firstthe frequency analysis of the script type occurrence in Latinas well as in Cyrillic documents is examined (Table 7) Theobtained results for each document are given in Table 8

The final processing of the results is based on cumulativemeasures like sum average max and min of script typeoccurrence in the database According to that the criteria areestablished All these are shown in Table 9

From cumulative results given in Table 10 some criteriacan be established It can be noted that the biggest marginbetween results are seen in the ratio of ascending letters Thisratio has the value of at least 8 Hence it is the strongestpoint of qualitative characterization and recognition of thecertain script Furthermore the smaller number of short anddescending scripts are common in Latin compared to Cyrillicdocuments At the and full letters are quite rare in a Latindocument However its characterization in criteria form is

The Scientific World Journal 11

Table 7 Frequency analysis of the script type occurrence in documents from database

Printed documents

Type of script Doc 1 Doc 2 Doc 3 Doc 4 Doc 5Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic

S 2243 2764 11396 13593 1510 1914 2217 2516 2069 2542A 906 53 4060 306 693 65 598 53 897 64D 183 468 724 1933 82 286 261 445 151 461F 0 7 0 8 0 8 8 26 0 12

Web documents

Type of script Doc 6 Doc 7 Doc 8 Doc 9 Doc 10Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic

S 1486 1799 1358 1682 783 996 1657 2078 1328 1637A 598 48 636 46 408 48 750 62 588 68D 99 304 75 292 58 174 134 344 99 284F 0 7 0 9 0 13 0 18 0 9The above results are further processed in order to calculate the ratio of script type occurrence between Latin and Cyrillic document Complete results aregiven in Table 8

Table 8 The ratio of script type occurrence between Latin and Cyrillic documents

Type of script Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc 9 Doc 10S 081 084 079 088 081 083 081 079 080 081A 1709 1327 1066 1128 1402 1246 1383 850 1210 865D 039 037 029 059 033 033 026 033 039 035F 000 000 000 031 000 000 000 000 000 000The results are presented in Figure 12

Table 9 The ratio of script type occurrence measures

Type of script sum Ratio CriteriaLatin Cyrillic Average Max Min

S 26047 31521 082 088 079 gt075A 10134 813 1221 1709 850 gt8D 1866 4991 036 059 026 lt06F 8 117 003 031 000

Table 10 GLCM five descriptors of the script type co-occurrence in documents from database

Printed documentsDoc 1 Doc 2 Doc 3 Doc 4 Doc 5

Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin CyrillicUniformity 02885 04725 02473 04167 02557 04120 02759 04545 02707 04498Entropy minus15191 minus11774 minus16379 minus13079 minus16047 minus12999 minus15675 minus11650 minus15847 minus11799Max probability 04655 06636 03952 06139 04120 06098 04439 06457 04349 06405Dissimilarity 06847 05933 07469 06592 07502 06427 07064 06041 07117 06217Contrast 10324 11790 11106 12859 11258 12261 10577 11449 10630 11949

Web documentsDoc 6 Doc 7 Doc 8 Doc 9 Doc 10

Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin CyrillicUniformity 02447 03714 02754 03817 02533 05005 02252 03147 02522 03325Entropy minus16524 minus13738 minus15725 minus13412 minus15990 minus10779 minus16778 minus15650 minus16144 minus15059Max probability 03964 05650 04409 05753 03972 06844 03195 05154 04016 05318Dissimilarity 07723 07320 06912 07209 07294 05686 08317 07667 07256 07416Contrast 11862 13869 10287 13681 10459 11158 12122 14220 10641 13716The above results are further processed in order to calculate the ratio of script type co-occurrence in between Latin and Cyrillic document These results areshown in Table 11

12 The Scientific World Journal

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

00102030405060708

DocumentsUniformity

x-ti

mes

02025

03035

04045

05055

Uni

form

ity (e

nerg

y)

DocumentsLatin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(a)

Entro

py

0

05

1

15

minus17

minus16

minus15

minus14

minus13

minus12

1

minus11

minus09

x-ti

mes

EntropyDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

(b)

03504

04505

05506

06507

Max

imum

pro

babi

lity

00102030405060708

x-ti

mes

Maximum probabilityDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(c)

05506

06507

07508

08509

095

Diss

imila

rity

0

02

04

06

08

1

12

x-ti

mes

DissimilarityDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(d)

Figure 13 Continued

The Scientific World Journal 13

1

11

12

13

14

15

Con

tras

t

0

02

04

06

08

1

Contrast

x-ti

mes

DocumentsLatin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(e)

Figure 13 Illustrations of co-occurrence descriptors in Latin and Cyrillic text (left) and its ratio (right) (a) Uniformity (b) Entropy (c)Maximum probability (d) Dissimilarity (e) Contrast

Table 11 The ratio of the co-occurrence descriptors between Latin and Cyrillic documents

Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc 9 Doc 10Uniformity 061 059 062 061 060 066 072 051 072 076Entropy 129 125 123 135 134 120 117 148 107 107Max probability 070 064 068 069 068 070 077 058 062 076Dissimilarity 115 113 117 117 114 106 096 128 108 098Contrast 088 086 092 092 089 086 075 094 085 078The final processing of the above results is based on cumulative measures like average max and min of script type co-occurrence in the database Accordingto that certain criteria are established All these are shown in Table 12

Table 12 The ratio of script type co-occurrence descriptors

Latin Cyrillic Ratio CriteriaMin Max Average Min Max Average Max Min Average

Uniformity 023 029 026 031 050 041 076 051 064 03Entropy minus168 minus152 minus160 minus157 minus108 minus130 148 107 125 Max probability 032 047 042 052 068 060 077 058 068 05Dissimilarity 068 083 074 057 077 067 128 096 111 Contrast 103 121 112 112 142 127 094 075 086

quite problematic due to their absence in Latin documentsfrom time to time

Furthermore the analysis of the script type co-occurrence in Latin as well as in Cyrillic documents isexamined according to GLCM method The obtained resultsfor each document are given in Table 10

The co-occurrence descriptor for Latin and Cyrillic textand its ratio is presented in Figure 13

From the above results some criteria can be establishedIt is clear that uniformity and maximum probability receivethe most distinct values in Latin and Cyrillic text Hencethese descriptors are suitable for qualitative characterizationof Latin and Cyrillic text as well as for creating criteria todistinguish a certain script type From the above results themargin criteria should be uniformity of 03 and maximumprobability of 05 These values of both descriptors representthe strong margin in qualifying the script in certain Serbiantext If we accompany them with the criteria obtained from

frequency analysis of the script type occurrence then the fullcriteria of decision making can be established This will leadto correct recognition of the script in Serbian text

5 Conclusion

The paper proposed the algorithm for recognition of exactscript in Serbian document Documents in Serbian languagecan be written in two different scripts Latin or CyrillicThe proposed algorithm converts document written in Latinand Cyrillic script into cipher text This way all alphabeticcharacters are exchanged with only four different encryptedsigns according to predefined encryption based on text linestructure definition Such ciphers texts are then subjected tothe frequency and co-occurrence analysis According to theobtained results a criteria for recognition of the certain scriptis proposed The proposed method is applied to the custom-oriented database which includes different types of printed

14 The Scientific World Journal

and web documents The experiment shows encouragingresults Possible applications can be seen in the area of webpage recognition

Future work will be toward the recognition of relatedlanguages as well as different languages written in the samescript

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this article (ldquoRecognition of theScript in Serbian Documents using Frequency Occurrenceand Co-occurrence Analysisrdquo by Darko Brodic Zoran NMilivojevic Cedomir A Maluckov)

References

[1] C Smith ldquoBasic cryptanalysis techniquesrdquo Tech Rep SANSInstitute 2001

[2] J Hoffstein J Pipher and J H Silverman An Introductionto Mathematical Cryptography Springer New York NY USA2008

[3] K Ruohonen ldquoMathematical cryptologyrdquo Lecture Notes 2010[4] A Zramdini and R Ingold ldquoOptical font recognition using

typographical featuresrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 20 no 8 pp 877ndash882 1998

[5] P Quaresma ldquoFrequency analysis of the Portuguese languagerdquoTR 2008003 University of Coimbra Portugal 2008

[6] R M Haralick K Shanmugam and I Dinstein ldquoTexturalfeatures for image classificationrdquo IEEE Transactions on SystemsMan and Cybernetics vol 3 no 6 pp 610ndash621 1973

[7] G Jaskiewicz ldquoAnalysis of letter frequency distribution inthe voynich manuscriptrdquo in Proceedings of the InternationalWorkshop (CSampP rsquo11) pp 250ndash261 Putusk Poland 2011

[8] S Cagnoni Evolutionary Image Analysis and Signal Processingvol 213 of Studies in Computational Intelligence SpringerBerlin Germany 2009

[9] R Nosaka C H Suryanto and K Fukui ldquoRotation invariantco-occurrence among adjacent LBPsrdquo in Proceedings of the 11thInternational Conference on Computer Vision (ACCV rsquo11) vol7088 of Lecture Notes in Computer Science pp 82ndash91 DaejeonRepublic of Korea 2011

[10] D A Clausi ldquoAn analysis of co-occurrence texture statistics as afunction of grey level quantizationrdquoCanadian Journal of RemoteSensing vol 28 no 1 pp 45ndash62 2002

[11] A Baraldi and F Parmiggiani ldquoInvestigation of the texturalcharacteristics associated with gray level cooccurrence matrixstatistical parametersrdquo IEEE Transactions on Geoscience andRemote Sensing vol 33 no 2 pp 293ndash304 1995

[12] F R Renzetti and L Zortea ldquoUse of a gray level co-occurrencematrix to characterize duplex stainless steel phases microstruc-turerdquo Frattura ed Integrita Strutturale vol 16 no 5 pp 43ndash512011

[13] R Susomboon D Raicu J Furst and T B Johnson ldquoA Co-occurrence texture semi-invariance to direction distance andpatient sizerdquo in Medical Imaging 2008 Image Processing vol6914 of Procedings of SPIE San Diego Calif USA February2008

[14] A Eleyan and H Demirel ldquoCo-occurrence matrix and itsstatistical features as a new approach for face recognitionrdquo

Turkish Journal of Electrical Engineering and Computer Sciencesvol 19 no 1 pp 97ndash107 2011

[15] D A Clausi and Y Zhao ldquoRapid extraction of image textureby co-occurrence using a hybrid data structurerdquoComputers andGeosciences vol 28 no 6 pp 763ndash774 2002

[16] R Jobanputra and D A Clausi ldquoPreserving boundaries forimage texture segmentation using grey level co-occurringprobabilitiesrdquo Pattern Recognition vol 39 no 2 pp 234ndash2452006

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mechanical Engineering

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Antennas andPropagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World Journal 5

Table 3 Statistical analysis of Latin and Cyrillic script types

Script Type of letters Occurrence of script types Distribution of script typesS A D F S () A () D () F ()

Latin Capital letters 0 28 0 2 0 9333 0 667Latin Small letters 12 13 4 1 4000 4333 1333 334Cyrillic Capital letters 0 27 0 3 0 9000 0 1000Cyrillic Small letters 21 2 5 2 7000 667 1666 667

S A D F0

500

1000

1500

2000

2500

Type of scripts

No

Figure 4 Distribution of 119862 setrsquos elements in cipher text obtainedfrom Latin document

Statistical analysis of the Cyrillic document image showsfollowing 2516 elements of S 53 elements of A 445 elementsof D and 26 elements of F types It should be noted that thesum of all 119862 set elements in Latin and Cyrillic document isnot quite identical It is valid due to difference in definition ofletters in two scripts In the Cyrillic script each letter is givenone and only one sign However in Latin script letters suchas dz lj and nj are represented by two letters Distribution of119862 set elements for Cyrillic document is presented in Figure 6

According to Figures 4 and 6 the comparison chart isdrawn It is shown below in Figure 7

Quantification of the script type appearance in a docu-ment written in Latin and Cyrillic is shown in Table 4

It is obvious that the Latin document compared toCyrillicone has slightly smaller number of short (S) descender (D)and full (F) letters Nonetheless the crucial margin is seen inascender (A) letters Hence it can be a measure of confidencefor detection of the script in a document given in Serbianlanguage

24 Co-Occurrence Analysis Let I be the gray scale imagewhich is under consideration It has 119872 row and 119873 columnswhile119879 is the total number of gray levelsThe spatial relation-ship of gray levels in the image I is expressed by the grayscaleco-occurrencematrix (GLCM)C [6 10] HenceC is a matrixthat describes the frequency of one gray level appearing ina specified spatial linear relationship with another gray levelwithin the area of investigation [11] In order to compute aco-occurrence matrixC we considered a central pixel 119868(119909 119910)

with a neighborhood defined by the window of interest Thiswindow is defined by two parameters inter-pixel distance (119889)

and orientation (120579) Typically the choice of 119889 is 1 (one pixel)

while the value of 120579 depends on the neighborhood Becauseof that each pixel has 8 neighbors given at following angles120579 = 0

∘ 45∘ 90∘ 135∘ 180∘ 225∘ 270∘ 315∘ However thecase of neighbors at 120579 = 0

∘ or at 120579 = 180∘ is similar to the

GLCM definition [12] So the choice may fall to 4 neighborspixels at 120579 = 0

∘ 45∘ 90∘ and 135∘ that is horizontal rightdiagonal vertical and left diagonal [13] These orientationsrefer to 4-adjacent pixels at (119909+119889 119910) (119909 119910minus119889) (119909minus119889 119910) and(119909 119910+119889) where119889 is 1 For each pixel of the neighborhood it iscounted the number of times a pixel pair appears specified bythe distance and orientation parameters The (119894 119895)th entry ofC represents the number of occasions a pixel with an intensity119894 is adjacent to a pixel with an intensity 119895 Hence for the givenimage I the co-occurrence matrix C is defined as [14]

119862 (119894 119895) =

119872

sum

119909=1

119873

sum

119910=1

1if 119868 (119909 119910) = 119894

119868 (119909 + Δ119909 119910 + Δ119910) = 119895

0 otherwise

(4)

where 119894 and 119895 are the image intensity values of the image119909 and 119910 are the spatial positions in the image I The offset(Δ119909 Δ119910) is specifying the distance between the pixel-of-interest and its neighbor It depends on the direction 120579 that isused and the distance 119889 at which the matrix is computedThesquarematrixC is of the order119873 Using a statistical approachlike GLCMprovides a valuable information about the relativeposition of the neighboring pixels in an image [12] In orderto normalize matrix C matrix P is calculated as [10]

119875 (119894 119895) =119862 (119894 119895)

sum119873

119894=1sum119873

119895=1119862 (119894 119895)

(5)

The normalized co-occurrence matrix P is obtained bydividing each element of C by the total number of co-occurrence pairs in C

To illustrate the computing of GLCM a four gray levelimage I is usedThe window parameters are 119889 = 1 and 120579 = 0

(horizontal) Initial matrix I is shown in Figure 8The procedure of calculating co-occurrence matrix for

grayscale matrix I (119889 = 1 and 120579 = 0∘) [12] is given in Figure 9

In order to GLCM be applied in our case set 119862 is mappedinto set 119862

119873by bijective function as

119891119862

119862 997891997888rarr 119862119873

(6)

where 119862119873

= 0 1 2 3 Furthermore the neighborhood isgiven as 2-connected (119909 minus 119889 and 119909 + 119889 around 119909 where119889 = 1) According to that the same document in Latin andCyrillic script is converted into cipher text It is shown belowin Figure 10

6 The Scientific World Journal

(a) (b)

(c) (d)

(e) (f)

Figure 5 Application of the proposed algorithm (a) Original Cyrillic text (b) Cipher text according to set 119862 (c) Only ldquoSrdquo text (d) Only ldquoFrdquotext (e) Only ldquoArdquo text (f) Only ldquoDrdquo text

The Scientific World Journal 7

Table 4 Percentage of script type occurrence in document

Type of script (TOS) Latin Cyrillic 119909 timesS 7188 8276 087A 1939 174 1114D 846 1464 057F 027 086 031It is obvious that the Latin document compared to Cyrillic one has slightly smaller number of short (S) descender (D) and full (F) letters Nonetheless thecrucial margin is seen in ascender (A) letters Hence it can be a measure of confidence for detection of the script in a document given in Serbian language

S A D F0

500

1000

1500

2000

2500

3000

Type of scripts

No

Figure 6 Distribution of 119862 sets elements in cipher text obtainedfrom Cyrillic document

S A D F0

500

1000

1500

2000

2500

3000

Type of scripts

LatinCyrillic

No

Figure 7 Comparison between distributions of 119862 set elements inLatin and equivalent Cyrillic document

1 0 0 20 1 1 12 2 1 10 0 0 00 1 0 10 0 2 30 0 1 1

I =

Figure 8 Initial 4-level grayscale matrix I (featuring 119872 = 7 119873 = 4and 119871 = 4)

Table 5 Normalized cooccurrence matrix

(a) For cipher text obtained from the Latin text

03722 02212 00623 0004802220 00343 00072 000623 00072 00016 000048 0 0 0

(b) For cipher text obtained from Cyrillic text

05863 00327 01326 0006400391 00104 00144 001262 00200 00224 0001600072 0 00008 0

To evaluate these cipher documents GLCM method isemployed Nevertheless various statistic measures obtainedfrom the co-occurrence matrix is introduced The primarygoal is to characterize the cipher text Five descriptors can beused to describe the image [15]

(i) Uniformity (UNI)(ii) Entropy (ENT)(iii) Maximum probability (MAX)(iv) Dissimilarity (DIS) and(v) Contrast (CON)

Uniformity (UNI) which is sometimes called angularsecondmoment (ASM) or energy (ENG)measures the imagehomogeneity It receives the highest value when GLCM hasfew entries of large magnitude In contrast it is low when allentries are nearly equalThe equation of the uniformity is [15]

UNI =

119873

sum

119894=1

119873

sum

119895=1

119875(119894 119895)2 (7)

Entropy (ENT) measures the disorder or the complexityof the image The highest value is found when the values of119875(119894 119895) are allocated quite uniformly throughout the matrixThis happens when the image has no pairs of gray level withparticular preference over othersThe equation of the entropyis [15 16]

ENT = minus

119873

sum

119894=1

119873

sum

119895=1

119875 (119894 119895) sdot log119875 (119894 119895) (8)

8 The Scientific World Journal

j

i 0 1 2 3 0 6 4 2 0 6 4 2 0 621 421 221 0

1 2 4 0 0 2 4 0 0 221 421 0 02 0 1 1 1 0 1 1 1 0 121 121 1213 0 0 0 0 0 0 0 0 0 0 0 0

(a) (b) (c)

≫ ≫C = P =

Figure 9 Co-occurrence matrix for grayscale matrix I (119889 = 1 and 120579 = 0∘) (a)The number of occasions a pixel with an intensity 119868 is adjacent

to a pixel with intensity 119895 (b) Co-occurrence matrix C (c) Normalized matrix P

(a) (b)

(c) (d)

Figure 10 Document conversion (a) Original Cyrillic text (b) Cipher text obtained fromCyrillic text according to set 119862119873 (c) Original Latin

text (equivalent to Cyrillic one) (d) Cipher text obtained from Latin text according to set 119862119873

Maximum probability (MAX) extracts the most probabledifference between gray scale value in pixels It is defined as[15]

MAX = max 119875 (119894 119895) (9)Dissimilarity (DIS) is a measure of the variation in gray

level pairs of the image It depends on distance from the

diagonal weighted by its probability The equation of thedissimilarity is [15]

DIS =

119873

sum

119894=1

119873

sum

119895=1

119875 (119894 119895) sdot1003816100381610038161003816119894 minus 119895

1003816100381610038161003816 (10)

The Scientific World Journal 9

(a) (b)

(c) (d)

Figure 11 Custom-oriented database (a) Printed document in Latin (b) Printed document in Cyrillic (c) Web document in Latin (d) Webdocument in Cyrillic

Contrast (CON) or inertia is a measure of the intensitycontrast between a pixel and its neighbor over the entireimage Hence it shows the amount of local variations presentin the image If the image is constant then the contrast willbe equal 0 The highest value of contrast is obtained whenthe image has random intensity and the pixel intensity and

neighbor intensity are very different The equation of thecontrast is [15 16]

CON =

119873

sum

119894=1

119873

sum

119895=1

119875 (119894 119895) sdot (119894 minus 119895)2 (11)

10 The Scientific World Journal

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

100

01

02

03

04

05

06

07

08

09

DocumentsShortDescendingFull

x-ti

mes

(a)

0

2

4

6

8

10

12

14

16

18

Ascending

x-ti

mes

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

(b)

Figure 12 The ratio of the script type occurrence (a) short descending and full (b) ascending

Table 6 Cooccurrence descriptors for Latin and Cyrillic cipher text

Serbian language Latin Cyrillic CharacterizationUniformity (energy) 02459 03811 Latin lt CyrillicEntropy minus16298 minus14363 Latin gt CyrillicMaximum probability 03722 05863 Latin lt CyrillicDissimilarity 07356 06669 Latin gt CyrillicContrast 10423 12660 Latin lt CyrillicFrom the above results it is clear that co-occurrence descriptors can fully characterize the difference between Latin andCyrillic scriptThismeans that frequencyanalysis of the occurrence can be supplemented with additional attributes in order to define a strong margin as a criterion to distinguish a certain script

A brief look at the normalized co-occurrence matrix Pfor the same document written in Latin and Cyrillic scripts(text representing the excerpt of the first four paragraphsfrom a document given in Figure 10) shows quite a differentcharacterization The test results are given in Table 5

Furthermore the calculation of five co-occurrencedescriptors shows the values given in Table 6

3 Experiments

For the sake of the experiment a custom-oriented databaseis created It consists of 10 documents These documentsrepresent excerpts from printed and web documents writtenin Serbian language The documents are created in bothscripts Latin and Cyrillic Printed documents are createdfrom PDF documents while web documents are extractedfrom web news The total length of documents given inthe database is approx 75000 letter characters per script(approx 40 pages) The length of printed documents is from2273 to 15840 letter characters Web documents are smallercompared to printed documents Their length is from 1231 to2502 letter characters It should be noted that all documentshave more than 1000 letter characters The example of the

printed and web document from the database is shown inFigure 11

4 Results and Discussion

According to the proposed algorithm all documents fromthe database are converted into equivalent cipher texts andsubjected to the frequency and co-occurrence analysis Firstthe frequency analysis of the script type occurrence in Latinas well as in Cyrillic documents is examined (Table 7) Theobtained results for each document are given in Table 8

The final processing of the results is based on cumulativemeasures like sum average max and min of script typeoccurrence in the database According to that the criteria areestablished All these are shown in Table 9

From cumulative results given in Table 10 some criteriacan be established It can be noted that the biggest marginbetween results are seen in the ratio of ascending letters Thisratio has the value of at least 8 Hence it is the strongestpoint of qualitative characterization and recognition of thecertain script Furthermore the smaller number of short anddescending scripts are common in Latin compared to Cyrillicdocuments At the and full letters are quite rare in a Latindocument However its characterization in criteria form is

The Scientific World Journal 11

Table 7 Frequency analysis of the script type occurrence in documents from database

Printed documents

Type of script Doc 1 Doc 2 Doc 3 Doc 4 Doc 5Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic

S 2243 2764 11396 13593 1510 1914 2217 2516 2069 2542A 906 53 4060 306 693 65 598 53 897 64D 183 468 724 1933 82 286 261 445 151 461F 0 7 0 8 0 8 8 26 0 12

Web documents

Type of script Doc 6 Doc 7 Doc 8 Doc 9 Doc 10Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic

S 1486 1799 1358 1682 783 996 1657 2078 1328 1637A 598 48 636 46 408 48 750 62 588 68D 99 304 75 292 58 174 134 344 99 284F 0 7 0 9 0 13 0 18 0 9The above results are further processed in order to calculate the ratio of script type occurrence between Latin and Cyrillic document Complete results aregiven in Table 8

Table 8 The ratio of script type occurrence between Latin and Cyrillic documents

Type of script Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc 9 Doc 10S 081 084 079 088 081 083 081 079 080 081A 1709 1327 1066 1128 1402 1246 1383 850 1210 865D 039 037 029 059 033 033 026 033 039 035F 000 000 000 031 000 000 000 000 000 000The results are presented in Figure 12

Table 9 The ratio of script type occurrence measures

Type of script sum Ratio CriteriaLatin Cyrillic Average Max Min

S 26047 31521 082 088 079 gt075A 10134 813 1221 1709 850 gt8D 1866 4991 036 059 026 lt06F 8 117 003 031 000

Table 10 GLCM five descriptors of the script type co-occurrence in documents from database

Printed documentsDoc 1 Doc 2 Doc 3 Doc 4 Doc 5

Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin CyrillicUniformity 02885 04725 02473 04167 02557 04120 02759 04545 02707 04498Entropy minus15191 minus11774 minus16379 minus13079 minus16047 minus12999 minus15675 minus11650 minus15847 minus11799Max probability 04655 06636 03952 06139 04120 06098 04439 06457 04349 06405Dissimilarity 06847 05933 07469 06592 07502 06427 07064 06041 07117 06217Contrast 10324 11790 11106 12859 11258 12261 10577 11449 10630 11949

Web documentsDoc 6 Doc 7 Doc 8 Doc 9 Doc 10

Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin CyrillicUniformity 02447 03714 02754 03817 02533 05005 02252 03147 02522 03325Entropy minus16524 minus13738 minus15725 minus13412 minus15990 minus10779 minus16778 minus15650 minus16144 minus15059Max probability 03964 05650 04409 05753 03972 06844 03195 05154 04016 05318Dissimilarity 07723 07320 06912 07209 07294 05686 08317 07667 07256 07416Contrast 11862 13869 10287 13681 10459 11158 12122 14220 10641 13716The above results are further processed in order to calculate the ratio of script type co-occurrence in between Latin and Cyrillic document These results areshown in Table 11

12 The Scientific World Journal

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

00102030405060708

DocumentsUniformity

x-ti

mes

02025

03035

04045

05055

Uni

form

ity (e

nerg

y)

DocumentsLatin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(a)

Entro

py

0

05

1

15

minus17

minus16

minus15

minus14

minus13

minus12

1

minus11

minus09

x-ti

mes

EntropyDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

(b)

03504

04505

05506

06507

Max

imum

pro

babi

lity

00102030405060708

x-ti

mes

Maximum probabilityDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(c)

05506

06507

07508

08509

095

Diss

imila

rity

0

02

04

06

08

1

12

x-ti

mes

DissimilarityDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(d)

Figure 13 Continued

The Scientific World Journal 13

1

11

12

13

14

15

Con

tras

t

0

02

04

06

08

1

Contrast

x-ti

mes

DocumentsLatin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(e)

Figure 13 Illustrations of co-occurrence descriptors in Latin and Cyrillic text (left) and its ratio (right) (a) Uniformity (b) Entropy (c)Maximum probability (d) Dissimilarity (e) Contrast

Table 11 The ratio of the co-occurrence descriptors between Latin and Cyrillic documents

Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc 9 Doc 10Uniformity 061 059 062 061 060 066 072 051 072 076Entropy 129 125 123 135 134 120 117 148 107 107Max probability 070 064 068 069 068 070 077 058 062 076Dissimilarity 115 113 117 117 114 106 096 128 108 098Contrast 088 086 092 092 089 086 075 094 085 078The final processing of the above results is based on cumulative measures like average max and min of script type co-occurrence in the database Accordingto that certain criteria are established All these are shown in Table 12

Table 12 The ratio of script type co-occurrence descriptors

Latin Cyrillic Ratio CriteriaMin Max Average Min Max Average Max Min Average

Uniformity 023 029 026 031 050 041 076 051 064 03Entropy minus168 minus152 minus160 minus157 minus108 minus130 148 107 125 Max probability 032 047 042 052 068 060 077 058 068 05Dissimilarity 068 083 074 057 077 067 128 096 111 Contrast 103 121 112 112 142 127 094 075 086

quite problematic due to their absence in Latin documentsfrom time to time

Furthermore the analysis of the script type co-occurrence in Latin as well as in Cyrillic documents isexamined according to GLCM method The obtained resultsfor each document are given in Table 10

The co-occurrence descriptor for Latin and Cyrillic textand its ratio is presented in Figure 13

From the above results some criteria can be establishedIt is clear that uniformity and maximum probability receivethe most distinct values in Latin and Cyrillic text Hencethese descriptors are suitable for qualitative characterizationof Latin and Cyrillic text as well as for creating criteria todistinguish a certain script type From the above results themargin criteria should be uniformity of 03 and maximumprobability of 05 These values of both descriptors representthe strong margin in qualifying the script in certain Serbiantext If we accompany them with the criteria obtained from

frequency analysis of the script type occurrence then the fullcriteria of decision making can be established This will leadto correct recognition of the script in Serbian text

5 Conclusion

The paper proposed the algorithm for recognition of exactscript in Serbian document Documents in Serbian languagecan be written in two different scripts Latin or CyrillicThe proposed algorithm converts document written in Latinand Cyrillic script into cipher text This way all alphabeticcharacters are exchanged with only four different encryptedsigns according to predefined encryption based on text linestructure definition Such ciphers texts are then subjected tothe frequency and co-occurrence analysis According to theobtained results a criteria for recognition of the certain scriptis proposed The proposed method is applied to the custom-oriented database which includes different types of printed

14 The Scientific World Journal

and web documents The experiment shows encouragingresults Possible applications can be seen in the area of webpage recognition

Future work will be toward the recognition of relatedlanguages as well as different languages written in the samescript

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this article (ldquoRecognition of theScript in Serbian Documents using Frequency Occurrenceand Co-occurrence Analysisrdquo by Darko Brodic Zoran NMilivojevic Cedomir A Maluckov)

References

[1] C Smith ldquoBasic cryptanalysis techniquesrdquo Tech Rep SANSInstitute 2001

[2] J Hoffstein J Pipher and J H Silverman An Introductionto Mathematical Cryptography Springer New York NY USA2008

[3] K Ruohonen ldquoMathematical cryptologyrdquo Lecture Notes 2010[4] A Zramdini and R Ingold ldquoOptical font recognition using

typographical featuresrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 20 no 8 pp 877ndash882 1998

[5] P Quaresma ldquoFrequency analysis of the Portuguese languagerdquoTR 2008003 University of Coimbra Portugal 2008

[6] R M Haralick K Shanmugam and I Dinstein ldquoTexturalfeatures for image classificationrdquo IEEE Transactions on SystemsMan and Cybernetics vol 3 no 6 pp 610ndash621 1973

[7] G Jaskiewicz ldquoAnalysis of letter frequency distribution inthe voynich manuscriptrdquo in Proceedings of the InternationalWorkshop (CSampP rsquo11) pp 250ndash261 Putusk Poland 2011

[8] S Cagnoni Evolutionary Image Analysis and Signal Processingvol 213 of Studies in Computational Intelligence SpringerBerlin Germany 2009

[9] R Nosaka C H Suryanto and K Fukui ldquoRotation invariantco-occurrence among adjacent LBPsrdquo in Proceedings of the 11thInternational Conference on Computer Vision (ACCV rsquo11) vol7088 of Lecture Notes in Computer Science pp 82ndash91 DaejeonRepublic of Korea 2011

[10] D A Clausi ldquoAn analysis of co-occurrence texture statistics as afunction of grey level quantizationrdquoCanadian Journal of RemoteSensing vol 28 no 1 pp 45ndash62 2002

[11] A Baraldi and F Parmiggiani ldquoInvestigation of the texturalcharacteristics associated with gray level cooccurrence matrixstatistical parametersrdquo IEEE Transactions on Geoscience andRemote Sensing vol 33 no 2 pp 293ndash304 1995

[12] F R Renzetti and L Zortea ldquoUse of a gray level co-occurrencematrix to characterize duplex stainless steel phases microstruc-turerdquo Frattura ed Integrita Strutturale vol 16 no 5 pp 43ndash512011

[13] R Susomboon D Raicu J Furst and T B Johnson ldquoA Co-occurrence texture semi-invariance to direction distance andpatient sizerdquo in Medical Imaging 2008 Image Processing vol6914 of Procedings of SPIE San Diego Calif USA February2008

[14] A Eleyan and H Demirel ldquoCo-occurrence matrix and itsstatistical features as a new approach for face recognitionrdquo

Turkish Journal of Electrical Engineering and Computer Sciencesvol 19 no 1 pp 97ndash107 2011

[15] D A Clausi and Y Zhao ldquoRapid extraction of image textureby co-occurrence using a hybrid data structurerdquoComputers andGeosciences vol 28 no 6 pp 763ndash774 2002

[16] R Jobanputra and D A Clausi ldquoPreserving boundaries forimage texture segmentation using grey level co-occurringprobabilitiesrdquo Pattern Recognition vol 39 no 2 pp 234ndash2452006

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mechanical Engineering

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Antennas andPropagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

6 The Scientific World Journal

(a) (b)

(c) (d)

(e) (f)

Figure 5 Application of the proposed algorithm (a) Original Cyrillic text (b) Cipher text according to set 119862 (c) Only ldquoSrdquo text (d) Only ldquoFrdquotext (e) Only ldquoArdquo text (f) Only ldquoDrdquo text

The Scientific World Journal 7

Table 4 Percentage of script type occurrence in document

Type of script (TOS) Latin Cyrillic 119909 timesS 7188 8276 087A 1939 174 1114D 846 1464 057F 027 086 031It is obvious that the Latin document compared to Cyrillic one has slightly smaller number of short (S) descender (D) and full (F) letters Nonetheless thecrucial margin is seen in ascender (A) letters Hence it can be a measure of confidence for detection of the script in a document given in Serbian language

S A D F0

500

1000

1500

2000

2500

3000

Type of scripts

No

Figure 6 Distribution of 119862 sets elements in cipher text obtainedfrom Cyrillic document

S A D F0

500

1000

1500

2000

2500

3000

Type of scripts

LatinCyrillic

No

Figure 7 Comparison between distributions of 119862 set elements inLatin and equivalent Cyrillic document

1 0 0 20 1 1 12 2 1 10 0 0 00 1 0 10 0 2 30 0 1 1

I =

Figure 8 Initial 4-level grayscale matrix I (featuring 119872 = 7 119873 = 4and 119871 = 4)

Table 5 Normalized cooccurrence matrix

(a) For cipher text obtained from the Latin text

03722 02212 00623 0004802220 00343 00072 000623 00072 00016 000048 0 0 0

(b) For cipher text obtained from Cyrillic text

05863 00327 01326 0006400391 00104 00144 001262 00200 00224 0001600072 0 00008 0

To evaluate these cipher documents GLCM method isemployed Nevertheless various statistic measures obtainedfrom the co-occurrence matrix is introduced The primarygoal is to characterize the cipher text Five descriptors can beused to describe the image [15]

(i) Uniformity (UNI)(ii) Entropy (ENT)(iii) Maximum probability (MAX)(iv) Dissimilarity (DIS) and(v) Contrast (CON)

Uniformity (UNI) which is sometimes called angularsecondmoment (ASM) or energy (ENG)measures the imagehomogeneity It receives the highest value when GLCM hasfew entries of large magnitude In contrast it is low when allentries are nearly equalThe equation of the uniformity is [15]

UNI =

119873

sum

119894=1

119873

sum

119895=1

119875(119894 119895)2 (7)

Entropy (ENT) measures the disorder or the complexityof the image The highest value is found when the values of119875(119894 119895) are allocated quite uniformly throughout the matrixThis happens when the image has no pairs of gray level withparticular preference over othersThe equation of the entropyis [15 16]

ENT = minus

119873

sum

119894=1

119873

sum

119895=1

119875 (119894 119895) sdot log119875 (119894 119895) (8)

8 The Scientific World Journal

j

i 0 1 2 3 0 6 4 2 0 6 4 2 0 621 421 221 0

1 2 4 0 0 2 4 0 0 221 421 0 02 0 1 1 1 0 1 1 1 0 121 121 1213 0 0 0 0 0 0 0 0 0 0 0 0

(a) (b) (c)

≫ ≫C = P =

Figure 9 Co-occurrence matrix for grayscale matrix I (119889 = 1 and 120579 = 0∘) (a)The number of occasions a pixel with an intensity 119868 is adjacent

to a pixel with intensity 119895 (b) Co-occurrence matrix C (c) Normalized matrix P

(a) (b)

(c) (d)

Figure 10 Document conversion (a) Original Cyrillic text (b) Cipher text obtained fromCyrillic text according to set 119862119873 (c) Original Latin

text (equivalent to Cyrillic one) (d) Cipher text obtained from Latin text according to set 119862119873

Maximum probability (MAX) extracts the most probabledifference between gray scale value in pixels It is defined as[15]

MAX = max 119875 (119894 119895) (9)Dissimilarity (DIS) is a measure of the variation in gray

level pairs of the image It depends on distance from the

diagonal weighted by its probability The equation of thedissimilarity is [15]

DIS =

119873

sum

119894=1

119873

sum

119895=1

119875 (119894 119895) sdot1003816100381610038161003816119894 minus 119895

1003816100381610038161003816 (10)

The Scientific World Journal 9

(a) (b)

(c) (d)

Figure 11 Custom-oriented database (a) Printed document in Latin (b) Printed document in Cyrillic (c) Web document in Latin (d) Webdocument in Cyrillic

Contrast (CON) or inertia is a measure of the intensitycontrast between a pixel and its neighbor over the entireimage Hence it shows the amount of local variations presentin the image If the image is constant then the contrast willbe equal 0 The highest value of contrast is obtained whenthe image has random intensity and the pixel intensity and

neighbor intensity are very different The equation of thecontrast is [15 16]

CON =

119873

sum

119894=1

119873

sum

119895=1

119875 (119894 119895) sdot (119894 minus 119895)2 (11)

10 The Scientific World Journal

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

100

01

02

03

04

05

06

07

08

09

DocumentsShortDescendingFull

x-ti

mes

(a)

0

2

4

6

8

10

12

14

16

18

Ascending

x-ti

mes

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

(b)

Figure 12 The ratio of the script type occurrence (a) short descending and full (b) ascending

Table 6 Cooccurrence descriptors for Latin and Cyrillic cipher text

Serbian language Latin Cyrillic CharacterizationUniformity (energy) 02459 03811 Latin lt CyrillicEntropy minus16298 minus14363 Latin gt CyrillicMaximum probability 03722 05863 Latin lt CyrillicDissimilarity 07356 06669 Latin gt CyrillicContrast 10423 12660 Latin lt CyrillicFrom the above results it is clear that co-occurrence descriptors can fully characterize the difference between Latin andCyrillic scriptThismeans that frequencyanalysis of the occurrence can be supplemented with additional attributes in order to define a strong margin as a criterion to distinguish a certain script

A brief look at the normalized co-occurrence matrix Pfor the same document written in Latin and Cyrillic scripts(text representing the excerpt of the first four paragraphsfrom a document given in Figure 10) shows quite a differentcharacterization The test results are given in Table 5

Furthermore the calculation of five co-occurrencedescriptors shows the values given in Table 6

3 Experiments

For the sake of the experiment a custom-oriented databaseis created It consists of 10 documents These documentsrepresent excerpts from printed and web documents writtenin Serbian language The documents are created in bothscripts Latin and Cyrillic Printed documents are createdfrom PDF documents while web documents are extractedfrom web news The total length of documents given inthe database is approx 75000 letter characters per script(approx 40 pages) The length of printed documents is from2273 to 15840 letter characters Web documents are smallercompared to printed documents Their length is from 1231 to2502 letter characters It should be noted that all documentshave more than 1000 letter characters The example of the

printed and web document from the database is shown inFigure 11

4 Results and Discussion

According to the proposed algorithm all documents fromthe database are converted into equivalent cipher texts andsubjected to the frequency and co-occurrence analysis Firstthe frequency analysis of the script type occurrence in Latinas well as in Cyrillic documents is examined (Table 7) Theobtained results for each document are given in Table 8

The final processing of the results is based on cumulativemeasures like sum average max and min of script typeoccurrence in the database According to that the criteria areestablished All these are shown in Table 9

From cumulative results given in Table 10 some criteriacan be established It can be noted that the biggest marginbetween results are seen in the ratio of ascending letters Thisratio has the value of at least 8 Hence it is the strongestpoint of qualitative characterization and recognition of thecertain script Furthermore the smaller number of short anddescending scripts are common in Latin compared to Cyrillicdocuments At the and full letters are quite rare in a Latindocument However its characterization in criteria form is

The Scientific World Journal 11

Table 7 Frequency analysis of the script type occurrence in documents from database

Printed documents

Type of script Doc 1 Doc 2 Doc 3 Doc 4 Doc 5Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic

S 2243 2764 11396 13593 1510 1914 2217 2516 2069 2542A 906 53 4060 306 693 65 598 53 897 64D 183 468 724 1933 82 286 261 445 151 461F 0 7 0 8 0 8 8 26 0 12

Web documents

Type of script Doc 6 Doc 7 Doc 8 Doc 9 Doc 10Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic

S 1486 1799 1358 1682 783 996 1657 2078 1328 1637A 598 48 636 46 408 48 750 62 588 68D 99 304 75 292 58 174 134 344 99 284F 0 7 0 9 0 13 0 18 0 9The above results are further processed in order to calculate the ratio of script type occurrence between Latin and Cyrillic document Complete results aregiven in Table 8

Table 8 The ratio of script type occurrence between Latin and Cyrillic documents

Type of script Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc 9 Doc 10S 081 084 079 088 081 083 081 079 080 081A 1709 1327 1066 1128 1402 1246 1383 850 1210 865D 039 037 029 059 033 033 026 033 039 035F 000 000 000 031 000 000 000 000 000 000The results are presented in Figure 12

Table 9 The ratio of script type occurrence measures

Type of script sum Ratio CriteriaLatin Cyrillic Average Max Min

S 26047 31521 082 088 079 gt075A 10134 813 1221 1709 850 gt8D 1866 4991 036 059 026 lt06F 8 117 003 031 000

Table 10 GLCM five descriptors of the script type co-occurrence in documents from database

Printed documentsDoc 1 Doc 2 Doc 3 Doc 4 Doc 5

Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin CyrillicUniformity 02885 04725 02473 04167 02557 04120 02759 04545 02707 04498Entropy minus15191 minus11774 minus16379 minus13079 minus16047 minus12999 minus15675 minus11650 minus15847 minus11799Max probability 04655 06636 03952 06139 04120 06098 04439 06457 04349 06405Dissimilarity 06847 05933 07469 06592 07502 06427 07064 06041 07117 06217Contrast 10324 11790 11106 12859 11258 12261 10577 11449 10630 11949

Web documentsDoc 6 Doc 7 Doc 8 Doc 9 Doc 10

Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin CyrillicUniformity 02447 03714 02754 03817 02533 05005 02252 03147 02522 03325Entropy minus16524 minus13738 minus15725 minus13412 minus15990 minus10779 minus16778 minus15650 minus16144 minus15059Max probability 03964 05650 04409 05753 03972 06844 03195 05154 04016 05318Dissimilarity 07723 07320 06912 07209 07294 05686 08317 07667 07256 07416Contrast 11862 13869 10287 13681 10459 11158 12122 14220 10641 13716The above results are further processed in order to calculate the ratio of script type co-occurrence in between Latin and Cyrillic document These results areshown in Table 11

12 The Scientific World Journal

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

00102030405060708

DocumentsUniformity

x-ti

mes

02025

03035

04045

05055

Uni

form

ity (e

nerg

y)

DocumentsLatin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(a)

Entro

py

0

05

1

15

minus17

minus16

minus15

minus14

minus13

minus12

1

minus11

minus09

x-ti

mes

EntropyDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

(b)

03504

04505

05506

06507

Max

imum

pro

babi

lity

00102030405060708

x-ti

mes

Maximum probabilityDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(c)

05506

06507

07508

08509

095

Diss

imila

rity

0

02

04

06

08

1

12

x-ti

mes

DissimilarityDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(d)

Figure 13 Continued

The Scientific World Journal 13

1

11

12

13

14

15

Con

tras

t

0

02

04

06

08

1

Contrast

x-ti

mes

DocumentsLatin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(e)

Figure 13 Illustrations of co-occurrence descriptors in Latin and Cyrillic text (left) and its ratio (right) (a) Uniformity (b) Entropy (c)Maximum probability (d) Dissimilarity (e) Contrast

Table 11 The ratio of the co-occurrence descriptors between Latin and Cyrillic documents

Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc 9 Doc 10Uniformity 061 059 062 061 060 066 072 051 072 076Entropy 129 125 123 135 134 120 117 148 107 107Max probability 070 064 068 069 068 070 077 058 062 076Dissimilarity 115 113 117 117 114 106 096 128 108 098Contrast 088 086 092 092 089 086 075 094 085 078The final processing of the above results is based on cumulative measures like average max and min of script type co-occurrence in the database Accordingto that certain criteria are established All these are shown in Table 12

Table 12 The ratio of script type co-occurrence descriptors

Latin Cyrillic Ratio CriteriaMin Max Average Min Max Average Max Min Average

Uniformity 023 029 026 031 050 041 076 051 064 03Entropy minus168 minus152 minus160 minus157 minus108 minus130 148 107 125 Max probability 032 047 042 052 068 060 077 058 068 05Dissimilarity 068 083 074 057 077 067 128 096 111 Contrast 103 121 112 112 142 127 094 075 086

quite problematic due to their absence in Latin documentsfrom time to time

Furthermore the analysis of the script type co-occurrence in Latin as well as in Cyrillic documents isexamined according to GLCM method The obtained resultsfor each document are given in Table 10

The co-occurrence descriptor for Latin and Cyrillic textand its ratio is presented in Figure 13

From the above results some criteria can be establishedIt is clear that uniformity and maximum probability receivethe most distinct values in Latin and Cyrillic text Hencethese descriptors are suitable for qualitative characterizationof Latin and Cyrillic text as well as for creating criteria todistinguish a certain script type From the above results themargin criteria should be uniformity of 03 and maximumprobability of 05 These values of both descriptors representthe strong margin in qualifying the script in certain Serbiantext If we accompany them with the criteria obtained from

frequency analysis of the script type occurrence then the fullcriteria of decision making can be established This will leadto correct recognition of the script in Serbian text

5 Conclusion

The paper proposed the algorithm for recognition of exactscript in Serbian document Documents in Serbian languagecan be written in two different scripts Latin or CyrillicThe proposed algorithm converts document written in Latinand Cyrillic script into cipher text This way all alphabeticcharacters are exchanged with only four different encryptedsigns according to predefined encryption based on text linestructure definition Such ciphers texts are then subjected tothe frequency and co-occurrence analysis According to theobtained results a criteria for recognition of the certain scriptis proposed The proposed method is applied to the custom-oriented database which includes different types of printed

14 The Scientific World Journal

and web documents The experiment shows encouragingresults Possible applications can be seen in the area of webpage recognition

Future work will be toward the recognition of relatedlanguages as well as different languages written in the samescript

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this article (ldquoRecognition of theScript in Serbian Documents using Frequency Occurrenceand Co-occurrence Analysisrdquo by Darko Brodic Zoran NMilivojevic Cedomir A Maluckov)

References

[1] C Smith ldquoBasic cryptanalysis techniquesrdquo Tech Rep SANSInstitute 2001

[2] J Hoffstein J Pipher and J H Silverman An Introductionto Mathematical Cryptography Springer New York NY USA2008

[3] K Ruohonen ldquoMathematical cryptologyrdquo Lecture Notes 2010[4] A Zramdini and R Ingold ldquoOptical font recognition using

typographical featuresrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 20 no 8 pp 877ndash882 1998

[5] P Quaresma ldquoFrequency analysis of the Portuguese languagerdquoTR 2008003 University of Coimbra Portugal 2008

[6] R M Haralick K Shanmugam and I Dinstein ldquoTexturalfeatures for image classificationrdquo IEEE Transactions on SystemsMan and Cybernetics vol 3 no 6 pp 610ndash621 1973

[7] G Jaskiewicz ldquoAnalysis of letter frequency distribution inthe voynich manuscriptrdquo in Proceedings of the InternationalWorkshop (CSampP rsquo11) pp 250ndash261 Putusk Poland 2011

[8] S Cagnoni Evolutionary Image Analysis and Signal Processingvol 213 of Studies in Computational Intelligence SpringerBerlin Germany 2009

[9] R Nosaka C H Suryanto and K Fukui ldquoRotation invariantco-occurrence among adjacent LBPsrdquo in Proceedings of the 11thInternational Conference on Computer Vision (ACCV rsquo11) vol7088 of Lecture Notes in Computer Science pp 82ndash91 DaejeonRepublic of Korea 2011

[10] D A Clausi ldquoAn analysis of co-occurrence texture statistics as afunction of grey level quantizationrdquoCanadian Journal of RemoteSensing vol 28 no 1 pp 45ndash62 2002

[11] A Baraldi and F Parmiggiani ldquoInvestigation of the texturalcharacteristics associated with gray level cooccurrence matrixstatistical parametersrdquo IEEE Transactions on Geoscience andRemote Sensing vol 33 no 2 pp 293ndash304 1995

[12] F R Renzetti and L Zortea ldquoUse of a gray level co-occurrencematrix to characterize duplex stainless steel phases microstruc-turerdquo Frattura ed Integrita Strutturale vol 16 no 5 pp 43ndash512011

[13] R Susomboon D Raicu J Furst and T B Johnson ldquoA Co-occurrence texture semi-invariance to direction distance andpatient sizerdquo in Medical Imaging 2008 Image Processing vol6914 of Procedings of SPIE San Diego Calif USA February2008

[14] A Eleyan and H Demirel ldquoCo-occurrence matrix and itsstatistical features as a new approach for face recognitionrdquo

Turkish Journal of Electrical Engineering and Computer Sciencesvol 19 no 1 pp 97ndash107 2011

[15] D A Clausi and Y Zhao ldquoRapid extraction of image textureby co-occurrence using a hybrid data structurerdquoComputers andGeosciences vol 28 no 6 pp 763ndash774 2002

[16] R Jobanputra and D A Clausi ldquoPreserving boundaries forimage texture segmentation using grey level co-occurringprobabilitiesrdquo Pattern Recognition vol 39 no 2 pp 234ndash2452006

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mechanical Engineering

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Antennas andPropagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World Journal 7

Table 4 Percentage of script type occurrence in document

Type of script (TOS) Latin Cyrillic 119909 timesS 7188 8276 087A 1939 174 1114D 846 1464 057F 027 086 031It is obvious that the Latin document compared to Cyrillic one has slightly smaller number of short (S) descender (D) and full (F) letters Nonetheless thecrucial margin is seen in ascender (A) letters Hence it can be a measure of confidence for detection of the script in a document given in Serbian language

S A D F0

500

1000

1500

2000

2500

3000

Type of scripts

No

Figure 6 Distribution of 119862 sets elements in cipher text obtainedfrom Cyrillic document

S A D F0

500

1000

1500

2000

2500

3000

Type of scripts

LatinCyrillic

No

Figure 7 Comparison between distributions of 119862 set elements inLatin and equivalent Cyrillic document

1 0 0 20 1 1 12 2 1 10 0 0 00 1 0 10 0 2 30 0 1 1

I =

Figure 8 Initial 4-level grayscale matrix I (featuring 119872 = 7 119873 = 4and 119871 = 4)

Table 5 Normalized cooccurrence matrix

(a) For cipher text obtained from the Latin text

03722 02212 00623 0004802220 00343 00072 000623 00072 00016 000048 0 0 0

(b) For cipher text obtained from Cyrillic text

05863 00327 01326 0006400391 00104 00144 001262 00200 00224 0001600072 0 00008 0

To evaluate these cipher documents GLCM method isemployed Nevertheless various statistic measures obtainedfrom the co-occurrence matrix is introduced The primarygoal is to characterize the cipher text Five descriptors can beused to describe the image [15]

(i) Uniformity (UNI)(ii) Entropy (ENT)(iii) Maximum probability (MAX)(iv) Dissimilarity (DIS) and(v) Contrast (CON)

Uniformity (UNI) which is sometimes called angularsecondmoment (ASM) or energy (ENG)measures the imagehomogeneity It receives the highest value when GLCM hasfew entries of large magnitude In contrast it is low when allentries are nearly equalThe equation of the uniformity is [15]

UNI =

119873

sum

119894=1

119873

sum

119895=1

119875(119894 119895)2 (7)

Entropy (ENT) measures the disorder or the complexityof the image The highest value is found when the values of119875(119894 119895) are allocated quite uniformly throughout the matrixThis happens when the image has no pairs of gray level withparticular preference over othersThe equation of the entropyis [15 16]

ENT = minus

119873

sum

119894=1

119873

sum

119895=1

119875 (119894 119895) sdot log119875 (119894 119895) (8)

8 The Scientific World Journal

j

i 0 1 2 3 0 6 4 2 0 6 4 2 0 621 421 221 0

1 2 4 0 0 2 4 0 0 221 421 0 02 0 1 1 1 0 1 1 1 0 121 121 1213 0 0 0 0 0 0 0 0 0 0 0 0

(a) (b) (c)

≫ ≫C = P =

Figure 9 Co-occurrence matrix for grayscale matrix I (119889 = 1 and 120579 = 0∘) (a)The number of occasions a pixel with an intensity 119868 is adjacent

to a pixel with intensity 119895 (b) Co-occurrence matrix C (c) Normalized matrix P

(a) (b)

(c) (d)

Figure 10 Document conversion (a) Original Cyrillic text (b) Cipher text obtained fromCyrillic text according to set 119862119873 (c) Original Latin

text (equivalent to Cyrillic one) (d) Cipher text obtained from Latin text according to set 119862119873

Maximum probability (MAX) extracts the most probabledifference between gray scale value in pixels It is defined as[15]

MAX = max 119875 (119894 119895) (9)Dissimilarity (DIS) is a measure of the variation in gray

level pairs of the image It depends on distance from the

diagonal weighted by its probability The equation of thedissimilarity is [15]

DIS =

119873

sum

119894=1

119873

sum

119895=1

119875 (119894 119895) sdot1003816100381610038161003816119894 minus 119895

1003816100381610038161003816 (10)

The Scientific World Journal 9

(a) (b)

(c) (d)

Figure 11 Custom-oriented database (a) Printed document in Latin (b) Printed document in Cyrillic (c) Web document in Latin (d) Webdocument in Cyrillic

Contrast (CON) or inertia is a measure of the intensitycontrast between a pixel and its neighbor over the entireimage Hence it shows the amount of local variations presentin the image If the image is constant then the contrast willbe equal 0 The highest value of contrast is obtained whenthe image has random intensity and the pixel intensity and

neighbor intensity are very different The equation of thecontrast is [15 16]

CON =

119873

sum

119894=1

119873

sum

119895=1

119875 (119894 119895) sdot (119894 minus 119895)2 (11)

10 The Scientific World Journal

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

100

01

02

03

04

05

06

07

08

09

DocumentsShortDescendingFull

x-ti

mes

(a)

0

2

4

6

8

10

12

14

16

18

Ascending

x-ti

mes

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

(b)

Figure 12 The ratio of the script type occurrence (a) short descending and full (b) ascending

Table 6 Cooccurrence descriptors for Latin and Cyrillic cipher text

Serbian language Latin Cyrillic CharacterizationUniformity (energy) 02459 03811 Latin lt CyrillicEntropy minus16298 minus14363 Latin gt CyrillicMaximum probability 03722 05863 Latin lt CyrillicDissimilarity 07356 06669 Latin gt CyrillicContrast 10423 12660 Latin lt CyrillicFrom the above results it is clear that co-occurrence descriptors can fully characterize the difference between Latin andCyrillic scriptThismeans that frequencyanalysis of the occurrence can be supplemented with additional attributes in order to define a strong margin as a criterion to distinguish a certain script

A brief look at the normalized co-occurrence matrix Pfor the same document written in Latin and Cyrillic scripts(text representing the excerpt of the first four paragraphsfrom a document given in Figure 10) shows quite a differentcharacterization The test results are given in Table 5

Furthermore the calculation of five co-occurrencedescriptors shows the values given in Table 6

3 Experiments

For the sake of the experiment a custom-oriented databaseis created It consists of 10 documents These documentsrepresent excerpts from printed and web documents writtenin Serbian language The documents are created in bothscripts Latin and Cyrillic Printed documents are createdfrom PDF documents while web documents are extractedfrom web news The total length of documents given inthe database is approx 75000 letter characters per script(approx 40 pages) The length of printed documents is from2273 to 15840 letter characters Web documents are smallercompared to printed documents Their length is from 1231 to2502 letter characters It should be noted that all documentshave more than 1000 letter characters The example of the

printed and web document from the database is shown inFigure 11

4 Results and Discussion

According to the proposed algorithm all documents fromthe database are converted into equivalent cipher texts andsubjected to the frequency and co-occurrence analysis Firstthe frequency analysis of the script type occurrence in Latinas well as in Cyrillic documents is examined (Table 7) Theobtained results for each document are given in Table 8

The final processing of the results is based on cumulativemeasures like sum average max and min of script typeoccurrence in the database According to that the criteria areestablished All these are shown in Table 9

From cumulative results given in Table 10 some criteriacan be established It can be noted that the biggest marginbetween results are seen in the ratio of ascending letters Thisratio has the value of at least 8 Hence it is the strongestpoint of qualitative characterization and recognition of thecertain script Furthermore the smaller number of short anddescending scripts are common in Latin compared to Cyrillicdocuments At the and full letters are quite rare in a Latindocument However its characterization in criteria form is

The Scientific World Journal 11

Table 7 Frequency analysis of the script type occurrence in documents from database

Printed documents

Type of script Doc 1 Doc 2 Doc 3 Doc 4 Doc 5Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic

S 2243 2764 11396 13593 1510 1914 2217 2516 2069 2542A 906 53 4060 306 693 65 598 53 897 64D 183 468 724 1933 82 286 261 445 151 461F 0 7 0 8 0 8 8 26 0 12

Web documents

Type of script Doc 6 Doc 7 Doc 8 Doc 9 Doc 10Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic

S 1486 1799 1358 1682 783 996 1657 2078 1328 1637A 598 48 636 46 408 48 750 62 588 68D 99 304 75 292 58 174 134 344 99 284F 0 7 0 9 0 13 0 18 0 9The above results are further processed in order to calculate the ratio of script type occurrence between Latin and Cyrillic document Complete results aregiven in Table 8

Table 8 The ratio of script type occurrence between Latin and Cyrillic documents

Type of script Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc 9 Doc 10S 081 084 079 088 081 083 081 079 080 081A 1709 1327 1066 1128 1402 1246 1383 850 1210 865D 039 037 029 059 033 033 026 033 039 035F 000 000 000 031 000 000 000 000 000 000The results are presented in Figure 12

Table 9 The ratio of script type occurrence measures

Type of script sum Ratio CriteriaLatin Cyrillic Average Max Min

S 26047 31521 082 088 079 gt075A 10134 813 1221 1709 850 gt8D 1866 4991 036 059 026 lt06F 8 117 003 031 000

Table 10 GLCM five descriptors of the script type co-occurrence in documents from database

Printed documentsDoc 1 Doc 2 Doc 3 Doc 4 Doc 5

Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin CyrillicUniformity 02885 04725 02473 04167 02557 04120 02759 04545 02707 04498Entropy minus15191 minus11774 minus16379 minus13079 minus16047 minus12999 minus15675 minus11650 minus15847 minus11799Max probability 04655 06636 03952 06139 04120 06098 04439 06457 04349 06405Dissimilarity 06847 05933 07469 06592 07502 06427 07064 06041 07117 06217Contrast 10324 11790 11106 12859 11258 12261 10577 11449 10630 11949

Web documentsDoc 6 Doc 7 Doc 8 Doc 9 Doc 10

Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin CyrillicUniformity 02447 03714 02754 03817 02533 05005 02252 03147 02522 03325Entropy minus16524 minus13738 minus15725 minus13412 minus15990 minus10779 minus16778 minus15650 minus16144 minus15059Max probability 03964 05650 04409 05753 03972 06844 03195 05154 04016 05318Dissimilarity 07723 07320 06912 07209 07294 05686 08317 07667 07256 07416Contrast 11862 13869 10287 13681 10459 11158 12122 14220 10641 13716The above results are further processed in order to calculate the ratio of script type co-occurrence in between Latin and Cyrillic document These results areshown in Table 11

12 The Scientific World Journal

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

00102030405060708

DocumentsUniformity

x-ti

mes

02025

03035

04045

05055

Uni

form

ity (e

nerg

y)

DocumentsLatin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(a)

Entro

py

0

05

1

15

minus17

minus16

minus15

minus14

minus13

minus12

1

minus11

minus09

x-ti

mes

EntropyDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

(b)

03504

04505

05506

06507

Max

imum

pro

babi

lity

00102030405060708

x-ti

mes

Maximum probabilityDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(c)

05506

06507

07508

08509

095

Diss

imila

rity

0

02

04

06

08

1

12

x-ti

mes

DissimilarityDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(d)

Figure 13 Continued

The Scientific World Journal 13

1

11

12

13

14

15

Con

tras

t

0

02

04

06

08

1

Contrast

x-ti

mes

DocumentsLatin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(e)

Figure 13 Illustrations of co-occurrence descriptors in Latin and Cyrillic text (left) and its ratio (right) (a) Uniformity (b) Entropy (c)Maximum probability (d) Dissimilarity (e) Contrast

Table 11 The ratio of the co-occurrence descriptors between Latin and Cyrillic documents

Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc 9 Doc 10Uniformity 061 059 062 061 060 066 072 051 072 076Entropy 129 125 123 135 134 120 117 148 107 107Max probability 070 064 068 069 068 070 077 058 062 076Dissimilarity 115 113 117 117 114 106 096 128 108 098Contrast 088 086 092 092 089 086 075 094 085 078The final processing of the above results is based on cumulative measures like average max and min of script type co-occurrence in the database Accordingto that certain criteria are established All these are shown in Table 12

Table 12 The ratio of script type co-occurrence descriptors

Latin Cyrillic Ratio CriteriaMin Max Average Min Max Average Max Min Average

Uniformity 023 029 026 031 050 041 076 051 064 03Entropy minus168 minus152 minus160 minus157 minus108 minus130 148 107 125 Max probability 032 047 042 052 068 060 077 058 068 05Dissimilarity 068 083 074 057 077 067 128 096 111 Contrast 103 121 112 112 142 127 094 075 086

quite problematic due to their absence in Latin documentsfrom time to time

Furthermore the analysis of the script type co-occurrence in Latin as well as in Cyrillic documents isexamined according to GLCM method The obtained resultsfor each document are given in Table 10

The co-occurrence descriptor for Latin and Cyrillic textand its ratio is presented in Figure 13

From the above results some criteria can be establishedIt is clear that uniformity and maximum probability receivethe most distinct values in Latin and Cyrillic text Hencethese descriptors are suitable for qualitative characterizationof Latin and Cyrillic text as well as for creating criteria todistinguish a certain script type From the above results themargin criteria should be uniformity of 03 and maximumprobability of 05 These values of both descriptors representthe strong margin in qualifying the script in certain Serbiantext If we accompany them with the criteria obtained from

frequency analysis of the script type occurrence then the fullcriteria of decision making can be established This will leadto correct recognition of the script in Serbian text

5 Conclusion

The paper proposed the algorithm for recognition of exactscript in Serbian document Documents in Serbian languagecan be written in two different scripts Latin or CyrillicThe proposed algorithm converts document written in Latinand Cyrillic script into cipher text This way all alphabeticcharacters are exchanged with only four different encryptedsigns according to predefined encryption based on text linestructure definition Such ciphers texts are then subjected tothe frequency and co-occurrence analysis According to theobtained results a criteria for recognition of the certain scriptis proposed The proposed method is applied to the custom-oriented database which includes different types of printed

14 The Scientific World Journal

and web documents The experiment shows encouragingresults Possible applications can be seen in the area of webpage recognition

Future work will be toward the recognition of relatedlanguages as well as different languages written in the samescript

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this article (ldquoRecognition of theScript in Serbian Documents using Frequency Occurrenceand Co-occurrence Analysisrdquo by Darko Brodic Zoran NMilivojevic Cedomir A Maluckov)

References

[1] C Smith ldquoBasic cryptanalysis techniquesrdquo Tech Rep SANSInstitute 2001

[2] J Hoffstein J Pipher and J H Silverman An Introductionto Mathematical Cryptography Springer New York NY USA2008

[3] K Ruohonen ldquoMathematical cryptologyrdquo Lecture Notes 2010[4] A Zramdini and R Ingold ldquoOptical font recognition using

typographical featuresrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 20 no 8 pp 877ndash882 1998

[5] P Quaresma ldquoFrequency analysis of the Portuguese languagerdquoTR 2008003 University of Coimbra Portugal 2008

[6] R M Haralick K Shanmugam and I Dinstein ldquoTexturalfeatures for image classificationrdquo IEEE Transactions on SystemsMan and Cybernetics vol 3 no 6 pp 610ndash621 1973

[7] G Jaskiewicz ldquoAnalysis of letter frequency distribution inthe voynich manuscriptrdquo in Proceedings of the InternationalWorkshop (CSampP rsquo11) pp 250ndash261 Putusk Poland 2011

[8] S Cagnoni Evolutionary Image Analysis and Signal Processingvol 213 of Studies in Computational Intelligence SpringerBerlin Germany 2009

[9] R Nosaka C H Suryanto and K Fukui ldquoRotation invariantco-occurrence among adjacent LBPsrdquo in Proceedings of the 11thInternational Conference on Computer Vision (ACCV rsquo11) vol7088 of Lecture Notes in Computer Science pp 82ndash91 DaejeonRepublic of Korea 2011

[10] D A Clausi ldquoAn analysis of co-occurrence texture statistics as afunction of grey level quantizationrdquoCanadian Journal of RemoteSensing vol 28 no 1 pp 45ndash62 2002

[11] A Baraldi and F Parmiggiani ldquoInvestigation of the texturalcharacteristics associated with gray level cooccurrence matrixstatistical parametersrdquo IEEE Transactions on Geoscience andRemote Sensing vol 33 no 2 pp 293ndash304 1995

[12] F R Renzetti and L Zortea ldquoUse of a gray level co-occurrencematrix to characterize duplex stainless steel phases microstruc-turerdquo Frattura ed Integrita Strutturale vol 16 no 5 pp 43ndash512011

[13] R Susomboon D Raicu J Furst and T B Johnson ldquoA Co-occurrence texture semi-invariance to direction distance andpatient sizerdquo in Medical Imaging 2008 Image Processing vol6914 of Procedings of SPIE San Diego Calif USA February2008

[14] A Eleyan and H Demirel ldquoCo-occurrence matrix and itsstatistical features as a new approach for face recognitionrdquo

Turkish Journal of Electrical Engineering and Computer Sciencesvol 19 no 1 pp 97ndash107 2011

[15] D A Clausi and Y Zhao ldquoRapid extraction of image textureby co-occurrence using a hybrid data structurerdquoComputers andGeosciences vol 28 no 6 pp 763ndash774 2002

[16] R Jobanputra and D A Clausi ldquoPreserving boundaries forimage texture segmentation using grey level co-occurringprobabilitiesrdquo Pattern Recognition vol 39 no 2 pp 234ndash2452006

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mechanical Engineering

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Antennas andPropagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

8 The Scientific World Journal

j

i 0 1 2 3 0 6 4 2 0 6 4 2 0 621 421 221 0

1 2 4 0 0 2 4 0 0 221 421 0 02 0 1 1 1 0 1 1 1 0 121 121 1213 0 0 0 0 0 0 0 0 0 0 0 0

(a) (b) (c)

≫ ≫C = P =

Figure 9 Co-occurrence matrix for grayscale matrix I (119889 = 1 and 120579 = 0∘) (a)The number of occasions a pixel with an intensity 119868 is adjacent

to a pixel with intensity 119895 (b) Co-occurrence matrix C (c) Normalized matrix P

(a) (b)

(c) (d)

Figure 10 Document conversion (a) Original Cyrillic text (b) Cipher text obtained fromCyrillic text according to set 119862119873 (c) Original Latin

text (equivalent to Cyrillic one) (d) Cipher text obtained from Latin text according to set 119862119873

Maximum probability (MAX) extracts the most probabledifference between gray scale value in pixels It is defined as[15]

MAX = max 119875 (119894 119895) (9)Dissimilarity (DIS) is a measure of the variation in gray

level pairs of the image It depends on distance from the

diagonal weighted by its probability The equation of thedissimilarity is [15]

DIS =

119873

sum

119894=1

119873

sum

119895=1

119875 (119894 119895) sdot1003816100381610038161003816119894 minus 119895

1003816100381610038161003816 (10)

The Scientific World Journal 9

(a) (b)

(c) (d)

Figure 11 Custom-oriented database (a) Printed document in Latin (b) Printed document in Cyrillic (c) Web document in Latin (d) Webdocument in Cyrillic

Contrast (CON) or inertia is a measure of the intensitycontrast between a pixel and its neighbor over the entireimage Hence it shows the amount of local variations presentin the image If the image is constant then the contrast willbe equal 0 The highest value of contrast is obtained whenthe image has random intensity and the pixel intensity and

neighbor intensity are very different The equation of thecontrast is [15 16]

CON =

119873

sum

119894=1

119873

sum

119895=1

119875 (119894 119895) sdot (119894 minus 119895)2 (11)

10 The Scientific World Journal

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

100

01

02

03

04

05

06

07

08

09

DocumentsShortDescendingFull

x-ti

mes

(a)

0

2

4

6

8

10

12

14

16

18

Ascending

x-ti

mes

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

(b)

Figure 12 The ratio of the script type occurrence (a) short descending and full (b) ascending

Table 6 Cooccurrence descriptors for Latin and Cyrillic cipher text

Serbian language Latin Cyrillic CharacterizationUniformity (energy) 02459 03811 Latin lt CyrillicEntropy minus16298 minus14363 Latin gt CyrillicMaximum probability 03722 05863 Latin lt CyrillicDissimilarity 07356 06669 Latin gt CyrillicContrast 10423 12660 Latin lt CyrillicFrom the above results it is clear that co-occurrence descriptors can fully characterize the difference between Latin andCyrillic scriptThismeans that frequencyanalysis of the occurrence can be supplemented with additional attributes in order to define a strong margin as a criterion to distinguish a certain script

A brief look at the normalized co-occurrence matrix Pfor the same document written in Latin and Cyrillic scripts(text representing the excerpt of the first four paragraphsfrom a document given in Figure 10) shows quite a differentcharacterization The test results are given in Table 5

Furthermore the calculation of five co-occurrencedescriptors shows the values given in Table 6

3 Experiments

For the sake of the experiment a custom-oriented databaseis created It consists of 10 documents These documentsrepresent excerpts from printed and web documents writtenin Serbian language The documents are created in bothscripts Latin and Cyrillic Printed documents are createdfrom PDF documents while web documents are extractedfrom web news The total length of documents given inthe database is approx 75000 letter characters per script(approx 40 pages) The length of printed documents is from2273 to 15840 letter characters Web documents are smallercompared to printed documents Their length is from 1231 to2502 letter characters It should be noted that all documentshave more than 1000 letter characters The example of the

printed and web document from the database is shown inFigure 11

4 Results and Discussion

According to the proposed algorithm all documents fromthe database are converted into equivalent cipher texts andsubjected to the frequency and co-occurrence analysis Firstthe frequency analysis of the script type occurrence in Latinas well as in Cyrillic documents is examined (Table 7) Theobtained results for each document are given in Table 8

The final processing of the results is based on cumulativemeasures like sum average max and min of script typeoccurrence in the database According to that the criteria areestablished All these are shown in Table 9

From cumulative results given in Table 10 some criteriacan be established It can be noted that the biggest marginbetween results are seen in the ratio of ascending letters Thisratio has the value of at least 8 Hence it is the strongestpoint of qualitative characterization and recognition of thecertain script Furthermore the smaller number of short anddescending scripts are common in Latin compared to Cyrillicdocuments At the and full letters are quite rare in a Latindocument However its characterization in criteria form is

The Scientific World Journal 11

Table 7 Frequency analysis of the script type occurrence in documents from database

Printed documents

Type of script Doc 1 Doc 2 Doc 3 Doc 4 Doc 5Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic

S 2243 2764 11396 13593 1510 1914 2217 2516 2069 2542A 906 53 4060 306 693 65 598 53 897 64D 183 468 724 1933 82 286 261 445 151 461F 0 7 0 8 0 8 8 26 0 12

Web documents

Type of script Doc 6 Doc 7 Doc 8 Doc 9 Doc 10Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic

S 1486 1799 1358 1682 783 996 1657 2078 1328 1637A 598 48 636 46 408 48 750 62 588 68D 99 304 75 292 58 174 134 344 99 284F 0 7 0 9 0 13 0 18 0 9The above results are further processed in order to calculate the ratio of script type occurrence between Latin and Cyrillic document Complete results aregiven in Table 8

Table 8 The ratio of script type occurrence between Latin and Cyrillic documents

Type of script Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc 9 Doc 10S 081 084 079 088 081 083 081 079 080 081A 1709 1327 1066 1128 1402 1246 1383 850 1210 865D 039 037 029 059 033 033 026 033 039 035F 000 000 000 031 000 000 000 000 000 000The results are presented in Figure 12

Table 9 The ratio of script type occurrence measures

Type of script sum Ratio CriteriaLatin Cyrillic Average Max Min

S 26047 31521 082 088 079 gt075A 10134 813 1221 1709 850 gt8D 1866 4991 036 059 026 lt06F 8 117 003 031 000

Table 10 GLCM five descriptors of the script type co-occurrence in documents from database

Printed documentsDoc 1 Doc 2 Doc 3 Doc 4 Doc 5

Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin CyrillicUniformity 02885 04725 02473 04167 02557 04120 02759 04545 02707 04498Entropy minus15191 minus11774 minus16379 minus13079 minus16047 minus12999 minus15675 minus11650 minus15847 minus11799Max probability 04655 06636 03952 06139 04120 06098 04439 06457 04349 06405Dissimilarity 06847 05933 07469 06592 07502 06427 07064 06041 07117 06217Contrast 10324 11790 11106 12859 11258 12261 10577 11449 10630 11949

Web documentsDoc 6 Doc 7 Doc 8 Doc 9 Doc 10

Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin CyrillicUniformity 02447 03714 02754 03817 02533 05005 02252 03147 02522 03325Entropy minus16524 minus13738 minus15725 minus13412 minus15990 minus10779 minus16778 minus15650 minus16144 minus15059Max probability 03964 05650 04409 05753 03972 06844 03195 05154 04016 05318Dissimilarity 07723 07320 06912 07209 07294 05686 08317 07667 07256 07416Contrast 11862 13869 10287 13681 10459 11158 12122 14220 10641 13716The above results are further processed in order to calculate the ratio of script type co-occurrence in between Latin and Cyrillic document These results areshown in Table 11

12 The Scientific World Journal

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

00102030405060708

DocumentsUniformity

x-ti

mes

02025

03035

04045

05055

Uni

form

ity (e

nerg

y)

DocumentsLatin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(a)

Entro

py

0

05

1

15

minus17

minus16

minus15

minus14

minus13

minus12

1

minus11

minus09

x-ti

mes

EntropyDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

(b)

03504

04505

05506

06507

Max

imum

pro

babi

lity

00102030405060708

x-ti

mes

Maximum probabilityDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(c)

05506

06507

07508

08509

095

Diss

imila

rity

0

02

04

06

08

1

12

x-ti

mes

DissimilarityDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(d)

Figure 13 Continued

The Scientific World Journal 13

1

11

12

13

14

15

Con

tras

t

0

02

04

06

08

1

Contrast

x-ti

mes

DocumentsLatin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(e)

Figure 13 Illustrations of co-occurrence descriptors in Latin and Cyrillic text (left) and its ratio (right) (a) Uniformity (b) Entropy (c)Maximum probability (d) Dissimilarity (e) Contrast

Table 11 The ratio of the co-occurrence descriptors between Latin and Cyrillic documents

Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc 9 Doc 10Uniformity 061 059 062 061 060 066 072 051 072 076Entropy 129 125 123 135 134 120 117 148 107 107Max probability 070 064 068 069 068 070 077 058 062 076Dissimilarity 115 113 117 117 114 106 096 128 108 098Contrast 088 086 092 092 089 086 075 094 085 078The final processing of the above results is based on cumulative measures like average max and min of script type co-occurrence in the database Accordingto that certain criteria are established All these are shown in Table 12

Table 12 The ratio of script type co-occurrence descriptors

Latin Cyrillic Ratio CriteriaMin Max Average Min Max Average Max Min Average

Uniformity 023 029 026 031 050 041 076 051 064 03Entropy minus168 minus152 minus160 minus157 minus108 minus130 148 107 125 Max probability 032 047 042 052 068 060 077 058 068 05Dissimilarity 068 083 074 057 077 067 128 096 111 Contrast 103 121 112 112 142 127 094 075 086

quite problematic due to their absence in Latin documentsfrom time to time

Furthermore the analysis of the script type co-occurrence in Latin as well as in Cyrillic documents isexamined according to GLCM method The obtained resultsfor each document are given in Table 10

The co-occurrence descriptor for Latin and Cyrillic textand its ratio is presented in Figure 13

From the above results some criteria can be establishedIt is clear that uniformity and maximum probability receivethe most distinct values in Latin and Cyrillic text Hencethese descriptors are suitable for qualitative characterizationof Latin and Cyrillic text as well as for creating criteria todistinguish a certain script type From the above results themargin criteria should be uniformity of 03 and maximumprobability of 05 These values of both descriptors representthe strong margin in qualifying the script in certain Serbiantext If we accompany them with the criteria obtained from

frequency analysis of the script type occurrence then the fullcriteria of decision making can be established This will leadto correct recognition of the script in Serbian text

5 Conclusion

The paper proposed the algorithm for recognition of exactscript in Serbian document Documents in Serbian languagecan be written in two different scripts Latin or CyrillicThe proposed algorithm converts document written in Latinand Cyrillic script into cipher text This way all alphabeticcharacters are exchanged with only four different encryptedsigns according to predefined encryption based on text linestructure definition Such ciphers texts are then subjected tothe frequency and co-occurrence analysis According to theobtained results a criteria for recognition of the certain scriptis proposed The proposed method is applied to the custom-oriented database which includes different types of printed

14 The Scientific World Journal

and web documents The experiment shows encouragingresults Possible applications can be seen in the area of webpage recognition

Future work will be toward the recognition of relatedlanguages as well as different languages written in the samescript

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this article (ldquoRecognition of theScript in Serbian Documents using Frequency Occurrenceand Co-occurrence Analysisrdquo by Darko Brodic Zoran NMilivojevic Cedomir A Maluckov)

References

[1] C Smith ldquoBasic cryptanalysis techniquesrdquo Tech Rep SANSInstitute 2001

[2] J Hoffstein J Pipher and J H Silverman An Introductionto Mathematical Cryptography Springer New York NY USA2008

[3] K Ruohonen ldquoMathematical cryptologyrdquo Lecture Notes 2010[4] A Zramdini and R Ingold ldquoOptical font recognition using

typographical featuresrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 20 no 8 pp 877ndash882 1998

[5] P Quaresma ldquoFrequency analysis of the Portuguese languagerdquoTR 2008003 University of Coimbra Portugal 2008

[6] R M Haralick K Shanmugam and I Dinstein ldquoTexturalfeatures for image classificationrdquo IEEE Transactions on SystemsMan and Cybernetics vol 3 no 6 pp 610ndash621 1973

[7] G Jaskiewicz ldquoAnalysis of letter frequency distribution inthe voynich manuscriptrdquo in Proceedings of the InternationalWorkshop (CSampP rsquo11) pp 250ndash261 Putusk Poland 2011

[8] S Cagnoni Evolutionary Image Analysis and Signal Processingvol 213 of Studies in Computational Intelligence SpringerBerlin Germany 2009

[9] R Nosaka C H Suryanto and K Fukui ldquoRotation invariantco-occurrence among adjacent LBPsrdquo in Proceedings of the 11thInternational Conference on Computer Vision (ACCV rsquo11) vol7088 of Lecture Notes in Computer Science pp 82ndash91 DaejeonRepublic of Korea 2011

[10] D A Clausi ldquoAn analysis of co-occurrence texture statistics as afunction of grey level quantizationrdquoCanadian Journal of RemoteSensing vol 28 no 1 pp 45ndash62 2002

[11] A Baraldi and F Parmiggiani ldquoInvestigation of the texturalcharacteristics associated with gray level cooccurrence matrixstatistical parametersrdquo IEEE Transactions on Geoscience andRemote Sensing vol 33 no 2 pp 293ndash304 1995

[12] F R Renzetti and L Zortea ldquoUse of a gray level co-occurrencematrix to characterize duplex stainless steel phases microstruc-turerdquo Frattura ed Integrita Strutturale vol 16 no 5 pp 43ndash512011

[13] R Susomboon D Raicu J Furst and T B Johnson ldquoA Co-occurrence texture semi-invariance to direction distance andpatient sizerdquo in Medical Imaging 2008 Image Processing vol6914 of Procedings of SPIE San Diego Calif USA February2008

[14] A Eleyan and H Demirel ldquoCo-occurrence matrix and itsstatistical features as a new approach for face recognitionrdquo

Turkish Journal of Electrical Engineering and Computer Sciencesvol 19 no 1 pp 97ndash107 2011

[15] D A Clausi and Y Zhao ldquoRapid extraction of image textureby co-occurrence using a hybrid data structurerdquoComputers andGeosciences vol 28 no 6 pp 763ndash774 2002

[16] R Jobanputra and D A Clausi ldquoPreserving boundaries forimage texture segmentation using grey level co-occurringprobabilitiesrdquo Pattern Recognition vol 39 no 2 pp 234ndash2452006

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mechanical Engineering

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Antennas andPropagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World Journal 9

(a) (b)

(c) (d)

Figure 11 Custom-oriented database (a) Printed document in Latin (b) Printed document in Cyrillic (c) Web document in Latin (d) Webdocument in Cyrillic

Contrast (CON) or inertia is a measure of the intensitycontrast between a pixel and its neighbor over the entireimage Hence it shows the amount of local variations presentin the image If the image is constant then the contrast willbe equal 0 The highest value of contrast is obtained whenthe image has random intensity and the pixel intensity and

neighbor intensity are very different The equation of thecontrast is [15 16]

CON =

119873

sum

119894=1

119873

sum

119895=1

119875 (119894 119895) sdot (119894 minus 119895)2 (11)

10 The Scientific World Journal

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

100

01

02

03

04

05

06

07

08

09

DocumentsShortDescendingFull

x-ti

mes

(a)

0

2

4

6

8

10

12

14

16

18

Ascending

x-ti

mes

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

(b)

Figure 12 The ratio of the script type occurrence (a) short descending and full (b) ascending

Table 6 Cooccurrence descriptors for Latin and Cyrillic cipher text

Serbian language Latin Cyrillic CharacterizationUniformity (energy) 02459 03811 Latin lt CyrillicEntropy minus16298 minus14363 Latin gt CyrillicMaximum probability 03722 05863 Latin lt CyrillicDissimilarity 07356 06669 Latin gt CyrillicContrast 10423 12660 Latin lt CyrillicFrom the above results it is clear that co-occurrence descriptors can fully characterize the difference between Latin andCyrillic scriptThismeans that frequencyanalysis of the occurrence can be supplemented with additional attributes in order to define a strong margin as a criterion to distinguish a certain script

A brief look at the normalized co-occurrence matrix Pfor the same document written in Latin and Cyrillic scripts(text representing the excerpt of the first four paragraphsfrom a document given in Figure 10) shows quite a differentcharacterization The test results are given in Table 5

Furthermore the calculation of five co-occurrencedescriptors shows the values given in Table 6

3 Experiments

For the sake of the experiment a custom-oriented databaseis created It consists of 10 documents These documentsrepresent excerpts from printed and web documents writtenin Serbian language The documents are created in bothscripts Latin and Cyrillic Printed documents are createdfrom PDF documents while web documents are extractedfrom web news The total length of documents given inthe database is approx 75000 letter characters per script(approx 40 pages) The length of printed documents is from2273 to 15840 letter characters Web documents are smallercompared to printed documents Their length is from 1231 to2502 letter characters It should be noted that all documentshave more than 1000 letter characters The example of the

printed and web document from the database is shown inFigure 11

4 Results and Discussion

According to the proposed algorithm all documents fromthe database are converted into equivalent cipher texts andsubjected to the frequency and co-occurrence analysis Firstthe frequency analysis of the script type occurrence in Latinas well as in Cyrillic documents is examined (Table 7) Theobtained results for each document are given in Table 8

The final processing of the results is based on cumulativemeasures like sum average max and min of script typeoccurrence in the database According to that the criteria areestablished All these are shown in Table 9

From cumulative results given in Table 10 some criteriacan be established It can be noted that the biggest marginbetween results are seen in the ratio of ascending letters Thisratio has the value of at least 8 Hence it is the strongestpoint of qualitative characterization and recognition of thecertain script Furthermore the smaller number of short anddescending scripts are common in Latin compared to Cyrillicdocuments At the and full letters are quite rare in a Latindocument However its characterization in criteria form is

The Scientific World Journal 11

Table 7 Frequency analysis of the script type occurrence in documents from database

Printed documents

Type of script Doc 1 Doc 2 Doc 3 Doc 4 Doc 5Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic

S 2243 2764 11396 13593 1510 1914 2217 2516 2069 2542A 906 53 4060 306 693 65 598 53 897 64D 183 468 724 1933 82 286 261 445 151 461F 0 7 0 8 0 8 8 26 0 12

Web documents

Type of script Doc 6 Doc 7 Doc 8 Doc 9 Doc 10Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic

S 1486 1799 1358 1682 783 996 1657 2078 1328 1637A 598 48 636 46 408 48 750 62 588 68D 99 304 75 292 58 174 134 344 99 284F 0 7 0 9 0 13 0 18 0 9The above results are further processed in order to calculate the ratio of script type occurrence between Latin and Cyrillic document Complete results aregiven in Table 8

Table 8 The ratio of script type occurrence between Latin and Cyrillic documents

Type of script Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc 9 Doc 10S 081 084 079 088 081 083 081 079 080 081A 1709 1327 1066 1128 1402 1246 1383 850 1210 865D 039 037 029 059 033 033 026 033 039 035F 000 000 000 031 000 000 000 000 000 000The results are presented in Figure 12

Table 9 The ratio of script type occurrence measures

Type of script sum Ratio CriteriaLatin Cyrillic Average Max Min

S 26047 31521 082 088 079 gt075A 10134 813 1221 1709 850 gt8D 1866 4991 036 059 026 lt06F 8 117 003 031 000

Table 10 GLCM five descriptors of the script type co-occurrence in documents from database

Printed documentsDoc 1 Doc 2 Doc 3 Doc 4 Doc 5

Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin CyrillicUniformity 02885 04725 02473 04167 02557 04120 02759 04545 02707 04498Entropy minus15191 minus11774 minus16379 minus13079 minus16047 minus12999 minus15675 minus11650 minus15847 minus11799Max probability 04655 06636 03952 06139 04120 06098 04439 06457 04349 06405Dissimilarity 06847 05933 07469 06592 07502 06427 07064 06041 07117 06217Contrast 10324 11790 11106 12859 11258 12261 10577 11449 10630 11949

Web documentsDoc 6 Doc 7 Doc 8 Doc 9 Doc 10

Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin CyrillicUniformity 02447 03714 02754 03817 02533 05005 02252 03147 02522 03325Entropy minus16524 minus13738 minus15725 minus13412 minus15990 minus10779 minus16778 minus15650 minus16144 minus15059Max probability 03964 05650 04409 05753 03972 06844 03195 05154 04016 05318Dissimilarity 07723 07320 06912 07209 07294 05686 08317 07667 07256 07416Contrast 11862 13869 10287 13681 10459 11158 12122 14220 10641 13716The above results are further processed in order to calculate the ratio of script type co-occurrence in between Latin and Cyrillic document These results areshown in Table 11

12 The Scientific World Journal

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

00102030405060708

DocumentsUniformity

x-ti

mes

02025

03035

04045

05055

Uni

form

ity (e

nerg

y)

DocumentsLatin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(a)

Entro

py

0

05

1

15

minus17

minus16

minus15

minus14

minus13

minus12

1

minus11

minus09

x-ti

mes

EntropyDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

(b)

03504

04505

05506

06507

Max

imum

pro

babi

lity

00102030405060708

x-ti

mes

Maximum probabilityDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(c)

05506

06507

07508

08509

095

Diss

imila

rity

0

02

04

06

08

1

12

x-ti

mes

DissimilarityDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(d)

Figure 13 Continued

The Scientific World Journal 13

1

11

12

13

14

15

Con

tras

t

0

02

04

06

08

1

Contrast

x-ti

mes

DocumentsLatin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(e)

Figure 13 Illustrations of co-occurrence descriptors in Latin and Cyrillic text (left) and its ratio (right) (a) Uniformity (b) Entropy (c)Maximum probability (d) Dissimilarity (e) Contrast

Table 11 The ratio of the co-occurrence descriptors between Latin and Cyrillic documents

Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc 9 Doc 10Uniformity 061 059 062 061 060 066 072 051 072 076Entropy 129 125 123 135 134 120 117 148 107 107Max probability 070 064 068 069 068 070 077 058 062 076Dissimilarity 115 113 117 117 114 106 096 128 108 098Contrast 088 086 092 092 089 086 075 094 085 078The final processing of the above results is based on cumulative measures like average max and min of script type co-occurrence in the database Accordingto that certain criteria are established All these are shown in Table 12

Table 12 The ratio of script type co-occurrence descriptors

Latin Cyrillic Ratio CriteriaMin Max Average Min Max Average Max Min Average

Uniformity 023 029 026 031 050 041 076 051 064 03Entropy minus168 minus152 minus160 minus157 minus108 minus130 148 107 125 Max probability 032 047 042 052 068 060 077 058 068 05Dissimilarity 068 083 074 057 077 067 128 096 111 Contrast 103 121 112 112 142 127 094 075 086

quite problematic due to their absence in Latin documentsfrom time to time

Furthermore the analysis of the script type co-occurrence in Latin as well as in Cyrillic documents isexamined according to GLCM method The obtained resultsfor each document are given in Table 10

The co-occurrence descriptor for Latin and Cyrillic textand its ratio is presented in Figure 13

From the above results some criteria can be establishedIt is clear that uniformity and maximum probability receivethe most distinct values in Latin and Cyrillic text Hencethese descriptors are suitable for qualitative characterizationof Latin and Cyrillic text as well as for creating criteria todistinguish a certain script type From the above results themargin criteria should be uniformity of 03 and maximumprobability of 05 These values of both descriptors representthe strong margin in qualifying the script in certain Serbiantext If we accompany them with the criteria obtained from

frequency analysis of the script type occurrence then the fullcriteria of decision making can be established This will leadto correct recognition of the script in Serbian text

5 Conclusion

The paper proposed the algorithm for recognition of exactscript in Serbian document Documents in Serbian languagecan be written in two different scripts Latin or CyrillicThe proposed algorithm converts document written in Latinand Cyrillic script into cipher text This way all alphabeticcharacters are exchanged with only four different encryptedsigns according to predefined encryption based on text linestructure definition Such ciphers texts are then subjected tothe frequency and co-occurrence analysis According to theobtained results a criteria for recognition of the certain scriptis proposed The proposed method is applied to the custom-oriented database which includes different types of printed

14 The Scientific World Journal

and web documents The experiment shows encouragingresults Possible applications can be seen in the area of webpage recognition

Future work will be toward the recognition of relatedlanguages as well as different languages written in the samescript

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this article (ldquoRecognition of theScript in Serbian Documents using Frequency Occurrenceand Co-occurrence Analysisrdquo by Darko Brodic Zoran NMilivojevic Cedomir A Maluckov)

References

[1] C Smith ldquoBasic cryptanalysis techniquesrdquo Tech Rep SANSInstitute 2001

[2] J Hoffstein J Pipher and J H Silverman An Introductionto Mathematical Cryptography Springer New York NY USA2008

[3] K Ruohonen ldquoMathematical cryptologyrdquo Lecture Notes 2010[4] A Zramdini and R Ingold ldquoOptical font recognition using

typographical featuresrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 20 no 8 pp 877ndash882 1998

[5] P Quaresma ldquoFrequency analysis of the Portuguese languagerdquoTR 2008003 University of Coimbra Portugal 2008

[6] R M Haralick K Shanmugam and I Dinstein ldquoTexturalfeatures for image classificationrdquo IEEE Transactions on SystemsMan and Cybernetics vol 3 no 6 pp 610ndash621 1973

[7] G Jaskiewicz ldquoAnalysis of letter frequency distribution inthe voynich manuscriptrdquo in Proceedings of the InternationalWorkshop (CSampP rsquo11) pp 250ndash261 Putusk Poland 2011

[8] S Cagnoni Evolutionary Image Analysis and Signal Processingvol 213 of Studies in Computational Intelligence SpringerBerlin Germany 2009

[9] R Nosaka C H Suryanto and K Fukui ldquoRotation invariantco-occurrence among adjacent LBPsrdquo in Proceedings of the 11thInternational Conference on Computer Vision (ACCV rsquo11) vol7088 of Lecture Notes in Computer Science pp 82ndash91 DaejeonRepublic of Korea 2011

[10] D A Clausi ldquoAn analysis of co-occurrence texture statistics as afunction of grey level quantizationrdquoCanadian Journal of RemoteSensing vol 28 no 1 pp 45ndash62 2002

[11] A Baraldi and F Parmiggiani ldquoInvestigation of the texturalcharacteristics associated with gray level cooccurrence matrixstatistical parametersrdquo IEEE Transactions on Geoscience andRemote Sensing vol 33 no 2 pp 293ndash304 1995

[12] F R Renzetti and L Zortea ldquoUse of a gray level co-occurrencematrix to characterize duplex stainless steel phases microstruc-turerdquo Frattura ed Integrita Strutturale vol 16 no 5 pp 43ndash512011

[13] R Susomboon D Raicu J Furst and T B Johnson ldquoA Co-occurrence texture semi-invariance to direction distance andpatient sizerdquo in Medical Imaging 2008 Image Processing vol6914 of Procedings of SPIE San Diego Calif USA February2008

[14] A Eleyan and H Demirel ldquoCo-occurrence matrix and itsstatistical features as a new approach for face recognitionrdquo

Turkish Journal of Electrical Engineering and Computer Sciencesvol 19 no 1 pp 97ndash107 2011

[15] D A Clausi and Y Zhao ldquoRapid extraction of image textureby co-occurrence using a hybrid data structurerdquoComputers andGeosciences vol 28 no 6 pp 763ndash774 2002

[16] R Jobanputra and D A Clausi ldquoPreserving boundaries forimage texture segmentation using grey level co-occurringprobabilitiesrdquo Pattern Recognition vol 39 no 2 pp 234ndash2452006

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mechanical Engineering

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Antennas andPropagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

10 The Scientific World Journal

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

100

01

02

03

04

05

06

07

08

09

DocumentsShortDescendingFull

x-ti

mes

(a)

0

2

4

6

8

10

12

14

16

18

Ascending

x-ti

mes

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

(b)

Figure 12 The ratio of the script type occurrence (a) short descending and full (b) ascending

Table 6 Cooccurrence descriptors for Latin and Cyrillic cipher text

Serbian language Latin Cyrillic CharacterizationUniformity (energy) 02459 03811 Latin lt CyrillicEntropy minus16298 minus14363 Latin gt CyrillicMaximum probability 03722 05863 Latin lt CyrillicDissimilarity 07356 06669 Latin gt CyrillicContrast 10423 12660 Latin lt CyrillicFrom the above results it is clear that co-occurrence descriptors can fully characterize the difference between Latin andCyrillic scriptThismeans that frequencyanalysis of the occurrence can be supplemented with additional attributes in order to define a strong margin as a criterion to distinguish a certain script

A brief look at the normalized co-occurrence matrix Pfor the same document written in Latin and Cyrillic scripts(text representing the excerpt of the first four paragraphsfrom a document given in Figure 10) shows quite a differentcharacterization The test results are given in Table 5

Furthermore the calculation of five co-occurrencedescriptors shows the values given in Table 6

3 Experiments

For the sake of the experiment a custom-oriented databaseis created It consists of 10 documents These documentsrepresent excerpts from printed and web documents writtenin Serbian language The documents are created in bothscripts Latin and Cyrillic Printed documents are createdfrom PDF documents while web documents are extractedfrom web news The total length of documents given inthe database is approx 75000 letter characters per script(approx 40 pages) The length of printed documents is from2273 to 15840 letter characters Web documents are smallercompared to printed documents Their length is from 1231 to2502 letter characters It should be noted that all documentshave more than 1000 letter characters The example of the

printed and web document from the database is shown inFigure 11

4 Results and Discussion

According to the proposed algorithm all documents fromthe database are converted into equivalent cipher texts andsubjected to the frequency and co-occurrence analysis Firstthe frequency analysis of the script type occurrence in Latinas well as in Cyrillic documents is examined (Table 7) Theobtained results for each document are given in Table 8

The final processing of the results is based on cumulativemeasures like sum average max and min of script typeoccurrence in the database According to that the criteria areestablished All these are shown in Table 9

From cumulative results given in Table 10 some criteriacan be established It can be noted that the biggest marginbetween results are seen in the ratio of ascending letters Thisratio has the value of at least 8 Hence it is the strongestpoint of qualitative characterization and recognition of thecertain script Furthermore the smaller number of short anddescending scripts are common in Latin compared to Cyrillicdocuments At the and full letters are quite rare in a Latindocument However its characterization in criteria form is

The Scientific World Journal 11

Table 7 Frequency analysis of the script type occurrence in documents from database

Printed documents

Type of script Doc 1 Doc 2 Doc 3 Doc 4 Doc 5Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic

S 2243 2764 11396 13593 1510 1914 2217 2516 2069 2542A 906 53 4060 306 693 65 598 53 897 64D 183 468 724 1933 82 286 261 445 151 461F 0 7 0 8 0 8 8 26 0 12

Web documents

Type of script Doc 6 Doc 7 Doc 8 Doc 9 Doc 10Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic

S 1486 1799 1358 1682 783 996 1657 2078 1328 1637A 598 48 636 46 408 48 750 62 588 68D 99 304 75 292 58 174 134 344 99 284F 0 7 0 9 0 13 0 18 0 9The above results are further processed in order to calculate the ratio of script type occurrence between Latin and Cyrillic document Complete results aregiven in Table 8

Table 8 The ratio of script type occurrence between Latin and Cyrillic documents

Type of script Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc 9 Doc 10S 081 084 079 088 081 083 081 079 080 081A 1709 1327 1066 1128 1402 1246 1383 850 1210 865D 039 037 029 059 033 033 026 033 039 035F 000 000 000 031 000 000 000 000 000 000The results are presented in Figure 12

Table 9 The ratio of script type occurrence measures

Type of script sum Ratio CriteriaLatin Cyrillic Average Max Min

S 26047 31521 082 088 079 gt075A 10134 813 1221 1709 850 gt8D 1866 4991 036 059 026 lt06F 8 117 003 031 000

Table 10 GLCM five descriptors of the script type co-occurrence in documents from database

Printed documentsDoc 1 Doc 2 Doc 3 Doc 4 Doc 5

Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin CyrillicUniformity 02885 04725 02473 04167 02557 04120 02759 04545 02707 04498Entropy minus15191 minus11774 minus16379 minus13079 minus16047 minus12999 minus15675 minus11650 minus15847 minus11799Max probability 04655 06636 03952 06139 04120 06098 04439 06457 04349 06405Dissimilarity 06847 05933 07469 06592 07502 06427 07064 06041 07117 06217Contrast 10324 11790 11106 12859 11258 12261 10577 11449 10630 11949

Web documentsDoc 6 Doc 7 Doc 8 Doc 9 Doc 10

Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin CyrillicUniformity 02447 03714 02754 03817 02533 05005 02252 03147 02522 03325Entropy minus16524 minus13738 minus15725 minus13412 minus15990 minus10779 minus16778 minus15650 minus16144 minus15059Max probability 03964 05650 04409 05753 03972 06844 03195 05154 04016 05318Dissimilarity 07723 07320 06912 07209 07294 05686 08317 07667 07256 07416Contrast 11862 13869 10287 13681 10459 11158 12122 14220 10641 13716The above results are further processed in order to calculate the ratio of script type co-occurrence in between Latin and Cyrillic document These results areshown in Table 11

12 The Scientific World Journal

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

00102030405060708

DocumentsUniformity

x-ti

mes

02025

03035

04045

05055

Uni

form

ity (e

nerg

y)

DocumentsLatin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(a)

Entro

py

0

05

1

15

minus17

minus16

minus15

minus14

minus13

minus12

1

minus11

minus09

x-ti

mes

EntropyDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

(b)

03504

04505

05506

06507

Max

imum

pro

babi

lity

00102030405060708

x-ti

mes

Maximum probabilityDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(c)

05506

06507

07508

08509

095

Diss

imila

rity

0

02

04

06

08

1

12

x-ti

mes

DissimilarityDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(d)

Figure 13 Continued

The Scientific World Journal 13

1

11

12

13

14

15

Con

tras

t

0

02

04

06

08

1

Contrast

x-ti

mes

DocumentsLatin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(e)

Figure 13 Illustrations of co-occurrence descriptors in Latin and Cyrillic text (left) and its ratio (right) (a) Uniformity (b) Entropy (c)Maximum probability (d) Dissimilarity (e) Contrast

Table 11 The ratio of the co-occurrence descriptors between Latin and Cyrillic documents

Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc 9 Doc 10Uniformity 061 059 062 061 060 066 072 051 072 076Entropy 129 125 123 135 134 120 117 148 107 107Max probability 070 064 068 069 068 070 077 058 062 076Dissimilarity 115 113 117 117 114 106 096 128 108 098Contrast 088 086 092 092 089 086 075 094 085 078The final processing of the above results is based on cumulative measures like average max and min of script type co-occurrence in the database Accordingto that certain criteria are established All these are shown in Table 12

Table 12 The ratio of script type co-occurrence descriptors

Latin Cyrillic Ratio CriteriaMin Max Average Min Max Average Max Min Average

Uniformity 023 029 026 031 050 041 076 051 064 03Entropy minus168 minus152 minus160 minus157 minus108 minus130 148 107 125 Max probability 032 047 042 052 068 060 077 058 068 05Dissimilarity 068 083 074 057 077 067 128 096 111 Contrast 103 121 112 112 142 127 094 075 086

quite problematic due to their absence in Latin documentsfrom time to time

Furthermore the analysis of the script type co-occurrence in Latin as well as in Cyrillic documents isexamined according to GLCM method The obtained resultsfor each document are given in Table 10

The co-occurrence descriptor for Latin and Cyrillic textand its ratio is presented in Figure 13

From the above results some criteria can be establishedIt is clear that uniformity and maximum probability receivethe most distinct values in Latin and Cyrillic text Hencethese descriptors are suitable for qualitative characterizationof Latin and Cyrillic text as well as for creating criteria todistinguish a certain script type From the above results themargin criteria should be uniformity of 03 and maximumprobability of 05 These values of both descriptors representthe strong margin in qualifying the script in certain Serbiantext If we accompany them with the criteria obtained from

frequency analysis of the script type occurrence then the fullcriteria of decision making can be established This will leadto correct recognition of the script in Serbian text

5 Conclusion

The paper proposed the algorithm for recognition of exactscript in Serbian document Documents in Serbian languagecan be written in two different scripts Latin or CyrillicThe proposed algorithm converts document written in Latinand Cyrillic script into cipher text This way all alphabeticcharacters are exchanged with only four different encryptedsigns according to predefined encryption based on text linestructure definition Such ciphers texts are then subjected tothe frequency and co-occurrence analysis According to theobtained results a criteria for recognition of the certain scriptis proposed The proposed method is applied to the custom-oriented database which includes different types of printed

14 The Scientific World Journal

and web documents The experiment shows encouragingresults Possible applications can be seen in the area of webpage recognition

Future work will be toward the recognition of relatedlanguages as well as different languages written in the samescript

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this article (ldquoRecognition of theScript in Serbian Documents using Frequency Occurrenceand Co-occurrence Analysisrdquo by Darko Brodic Zoran NMilivojevic Cedomir A Maluckov)

References

[1] C Smith ldquoBasic cryptanalysis techniquesrdquo Tech Rep SANSInstitute 2001

[2] J Hoffstein J Pipher and J H Silverman An Introductionto Mathematical Cryptography Springer New York NY USA2008

[3] K Ruohonen ldquoMathematical cryptologyrdquo Lecture Notes 2010[4] A Zramdini and R Ingold ldquoOptical font recognition using

typographical featuresrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 20 no 8 pp 877ndash882 1998

[5] P Quaresma ldquoFrequency analysis of the Portuguese languagerdquoTR 2008003 University of Coimbra Portugal 2008

[6] R M Haralick K Shanmugam and I Dinstein ldquoTexturalfeatures for image classificationrdquo IEEE Transactions on SystemsMan and Cybernetics vol 3 no 6 pp 610ndash621 1973

[7] G Jaskiewicz ldquoAnalysis of letter frequency distribution inthe voynich manuscriptrdquo in Proceedings of the InternationalWorkshop (CSampP rsquo11) pp 250ndash261 Putusk Poland 2011

[8] S Cagnoni Evolutionary Image Analysis and Signal Processingvol 213 of Studies in Computational Intelligence SpringerBerlin Germany 2009

[9] R Nosaka C H Suryanto and K Fukui ldquoRotation invariantco-occurrence among adjacent LBPsrdquo in Proceedings of the 11thInternational Conference on Computer Vision (ACCV rsquo11) vol7088 of Lecture Notes in Computer Science pp 82ndash91 DaejeonRepublic of Korea 2011

[10] D A Clausi ldquoAn analysis of co-occurrence texture statistics as afunction of grey level quantizationrdquoCanadian Journal of RemoteSensing vol 28 no 1 pp 45ndash62 2002

[11] A Baraldi and F Parmiggiani ldquoInvestigation of the texturalcharacteristics associated with gray level cooccurrence matrixstatistical parametersrdquo IEEE Transactions on Geoscience andRemote Sensing vol 33 no 2 pp 293ndash304 1995

[12] F R Renzetti and L Zortea ldquoUse of a gray level co-occurrencematrix to characterize duplex stainless steel phases microstruc-turerdquo Frattura ed Integrita Strutturale vol 16 no 5 pp 43ndash512011

[13] R Susomboon D Raicu J Furst and T B Johnson ldquoA Co-occurrence texture semi-invariance to direction distance andpatient sizerdquo in Medical Imaging 2008 Image Processing vol6914 of Procedings of SPIE San Diego Calif USA February2008

[14] A Eleyan and H Demirel ldquoCo-occurrence matrix and itsstatistical features as a new approach for face recognitionrdquo

Turkish Journal of Electrical Engineering and Computer Sciencesvol 19 no 1 pp 97ndash107 2011

[15] D A Clausi and Y Zhao ldquoRapid extraction of image textureby co-occurrence using a hybrid data structurerdquoComputers andGeosciences vol 28 no 6 pp 763ndash774 2002

[16] R Jobanputra and D A Clausi ldquoPreserving boundaries forimage texture segmentation using grey level co-occurringprobabilitiesrdquo Pattern Recognition vol 39 no 2 pp 234ndash2452006

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mechanical Engineering

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Antennas andPropagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World Journal 11

Table 7 Frequency analysis of the script type occurrence in documents from database

Printed documents

Type of script Doc 1 Doc 2 Doc 3 Doc 4 Doc 5Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic

S 2243 2764 11396 13593 1510 1914 2217 2516 2069 2542A 906 53 4060 306 693 65 598 53 897 64D 183 468 724 1933 82 286 261 445 151 461F 0 7 0 8 0 8 8 26 0 12

Web documents

Type of script Doc 6 Doc 7 Doc 8 Doc 9 Doc 10Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic

S 1486 1799 1358 1682 783 996 1657 2078 1328 1637A 598 48 636 46 408 48 750 62 588 68D 99 304 75 292 58 174 134 344 99 284F 0 7 0 9 0 13 0 18 0 9The above results are further processed in order to calculate the ratio of script type occurrence between Latin and Cyrillic document Complete results aregiven in Table 8

Table 8 The ratio of script type occurrence between Latin and Cyrillic documents

Type of script Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc 9 Doc 10S 081 084 079 088 081 083 081 079 080 081A 1709 1327 1066 1128 1402 1246 1383 850 1210 865D 039 037 029 059 033 033 026 033 039 035F 000 000 000 031 000 000 000 000 000 000The results are presented in Figure 12

Table 9 The ratio of script type occurrence measures

Type of script sum Ratio CriteriaLatin Cyrillic Average Max Min

S 26047 31521 082 088 079 gt075A 10134 813 1221 1709 850 gt8D 1866 4991 036 059 026 lt06F 8 117 003 031 000

Table 10 GLCM five descriptors of the script type co-occurrence in documents from database

Printed documentsDoc 1 Doc 2 Doc 3 Doc 4 Doc 5

Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin CyrillicUniformity 02885 04725 02473 04167 02557 04120 02759 04545 02707 04498Entropy minus15191 minus11774 minus16379 minus13079 minus16047 minus12999 minus15675 minus11650 minus15847 minus11799Max probability 04655 06636 03952 06139 04120 06098 04439 06457 04349 06405Dissimilarity 06847 05933 07469 06592 07502 06427 07064 06041 07117 06217Contrast 10324 11790 11106 12859 11258 12261 10577 11449 10630 11949

Web documentsDoc 6 Doc 7 Doc 8 Doc 9 Doc 10

Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin Cyrillic Latin CyrillicUniformity 02447 03714 02754 03817 02533 05005 02252 03147 02522 03325Entropy minus16524 minus13738 minus15725 minus13412 minus15990 minus10779 minus16778 minus15650 minus16144 minus15059Max probability 03964 05650 04409 05753 03972 06844 03195 05154 04016 05318Dissimilarity 07723 07320 06912 07209 07294 05686 08317 07667 07256 07416Contrast 11862 13869 10287 13681 10459 11158 12122 14220 10641 13716The above results are further processed in order to calculate the ratio of script type co-occurrence in between Latin and Cyrillic document These results areshown in Table 11

12 The Scientific World Journal

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

00102030405060708

DocumentsUniformity

x-ti

mes

02025

03035

04045

05055

Uni

form

ity (e

nerg

y)

DocumentsLatin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(a)

Entro

py

0

05

1

15

minus17

minus16

minus15

minus14

minus13

minus12

1

minus11

minus09

x-ti

mes

EntropyDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

(b)

03504

04505

05506

06507

Max

imum

pro

babi

lity

00102030405060708

x-ti

mes

Maximum probabilityDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(c)

05506

06507

07508

08509

095

Diss

imila

rity

0

02

04

06

08

1

12

x-ti

mes

DissimilarityDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(d)

Figure 13 Continued

The Scientific World Journal 13

1

11

12

13

14

15

Con

tras

t

0

02

04

06

08

1

Contrast

x-ti

mes

DocumentsLatin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(e)

Figure 13 Illustrations of co-occurrence descriptors in Latin and Cyrillic text (left) and its ratio (right) (a) Uniformity (b) Entropy (c)Maximum probability (d) Dissimilarity (e) Contrast

Table 11 The ratio of the co-occurrence descriptors between Latin and Cyrillic documents

Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc 9 Doc 10Uniformity 061 059 062 061 060 066 072 051 072 076Entropy 129 125 123 135 134 120 117 148 107 107Max probability 070 064 068 069 068 070 077 058 062 076Dissimilarity 115 113 117 117 114 106 096 128 108 098Contrast 088 086 092 092 089 086 075 094 085 078The final processing of the above results is based on cumulative measures like average max and min of script type co-occurrence in the database Accordingto that certain criteria are established All these are shown in Table 12

Table 12 The ratio of script type co-occurrence descriptors

Latin Cyrillic Ratio CriteriaMin Max Average Min Max Average Max Min Average

Uniformity 023 029 026 031 050 041 076 051 064 03Entropy minus168 minus152 minus160 minus157 minus108 minus130 148 107 125 Max probability 032 047 042 052 068 060 077 058 068 05Dissimilarity 068 083 074 057 077 067 128 096 111 Contrast 103 121 112 112 142 127 094 075 086

quite problematic due to their absence in Latin documentsfrom time to time

Furthermore the analysis of the script type co-occurrence in Latin as well as in Cyrillic documents isexamined according to GLCM method The obtained resultsfor each document are given in Table 10

The co-occurrence descriptor for Latin and Cyrillic textand its ratio is presented in Figure 13

From the above results some criteria can be establishedIt is clear that uniformity and maximum probability receivethe most distinct values in Latin and Cyrillic text Hencethese descriptors are suitable for qualitative characterizationof Latin and Cyrillic text as well as for creating criteria todistinguish a certain script type From the above results themargin criteria should be uniformity of 03 and maximumprobability of 05 These values of both descriptors representthe strong margin in qualifying the script in certain Serbiantext If we accompany them with the criteria obtained from

frequency analysis of the script type occurrence then the fullcriteria of decision making can be established This will leadto correct recognition of the script in Serbian text

5 Conclusion

The paper proposed the algorithm for recognition of exactscript in Serbian document Documents in Serbian languagecan be written in two different scripts Latin or CyrillicThe proposed algorithm converts document written in Latinand Cyrillic script into cipher text This way all alphabeticcharacters are exchanged with only four different encryptedsigns according to predefined encryption based on text linestructure definition Such ciphers texts are then subjected tothe frequency and co-occurrence analysis According to theobtained results a criteria for recognition of the certain scriptis proposed The proposed method is applied to the custom-oriented database which includes different types of printed

14 The Scientific World Journal

and web documents The experiment shows encouragingresults Possible applications can be seen in the area of webpage recognition

Future work will be toward the recognition of relatedlanguages as well as different languages written in the samescript

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this article (ldquoRecognition of theScript in Serbian Documents using Frequency Occurrenceand Co-occurrence Analysisrdquo by Darko Brodic Zoran NMilivojevic Cedomir A Maluckov)

References

[1] C Smith ldquoBasic cryptanalysis techniquesrdquo Tech Rep SANSInstitute 2001

[2] J Hoffstein J Pipher and J H Silverman An Introductionto Mathematical Cryptography Springer New York NY USA2008

[3] K Ruohonen ldquoMathematical cryptologyrdquo Lecture Notes 2010[4] A Zramdini and R Ingold ldquoOptical font recognition using

typographical featuresrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 20 no 8 pp 877ndash882 1998

[5] P Quaresma ldquoFrequency analysis of the Portuguese languagerdquoTR 2008003 University of Coimbra Portugal 2008

[6] R M Haralick K Shanmugam and I Dinstein ldquoTexturalfeatures for image classificationrdquo IEEE Transactions on SystemsMan and Cybernetics vol 3 no 6 pp 610ndash621 1973

[7] G Jaskiewicz ldquoAnalysis of letter frequency distribution inthe voynich manuscriptrdquo in Proceedings of the InternationalWorkshop (CSampP rsquo11) pp 250ndash261 Putusk Poland 2011

[8] S Cagnoni Evolutionary Image Analysis and Signal Processingvol 213 of Studies in Computational Intelligence SpringerBerlin Germany 2009

[9] R Nosaka C H Suryanto and K Fukui ldquoRotation invariantco-occurrence among adjacent LBPsrdquo in Proceedings of the 11thInternational Conference on Computer Vision (ACCV rsquo11) vol7088 of Lecture Notes in Computer Science pp 82ndash91 DaejeonRepublic of Korea 2011

[10] D A Clausi ldquoAn analysis of co-occurrence texture statistics as afunction of grey level quantizationrdquoCanadian Journal of RemoteSensing vol 28 no 1 pp 45ndash62 2002

[11] A Baraldi and F Parmiggiani ldquoInvestigation of the texturalcharacteristics associated with gray level cooccurrence matrixstatistical parametersrdquo IEEE Transactions on Geoscience andRemote Sensing vol 33 no 2 pp 293ndash304 1995

[12] F R Renzetti and L Zortea ldquoUse of a gray level co-occurrencematrix to characterize duplex stainless steel phases microstruc-turerdquo Frattura ed Integrita Strutturale vol 16 no 5 pp 43ndash512011

[13] R Susomboon D Raicu J Furst and T B Johnson ldquoA Co-occurrence texture semi-invariance to direction distance andpatient sizerdquo in Medical Imaging 2008 Image Processing vol6914 of Procedings of SPIE San Diego Calif USA February2008

[14] A Eleyan and H Demirel ldquoCo-occurrence matrix and itsstatistical features as a new approach for face recognitionrdquo

Turkish Journal of Electrical Engineering and Computer Sciencesvol 19 no 1 pp 97ndash107 2011

[15] D A Clausi and Y Zhao ldquoRapid extraction of image textureby co-occurrence using a hybrid data structurerdquoComputers andGeosciences vol 28 no 6 pp 763ndash774 2002

[16] R Jobanputra and D A Clausi ldquoPreserving boundaries forimage texture segmentation using grey level co-occurringprobabilitiesrdquo Pattern Recognition vol 39 no 2 pp 234ndash2452006

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mechanical Engineering

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Antennas andPropagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

12 The Scientific World Journal

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

00102030405060708

DocumentsUniformity

x-ti

mes

02025

03035

04045

05055

Uni

form

ity (e

nerg

y)

DocumentsLatin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(a)

Entro

py

0

05

1

15

minus17

minus16

minus15

minus14

minus13

minus12

1

minus11

minus09

x-ti

mes

EntropyDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

(b)

03504

04505

05506

06507

Max

imum

pro

babi

lity

00102030405060708

x-ti

mes

Maximum probabilityDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(c)

05506

06507

07508

08509

095

Diss

imila

rity

0

02

04

06

08

1

12

x-ti

mes

DissimilarityDocuments

Latin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(d)

Figure 13 Continued

The Scientific World Journal 13

1

11

12

13

14

15

Con

tras

t

0

02

04

06

08

1

Contrast

x-ti

mes

DocumentsLatin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(e)

Figure 13 Illustrations of co-occurrence descriptors in Latin and Cyrillic text (left) and its ratio (right) (a) Uniformity (b) Entropy (c)Maximum probability (d) Dissimilarity (e) Contrast

Table 11 The ratio of the co-occurrence descriptors between Latin and Cyrillic documents

Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc 9 Doc 10Uniformity 061 059 062 061 060 066 072 051 072 076Entropy 129 125 123 135 134 120 117 148 107 107Max probability 070 064 068 069 068 070 077 058 062 076Dissimilarity 115 113 117 117 114 106 096 128 108 098Contrast 088 086 092 092 089 086 075 094 085 078The final processing of the above results is based on cumulative measures like average max and min of script type co-occurrence in the database Accordingto that certain criteria are established All these are shown in Table 12

Table 12 The ratio of script type co-occurrence descriptors

Latin Cyrillic Ratio CriteriaMin Max Average Min Max Average Max Min Average

Uniformity 023 029 026 031 050 041 076 051 064 03Entropy minus168 minus152 minus160 minus157 minus108 minus130 148 107 125 Max probability 032 047 042 052 068 060 077 058 068 05Dissimilarity 068 083 074 057 077 067 128 096 111 Contrast 103 121 112 112 142 127 094 075 086

quite problematic due to their absence in Latin documentsfrom time to time

Furthermore the analysis of the script type co-occurrence in Latin as well as in Cyrillic documents isexamined according to GLCM method The obtained resultsfor each document are given in Table 10

The co-occurrence descriptor for Latin and Cyrillic textand its ratio is presented in Figure 13

From the above results some criteria can be establishedIt is clear that uniformity and maximum probability receivethe most distinct values in Latin and Cyrillic text Hencethese descriptors are suitable for qualitative characterizationof Latin and Cyrillic text as well as for creating criteria todistinguish a certain script type From the above results themargin criteria should be uniformity of 03 and maximumprobability of 05 These values of both descriptors representthe strong margin in qualifying the script in certain Serbiantext If we accompany them with the criteria obtained from

frequency analysis of the script type occurrence then the fullcriteria of decision making can be established This will leadto correct recognition of the script in Serbian text

5 Conclusion

The paper proposed the algorithm for recognition of exactscript in Serbian document Documents in Serbian languagecan be written in two different scripts Latin or CyrillicThe proposed algorithm converts document written in Latinand Cyrillic script into cipher text This way all alphabeticcharacters are exchanged with only four different encryptedsigns according to predefined encryption based on text linestructure definition Such ciphers texts are then subjected tothe frequency and co-occurrence analysis According to theobtained results a criteria for recognition of the certain scriptis proposed The proposed method is applied to the custom-oriented database which includes different types of printed

14 The Scientific World Journal

and web documents The experiment shows encouragingresults Possible applications can be seen in the area of webpage recognition

Future work will be toward the recognition of relatedlanguages as well as different languages written in the samescript

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this article (ldquoRecognition of theScript in Serbian Documents using Frequency Occurrenceand Co-occurrence Analysisrdquo by Darko Brodic Zoran NMilivojevic Cedomir A Maluckov)

References

[1] C Smith ldquoBasic cryptanalysis techniquesrdquo Tech Rep SANSInstitute 2001

[2] J Hoffstein J Pipher and J H Silverman An Introductionto Mathematical Cryptography Springer New York NY USA2008

[3] K Ruohonen ldquoMathematical cryptologyrdquo Lecture Notes 2010[4] A Zramdini and R Ingold ldquoOptical font recognition using

typographical featuresrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 20 no 8 pp 877ndash882 1998

[5] P Quaresma ldquoFrequency analysis of the Portuguese languagerdquoTR 2008003 University of Coimbra Portugal 2008

[6] R M Haralick K Shanmugam and I Dinstein ldquoTexturalfeatures for image classificationrdquo IEEE Transactions on SystemsMan and Cybernetics vol 3 no 6 pp 610ndash621 1973

[7] G Jaskiewicz ldquoAnalysis of letter frequency distribution inthe voynich manuscriptrdquo in Proceedings of the InternationalWorkshop (CSampP rsquo11) pp 250ndash261 Putusk Poland 2011

[8] S Cagnoni Evolutionary Image Analysis and Signal Processingvol 213 of Studies in Computational Intelligence SpringerBerlin Germany 2009

[9] R Nosaka C H Suryanto and K Fukui ldquoRotation invariantco-occurrence among adjacent LBPsrdquo in Proceedings of the 11thInternational Conference on Computer Vision (ACCV rsquo11) vol7088 of Lecture Notes in Computer Science pp 82ndash91 DaejeonRepublic of Korea 2011

[10] D A Clausi ldquoAn analysis of co-occurrence texture statistics as afunction of grey level quantizationrdquoCanadian Journal of RemoteSensing vol 28 no 1 pp 45ndash62 2002

[11] A Baraldi and F Parmiggiani ldquoInvestigation of the texturalcharacteristics associated with gray level cooccurrence matrixstatistical parametersrdquo IEEE Transactions on Geoscience andRemote Sensing vol 33 no 2 pp 293ndash304 1995

[12] F R Renzetti and L Zortea ldquoUse of a gray level co-occurrencematrix to characterize duplex stainless steel phases microstruc-turerdquo Frattura ed Integrita Strutturale vol 16 no 5 pp 43ndash512011

[13] R Susomboon D Raicu J Furst and T B Johnson ldquoA Co-occurrence texture semi-invariance to direction distance andpatient sizerdquo in Medical Imaging 2008 Image Processing vol6914 of Procedings of SPIE San Diego Calif USA February2008

[14] A Eleyan and H Demirel ldquoCo-occurrence matrix and itsstatistical features as a new approach for face recognitionrdquo

Turkish Journal of Electrical Engineering and Computer Sciencesvol 19 no 1 pp 97ndash107 2011

[15] D A Clausi and Y Zhao ldquoRapid extraction of image textureby co-occurrence using a hybrid data structurerdquoComputers andGeosciences vol 28 no 6 pp 763ndash774 2002

[16] R Jobanputra and D A Clausi ldquoPreserving boundaries forimage texture segmentation using grey level co-occurringprobabilitiesrdquo Pattern Recognition vol 39 no 2 pp 234ndash2452006

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mechanical Engineering

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Antennas andPropagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World Journal 13

1

11

12

13

14

15

Con

tras

t

0

02

04

06

08

1

Contrast

x-ti

mes

DocumentsLatin script Cyrillic script

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

Documents

Doc

1

Doc

2

Doc

3

Doc

4

Doc

5

Doc

6

Doc

7

Doc

8

Doc

9

Doc

10

(e)

Figure 13 Illustrations of co-occurrence descriptors in Latin and Cyrillic text (left) and its ratio (right) (a) Uniformity (b) Entropy (c)Maximum probability (d) Dissimilarity (e) Contrast

Table 11 The ratio of the co-occurrence descriptors between Latin and Cyrillic documents

Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Doc 6 Doc 7 Doc 8 Doc 9 Doc 10Uniformity 061 059 062 061 060 066 072 051 072 076Entropy 129 125 123 135 134 120 117 148 107 107Max probability 070 064 068 069 068 070 077 058 062 076Dissimilarity 115 113 117 117 114 106 096 128 108 098Contrast 088 086 092 092 089 086 075 094 085 078The final processing of the above results is based on cumulative measures like average max and min of script type co-occurrence in the database Accordingto that certain criteria are established All these are shown in Table 12

Table 12 The ratio of script type co-occurrence descriptors

Latin Cyrillic Ratio CriteriaMin Max Average Min Max Average Max Min Average

Uniformity 023 029 026 031 050 041 076 051 064 03Entropy minus168 minus152 minus160 minus157 minus108 minus130 148 107 125 Max probability 032 047 042 052 068 060 077 058 068 05Dissimilarity 068 083 074 057 077 067 128 096 111 Contrast 103 121 112 112 142 127 094 075 086

quite problematic due to their absence in Latin documentsfrom time to time

Furthermore the analysis of the script type co-occurrence in Latin as well as in Cyrillic documents isexamined according to GLCM method The obtained resultsfor each document are given in Table 10

The co-occurrence descriptor for Latin and Cyrillic textand its ratio is presented in Figure 13

From the above results some criteria can be establishedIt is clear that uniformity and maximum probability receivethe most distinct values in Latin and Cyrillic text Hencethese descriptors are suitable for qualitative characterizationof Latin and Cyrillic text as well as for creating criteria todistinguish a certain script type From the above results themargin criteria should be uniformity of 03 and maximumprobability of 05 These values of both descriptors representthe strong margin in qualifying the script in certain Serbiantext If we accompany them with the criteria obtained from

frequency analysis of the script type occurrence then the fullcriteria of decision making can be established This will leadto correct recognition of the script in Serbian text

5 Conclusion

The paper proposed the algorithm for recognition of exactscript in Serbian document Documents in Serbian languagecan be written in two different scripts Latin or CyrillicThe proposed algorithm converts document written in Latinand Cyrillic script into cipher text This way all alphabeticcharacters are exchanged with only four different encryptedsigns according to predefined encryption based on text linestructure definition Such ciphers texts are then subjected tothe frequency and co-occurrence analysis According to theobtained results a criteria for recognition of the certain scriptis proposed The proposed method is applied to the custom-oriented database which includes different types of printed

14 The Scientific World Journal

and web documents The experiment shows encouragingresults Possible applications can be seen in the area of webpage recognition

Future work will be toward the recognition of relatedlanguages as well as different languages written in the samescript

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this article (ldquoRecognition of theScript in Serbian Documents using Frequency Occurrenceand Co-occurrence Analysisrdquo by Darko Brodic Zoran NMilivojevic Cedomir A Maluckov)

References

[1] C Smith ldquoBasic cryptanalysis techniquesrdquo Tech Rep SANSInstitute 2001

[2] J Hoffstein J Pipher and J H Silverman An Introductionto Mathematical Cryptography Springer New York NY USA2008

[3] K Ruohonen ldquoMathematical cryptologyrdquo Lecture Notes 2010[4] A Zramdini and R Ingold ldquoOptical font recognition using

typographical featuresrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 20 no 8 pp 877ndash882 1998

[5] P Quaresma ldquoFrequency analysis of the Portuguese languagerdquoTR 2008003 University of Coimbra Portugal 2008

[6] R M Haralick K Shanmugam and I Dinstein ldquoTexturalfeatures for image classificationrdquo IEEE Transactions on SystemsMan and Cybernetics vol 3 no 6 pp 610ndash621 1973

[7] G Jaskiewicz ldquoAnalysis of letter frequency distribution inthe voynich manuscriptrdquo in Proceedings of the InternationalWorkshop (CSampP rsquo11) pp 250ndash261 Putusk Poland 2011

[8] S Cagnoni Evolutionary Image Analysis and Signal Processingvol 213 of Studies in Computational Intelligence SpringerBerlin Germany 2009

[9] R Nosaka C H Suryanto and K Fukui ldquoRotation invariantco-occurrence among adjacent LBPsrdquo in Proceedings of the 11thInternational Conference on Computer Vision (ACCV rsquo11) vol7088 of Lecture Notes in Computer Science pp 82ndash91 DaejeonRepublic of Korea 2011

[10] D A Clausi ldquoAn analysis of co-occurrence texture statistics as afunction of grey level quantizationrdquoCanadian Journal of RemoteSensing vol 28 no 1 pp 45ndash62 2002

[11] A Baraldi and F Parmiggiani ldquoInvestigation of the texturalcharacteristics associated with gray level cooccurrence matrixstatistical parametersrdquo IEEE Transactions on Geoscience andRemote Sensing vol 33 no 2 pp 293ndash304 1995

[12] F R Renzetti and L Zortea ldquoUse of a gray level co-occurrencematrix to characterize duplex stainless steel phases microstruc-turerdquo Frattura ed Integrita Strutturale vol 16 no 5 pp 43ndash512011

[13] R Susomboon D Raicu J Furst and T B Johnson ldquoA Co-occurrence texture semi-invariance to direction distance andpatient sizerdquo in Medical Imaging 2008 Image Processing vol6914 of Procedings of SPIE San Diego Calif USA February2008

[14] A Eleyan and H Demirel ldquoCo-occurrence matrix and itsstatistical features as a new approach for face recognitionrdquo

Turkish Journal of Electrical Engineering and Computer Sciencesvol 19 no 1 pp 97ndash107 2011

[15] D A Clausi and Y Zhao ldquoRapid extraction of image textureby co-occurrence using a hybrid data structurerdquoComputers andGeosciences vol 28 no 6 pp 763ndash774 2002

[16] R Jobanputra and D A Clausi ldquoPreserving boundaries forimage texture segmentation using grey level co-occurringprobabilitiesrdquo Pattern Recognition vol 39 no 2 pp 234ndash2452006

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mechanical Engineering

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Antennas andPropagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

14 The Scientific World Journal

and web documents The experiment shows encouragingresults Possible applications can be seen in the area of webpage recognition

Future work will be toward the recognition of relatedlanguages as well as different languages written in the samescript

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this article (ldquoRecognition of theScript in Serbian Documents using Frequency Occurrenceand Co-occurrence Analysisrdquo by Darko Brodic Zoran NMilivojevic Cedomir A Maluckov)

References

[1] C Smith ldquoBasic cryptanalysis techniquesrdquo Tech Rep SANSInstitute 2001

[2] J Hoffstein J Pipher and J H Silverman An Introductionto Mathematical Cryptography Springer New York NY USA2008

[3] K Ruohonen ldquoMathematical cryptologyrdquo Lecture Notes 2010[4] A Zramdini and R Ingold ldquoOptical font recognition using

typographical featuresrdquo IEEE Transactions on Pattern Analysisand Machine Intelligence vol 20 no 8 pp 877ndash882 1998

[5] P Quaresma ldquoFrequency analysis of the Portuguese languagerdquoTR 2008003 University of Coimbra Portugal 2008

[6] R M Haralick K Shanmugam and I Dinstein ldquoTexturalfeatures for image classificationrdquo IEEE Transactions on SystemsMan and Cybernetics vol 3 no 6 pp 610ndash621 1973

[7] G Jaskiewicz ldquoAnalysis of letter frequency distribution inthe voynich manuscriptrdquo in Proceedings of the InternationalWorkshop (CSampP rsquo11) pp 250ndash261 Putusk Poland 2011

[8] S Cagnoni Evolutionary Image Analysis and Signal Processingvol 213 of Studies in Computational Intelligence SpringerBerlin Germany 2009

[9] R Nosaka C H Suryanto and K Fukui ldquoRotation invariantco-occurrence among adjacent LBPsrdquo in Proceedings of the 11thInternational Conference on Computer Vision (ACCV rsquo11) vol7088 of Lecture Notes in Computer Science pp 82ndash91 DaejeonRepublic of Korea 2011

[10] D A Clausi ldquoAn analysis of co-occurrence texture statistics as afunction of grey level quantizationrdquoCanadian Journal of RemoteSensing vol 28 no 1 pp 45ndash62 2002

[11] A Baraldi and F Parmiggiani ldquoInvestigation of the texturalcharacteristics associated with gray level cooccurrence matrixstatistical parametersrdquo IEEE Transactions on Geoscience andRemote Sensing vol 33 no 2 pp 293ndash304 1995

[12] F R Renzetti and L Zortea ldquoUse of a gray level co-occurrencematrix to characterize duplex stainless steel phases microstruc-turerdquo Frattura ed Integrita Strutturale vol 16 no 5 pp 43ndash512011

[13] R Susomboon D Raicu J Furst and T B Johnson ldquoA Co-occurrence texture semi-invariance to direction distance andpatient sizerdquo in Medical Imaging 2008 Image Processing vol6914 of Procedings of SPIE San Diego Calif USA February2008

[14] A Eleyan and H Demirel ldquoCo-occurrence matrix and itsstatistical features as a new approach for face recognitionrdquo

Turkish Journal of Electrical Engineering and Computer Sciencesvol 19 no 1 pp 97ndash107 2011

[15] D A Clausi and Y Zhao ldquoRapid extraction of image textureby co-occurrence using a hybrid data structurerdquoComputers andGeosciences vol 28 no 6 pp 763ndash774 2002

[16] R Jobanputra and D A Clausi ldquoPreserving boundaries forimage texture segmentation using grey level co-occurringprobabilitiesrdquo Pattern Recognition vol 39 no 2 pp 234ndash2452006

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mechanical Engineering

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Antennas andPropagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Submit your manuscripts athttpwwwhindawicom

VLSI Design

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

RotatingMachinery

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

Journal ofEngineeringVolume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Shock and Vibration

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Mechanical Engineering

Advances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Civil EngineeringAdvances in

Acoustics and VibrationAdvances in

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Electrical and Computer Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Distributed Sensor Networks

International Journal of

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

SensorsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Modelling amp Simulation in EngineeringHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Active and Passive Electronic Components

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Chemical EngineeringInternational Journal of

Control Scienceand Engineering

Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Antennas andPropagation

International Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Navigation and Observation

International Journal of

Advances inOptoElectronics

Hindawi Publishing Corporation httpwwwhindawicom

Volume 2014

RoboticsJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014