zipf's law and writings on lis abstra ct comprising about...

6
~ \ ' ,I Malaysian Jour"al of Library & Information Science, VoL3, no.2, July 1998:93-98 ZIPF'S LAW AND WRITINGS ON LIS B K SeD;Khong Wye Keen; Lee Soo Hoon; Lim Bee Ling; Mohd Rafae Abdullall; Ting Chang Nguan; Wee Siu Hiang MLIS Programnle, Faculty of ComputerScience and Information Technology, Ulllversity of Malaya 50603 Kuala LUlllpur, Malaysia E-mail: [email protected] ABSTRA CT Presentsthe results of a study conductedto find out the validity of Zip.fs law related to the word length and the frequency of its uses in the case of library and. information science literature. The results obtainedfrom the analysis of six different samples obey Zip.fs law in all the cases with small deviations. The result provided by the sample comprising about 5,800 words fits the law best with just one deviation. The main exceptionisfound to be one-letter word.S'. KeY"'ords: Zipf's law; Library and information science literature; Bibliometrics. DEFINITIONS very closely relatedto the frequency of its usage - the greater the frequency, the shorterthe word (Zipf, 1935). As far as literary writings are concerned such as fictions, short stories, and poems this law of Zipf holds good. Does this law apply to. technical writings as well? The question arose becausetechnical writing differs from literary or ordinary writing in a number of ways. In technical writing, more often tItan not, eachterm represents a particular concept which is used again and again whenever the author refers to tllat concept thus leadingto the increase in the frequencyof its use. For example, an astronomer writing about the moon will be obliged to use the word moon again and again whenever referring to the concept. On the other hand a poet writing about the same subject can use the words moon. Alpha-numeric expression- an expression involving alphabet/s and number/~, e.g. 2nd. Alpha-symbolic expression - an expres- sion involving aIphabet/s and symbol/s, e.g. au=. Extra-textual references ,- References ap- pearing outside the text in the foon of footnotesor citations at the end of the text. Intra-textual references - Referencesfi- guring within the text. INTRODUCTION George Kingsley Zipf, a noted linguist, tried to examine the field of linguistics from the scientific point of view and discovered three different interestinglaws. One of them is that the length of a word is

Upload: others

Post on 19-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ZIPF'S LAW AND WRITINGS ON LIS ABSTRA CT comprising about ...eprints.rclis.org/5824/1/pdf.pdf · technical writing, we may find tables, charts, formulas, symbols, etc. which are usually

~

\ ' ,I

Malaysian Jour"al of Library & Information Science, VoL3, no.2, July 1998:93-98

ZIPF'S LAW AND WRITINGS ON LIS

B K SeD; Khong Wye Keen; Lee Soo Hoon; Lim Bee Ling;Mohd Rafae Abdullall; Ting Chang Nguan; Wee Siu Hiang

MLIS Programnle, Faculty of Computer Science andInformation Technology, Ulllversity of Malaya

50603 Kuala LUlllpur, MalaysiaE-mail: [email protected]

ABSTRA CT

Presents the results of a study conducted to find out the validity of Zip.fs law related tothe word length and the frequency of its uses in the case of library and. informationscience literature. The results obtained from the analysis of six different samples obeyZip.fs law in all the cases with small deviations. The result provided by the samplecomprising about 5,800 words fits the law best with just one deviation. The mainexception is found to be one-letter word.S'.

KeY"'ords: Zipf's law; Library and information science literature; Bibliometrics.

DEFINITIONS very closely related to the frequency of itsusage - the greater the frequency, the

shorter the word (Zipf, 1935).

As far as literary writings are concernedsuch as fictions, short stories, and poemsthis law of Zipf holds good. Does this lawapply to. technical writings as well? Thequestion arose because technical writingdiffers from literary or ordinary writing ina number of ways. In technical writing,more often tItan not, each term representsa particular concept which is used againand again whenever the author refers totllat concept thus leading to the increase inthe frequency of its use. For example, anastronomer writing about the moon will beobliged to use the word moon again andagain whenever referring to the concept.On the other hand a poet writing about thesame subject can use the words moon.

Alpha-numeric expression - an expressioninvolving alphabet/s and number/~, e.g.2nd.Alpha-symbolic expression - an expres-sion involving aIphabet/s and symbol/s,e.g. au=.Extra-textual references ,- References ap-pearing outside the text in the foon offootnotes or citations at the end of the text.Intra-textual references - References fi-guring within the text.

INTRODUCTION

George Kingsley Zipf, a noted linguist,tried to examine the field of linguisticsfrom the scientific point of view anddiscovered three different interesting laws.One of them is that the length of a word is

Page 2: ZIPF'S LAW AND WRITINGS ON LIS ABSTRA CT comprising about ...eprints.rclis.org/5824/1/pdf.pdf · technical writing, we may find tables, charts, formulas, symbols, etc. which are usually

~ ,.,.I

Zipfs Law and Writings in LIS

pattern of an author. For the study, it hasbeen assumed that the changes in thefrequency of the use of words during theeditorial process is minimal and does notaffect the overall result. The length of thesix articles in tenus of words vary be-tween 862 to 5772 words.

PrO2ram

program Character_Count;

canst{maximwn length ofMAXLENGl1I = 20;

words}

var

Each word length was measured in termsof letters. This was done to avoid theproblem of hyphenated words which attimes give rise 'to different counts. Forexample, the word length of 'on-line' isseven and 'online' is six when counted interms of characters. But, in both the cases,the word length is six when counted interms of letters. The manual process wasobviously cwnbersome, laborious andtime consuming. Hence, an alternative.method (detailed below) amenable tocomputerised analysis was tried.

string[255];text. ,

integer,auay [O..MAXLENGnI] ofinteger;

Ininfile.1CO1lllt

beginfor i:=O to MAXlength do { Initialise}

coWlt[i] := O~writeln( 'ChrCoWlt - Character CoWlt')~

write('Input file? ')~

readln(ln)~assign(infile, In)~reset(outfile)~

\\Tite('Output file? ');

readIn(In);assign( outfile, In);re\\Tite(outfile)~Each article was scanned with an optical

character recognition (OCR) software i.e.OmniPage Pro. The resultant file wassaved in Microsoft: Word for Windowsfonnat. The file so generated was checkedwith the original article for accuracy. Theportions of the article as detailed abovewere removed. Punctuation marks weretllen converted into spaces, and subse-quently all spaces were converted into linebreaks. This resulted in a pJJre word listwhich was saved in a text file.

read1n(infile, In); {Analyse}while not eof(infile), dobegin

coWlt[length(In)] := coWlt[lenglli(In)] + 1;

readIn(infile, In)end;

for i:=O to MAXLENGnI do {Output to file}writeln(outfile, i, chr(9), count[i];

close( outfile)~

close( infile)~

writeh1('Completed. ')end.The text file w~ run through the program

written in Turbo-Pascal to count thefrequencies according to the number ofletters in the words. The program took thetext file as its input and produced thecorresponding frequencies as its output.

RESULTS

The results of the study are presented inTable 1. The percentage of one-letter wordvaries from 0.67 to 2.23. This low percentage

95

outfile

Page 3: ZIPF'S LAW AND WRITINGS ON LIS ABSTRA CT comprising about ...eprints.rclis.org/5824/1/pdf.pdf · technical writing, we may find tables, charts, formulas, symbols, etc. which are usually

Sen, B.K. et aL

intra- and extra-textual references, tables,figures, and appendices.

Luna, Cynthia, sailor's friend, orb ofnight, and so on to break the monotony orto rhyme with another word. This apart, intechnical writing, we may find tables,charts, formulas, symbols, etc. which areusually not part of an ordinary literarywriting. As library and informationscience (LIS) literature falls under thecategory of technical writing, it was in-tended to study the validity of Zipf sfinding in this particular field.

Another object of the study was to find outthe optimum sample size in tefnlS of thenumber of words required for the veri-fication of this particular law. In the caseof the law related to rank and frequency(Zipf 1949), the optimum sample size isfound to be at least 5,000 words (Wyllys

1981).

METHODOLOGY

To conduct the study, six papers of vary-ing length (Teh and Wong, 1996; Kade-mani and Kalyane, 1996; Nor, 1996;Panigrahi and Panda, 1996; Parvatha-mma, 1996; Zainab and Nor, 1996) werechosen from the Malaysian Journal ofLibrary and Information Science (July1996) .

The portions of an article which wereexcluded from our study comprised thenames of the authors, author affiliations,abstract, keywords, alpha-numeric ex-pressions like 2nd and FIO, alpha-sym-bolic expressions like au=, and su=,abbreviations such as FDT and ISO,numbers written with digits, serial num-bering, formulas, pW1ctuation marks,

The rationale behind the exclusion of thenames of authors and author affiliationsis obvious because they do not representthe author's style of writing and as suchcannot be used for word counting. Thekeywords are soDletnnes chosen by con-sulting a thesaurus, where the author haslittle choice and soDletnnes'they are addedby the editor.' Hence, it was felt safe toexclude theDl. An abstract is the con-densed version of an article and does notnecessarily represent the noffi1al style ofwriting of an author. Moreover, SODle-times the abstract is prepared by SODleoneother than the author. Therefore, it wasnot considered. Alpha-numeric as well asalpha-syntbolic expressions, abbrevia-tions, numbers written with digits, serialnuDlbering with a, b, c, etc., and foffi1u-las are not words, hence excluded. Thereferences coDlprise certain fixed eleDlentssuch as author, year, title of th~ article,and o~er bibliographical details which arenot the creation of the author. Tables andfigures were to be excluded for the ease ofsorting. Moreover, at times a table Dlaycontain numerous YESes, NOes, etc.representing the answers of respondents.which are again not representative of thestyle of the author. This is the case withappendices as well. In the articlesanalysed for this study one appendix listedthe papers of a famous scientist, severalappendices listed CD ROM databases,and so on. Thus, only the textual part ofthe article with the above exceptions wasconsidered since that part seeDled to beDlOSt relevant for judging the word use

Page 4: ZIPF'S LAW AND WRITINGS ON LIS ABSTRA CT comprising about ...eprints.rclis.org/5824/1/pdf.pdf · technical writing, we may find tables, charts, formulas, symbols, etc. which are usually

Qh Sen. B

.X.

et oL

0)u=0)c='

uu0~0~u=0)='

CO

~~~~=0)~"0J..0~-I0)-.Q~~

-0:-:""

CID

0 =

~

,3

~

~

t'~.

=

... Z

:i~

=

.1.=

~

N

IC

-""-j~

~

I

"'~~

I

~

Q)

G'~

.=

'""C-

Q)=

~=

Q)

~o-UQ

) ...

... Q

)r;--

~

~

~

I

c~.1

=-

~~

=

:s ~

I

a'~~

~~

1""~

fJo

~

~bD .

U~

~~

=~

~

~~

I =

U

0 ~ ~

...~

~ ~

~tin-e'

~

.=

-,-,

~~

~:S

Yo

C'~

~

~~

~r.- ~

~bJ)

~~

.-"',-,=

=~

~

~

I=

V

9'

ac~

~

~~

~r~

~1~

1~1~

1~1~

1~1~

1~1~

1~1~

1~loolololo

~1~

1~1~

1~1

001

~1~

1M

I~

I~

I~

lol

~1o

lolo

l~

lolo

l~1

~

. .

. .~

~~

N~

~~

N~

OO

OO

OO

O.~

~N

O

. .

. .

. .

. .

. .

. .

. .

'0~

,-

,- ,-

,- 00

~

\C

~

~

M

~

0 0

0 0

0 0

0 0

-

~I~

I~I~I~I~

I~I~

'~1~

1~1~

1~1~

1~lo

~r---v)~

NN

Nff')--

~I~

I~I~

I~I~

I~I~

I~I~

I~I~

I~I~

I~I~

I~I~

I~I~

'6,-,~

~~

~Ir)f"-.\D

~":t"N

'-'OO

OO

OoooO

al~I§lsl~

I~I~

I~I~

I~I~

I~I~

I~I~

lololololol§1

01~

1~1~

IN

I~

I~

I~I~

IM

loo

l~

I~

lo

loo

1°1°

1°1°

1o

l~

N

.

. .M

~~

O~

OO

~~

~~

OO

OO

OO

O

co--i!::::!::::~~

o.;~r-.:"')M

co--iooooooOO

OO

~I~

I~I~

I~I~

I~I~

I~I~

I~I~

I~IN

I~I~

'NI~

ININ

I~I

IIII

I~~~~~N~~~O~~OO~N~OO~OOOOO

~I

~I

~I

~I

~I

~I

~I

~I

~ ~

I~

~ ~

I~

I~

I~

~I

~ ~

1~

lo!

0 N

00

~

0 0

~

0 ~

N

~

N

~

0

0 0

0 0

0 0

01

- -

- ,-

,- -

~I~

I~I~

I~I~

I~I~

I~I~

I~I~

I~I~

I~I~

I~I~

I~I~

I~I

~1~

1~1~

1~1~

1~1§1~

1~1~

1~lool~

lolol

~1~

1~1~

151~

1~1~

1~1~

1~1~

1~1~

I~I~

I~I~

I~I~

I~I

~

~

Ir\ 0\ t'-- \0 N"1"

0 0

0 00

000 ~

I

~1~

1§1~1~

1~1~

1~1~

1~1~

1~1~

lololoINlololol~

~IN

I~I~

I~I~

I~lool~

I~I~

I~I~

I~I~

I~I~

I~I~

I~I~

olol~

ololol~

0101019. .;..::.\D0-0--fP

o.'-"

~.;..::.\D0-0--"§~~~'-"~~r-\D0-0-..-.c.

~'-"~I

f--.;..::.\D0-0--0b

I

~.;..:: \D

\D

0-0-0-0-

~~

.S

6b

~"§I

Po.

Z'-"I

p..N

~

Page 5: ZIPF'S LAW AND WRITINGS ON LIS ABSTRA CT comprising about ...eprints.rclis.org/5824/1/pdf.pdf · technical writing, we may find tables, charts, formulas, symbols, etc. which are usually

Zipfs Law and Writings in LIS

was expected since in English language thereare only two one-letter words, i.e., 'a' and 'I'that are used quite frequently. In technicalwriting, the use of the word 'I' is not verycommon. In the six papers analysed for tilestudy, we have not encountered the word 'I'even once. Hence, the percentage given abovereflects only the occurrence of 'a'. At anyrate, this low percentage does not followZipf's law.

starts falling below 1 with thirteen-letterwords. The number of words comprising 15to 18 letters is negligibly small. Wordscomprising more than eighteen letters werenot encountered. If we ignore tile case of one-letter words, tllen the findings obey Zipfslaw quite well.

As to the optimum sample size it may beobserved from Fig. I that the decrease is mostuniform in tile case of the third chart wherethe sample size is reasonably large (5772words). WitII lesser saniple size certaindeviations are observed. The fIfth chart withtile least sample size (862 words) also showmore or less uniform decrease. Still, it is feltthat for better results a sample size of about5,000 words may be considered adequate.

However, the situation changes from two-letter words onwards. The percentage ofoccurrence varies from 15 to 20 in the case oftwo-letter words, which remains more or lessconstant aroWld 18% in the case of three-letter words. After this in all cases the fall isgradual as well as uniform as can be seenfrom Fig. 1. The percentage of occurrence

Fig. 1 - Frequency Distribution of Word Lengths of Six Sample Articles

1200

1000

800

600

400

200

0

1st 2nd 3rd 4th 5th 6th

97

Page 6: ZIPF'S LAW AND WRITINGS ON LIS ABSTRA CT comprising about ...eprints.rclis.org/5824/1/pdf.pdf · technical writing, we may find tables, charts, formulas, symbols, etc. which are usually

lit~

Sen, B.X. et aL

Parvatllamma, N.1996. The coverage of In-dian literature in social science biblio-graphic databases on CD-ROM. Malay-sian Journal of Library and InformationScience Vol. 1, no. 1:85-92.

CONCLUSION

This study indicates that the LIS writingsalso follow the Zipf's law when only tiletextual part of the writing is consideredomitting alpha-nwneric and alpha-symbolicexpressions, abbreviations, headings of illus-trations, intra-textual references, words figu-ring within tables, keywords. However, justfrom this study it is not possible to generalisethat the law will be applicable to all types oftechnical writings, for which separate studiesneed to be Wldertaken.

Teh, K. H. and S. F. Wong. 1996. Deve-loping a CDS/ISIS-based online catalo-guing and infonnation retrieval interfacesfor use in small libraries. MalaysianJournal of Library and InformationScience Vol. 1, no. 1: 1-20.

Wyllys, R. E. 1981: Empirical and theoreticalbases of Zipfs law. Library Trends Vol.30, no.l: 53

ACKNOWLEDGEMENT

The authors are profoundly thankful to Prof.Mashkuri Yaacob, Dean, Faculty of Compu-ter Science and Infonnation Teclmology,University of Malaya, for providing all faci-lities in the preparation of the paper.

Zainab, A. N. and Nor Eliza. M. Z.1996.Introducing MAKLUM the generalreference expert adviser developed for auniversity library. Malaysian Journal ofLibrary and Information Science Vol. 1,no. 1: 93-107REFERENCES

Zipf, G. K.1935. The psycho-biology of lan-guage: an introduction to dynamic philo-logy. Houghton Mitnin, Boston~ RiversidePress, Cambridge. p. v.

Kademani, B. S. and V. L. Kalyanc. 1996.Outstandingly cited and most significantpublications of R Chidambaram, a nu-clear physicist. Malaysian Journal ofLibrary and Information Science Vol. 1,no. 1: 21-36. Zipf, G. K. 1949. Human behavior and the

principle of least effort: an introductionto human ecology. Reading, Mass.:Addison-Wesley Press.

Nor Ehzan, N. 1996. The use of CD-ROMdatabases by Malaysian postgraduate stu-dents in Leeds. Malaysian Journal ofLibrary and Information Science Vol. 1,no. 1: 37-55.

Panigrahi, C. and K. C. Panda. 1996. Read-ing interests and infonnation sources ofschool going children: a case study of twoEnglish medium schools of Rourkela,India. Malaysian Journal of Library andInformation Science Vol. I, no. 1: 57-65.

98