information theory and other quantitative factors in code

12
Inf~rm~tion Theory and Other Quantitative Factors in Code Design for Document Card Systems* Eugene Garfield Institute for Scient@c Information 325 Chestnut Street Philadelphia, Pennsylvania 19106 In the past ten years, the field of information retrieval has witnessed the development of many new sys- tems, devices, and theories. In par- ticular, two opposing “schools” of thought on card indexing systems have developed. One claims that the term card (unit term) or “collating” system is the most desirable. The other advocates the document card (unit record) or ‘‘scanning” system. Dr. Whaley has noted many of the advantages and disadvantages of collating and scanning systems, and I am glad to adopt his terminology and agree with most of his com- ments. I For the record, however, I wish to remind the proponents of term card systems that theirs was no new finding. Coste1102 says Batten3 anticipated Taube 4 by 15 years. Batten was anticipated by at least another 35 years. One term card system began at the turn of the century at Johns Hopkins Hospital. Subsequently, it went through all the evolutionary stages which the inherent clearly demonstrate similarities between term card and document card sys- tems. This does not mean that the rediscovery of the term card system was an insignificant development. After all, many useful ideas and in- ventions are redkcovered and we are grateful for these discoveries. However, when appropriate, our precursors ought to be given credit. Even the ten column posting card was anticipated by Paul Otlet, founder of the modern documenta- tion movement.5 Indeed, long ago, the term card system was used in several medical institutions, includ- ing Johns Hopkins Hospital and the Mayo Clinic. Texts on medical records man- agement demonstrate such sys- tems.c These consist of one 3x5 card for each disease (term). Each card then lists the case history docu- ment numbers for all patients so diagnosed. Ultimately, the number of case history numbers grew larger Reprinted with permission from the Journal o~ Chemical Documentation 1:70, 1961. Copyright by the American Chemical Society. * Presented at the American Documentation Institute Annual Meeting, 22 October 1959, Lehigh University, Bethlehem, Pa. 2?4

Upload: others

Post on 18-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Information Theory and Other Quantitative Factors in Code

Inf~rm~tion Theory and Other Quantitative Factors in

Code Design for Document Card Systems*

Eugene Garfield

Institute for Scient@c Information325 Chestnut Street

Philadelphia, Pennsylvania 19106

In the past ten years, the field ofinformation retrieval has witnessedthe development of many new sys-tems, devices, and theories. In par-ticular, two opposing “schools” ofthought on card indexing systemshave developed. One claims that theterm card (unit term) or “collating”system is the most desirable. Theother advocates the document card(unit record) or ‘‘scanning” system.Dr. Whaley has noted many of theadvantages and disadvantages ofcollating and scanning systems, andI am glad to adopt his terminologyand agree with most of his com-ments. I For the record, however, Iwish to remind the proponents ofterm card systems that theirs wasno new finding. Coste1102 saysBatten3 anticipated Taube 4 by 15years. Batten was anticipated by atleast another 35 years.

One term card system began atthe turn of the century at JohnsHopkins Hospital. Subsequently, itwent through all the evolutionary

stages whichthe inherent

clearly demonstratesimilarities between

term card and document card sys-tems. This does not mean that therediscovery of the term card systemwas an insignificant development.After all, many useful ideas and in-ventions are redkcovered and weare grateful for these discoveries.However, when appropriate, ourprecursors ought to be given credit.Even the ten column posting cardwas anticipated by Paul Otlet,founder of the modern documenta-tion movement.5 Indeed, long ago,the term card system was used inseveral medical institutions, includ-ing Johns Hopkins Hospital and theMayo Clinic.

Texts on medical records man-agement demonstrate such sys-tems.c These consist of one 3 x 5card for each disease (term). Eachcard then lists the case history docu-ment numbers for all patients sodiagnosed. Ultimately, the numberof case history numbers grew larger

Reprinted with permission from the Journal o~ Chemical Documentation 1:70, 1961.Copyright by the American Chemical Society.

* Presented at the American Documentation Institute Annual Meeting, 22 October1959, Lehigh University, Bethlehem, Pa.

2?4

Page 2: Information Theory and Other Quantitative Factors in Code

and the time required to make anycorrelations between two diagnosticterm cards increased to ridiculous,exponential proportions. Some-where along the line it was decidedthat the document card systemshould be employed. At JohnsHopkins and Mayo, Hollerith cardswere in use as early as the 1920’s.The School of Public Health atJohns Hopkins was one of the earli-est users of punched-card machines.Their equipment is still of earlyvintage. At Johns Hopkins, eventhe IBM card finally became aproblem as the volume of patientsgrew into the hundreds of thou-sands. The “vicious circle” wascontinued when it was decided touse duplicate sets of cards-i e.,rotated files, not unlike the systemused at the Chemical-Biological Co-ordination Center (CBCC) severalyears ago.7 Finally, this semi-col-Iating,semi.scanning system wasabandoned because of the high costof storing millions of cards. Theentire file was tabulated on printedsheets and the punched-cardsthrown out. This printed index ar-rangement is very similar to theoriginal term card arrangement.However, in a separate section, theequivalent of the document card isalso printed. Thus, one is able to doa search by both methods. Depend-ing upon the individual searcheither one or both may be used.Pre-coordinations were made whereappropriate befote printing theindex.

The Mayo Clinic long ago at-tacked the space problem in anotherfashion. The storage density of theIBM card was increased by a systemof binary coding.s These IBM meth-ods, 1 believe, are still used there.

“l’he binary coding utmzes all ot the

4024 combinations possible in a 12position punched-card column. It isunderstandable that a group of stat-isticians would discover this meth-od. After all, statisticians work withprobability data constantly. How-ever, it is interesting that manypeople, including the statisticians,have been clever in finding ways ofincreasing the number of codes thatcan be crammed on a card (Wise,9Mooers, 10 et al.). However, theproblem of how many times eachwas used was not considered asimportant.

This aspect first troubled mewhile working with the IBM 101 atthe Welch Medical Library IndexingProject. 11 Some readers may recallthe experimental 101 system wedemonstrated in 1953 using fivedtgit decimal codes, randomlystrung along the first sixty columnsof an IBM card. 12 For each subjectheading or descriptor there was onefive digit decimal number. Eachcard contained 12 such numbers.The details are described in thefinal report of the project. To usethe same code length for all de-scriptors regardless of their fre-quency was rather ineftlcient interms of space utilization, inputtime and searching cost. Obviously,others have arrived at similar con-clusions because their coding sys-tems intuitively employ a statisticalapproach. It is surprising, however,how many extant systems still donot make provisions for “normaldistribution. ” A good example isthe CBCC system, and the same istrue of Uniterm, 13 Zatocoding14and others. To reiterate: they all usethe same amount of coding spacefor each descriptor, regardless of its

275

Page 3: Information Theory and Other Quantitative Factors in Code

frequency of use.Working with the CBCC system,

and utilizing Heumann’s statisticaldata15 on about 25,000 chemicalcompounds coded with this system,it was possible to design a codewhich reduced significantly cardspace and the time and cost ofsearching. For the moment it is suf-ficient to state briefly that the sta-tistical information available on theCBCC file was used to construct anormal distribution curve giving thefrequency of use of each alpha-numerical code. One then arbitrarilybreaks into the frequency curves invarious sections to determine thespace allocations for the descriptors.If a descriptor, such as benzene,occurs in half the chemicals and thecode for uranium occurs rarely,why devote the same amount ofspace to both. Obviously, asWiswesser,16 Steidle17 and manyothers have found, it is quite suffi-cient to assign permanent card loca-tions to frequently occurring codes.On the other hand, descriptorswhich occur infrequently can beassigned some coding configurationw$ich requires, relatively, a greatdeal of card space. This will be oflittle consequence since it will cropup so rarely. These “rare” birds aretreated as a class and codes areused that permit many combina-tions in a larger space. The Mayosystem is one example; another isthe Zator system, as applied bySchultz. 18 Indeed, one of the pri-mary shortcomings of Mooers’ Zatorsystem is the indiscriminate, i. e.,random assignment of an equalnumber of code symbols regardlessof actual occurrence in the file. 19This results in excess noise, i. e,,false drops. Incidentally, I wish to

point out that i am well aware ofMooers’ early at%empt in AmericanDocumentation to set Wise straighton the folly of a superimposedcoding scheme for the now defunetRapid Selector.9* 10 However, to useprobability theory is one thing—touse information theory is somethingelse. We all readily can visualizemethods of utiiizing card space thatwill grossly take advantage of thefacts revealed by a statistical analy-sis of the use made of a particulardescriptor dictionary or subjectheading list. The theoretician, how-ever, wants precise quantitative cri-teria for allocating code space toindividual descriptors or groups ofdescriptors, Here is where Informa-tion Theory comes to the rescue.The design of the most eftlcientcoding system does not dependupon the meaning of terms. Theterms, by themselves, have no in-formational value. Rather, it is thefrequency of use of a particular de-scriptor which determines its infor-mational content. One can onlymeasure the amount of informationin the word benzene when trans-mitting it in English text. As a codeor term in a document collectiondictionary, the word has no value. Itis only significant in so far as it oc-curs with a particular frequency. Ifhalf of the chemicals coded containbenzene then the knowledge that aparticular chemical contains ben-zene reduces the remaining choicesto one half.

Having cleared the cobwebs onwhat the real “coding” problem isin documentation systems it is thenrelatively simple to apply Shannon’sbasic formula for measuring infor-mational content.20 I might men-tion that it is dit%cult, at first, to

276

Page 4: Information Theory and Other Quantitative Factors in Code

think of the card searching problemas a transmission problem. How-ever, if you think in terms of mag-netic tape systems (Univac) orpaper tape systems such as theWestern Reserve Scanner, it iseasier to see an analogy between“transmission” and searching.

The information content of a doc-ument jile is neither the number ofdescriptors used, nor the number ofdocuments which the van”ous com-binations of descriptors constitute.The information content of a docu-ment collection is a function of theprobabilities of the descriptors inthe dictionary. H, the familiar ther-modynamic entropy function, andShannon’s measure of information,is equal to the sum of the individualprobabilities multiplied by the log-arithm of the individual probabili-ties, ie., H = -(P1 log Pl 1- P2 logPI+. . . +Pn IOg Pn).

From this we are able to drawmany interesting conclusions. Forexample, a document collection of1,000 documents may contain nomore information than a documentcollection of one million documents.This fact accounts for the intuitivedecision of the Patent Offke to use a“composite” card, which in cer-tain cases is quite justifiable.21 Italso can be shown that the informa-tional equality in two such files canbe changed readily if the depth ofindexing is altered. Indeed, if theinformational content remains con-stant during such a growth onemust either conclude that unneces-sary cards remain in the ffle, newsub-dividing terms are required, ornoise is present during a search.This situation is illustrated perfectlyby our experience in coding steroidchemicals using the Patent Ot%ce

coae. In many instances a aozendifferent steroids were coded exactlyalike. If the code dictionary is notchanged, it is properly concludedthat it is more economical to ‘‘com-posite” the 12 cards into one. How-ever, one could increase the spec-ificityy of the coding. From the pointof view of the Patent OffIce, withemphasis on the generic approach,the former conclusion, compositing,may appear simplest. From thepoint of view of the research chem-ist the latter approach, more spe-cificity in coding, is more desirable.Taube’s paper at the ICSI Con-ference implies that a term cardsystem for the same steroid filecould be used as readily as thePatent Oftlce document card sys-tem.22 This has a theoretical validi-ty in view of the fact that in bothsystems no attention whatsoever isdevoted to the frequency of occur-rence of the various codes. (ThePatent OffIce uses one punched holeposition for each descriptor and theUniterm system uses a 4 digit docu-ment number for each descriptor. )Indeed, from a tabulation of thecoding done by the Patent OffIce ofover 2500 U.S. patents, involvingabout 35,000 codes, it is no coin-cidence to find that seven descrip-tors account for over 9,200 codes, 16additional account for another 9,100,the next 52 another 9,400 and all theremaining 359 descriptors 6,800.23Deciding the relative merits ofworking with a term card involving1,500 document numbers (the high-est frequency code) or the time torun 2,500 cards through a machinewith a speed varying (according toprice) from 500 to 2,000 cards perminute is meaningless. This be-comes particularly ludlcrous if one

277

Page 5: Information Theory and Other Quantitative Factors in Code

then considers the time required tofind those chemicals containing

both a 3-Hydroxy Steroid code and a17-Hydroxy steroid which occurswith almost equal frequency (1 ,200occurrences). Instead of matchingnumbers on Uniterm cards by eye,one can speed this up by “collat-ing” on an IBM machine at speedscomparable to the sorting operation.Using a Ramac system or a highspeed computer this can be speededfurther.24 The point is that eachsystem, according to the circum-stances, has advantages and for thisreason, in certain cases, I have useda combination of both—even goingso far as to maintain two indepen-dent systems. This is commonly

done, but not admitted, in manyinstallations.

Returning to the discussion of thenow measurable quantity H of aninformation file, to explain how thismeasure of information is deter-mined and used, I must resort tobasic Information Theory. For that Ihave paraphrased Shannon’s ownwords, to which I refer those whoare not yet familiar with InformationTheory .20

Information Theory is concernedwith the discovery of mathematicallaws governing systems designed tocommunicate or manipulate infor-mation. It sets up quantitativemeasures of information and thecapacity to transmit, store and pro-cess information. Information is in-terpreted to include the messagesoccurring in standard communica-tion media, computers, and eventhe nerve networks of animals. Thesignals or messages need not bemeaningful in any ordinary sense.Information Theory is quite differ-ent from classical communication

engmeermg theory, which dealswith the devices used—not withthat which is communicated.

I submit that most of the polemicsconcerning devices, i. e., term cardvs. document card systems havekept us in the dark ages of conven-tional engineering theory. Relative-ly speaking, we have paid littleattention to the nature of the infor-mation itself. This led to the failureto design really eftlcient searchingdevices; anyone who rents an IBMmachine knows this. The measureof information, H, is important be-cause it determines the saving intransmission time that is possible,by proper encoding, due to thestatistics of the message source.Consider a model language in whichthere are only four letters—A, B,C, and D. These letters have prob-abilities 1/2, 1/4, 1/8 and 1/8. In along text, A will occur 1/2 the time,B one quarter, and C and D each1/8. Suppose this language is to beencoded into binary digits, O or 1 asin a pulse system with two types ofpulse. The most direct code is: Aequal 00, B equal 01, C equal 10,and D equal 11. This code requires 2binary digits per letter. However, abetter code can be constructed, withA equal O, B equal 10, C equal 110and D equal 111. The number ofbinary digits used in this code issmaller on the average. ltwill equal1/2 (1) -t 1/4 (2) + 1/8 (3) + 1/8(3) = 1 3/4, where the first termdei-ives from letter A, second B, etc.This is just the value of H found ifthe probability functions are calcu-lated.

The result verified for this specialcase holds generally—if the infor-mation rate of the message is H bitsper letter, it is possible to encode it

278

Page 6: Information Theory and Other Quantitative Factors in Code

into binary digits using, on theaverage, only H binary digits perletter of text. There is no method ofencoding which uses less than thisamount if the original message is tobe recovered without noise. Anaverage of 1 1/4 bits is possible ifthe message is allowed to be noisy,i.e., not a completely faithful rendi-tion of the original message.

Before we can consider how infor-mation is to be measured it is neces-sary to clarify the precise meaningof “Information” to the communi-cation engineer. In general, mes-sages to be transmitted have“meaning,” but have no bearing onthe problem of transmitting theinformation. It is as diftlcult totransmit nonsense words or sylla-bles as meaningful text (more so infact). The significant point is thatone particular message is chosenfrom a set of possible messages.What must be transmitted is aspecification of the particular mes-sage chosen by the informationsource. The original message can bereconstructed at the receiving pointonly if such an unambiguous spec-ification is transmitted. Thus “in-formation” is associated with thenotion of a choice of a set ofpossibilities. Furthermore, thesechoices occur with certain probabili-ties; some messages are more fre-quent than others.

The simplest type of choice isfrom two possibilities, each withprobability 1/2, as when a coin istossed. It is convenient, but notnecessary, to use as the basic unitthe binary digit or bit. l~there are Npossibilities, all equally likely, theamount of information is given bylog21V. If the probabilities are notequal, the formula is more compli-

cated. When the choices have prob-abilities PI, P2, . . ,, Pn, the amountof information El is given by theequation above. An informationsource produces a message whichconsists not of a single choice but ofa sequence of choices, for example,the letters of a printed text or theelementary words or sounds ofspeech. In these cases, by an appli-cation of a generalized formula forH, the rate of production of infor-mation can be calculated. This “in-formation” rate for English text isroughly one bit per letter, whenstatistical structure out to sentencelength is considered (see Bell Sys-tem Tech. J., October 194925 or“Encyclopedia Britannica” articleon Information Theory26).

The problem of applying informat-ion theory to documentation, I be-lieve, is to be solved in properly de-fining the information source, whichis the totality of descriptors as-signed in any file. The next problemis defining the language units, Le.,the descriptors and/or their com-ponents. A classification number,e.g., has built into it much moreinformation than a Uniterm. Eachfacet of the class number must betaken into consideration whenmeasuring the information contentof a classification system. It is thennecessary to determine the proba-bilities of the units involved.

I will further hazard the state-ment that in the design of a docu-ment card of the IBM type the mostefficient space utilization will beobtained when the informationalcontent of all card jields approachequality. For example, in the case ofthe steroid file mentioned above, acard of four basic fields could bedesigned in which about 25’?’. of the

279

Page 7: Information Theory and Other Quantitative Factors in Code

information was contained in each.The first “field” would consist ofone column of 12 punches. The

twelve most frequently occurring

codes would be assigned to each ofthe twelve locations. The next eigh-teen codes would be accommodatedin another column divided into sixsections, each of which could ac-commodate three different mutuallyexclusive codes. You cannot have asteroid which is both an 11-keto andan 1l-hydroxy compound. In actualpunched-card application I suspectthat one would continue to use thefirst five columns, at least, for directcodes covering the first 60 most fre-quently occurring descriptors. Ifnot, another field could be used toaccommodate the next 28 codesdividing one or more columns into 4sections, each containing 3 punches.To accommodate the remaining 359codes in one field would be quitesimple by using all the 495 combi-nations (binary) of four hole punch-ing patterns possible. The numberof columns in the field would de-pend upon the average number ofsuch codes possible in a singlecompound. Specific characteristicsof existing equipment may modifythis decision.

The preceding example of apply-ing measures of information contentto the design of an IBM card hasbeen very brief and may not be en-tirely clear to those not familiar withIBM machines. It is important, atthis point, to make clear the similar-ity between this simple code for anlBM card and a similar code thatcan be used for a variety of docu-ment card or scanning card sys-tems. Let us take up a brief dis-cussion of the qualitative aspects ofdocument cards systems, particu-

larly as they relate to coding.By document card systems, as

contrasted to term card systems, wemean systems wherein all descrip-tors, or codes for descriptors, areretained together in the particularstorage medium involved. Thus, ina punched-card document card sys-tem, i. e., McBee, E-Z Sort, IBM,Remington Rand, Underwood-Samas, etc., the holes or perfora-tions are used to encode descriptorsassigned to individual docu-ments.27 In a limited sense, thecard is the document. Indeed, if thecoding were suftlciently elaborateand detailed the card could be thedocument. The original Luhn Scan-ner employed an IBM card in whichsemantically factored words werestretched across the card to form anencoded telegraphic style mes-sage.28 The IBM card employedwas the standard 80 column cardwith a total of 960 punching posi-tions.

Punched-card document card sys-tems have their counterparts in film(FiImore29 and Minicard30) whereagain all the descriptor codes areassembled together on a singlepiece of unitized film. The codingpatterns may or may not be exactlyof the type found on punched-cards.However, black or white spots cor-respond to perforations or the lackof perforations. The film-card(microfiche) may also contain amicro image of the original docu-ment. Similarly, an lBM card couldcontain the same micro ima e in a

fmicrofilm insert (Filmsort).3 Simi-larly, the Magnacard32 is the mag-netic analog of a punched card. Inthis case information is coded asmagnetized spots on magnetic tape.

The unit-card characteristic com-

280

Page 8: Information Theory and Other Quantitative Factors in Code

mon to ~unched-cards, film cards.and magnetic cards is not onlyfound in document-card systems.The same information found onMagna-cards can be stored on con-

tinuous magnetic tape. This is doneon Univac and the IBM 700 series

computers. The mechanisms em-ployed to scan the “card” (sections

of tape) are naturally somewhat dif-ferent. Similarly, the defunct RapidSelector was a continuous series ofFilmorex cards strung out on one

reel of film.ss In the Benson-

Lehner Flip system, the Rapid Se-lector system is partially revived.s’lA commomise ‘between Filmorex.and the Rapid Selector was sug-gested in the AMFIS system byAvakian,35 The serial counterpartof tIerforated cards can be found inth~ Flexowriter tape used at West-ern Reserve where each documentis represented by a series of codesexa~ly as in the-fasion of the Luhnscanner.36 This is no different fromteletype tape except for the numberof channels involved and the selec-tor circuitry.

The Zator card is another versionof the punched card.37 The codingmethod employed has no basic de-pendence upon the card. It can beused with any type of documentcard system. Superimposition ofcodes is employed to make more ef-ficient use of space. I mentionedearlier some of the limitations ofZator coding theory.

There are, obviously, many fac-tors to consider in evaluating docu-ment card systems. Cost is one fac-tor, but 1 believe its relative im-portance has been overly stressedby Taube and others.38 Documentcard systems are not inherently ex-pensive, nor small collections of

manual punched-cards. Dr. Whaleyhas covered more than adequatelymany other factors which may favorthe document-card or scanning cardsystem. 1 He particularly st~essedthe need, sometimes, to retain re-lationships between various de-scriptors. He did not stress ade-quately the advantages in terms ofinput convenience and cost, whereit is equally advantageous to keepcodes together. Preparing a singleIBM card is simpler than posting adozen or more document numbersto indivi@al term cards. It is alsosimpler than duplicating the samecard a dozen times, each to be filedin twelve different tile locations.

At the present time, punching areally eftlcient IBM card is ditllcultbecause the IBM machines are notdesigned for retrieval purposes ex-clusively. However, in my own ex-perience, preparing elaboratelypunched cards is not an insur-mountable obstacle. Key-punchingcosts are not considered major prob-lems when a file is used repeatedly.Another factor to consider is search-ing time for large tiles. This can becut down by converting to speediermachines—if time is a problem.

The major criticism of existingdocument-card svstems is the needto operate in a “scanning” sense,i.e., each card or each unit of tapeor file must physically pass by ascanning unit. When there are largevolume; of records involved veryhigh speeds may be required. Thisis not only costly, but it will be ob-vious that there is a limit to thespeeds we can reach in mechanical-ly transporting cards, film, etc. It isphenom-enal how fast some sortingand scanning devices do work, andpossibly these speeds will satisfy

281

Page 9: Information Theory and Other Quantitative Factors in Code

most requirements for a long time.However, these speeds are general-ly available only at a relatively highprice. IBM machine rentals arehigher in proportion to the speed atwhich they work, presumably be-cause of greater maintenance andengineering cost. IBM tabulatorrentals also vary according to thespeed at which they are operated.

An ideal document card systemwould be one in which the basicadvantages are retained—unit rec-ord input and storage, logical capa-bilities, etc. However, one wouldlike to eliminate the need to scanthe entire document file, in d physi-cal sense, i. e., by passing cardsthrough a sorter, or magnetic tapepast a reading lead, running film bya photoelectric cell. I believe such asystem is possible and required par-ticularly if we are to achieve the ul-timate in access time. Such a sys-tem would be a truly random-accesssystem and not a term card systemusing so-called random access. Sys-tems such as RAMAC or AMFIS donot appear to be as energy con-suming as high speed tape readersor sorters on punched cards, buttheir mechanical characteristicswould seem to be limiting. It is com-parable to solving the problem ofsorting at high speeds by using adozen sorters all at once. Similarlyto use the equivalent of a dozenmagnetic tape readers is no funda-mental solution. In the ideal, the filewill remain completely stationaryand the scanning mechanism will beable to identify the existence of de-sired codes by scanning in a non-mechanical fashion, An approach inthis direction is seen in the BellTelephone system of routing longdistance calls by use of special

punched cards. Verner W, Clapponce asked me why you couldn’twave a flashlight at a file and have itthrow out the answers. This is notimpossible, I have been exploring asimilar principle utilizing electro-magnetic phenomena which I havecalled Radio Retrieval.

In conclusion, I have tried to showthe fundamental similarities be-tween so-called term card and doc-ument card systems by tracing thecyclical evolution of a term cardsystem into a document card sys-tem, then into a semi-documentcard system employing collatingmethods, and finally back to a termcard printed index arrangement. 1maintain that the differences be-tween term and document card sys-tems are basically illusory. You willfind vigorous proponents for eachsystem depending upon the circum-stances. If one had no indexing sys-tem at all in the first place, anysystem is an improvement. Once asystem is adopted, thereby improv-ing access to documents, a proposalto merely change the mechanics willnot usually excite people.

An area of research which re-quires more fundamental work is incoding. No matter what system isused, the same amount of informat-ion is produced if one uses thesame code dictionary and code fre-quencies.

The Patent OffIce Steroid Codewould be, theoretically, equally ef-ficient with a term card system as inits present document card system.From a practical point of view,it would not. Using InformationTheory the coding space required ina document card system can bereduced considerably. It is possiblethat similar etllciencies are possible

282

Page 10: Information Theory and Other Quantitative Factors in Code

in designing term card systems, butthese are not yet apparent and maybe difticult to find. In other words,term card systems are inherentlyineftlcient because they seeminglycannot take advantage of the varia-tions in code frequencies which areinherent to all information systems.According to Keckley, “there is acentral tendency for 90T0 of the ac-tivity to be concentrated within 257?0of the classifications. ‘’39 This ap.

pears to be well substantiated in thecoding of 2,500 steroid compoundsfrom the literature. Furthermore,term card space requirements mayincrease exponentially as the size ofthe collection grows. A collection of1,000 documents requires less than7 bits per descriptor assignment, acollection of 10,000 about 12 bits perdescriptor assignment, 100,000 16bits, and 1,000,00020 bits.

Mooers deserves credit for recog-nizing the value of Information The-ory for retrieval theory .40 However,it is just as inefficient to use fivepunched holes for every descriptoron a document card as it is to use afive digit document number on aterm card, By proper application ofdescriptor probabilities InformationTheory can make Zato coding evenmore powerful.

It has been shown that one canquantitatively measure the amountof information in a document col-lection by the Shannon formulaH = -(P1 log P1 + P2 log P2 +. . .Pn 10g Pn)

As a result of this expression, it isconcluded that the size of a docu-ment collection is no realistic meas-ure of its “information content. ”Indeed, two collections of entirety

dijferent size contain the “‘same”information if they use exactly thesame code or dictionary with thesum e percentage distribution of de-scn’ptors. Thus, in this sense theLibrary of Congress Subject Catalogcontains no more information thanthe local Public Library Catalog.This may sound startling or ridicul-ous to librarians. However, as longas the local Library uses the LCSubject Heading Authority List, itmay even contain more informationbecause it may add further refine-ments to the existing LC dictionaryor use it with varying frequency as-signments. A special library is ofmore use to its clientele than is theLibrary of Congress. To alter the in-

formation content of a collection onemust index in greater depth—notindex more documents. This point ismost important in industry.

Analysis of the Patent OffIce ster-oid code frequencies illustrates in asimple case how Information Theorymay be put to use.41 A brief sum-mary and review of Shannon’s In-formation Theory has been pre-sented to show that the past preoc-cupation of documentalists with de-vices is comparable to the earlierpreoccupation of communicationengineers with machines ratherthan the information they weretransmitting. The main problem inapplying information theory indocumentation is in defining the“information source” and the“channel. ” A completely successfulretrieval system must combine theadvantages of both term and docu-ment card systems in such a waythat all inertial characteristics ofexisting systems are removed.

283

Page 11: Information Theory and Other Quantitative Factors in Code

REFERENCES

1. Whaley FR. Themanipulation ofnonconventional indexing systems. Amen”can

Documentation 12:101.107, 1961. Presented at the American Documentation

Institute Annual Meeting, 220ctober 1959, Lehigh University, Bethlehem, Pa,

2. Costello J C. Uniterrn indexing principles, problems and solutions. AmericanDocumentation 12:20, 1961.

3. Batten WE. Specialized files for patent searching. Punched crzrds: their applica-tion to science and indu.rhy. (Casey R S & Perry J W, eds.) New York: Reinhold,1951, p. 169-181.

4. Taube M. & Associates. Studies in coordinate indexing. Vol. 1. Washington,D. C.: Documentation, Inc., 1953.

5. Otlet P. Trrzite’ de documentation: le lime sur le livre: thiorie et pratique.Brussels: Editiones Mundaneum, Palais Mondial, 1934, p. 388.

6. Cristina S X. History of writing and records. Hospital Management. Part 1,

72:111, 1958. Part 2, 86:82, 1958.

7. Beard R L & Heumann K F. The chemical-biological coordination center: anexperiment in documentation. Science 116:553, 1952.

8. Berkson J. A system of codification of medical diagnoses for application topunched cards with a plan of operation. American Journal of i%bltc Health26:606, 1936.

9. Wise C S & Perry J W. Multiple coding and the rapid selector. A mericartDocumentation S:223, 1952.

10, Mcmers C N. Coding, information retrieval and the rapid selector. A men”can

Documentation 1:225, 1950.

11. Himwich W A, Garfield E, Field H G, Whittock J M & Larkey S V. Finalreport on machine methods for Information searching: Welch Medical Library

Indexing Project. Baltimore: Johns Hopkins University, 1955.

12. Garfield E. Preliminary report on the mechanical analysis of information byuse of the 101 statistical punch card machines. A men”can Documentation5:7, 1954.

13. The uniterrn system of indexing: o#erating manual. Washington, DC,: Documentation, Inc., 1955,

14. Moners C N. Zatocoding applied to mechanical organization of knowledge.Amen”can Documentation 2:20, 1951,

15. Heumann K F & Dale E. Statritical survey OJ chemicaf structure. Presented at

the American Chemical Society Meeting, 12 September 1955, Minneapolis,Minn.

16. Wiswesaer W J. A line-Jormufa chemical notation. New York: Thomas Crowell

co., 1954.17. Steidle W. Possibilities of mechanical documentation in organic chemistry.

Phamnazeutische Industn”e 19:88, 1957.18. Schultz C K. An application of random codes for literature searching. Punched

cards: their application to science and industry. (Casey R S, Perry J W,Berry M M & Kent R A, eds. ) New York: Reinhold, 1958, p. 2S2-247.

19. Moners C N. Zatocoding and developments in information retrieval. ASLIB

fioceedings 8:3, 1956.

20. Shannon C E & Weaver W. The mathematical theory of communication.

Urbana: University of Illinois, 1949.

21. Baily M E, Lanham B E & Leihowitz J. Mechanized searching in the US Patent

Office. Journal of the Patent Office Society 35:566, 1953.

Page 12: Information Theory and Other Quantitative Factors in Code

22. Taube M. The Comae: an efficient punched card collating system for the

storage and retrieval of information. Proceedings OJ the Interruationuf

Conference on Scient@ic Information. Vol. 2. Washington, D. C.: NationalAcademy of Sciences/National Research Council, 1959, p. 1245-1254.

23. Ball N T. Searching patents by machine. American Docrnnentation 6:88, 1955.24. Nolan J J. Information storage and retrieval using a large scale random

access memory. American Documentation 10:27, 1959.25. Shannon C E. Communication theory of secrecy systems. Bell System Technical

jOU~t 28:656, 1949.26. . Information theory. Encyclopedsis Brrlanrska 14th edition. Vol.

12. Chicago: Encyclopedia Britannica, Inc., 1958, p. S50.27. Casey R S, Perry J W, Berry M M & Kent A, eds. Punched cards: their

appkration to science and rndustry. New York: Reinhold, 1958.28. Luhn H P. The IBM universal card scanner for punched cards information

searching systems. Emer@”ng solutions for mecharking the storage @ retriirzsl

of information: studies in coordinate indexing, Vol. 5. (Taube M, ed. ) Wash-ington, D. C.: Documentation, Inc., 1959, p. 112-140.

29. Samain J. Filmorex. Une nouvelle technique de classement et de s;lection des

documents et des rnforrrsations. Paris 1952,

30, Taube M. The Minicard system: a case study in the application of storage andretrieval theory. The mechantiation of data retn”ezxal: studies in coordinate

indexing, Vol. 5. (Taube M, ed. ) Washington, D. C.: Documentation, Inc.,1957, p. 55-1oo.

31. Rem T H. Commercially available equipment and supplies. Punched cards: their

application to science and industry. (Casey R S, Perry J W, Berry M M &Kent A, eda.) New York: Reinhold, 1958, p. 30-90.

32, Hayea R M. The Magnacard system. Presented at the International Conference

on Information Processing, June 1959, Paris, France.33. Shaw R R. The rapid selector. ]oumul of Documentation 5:164, 1949.34. Introduction to the FLIP (film library instantaneous presentation). A merican

Documentation 8:330, 1957.

35. Avakian E & Garfield E. AMFIS – the automatic microfilm information system..$pecrial Libram”es 48:145, 1957.

36. Perry J W. The Western Reserve University searching selector. Took for machine

literature searching. (PerryJ W & Kent A, eds.) New York: lnterscience, 1958,

p. 489-579.37. Mttoers C N. Scientific information retrieval systems for machhe operation: case

studies in design. Zator Technical Bulletin #66, Boston: Zator Company, 1951,38. Taube M. Studies in coordinate indexing. 5 volumes. Washington, D. C.:

Documentation, Inc., 1953-1959.

39. Unidentified reference. [1 would appreciate hearing from any reader who is able

to supply this reference.]

40, Mooesn C N. Information retrieval viewed as temporal signaling. InternationalCongress of Mathernuticrims, Harrmrd University, A rsgust 30-September 6,1950. F’roceedings. Providence, R. 1.: American Mathematical Society, 1952,

p. 572-579.41. Andrews D D, Frome J, Keller H R, Leibowiu J & Pfeffer H. Recent ad-

vances in patent office searching: steroid compounds and ILAS. A drnnces in

documentation and library science. Vol. 2. Information systems in documenta-

tion. (Shera J H, cd. ) New York: Inter-science, 1957, p. 447-477.

28!i