the theory of digital handling of non-numerical information and its implications to machine...

Upload: scp

Post on 07-Jul-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/18/2019 The Theory of Digital Handling of Non-numerical Information and its implications to machine economics

    1/18

    Zator Technical Bulletin

    Number ^8

    TH3 THEOHY OF DIGITAL HANDLING OF rTON-lTUMERICAL INFOEMATION

    AND ITS IMPLICATIONS TO MACHINE ECONOMICS

    "by 

    Calvin N. Mooers

    " V *>

    Copyright 1950 Sator Company79 Milk Street* Boston 9* Mass,

  • 8/18/2019 The Theory of Digital Handling of Non-numerical Information and its implications to machine economics

    2/18

    Zator Technical BulletinNumber 48

    THE THEORY OF DIGITAL HAHDLIiJG OF ITON-NTJMERICA1 lOTQHMATIOH

    A3D ITS IMPLICATIONS TO MACHINE ECONOMICS

    Calvin N. MooersZator Company2

    The problem considered is the recall from storage of itemsof non-numerical information. An example is the library problem forthe selection of technical abstracts by subject specification from alisting of such abstracts* There are now several digital machine methods fordealing with this important problem,and the comparative success and machine complexity of each is intimately connected wit h theprinciple of digital coding employed. Each information item must becharacterized for selection by a set of descriptive terms or "descriptors" • Independence of the descriptors in the selective process of anitem is most important. The different methods can be distinguished bythe manner in which they face or dodge the descriptor problem. The systems now in use and which are considered are (1) alphabetical sorting!(2) numerical code with sorting* (3) Dewey decimal coding, (4) method ofexclusive subfields* (5) unit card system, (6) Microfilm Rapid Selectorcoding (Department of Agriculture system)* (7) Microfilm Rapid Selectorcoding (AEC revision), and (8) superimposed random coding (Zatocoding).The principles of each method are sketched and the implications of thecoding with respect to efficient searching and economical machine costare examined.

    I Introduction

    II The FoundationsIII The Alphabetical IndexIV Numerical Code and SortingV Dewey Decimal ClassificationVI Method of Exclusive SubfieldsVII Unit Card SystemVIII Microfilm Rapid SelectorIX Atomic Energy Commission Joint ProjectX ZatocodingXI Epilog

    I - INTRODUCTION

    The problem under discussion here is machine searching andretrieval of information from storage according to specification bysubject. An example is the library problem of selection of technical

    1 A paper presented before the Association for Computing Machinery attheir Rutgers University Conference on March 29» 1950.

    2 79 Milk Street, Boston 9» Massachusetts

    - 2 -

  • 8/18/2019 The Theory of Digital Handling of Non-numerical Information and its implications to machine economics

    3/18

    abstracts from a listing of such abstracts. It should not be necessary to dwell upon the importance of information retrieval before a

    scientific group such as this, for all of us have known frustrationfrom the operation of our libraries — all libraries, without exception.

    Unlike the problem of "random access" to mathematical tableswhere the location of the desired tabular information is known in advanceonce the independent variable is specified* in the library retrievalsituation the position and the very existence of the desired informationmust be discovered.

    The retrieval problem is a digital problem, for in fact allhuman communication is digital. Information retrieval is a non-numericalproblem in part because the most of human communication is verbal, butmore important because most ideas or concepts cannot be mapped into

    a Euclidean 3~space, or higher space, ’vhile there are scale readingsfor the representation of some information, these are relatively few andunimportant. Spacial and metrical concepts do not apnly to most information, at least not at the simpler levels. Yet, though the informationretrieval problem is non-numerical, there does not; seem to be any alternative to the use of digital techniques for its solution. Digitalinformation retrieval systems employing machines are already operating,and their degree of success seems to indicate that this is the directionof progress.

    The intent of this paper is to bring to the attention ofmembers of the Asso ciation (1) that objective and scientific analysisand design can be applied to information retrieval problems, (2) that

    some very worth-while answers are already available, and (3) that theresponsibility for the future development of this important field willfall upon the members of this Association.

    II - THE FOUNDATIONS

    In spite of the poor record to date, i::i ormation retrievalcan be treated scientifically when the problem is accurately stated andthe applicable parameters are defined and used. Unfortunately, most ofthe reasoning applying to the mathematical operation of computing machines does not apply.

    The closest analogy to our problem is that of looking upnumerical values in a table. However the retrieval requirements trans

    form this familiar operation into a distorted situation where one mightthink of a table in vhich some values are repeated, some are left out,and many inconsistent values are given for the same argument. There isno indication where any of the values are located, but there is the requirement that every value must be found.

    -3-

  • 8/18/2019 The Theory of Digital Handling of Non-numerical Information and its implications to machine economics

    4/18

    An information retrieval system must "be evaluated by at leasttwo criteria: (1) can the system do the job* and (2) how expensive is

    the solution in terms of the machine* storage capacity or time? It willbe seen that most systems are actually incapable of information retrieval—a conclusion that has been reached empirically by many scientists*but which does not seem to have been seriously realized by those chargedwith organizing information* who currently are the librarians. When thesystem can do the job* the next question concerns its economics.

    It is necessary at this point to make a request of my audience:In order to approach this slippery problem with any hope of success orefficiency of thought* it will be necessary for us to put aside almostall the ideas* doctrines* and symbolic or metaphysical superstructureabout libraries and library methods that we hive learned or otherwisepicked up in the past. It can be said — and demr .strated — that almost

    everything that the- librarians hold dear in classification is absolutelywrong for information retrieval. It is my hope to develop the detailsof this assertion in subsequent papers. However, for the moment* letus put aside all preconceptions and examine the consequences,

    V.rhat sort of a precise foundation can we put under the studyof information retrieval? To this question* a useful sequence of thoughtand argument might be sketched in the following fashion: An item ofinformation will be considered to be a single report* scientific paper*or unitary piece of data. In a moment we will see that it is impossibleto achieve retrieval by any system of ordering of the physical itemsthemselves; and the logicr.l obverse of this fact is that retrieval requires either some form of scanning in entirety of all information items

    in the collection* or of some manner of dealing with symbolic abbreviations of the content of the items. Scanning in entirety is humanlyimpracticable and technically undesirable. Therefore* attention mustbe directed to methods of symbolic description.

    Let there be an artificial language without synonyms whosevocabulary is §, fixed list of statements or ideas* which ve shall callalternatively "attributes" or "descriptors". In the simplest case, thislanguage is given no further algebra (or internal topology or grammarif you will) than the logical product3 of the descriptors. More compl-cated algebras have been used quite successfully in chemistry, but itis possible to go very far with this simple algebra. Upon looking atan apple, one could apply the descriptors "fruit" and "red". Each isapplied separately, and the apple, being a red fruit, is characterized

    by this logical product which describes those thir ;s that are bothred and a fruit.

    3 George Boole. The mathematical analysis of logic. Cambridge, 18^7,

  • 8/18/2019 The Theory of Digital Handling of Non-numerical Information and its implications to machine economics

    5/18

    Any information item can have a set of assertions made about

    it in terms of descriptors chosen from the vocabulary. Thus* the itemcan be represented by a complex of these descriptors. There can be onlya finite number of descriptors in this complex. V/here the set ((a^))is the vocabulary, then the j information item can be symbolicallyrepresented by Cj(ai*a2»a^....a^).

    The requirements of information retrieval, of finding information v/hose location or very existence is a-priori unknovm* now requiresthat it be possible by some efficient technique to specify a selectionof complexes by means of any set or combination of descriptors chosenin any way from the vocabulary ((a^)). There must be complete independence in the choice and use of descriptors.

    This is the information retrieval problem as the user of information sees it. This is how the user insists upon specifying his information. Unfortunately, the basic organization and ’/orking philosophyof our libraries is concerned with putting avay ir_:'ormation — itslisting, shelving and storage wit h very little t.'.oujht to use. Such aphilosophy is incompatible with the requirements of information retrievalas I have stated them here, and we shall see why this is so in the studyof library systems which follows.

    The different methods proposed and in use for informationretrieval can be searchingly criticised in accord with the manner inwhich they meet or dodge the paramount necessity of complete freedom andindependence in the use and choice of descriptors*

    T.;‘e shall now restrict our field by requiring (1) that a systembe capable of dealing with collections of information that may becomevery large (certainly larger than 10 items), (2) that, in general,at least five attributes or descriptors are necessary for adequate description of the item, (3) that a request for retrieval will take no lessthan three descriptors operating in conjunction, and (4) that the systemis in some way dependent upon a machine in its operation. For the sa.keof standard nomenclature, we will say that a card (or cards) is assignedto each item of information; that the card bears the citation, "address",or location of the actual document; and that the card also carries insome manner a symbolic designation of the descriptors applicable to theitem. In our study, with emphasis upon computing machines and alliedtechniques, these symbolic designations for selection will be digital,

    thus allowing machine selection. Techniques which can be carried outon cards can certainly be extended to photographic film strips, magnetictapes, or other memory or storage devices. In general I will not discusssuch extensions. With these ground rules, we shall new look over some

    current methods and proposals for information retrieval.

    -5-

  • 8/18/2019 The Theory of Digital Handling of Non-numerical Information and its implications to machine economics

    6/18

    Ill - THE ALPHABETICAL INDEX

    The standard alphabetical index is one of the simplest methodsfor information retrieval. A card bears in ordinary language the digitalverbal statement of the applicable descriptors* i.e. a list of writtenv/ords. As has been formulated here, the usual problem of synonyms hasbeen eliminated by the use of the standard vocabulary of attributes.Tabulating machines can be used to sort the cards into a unique alphabetical order providing the cards are punched corresponding to the descriptions. Retrieval presumably consists in going to the one place inthe linearly ordered alphabetical file (according to the descriptorsspecifying the selection) and there finding directly the cards bearingthese descriptors.

    Such ease of finding is virtually never the case. It would

    require, for an item described by five descriptors, that there would befactorial five (120) cards in the file, each c.rd with the descriptorslisted in a different order. Such ridiculous multiplicity of cards wouldbe intolerable, as any library user could testify. Consequently allthese combinations are never formed. Familiar cross-references are necessary, and there is a serious loss of utility in retrieval. Therefore,I preclude any system having a maze of cross-references as being incapable of handling the multiple descriptor situation.

    It can be definitely said that the alphabetical index (eitheron cards or listed in a book) does not meet the fundamental requirementof a complete independence in the use of descriptors in retrieval, andit can never meet this requirement ’’ithout inordinate multiplication of

    the storage requirements, or by the use of an unacceptable system ofcross-referencing. For this reason the alphebetical index is incapableof information retrieval in the sense here under discussion.

    IV - IIUMSRICAL CODE Ai-TD SORTING

    4The patent office classes and subclasses seem to be one of the

    best examples of this technique, though it is not fully expanded so asto differentiate down to a single patent. The Dysonian system^ of ciphering organic chemical compounds also seems to fall in this category. Forthe purpose of the study here, this method can be thought of as the alphabetical technioue described above, with a translation of letters intonumbers. I know of no system in this category which actually displays

    an independent use of descriptors.

    4 Manual of Classification of Patents, 19^7* U.S. Department of Commerce.5 G. Malcolm Dyson. A new notation and enumeration system for organic

    compounds. (2nd ed.) London, Longmans Green. 1949*

    -6-

  • 8/18/2019 The Theory of Digital Handling of Non-numerical Information and its implications to machine economics

    7/18

    V - DEWEY DECIMAL CLASSIFICATION

    This sytem of numorical classification was devised by MelvilDewey in 1873, and now is in widespread use for classification of library books in the United States. On the Continent an expanded versionof the Dewey system (more decimal places) is known as the UniversalDecimal Classification. The first empirical comment that can be madeof the Dewey system is that the librarians who use it for putting awaybooks on the library shelves never themselves use the Dewey decimalsdirectly for information retrieval. If this state'rjent provokes anydisagreement, I suggest that you try asking a librarian to bring outto you all books in class 512.2 —or any other class, or that you actually try to use the classification schedule—and nothing more--tofind your information. In fact, the librarians themselves have indexed

    the Dewey schedule so they can find the subjects listed in it!

    Far more pertinent to our study here is the theoretical basisof this decimal classification, './hile there is an elaborate dogma ofpostulates, I believe we can quickly cut through to the core of thematter by the following reasoning. A basic assumption of the systemis that each information item in the universe can be mapped into asingle (and it is believed unique) point on the real line interval from0 to 1. This is the librarian*s ideal of "pin-pointing" the information.It is further believed, by proper attention to the construction of theclassification schedule, that this mapping can be made topologicallycontinuous. What this means is that about any point on the real lineinterval there is a neighborhood, or small segment of the line, inwhich all points are associated with the information items having a closeconceptual similarity to each other. This mapping of ideas onto theline "groups”the ideas--or classifies them—according to the beliefsof the proponents of the Dewey decimal system. Moreover, they believethat the mapping is such that the neighborhood about a given point(with the neighborhood assumed to be connected, and not broken intosegments) must contain al1 the coneptually similar points, with emphasis on the "all".

    From ordinary experience with libraries organized by theDewey system, we know that these beliefs do not correspond to fact.The mapping of idea complexes onto the real line interval is nottopologically continuous in a library. Books written in German, forinstance, are scattered throughout the shelves. I is inability to

    set up such a mapping is not due to any lack of skill or patience orlack of revision of the classification schedules. It is due to afundamental property of information itself as compared to the decimaltechnique.

    The same difficulty that precludes the mathematical definitionof a continuous transformation from a space of two or more dimensions

    - 7 -

  • 8/18/2019 The Theory of Digital Handling of Non-numerical Information and its implications to machine economics

    8/18

    6nto a single-dimensional line element precludes the attainment ofthe Dewey decimal idea. Idea complexes# in so far as they contain more

    than one independent descriptor, are multi-dimensional. The Dev/ey transformation from idea space onto the one-dimensional 0-1 interval islogically impossible.

    The postulates of the Dev/ey system are incompatible amongthemselves, and the system can never be readjusted so as to perform thetask set for it. Practically, and from the retrieval standpoint* theDev/ey classification does and must scatter widely the information bearing on an arbitrary idea complex. Therefore, since it misses its onlygoal, it is really incapable of information retrieval as we have formulated it.

    The amazing thing about the continued usage and growth ofthe Dewey decimal system, and the U.D.C.* is the tyranny that it has

    exerted over scientists i/ho really should have known better. For atleast thirty or forty years mathematicians have had the critical toolsavailable that would have demolished the Dewey postulates* and we canonly v/onder why it has not been done before this time*

    VI - METHOD OF EXCLUSIVE SUBFIELDS

    This is the descriptive name of ±he most prevalent method ofinformation retrieval using punched cards. It is a tantalizing methodin that it almost makes the grade to give a very efficient solution tothe problem. ITear-success of this method has often been mistaken forreal success, and therefore this method has attracted a great deal ofattention as a competent punched card method for solution of the retrieval

    problem.

    By the method of exclusive subfields* each information itemor report is given one punched card. In selection the whole collectionis scanned by some mechanical device. The digital coding area ofeach card is partitioned into a standard set of subfields* each ofwhich can contain the digital representation of a single descriptor.

    The fundamental difficulty of this system is due to the indeterminacy in allocation or placement of descriptors among the various exclusive subfields. Either there must be a standard placementfor the different classes of descriptors* or a machine selectiondevice must be capable of trying all possible subfield locations.

    6 L. E. J. Brouver. Bev/ies der Invariance der Dimensionenzahl.Math. Ann. vol. 70 (1911) PP» 161-165*

    7 For citations refer to: Lorna Ferris. Kanardy Taylor, and J. W. Perry,Bibliography on the uses of punched cards* procurable from theAmerican Chemical Society.

    -8-

  • 8/18/2019 The Theory of Digital Handling of Non-numerical Information and its implications to machine economics

    9/18

    Such a machine is necessarily complicated and wasteful. Alternatively,if the descriptors are placed in a standard location, there arises a

    subtle rigidity that has in general been little understood, though theeffects have been deplored. These effects have led to the generalopinion that any punch card cannot carry enough punches, because ithas always been impossible to organize a system using punched cardswhich could actually handle the full range of information of a largefile. The effect is real, and it is serious. For a complete discussion of this indeterminacy of subfields, reference is made to theauthor1s earlier p a p e r , ^

    Approximate solutions to this problem, when attempted, invariably lead to restrictions upon the use of descriptors; they can no longerbe used in an independent fashion.

    Therefore* the punched card system with mutually exclusivesubfields is not found to meet the requirements of information retrievalas we have formulated them,

    VII - UNIT CARD SYSTEM

    This ingenious technioue has a solution to the problem ofindeterminacy of subfields, but it achieves this goal at the cost ofa separate card for every descriptor of. every item in the collection.The unit card system can first be criticised for the excessive demandsit makes on the storage system. Experiments with this method, usingoffice tabulating machinery, have been conducted at the U. S. PatentOffice, the Chemical-Biological Coordination Center, and at otherplaces.

    Each card has two symbols: a document number punched at oneend, and a single descriptor punched at the other end. There are asmany cards for each item as there are applicable descriptors. Apatent specification might take as many as 20 to 50 cards for as manydescriptors. The cards are alphabetized into serial order accordingto the descriptors and the document number* and are then stored in thiBorder.

    Selection is made upon a concurrence of descriptors such as“a, b, and c". To do so, one goes to the file under descriptor "an#takes out the cards, doing in turn the same for Mb!l and "cn, Therewill be many cards in each collection. The cards for descriptors Man

    and 11 b" are then placed in the two feed positions of a collating m ach-•ine, and coincidences between the document numbers are looked for.

    8 J. E. Holmstrom. The Royal Society Scientific Information Conference,

    London. 19^. p* 26^,9 C, N. Mooers. Zatocoding for punched cards, Zator Technical

    Bulletin Ho. 30. Zator Company, Boston. 1950.

    -9-

  • 8/18/2019 The Theory of Digital Handling of Non-numerical Information and its implications to machine economics

    10/18

    Pairs of cards having coincidences are set aside, and these in turnare then collated against the cards of the collection Mc". Cards hav

    ing triple coincidences represent the desired documents.

    It is one of the hopes of proponents of this method that thetime-consuming collation process can be held down b y using v e r y narrowand precise descriptors. If this could be done* the unit card systemwould be an excellent solution to* information retrieval.

    For my part. I have at least two objections to the unit cardmethod—besides the matter of the great overload on storage. The firstis that in all my ejperience working with retrieval systems. I havefound that the descriptors must be broad, not precise. This seemsfundamental to the whole retrieval situation, and enters in severaldifferent ways. The second point is that large-scale mechanization

    of the unit card system gives rise to difficulties with respect tocollation and the ease of insertion of new items in the storage system. This is inherently a matter of the use of machines* and it involves the sequential sorting problem. In particular* readjustmentof the record would become most difficult if the system went beyondthe use of cards into the use of a magnetic or film record. I bringthis up because a unit card system applied to more than one millionitems must run into an enormous collection of cards* and in a projectof this size it would be desirable to completely mechanize the processby the use of a film or tape record.

    It can be concluded, though with some serious reservations*that the unit qard system is the first system considered that can actually meet the requirements set for information retrieval.

    VIII - MICROFILM RAPID SELECTOR

    This is the machine constructed for the Department ofAgriculture by Engineering Research Associates along the lines of anearlier though similar machine by V. Bush. It is a very interestingdevice. As an electronic machine, it can be criticised for being atleast an order of magnitude too slow in its speed of scanning. It isslow by a factor of 100 as compared to the internal processes of theBIUAC. Its present low rate of scanning of only 10*000 items per minute makes its present cost difficult to justify when compared to thespeeds of about 1*000 items pe r minute that can be attained in acomparable selection situation when sorting cards by a simple hand

    operated machine. It has been suggested, however, that succeedingversions of the machine would be cheaper than the $75»000 cost of thefirst model.

    The full mechanical details of the machine are to be found

    -10-

  • 8/18/2019 The Theory of Digital Handling of Non-numerical Information and its implications to machine economics

    11/18

    in a report^-® by Engineering Research Associates* We will considerhere only those mechanical details which have an impact on the theoryof information retrieval. In contrast to the method of mutually exclusive subfields already discussed* this machine might be said tooperate on the method of “alternative subfields" .

    The record medium is a reel of film. Each frame of the 35°mfilm is split in two# with one half carrying a photo reduction of atypewritten abstract of a document and the citation. The other halfof the frame carries six subfields for digital designation of the descriptors. Each subfield has 35 binary positions tliat can either beblackened or left clear. With five binary positions per decimal digit*a subfield can represent any seven-digit number, giving ten milliondifferent codes.

    The machine serially scans all the frames at a rate of about180 per second, and a selective photocell arrangement scrutinizes in

    series each subfield of every frame as it passes. The machine selectswhen the specified seven-digit code pattern is found in any subfieldof a frame. Upon selection, a micro-flash lamp fires and a photographiccopy is made of the moving abstract. The machine selects according toonly a single seven-digit code* and there is no way to use two codesfor selection on a combination.

    The method for coding—the assignment of verbal meanings tothe seven-digit numbers—is unusual. In explaining the method bywhich he does this* Ralph R. Shaw the Librarian of the Department ofAgriculture, at the meeting of the American Chemical Society last fall,pointed out there was apparently no foreseeable unanimity about schemesfor classification. Instead of waiting for problematic future agree

    ments in this field, he said it was his aim to avoid conflict and toapply what was already in wide use and acceptance. He bases his codingsystem upon those large indexes that are already in operation for thedifferent scientific fields. For instance, in chemistry, he usesChemical Abstracts. Given the index, he takes a numbering machine* andstarting with the number 0000001 at "AAA11 • he numbers each line andentry of the entire index. Sub-entries are numbered serially afterthe main entries, all in order. An abstract is given up to Bix ofsuch codes, one in each subfield, and with no particular order forthe sequence of codes on the frame*

    10 Anon. Report for the Microfilm Rapid Selector (EngineeringResearch Associates* Inc*)* ITo* 97313* U* S. Department ofCommerce. 19*:9«

    11 The selector described by J. Samain (pp. 265-266* 68O-685The Royal Society Scientific Information conference. London.1948) also makes use of alternative subfields, but with a different method of coding.

    -11-

  • 8/18/2019 The Theory of Digital Handling of Non-numerical Information and its implications to machine economics

    12/18

    To perform a machine selection* it is first necessary tolook in the index to find the single word entry and corresponding codethat will give the selection desired. The one code is entered in themachine. The machine then copies out those abstracts that have thiscode in one of its subfields.

    In effect* with this method of coding, the mechanism d.oesonly the equivalent of looking up page numbers and making a photographiccopy of the abstract. There is still the severe load upon the humanoperator who must select the single index entry to define the selection.

    All the difficulties that apply to the index method must apply here, irrespective of the machine operation in the copying stage.In terms of our formulation of the problem, the descriptors are notindependent. Eac h code really represents a complex of descriptors*and the choice of descriptor combinations is restricted to those thatare already listed in the index. Unusual configurations of descriptors*

    possibly not of importance when the original index ”as constructed* areimpossible to find, due to the coding method chosen. Thus in thestrict sense in whi ch we are using the term, the technique is incapableof efficient information retrieval.

    IX - ATOMIC EHERC-Y COMMISSION JOINT PROJECT

    These difficulties of the Microfilm Rapid Selector have beenrealized, and there is now underway a joint project between the Department of Agriculture and the Atomic Energy Commission for the revisionof the present machine. The direction of these improvements is notclear at this time, though there may be an effort to set up the machineand its coding so that statements in the form of the propositional

    calculus can be put into the coding and selection"^. In this fieldthere is a considerable amount of work that might be adapted to theelectrical or electronic realization of such logical relations, ofwhich the work of Shannon might be mentioned 3. Shannon showedmathematically how a large range of propositional functions in thelogical calculus could be set up by realizable configurations of relaysand contactors.

    V.'ith respect to this tentative line of approach, but withoutreference to the AEC work, I might make some additional remarks. Bysetting up an array of ideas in a logical structure symbolized by apolynomial in the propositional calculus, one is in effect imposing agrammar on the ideas. The two concepts are in a v/ay equivalent. Ithas been my experience, from experimenting with modes of description

    12 Mortimer Taube, A.E.C. Personal communication.13 C. E. Shannon. A symbolic analysis of relay and switching circuits.

    Trans. Amer. Inst. Electr. Engr. v. 57 PP» 713“723 (1938)*

  • 8/18/2019 The Theory of Digital Handling of Non-numerical Information and its implications to machine economics

    13/18

    for machine selection, that unless the atomic ideas exist in a verywell-determined structure, the grammar can cause trouble by imposinga "point of view". For instance. ”1 eat a banana0 and "The bananawas eaten by me" mean exactly the same thing, though their form isquite different because of the differing points of view. More complexsituations are even more difficult.

    In chemistry the structures are quite determinate (at leastup to a certain point) and then grammar can often be used to advantage,though even so it can be overdone.

    X - ZATOCODING

    Zatocoding is a new method of coding that I am very muchinterested in, since I have been concerned vith its mathematical formulation. It is inherently a principle of coding rather than any specific

    machine embodyment. It can be applied with a number of differentdigital machines: electronic scanners, tabulating machinery, and evenwith such simple hand-sorted punched cards as the one I am showing inthe Zator Company exhibit at Kutgers hero today. Zatocoding can bebriefly characterized as the coding technique which uses the superimposition of random subject codes in a single coding field.

    Zatocoding is a system of coding which was designed primarilyfor information retrieval, and it has revealed the need for some drasticchanges in the conventional library postulates or doctrines. Forinstance, in Zatocoding the unwanted bulk of the material in the fileis rejected according to statistical rules, rather than by the principles of the "exactness" implied in ordinary library systems.

    Zatocoding is able to combine the best feature of the unit card system(the independence of attributes) with the best feature of the methodof exclusive subfields (the use of a single card per information item).Yet. Zatocoding is able to leave behind the mo3t serious disadvantagesof both methods: respectively, the many cards per information item inthe unit card system, and the indeterminacy of subfields in the methodof exclusive subfields.

    The Zatocoding method is as follows: To each informationitem there is delegated a single card which has a field for carryingpunches. Other carriers of digital information— such as film—couldbe used auite as well. Eac h information item is characterized by aset of attributes, which we can consider as having been written outon the face of the card. There are as many cards as there are information items in the collection. The set of 411 the attributes used inthe whole collection forms a "vocabulary" of descriptive terms. Codesare assigned to the attributes in the vocabulary by starting at thetop of the list and giving the first attribute a random pattern ofpunches ranging over the field. The second attribute is given asecond pattern also ranging over the field, and generated randomly

    -13-

  • 8/18/2019 The Theory of Digital Handling of Non-numerical Information and its implications to machine economics

    14/18

    and independently of the first. An d so on» for each attribute in turn.

    A card is coded by finding those patterns assigned to theattributes written on the card, and by punching those patterns intothe single field of the card one on top of the other—in superimoosi-tion. Mathematically speaking, the patterns are combined by Boolianaddition in the single coordinate system of the field.

    The feature of randomness of the codes in Zatocoding is veryimportant and merits additional discussion. It is not sufficient forthe cards, as punched out with the several codes, merely to give theappearance of randomness in the dictionary sense: "without definiteaim, direction, rule, or method". This is not enough. What is required for successful operation of the statistical selection processin Zatocoding is (1) that the code patterns, taken individually, havea mathematically random scatter of punches ranging over the field, and(2) that the patterns considered with respect to the list of attributesexhibit a mathematical randomness from one to another.

    The reason for such a stringent requirement on randomnessis that in selection the statistical rejection of the unwanted cardsoperates by means of the differing patterns of the selected cards ascompared to the rejected cards. The only way to guarantee code patterns that differ as much as possible among themselves is to producethem randomly v/ith respect to each other. When this is done, the required randomness will prevail no matter how the attributes may bearranged in alphabetical lists, or classified by subject. Thus"airdromes" and "airfoils" have entirely different Zatocoding patterns,in spite of their alphabetical or subject contiguity, and in selection

    a strong statistical discrimination is exerted betv/een them.

    To return to the pack of cards, in Zatocoding each cardis punched with a set of random patterns in superimposition in thefield, v/ith these patterns individually representing the attributesdescriptive of the subject content of the card. A Zatocoding selection is defined by a single or multiple set of attributes, typicallytv/o or three in combination. To carry out a selection, the Zatocodingpatterns corresponding to the several selecting attributes are combined also by Eoolian addition, to give tho total selective pattern S.Then if C is a typical pattern on a card, Zatocoding selection occurswhen the pattern C includes S. or in the notation of Boolian algebra,when C *S=0, t/ith the apostrophe denoting complementation with re-

    spcct to the whole field of tho card.

    14 See Garrett Birkhoff and S. MacLane, A survey of modern algebra,Macmillan Co., New York 19^4. pp. 311-332.

    -14-

  • 8/18/2019 The Theory of Digital Handling of Non-numerical Information and its implications to machine economics

    15/18

    Those cards are selected, which contain each and every oneof the selector attributes. If the patterns of these attributes havebeen punched out on a particular card* the inclusion relation must hold

    with this card, and it must be selected irrespective of the other patterns on the card. In this respect, selection is according to thelogical product of the selector attributes; e.g. all cards bearingpunches for “large", "red", and "apples" simultaneously will come outwhen these attributes are placed in the selector.

    While all cards fitting the selector prescription must beselected by the inclusion principle, the strict converse does not holdv/ith repect to the exclusion of the unwanted cards. This peculiarityof Zatocoding comes from the superimposition of many code patternsin the single field of the card. There is an intermingling and overlapping of the individual patterns. Because of this overlapping,there is a finite statistical possibility that patterns having no

    intellectual connection v/ith the desired patterns can combine tosimulate the configuration of punches in the cards having the desiredpatterns. Such cards do select out, and are called "extra cards".Eowever, and this is important, the relative frequency of such extracards with respect to the entire collection is under strict statistical control. Typically, in a selection on two patterns, the fre-ouency will be .001 or less. Generally, where S is the total numberof positions in the selector pattern, the average ratio of extra cardsis always less than (1/2) , and often very much l e s s . 9

    One might say that selection of cards by Zatocoding isaccording to the logical product of the selector attributes plus"epsilon", where epsilon can be made as small as desired by designof the system. While Zatocoding selection is not exact from a per

    fectionist' s standpoint, it is a good engineering solution to aproblem—particularly when epsilon can easily be brought to 10”̂ 0rless if ever required.

    Zatocoding, by accepting the existence of the inconsequential epsilon, accomplishes these things:

    1. There is no indeterminacy of subfielcs for the locationof an attribute on a card, because all the codes are ina single coordinate frame.

    2. To find an attribute, a selector mechanism need searchonly in one location on the card.

    3* Attributes are used entirely independently, both inselection and in making up the card. Such independence,in conjunction with good statistical control of extras,is gained through the use of random code assignments.

    -15-

  • 8/18/2019 The Theory of Digital Handling of Non-numerical Information and its implications to machine economics

    16/18

    4. A feature of great practical importance is the enormousincrease in the size of the usable vocabulary-9 as compared to coding methods such as the '’method of exclusive

    subf ields11.

    5. Selection by Zatocoding automatically is made according tothe logical product of the selector attributes, the naturalway for def ini nga selection.

    6. Zatocoding leads to extrememly simple structures in theselective machines, a matter of great importance to thisgroup of machine designers and builders.

    Because all patterns with Zatocoding are carried in thesingle coordinate frame of the field, the selector mechanism does notrequire a 11 subfield shifter11 so that it can look into several dif

    ferent subfields on the card for individual patterns. Because ofpattern inclusion selection, a very simple digital pattern recognition scheme is possible— the simplest being represented by a maskand a photoelectric arrangement in an optical system.

    Simple structure quite generally means high-speed operation.This is true of Zatocoding. The exhibit I have here at Rutgers shows(as you perhaps have tried for yourself) that cards can be sorted ata rate of around 800 per minute with a strictly mechanical device.without recourse to electronics. This is very favorable when comparedwith other systems, such as tabulating machines. On a frequency basis—binary digit field positions scanned per second—this simple selector is operating at about 530  digits per second.

    Because of the extreme simplicity possible with Zatocoding,it is possible to envisage an electronic selector method for scanninga record at a rate of 10° digits per second using essentially ourpresent technology. This represents a scanning speed of approximately a million coded fields per second. At this rate a subjectsearch of all the volumes in the Library of Congress would take lessthan ten seconds, and a search of all known scientific papers forall time would take less than five minutes.

    XI - EPILOG

    The conclusions with respect to information retrieval andmachine economics of various systems are already clear and need not

    be dwelt upon here. The systems, as discussed, represent a graded

    15 For a description, see Zator Technical Bulletin No. *K).

    -16-

  • 8/18/2019 The Theory of Digital Handling of Non-numerical Information and its implications to machine economics

    17/18

    series of techniques ordered v/ith increasing sophistication of approach.

    At the same time, there is a progressive lifting of the intellectualload on the user of the retrieval system. Tor instance, a user havingdecided upon a subject, need give much less intellectual attention tosorting a pack of punched cards than he need give to the job of usinga highly cross-referenced card index system. This is all to the good,for it should be the purpose of machines to remove the load of meredrudgery from the minds of human beings.

    V/ith respect to large-scale systems for the machine retrievalof information, I believe that these things can be asserted: Information retrieval, in the useful sense defined here, is now possiblev/ith known systems and mechanisms. Large-scale high-speed machinesare being built, and greatly improved machines v/ill come from further

    application of known techniques.

    Machine builders and applied mathematicians have now takenthe lead v/ithout waiting for librarians to come to grips with orformulate the solution to the large-scale machine retrieval problem.They will soon be in the position of telling the librarians anddocumentalists what the fundamental operational requirements ofretrieval are, of developing the theories that apply, and they arewell on the way to the production of useful machines to do the job.

    Library science has been largely stalled for two milleniawith an organization principle which came from Aristotle, and storageprinciples of the Ptolmaic librarians of Alexandria. Now v/e canhope that the intitially successful departures into new methods

    and machines for information retrieval will continue and expand.Me can also hope that effort will at last be guided by the principlesof engineering and the scientific method instead of the outwornmetaphysics which has too long held sway through actual default onthe part of the scientists.

    * * # *

    -17-

  • 8/18/2019 The Theory of Digital Handling of Non-numerical Information and its implications to machine economics

    18/18

    u v