unseen users, unknown systems: computer design for a scholar's dictionary

Computers and the Humanities 22 (1988) 285--291. © 1988 by KluwerAcademic Publishers.

Unseen Users, Unknown Systems: Computer Design for a Scholar's Dictionary

Richard L. Venezky

Department of Eductional Studies, University of Delaware, Newark, DE 19716, U.S.A.

Abstract: The Dictionary of Old English computing systems have provided access since the 1970s to a database of approximately three million running words. These systems, designed for a variety of machines and written in a variety of languages, have until recently been planned with computing center billing algorithms in mind. With personal workstations emphasis has shifted to building more elegant user interfaces and to providing the entire DOE database to editors around the world. While the shift from sequential files to random access files and the provision of extensive development tools have changed some of the design process, error checking and protection of the database against accidental intrusion have remained as central issues.

Key Words: dictionary, lexicography, concording, lemmatizing, editing, CD-ROM, database.

Designing software systems for a large scholarly dictionary is like preparing a spaceship for a journey to a distant galaxy in outer space. In both cases the duration of the project will span many technological generations and perhaps more than one human generation. Needs perceived at the beginning may change as new capabilities become available and as new challenges are encountered.

Richard L. Venezky is Unidel Professor of Educa- tional Studies and professor of Computer and In- formation Sciences at the University of Delaware. He was formerly professor and chair of Computer Sciences at the University of Wisconsin. His research interests include writing systems, literacy, knowledge representation, and computer-assisted instruction. Among his recent publications are The Subtle Danger: Reflections on the Literacy Abilities of America's Young Adults (Princeton, NJ." ETS, 1987), and "Steps Towards a Modern History of American Reading Instruction" (Review of Research in Education, 1986, vol. 13, 129--70).

Design constraints of one era may become irrele- vant in another. Even the users of the two systems may change in unpredictable ways from beginning to end of the journey. The Oxford English Dictionary, for example, began when the Hoe rotary press, the telegraph, and the typewriter were still relatively new, and was finished at a time when the linotype machine, the radio, and the automobile were commonplace.

The Dictionary of Old English, which I will discuss here, was begun in earnest in 1970 and produced its first microfiche fascicle (the letter 'D') last year. At its outset it was supported mainly by the Canada Council, and more recently by the Social Sciences and Humanities Research Council of Canada. With the cooperation of patient and solvent sponsors and with the continued produc- tivity of the staff, the project might conclude by the year 2000. For a scholar's dictionary of the scope projected for the DOE, with an anticipated 40,000-1- entries and perhaps 6,000 printed pages, thirty years is not an alarmingly long period, but in a world of rapid technological change where equipment, software tools, and processing techniques change at a rapid pace, thirty years is closer to infinity than it is to stability. This project has required continual development of computing systems, from those that generated the earliest concordances to projected ones that will retrieve data from a CD-ROM database, but all have and will continue to emphasize secure handling of large files and the need to interact smoothly with non-programmers. Future systems must operate in cultural and technological contexts in which various design decisions have already been taken. The lessons to be derived for new projects may support similar future journeys into the unknown.

286 R I C H A R D L. V E N E Z K Y

An Overview of the DOE Plans for a new dictionary of Old English were first discussed at an international conference at the University of Toronto in March 1969 (Cameron, Frank, & Leyerle, 1970). Later in the Spring of that year, through invitation of an International Advisory Committee appointed by the Old English Group of the Modern Language Association, Christopher Ball of Lincoln College, Oxford, and Angus Cameron of the Centre for Medieval Studies, University of Toronto, became editors of the new dictionary. At a second conference, held at the University of Toronto in September 1970, initial plans for the Dictionary were presented and work began in earnest (Frank & Cameron, 1973). Between the first and second Toronto meetings I was asked to develop a plan for using computers to generate and merge concordances and for eventual on-line editing of the dictionary. Sub- sequently, as computer consultant to the Dic- tionary, I was able to draw upon others to provide valuable technical assistance.

The basic task which we faced in 1970 was to obtain well-edited versions of the extant Old English texts, representing more than three million running words; to transfer these to machine- readable form; and to generate concordances from them. From this point a variety of options were available, but all led eventually to a single concordance which at one stage or another had to be converted from a raw concordance, based upon textual spellings, to a lemmatized one based upon the entry forms of head words selected by the editors. Each record in the merged concordance corresponds to a slip in the traditional language of lexicography, that is, a textual citation for a word, including sufficient syntactic context to reveal the word's usage, plus whatever identifying information is required to locate the citation in an accessible edition. Dictionary entries are generated through shuffling, organizing, and analyzing the slips for a headword, reinforced and aided by references to auxiliary materials: other dictionaries, word studies, Latin parallels, grammars, and the like. The needs of the DOE, therefore, could be summarized in three categories: data input, slip generation, and entry composition.

The World of 1970 In 1970 when the DOE first opened shop,

software development in North America was in its Late Pleistocene, dominated by Fortran and Cobol; structured programming (Dahl, Dijkstra & Hoare, 1972) was still more a rumor than a reality. An upstart operating system called Unix was barely a year old and just on its way to the PDP- 11. Wirth's definition of Pascal was a year from publication (Wirth, 1971) and C would not debut for two more years (Ritchie & Thompson, 1978). At the University of Wisconsin, where the processing for the DOE was being done, the computing center work horse was a Univac 1108, running some version of the operating system Exec-8.

LEXICO When we began planning for the DOE processing, we already had a crude concordance program and some experience in processing languages like Old English which had mixed character sets (Venezky, 1971). Others with whose work we were then familiar also had experience in these same areas (e.g., Busa, 1964; Walker, 1967). We knew, furthermore, that software design had to be driven as much by the local computing center billing algorithm as by the desire to incorporate proven software utilities. For example, at the time, a concordance that generated only 916 keywords, 1558 entries, and 110 pages of output cost SI 3.31, an amount that if extrapolated linearly to our full corpus would have been far beyond our means. Furthermore, only about one-half of this amount was absorbed by CPU time; I/O and page costs also represented major proportions of the total costs. As we began to project the processing procedures and costs for inputting, correcting, and concording approximately 3,000 texts, it became apparent that new processing techniques were needed, particularly if non-programmers were to interact with the computing system. Under a grant from the National Science Foundation, we began to explore procedures for storing and processing texts, and for building user-tolerable systems. The result of this effort was a system called LEXICO which was completed in 1975 and which served most of the DOE's needs for the next six years (Venezky, Relies, & Price, 1977).

LEXICO is a text processing system which was implemented in Fortran on a Univac 1110 and which provides the following capabilities:

C O M P U T E R D E S I G N F O R A S C H O L A R ' S D I C T I O N A R Y 287

1. Forming and maintaining a collection of texts, including entering new texts into a collection, deleting and editing texts already in a collection, and specifying collection parameters to reduce repetition of control statements.

2. Concording individual texts; 3. Classifying words in a concordance by head-

word (baseform, lemma); 4. Generating slips for a dictionary file or a base

concordance.

LEXICO was designed for users who knew little about computers or processing costs. For example, when the user asked that a text be concorded, the system responded with a request for a processing priority: immediately (and therefore very expensive), overnight (and therefore cheaper), and convenience (and therefore very cheap). LEXICO also had an extensive set of on- line aids, including "Explain error", "Explain question", "Example", "Help", "Menu", "Cost", and "Comment". The "Cost" request resulted in a display of the costs incurred so far in the current on-line session. Some of the other requests had multi-level responses: the first request resulted in a brief summary response, the second in a more detailed response, and the third in a patient, step by step explanation.

Although large disk files were available on the Univac 1110, LEXICO was built around sequential files. On-line storage was far too expensive to allow the entire database to be on-line at the same time. Texts were grouped into collections and stored on magnetic tape. Running a job required communication with the operator to locate and mount tapes, a process which for batched jobs was never error free. Nevertheless, LEXICO proved a generally reliable and easy-to-learn system, although it is difficult to evaluate precisely its utility since expert assistance was continually available. Technically, the concording scheme in LEXICO, using hash coding techniques originally developed in the 1950s and 1960s (Morris, 1968), was probably its most efficiently designed component.

From a historical perspective, it is worth reflecting on the amount of time that was invested in the early 1970s in natural language processing due to the limitations of the 6-bit codes that were used then. Capital letters, requiring at least two characters each, caused minor irritations in

searching and sorting. The 1970s operating systems lacked standard functions for translating characters, comparing files, and searching for character strings as they exist today in systems like Unix (e.g., tr, comm, grep, awk, lex). Most natural language software projects had to develop their own primitives for these operations (e.g., Kay, 1967; Bratley, Lusignan, & Ouellette, 1974). Considerable time was often spent, perhaps un- wisely, in minimizing bits and bytes to reduce I/O and CPU costs.

The DOE database consists of approximately 25 million characters, including storage overhead, distributed across 3,000+ texts (or text-like units). Within LEXICO, texts were grouped in collections, generally on the basis of genre and prox- imity within the DOE text listing (Cameron, 1973). Since editing, concording, and lemmatizing all required specification of the encoding scheme of the text, along with parameters unique to each task, LEXICO was designed with an inheritance property across collection, text, and task. System defaults were assigned to every collection but could be overriden at any level by user-specified values. Texts inherited parameter values from collections and tasks inherited parameter values from the texts they applied to. For example, square brackets might be the collection default for delimiting citation references; several texts in the collection might, however, use angle brackets for these delimiters. When such texts were defined as having angle brackets, this definition overrode the collection default.

Besides its storage, editing, and concording facilities, LEXICO also had an elaborate lemmatizing system. Respelling rules and base-type rules could be specified as collection parameters. During lemmatizing, respelling rules were applied to each word in the word list generated by a concordance. Then base-type rules were applied, converting text forms to base forms. Finally, in the Cleanup phase, unmatched text forms (types) could be assigned to base forms, undesirable types (e.g., numbers) eliminated, homographs separated, and incorrectly assigned words changed. Then, a base concordance could be formed and output either on slips (4" × 6"), a full-page listing, or a magnetic tape.

The entire DOE collection was made available in LEXICO format to interested scholars and


distributed in North America through the DOE office in Toronto and elsewhere through the Oxford Text Archive. The system achieved its goals of providing functions needed for this stage in the Dictionary's development, at a cost that could be tolerated. From the collection tapes two microfiche concordances to the entire collection have been generated: one for the bulk of the corpus, minus the most frequent words, and one with these words alone (Venezky & Healey, 1980; Venezky & Butler, 1985). Both concordances were generated by reconcording all of the OE texts and then merging the separate concordance outputs. This latter process, carried out at the University of Delaware on a Burroughs 7800, used ALGOL programs and a standard sort package. The first microfiche concordance was supported by the National Endowment for the Humanities, supplemented with institutional cost- sharing by the University of Delaware and gener- ous amounts of time and labor by the DOE staff. The second was supported primarily by the DOE itself, with contributions of computer time by the University of Delaware. These concordances have proven invaluable to scholars throughout the world and have been far more in demand than the tape archives.

Specifications for an Editing System As the concording and lemmatizing system described above was nearing completion, studies were begun on the hardware and software needed for the editing phase of the dictionary. The primary requirements for editing were the ability to display full dictionary pages in a typographical form similar to what would be printed, to retrieve citations from the text corpus in real time, and to enter and edit dictionary entries. Added later were other requirements, such as lemmatizing slips on- line and interactive assignment of slips to sense categories.

These criteria were developed at a time when microcomputer kits were just becoming popular, but before good benchmark data were generally available for gauging throughput, processing speeds, etc. At this time Dijkstra's Discipline of Programming had just been published. The ether- net protocol had been described (Metcalfe & Boggs, 1976), but as with personal workstations,

performance data were quite limited. In general, we (and others) tended to underestimate the horsepower required in a system that could process large files in real time, maintain commu- nications over a packet-switching network, and drive bit-mapped displays with complex typo- graphic configurations.

The system finally selected is a network of five Xerox 1108 (Dandelion) workstations, with a large file server, a print server, and a communica- tions server. The text corpus resides on-line, along with other large databases referenced by the editors, including a complete concordance to the texts, the short-title list, the headword list, the frequency list, and an index to OE word studies (Cameron, Kingsmill, & Amos, 1983). The basic configuration before the donation by Xerox of four workstations and a fileserver is described by Healey (1985).

Although production work has continued on the new configuration over the past year and a half, we are only now completing one of the main tools that we designed for editing, the lexicographer's desktop. This system will duplicate visually and functionally the work space that lexicographers normally adopt when working with books, card slips, paper and pencil. Icons will represent the various slip piles and reference works that are essential to DOE editing: headword list, concordance, texts, frequency list, bibliography of word studies, and short-title list. Data from any of these can be browsed by an appropriate mouse-clicking protocol. Menu selections will allow scrolling, selection and transfer of data across windows, and all of the other editing conveniences popularized by the Macintosh. Since the keyboard on the Dandelion workstation is mapped by software functions into characters, we can switch easily among different character sets: Old English, modern English, Latin, etc., and can mark dia- critics for any of them.

When a session with the lexicographer's desktop is initiated, windows for the various data types appear on the screen. Typically, an editor enters a headword, representing an entry in progress. All of the attested spellings for that headword are then displayed in the spellings window. If the editor selects a particular spelling (i.e., moves the cursor to the spelling and presses a button on the mouse),

COMPUTER DESIGN FOR A SCHOLAR'S DICTIONARY 289

the first slip for that spelling appears in the slips window. The editor might browse through further slips (which are ordered by genre and text), select portions of a citation and copy them to the entry window, or assign citations to semantic categories. This latter function is achieved by clicking on the target category in the schema window, or by entering a new category in the schema and then clicking on it. Other editorial functions are achieved similarly.

The AI orientation of the Dandelion offers many advantages for the design of user interfaces, but it also presents challenges in the management of overlapping windows and other screen aes- thetics. The current database management scheme is quite slow, although we have yet to determine where the major delays originate. Nevertheless, the What-you-see-is-what-you-get facility is a quantum leap from the alphanumeric displays of the 1970s generation of computers.

A Portable Editor Since the number of Anglo Saxon scholars throughout the world is relatively small and the dictionary has always intended to call upon all of the talent that might be willing and able to assist, particularly in writing entries, a project was initiated several years ago to design and implement a portable editing machine, that is, a microcomputer version of the lexicographer's desktop described above. Our goal was to have a microcomputer that could be shipped with ease anywhere in the world where mail was delivered safely. "Foreign" editors would receive the machine, editing software, and a file of slips for each entry they volunteered to do. The actual system, which is now being field- tested, has been implemented on a Zenith Z-286 workstation, which is compatible with the IBM PC/AT. The SCO Operating System (a Xenix system, generally compatible with Berkeley 4.2 Unix) is used through a special arrangement with SCO, Inc. Startup funds for this project were provided by the Connaught Fund of the University of Toronto.

The portable editor has two major capabilities: schema development and entry composition. For schema development an editor may receive on diskettes as many as 100,000 slip images for a particular headword. The portable editor allows

sense categories to be entered and edited and slips to be assigned to senses. In general, editors will start with a small number of sense categories that they will assign slips to, and then begin to sub- divide categories until a full schema is developed. Thus, slips assigned to a superordinate category will be reassigned to subordinate categories and the process repeated as many times as necessary for the full definition to be developed. Using a split screen showing the sense categories on the left and the slips on the right, an editor can assign slips rapidly to categories through a "browse" mode.

Also provided by the system is a search facility, which allows retrieval of all slips containing a designated affix, word or phrase (i.e., collocation). Such searches can be constrained by position rela- tive to the headword, and can also yield a listing of all words which occur in any designated relation- ship to the headword, with their frequencies of occurrence. Thus, it is possible to obtain a list of all words that occur just before or just after the keyword, sorted by frequency of occurrence. An editor can then browse all the slips for any of these words and assign them to sense categories.

As slips are read and assigned to categories, an editor can add notes for future reference and markers that can be objects of special searches. For example, slips that an editor wants to cite in the dictionary entry can be marked as such. In the portable editor search mode, it is possible to locate all slips that have been marked for citing. The composition facilities are based on 14 fields of data that compose the standard DOE entry structure. An editor can copy a schema into one of these fields and request insertion of designated slips as exemplars of sense categories. Editing is currently done with vi, a standard Unix editing system, but a structured editor that has been designed will be implemented in the next version of the system.

Implementation Techniques The main challenge of the portable editor was to develop a database structuring technique that was both manageable for a 20MB hard disk and not excessively slow. The various search capabilities can generate monstrously large files which if not pruned periodically could eventually deadlock the system. After consideration of storage constraints


and programming ease, we decided to operate with a single version of the slips, contained in a master file that was indexed by position through a master index of pointers. Each sense file and each search-generated file is represented by a bit map in which each bit position stands for the slip in that position in the master file. With this scheme, a file of 100,000 slips, which is the largest we anticipate, requires 12.5K for each derived file. (This limit does not hold, of course, for storing the full contents of each slip, i.e., the master file, and its index.)

Storage of 75 derived files (e.g., sense types, searches) requires less than 1MB for a file of 100,000 slips with this scheme, and 75 derived files is beyond what we anticipate, even for a complicated entry. With data compression, even less space would be required, but processing time would be increased. For backing up the system, we dump the schema definition and the bit maps for all categories in the schema and all search files. Restoring after system failure requires only the reloading of the original master file plus the most recent backup.

Although the current portable editor is based on the assumption of separate slip files generated from the DOE master files in Toronto, we are also working with Reteaco Inc. in Toronto to develop a CD-ROM version of the entire DOE text corpus. With the Reteaco retrieval scheme, which utilizes inverted indexes, an editor can not only generate in real time the entire set of slips for any group of text spellings, but can also access any amount of the context for a citation. Thus, with the addition of a CD-ROM player, we can give to each editor (foreign or local) the entire DOE corpus on-line, including the frequency lists, word studies, and head-word lists. We have not yet worked out the full set of procedures that we will use with this system, but the power that it brings to editors anywhere in the world is far beyond what we imagined, even two years ago.

Looking Back, Looking Forward In 17 years we have created a database for one machine, a Univac 1108, and then moved it across a variety of other systems without major pain. We are finally ready to condense this database, and a number of associated ones, onto a storage medium

that will be accessible from any future system without code conversions. Our editors learned a simple, command-based language for LEXICO, and then moved without undue suffering to the menu-driven Interlisp-D interface on the Dandelion and even to a less than beautiful menu system on the Zenith Z-286 microcomputer. Editors who cut their baby teeth on shoe boxes full of printed slips tend not to react violently to pull- down menus and function key selections, as long as the selection procedures are rational, unambig- uous, and reasonably consistent.

Our success in moving the database across machines was due to an initial decision to code everything that might be needed for the finished entries, but to code simply and flexibly, eschewing machine dependence whenever possible. This was a critical decision and one which every similar project should consider making. Expediency is tempting, especially in the face of staggering amounts of data and uncertain funding. However, in the world of rapidly changing technology, the race is not to the swiftest but to the most adaptable.

On the management of large databases over longer periods of time, our experiences may not be applicable to those dictionary projects that have truly large databases. Our corpus is relatively small, especially compared to the OED, and we have limited concern about future additions. Nevertheless, error checking and process verifica- tion remained central concerns across all of our systems and apply equally to databases of other magnitudes. Most of our systems were developed especially for the D O E and were implemented in quasi-professional environments. As talented as our various staffs were (and still are), the potential always exists for subtle, destructive side effects.

In retrospect, we might have worried less about saving bits and bytes in 1970 and spent more time on error checking and transportability, but the culture of the time demanded otherwise. The realities of funding for scholarly dictionaries have given us limited leisure for exploring new ap- proaches to lexicography and testing more intelli- gent alogrithms for searching and composing. Nevertheless, the lexicographer's desktop and the portable editor are major advancements for entry composition and from them we hope to design

C O M P U T E R D E S I G N F O R A S C H O L A R ' S D I C T I O N A R Y 291

even more powerful systems in the future. With the database on a CD-ROM, we will be giving to editors anywhere in the world, for less than S10,000 each, what $3 million would not have allowed when we started.

Lexicographers in the future will work at even more powerful workstations than what we now employ, with immediate, on-line access to slip files, competitive and retrospective dictionary entries, semantic classification schemes, encyclo- pedias, and other reference works, all of which are now appearing on CD-ROM. With hypertext concepts to facilitate linkages across information sources and smart software to assist in searching and analyzing, editors will find their work loads focusing more on conceptualizing and organizing and less and less on gathering and sifting. Perhaps dictionaries will also change to take advantage of this increased linkage ability, with the alphabetized list relegated to the entry location mechanism. Entries could then be constructed around semantic concepts, showing interrelationships of terms, and auxiliary information (exemplars, location of term within a semantic field, etc.) could be readily available.

REFERENCES

Bratley, Paul, Serge Lusignan, and Francine Ouellette. "JEUDEMO: A Text-Handling System". In Computers in the Humanities. Ed. J. L. Mitchell. Minneapolis: Univer- sity of Minnesota Press, 1974, pp. 234--49.

Busa, R. "An Inventory of Fifteen Million Words". In Literary Data Processing Conference. Ed. Jess Bessinger, et al. New York: IBM Corp, 1964, pp. 64--78.

Cameron, Angus. "A List of Old English Texts". In A Plan for the Dictionary of Old English. Ed. Roberta Frank and Angus Cameron. Toronto: University of Toronto Press, 1973, pp. 25--306.

Cameron, Angus, Roberta Frank, and John Leyerle, eds.

Computers and Old English Concordances. Toronto: University of Toronto Press, 1970.

Cameron, Angus, Allison Kingsmill, and Ashley Crandell Amos, comps. OM English Word Studies: A Preliminary Author and Word Index. Toronto Old English Series 8. Toronto: University of Toronto Press, 1983.

Dahl, Ole Johan, Edsger W. Dijkstra, and C. A. R. Hoare. Structured Programming. New York: Academic Press, 1972.

Dijkstra, Edsger W. A Discipline of Programming. Engle- wood Cliffs, N.J.: Prentice-Hall, 1976.

Frank, Roberta and Angus Cameron, eds. A Plan for the Dictionary of Old English. Toronto: University of Toronto Press, 1985.

Healey, Antonette diPaolo. "The Dictionary of Old English and the Final Design of its Computer System". Computers and the Humanities, 19 (1985), 245--49.

Kay, Martin. Standards for Encoding Linguistic Data. Report P-3575. The RAND Corporation, Santa Monica, CA, 1967.

Ritchie, Dennis M., and Ken Thompson. "The UNIX Time- Sharing System". The Bell System Technical Journal, 57, No. 6, Part 2 (1978), 1905--29.

Venezky, Richard L. BIBCON: An 1108 Program for Pro- ducing Concordances to Prose, Poetry and Bibliographic References. Computer Science Technical Report # 113. University of Wisconsin, Madison, WI, 1971.

Venezky, Richard L. "Computational Aids to Dictionary Compilation". In A Plan for the Dictionary of Old English. Ed. Roberta Frank and Angus Cameron. Toronto: University of Toronto Press, 1973, pp. 307--27.

Venezky, Richard L., and Sharon Butler. A Microfiche Concordance to Old English: The High Frequency Words. Publications of the Dictionary of Old English 2. Toronto: Pontifical Institute of Medieval Studies, 1985.

Venezky, Richard L., and Antonette diPaolo Healey. A Microfiche Concordance to Old English. Publications of the Dictionary of Old English 1. Toronto: Pontifical Institute of Medieval Studies, 1980.

Venezky, Richard L., Nathan N. Relies, and Lynne A. Price. "Man-Machine Integration in a Lexical Processing Sys- tem". Cahiers de Lexicologie, 30 (1977), 17--49.

Walker, Donald E. "Safari: An On-Line Text Processing System". Proceedings of the American Documentation Institute, 4 (1967), 144--47.

Wirth, Niklaus. "The Programming Language PASCAL". Acta lnformatica, 1 (1971), 35--63.

unseen users, unknown systems: computer design for a scholar's dictionary

Documents