an electronic edition of don quixote for humanities scholarsfuruta/cervantes/papers/docnum.pdf ·...

An Electronic Edition of Don Quixotefor Humanities Scholars

Shueh-Cheng Hu* — Richard Furuta* — Eduardo Urbina**

* Centerfor theStudyof Digital LibrariesandDepartmentof ComputerScienceTexasA&M University��

Departmentof Modern& ClassicalLanguagesTexasA&M University��!"��#%$�&��"��'�(��)�

ABSTRACT. A numberof stepsare required to createcomprehensive, flexible electronic editionsof ancientdocuments,especiallythosewhosevisualquality hasdegradedandthoseavailablein multiple versions. Besidesacquisitionand displayof materials,supportis required for in-terrelating variants,and interfacesare neededto allow deriving and justifying new editions.Furthermore, lower-level mechanismsare neededto allow themaintenanceof discovered andspecifiedrelationships. Althoughsolutionscan be found in other applicationsfor handlingsomeportionsof the specificsteps,still there are challenges to meetthe steps’new require-mentsand to integrate the individual stepsseamlessly. This paperpresentsan overview of awork-in-progressto createan integratedsystemfor processingprinteddocumentswith degrad-edvisualqualityandmulti-variantcontents.Additionally, theimplementationof thethreemod-ulesenablingcreationof new editions,providingunderlyingdatabasesupport,andcustomizinghypertext documentsfor readers are describedin detail.

RÉSUMÉ.à traduire

KEYWORDS: Multi-variant documents,variorumedition,userinterface, dataentities,hypertext

MOTS-CLÉS: à traduire

Documentnumérique.Volume3 - n* 1-2/1999,pages1 à X

2 Documentnumérique.Volume3 - n* 1-2/1999

1. INTRODUCTION

For severalyears,we havebeenbuilding a digital archivebasedaroundtheworksof Miguel deCervantesSaavedra(1547–1616),theauthorof DonQuixotedela Man-cha. Don Quixote, oftenrecognizedasthefirst modernnovel, providesillustrationofmany of the issuesinvolved in creationof electroniceditions. No manuscriptof thework hassurvived, only a numberof editionspublishedduring Cervantes’lifetime.The available imagesof thesemanuscriptsare duplicatedfrom archival microfilmsandaretypical quality for many rareandimportantdocumentsreceivedfrom distantlibraries;while readablethey reflectseverecompromisesin imagequality. Unfortu-nately, dueto administrativeor budgetaryrestrictions,theavailability of bettercopiesis unlikely due to both the ageof original materialand the rarenessof the originaltexts; the holdersof rareprintedtexts usuallyareunwilling to subjectthemto addi-tional scanningor copying.

Figure 1. Textual imageof a pagefromthe1605princepseditionof Don Quixote

Figure1 shows the imageof onepage,convertedfrom a 35mmmicrofilm, of the1605princepsedition of Don Quixote. Obviously, the quality of this facsimileim-

ElectronicEdition of DonQuixote 3

agepresentssignificantchallengesfor contentacquisitionaswell aspresentation.AsFigure1 shows,unacceptablevisualquality resultsfrom backgroundnoise,shadowsalongspineandedges,distortionsto theimageintroducedin copying, writtenannota-tions,andlibrary stamps.Furtherchallengesto automaticprocessingarecausedby theunusualletter formsusedin classicalSpanishandtypesettingconventionsno longerusedtoday. The effect of documentimagequality on the correctionratefor opticalcharacterrecognition(OCR)is significant,especiallywhentheimagequalitydegradesbelow specificthresholds[GAR 98, CAS90]. A poor recognitionrateincreasestheacquisitioncost dramaticallydue to necessaryheavy manualpost-processingworksuchasproof-readingandcorrection.

As a separatefactor, importanttexts in thehumanitiesoftenhave beenpreservedin multipleearlyversions.Evenwhentheauthor’soriginalmanuscriptsareno longeravailable,insightsinto thework canbegainedby comparisonof differentimpressionsof thefirst editionandof editionspreparedwhile theauthorwasstill alive. For exam-ple, thereareelevensignificantversions1 of DonQuixotedela Mancha, includingtheprinceps– the earliestprintededition,publishedin Madrid, 1605. Many variances,discrepanttext segmentsin differentversionsderivedfrom anidenticaloriginal coun-terpart,exist amongthedifferentcopiesof thesameedition.2 Choosingonevarianceover anotheris the job of a Cervantesscholar, anddifferentscholarsmay, of course,make differentchoicesbasedon their interpretationof theraw material.Our acquisi-tion processsupportsthespecificationof suchchoices,allowing thecategorizationofthechoiceandassociationof justificationfor it.

Our projectteam,comprisedof researchersfrom the areasof ComputerScienceandSpanishLiterature,is seekingto producea unified, electronicvariorum editionof Don Quixote. Thevirtual variorumeditionwill containmultiple copiesof the 11significantearlyeditionsof Don Quixote, annotationof thevariancespresentamongthe editionsto allow their comparison,derivative editions,generatedasthe resultofscholarlyanalysisof the variancesandbearingsupportingreasoning,andscholarlycommentaryby experts.Thereaderof a virtual variorumeditionwill beableto cus-tomize the text presentation,perhapsselectingdifferent interpretationsfor differentapplications,aswell asannotatethe results.Furthermore,all componentsin thevir-tual variorumeditionwill beinterlinked,allowing easytraversalamongtherepresen-tations. To provide raw materialsfor the virtual variorumedition,we have obtainedmicrofilmed copiesof the text from multiple collections(the SpanishNational Li-brary, theHispanicSocietyof America,theBritish Museum,YaleUniversity, HarvardUniversity, andothersstill in progress),whichwe areconvertingto digital images.

+. Theseeditionsare,for volume1: Madrid 1605(princeps); Madrid 1605,2nded.;Valencia

1605; Brussels1607; Madrid 1608,3rd ed.; Madrid 1637(combinededition); Madrid 1647(combinededition). For volume2, they are: Madrid 1615(princeps); Brussels1616;Madrid1637(combinededition);Madrid 1647(combinededition).,. Accordingto ananalysisdonemanually, thereare20 significantvariancesfound in chapter

onealone,excludingvariancesin theuseof punctuationmarks.Therearetotal of 126chaptersin DonQuixote.


papers or

microfilms

majordataentities

textualdigital cleaned

imagesedited hypertext

documentstexts

multipleplain textimages

digitization enhancement

image contentsrecognition

editing byscholars

Web_based publication

Figure 2. General processof digitizingdegradedprintedtextswith multi-variantcon-tents

In thispaper, wereportoninitial experiencesbasedonwork with theearlyeditionsof Don Quixotede la Mancha. Theseexperienceshave leadto developmentof anarchitecturethatallows economicalconversionto digital form of ancientdocumentswith degradedvisualquality andmulti-variantcontents.

2. PREVIOUS WORK AND PROPOSED SOLUTIONS

Someeffort hasbeendevotedto theresearchof acquisition,preparation,andpre-sentationof digital librariesmaterial[BRE 96, SEA 97, LEC 98]. Issuesabouthow todealwith textswith degradedimagequalityandmulti-variantcontentsalsohavebeendiscussed[GLA 98, BLA 95]. To offer bettersolutionsfor processingprinteddocu-mentswith degradedvisualquality and/ormulti-variantcontents,methodspresentedin the prior work canbe extended. In addition,new methodsneedto be developedto meetnew requirements.An importantrequirementis to retainlinkageamongtherelateddataentitiesthataregeneratedduringdifferentprocessingsteps.For instance,variancesareidentifiedduringacollationphaseandeditingjustificationsarespecifiedduringa latereditingphase.A relationshipexistsbetweena varianceandtheeditingjustificationsthat resultedfrom its analysis. It is important,therefore,to retain thecorrectrelationshipamongdataentitiesthatmight bemodifiedor removedaftergen-eration. For instance,an updateto a varianceshouldleadto a correspondingreviewof the relatedediting justifications,or an inconsistency might be incurred. If thereis no mechanismfor retainingrelationshipsamongdataentities,it will bedifficult tomaintainthe consistency of relateddataentitiesandto presentthe inter-relateddataentitiesto readerscorrectly. To overcometheabovedrawback,anappropriateintegra-tion amongindividual stepsis necessary, dataentitiesaswell asrelationshipsamongthemwill beretainedandcontrolledin theintegratedenvironment.

3. OVERVIEW OF THE PROPOSED SYSTEM

Although theremight be differencesamongthe proceduresfor handlingvarioustexts,aprocedurefor digitizing printedtextswith poorvisualqualityandmulti-variantcontentscanbegeneralized,asshown in Figure2. To performdigitizationwork thatfollows the above procedure,a system,illustratedin Figure3, is proposed.A num-


texts

Source

Enhanced

imagestextual

5

generatorhypertext

CGI programcustomized

reader

unified

text in

HTML

Web browser

editingworks:

corrections,commentary,

etc.

Textualimage

preprocessor

Contentsrecognition &

manualcorrection

HDEMS - Hypermedia-based

data entitymanagement system

1 2

3

4

ofsource texts

microfilms

The MVED -editor for

multi-variant documents

reader’s request

imagestexts, and

TheInternet

editor

Figure 3. Architecture of thesystemfor digitizing degradedprinted textswith multi-variant contents

ber of functional modulesin the systemarenecessaryin the procedure,which aredescribedasfollows: (1) After digitization, the documentimageenhancementmod-ule is provided to reconstructraw imagesto obtain sufficient quality for Web pre-sentation.Additionally, reconstructedimagesimprove the accuracy rateof the latercontentsrecognitionprocess.Reconstructeddocumentimagesalsoareimportantforlatereditingwork becausescholarsneedthemto verify correctconversionfrom doc-umentimagesto thecorrespondingplain texts andto obtainadditionalcluesthatarenot presentin the plain text form. (2) To copewith the degradedvisual quality ofavailable facsimileimages,a customrecognitionmoduleis required. The recogni-tion moduleis expectedto bemoretolerantof noisein documentimagesby makinguseof known featuresof the printedtext, suchasthe lexicon andotherhigher levellinguistic informationin the processedtext. (3) An editing tool (the MVED) is un-derdevelopmentto supportthesimultaneouseditingof texts with multiple versions,which is not supportedby generalediting tools. The editing tool needsto facilitatethe work of locating,analyzingandclassifyingvariancesamongdifferentversions;comparing,contrasting,andlinking versionsthatcouldbein differentformats(eitherfacsimileimageor plaintext). (4) A Hypermedia-basedDataEntity ManagementSys-tem(HDEMS) is requiredfor controllingaccessof dataentities,recordingderivationrelationshipsamongdataentities,andkeepingrelateddataentitiesin consistentstate.In this paper, dataentitiesarecalledrelatedif thereexist relationshipsamongthem.


micro-

films

cleanedtextualimage

imagetextualraw

fullpage

ASCII

micro-

films

cleanedtextualimage

imagetextualraw

fullpage

ASCII

micro-

films

cleanedtextualimage

imagetextualraw

fullpage

ASCII

micro-

films

cleanedtextualimage

imagetextualraw

fullpage

ASCII

variance

list

-/.10 0 2 3 4 .653 79893 :/2;0

editing

works

textsegment

textsegment

textsegment

textsegment

-/.65 <;7;= >14 .65

7;51?/2;5;- 71@ 7;593

edition_2 edition_3 edition_Nedition_1

A 4 B64 3 2;0

4 @ 2;B/7

-9.6593 7;593 >-/.65 <;7;= >14 .65

7 A 4 3 4 5;B

@ 71= B;7

paragraphs

text0 .;-92 3 7</21= 4 2;5;-97

extract

Figure 4. Flow chart of dataentitiesduring theprocessof digitizingprintedtexts

Inter-relationshipsamongdataentitiesarecreateddueto two major reasons:formatmutationandcontentsderivation.As Figure4 shows,a largevolumeof variouskindsof dataentitiesaregeneratedby different functionalmodulesduring the processofdigitizing degradedprintedtexts with multi-variantcontents.Thus,a framework likethe HDEMS is necessaryto prevent inconsistenciesamongrelateddataentitiesdur-ing thedigitalizationprocess.(5) A modulefor customizinghypertext documentsisnecessaryto serve readerswith variousbackgroundsanddifferentinterests.Throughadaptabledocuments,readerscanview differentlevelsof materialthatincludeseditingrationale,variancesamongmultiple versions,andoriginal facsimileimages,besidestheeditedtexts.

Ourearlyactivitieshavefocusedonmodules(3), (4), and(5). Thesemoduleswillbedescribedin thefollowing threesections.


4. THE MULTI-VARIANT CONTENTS EDITOR (MVED)

Scholarsneedto preparevariouseditionsbasedontheiranalysisof theraw materi-al, interpretedthroughtheirknowledgeandperceptions.To supporttheeditingof doc-umentswith multiple versions,therearerequirementsthatcannot bemetby currenttext editing tools. Theserequirementsinclude the capabilityof locating,analyzingandclassifyingvariancesamongdifferentversions;andcomparingandcontrastingversionsthatcouldbein differentformats(eitherfacsimileimageor plain text).

4.1. Structural Overview

To fulfill theserequirements,the major componentsin the MVED include: (1) acollator, for comparingmultipleversionsof thesametext to locateall textualvariancesamongversions.Theoutcomeof a collationsessionincludestheoffsetandlengthofall variancesfoundduringthesession.Theoffsetandlengthdatawill beusedby thetext synchronizationmechanismto facilitate the analysisof textual variances;(2) amechanismfor text andimagesynchronization,which facilitatescomparingandcon-trastingmultiple texts andtheir raw images.Sincerepeatedcomparisonandcontrastamongthemultiple sourcesarerequiredto conducttheanalysisof variancesexistingin thosetexts andeditingdecisions,anefficient mechanismfor performingintensivecomparisonandcontrastwork amongmultiple texts/imagessourcesis important;(3)avarianceclassificationmechanism,whichallowsscholarsto classifyvariancesinter-actively, basedon their knowledgeandanalysisof variancesfoundby thecollator.

4.2. Interactions with the MVED

TheMVED is implementedin Java andis deployedon WindowsNT. TheMVEDcanfind all textual variancesin a collectionof up to 32 texts with multi-variantcon-tentsby comparingeachto a pre-selectedbasetext. After collation,editorscanselecta specificsetof variantcontentsin all collatedtexts with a mouseclick. Whensuchaselectionis made,bothplaintext andfacsimilerepresentationaresynchronizedonthedisplay. With supportof thetext/imagesynchronizationmechanism,editorsareableto focuson comparingandcontrastingcontentsof documentswith minimumcogni-tiveoverhead.Figure5 shows the MVED’s main window, displayinga list of variancesfound in

a collationsessionin which threedifferenttexts areanalyzed.Thethreetexts corre-spondto the chapteronein threeversionsof Don Quixote, denotedasMadrid1605,Madrid1608, andMadrid1637, respectively. Figure6 shows analternative, compactlist of thesamevariancelist. In eitherthe full or compactvariancelist, clicking anyitem amonga setof varianceswill enabletext synchronization;i.e., the itemsof thesamesetof variancein othertexts will be locatedat the sametime. As seenin Fig-ure7, a documentviewer shows a segmentof text, with thevariancehighlighted. Ifdesired,a facsimileimage,alsosynchronized,canbedisplayed.Figure8 shows three


Figure 5. Full variancelist in theMVED

synchronizedviewerson a particularvariance– concluìan, concluîan, andconcluianin thisexample.An editoralsomayselectany regionof text in any of theviewersandsynchronizetheotherviewerswith it.

Sincefacsimileimagesare the sourcesof all dataentitiesgeneratedduring thedigitizing process,accessto thoseimagesis importantduring editing. As Figure7shows,thereis a scrollableimagedisplayareawithin eachdocumentviewerwindow.The synchronizationmechanismbetweena text andits facsimileimagecounterpartallows usersto locatecorrespondingregions. In addition to this text-to-imagesyn-chronization,clicking onfacsimileimagesandthechoicelist locatedin thetoprow ofeachdocumentviewerwindow allow usersto performimage-to-text synchronization.

Wheneveraneditordecidesto makeacorrectionand/orcommentaryonaparticu-lar setof variances,pressingtheEdit Var buttonwill leadhim/herto aneditingdialog,asFigure9 shows. Throughtheeditingdialog,theeditor is ableto selectthecorrectcontentsor makeacorrectiondirectly, classifytheeditedvarianceinto oneof thefive


Figure 6. Compactvariancelist in theMVED

pre-determinedcategories3, give commentaryaboutthe editing transaction,andcitesupportingmaterial. Pressingthe Savebutton in the editing dialog will activatetheHDEMSto storethecontentsof theeditingtransactioninto theunderlyingdatabase.

4.3. Implementation Issues

Due to the limits of availableOCR techniques,it is not feasibleto performreal-time synchronizationbetweentexts andtheir facsimileimagescounterparts.Thefea-sible solution is to collect the coordinatesof every line in facsimileimageswith asupportingtool. Thesepre-collectedcoordinatesareusedby thetext-imagesynchro-nizationmechanismto locatetheregionof animageapproximately, whichderivesthecorrespondingparagraphof plain text.

C. The five categories: printing errors, typographicalerrors, spelling variants, substantial-

certain,and substantial-uncertain,were developedby a domainexpert after examinationofthetexts andexistingscholarlypractices.


Figure 7. Highlightedvariancein theversionof Madrid 1605

5. THE HYPERMEDIA-BASED DATA ENTITY MANAGEMENT SYSTEM

Besidesenforcingconsistency control, two other objectivesof the HDEMS are(1) providing commonfunctionality, suchasnavigating amongrelateddataentities,for differentupper-layeredapplicationsthroughuniform interface,which increasethemodularity; and(2) facilitating efficient projectmanagement,suchasbeingable tokeeptrackof relationshipsamongdataentities.

5.1. Architecture of the HDEMS

Figure 10 shows the structureof the HDEMS and the relationshipbetweentheHDEMS andotherapplications.As Figure10 shows, dataentitiescanbe storedintwo places:eitherin theunderlyingdatabaseor in the file system.However, for thesakeof efficientaccesscontrolandverification,boththemetadataandrelationshipsofdataentities,whichareaccessedfrequently, arestoredin theunderlyingdatabase.Allupper-layeredapplicationsaccessdataentitiesthroughtheapplicationinterface(API)of theHDEMS.


Figure 8. ThreeSynchronizedDocumentswith TextsandImages

5.2. Underlying Database Design

We choseto placedataentities,relationshipsamongthem,andtheirmetadataintoa relationaldatabasefor the following reasons.First of all, thereare large numberof instancesof eachtype of entity andthereexists a regular patternof relationshipsamongspecifictypesof entities.Second,with features(metadata)of entitiesstoredina structuredformat, it is easierto locateparticularentitiesprecisely, by usingmeta-datato specifytheentitiesof interest.Third, therelationshipsamongentitiescanbemodeledby theentity-relationship(E-R) model[CHE 76, OZK 90, OZK 86], whichis theunderlyingmodelof relationaldatabases.

The first stepof designingthe underlyingdatabaseis identifying all necessaryentities,along with their attributes. Thereare two typesof dataentitiesgeneratedduring theprocessof creatingelectronicvariorumeditions:single-editionentity andcross-editionentity. Single-editionentitiesare thoseentitieswhich contentscomefrom oneeditiononly. In oursystem,identifiedsingle-editiondataentitiesinclude(1)raw image: the classfor the imagesobtainedfrom digitizing microfilms that recordoriginal texts,(2) enhancedimage:theclassfor theenhancedimagesthatwill resultinmoreefficientautomaticcontentsrecognition,(3) full pagein ASCII: theclassfor the


Figure 9. VarianceEditingDialog Interface

plain texts obtainedbasedon thefacsimileimages,(4) text paragraphwithin a page:aparagraphof text within onepage,usedfor recordingavariantstringbelongingto aparticularvariance,and(5) text segment:comprisescontinuousfull pagesin ASCII,thatwill bethebasisfor text collationtofind textualvariancesamongdifferenteditionsof the sametext. In contrastto single-editionentity, cross-editionentities’ contentscomefrom multiple editions. The identified cross-editionentitiesare (1) collationsession:specifiesone set of text segments,which textual collation was performedto find thevariancesamongthegivensetof text segments,(2) variance:includesallinformationaboutaparticularvariancefoundduringtheprocessof acollationsession,and(3) editingtransaction:recordscontribution from theeditors.

After identifyingtheentitiesandtheirattributes,theentity-relationship(E-R)mod-el is usedto modeltherelationshipsamongthedataentities.Thederivationalrelation-shipsamongthedataentitiesaremodeledin two ways,dependingon thetypeof therelationship.Theone-to-many relationshipis is representedby meansof primarykey(PK) - foreign key (FK) pairing. The requirementsof the systemdeterminewhichtypesof relationshipsaredefinedamongentities. For example,the one-to-many re-lationshipbetweenthe raw imageclassand the enhancedimageclassreflectsthat


Projectadministrating

utilities

Material

presentation

applications

Contents

Recognitionutilities

multi-variance

Editor for

texts (MVED)

local file systemunderlying relational database

Data entities: texts & images

HDEMS API (access control, navigation,rippled verification)

of DEsDEs and metadata

Figure 10. Thearchitectureof theHDEMS

morethanoneimagescouldbeproducedby enhancingthesameoneraw image.Themany-to-many relationship,alsoknown asassociation,needto be representedasta-blesin E-Rmodels.Theseassociative tablesareusedby theHDEMS API to performnavigation;upper-layeredapplicationsdo not know theexistenceof thesetables.

5.3. Implementation

Therelationaldatabasemanagementsystemusedin this work is Oracle8. Tablesin thedatabasestorevariousdataentities,theirmetadata,andrelationships.To accessdataentitiesin thedatabase,theMVED usesJDBC— a standardrelationaldatabaseinterfacefor Java programs.Anothermoduleneedingaccessto thedataentitiesis aunifieddocumentcomposerwhichwill bediscussedlater, to accessthedataentitiesbyusing“oraperl” — a packagefor connectingPerlprogramsandtheOracledatabase.


Figure 11. Optionsandunifiedhypertext documentin a browser

6. The Reader’s Interface

Evenwith raw facsimileimages,texts, andediting transactionsstoredin thesys-tem,thewholesystemwill notbecompletewithoutaninterfacethroughwhichreadersareableto accesstheraw materialandeditingrecords.Two goalsof thereader’sinter-facein our systemare(1) to providehigheravailability (allow morereadersto accessthecontentsof thesystem)and(2) to allow readersto customizeunifieddocuments,


i.e., to generateany combinationsof raw materialandediting records,basedupontheir preferences.To achieve thegoals,a Web-basedreader’s interfaceis developed,which usesstandardWeb browsersto accepta reader’s input andto displayunifieddocuments,generatedby a CGI programbasedon reader’s input. As the left-sideframe in Figure 11 shows, throughthe compositionoptions,readerscan selectthesourcesof material. The unified document,shown in the right-sideframein Figure11,will becreatedbasedon theselectedoptions.

TheCGIprogramis responsiblefor composingahypertext documentdynamically.Uponreceiving a requestfrom thereader, thehypertext documentcomposerperformsasequenceof operationsto generateacustomizedunifieddocument.Theseoperationsinclude (1) parsingthe given argumentsfrom Web browser to determinethe read-er’s requirement.(2) Matchingthe reader’s requirementswith availablematerialbysendingqueriesto theunderlyingdata-entitydatabase.(3) Fetchingthebest-matchedmaterialfrom the dataentity database.If the systemcannot find the requesteddataentities,the bestalternative dataentitieswill be sentfor display. For example,if areaderwant to seethe variancescategorizedasprinting errorsby Editor-A but thereis no editing transactionclassifiedas “printing error” by Editor-A, the systemwillsearchfor a varianceidentifiedasresultingfrom printing errorsby any othereditorfor display. Thesystemwill usethecorrespondingvariancein thebasetext asdefaultreplacementif no appropriateeditingclassificationsexist in thedatabase.(4) Associ-atinghyperlinkswith inter-relationshipsamongdataentities,for example,associatinghyperlinkswith editing recordsthatcontaindetailedinformationaboutraw material,editingcommentaries,andparticipatingeditors.Thus,in additionto browsingunifieddocuments,readersalsocaninvestigatetheeditedmaterialin detailthroughthesehy-perlinks.As Figure11 shows, in theunifieddocument,eachvarianceis color-coded,enlarged,and associatedwith a hyperlink representedby a magnifying glassicon.Clicking an icon will leadreaderto anotherbrowserwindow whereall dataentitiesrelatedto the associatedvariancearedisplayedor canbe reachedthroughotherhy-perlinks;seeFigure12. If morethanoneediting transactionshasbeenmadeon thesamevarianceby differenteditors,readercanselectany oneof theeditingtransactionsby usingthe pull-down menuin the upper-sideframe. Besidesediting transactions,readersalsocanview facsimileimages,which containthevariancesof interestin thelower-sideframe. (5) Composinga unifieddocumentsby concatenatingfetcheddataentitiesin theorderof occurrence.Theunifiedhypertext documentwill besentbackto reader’sbrowserthroughtheWebserver.

7. CONCLUSIONS

Due to rareness,accessto ancientdocumentsusuallyis restricted,which hinder-s the researchwork of domainexpertsandawarenessof raredocumentsby generalreaders.Besidespresentingthe overview of an integratedsystemfor digitizing an-cient documentswith degradedvisual quality andmulti-variantcontents,this paperdetailsthreemodulesthatareunderdevelopment.First, this paperdescribeshow an


Figure 12. Browserwindowdisplayingdetailsof a variance

editingtool for documentswith multi-variantcontents(MVED) canlocatevariances,facilitatecomparisonandcontrastof multiple versionsof thesamedocument,andal-low scholarsto correctandcommenton particularpartsof the document.With it’sembeddedsynchronizationmechanism,theMVED canlocateasetof variantcontentsin every collatedtexts aswell asthe correspondingfacsimileimagebasedon a par-ticular segmentof texts. Following theMVED, this paperdescribeshow derivationalrelationshipsamongrelateddataentitiescanbe storedin a relationaldatabase,thuslessinconsistencieswill result from the frequentupdateon the linkagestructureofrelatedentitiesandthecontentsof entities.Thethird modulepresentedin detail is aWeb-basedreaders’interface.Thepervasivenessof Webbrowsersincreasestheavail-


ability of bothraw andeditedmaterial.ThroughtheWeb-basedinterface,readerscancustomizethecompositionof unifiedtexts,aswell astraverseamongall relateddataentitiesthroughhyperlinks.

Integratingthe above modulestogether, a genericsolutionis formedfor creatingelectroniceditionsof ancienttextualdocuments.Theintegratedsystembenefitsschol-arsby associatingraw materialswith theireditingrationalecloselyandprecisely. Theassociationalsoallows dynamiclinks amongraw materialsandsupportfor multiplescholarlyinterpretations.This facilitatesservingreaderswith differentbackgrounds,interests,andgoals. In addition, readersare not limited by the decisionsmadebyeditors,readerscancreatecustomizededitionsat will. More broadly, we believe thattheconcepts,workingprocedures,andsoftwaretechniqueswill behelpful in handlingissuesof mid-to-large scaleprocessingof ancientdocumentsfor a wide variety ofadditionalcollections.

8. References

[BLA 95] BLAKE N. F., ROBINSON P., SOLOPOVA E., TheCanterbury TalesProject, , 1995,http://www.shef.ac.uk/uni/projects/ctp/index.html.

[BRE 96] BREWER A., DING W., HAHN K., KOMLODI A., “The Roleof IntermediarySer-vicesin Emerging Digital Libraries”, DL96: Proceedingsof Digital Libraries, BethesdaMD, USA, 1996,ACM, p. 29-35.

[CAS 90] CASEY R. G., YONG S. Y., “ Image AnalysisApplications”, chapter1, p. 1-36,MarcelDecker, Inc.,1990.

[CHE 76] CHEN P., “The Entity RelationshipModel: A Basisfor theEnterpriseView of Da-ta”, ACM Trans.on DatabaseSystems, vol. 1, num. 1, 1976,p. 9-36.

[GAR 98] GARRIS M. D., JANET S., KLEIN W. W., “Impact of ImageQuality on MachinePrintOpticalCharacterRecognition”,report, January1998,NationalInstituteof StandardsandTechnology.

[GLA 98] GLADNEY H. M., M INTZER F., SCHIATTARELLA F., BESCOS J., TREU M., “Dig-ital Accessto Antiquities”, Communicationsof ACM, vol. 41,num. 4, 1998,p. 49-57.

[LEC 98] LECOLINET E., ROLE F., L IKFORMAN-SULEM L., LEBRAVE J.-L ., ROBERT L.,“An IntegratedReadingandEditingEnvironmentfor ScholarlyResearchonLiteraryWorksandtheir HandwrittenSources”, Proceedingsof theThird ACM internationalconferenceon Digital Libraries,DL ’98, Pittsburgh,PA, June1998,ACM, p. 144-151.

[OZK 86] OZKARAHAN E., DatabaseMachinesandDatabaseManagement, Prentice-Hall,Inc., EnglewoodClif fs, New Jersey, USA, 1986.

[OZK 90] OZKARAHAN E., Database Management: Concepts, Design, and Practice,Prentice-Hall,Inc.,EnglewoodClif fs, New Jersey, USA, 1990.

[SEA 97] SEAMAN D., “The User Community as Responsibility and Re-source: Building a Sustainable Digital Library”, D-Lib Magazine, , 1997,http://www.dlib.org/dlib/july97/07seaman.html.