dieper - università degli studi di...

29
DIEPER LB-5632 Deliverable 13 Survey of current methodology in image capturing and document management Date 08-07-99 Reference D13final/public/ABC, 29 pages Produced by ABC Datenservice GmbH for UNIGOE Workpackage 4, supervised by UBG Distribution list All DIEPER partners Contact person Reinhard Ecker * Am Wasserturm 6 D-60435 Frankfurt am Main ) + 49 69 954031-30 2 + 49 69 954031-12 . [email protected]

Upload: others

Post on 12-Nov-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPERLB-5632

Deliverable 13Survey of current methodology in image capturing and

document management

Date 08-07-99Reference D13final/public/ABC, 29 pages

Produced by ABC Datenservice GmbH for UNIGOEWorkpackage 4, supervised by UBG

Distribution list All DIEPER partnersContact person Reinhard Ecker

* Am Wasserturm 6D-60435 Frankfurt am Main

) + 49 69 954031-302 + 49 69 954031-12. [email protected]

Page 2: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 2 Date: 08.07.1999

Document history

VersionsVersion Date Author Comments

1 20/12/98 R. Ecker Preliminary Draft of D132 16/02/99 R. Ecker Draft 2 of D 133 09/07/99 R. Ecker Final version of D 13

UpdatesChapter Description of modifications Version

1. Introduction............................................................................................................................................. 32. Scanning ................................................................................................................................................. 4

Kind of printed materials .................................................................................................................... 5Kind of intended use and further processing of the digitised materials................................................. 6Kind of intended access to the digitised materials................................................................................ 6Image processing ................................................................................................................................ 7Image compression ............................................................................................................................. 7Versions of image files for different applications................................................................................. 7

3. Indexing ................................................................................................................................................ 10Categories of indexing ...................................................................................................................... 10Document identifier .......................................................................................................................... 10Document structure........................................................................................................................... 10

4. Methods of full text + meta data capturing............................................................................................. 12Manual text capturing....................................................................................................................... 12Text capturing by OCR / ICR............................................................................................................ 12Download of catalogue data .............................................................................................................. 12

5. Document storage .................................................................................................................................. 13Document storage formats................................................................................................................. 13Digital master file ............................................................................................................................. 15Application file formats .................................................................................................................... 15Self-describing image files ................................................................................................................ 16Storage media ................................................................................................................................... 17

6. Document management ......................................................................................................................... 19Electronic archiving and document management systems.................................................................. 19Basic functions of archiving and document management systems...................................................... 19Document storage ............................................................................................................................. 20Document retrieval......................................................................................................................... 20Document visualisation and reproduction.......................................................................................... 20Maintenance and administration ................................................................................................... 20Existing archiving and document management systems for digital libraries ...................................... 21Online library catalogue software systems ......................................................................................... 21Local solutions.................................................................................................................................. 22

7. Relevant Standards ................................................................................................................................ 228. References, URLs etc. ............................................................................................................................ 29Appendix: Dieper Questionnaire

Page 3: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 3 Date: 08.07.1999

1. Introduction

This document describes the current status of image capturing and document managementmethods, especially with respect of library documents.

Some years ago several libraries have started to retro-digitise printed materials, as e.g. booksor periodicals, and to distribute these digital documents via Internet or other networks to usersover the world. We are now at the beginning of a development which will, as we hope, makeall important information make available immediately from anywhere and at any time.

This new kind of information access meets already some enthusiasm from their users to act as acatalyst for starting additional projects. It is expected, that the information behaviour will beinfluenced considerably by the direct access to digitised documents.

Libraries – in our tradition one of the significant groups of conventional information providers– will identify and use this chance to overtake also a leading role in the digital informationsociety.

The goal DIEPER project is to enhance these developments with respect to the digitisation,indexing and presentation of scientific periodicals.

This report gives an overview on the current methodology status on digitisation of printedlibrary materials and electronic storage and administration of digital documents. In addition ashort overview is given to indexing and to the capturing of full text and meta data. (Deliverable16, which is to be prepared later will give more details on these items).

A list of relevant standards and a technical glossary of relevant terms is added together withsome references.

In the appendix to this report the results of a survey (“Dieper Questionnaire”) for theinvestigation of the current methodology in image capturing and document management at theproject partners and selected European libraries are presented.

Page 4: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 4 Date: 08.07.1999

2. Scanning

To make a printed document available via the Internet it has to be converted into an electronicformat.

One of the first difficult issues that must be addressed in any digital conversion projectconcerns the selection of appropriate formats and technologies for storage, display anddistribution of the material.

Another difficult question is what file format (images, PDF, SGML, HTML, etc.) should beused to deliver the content.

Another question is whether to store and deliver the materials as images or as text. Given thetechnology available to web browsers, the most accurate way to replicate completely theoriginally published material, which is full of special characters, foreign languages,mathematical symbols, charts and pictures, is with scanned images.

In addition, by the use of Optical Character Recognition software a corresponding text file canbe built that would allow the user to search the full-text of the journals in the database. Butthese (uncorrected) OCR-text files should not be made available to users.

A distinction should be made between coded and non-coded information:

Coded information Non-coded information

Files Text Image

Capturing Manually input, OCR/ICR Scanning

Editing Text editor Pixel editor

Direct retrieval Yes No

Basic scanning parameters

• Kind of the original document (printed text on paper, printed image, photograph, colour,microfilm, microfiche, ...)

• Size of the original document (micro form, <A 4, >A 4 <A3, <A 3, ..., >A0)

• Scanning resolution (100 dpi, 300 dpi, 400 dpi, 600 dpi, ...)

• Image depth (pixel information: 1 bit, 8 bit, 12 bit, 3 x 12 bit, ...)

• Intended exploitation of the digital materials

• File size of the digital materials

Criteria for the definition of scanning parameters

Page 5: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 5 Date: 08.07.1999

The definition of scanning parameters depends on the

• kind of the printed materials

• kind of intended use

• kind of intended access

Kind of printed materials

Paper based materials

• Bounded volumes

• Single sheets of paper

• Maps

• Library catalogue cards

• etc.

• One side – double sided

• Usual book format (~ A 4, A 5 ..)

• Small size

• Large size (A 0 or larger)

• Text

• Graphics

• Halftone

• Colour

Microfilm based materials

• Microfilm

• Microfiche

• Slides

• Professional reprofilm

Other materials

• 3-D objects

• etc.

Page 6: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 6 Date: 08.07.1999

Kind of intended use and further processing of the digitised materials

Presentation on the screen

Reproduction by a printer

• Local print of a small number of document pages

• Reprint of the complete document

• Reprint in professional quality

• Reprint of coloured posters in an optimum true colour quality

• etc.

Further processing of documents

• Automatically OCR-conversion to full text

• Automatically vectorisation of graphical information

• Automatically analysis of the document type

• Production of a CD-ROM

• etc.

Kind of intended access to the digitised materials

• Via Internet/Intranet

• Local access within the premises of the library

Categories of scanners

• Flat bed scanners

• Flat bed scanners with automatic feeders

• Camera scanners

• Specialised book scanners

• Microfilm scanners

• Other specialised Scanners (x-ray, 3 D objects, ...)

• Black/white scanners

• Greyscale scanners

• Colour scanners

Page 7: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 7 Date: 08.07.1999

• Digital resolution (dpi)

Image processing

• Clipping: Separation of double pages to single pages (book scanners)

• Despeckle: Purification of the image from dirt (e.g. from marks caused bymould) by deleting single spots and grey background.

• Deskew: Orientation of the image to vertical

• Black Border removal: To remove the black border or any black back ground

• Contrast enhancement: Enhancement and Addition (Tracing) of lines in the document

• Level reducing: Reducing the pixel information (number of grey or colour scales)

• Resolution reducing: Reducing the image resolution (pixel resolution)

• Scaling to original size: Scaling the image file to the original size of the paper document

Image compression

To reduce the size of the image file. There is a limit to the size of the file one can expect a userto down-load over network links. Because of that limitation, it may not be possible to offervery high resolution colour or greyscale images.

Lossless compression

• CCITT G4 T6

• LZW

• etc.

Lossy compression

• JPEG

• Wavelet

• Fractal compression (synchronous or asynchronous)

• etc.

Versions of image files for different applications

• Digital master file

• Archive file

• Screen presentation

Page 8: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 8 Date: 08.07.1999

• Gallery / Thumbnail

• Local print

• Download format

The following tables give an overall view on the different formats and recommended parameters.

Text, Line-graphics

Scanning (300)/400/600 dpi 1 BitStorage TIFF/CCITT G4 1 BitViewing 70-120 dpi GIF 1-4 BitGallery/Thumbnails 15 dpi GIF 1 BitDownload 300/400/600 dpi PDF 1 Bit

If an OCR conversion is intended, scanning should be done with a resolution of 600 dpi.

Grey-scale graphics, Photographs

Scanning 300dpi 8 BitStorage TIFF uncompressed 8 BitViewing 512x768 to 1024x1536 JPEG 4 BitGallery/Thumbnails ~ 100x150 JPEG 4 BitDownload 2048x3072 JPEG 8 Bit

Manuscripts

Scanning 300dpi 8 BitStorage TIFF uncompressed 8 BitViewing 512x768 to 1024x1536 JPEG 1-4 BitGallery/Thumbnails ~ 100x150 JPEG < 8 BitDownload 2048x3072 JPEG 8 Bit

Colour graphics

Scanning 200-300dpi 3x8 BitStorage TIFF uncompressed 3x8 BitViewing 512x768 to 1024x1536 JPEG 3x8 BitGallery/Thumbnails ~ 100x150 JPEG 8 BitDownload 2048x3072 JPEG 3x8 Bit

Page 9: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 9 Date: 08.07.1999

2-D Representation of 3-D Objects

Scanning 200-300dpi 3x8 BitStorage TIFF uncompressed 3x8 BitViewing 512x768 to 1024x1536 JPEG 3x8 BitGallery/Thumbnails ~ 100x150 JPEG 8 BitDownload 2048x3072 JPEG 3x8 Bit

Acceptable compression for JPEG-files

Grey scale: maximum 10:1 Colour: maximum 15:1

Page 10: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 10 Date: 08.07.1999

3. Indexing

For more details see Deliverable 16.

Categories of indexing

Bibliographic indexingDocument identifierDocument structure (SGML or XML representation of the document structure)Full text (complete full text or in part)

Bibliographic indexing

• Bibliographic data

• Catalogue data sets

• Dublin Core data sets

• Storage of bibliographic data in the TIFF-Header

• Storage of bibliographic data in the TEI-Header

• etc.

Document identifier

• DOI

• SICI

• URL

• PURL

• URN

• etc.

Document structure

• SGML/XML representation

• Ebind format

• Hyperlinks text ↔ Image page

SGML is quite used for describing the structure of catalogue records. SGML is aninternational standard used for the formal definition of electronic text. SGML is thus astructure driven meta language. HTML for instance is an application of SGML. The structureof an SGML set of documents is described in a single definition document referred to as theDTD, the Document Type Definition. HTML corresponds to a specific DTD as well as

Page 11: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 11 Date: 08.07.1999

Netscape browser (HTML viewer) which has its own DTD called Mozilla, a superset ofHTML 2.0.

Full text (complete full text or in part)

• Complete documents

• Special parts of the document

• Summary, abstract

• Tables of contents

• Indexes

• Key words

Full text formats

• HTML

• ASCII

• .DOC

• .XLS

• TEX

• etc.

Page 12: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 12 Date: 08.07.1999

4. Methods of full text + meta data capturing

For more details see Deliverable 16.

Manual text capturing

This is often the cheapest way to convert scanned images to full text, particularly for handprint, old prints in “Fraktur” or low printing quality. The costs are approximately 1 EURO for1.000 characters “double keying” if the work is done in Far East. The quality range is from99% (Fraktur) to 99,85% (typewritten text).

Text capturing by OCR / ICR

Optical Character Recognition is a method for the automatic conversion of scanned text pagesto full text. Middle class software (as FineReader, Omnipage etc.) can convert well printed andwell scanned material with an accuracy of 99,8% and a speed of up to 1.000 characters perminute. The price for a PC license of this software is less than 500 EUROs.

It is recommended to check the text by special dictionaries (language, topic).

“Intelligent” Character Recognition systems run interactive quality checks based on documentstructure analysis, syntax and semantic rules.

Special software exists for handprint and Fraktur, but it must stated that these products are notcompetitive in comparison to manually capturing in Far East.

A software was recently developed by our own company for the automatic analysis andstructured text conversion of tables of contents. This Toccata software will be offered to theDieper partners for testing free of charge.

Download of catalogue data

This is of course the easiest and cheapest way to “capture” bibliographic data of documents.Usually the catalogue data are downloaded as structured ASCII files.

Page 13: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 13 Date: 08.07.1999

5. Document storage

Document storage formats

The document format is mainly important regarding three issues:- size of a document which relates to the size of the required hard disk and transfer time- recommended exchange format as a common format between partners- need of a special viewer or not to display and print the document when retrieved

Tag(ged) Image File Format (TIFF)

TIFF is a widely supported format within the libraries community. The latest TIFF version is6.0.

TIFF limitations: There are no provisions in TIFF for storing vector graphics and textannotation (although such items could be easily constructed using TIFF extensions). TIFF uses4-byte integer file offsets to store image data, with the consequence that a TIFF file cannothave more than 4 Gigabytes of compressed raster data. This is not a big deal for DIEPER sincethis limit is far from being reached within a single document. It is considered that an averagedocument is a 10-page document with each page having 100 KB compressed size. This makesthe average size of a requested article roughly 1 MB.

TIFF strengths: TIFF is primarily designed for raster data interchange. Its main strengths are ahighly flexible and platform-independent format which is supported by numerous imageprocessing applications. Supported compression algorithms are: raw uncompressed,PackBits, LZW (Lempel-Ziv-Welch), CCITT Group 3 & 4 and JPEG compression.

Regarding time transfer for an average document: Suppose that we have an end-userconnected through a dial-up connection (28,800 BPS). A 1 MB document requires thenroughly 5 to 10 minutes to download. This seems to be accepted by end-users compared to theclassical postal delivery.

Portable Document Format (PDF)

PDF is a file format used to represent a document independent of the application software,hardware, and operating system that were used to create it. A PDF file contains a PDFdocument and other supporting data. A PDF document contains one or more pages. Each pagein the document may contain any combination of text, graphics, and images in a device- andresolution-independent format. This is the page description. A PDF document may also containinformation possible only in an electronic representation, such as hypertext links.

PDF limitations: Printing a PDF document requires installing the article embedded fonts on theend-user's machine and several steps in order to convert the file to a postscript format. Pagesare not necessarily stored in sequential order in the PDF file.

PDF strengths: PDF is primarily a portable format. To reduce file size, PDF supports a numberof industry-standard compression filters: JPEG compression, CCITT Group 3 & 4, LZW. PDF

Page 14: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 14 Date: 08.07.1999

viewers are supported freely from Adobe and exist for all platforms: UNIX, Macintosh and PCenvironments. Supporting hyperlinks is also a helpful feature. Editing a PDF document requiresprofessional tools which may be helpful to guarantee the authenticity of the original document.This compared to HTML for example which presents the content as a free text that may beeasily modified by a novice end-user.

PNG

The PNG format provides a portable, legally unencumbered, well-compressed, well-specifiedstandard for lossless bitmapped image files.

Although the initial motivation for developing PNG was to replace GIF, the design providessome useful new features not available in GIF, with minimal cost to developers.

GIF features retained in PNG include:

- Indexed-color images of up to 256 colors.- Streamability: files can be read and written serially, thus allowing the file format to be used

as a communications protocol for on-the-fly generation and display of images.- Progressive display: a suitably prepared image file can be displayed as it is received over a

communications link, yielding a low-resolution image very quickly followed by gradualimprovement of detail.

- Transparency: portions of the image can be marked as transparent, creating the effect of anon-rectangular image.

- Ancillary information: textual comments and other data can be stored within the image file.- Complete hardware and platform independence.- Effective, 100% lossless compression.

Important new features of PNG, not available in GIF, include:

- Truecolor images of up to 48 bits per pixel.- Grayscale images of up to 16 bits per pixel.- Full alpha channel (general transparency masks).- Image gamma information, which supports automatic display of images with correct

brightness/contrast regardless- of the machines used to originate and display the image.- Reliable, straightforward detection of file corruption.- Faster initial presentation in progressive display mode.

GIF

High compressed image format. The limitation to 8 Bit (256 colours) may cause colourinhomogenity. Suitable for screen presentation and for images in one colour.

Page 15: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 15 Date: 08.07.1999

JPEG

Compressed image format. Colour depth from 8 to 24 Bits. Lossy compression. It is possibleto define the degree of information loss. Suitable for grey scale and colour images of low ormedium quality.

Digital master file

The image file which results from the scanning will be the so called Digital master file of thehighest quality (resolution, image depth). This is the archive version which should be stored ina standardised format under loss less compression on a long life media (WORM, CD-R, tape)in a secure place and should not be used for the daily access.

The preferred file format is Tiff. Printed text, which has been scanned with 1 bit image depth isloss-less compressed according to CCITT G 4 T6 algorithm. Greyscale or colour images canbe stored as Tiff in an uncompressed manner. Text files and other file formats are converted to“Tiff” in certain cases.

Alternative formats may be PNG and (only for grey scale and colour) the GIF format.

The formats described are mostly derivatives from this master.

Application file formats

Archive file

This file is stored in the archive system (magnetic disc, RAID system, jukebox) for access bythe user. Resolution and image quality depend on the kind of use and of user access. Formatscan be Tiff, JPEG, GIF, PNG etc. File compression, even with loss of information, is possible.

In addition application formats will be prepared for the screen presentation (e.g. GIF or JPEG,75 – 100 dpi), for the download and for local printing (e.g. Postscript or PDF, 300 dpi). Theseapplication formats may also be stored in the archive system or can be prepared on-the-fly.

Screen presentation

This file can either be an additional format derived from the digital master and stored in thearchive system or be a temporary file produced “on the fly” from the archive file. The samegoes also for the formats below.

As this file is to be displayed on the screen, the resolution corresponds to this quality (approx.72 –150 dpi). Lossy file compression up to the range 1:10 (b/w) and 1:15 (colour) is possible.

Preferred file formats are JPEG or GIF.

Gallery / Thumbnail

This is a reduced version of the screen presentation file. The goal is to give the user anoverview and to allocate a specific image in a number of images.

Page 16: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 16 Date: 08.07.1999

The resolution is normally in the range 15 dpi. Extremely high and lossy compression ispossible.

Preferred file formats are JPEG or GIF.

Local print

This file will be delivered to the user’s local printer if the print button was clicked. Theresolution depends on the file size and the requirements for sufficient printing quality. Formatsmay be PDF, JPEG, GIF etc.

Download format

This file will be delivered to the user’s PC for local storage. The preferred format is PDF. Theresolution should be a reasonable compromise between file size and quality.

Self-describing image files

Self-describing documents consist of two parts: The body, which contains the data, and theheader with attributes describing the document and its format.

The working group “Technik”, established by the Deutschen Forschungsgemeinschaft for thepreparation of their retro-digitisation program for libraries, defined five Tiff header fields inaddition to the existing standard fields. These additional categories are described below.

For further details see: http://wwww.sub.uni-goettingen.de/ebene_2/vdf/einstieg.htm

Format information and attributes

Images file (e.g. Tiff)

Header

Body

Page 17: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 17 Date: 08.07.1999

Additional Tiff header categories

CategoryName

CategoryNo.

Content Sample

DocumentName 269 Character 1-3: libraryshort nameCharacters 4-#: Cataloguenumber

UBG 12345678

ImageDescription 270 bibliographic datafile structure

|PERIO|Journal of Mathematics|1874|Berlin|JourMath_33355577_V030_I005|

PageName 285 page count 00000172

(Scan)Software 305 name and version numberof the scanning software

SRZ Proscan, Version 2.0

Artist 315 name (or short name) ofthe library

Universitätsbibliothek Graz

Storage media

For digital archive systems two alternative storage technologies are existing:

• Magnetic storage

- Magnetic disc

- Magnet tape

• Optical storage

- Optical disc

- Optical tape, card etc.

- Holographic solid storage

Magnetic discs are preferred for the data that are in permanent access. Raid systems containseveral magnetic discs in one array. For large archives (> 100 Gbytes) optical discs jukeboxesare often used as a cheap mass storage device. The main disadvantage of optical discsjukeboxes is that the disc is mechanically inserted to the optical drive. This process takesseveral seconds.

Magnetic tapes are sometimes used as back-up media for the master file.

There exist three categories of optical disc storage media:

• Rewritable optical discs

• WORM (Write Once Read Many)

• CD-R, DVD-R

Page 18: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 18 Date: 08.07.1999

Rewritable optical discs apply magneto-optic technology. WORMs are storage media for thepermanent and irreversible archiving of image data. CD-R, DVD-R are special versions ofWORMs using the CD ROM / DVD ROM formats.

Optical discs are seen as the ideal media for long time storage of image data. The expectedphysical durability of optical storage media is in the range of 100 years, which is much longerthan the lifetime of the hardware and software for recording the media.

Page 19: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 19 Date: 08.07.1999

6. Document management

Document Management is usually understood as the data base supported management of anykind of dynamic electronic documents during their full life cycle from the first production tothe permanent archiving of the final version. Document management systems are wide used forbusiness documents.

As digital library documents are, in contrast to business documents, static the term “documentmanagement” does not really fit.

It is suggested to understand “document management” in the context of retro digitised librarymaterial as the administration of image files in a non-hierarchic system.

Electronic archiving and document management systems

Electronic archiving and document management systems are used for the permanent storageand electronic provision of images, text and meta data.

Electronic archive systems administer documents and single pieces of information in data basesusing a data base management system.

Furthermore efficient electronic archive systems usually offer functions to manage jukeboxesfor the storage (online, nearline or offline) of huge data collections on optical storage media.

Basic functions of archiving and document management systems

Archiving and document management systems should cover the following functionality.

• Image capturing and indexing

• Document storage

• Document retrieval

• Document visualisation and reproduction

• Maintenance and administration

Image capturing and Indexing

This covers scanning of analogous data (usually in paper form), import of image files and data,indexing (manually, semi automatic or full automatic) and preparation of protocols.

Scanning

• Preparatory steps

• Inhouse scanning by the library itself

• External scanning

• Image processing and manipulation

• Image file compression

Page 20: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 20 Date: 08.07.1999

Import of image files and data

Automatic or semi automatic import of image data from other sources. The data stream isprocessed according to defined rules. Any import must be comprehensible and free of dataloss. It should be possible to stop and set up the process at every stage.

During this process it is possible to change image formats and index data.

Indexing

Manually, semi automatic or automatic indexing is possible. Index files can be copied fromexisting data bases.

Semi automatic indexing is possible for instance by offline preparation of bar codes and readingthe bar code information (document identifier) during the scanning.

The automatic methods apply OCR and document analysis technology.

Index data are linked to the image file via their addresses or Ids.

A special kind of index information is the representation of the document structure. The indexterms (Tables of contents, head lines, categories, pagination, index, key words, ...) of thedocument elements are hyper linked to the corresponding image pages.

Document storage

Storage of image files in different formats, index data, document structure information anddocument classification.

Document retrieval

Multiple retrieval and navigation tools (catalogue data, SGML/XML based tables ofcontents, indexes, lists of illustrations, full text, thumbnails, etc.).

Document management systems for digital libraries must apply the web technology to giveaccess via Internet and Intranet.

Document visualisation and reproduction

Standard viewers and browsers should be included. Tools for the conversion of image formatsand scaling. Export tools for download, printing-on-demand and storage an CD-R

Maintenance and administration

This includes

• set up and maintenance of the data base system

• set up and maintenance of user categories

• user adminstration

Page 21: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 21 Date: 08.07.1999

• statists

• system administration and operating

Existing archiving and document management systems for digital libraries

Agora

Agora is the first complete digital library system containing all modules necessary for dailylibrary business as described above.

The main components are:

• Electronic Document Management and Administration System

• Batch import tools (archiving, meta data, utilization of the TIFF tags, etc.)

• Unique procedures to handle any kind of heterogeneous pagination systems

• Multiple retrieval and navigation tools (catalogue data, SGML/XML based tables ofcontents, indexes, lists of illustrations, full text, thumbnails, etc.)

• Tools for the conversion of image formats and scaling

• Export tools for printing-on-demand and storage an CD-R

• Internet/Intranet server, easy to use HTML-templates

The Digital Library System was developed by the Satz-Rechen-Zentrum SRZ in collaborationwith the Göttinger Digitalisierungszentrum for the storage of and access to digital and digitizeddocuments, including their structural, bibliographic and content meta data.

IBM Digital Library

Xerox

Die Digitale Bibliothek NRW

This is mainly a common access system to several data bases and electronic documentsrepositories.

BieblisBieblis is an electronic system for archiving of documents (images, text files etc.) and Internetaccess to these documents. It is an integrated component of the IBIS library system of theUniversity Library of Bielefeld.

Online library catalogue software systems

Several OPAC software systems offer functions for the directs access of images from thecatalogue data entries.

Page 22: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 22 Date: 08.07.1999

Local solutions

In addition to the systems mentioned several local solutions have been developed by librariesand universities.

7. Relevant Standards

CD DA (Digitalaudio), Red Book

Physical specification of the CD-ROM 1980

CD ROM, YellowBook

Continuation of Red Book 1983

CD-I (Interactive),Green Book

Complete Multimedia system, ISO 9660 1986

CD-R (Recordable),Orange Book

1995

CD-RW (Rewritable),Orange Book

1995

UDF Universal Disc Format (DVD compatible, but not ISO 9660 compatible).

Developed by the Optical Storage Technology Association; based on ISO13346. UDF will replace ISO 9660.

1996

ISO 9660 Standard for the CD file formats. Predecessor: High Sierra standard (HSF).Disc size: 120 mm

1987

HSF High Sierra Standard 1986

ISO 9171-1 Specification of the disc format (5,25” = 130 mm) 1990

ISO 9171-2 Specification of the writing format 1990

ISO 10089 Specification of the disc format for MOs 130 mm 1991

ISO 10090 Specification of the disc format for ROMs and MOs 80 mm 1992

ISO 10091 Specification of the disc format for WORM 130 mm 1995

ISO 10149 Specification of the disc format for CD-ROM 120 mm 1995

ISO 10885 Specification of the disc format for 14” WORM (Kodak) 1993

ISO 11560 Specification of the disc format for MO-WORM 130 mm 1992

ISO 11694-4 Specification of the file structure for optical storage cards 1996

ISO 12654 Hardware independent storage format (draft) 1996

ISO 13403 Specification of the disc format for 12” WORM (CCS) 1995

ISO 13481 Specification of 1 GB discs 130 mm 1993

ISO 13549 Specification of 1,3 GB discs 130 mm 1993

ISO 13490-1/-2 Specification of the file structure for ROMs and WORMs 1995

ISO 13614 Specification of the disc format for 12” WORM (SSF) 1995

Page 23: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 23 Date: 08.07.1999

ISO 13482 Specification of 2 GB discs 130 mm 1995

ISO 14517 Specification of 2,6 GB discs 130 mm 1996

ISO 15525 Life Expectancy of CD ROM Draft

ISO 8879 SGML 1986

8. Technical Glossary and Acronyms

Access Provider see Internet Service Provider

Alpha A value representing a pixel's degree of transparency. The more transparent a pixel,the less it hides the background against which the image is presented. In PNG, alphais really the degree of opacity: zero alpha represents a completely transparent pixel,maximum alpha represents a completely opaque pixel. But most people refer toalpha as providing transparency information, not opacity information, and wecontinue that custom here.

APDU Application Protocol Data Unit. A unit of information transferred between a clientand a server. This is used in the Z39.50 to define the data exchanged between aZ39.50 origin and a Z39.50 target

API Application Programmable Interface

BC Berne Convention

Bib-1 Bib stands for Bibliographic. Denotes the set of attributes that can be searched usingZ39.50

Bit depth The number of bits per palette index (in indexed-colour PNGs) or per sample (inother colour types). This is the same value that appears in <tt>IHDR</tt>.

Bits per Second Measure for the speed of data transfer through communication media..

Browser Software for the interpretation and presentation of HTML documents.

Byte Eight bits; also called an octet.

CANTATE Computer Access to Notation and Text in Music Libraries. R&D project fundedwithin the EU libraries programme

CAS Current Awareness Service

CCITT Comité Consultatif International de Télégraphie et Téléphonie (ITU-T)

CERN European centre for high energy physics (Geneva, Switzerland).

CGI Common Gateway Interface. A technique that allows a Web server to interface toexternal application such as databases.

Channel The set of all samples of the same kind within an image; for example, all the bluesamples in a true colour image. (The term "component" is also used, but not inthis specification.) A sample is the intersection of a channel and a pixel.

Chromaticity A pair of values <i>x,y</i> that precisely specify the hue, though not the absolutebrightness, of a perceived colour.

Chunk A section of a PNG file. Each chunk has a type indicated by its chunk type name.Most types of chunks also include some data. The format and meaning of the datawithin the chunk are determined by the type name.

CI data Coded information; e.g. text files.

Composite As a verb, to form an image by merging a foreground image and a background

Page 24: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 24 Date: 08.07.1999

image, using transparency information to determine where the background shouldbe visible. The foreground image is said to be "composited against" thebackground.

CRC Cyclic Redundancy Check. A CRC is a type of check value designed to catchmost transmission errors. A decoder calculates the CRC for the received data andcompares it to the CRC that the encoder calculated, which is appended to thedata. A mismatch indicates that the data was corrupted in transit.

Critical chunk A chunk that must be understood and processed by the decoder in order to produce ameaningful image from a PNG file.

DBMS Database Management System

DECOMATE Delivery of Copyright Materials to End-users. R&D project funded within the EUlibraries programme

Datastream A sequence of bytes. This term is used rather than "file" to describe a byte sequencethat is only a portion of a file. We also use it to emphasise that a PNG image mightbe generated and consumed "on the fly", never appearing in a stored file at all.

Deflate The name of the compression algorithm used in standard PNG files, as well as in zip,gzip, pkzip, and other compression programs. Deflate is a member of the LZ77family of compression methods

Document Provider Organisation which provides on line access to primary electronic material

Document Server Server from which is processed the secure electronic transmission of documents tothe end user.

DOI Digital Object Identifier

Download The transfer of data (documents) from a server to a local computer

DTD Document Type Definition. A DTD describes an SGML document. For example,Mozilla is known to be Netscape's DTD

DVD Digital Versatile Disc

DVD ROM Read Only DVD

DVD-R Recordable DVD (WORM technology)

DVD RAM Erasable DVD

ECMS Electronic Copyright Management System

ECUP, ECUP+ European Copyright User Platform.

EDD Electronic Document Delivery. Generic term which envolves the identification of theuser, the searching of bibliographical reference and the requesting of document (SODor Online delivery).

EDIFACT Electronic Data Interchange For Administration, Commerce and Transport

EDIL Electronic Document Interchange between Libraries. European project terminated onDec. 31, 1995.

ELITE-Project Electronic Library Teleservices. European project (1996-1997).

ELITE ServiceProvider

Organisation which manage the LEAS

EUROPAGATE European project that aims at interoperability of bibliographic catalogue systemsusing the search and retrieve protocols (Z39.50 / SR)

FASTDOC Fast Document Ordering and Delivery. R&D project funded within the EUlibraries programme (1994-1996)

Page 25: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 25 Date: 08.07.1999

Firewall See Security Firewall

FTP a) File Transfer Protocol. Defines how files will be transferred from one computer toanother.

b) A software to transfer files using the File Transfer Protocol.

File Transfer protocol. Reliable file transfer protocol used on the top of the TCP/IPstack.

Filter A transformation applied to image data in hopes of improving its compressibility.PNG uses only lossless (reversible) filter algorithms.

GEDI Group on Electronic Document Interchange. GEDI is a TLV format that defines howto encapsulate a TIFF article.

GIF Graphics Interchange Format. File format for graphics, developed by CompuServe,Inc. GIF offers the inclusion of Inline graphics to HTML.

See also XBM.

Greyscale An image representation in which each pixel is represented by a single sample valuerepresenting overall luminance (on a scale from black to white). PNG also permitsan alpha sample to be stored for each pixel of a greyscale image

GUI Graphical User Interface

GURL Golden URL, technique used in WebDOC project in order point to online documents

HTML HyperText Markup Language. The language that describes Web pages contents.HTML is derived from ISO/SGML using a specific DTD.

See also SGML.

HTTP HyperText Transfer Protokoll.

HTTP / httpd HyperText Transfer Protocol Daemon. This is the world wide Web server.

Hyperlink See Link

Hypermedia See Hypertext

Hypertext A document which contains links to other documents.

IAB See Internet Architecture Board

IAS Individual Article Supply

ICR Intelligent Character Recognition

IDF International DOI Foundation

IETF See Internet Engineering Task Force

IMPRIMATUR Intellectual Multimedia Property Right Model and Terminology for UniversalReference (EC Project)

Indexed colour An image representation in which each pixel is represented by a single sample that isan index into a palette or lookup table. The selected palette entry defines the actualcolour of the pixel.

Inline-Image Graphic as part of a hypertext document.

See also Linked Image.

Internet a) In general, a number of single networks which operate together like one bignetwork.

b) The worldwide network of networks.

Page 26: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 26 Date: 08.07.1999

Internet ArchitectureBoard (IAB)

Committee for standardisation and other important decisions for the Internet.

Internet EngineeringTask Force (IETF)

Committee for the analysis and clearing of technical problems with respect toInternet. The members of IETF report to the Internet Architecture Board (IAB).

Internet-Resources Any kind of information accessible via Internet.

Internet ServiceProvider (ISP)

An organisation which offers Internet connections.

IPR Intellectual Property Right

ISDN Integrated Services Digital Network. The natural evolution of the PSTN towards afully digital network. Allows data transfer speed up to 64 KBPS for a basic rateinterface BRI

ISP See Internet Service Provider

JPEG Joint Photographic Experts Group. Image compression standard.

Link Reference to an other document. If the link is used the corresponding document willbe loaded.

Loss-less compression Any method of data compression that guarantees the original data can bereconstructed exactly, bit-for-bit.

Lossy compression Any method of data compression that reconstructs the original data approximately,rather than exactly.

Luminance Perceived brightness, or greyscale level, of a colour. Luminance and chromaticitytogether fully define a perceived colour.

LZW Lempel-Ziv-Welch image file compression algorithm

MARC Machine Readable Catalogue. MARC is an exchange format used to import/exportbibliographic records. e.g. UNIMARC, USMARC

Metadata Data to describe the documents (as e.g. libraries catalogue data). Metadata shouldguarantee an unique identification key for each document.

Activities for standardisation:− Dublin Core Set− Warwick Framework− PURL-Concept of OCLC

MIME Multipurpose Internet Mail Extensions. Enhancement to SMTP / 7-bit limitation.Handles all types of content through E-mail (e.g. images)

MUMLIB Multimedia Methodology in Libraries

MURIEL Multimedia Education System for Librarians Introducing Remote InteractiveProcessing of Electronic Documents

NCI data Non-coded information; e.g. image files

NIST National Institute of Standards and Technology

NLC National Library of Canada. Implementor of CanSearch, Z39.50 origin

OCLC On-line Computer Library Centre, Dublin, Ohio.

OCR Optical Character Recognition

ONE OPAC Network in Europe. European Project that aims at investigating andevaluating Z39.50 implementations and search and retrieval APIs.

OPAC On-line Public Access Catalogue

Page 27: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 27 Date: 08.07.1999

OSTA Optical Storage Technology Association. World-wide association of optical storagesystems producers (represents > 70% of the market). Specification of the UDF basedon ISO 13346 (1996)

Palette The set of colours available in an indexed-colour image. In PNG, a palette is anarray of colours defined by red, green, and blue samples.

Pixel The information stored for a single grid point in the image. The complete image is arectangular array of pixels.

PDF Portable Document Format. PDF is Acrobat's favourite format

Perl Practical Extraction and Report Language. An interpreted language. Mostly used toimplement CGI scripts for Web servers.

PIN-code Personal Identification Number code usually used in order to authenticate an end-user

PNG Portable Network Graphics. A standard format for lossless bitmapped image files.The intention was to replace GIF.

PNG editor A program that modifies a PNG file and preserves ancillary information, includingchunks that it does not recognise. Such a program must obey the rules given inChunk Ordering Rules

PSTN Public Switched Telephone Network. Just to denote the plain old telephoneinfrastructure. Allows data transfer speed up to 28,800 BPS

PURL Persistent URL.

RAID Redundant Array of Inexpensive Discs

Harddisc systems to store data in different security levels. Five level system, wasdefined at the University of Berkeley in 1987. RAID 7 Architecture (7 levels) givesaccess from several hosts to one array system.

RAMA/CHIO Remote Access to Museum Archives is a European project managed by Telis whichaimed at interconnected museum archives in Europe / Cultural Heritage InformationOn-line is a North-American initiative identical to RAMA.

RPN Reverse Polish Notation. A query basic language used in Z39.50 in order to issuesearch requests

RRO Reproduction Rights Organisation (IFFRO, VG Wort, CLA, CCC, etc.)

Scanline One horizontal row of pixels within an image.

Secure LocationIdentifier

Unique identifier which protects the document from unauthorised access.

Service-Provider See Internet Service Provider

SGML Standard Generalised Mark-up Language. SGML is an ISO standard widely adoptedby publishing professionals

Shell-Account see Dialup-Account

SMTP Simple Mail Transfer Protocol. Internet E-mail

SSL Secure Socket Layer. Low level packet encryption mechanism. Current version isSSL 3.0

STM Scientific, Technical and Medical publications (publishers)

SUTRS Simple Unstructured Text Record Syntax. A Z39.50 record syntax that allows anorigin to retrieve a result set of bibliographic records

Tags a) In HTML: the structure and presentation of documents will be defined via tags.

Page 28: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 28 Date: 08.07.1999

b) In TIFF: categories for the description of the TIFF file.

TCP/IP Transmission Control Protocol/Internet Protocol

Transmission Control Protocol / Internet Protocol. Denotes the required stack ofsoftware that allows a machine connect to the Internet. This covers layers 3 and 4 ofthe 7-layer OSI model.

TEI Text Encoding Initiative. Guidelines for Electronic Text Encoding and Interchange(final version 1994). Based on SGML.

TIFF Tag(ged) Image File Format (Aldus Corp.). Graphics file format. Latest revision is6.0.

TIFF Header Contains the file description of an TIFF file.

Truecolor An image representation in which pixel colours are defined by storing three samplesfor each pixel, representing red, green, and blue intensities respectively. PNG alsopermits an alpha sample to be stored for each pixel of a truecolour image

UNIMARC UNIversal MARC. Widely supported bibliographic MARC records

URL Uniform Resource Locator. A Web reference of a document (or object). e.g.http://www.telis-sc.fr denotes the URL of Telis S&C Web server.

VAN EYCK Visual Arts Network for the Exchange of Cultural Knowledge (EC project)

W3C An organisation jointly founded by MIT and CERN to manage the development ofthe www

WAIS See Wide Area Information Server

WAN See Wide Area Network

WebDOC Project where Pica is co-operating with publishers in order to provide a documentdelivery on the Web (http://www.pica.nl)

White point The chromaticity of a computer display's nominal white value.

WIPO World Intellectual Property Organisation. Belongs to the UN. WIPO is seated inGeneva.

WORM Write Once Read Multiple

WORM disc Disc, optical disc, applying WORM technology

World Wide Web A Hypertext based system for retrieval and access to Internet resources

WWW see World Wide Web

X.400 CCITT Messaging System

XBM X-Bitmap file format. Standard format for the storage of bit-map graphics under X-Windows.See also GIF

X-Windows System A network based window system which has been developed originally by theMassachusetts Institute of Technology (MIT). X-Windows (also called „X“) is mainlyused for UNIX computers.

YAZ Yet Another Z39.50 implementation (http://www.indexdata.dk/yaz)

Z39.50 ANSI and ISO standard. Z39.50 is a search and retrieval protocol widely acceptedwithin the libraries community in the USA and Europe

zlib A particular format for data that has been compressed using deflate-stylecompression.

Page 29: DIEPER - Università degli studi di Padovabibliotecadigitale.cab.unipd.it/chi_siamo/i-progetti/dieper.pdf · print, old prints in “Fraktur” or low printing quality. The costs

DIEPER Project: Deliverable 13Survey of current methodology in image capturing and document management

Version: Final Page 29 Date: 08.07.1999

8. References, URLs etc.

• GIF Info:http://www.geocities.co.jp/SiliconValley/3453/gif_info/index_en.html

• JPEG Home Page:http://www.jpeg.org/public/jpeghomepage.htm

• PNG Homepage:http://www.cdrom.com/pub/png/

• PDF:http://www.

• TIFF Revision 6.0:http://www.jgd.fhg.de/icib/it/defacto/company/aldus/read.html

• Wavelet Digest Home Page:http://www.wavelet.org/wavelet/index.html

• Report of the technical DFG working group “Verteilte digitale Forschungsbibliotheken”:http://wwww.sub.uni-goettingen.de/ebene_2/vdf/einstieg.htm

• Glossary of Digital Age Terms:http://strategy.gemconsult.com/resources/glossary/index.htm