1 gerhard schneider – rechenzentrum der universität freiburg aspects of long term preservation of...
Post on 19-Dec-2015
215 views
TRANSCRIPT
1
Ger
har
d S
chn
eid
er –
Rec
hen
zen
tru
m d
er U
niv
ersi
tät
Fre
ibu
rg
Aspects of Long Term Preservation
of Digital Libraries
Gerhard SchneiderComputing Centre & CS Department
University of Freiburg
2
Ger
har
d S
chn
eid
er –
Rec
hen
zen
tru
m d
er U
niv
ersi
tät
Fre
ibu
rg
Storage on Paper
• Longevity of the media– paper lasts for centuries, no special care required
– except perhaps: acid in paper, water from burst pipes, fire, etc
• Longevity of the description language– except perhaps: old English or the old German alphabet
– abstract terms: decoding is possible, as related information is available
– how about old assyrian writings?
• Loss of information is a well known phenomenon– loss of old information is not so relevant to current society
• 5th book of Aristotle
– loss of new information is more or less impossible through the distribution of knowledge to many places
• thanks to Gutenberg
3
Ger
har
d S
chn
eid
er –
Rec
hen
zen
tru
m d
er U
niv
ersi
tät
Fre
ibu
rg
Storage on Paper
• Accessibility of printed information– no special device is needed, except perhaps glasses
– no technical knowledge is required: “hands on”
• Outsourcing of the handling of knowledge distribution to publishers– economically very successful - so successful that we can no longer
afford to buy the books we wrote
• long term storage of information has been centralised in libraries– high running costs
• library building, maintenance, staff required to manage books
– cost of storage may by far exceed the cost of acquisition
4
Ger
har
d S
chn
eid
er –
Rec
hen
zen
tru
m d
er U
niv
ersi
tät
Fre
ibu
rg
Storage on Paper
• If you don‘t live close to the library, accessing information can be very difficult (3rd world countries)
• a rather costly machinery has been set up to ease the problem– long distance inter-library loans
• staff intensive, cost of transportation
• photocopies of articles vs. copyright
• now: scanning articles and delivery via fax (sic!) or email
• the user is charged with a nominal fee – nominal w.r.t. the cost of operation, not w.r.t. the user’s own budget
• Information is produced electronically– most features are lost when the information is brought to paper
• It is only natural that scientists are asking for electronic libraries - given all the benefits
5
Ger
har
d S
chn
eid
er –
Rec
hen
zen
tru
m d
er U
niv
ersi
tät
Fre
ibu
rg
Electronic storage
• There are a few pitfalls when it comes to digital storage– can you still read your old 5 1/4“ - floppies?
• Do you still have a device to read them?– Well known problem in other areas:
• record players are rare these days.
• And if so is there still anything on them?– Magnets can erase information, and each information bit is a little
magnet interfering with the others
– well known phenomenon also in other areas of magnetic storage• music cassettes, tape recorders, video tapes
• Solution: digitally stored information can be copied to new media without any loss!– The problem old fashioned industry is now facing w.r.t. to CD-writers
6
Ger
har
d S
chn
eid
er –
Rec
hen
zen
tru
m d
er U
niv
ersi
tät
Fre
ibu
rg
Electronic storage
• Thus in principle we have a solution to the media problem:– keep converting
– conversion can be done in a fully automated way, using robots
– the technology is available in most computer centres and used for automated backup and archive.
– Typical archive software recycles tapes which have been overused and copies the information onto new tapes, ejecting the old tapes.
• Interpretation of the contents– bits carry no real information, interpretation by software is required
before it can be presented to the human eye/ear
• New problem: convert the software that was used to generate the information.– Well known problem: word processors can’t read old files
7
Ger
har
d S
chn
eid
er –
Rec
hen
zen
tru
m d
er U
niv
ersi
tät
Fre
ibu
rg
Format issues
• What do the bits mean?– Simple, but good example: TeX
• Information and control commands are stored in plain ASCII
• The functionality of the control commands are exactly described in the TeX manual
• So, if you sit on an island with nothing but the bits and the TeX manual, you can find out what the paper is supposed to look like
– Try this with MS-Word – or, even better, MS-Powerpoint
• Putting data into electronic libraries only makes sense if the format is 100% specified– Whether this description has to be in the document or in an
accompanying file is of secondary interest.
– Keep it simple: the original document should be understandable even if additional structure information gets lost (or is difficult to retrieve)
8
Ger
har
d S
chn
eid
er –
Rec
hen
zen
tru
m d
er U
niv
ersi
tät
Fre
ibu
rg
Format issues
• Example: the Kodak imaging software in MS-Windows allows the annotation of TIFF-files– Can only be read with the Kodak software
– Or the annotation can be added permanently to the document, thus making it visible (and not removable) to any other TIFF reader.
• Text formats are precise – i.e. we know what has been typed
• Image formats are different, as information is lost during the scanning process– By the lens itself
– By the sensor (300 dpi means that only 300 dots of an inch are stored)
– By the storage format (i.e. do we get back what we stored?)• Lossy vs faithful
9
Ger
har
d S
chn
eid
er –
Rec
hen
zen
tru
m d
er U
niv
ersi
tät
Fre
ibu
rg
Format issues
• What does „lossy“ mean?– We do not get back every information that we stored – sounds scaring
– Did we see it in the first place? Is Fax lossy? (fax = 100 dpi or 200 dpi)
– Analog recording vs. CD vs. MP3• CD is a lossy process, but what is really lost?
• MP3 is good enough, even for the young generation.
– What we lose depends on the algorithm• Doctor‘s scare: „vital Xray data is lost“ – completely wrong
– Why lossy? Keeping the original information needs too much space and does not give any gain in knowledge.
• In addition „writing things down“ is already a lossy process.
• „lossy“ does not imply that we lose more and more information over time
10
Ger
har
d S
chn
eid
er –
Rec
hen
zen
tru
m d
er U
niv
ersi
tät
Fre
ibu
rg
Complexity of storage
• When it comes to electronic media, we tend to ask for overkill, forgetting that we cannot do anything like that on paper
• When moving to the paperless office (my office does!!)– after having solved the format issue in favour of RTF and TIFF
how do we store the documents?
– We use the filesystem and nothing else as it pretty well reflects the current structure of an office.
– Thus we are independent of the operating system and the management software
• all I need are long filenames and a tree structure, possibly access rights
• Thus we can get quite far before running into another really hard problem
11
Ger
har
d S
chn
eid
er –
Rec
hen
zen
tru
m d
er U
niv
ersi
tät
Fre
ibu
rg
Software issues
• In a multimedia environment, it may not be enough to convert the media, the software has to be recompiled– a standard job in science, whenever a new computer architecture
appears, just recompile and run.
– Most scientific software has little sophisticated I/O
– what happens if the software is intimately married to the underlying operation system
• like Word to Windows???
– Can we really afford to store our information in proprietary systems?• i.e. systems which we cannot look into?
• Use system-independent data storage– even if a loss of information occurs
• don‘t put this information in in the first place....
12
Ger
har
d S
chn
eid
er –
Rec
hen
zen
tru
m d
er U
niv
ersi
tät
Fre
ibu
rg
Live documents
• After all we want live documents– query and retrieval
• how many libraries are locked into old fashioned systems because their data cannot be converted?
– hyperlinks– computer games – simulation
• upgrades to new versions on new operating systems are upward compatible - hopefully
• a manufacturer may decide NOT to move to a new platform– make as much money as possible and vanish
• reimplemenation may not be a solution– incompatibilities, copyright issues, errors become historic features
13
Ger
har
d S
chn
eid
er –
Rec
hen
zen
tru
m d
er U
niv
ersi
tät
Fre
ibu
rg
Solution
• Why not specify the programming environment along the lines of the file format discussion?
• Use JAVA !– Port the java engine to a new environment and you are „done“
• Unfortunately:– Users like their own programming environment
– Environments are made for performance (data bases)
– And not for long term storage
• So we have to face the real world
14
Ger
har
d S
chn
eid
er –
Rec
hen
zen
tru
m d
er U
niv
ersi
tät
Fre
ibu
rg
Solutions??
• Keep a museum of running machinery?
• Emulation?? (Idea of Rothenberger, Rand Corp)
• during a phase of transition emulators are typically available
• Example: Lots of games were available for the C64 and are still kept (collected) in libraries, without a working environment
• emulators are available:CCS64 v 1.09 runs under Windows
15
Ger
har
d S
chn
eid
er –
Rec
hen
zen
tru
m d
er U
niv
ersi
tät
Fre
ibu
rg
Emulators under Windows
Sinclair ZXSpectrum
Atari emulator
Even emulators for modern PalmPilots
16
Ger
har
d S
chn
eid
er –
Rec
hen
zen
tru
m d
er U
niv
ersi
tät
Fre
ibu
rg
Even more emulators
• Even for Sony‘s Playstation, there is an emulator under Win98
• There is a Palm emulator for the Gameboy, running in the Windows emulator of the Palm, which runs…
17
Ger
har
d S
chn
eid
er –
Rec
hen
zen
tru
m d
er U
niv
ersi
tät
Fre
ibu
rg
What about Windows?
• Running NT under Linux on an Intel machine..
• Or:• Running
Linux under NT on an Intel machine
• Or:• Running NT
under Windows XP
18
Ger
har
d S
chn
eid
er –
Rec
hen
zen
tru
m d
er U
niv
ersi
tät
Fre
ibu
rg
What about other hardware?
• Emulate Windows on other hardware (Macintosh):
19
Ger
har
d S
chn
eid
er –
Rec
hen
zen
tru
m d
er U
niv
ersi
tät
Fre
ibu
rg
Observation
• Many software developers use emulators to cross-compile applications for new environments
• Thus emulators do exist in most environments
• Can we obtain them from the manufacturers?– Copyright issues
– company secrets
– maybe enforce a deposit of software emulators in a safe??• For later use?
20
Ger
har
d S
chn
eid
er –
Rec
hen
zen
tru
m d
er U
niv
ersi
tät
Fre
ibu
rg
Using emulators
• Emulators typically store an environment in one special file
• Application example (tested) for VMWARE– install Windows 98 in a VMWARE box
• keep the resulting file as a reference installation
– install one computer game (or one programme setup) under a copy of the reference installation
– store the resulting file in a digital library with the name of the game as metadata
– to play the game, start your computer (either NT, Win2k or Linux), start VMWARE with that specific file and .... Play!
• the file can be exchanged between operating systems
– to convert the file from one storage medium to another, use the standard process
21
Ger
har
d S
chn
eid
er –
Rec
hen
zen
tru
m d
er U
niv
ersi
tät
Fre
ibu
rg
Using emulators
• At some stage, the PC technology will die. Very likely there will be an emulator for the old fashioned PC on the new hardware, at least for a limited time.
• During this time, set up a scheme to use that emulator to run your favourite operating system and install your favourite emulator under the emulated environment.
• If this works, continue to use all the old files
• If it fails, some development has to be carried out– money on such projects is wisely spent: one local solution is a solution
for the whole world
• Performance is not an issue!
22
Ger
har
d S
chn
eid
er –
Rec
hen
zen
tru
m d
er U
niv
ersi
tät
Fre
ibu
rg
Performance of emulators
• Machines get faster– VMWARE loses a factor of 2, so on an 800 MHz machine it appears as
if the original code were running on a 350 MHz machine
• we will thus keep even the original „feeling“ of the software• For some time, before machines get faster
• Experience:a whole server setup can be run under emulators– VMWARE even has network and USB connection
– a complete digital library system, when installed under VMWARE can be kept in one (huge) file and preserved for the future, at least for a limited time
• which is better than losing it right away
– The hardest part is to convince a sysadmin not to use the real machine
23
Ger
har
d S
chn
eid
er –
Rec
hen
zen
tru
m d
er U
niv
ersi
tät
Fre
ibu
rg
Using emulated environments even today
• A typical „library loan“ requires the retrieval of the software and the handing over to the customer– customer may lose parts of the software (diskette, documentation)
– customer may have problems with the installation and the librarian cannot help, since a computer expert is required
• using the emulated version means the retrieval of a file from a digital library (electronic storage) and its installation (i.e. a copy process) on the library computer (which has an emulator installed)– no manpower involved, instant service to the customer
• it suffices to have one reference installation in the world– libraries could trade the files, provided they own the copyright of the
“computer game”
24
Ger
har
d S
chn
eid
er –
Rec
hen
zen
tru
m d
er U
niv
ersi
tät
Fre
ibu
rg
Summary
• Emulators may be the only way to preserve a complex software environment– a „living“ environment in contrast to a „dead“ environment like a book
(text or image)
• Digital libraries themselves are complex software environments, which depend on hardware and operating systems
• This is a current Ph.D.-project at the University of Freiburg.– How far can we go?
– Apparently very far…..