digitisation and access to archival collections: a case study of the sofia municipal government...

32
Digitisation and Access Digitisation and Access to Archival Collections: to Archival Collections: A Case Study of the A Case Study of the Sofia Municipal Sofia Municipal Government (1878 – 1879) Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Maria Nisheva-Pavlova, Pavel Pavlov Faculty of Mathematics and Informatics, Sofia Faculty of Mathematics and Informatics, Sofia University University and Institute of and Institute of Mathematics and Informatics, Mathematics and Informatics, Bulgarian Academy of Sciences Bulgarian Academy of Sciences Nikolay Markov, Maya Nedeva Nikolay Markov, Maya Nedeva General Department of Archives at the Council of General Department of Archives at the Council of Ministers Ministers of Republic of Bulgaria of Republic of Bulgaria

Post on 20-Dec-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

Digitisation and Access to Digitisation and Access to Archival Collections: A Archival Collections: A Case Study of the Sofia Case Study of the Sofia Municipal Government Municipal Government

(1878 – 1879)(1878 – 1879) Maria Nisheva-Pavlova, Pavel PavlovMaria Nisheva-Pavlova, Pavel Pavlov

Faculty of Mathematics and Informatics, Sofia UniversityFaculty of Mathematics and Informatics, Sofia Universityand Institute of and Institute of Mathematics and Informatics, Mathematics and Informatics,

Bulgarian Academy of SciencesBulgarian Academy of Sciences Nikolay Markov, Maya NedevaNikolay Markov, Maya Nedeva

General Department of Archives at the Council of General Department of Archives at the Council of Ministers Ministers

of Republic of Bulgariaof Republic of Bulgaria

Page 2: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

IntroductionIntroduction

TheThe paperpaper presentspresents anan ongoingongoing projectproject aimedaimed atat the development of a methodology and the development of a methodology and corresponding software tools intended for corresponding software tools intended for building of proper environments giving up building of proper environments giving up means for semantics oriented, web-based means for semantics oriented, web-based access to access to digitized archivaldigitized archival collections. collections.

We suppose that these collections are We suppose that these collections are heterogeneous, i.e. they may include heterogeneous, i.e. they may include diverse types of materials (official diverse types of materials (official handwritten, typewritten or printed handwritten, typewritten or printed documents, letters, photographs, documents, letters, photographs, newspapers, maps etc.) and the texts of newspapers, maps etc.) and the texts of the documents within them may be written the documents within them may be written in different languages. in different languages.

Page 3: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

The practical experiments have been The practical experiments have been performedperformed on a collection of archival on a collection of archival documents from the period of the documents from the period of the organization of the Sofia Municipal organization of the Sofia Municipal Government (1878 – 1879)Government (1878 – 1879)..

Page 4: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

International encoding standards as well as International encoding standards as well as Semantic web methods and technologies Semantic web methods and technologies have been used. have been used.

The main difference with other similar The main difference with other similar projects is in the exploration of the idea projects is in the exploration of the idea that the usage of proper general-purpose that the usage of proper general-purpose and domain-specific ontologies can and domain-specific ontologies can minimize the resources necessary for the minimize the resources necessary for the development of tools for adequate, development of tools for adequate, semantics oriented access to semantics oriented access to heterogeneous (including distributed) heterogeneous (including distributed) multilingual archival collections.multilingual archival collections.

Page 5: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

Main objectives of the projectMain objectives of the project

To define suitable metadata to accompany To define suitable metadata to accompany digitiseddigitised documents from archival collections documents from archival collections in accordance with the international in accordance with the international standards, the Bulgarian traditional standards, the Bulgarian traditional experience and the needs of the target experience and the needs of the target groups of users.groups of users.

To study the various aspects of creation of To study the various aspects of creation of an appropriate ontology for the mentioned an appropriate ontology for the mentioned collection (e.g. the scope of the ontology, the collection (e.g. the scope of the ontology, the corresponding linguistic problems etc.).corresponding linguistic problems etc.).

Page 6: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

To explore the necessities of the typical To explore the necessities of the typical users of the discussed archival collection users of the discussed archival collection (experts in various domains and general (experts in various domains and general public) in order to give proper kinds of public) in order to give proper kinds of access to this collection. In particular, access to this collection. In particular, providing providing versatile, user-friendly access to versatile, user-friendly access to the collection based on the semantics of the collection based on the semantics of its its content.content.To develop a framework (that will be To develop a framework (that will be intended for users who are professional intended for users who are professional archivists) for application of Semantic Web archivists) for application of Semantic Web methods and technologies to methods and technologies to digitiseddigitised collections of archival documents.collections of archival documents.

Page 7: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

Representation of the archival Representation of the archival documentsdocuments

The discussed archival collection consists of The discussed archival collection consists of approximately 980 original handwritten approximately 980 original handwritten documents from the period around and documents from the period around and after the end of the Russo-Turkish war after the end of the Russo-Turkish war (1877 – 1878). This is the period when the(1877 – 1878). This is the period when the building of the fundamentals of the building of the fundamentals of the Bulgarian state and municipal institutions Bulgarian state and municipal institutions has been initiated and the basic has been initiated and the basic rules of rules of the contemporary Bulgarianthe contemporary Bulgarian language language have yet to be drawn up.have yet to be drawn up.

Page 8: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

Thus the documents within the collection are Thus the documents within the collection are of great scientific, historical and social of great scientific, historical and social value and are of interest to archivists, value and are of interest to archivists, historians, linguists etc.historians, linguists etc.

Because of these reasons we consider it Because of these reasons we consider it expedient to include in the electronic expedient to include in the electronic version of our collection not only digital version of our collection not only digital images of the chosen archival documents images of the chosen archival documents but also structured electronic transcriptions but also structured electronic transcriptions of their full texts and proper descriptions of of their full texts and proper descriptions of the collection as a whole as well as the collection as a whole as well as descriptions of its parts (known as archival descriptions of its parts (known as archival units) and all particular documents in it.units) and all particular documents in it.

Page 9: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

Description of the structural parts Description of the structural parts of the archival collectionof the archival collection

These descriptions have been prepared in These descriptions have been prepared in conformity with the traditional practice of conformity with the traditional practice of Bulgarian archivists. Bulgarian archivists.

The structure of Bulgarian archives consists The structure of Bulgarian archives consists of four levels of hierarchy: archival funds, of four levels of hierarchy: archival funds, inventory lists, archival units and individual inventory lists, archival units and individual documents. The descriptions at all levels documents. The descriptions at all levels have been structured and accompanied have been structured and accompanied with proper sets of metadata according to with proper sets of metadata according to the requirements of the EAD encoding the requirements of the EAD encoding scheme.scheme.

Page 10: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

For example, the EAD – compliant description For example, the EAD – compliant description of an archival fund contains data about the of an archival fund contains data about the type of the fund, the dates (starting and type of the fund, the dates (starting and final years) of creation of its documents, its final years) of creation of its documents, its logical structure and physical extent, the logical structure and physical extent, the genre(s) and language(s) of its documents, genre(s) and language(s) of its documents, the substances, technologies and methods the substances, technologies and methods of creation of documents and other of creation of documents and other materials in it as well as some short materials in it as well as some short information about the administrative information about the administrative history of the corresponding corporate history of the corresponding corporate body, the history of the fund etc. body, the history of the fund etc.

Page 11: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

Part of the description of archival fund 1K according to EAD standard

Page 12: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

Representation of electronic Representation of electronic transcriptionstranscriptions

of full texts of archival documentsof full texts of archival documents

We maintain two different digital forms of We maintain two different digital forms of each original archive document: its digital each original archive document: its digital image and an electronic transcription of its image and an electronic transcription of its full text (in XML format). full text (in XML format).

The digital images of the original documents The digital images of the original documents are intended mainly for visualization are intended mainly for visualization purposes while the electronic transcripts of purposes while the electronic transcripts of the documents and their EAD encoded the documents and their EAD encoded descriptions will be used to support various descriptions will be used to support various types of search and document retrieval types of search and document retrieval activities.activities.

Page 13: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

For the representation of the structured For the representation of the structured electronic transcriptions of the full texts of electronic transcriptions of the full texts of archival documents we use the TEI standard.archival documents we use the TEI standard.

The structure and the contents of the various The structure and the contents of the various kinds of documents within the collection kinds of documents within the collection (instructions, orders, reports, records of (instructions, orders, reports, records of sessions, letters, requests, petitions etc.) sessions, letters, requests, petitions etc.) were explored and a generalized model of were explored and a generalized model of these documents was created. A proper set these documents was created. A proper set of elements and attributes from the TEI of elements and attributes from the TEI document type definition was adopted to document type definition was adopted to describe this model. describe this model.

Page 14: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

Part of the <teiHeader> element of the electronic transcription of an archival document

Page 15: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

TEI – conformant representation of the text of an archival document

Page 16: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

Access to the collectionAccess to the collection

The user has the opportunity to switch The user has the opportunity to switch between two types of interface to the between two types of interface to the chosen collection:chosen collection:

The first one is based on the principles of The first one is based on the principles of the “standard” archivist’s view to an the “standard” archivist’s view to an archival collection. archival collection.

The second type of provided on-line The second type of provided on-line access to the collection may be described access to the collection may be described as the semantics oriented one.as the semantics oriented one.

Page 17: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

The homepage providing various types of access to the collection

Page 18: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

The interface to the archival collection The interface to the archival collection oriented to the standard archivist’s point oriented to the standard archivist’s point of view allows the user to browse the of view allows the user to browse the hierarchical structure of the collection as a hierarchical structure of the collection as a whole. whole.

At the archival fund and inventory list levels At the archival fund and inventory list levels the user has an access to the EAD the user has an access to the EAD encoded description of the corresponding encoded description of the corresponding unit (in XML format) and to a properly unit (in XML format) and to a properly visualized form of the same metadata (in visualized form of the same metadata (in PDF format).PDF format).

Page 19: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

Interface to the collection supporting the standard archivist’s view (at archival fund level)

Page 20: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

The user interface at archival unit level allows The user interface at archival unit level allows one to browse five different forms of each one to browse five different forms of each particular document in the corresponding particular document in the corresponding archival unit: the EAD encoded description archival unit: the EAD encoded description of the document (in XML format), a proper of the document (in XML format), a proper visualization of this description (in PDF visualization of this description (in PDF format), the TEI encoded electronic format), the TEI encoded electronic transcription of the full text of the document transcription of the full text of the document (in XML format), a proper visualization of (in XML format), a proper visualization of the electronic transcription of the document the electronic transcription of the document (in PDF format) and a digital image of the (in PDF format) and a digital image of the original document (again in PDF format).original document (again in PDF format).

Page 21: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

Interface to the collection supporting the standard archivist’s view (at archival unit level)

Page 22: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

The other type of provided access to the The other type of provided access to the discussed archival collection is based on discussed archival collection is based on the use of explicitly represented the use of explicitly represented knowledge describing different aspects of knowledge describing different aspects of the semantics of the collection as a whole the semantics of the collection as a whole and its structural parts. and its structural parts.

A set of access tools (often called “finding A set of access tools (often called “finding aids”) realizing various types of document aids”) realizing various types of document search and retrieval (search and retrieval (chronologicalchronological, , oriented to the kinds of documents within oriented to the kinds of documents within the collection, subject oriented etc.) has the collection, subject oriented etc.) has been under development for the purpose. been under development for the purpose.

Page 23: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

The search engines of most of these tools The search engines of most of these tools use the values of the corresponding use the values of the corresponding elements of the TEI encoded versions of elements of the TEI encoded versions of archival documents. archival documents.

In particular, the subject oriented document In particular, the subject oriented document retrieval is based on the use of the retrieval is based on the use of the semantic annotation of the documents. semantic annotation of the documents. The semantic annotation consists of The semantic annotation consists of appropriate words and phrases (chosen appropriate words and phrases (chosen from a subject ontology) which describe from a subject ontology) which describe the content of the document. the content of the document.

Page 24: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

In our project we use a subject ontology In our project we use a subject ontology (covering the main types of municipal (covering the main types of municipal activities) especially developed for the activities) especially developed for the purpose. This ontology is prepared using purpose. This ontology is prepared using Protégé/OWL. Protégé/OWL.

Page 25: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

Interface to the collection supporting the subject oriented document retrieval

Page 26: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

The topics viewed on the last screenshot The topics viewed on the last screenshot belong to a subset of the concepts at the belong to a subset of the concepts at the highest two levels of the mentioned highest two levels of the mentioned ontology based on the assumption for ontology based on the assumption for typical requests for information according typical requests for information according to the characteristics of the discussed to the characteristics of the discussed historical period. historical period.

Our future plans include some ideas to Our future plans include some ideas to generate automatically the list of generate automatically the list of searchable topics using the results of the searchable topics using the results of the preliminary examination of the professional preliminary examination of the professional needs of the main groups of potential users.needs of the main groups of potential users.

Page 27: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

The semantic annotations of the documents The semantic annotations of the documents within the collection contain proper terms within the collection contain proper terms from all levels of the same ontology. When from all levels of the same ontology. When the user chooses a topic from the list the user chooses a topic from the list shown on the last figure, the shown on the last figure, the corresponding access tool finds all corresponding access tool finds all documents which contain in their semantic documents which contain in their semantic annotations terms matching the user annotations terms matching the user query (i.e. identical to the term chosen by query (i.e. identical to the term chosen by the user or semantically related with it).the user or semantically related with it).

Page 28: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

A tool for search in the full texts of the A tool for search in the full texts of the document transcriptions is provided as document transcriptions is provided as well. We intend to use in its well. We intend to use in its implementation some our former results implementation some our former results concerning the development of tools for concerning the development of tools for ontology drivenontology driven search in collections of search in collections of digitised manuscriptsdigitised manuscripts. .

Page 29: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

ConclusionConclusion

The most valuable expected results of our The most valuable expected results of our project could be formulated as follows:project could be formulated as follows:

A methodology for application of A methodology for application of international standards, ontological international standards, ontological knowledge and Semantic web technologies knowledge and Semantic web technologies for the development of software tools for the development of software tools providing semantics oriented access to providing semantics oriented access to heterogeneous multilingual collections of heterogeneous multilingual collections of archival documents.archival documents.

Page 30: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

A model and a prototype of a website A model and a prototype of a website which gives the users an interface which gives the users an interface supporting various types of access to a supporting various types of access to a chosen archival collection.chosen archival collection.

Page 31: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

The main advantage of our approach is the The main advantage of our approach is the proper use of ontological knowledge proper use of ontological knowledge describing the semantics of the individual describing the semantics of the individual documents in the archival collection as documents in the archival collection as well as the semantics of the collection as a well as the semantics of the collection as a whole and the semantics of its structural whole and the semantics of its structural parts. It allows users with different profiles parts. It allows users with different profiles to study and analyze the documents within to study and analyze the documents within the corresponding collection from multiple the corresponding collection from multiple points of view using a single environment points of view using a single environment for the purpose.for the purpose.

Page 32: Digitisation and Access to Archival Collections: A Case Study of the Sofia Municipal Government (1878 – 1879) Maria Nisheva-Pavlova, Pavel Pavlov Faculty

AcknowledgementsAcknowledgements

This work has been funded by the EC FP6 This work has been funded by the EC FP6 Project “Knowledge Transfer for Digitisation Project “Knowledge Transfer for Digitisation of Cultural and Scientific Heritage in of Cultural and Scientific Heritage in Bulgaria” (KT-DigiCULT-BG) coordinated by Bulgaria” (KT-DigiCULT-BG) coordinated by the Institute of Mathematics and the Institute of Mathematics and Informatics of the Bulgarian Academy of Informatics of the Bulgarian Academy of Sciences. The authors are thankful to Dr. Sciences. The authors are thankful to Dr. Matthew Driscoll from Copenhagen Matthew Driscoll from Copenhagen University, Denmark, for the useful advices University, Denmark, for the useful advices concerning the TEI encoding of the concerning the TEI encoding of the electronic copies of archival documents.electronic copies of archival documents.