mapping of bibliographical standards into xml

14
SOFTWARE—PRACTICE AND EXPERIENCE Softw. Pract. Exper. 2004; 34:1051–1064 (DOI: 10.1002/spe.603) Mapping of bibliographical standards into XML Mirjana Jakˇ si´ c ,† Department of Mathematics and Informatics, Faculty of Science, University of Novi Sad, Trg D. Obradovi´ ca 4, Novi Sad 21000, Serbia and Montenegro SUMMARY The most popular bibliographical standards, which prescribe the exchange of bibliographical data in machine readable form, are MARC (Machine Readable Cataloguing) and UNIMARC (Universal Machine Readable Cataloguing). This paper presents two schemas, both written in the XML schema language: the UNIMARC format schema and the XML bibliographical records schema. The instance of the UNIMARC format schema is an XML document of UNIMARC format. This schema provides easier updating and adjusting of the UNIMARC format at the national or regional level. It was also the basis for the development of the XML editor for UNIMARC format description. The instance of the XML bibliographical records schema is an XML bibliographic record. This schema provides very thorough control of the contents and structure of bibliographical records, whether they were directly created as XML records or converted to XML from existing UNIMARC records. The validation process is implemented in the Java environment, with the Sun Multi-Schema XML Validator (MSV) package. The author is aware of the inevitable question of redundant information contained in these two schemas and proposes some possible solutions. Copyright c 2004 John Wiley & Sons, Ltd. KEY WORDS: XML; UNIMARC; XML Schema language; bibliographic records; validation INTRODUCTION XML is used extensively in all fields where the exchange of structured documents in electronic form is required. The exchange of data in machine readable form has been performed for a very long time in the field of librarianship. Several standards which prescribe the format of such information have been developed, of which MARC (Machine Readable Cataloguing) and UNIMARC (Universal Machine Readable Cataloguing) are the best known. The MARC standard was developed by The Library of Congress [1], while UNIMARC [2,3] represents the international MARC format, developed by IFLA (International Federation of Library Associations and Institutes). It was created for the purpose of Correspondence to: Mirjana Jakˇ si´ c, Department of Mathematics and Informatics, Faculty of Science, University of Novi Sad, Trg D. Obradovi´ ca 4, Novi Sad 21000, Serbia and Montenegro. E-mail: [email protected] Published online 7 June 2004 Copyright c 2004 John Wiley & Sons, Ltd. Received 11 August 2003 Revised 20 January 2004 Accepted 20 January 2004

Upload: mirjana-jaksic

Post on 06-Jul-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Mapping of bibliographical standards into XML

SOFTWARE—PRACTICE AND EXPERIENCESoftw. Pract. Exper. 2004; 34:1051–1064 (DOI: 10.1002/spe.603)

Mapping of bibliographicalstandards into XML

Mirjana Jaksic∗,†

Department of Mathematics and Informatics, Faculty of Science, University of Novi Sad,Trg D. Obradovica 4, Novi Sad 21000, Serbia and Montenegro

SUMMARY

The most popular bibliographical standards, which prescribe the exchange of bibliographical data inmachine readable form, are MARC (Machine Readable Cataloguing) and UNIMARC (Universal MachineReadable Cataloguing). This paper presents two schemas, both written in the XML schema language: theUNIMARC format schema and the XML bibliographical records schema. The instance of the UNIMARCformat schema is an XML document of UNIMARC format. This schema provides easier updating andadjusting of the UNIMARC format at the national or regional level. It was also the basis for the developmentof the XML editor for UNIMARC format description. The instance of the XML bibliographical recordsschema is an XML bibliographic record. This schema provides very thorough control of the contents andstructure of bibliographical records, whether they were directly created as XML records or converted toXML from existing UNIMARC records. The validation process is implemented in the Java environment,with the Sun Multi-Schema XML Validator (MSV) package. The author is aware of the inevitable questionof redundant information contained in these two schemas and proposes some possible solutions. Copyrightc© 2004 John Wiley & Sons, Ltd.

KEY WORDS: XML; UNIMARC; XML Schema language; bibliographic records; validation

INTRODUCTION

XML is used extensively in all fields where the exchange of structured documents in electronic form isrequired. The exchange of data in machine readable form has been performed for a very long time inthe field of librarianship. Several standards which prescribe the format of such information have beendeveloped, of which MARC (Machine Readable Cataloguing) and UNIMARC (Universal MachineReadable Cataloguing) are the best known. The MARC standard was developed by The Library ofCongress [1], while UNIMARC [2,3] represents the international MARC format, developed by IFLA(International Federation of Library Associations and Institutes). It was created for the purpose of

∗Correspondence to: Mirjana Jaksic, Department of Mathematics and Informatics, Faculty of Science, University of Novi Sad,Trg D. Obradovica 4, Novi Sad 21000, Serbia and Montenegro.†E-mail: [email protected]

Published online 7 June 2004Copyright c© 2004 John Wiley & Sons, Ltd.

Received 11 August 2003Revised 20 January 2004

Accepted 20 January 2004

Page 2: Mapping of bibliographical standards into XML

1052 M. JAKSIC

solving the problem that arose while attempting to adjust the original MARC standard to the applicationon the national level in different countries.

With the development of computer technologies, the need to improve and modernize the libraryservice has become evident. As a result, the possibilities for the application of new technologies,such as XML, in the exchange and control of bibliographical data are now under by consideration.XML does not have priority for the representation of bibliographical data by mere chance. Librarystandards for creating bibliographical records have a hierarchical structure, and XML is particularlywell-suited to this type of structure. It is also very important to make sure that numerous rules, whicha certain standard prescribes, are strictly followed when we work with any type of standard. In XMLthis condition is fulfilled through the process of validation. Validation is a process of checking thecorrectness of an XML document with reference to the given document type definition (DTD)—standard DTD or a scheme written in a schema language. Validation is performed easily with availableprogramme packages or tools which provide precise information if an error occurs.

The idea of converting bibliographical data into XML is not new. The basis of the MedlaneProject of Stanford University Medical Center [4] is a Java client/server application which performsthe conversion of MARC records into XML documents. The validation of XML records has beenperformed with reference to DTDs. In 2002 a new result of this project was published: XOBIS (XMLOrganic Bibliographic Information Schema) [5], a schema written in the RELAX NG schema language.XOBIS could be observed as a member of a set of schemas, each created to conform to different areasof information management. Such a superstructure of independent and coordinated schemas wouldlead to the integration of library systems. In comparison to other referenced projects below, and theapproach reported in this paper, XOBIS provides more than just simply translating MARC’s existingstructure into XML by replacing MARC.

A system for handling MARC data in an XML environment that consists of three levels has beendeveloped as part of the MARC XML project of The USA Library of Congress [6,7]. The basis ofthe system consists of MARC records and a tool that performs the conversion of those records intoXML records. On the next level, an XML schema in the RELAX NG schema language is defined.This schema describes the XML bibliographical records. Based on the schema, software tools forprocessing XML MARC records are developed on the third level.

In [8] the role of XML in library information systems at three levels is discussed: the transportationof bibliographic data, representation of complex validation rules, and finally, the description of servicesthrough which such data can be exploited. The idea of separating transport from validation is a differentapproach than that presented in this paper.

The issue of using XSLT for XML MARC record conversion is presented in [9]. The XSL stylesheetfor converting the MARC21 XML into UKMARC XML is presented. The argument in favour ofthis approach was that it was considered more practical to use XML, XSL and XSLT, as in the ITindustry staff would be more readily available who knew XML, XSL and XSLT, and could applytheir general knowledge to the conversion techniques using the identified rules. However, it is statedin the conclusion that there are still many reasons why it is too early for this approach to be used inpractice.

The development of the BISIS library software system began in 1993 at the University of NoviSad [10]. In version 3.0, an authentic text server, which indexes and retrieves bibliographical recordsin UNIMARC format, was created. The maintenance for the Unicode standard was carried outconsistently in the entire BISIS system, version 3.0. As part of the further development of BISIS

Copyright c© 2004 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2004; 34:1051–1064

Page 3: Mapping of bibliographical standards into XML

MAPPING OF BIBLIOGRAPHICAL STANDARDS INTO XML 1053

system, this paper describes the mapping of the UNIMARC format into the XML Schema languageand the conversion of UNIMARC bibliographical records into XML [11].

There is no literature currently available which describes the complete mapping of the UNIMARCformat into any of the schema languages. Why should a schema of UNIMARC format be created at all?An instance of this schema is an XML document of UNIMARC format. When would such a schemabe used? One possible usage, important at the national or regional level in any country, is updatingthe UNIMARC format for which there are two reasons. The first reason is updating UNIMARC forthe purpose of monitoring format supplements that are published, and the other reason is adjustingUNIMARC to the national or regional use, according to the instructions given for block 9. In everylibrary, there is a need to define UNIMARC subsets which are used for processing the different typesof publications: monographs, serial publications, audio and video records, etc. This schema wouldprovide the opportunity to check whether the chosen set of fields and subfields is correct with respectto the format definition.

This paper describes how a UNIMARC format schema can be formed. Owing to the similarity ofconcepts which exist in the MARC format, corresponding schemas of the MARC format can also beformed.

The solution for mapping UNIMARC into XML presented in this paper differs from otherapproaches mentioned above in the following ways.

• An effort is made to make the mapping of UNIMARC records (i.e. of UNIMARC format) intoXML documents as close as possible—in structure and in the labeling of concepts.

• In cases where one subfield keeps more than one piece of information, a new hierarchy levelwithin the subfield is made. This solution provides easier searching and management of XMLrecords.

• For representing data from code lists or those defined through patterns, the best possible tollswhich the XML Schema language offers are used.

• For each existing indicator, each possible value is precisely represented, and in the UNIMARCformat schema the meaning of each value is also represented.

SCHEMA OF THE UNIMARC FORMAT

Part of the UNIMARC format schema is graphically shown in Figure 1, in the Design view defined bythe XML Spy editor. The root element is UNIMARCformat which consists of sub-elements representingblocks, the names of which are of the form b∗, where ∗ ∈ [0 . . . 9]. A block property is the descriptionof a block, which is mapped into the desc attribute of the string type. This attribute is defined in theattributes group AGblock. Figure 1 shows an element that corresponds to block 2 and several of itsfields. The elements marked with broken lines are not mandatory according to the format definition.To their attribute minOccurs, therefore, the value 0 is assigned, and these elements can be left out of aschema instance. Elements marked with full lines are mandatory.

Each block contains several elements which represent fields, the names of which are of the formf ∗∗∗, where ∗ ∈ [0 . . . 9]. Three figures appearing in the name of an element which represents a fieldcorrespond to the three-figure code of the field in the UNIMARC format. The properties of a fieldare its description, obligatory and repeatability. These properties are mapped into the desc attribute—

Copyright c© 2004 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2004; 34:1051–1064

Page 4: Mapping of bibliographical standards into XML

1054 M. JAKSIC

Figure 1. Part of the UNIMARC format schema.

string type, mandatory-(Boolean type) and repeatable-(Boolean type). These three attributes belong tothe following four groups of attributes: AGfield , AGfield1 , AGfield 0, AGfield10. Exactly one of thesegroups is assigned to each of the elements depending on its properties. The groups are defined in thefollowing manner.

• AGfield : none of the values of any of the attributes are fixed. It is assigned to the fields that arenot mandatory and can be repeated. Such fields can be declared as mandatory and/or repeatablein a concrete subset.

• AGfield1 : the value of the mandatory attribute is fixed to 1. It is assigned to the fields that aremandatory and repeatable. In a concrete subset, such a field can be declared as unrepeatable.

• AGfield 0: the value of the repeatable attribute is fixed to 0. It is assigned to the fields that areneither mandatory nor repeatable. Such a field can be declared as mandatory in a concrete subset.

• AGfield10: the value of the mandatory attribute is fixed to 1, and that of the repeatable attributeto 0. It is assigned to the fields that are mandatory but not repeatable. The properties of such afield cannot be altered.

Figure 1 shows field 207 which has the second indicator and two subfields defined. To this field theattributes group AGfield 0 is assigned; it is therefore neither mandatory nor repeatable according to thedefinition. In a particular subset this field could be declared as mandatory.

Each field represents a sequence of elements that correspond to indicators and subfields. A fieldcan have the first and/or the second indicator defined (Figure 2). Elements that correspond to theindicators are named ind∗, where ∗ ∈ {1, 2}. An indicator has a desc attribute of string type defined,which contains information on the meaning of the indicator. An indicator is a sequence of two or moreelements, the names of which are ic∗, where ∗ ∈ {0, 1, 2, 3}; one such element represents one allowed

Copyright c© 2004 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2004; 34:1051–1064

Page 5: Mapping of bibliographical standards into XML

MAPPING OF BIBLIOGRAPHICAL STANDARDS INTO XML 1055

Figure 2. The first indicator of the field 207.

value of the indicator and has two attributes defined: the attribute code of byte type, which containsthe allowed value of the indicator to which the element belongs; and the attribute desc of string type,which contains the explanation of the meaning of that value. Only four different structure cases of theallowed indicator values have been observed in the UNIMARC format: 0 or 1; 0 or 1 or 2; 1 or 2; 0or 1 or 2 or 3. As a consequence of this observation, the schema has one data type defined globallyfor each case, thus assigning to each element of the indicator a corresponding type. If the new casefor possible indicator values appears within the update of the UNIMARC format, the schema must bechanged. For this reason, both schemas are divided in three schema files, two of which are the libraryof types for indicators and the list codes. So the update does not affect the main schema file.

Elements that represent subfields have names of the form sf ∗, with ∗ being one alpha-numericalcharacter corresponding to the subfield identifier in the UNIMARC format. By way of analyzingUNIMARC, several types of subfield content have been found: numerical or textual content; contentdefined in one of the general code lists (such as lists for languages or countries); content defined by aspecific code list relative only to the subfield in question; content which is distributed in the subfieldsegments, which are likewise defined by code lists applicable only to that particular segment; contentwhich may be one of the fields which are secondary by format definition—this applies only to block 4.

The subfields that have content distributed in several segments, with each segment containing datafrom the code list specific to that segment, are mapped in the following manner (Figure 3). An elementof such subfield is a sequence of elements that represent segments. Its name is of the form ssfield∗,where ∗ marks the end position of that segment in a subfield. Each element of a segment has twoattributes defined, startPos and endPos of byte type, which contain data on the start and end positions

Copyright c© 2004 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2004; 34:1051–1064

Page 6: Mapping of bibliographical standards into XML

1056 M. JAKSIC

Figure 3. The subfield segment and its content.

of a segment in the subfield. Since these data are set by the UNIMARC format, the values of theseattributes are fixed to those predefined values. For example, Figure 3 shows subfield a of field 100, thecontent of which is distributed in several segments. The start and the end position of segment ssfield21is 21, which means that the segment contains exactly one character. Based on its sub-elements inCode 0and inCode 1, it is evident that the allowed characters are 0 and 1. The elements that correspond to thecode list values are mapped in the same manner as indicator codes. They have two attributes defined:code of string type, which contains the code itself, and the attribute desc of string type, which containsthe description of the code meaning. Both attributes are mandatory.

The following attributes are defined for a subfield element.

• Desc of string type, which represents the description of the subfield. This attribute is mandatory.• Length of byte type, which represents the maximum length of subfield contents; it is not

mandatory, since this data does not exist in the format definition of each subfield.• Group of attributes AGsfield∗ that includes the following attributes:

– mandatory of Boolean type, which contains information on whether the field is mandatoryor not, this attribute is mandatory;

– repeatable of Boolean type, which contains information on whether the field is repeatableor not, this attribute is mandatory.

As is the case with field elements and their attribute groups AGfield∗, the four attributegroups appear here as well, with the entirely same meaning: AGsfield , AGsfield 1, AGsfield 0,AGsfield10.

Copyright c© 2004 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2004; 34:1051–1064

Page 7: Mapping of bibliographical standards into XML

MAPPING OF BIBLIOGRAPHICAL STANDARDS INTO XML 1057

Figure 4. The language code list.

• extCode of byte type, which represents the identification number of the general code list, if thecontents of that field are controlled in this manner. This attribute is not mandatory.

General code lists, such as those for languages and countries, are defined globally with the extelement that is a sequence of elements corresponding to the following code lists: codeLanguage,codeCountry, etc. (Figure 4). For each of these elements an attribute extId of int type, which containsthe identification number of the code list, is defined. Each element of a code list is a sequence of sub-elements corresponding to the individual codes in the list. For each sub-element an attribute code ofthe string type, which has the value of the code itself, is defined. The content of a sub-element is adescription of the code meaning. For example, the sub-element l scc of the element codeLanguagerefers to the Cyrillic alphabet of the Serbian language. Owing to the Identity constraints concept of theschema language, unique values of a code list are ensured as well as control of the entered values forthe attributes of the subfield extCode so that they may not include the identification of a non-existingcode list.

Each field of block 4 may include either subfield 1, which is repeatable, or a set of other subfields.This concept has been mapped into the choice element in the schema. The content of subfield 1 canonly be one of the fields that are listed as secondary in the format definition. This concept has also beenmapped into the choice element.

XML DOCUMENT OF UNIMARC FORMAT

Figure 5 shows part of an XML document of UNIMARC format as an instance of the schema describedin the second section of this paper. This document is valid with relation to the schema. This means thatthe fields that appear in this document together with their indicators, subfields and code list values are

Copyright c© 2004 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2004; 34:1051–1064

Page 8: Mapping of bibliographical standards into XML

1058 M. JAKSIC

<b2 desc="DESCRIPTIVE INFORMATION BLOCK"><f200 mandatory="1" repeatable="0" desc="TITLE AND STATEMENT OF RESPONSIBILITY">

<ind1 desc="Title Significance Indicator"><ic0 code="0" desc="Title is not significant"/><ic1 code="1" desc="Title is significant"/>

</ind1><sfa desc="Title Proper" mandatory="1" repeatable="1"/><sfb desc="General Material Designation" mandatory="1" repeatable="1"/><sfc desc="Title Proper by Another Author" mandatory="1" repeatable="1"/><sfd desc="Parallel Title Proper" mandatory="1" repeatable="1"/><sfe desc="Other Title Information" mandatory="1" repeatable="1"/><sff desc="First Statement of Responsibility" mandatory="1" repeatable="1"/><sfg desc="Subsequent Statement of Responsibility" mandatory="1" repeatable="1"/><sfh desc="Number of a Part" mandatory="1" repeatable="1"/><sfi desc="Name of a Part" mandatory="1" repeatable="1"/><sfv desc="Volume Designation" mandatory="1" repeatable="0"/><sfz desc="Language of Parallel Title Proper" extCode="1" mandatory="1" repeatable="1"/><sf5 desc="Institution to Which Field Applies" mandatory="1" repeatable="0"/>

</f200><f205 mandatory="1" repeatable="1" desc="EDITION STATEMENT">

<sfa desc="Edition Statement" mandatory="1" repeatable="0"/><sfb desc="Issue Statement" mandatory="1" repeatable="1"/><sfd desc="Parallel Edition Statement" mandatory="1" repeatable="1"/><sff desc="Statement of Responsibility Relating to Edition" mandatory="1" repeatable="1"/><sfg desc="Subsequent Statement of Responsibility" mandatory="1" repeatable="1"/>

</f205>

...</b2>

Figure 5. XML document of UNIMARC format.

in full compliance with the definitions given in the UNIMARC format. For example, Figure 5 shows anelement that corresponds to block 2, with sub-elements corresponding to fields 200 and 205. Field 200has the first indicator defined. Among other subfields, field 200 also includes subfield z, the content ofwhich is taken from language code list and has the attribute extCode listed. Its value is 1.

The schema described in the second section of this paper allows the selection of the correctUNIMARC format subsets for different levels of processing in an institution, as well as those used indifferent institutions, regions and countries. The appropriate subset is selected in the form of an XMLdocument and afterwards validated with respect to the schema. In addition to selecting the subsets, it ispossible to update the UNIMARC format on the national and regional level so that the schema changes.After the validation potential changes can also be made to the XML document.

To efficiently perform the functions that are described, an XML editor for UNIMARC formatdescription in the Java environment has been implemented [12]. This software tool allows the creationof a valid XML document of UNIMARC format. Validation is implemented with the Sun packageMSV and performed with respect to the schema. The editor recognizes all concepts that exist in theschema (field, subfield, indicator, etc.) and relations among them. Thus, it is resistant to the schema

Copyright c© 2004 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2004; 34:1051–1064

Page 9: Mapping of bibliographical standards into XML

MAPPING OF BIBLIOGRAPHICAL STANDARDS INTO XML 1059

Figure 6. The field 105 and its subfield a.

changes, caused by format updating, except in cases when the new concept is added to the format. In allother cases, when changes affect existing concepts it is enough to update the schema. Moreover, theuser interface of the editor is exceptionally clear and easy to work with. The navigation tree that allowsquick browsing through the document is particularly useful. Within the implementation, room to realizea multilingual user interface is left open for the possibility to form a specific XML document in theUNIMARC format for any of the spoken languages. The editor described can also be used for traininglibrarians in using the UNIMARC format.

In this schema, the RecordLabel concept has not been studied. It will be added later analogous tothe fields with the content distributed in segments. Since the concepts of the UNIMARC/Authoritiesformat are very similar, a schema of this format could be formed analogous to this schema.

SCHEMA OF XML BIBLIOGRAPHICAL RECORDS

Similar to the schema of the UNIMARC format, a schema of bibliographical records in the UNIMARCformat was created. An instance of this schema is an XML bibliographical record. The structures ofthese two schemas are almost alike, with the exception of the elements relating to the blocks that havebeen excluded; as a result the sub-elements of a record represent the direct child of the root element.This was done since the blocks in the UNIMARC records are not directly visible either. The same rulesfor naming the tags have been followed. Block 4, i.e. subfield 1, which appears in all of its fields wasmapped on the same principle as in the schema of the format.

Copyright c© 2004 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2004; 34:1051–1064

Page 10: Mapping of bibliographical standards into XML

1060 M. JAKSIC

105 CODED DATA FIELD: TEXTUAL MATERIAL, MONOGRAPHIC (Not repeatable)Indicators: blankSubfield codes$a Monograph Coded Data$a/0–3 Illustration codesa = illustrationsb = mapsc = portraitsd = chartse = plansf = platesg = musich = facsimilesi = coats of armsj = genealogical tablesk = formsl = samplesm = sound recordingsn = transparencieso = illuminationsy = no illustrationsTo be used only once, i.e. y###.# = (blank)—value position not needed

Figure 7. Definition of the first segment of the subfield a of the field 105.

Figure 6 illustrates the mapping of field 105. This field contains one subfield with content dividedinto seven segments. What may be written in each of the segments is precisely stated in the UNIMARCformat definition. Figure 7 shows the definition of the first segment, taken from [3], which is shownin Figure 6. The pattern with which the restriction of string type was performed for the purpose ofdescribing the rule shown in Figure 7, has the following meaning: segment ssfield3 must containfour characters, in positions 0 to 3, which represent the codes for the type of illustrations. The codespermitted are the letters of the English alphabet from a to o. Should a publication contain less than fourtypes of illustrations, the rest of the segment is filled with # characters, i.e. blank. The use of code y isalso permitted if a publication includes no illustrations whatsoever, provided that the code is used onlyin the manner y###.

Data types that represent general code lists and indicators are mapped into simple types that representthe restrictions of built-in types such as byte (for indicators) or token (for the general code lists).Thereafter, suitable types of data are simply assigned to the elements of the indicator or the subfield thatcontain data from the general code lists. Figure 8 shows the field 101, i.e. the first indicator within thatfield. The type of this indicator is Tind02, which is defined as numerable type with the set of allowedvalues {0, 1, 2}. The type of data that represent the languages code list has been mapped in the sameway and is added to each subfield that has this content—such are all the fields of this subfield.

The corresponding type has been simply assigned to any subfield with the contents that can bedescribed with one of the built-in types, or with the restriction of such a type.

Copyright c© 2004 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2004; 34:1051–1064

Page 11: Mapping of bibliographical standards into XML

MAPPING OF BIBLIOGRAPHICAL STANDARDS INTO XML 1061

Figure 8. Field 101 and its first indicator.

1011#[email protected]##$aac######000ay@

Figure 9. Fields 101 and 105 with the data from one UNIMARC record.

The difference between the format schema and the record schema is also illustrated in the fact thatthe attributes for the elements that correspond to the fields, subfields, subfield segments and indicatorsare not defined in the record schema. Mandatory and repeatable properties are mapped directly into theattributes minOccurs and maxOccurs. The length of data in a subfield is likewise directly mapped intothe facet length. Property relating to the type of global code list is directly defined with assigning tothat subfield the type of data representing that particular global code list.

Figure 9 shows field 101 and field 105 with the information taken from the UNIMARC record givenin [3], and Figure 10 shows what these fields look like in the corresponding XML record.

By creating a schema of bibliographical records, a very thorough control of the contents and structureof bibliographical records in the UNIMARC format was provided, and that is sufficient to performthe validation of an XML bibliographical record based on this schema. The validation process for anXML document in a general case implies checking whether the document is well-formed and whether

Copyright c© 2004 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2004; 34:1051–1064

Page 12: Mapping of bibliographical standards into XML

1062 M. JAKSIC

<f101><ind1>1</ind1><sfa>eng</sfa><sfc>fre</sfc>

</f101>

...<f105>

<sfa><ssfield3>ac##</ssfield3><ssfield7>####</ssfield7><ssfield8>0</ssfield8><ssfield9>0</ssfield9><ssfield10>0</ssfield10><ssfield11>a</ssfield11><ssfield12>y</ssfield12>

</sfa></f105>

Figure 10. Fields 101 and 105 in the XML record.

it follows all the rules stated in the DTD with relation to structure and contents. Control of XMLbibliographical records implies checking the following.

• That all of the mandatory fields are stated in the record, that within the stated fields all themandatory subfields are stated and also that all the indicators have been stated in the fields forwhich they are defined.

• That the fields and subfields stated in the record actually exist as defined in the UNIMARCformat.

• That the fields from block 4 and subfield 1 in those fields are truly secondary according to theformat definition if they actually appear in the record.

• That if one or both indicators with their values appear within any field of the record, and thatthey are actually defined as indicators for that field according to the UNIMARC format.

• That the values written in the indicators and the subfields are correct as defined in the format.

Even if the systems that perform the exchange do not use XML, it is quite obvious why it is veryconvenient to make the exchange of bibliographical records in this particular technology. It would besufficient to implement a conversion between an XML document and some other data format, validatean XML record and then re-convert it into the incoming format.

VALIDATION OF BIBLIOGRAPHICAL RECORDS

The conversion of UNIMARC bibliographical records into XML documents was implemented in theBISIS system, version 3.0. It is possible to convert other formats of bibliographical records into XMLdocuments in a similar manner. The conversion was implemented in the Java environment.

Copyright c© 2004 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2004; 34:1051–1064

Page 13: Mapping of bibliographical standards into XML

MAPPING OF BIBLIOGRAPHICAL STANDARDS INTO XML 1063

In the general case it is not necessary to develop a specific application to perform the validationof individual XML documents with reference to the schema. There are specialized editors availablefor documents connected with XML which generally allow the validation of individual documents.However, if we have in mind specialized systems with which we need to process several tens orhundreds of thousands of XML documents, as is the case in working with bibliographical records,a special application has to be developed to provide an adequate user interface and to ensure thevalidation of several documents at the same time. Sun Microsystems has developed the package SunMulti-Schema XML Validator MSV. Validation carried out in this manner gives very precise messagesregarding all the existing errors if the XML document that is examined is not valid.

The implementation of the validation is performed in the Java environment with the available MSVpackage. This application has become one of the modules in the BISIS software system which was thebasis for the validation of existing bibliographical records, each of the currently created records andthose records that are taken from other library software systems.

James (Java MARC Events) is a Java API, now called MARC4J, used within the Library of Congressproject, developed by Peters [13]. It provides the conversion between MARC records and XML andalso the validation of MARC XML files against the schema written in the RELAX NG schemalanguage. Within this API, validation is performed with MSV. Although this API is similar to theapplication mentioned in this paper, both are schema-dependant. Of course, MARC4J represents themore complete solution.

A very interesting and completely different approach to the validation process is presented in [14].It describes the Schematron-structured based validation language defined by Jelliffe. This languageconsists of tree patterns, defined as XPath expressions.

CONCLUSION

This paper has suggested an outline of the UNIMARC bibliographical format and UNIMARCbibliographical records conversion into XML. It has shown that these standards can be mapped entirelyinto the XML schema language. In a similar fashion, it is possible to convert the MARC format andMARC bibliographical records. This result was the basis for the realization of an XML editor for theUNIMARC format description.

Analogous to an editor for an XML document of the UNIMARC format, it is possible to implementan editor for XML bibliographical records which could be widely applied in creating, modifying andvalidating XML bibliographical records. Its special advantage could be the fact that in working with itthere is no necessity to be acquainted with XML.

A schema of XML bibliographical records and the editor for creating, updating and validatingsuch records would naturally lead to the possibility of developing a system for the exchange ofbibliographical records in which different library information systems could participate. It would besufficient for them to have implemented software for the conversion of their own records into XMLbibliographical records in accordance with a schema accepted on the system level.

Based on the above, the following questions are inevitable.

Is it possible to create a unique schema which would describe both UNIMARC format andbibliographical records?

Copyright c© 2004 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2004; 34:1051–1064

Page 14: Mapping of bibliographical standards into XML

1064 M. JAKSIC

One possible solution would be to create a schema of UNIMARC format and afterwards to developan application that would, on the basis of that schema and special features of records, automaticallygenerate the schema of the records. Another solution might be to create one super-schema that wouldinclude the common elements of both schemas and later on to create specific sub-schemas for formatand records.

Which method would provide the most efficient storage and retrieval of XMLbibliographical records?

This problem was thoroughly examined in [4]. There is, however, no conclusive answer at present.New technological solutions for storing and retrieving XML documents appear daily.

ACKNOWLEDGEMENT

This paper is part of the scientific research project ‘XML technologies and cooperative information systems’,no. 1875, supported by the Ministry of Science, Technologies and Development, Republic of Serbia.

REFERENCES

1. MARC standards. Library of Congress, Network Development and MARC Standards Office.http://www.loc.gov/marc/ [17 February 2003].

2. IFLA Universal Bibliographic Control and International MARC Programme. UNIMARC Manual: Bibliographic Format.New Providence: London, 1994.

3. UNIMARC concise bibliographic format. http://www.ifla.org/VI/3/p1996-1/concise2.pdf [1 October 2002].4. Clarke KS. Medlane/XMLMARC Update: From MARC to XML Database, Lane Medical Library, Stanford University

Medical Center. http://elane.stanford.edu/laneauth/medlane 2001.html [20 February 2003].5. Miller DR, Clarke KS. XOBIS: The XML Organic Bibliographic Information Schema, Lane Medical Library, Stanford

University. http://elane.stanford.edu/laneauth/XOBIS.pdf [20 February 2003].6. MARC XML–MARC 21 XML Schema. Official Web Site. http://www.loc.gov/standards/marcxml/ [17 February 2003].7. Implementation Guidelines for the Open Archive Initiative Protocol for Metadata Harvesting—An XML Schema to

represent MARC Records, Protocol version 2.0.http://www.openarchives.org/OAI/2.0/guidelines-oai marc.htm [17 February 2003].

8. Carvalho J, Cordeiro MI. XML and bibliographic data: The TVS (Transport, Validation and Services) model. Proceedingsof the 68th IFLA Council and General Conference, Glasgow, 2002.http://www.ifla.org/IV/ifla68/papers/075-095e.pdf [3 February 2004].

9. Hough J, Bull R, Young B. Using XSLT for XML MARC Record Conversion, Crossnet Systems Limited, 2000.http://www.crxnet.com/one2/xslt marcx report.pdf [3 February 2004].

10. Milosavljevic B, Konjovic Z. Modelling and implementation of a system for bibliographic records interchange. Novi SadJournal of Mathematics 2001; 31(1):159–166.

11. Zeremski M. Modelling of UNIMARC format in XML technology. Masters Thesis, Novi Sad, 2002 (in Serbian).http://diglib.ns.ac.yu/ndltd/docs/set1/ndltd64/ZeremskiMMagistarskaTeza.pdf [15 January 2003].

12. Mijic V. XML editor for UNIMARC format description. Masters Thesis, Novi Sad, 2003 (in Serbian).http://diglib.ns.ac.yu/ndltd/docs/set1/ndltd133/MijicVMagistarskaTeza.pdf [12 June 2003].

13. Peters B. MARC4J: A basic tutorial. http://marc4j.tigris.org/doc [3 February 2004].14. Jelliffe R. The Schematron—An XML Structure Validation Language using Patterns in Trees.

http://www.ascc.net/xml/resource/schematron [3 February 2004].

Copyright c© 2004 John Wiley & Sons, Ltd. Softw. Pract. Exper. 2004; 34:1051–1064