xml — an opportunity for data standards in the geosciences

11
Computers & Geosciences 27 (2001) 839–849 XML } an opportunity for 5meaningful> data standards in the geosciences $ Simon W. Houlding* Geoscience Modeling Consultant, 8625 Saffron Place, Burnaby, BC, Canada V5A 4H9 Received 16 June 1999; received in revised form 1 January 2000; accepted 20 June 2000 Abstract Extensible markup language (XML) is a recently introduced meta-language standard on the Web. It provides the rules for development of metadata (markup) standards for information transfer in specific fields. XML allows development of markup languages that describe what information is rather than how it should be presented. This allows computer applications to process the information in intelligent ways. In contrast hypertext markup language (HTML), which fuelled the initial growth of the Web, is a metadata standard concerned exclusively with presentation of information. Besides its potential for revolutionizing Web activities, XML provides an opportunity for development of meaningful data standards in specific application fields. The rapid endorsement of XML by science, industry and e- commerce has already spawned new metadata standards in such fields as mathematics, chemistry, astronomy, multi- media and Web micro-payments. Development of XML-based data standards in the geosciences would significantly reduce the effort currently wasted on manipulating and reformatting data between different computer platforms and applications and would ensure compatibility with the new generation of Web browsers. This paper explores the evolution, benefits and status of XML and related standards in the more general context of Web activities and uses this as a platform for discussion of its potential for development of data standards in the geosciences. Some of the advantages of XML are illustrated by a simple, browser-compatible demonstration of XML functionality applied to a borehole log dataset. The XML dataset and the associated stylesheet and schema declarations are available for FTP download. # 2001 Elsevier Science Ltd. All rights reserved. Keywords: XML; Metadata; Standard; Geoscience; Internet 1. Introduction The rapid expansion of the Web has been fuelled in large part by the successful introduction of hypertext markup language (HTML). Unfortunately, the very success of HTML has created new problems for the Web. Network traffic volume on the supposedly speed- of-light Internet is now such that it frequently moves at a crawl and, although nearly every possible kind of information is available somewhere on-line, it is becom- ing increasingly difficult to find the piece one needs. Both problems arise from the nature of HTML. Despite being the most successful electronic-publishing language invented, HTML is superficial in its concern with information presentation as opposed to informa- tion content. HTML merely describes how a Web browser should arrange text and images on a page, it provides no useful information about the content itself. HTML’s concern with presentation makes it relatively easy to learn, but has inherent costs. The familiar phrase ‘‘what you see is what you get’’ ironically highlights the problem, in fact, with HTML, ‘‘what you see is all you’ve got!’’ $ Dataset available from server at http://www.iamg.org/ CGEditor/index.htm *Tel.: +1-604-420-0811; fax: +1-604-420-6840. E-mail address: [email protected] (S.W. Houlding). 0098-3004/01/$ - see front matter # 2001 Elsevier Science Ltd. All rights reserved. PII:S0098-3004(00)00145-X

Upload: simon-w-houlding

Post on 15-Sep-2016

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: XML — an opportunity for  data standards in the geosciences

Computers & Geosciences 27 (2001) 839–849

XML } an opportunity for 5meaningful> data standardsin the geosciences$

Simon W. Houlding*

Geoscience Modeling Consultant, 8625 Saffron Place, Burnaby, BC, Canada V5A 4H9

Received 16 June 1999; received in revised form 1 January 2000; accepted 20 June 2000

Abstract

Extensible markup language (XML) is a recently introduced meta-language standard on the Web. It provides therules for development of metadata (markup) standards for information transfer in specific fields. XML allowsdevelopment of markup languages that describe what information is rather than how it should be presented. This allows

computer applications to process the information in intelligent ways. In contrast hypertext markup language (HTML),which fuelled the initial growth of the Web, is a metadata standard concerned exclusively with presentation ofinformation. Besides its potential for revolutionizing Web activities, XML provides an opportunity for development of

meaningful data standards in specific application fields. The rapid endorsement of XML by science, industry and e-commerce has already spawned new metadata standards in such fields as mathematics, chemistry, astronomy, multi-media and Web micro-payments. Development of XML-based data standards in the geosciences would significantly

reduce the effort currently wasted on manipulating and reformatting data between different computer platforms andapplications and would ensure compatibility with the new generation of Web browsers. This paper explores theevolution, benefits and status of XML and related standards in the more general context of Web activities and uses thisas a platform for discussion of its potential for development of data standards in the geosciences. Some of the

advantages of XML are illustrated by a simple, browser-compatible demonstration of XML functionality applied to aborehole log dataset. The XML dataset and the associated stylesheet and schema declarations are available for FTPdownload. # 2001 Elsevier Science Ltd. All rights reserved.

Keywords: XML; Metadata; Standard; Geoscience; Internet

1. Introduction

The rapid expansion of the Web has been fuelled in

large part by the successful introduction of hypertextmarkup language (HTML). Unfortunately, the verysuccess of HTML has created new problems for the

Web. Network traffic volume on the supposedly speed-of-light Internet is now such that it frequently moves ata crawl and, although nearly every possible kind of

information is available somewhere on-line, it is becom-ing increasingly difficult to find the piece one needs.Both problems arise from the nature of HTML.

Despite being the most successful electronic-publishinglanguage invented, HTML is superficial in its concernwith information presentation as opposed to informa-

tion content. HTML merely describes how a Webbrowser should arrange text and images on a page, itprovides no useful information about the content itself.

HTML’s concern with presentation makes it relativelyeasy to learn, but has inherent costs. The familiar phrase‘‘what you see is what you get’’ ironically highlights theproblem, in fact, with HTML, ‘‘what you see is all

you’ve got!’’

$Dataset available from server at http://www.iamg.org/

CGEditor/index.htm

*Tel.: +1-604-420-0811; fax: +1-604-420-6840.

E-mail address: [email protected] (S.W. Houlding).

0098-3004/01/$ - see front matter # 2001 Elsevier Science Ltd. All rights reserved.

PII: S 0 0 9 8 - 3 0 0 4 ( 0 0 ) 0 0 1 4 5 - X

Page 2: XML — an opportunity for  data standards in the geosciences

To work effectively with information, computers needto be told exactly what the information is, how it is

related and how to deal with it. Extensible markuplanguage (XML) is a new meta-language designed to dojust that, to make information self-describing to the

computer. This apparently simple change in how compu-ters operate on and communicate information has thepotential to dramatically increase the capacity andefficiency of the Web and to extend it beyond information

delivery to many other kinds of human activity. The XMLstandard was completed in early 1998 by the W3C (WorldWide Web Consortium) and is already spreading rapidly

through certain science disciplines and industries rangingfrom manufacturing to medicine.XML is based on the concept of metadata, i.e. data

about data. Metadata is a description of the character-istics of data that have been collected for a specificpurpose. XML employs metadata tags to describe what

information is, not (like HTML) what it should looklike. For example, HTML would tag the components ofan order document for a shirt as boldface, paragraph,row and column; in contrast, an XML implementation

tags them as price, size, quantity and color. Computerapplications can then recognize the document as acustomer order and take appropriate action: display it in

appropriate ways to management or production, put itthrough an accounting system, and issue deliveryinstructions.

Another advantage of XML is its reliance on the newUnicode standard, a character-encoding system thatsupports intermingling of text in the world’s majorlanguages. In HTML a document is generally in one

particular language, whether English, Japanese orArabic. Applications that cannot read the charactersof that language, cannot do anything with the docu-

ment. But applications that read XML properly can dealwith any combination of any of these character sets.Thus, XML will enable exchange of information not

only between different computer systems but also acrossnational and cultural boundaries.XML has implications that extend well beyond the

Web and browser applications. It has already gainedsignificant acceptance within e-commerce, industry andcertain science disciplines as a data standard forinterfacing between computer applications. This is

largely because the XML standard includes specifica-tions of how an XML document should be parsed andrepresented within a computer } any computer,

irrespective of type or operating system. This internalrepresentation of an XML document, called the docu-ment object model (DOM), allows a single document to

be accessed in the same way by different applicationsrunning on different computer platforms. XML parsers(the software that creates the DOM) are now readily

available for incorporation into application software inmost programming languages.

All of these advantages combine to provide an idealplatform for development of long-awaited and mean-

ingful data standards in the geosciences. The benefits interms of efficiencies resulting from being able to moveinformation freely between computer applications re-

quire little elaboration. The additional benefits ofbrowser compatibility, being able to display a singledataset (a borehole log or map for example) in differentways for different audiences, and the ability to conduct

meaningful and efficient information searches are addedincentives.This paper traces the evolution of XML, discusses the

benefits and advantages and summarizes the currentstatus of XML and related standards. This provides aplatform for discussion of XML, and its potential for

development of new data standards, in a geosciencecontext. The paper concludes with a demonstration ofXML based on a simple borehole log example. The

references include links to a number of useful XML-related websites for readers who wish to follow up withtheir own XML research or to acquire XML-relatedsoftware.

2. XML evolution

XML has evolved from standard generalized markuplanguage (SGML), a meta-language (language about

languages) which in turn evolved within the printing andpublishing industries. HTML is an implementation ofSGML developed specifically for presentation andlinking of documents on the Web. It has been so

successful that its limitations are beginning to restrictWeb growth. Hence the need for a more generalizedapproach like XML.

2.1. SGML and the publishing industry

For generations, printers and editors scribbled noteson manuscripts to instruct typesetters. This ‘‘markup’’evolved on its own until the mid-1980s, when it became

an International Organization for Standardization (ISO)approved standard for creation of new markuplanguages.SGML has since proved useful in many large

publishing applications where it is used to define thestructure of electronic documents. HTML was definedusing SGML when the need for a simple markup

language arose on the Web. The problem with SGML isthat it is too general and full of features designed tominimize keystrokes in an era when every byte had to be

accounted for. It is more complex than Web browsersand average users can cope with.

S.W. Houlding / Computers & Geosciences 27 (2001) 839–849840

Page 3: XML — an opportunity for  data standards in the geosciences

2.2. HTML and the World Wide Web

HTML is an implementation of SGML designed toprovide Web authors with a relatively simple andefficient means of publishing documents for Web

distribution. The SGML declaration for HTML isimplicit among Web implementations.In HTML documents, tags define the start and end of

documents, headings, paragraphs, lists, hypertext links,

etc. HTML elements are generally identified in adocument as a start tag, which gives the element nameand attributes, followed by the content, followed by an

end tag. Start tags are delimited by 5 and >, and endtags are delimited by 5/ and >. For example

5H1 > This is a Heading5=H1 >

5P > This is a paragraph5=P >

The content of an element is a sequence of characters(text) and nested elements. Some elements, such as

anchors, cannot be nested. The content model for a tagdefines the syntax permitted for the content. HTML isdesigned to be flexible in that the closing tags of some

elements may be omitted when they are clearly impliedby the context and tags and their attributes are caseinsensitive.HTML has become the lingua franca for publishing

hypertext on the Web. It is a non-proprietary formatthat can be created and processed by a wide range oftools, from simple plain text editors to sophisticated

wysiwyg authoring tools and Web browsers. In terms ofwhat it was originally designed to do and its acceptanceby the Web community, HTML has been highly

successful. However, it tells the computer nothing aboutthe content of a document other than how it should bedisplayed.

This is extremely wasteful in terms of computerprocessing. Client-side computers are reduced to plat-forms for document display, and server-side computersare required to endlessly produce and communicate

documents to feed the demand. It is also wasteful interms of Web search efficiency. With HTML a searchengine cannot distinguish between references to a book

by Benjamin Franklin and a book about BenjaminFranklin, which is why the results of a Web search areinvariably cluttered with many useless and inappropriate

links.

2.3. Separation of content from style

The solution is simple: use tags that say what the

information is, not how it looks, and separate thecontent of a document from its presentation (or style).XML does exactly this } it allows use of tags that are

descriptive of the contents of a document and itseparates the description of structure and content from

information concerning presentation. The former is inthe document, while the latter is in a stylesheet that the

document links to. This makes it much easier to have,and to change, a common presentation across a set ofdocuments, or to have different presentations of the

same information for different audiences. Only onestylesheet is required to render many XML documents;conversely, a single XML document may be rendered inmany ways by different stylesheets.

2.4. XML and the next-generation Web

XML was created by removing frills from SGML to

arrive at a more streamlined, digestible meta-language.XML consists of simple rules that allow a markuplanguage (tag-set) to be created from scratch. The rules

ensure that a single compact program, called a parser,can process any conforming language.Whereas HTML is a tag-set built from the SGML

meta-language, XML is not a tag-set at all; rather, it is a

more easily used form of meta-language, derived directlyfrom SGML. XML does not provide a set of tags to use,as HTML does, instead it provides rules for building

tag-sets that suit information requirements. For exam-ple, with XML a computer application can readilydistinguish between 5author>Benjamin Franklin

5/author> and 5subject>Benjamin Franklin5/subject>.Another key difference is that an XML-compliant

browser has no hard-coded knowledge of the tag-set in

a document that it will be expected to display. Instead,the XML document includes a pointer to a stylesheet, afile that accompanies the document and defines how the

content of the tags should be rendered. The XMLdocument may (optionally) also contain a pointer to adocument type definition (DTD), a declaration of

allowable tags, dependencies and content type (moreon this later).The XML rules can be summarized as follows:

* every XML document must have a root element (tag)that encloses the contents;

* every start tag must have a closing tag;* tags must nest cleanly;* empty tags have a different form to make it clear that

these are tags with no closing tag;* all attribute values must be in quotation marks;* tags are case sensitive and must match;* XML documents need a declaration at the top to

signal what they are.

An XML document that conforms to these rules, asdetermined by an XML parser, is classified as well-formed. The rules are significantly stricter than those

implemented within HTML, i.e. current HTML docu-ments do not satisfy the XML rules. This does not mean

S.W. Houlding / Computers & Geosciences 27 (2001) 839–849 841

Page 4: XML — an opportunity for  data standards in the geosciences

that HTML will be replaced by XML, since HTML isstill useful for presentation purposes. HTML will remain

in its current form but will converge towards XML-conforming HTML.The nesting rule automatically forces a certain

simplicity on every XML document, which takes onthe structure known in computer science as a tree. Aswith a genealogical tree, each graphic and bit of text inthe document represents a parent, child or sibling of

some other element; relationships are unambiguous.Trees cannot represent every kind of information, butthey can represent most kinds that computers are

required to understand. A tree representation ofinformation, moreover, makes it extremely convenientfor programmers to generate software for accessing the

information. For example, the XML data for a portionof a borehole log document might be

5BOREHOLE>|5IDENTITY>

5NAME>10085/NAME>

5PROPERTY>Las Estrellas(Norte)5/PROPERTY>

5DATE>November 13 1998

5/DATE>5/IDENTITY>|

|5/BOREHOLE>

The graphical tree representation of this informationis shown in Fig. 1.This tree representation of XML content, which is

generated within a computer by an XML parser, iscalled the DOM. It is a key component of the XMLstandard and provides an efficient basis for softwareapplications to access and manipulate the XML content

in standard ways through XML tag references. In asoftware context, the DOM is the application program-ming interface (API) for an XML document.

In a browser context, the DOM is a platform- andlanguage-neutral interface that allows software scripts todynamically access and update the content, structure

and style of documents. The document can be furtherprocessed and the results of that processing can be

incorporated back into the presented page. Thus thecontent of an XML document can be manipulated(formatted, re-calculated, sorted, etc.) within a client-

side browser to suit varying presentation requirements.

3. XML advantages

The combination of more efficient distribution ofprocessing, more accurate searching and more flexible

linking will revolutionize the structure of the Web andmake possible completely new ways of accessinginformation. Users will find this new Web faster, more

powerful and more useful than the Web of today.Of equal importance, the XML DOM will standardize

the way in which information is passed between

computer applications, both on the Web and beyond.

3.1. Reduced web traffic

As XML spreads, the Web will become noticeablymore responsive. At present, client-side computersconnected to the Web, whether they are powerful

desktops or handheld devices, cannot do much morethan get an HTML form, fill it out and then swap it backand forth with a Web server until a task is completed.

The structural and semantic information that can beadded with XML allows these client-side devices to domuch more processing themselves, without recourse to a

Web server. All of the information required for aparticular client operation can be dispatched by theserver as a single dataset to be processed and presentedin different ways by the client computer. This will

significantly reduce both network traffic and the load onWeb servers.

3.2. More efficient Web search functionality

As more of the information on the Net is labeled with

field-specific XML tags, it will become easier to findexactly what is needed. Librarians determined a long timeago that the way to find information quickly is to look

not at the information itself but rather at much smaller,more focused sets of data that point to useful sources;hence the library card catalogue, a metadata approach.From the outset, part of the XML project has been to

create a standard for metadata itself. Resource descrip-tion framework (RDF) will do for Web data whatcatalogue cards do for library books. RDF integrates a

variety of web-based metadata activities includingsitemaps, content ratings, channel definitions, searchengine data collection (web crawling), digital library

collections and distributed authoring, using XML as aninterchange syntax. Deployed across the Web, RDF

Fig. 1. Tree representation of XML metadata tags and

contents.

S.W. Houlding / Computers & Geosciences 27 (2001) 839–849842

Page 5: XML — an opportunity for  data standards in the geosciences

metadata will make searching much faster and moreaccurate than it is at present.

3.3. More efficient Web linking

Hyperlinks will also do more when powered by XML.A standard for XML-based hypertext, named extensiblelinking language (XLL), will provide a choice from a list

of multiple destinations. Other kinds of hyperlinks willallow insertion of linked text or images within adisplayed document, instead of forcing a move to anew document.

Perhaps most useful, XLL will enable authors to useindirect links that point to entries in a central databaserather than to the linked documents themselves. When a

document’s address changes, the author will be able toupdate all the links that point to it by editing just onedatabase record. This will significantly reduce the

familiar ‘‘404 File Not Found’’ error that signals abroken hyperlink.

3.4. Meaningful data standards

Both on the Web and beyond, the greatest impact ofXML is likely to be in data transfer efficiencies achieved

through development of new data standards. These willbe based on the ready availability of XML parsers for allsoftware languages and computer platforms, and the

ease with which the information in the XML DOM canbe accessed.XML allows anyone to design a new, custom-built

markup language, but designing languages is a challengethat cannot be undertaken lightly. The design is justthe beginning: the meanings of tags are not going to beobvious to others unless accompanied by declarations

that explain them, nor to computers unless givensoftware to process them.What XML does is simple but effective. It lays down

ground rules that strip away a layer of programmingdetail so that users with similar interests can concentrateon the hard part } agreeing on how they want to

represent the information they commonly exchange.This is not an easy problem to solve, but it is not a newone either.

Such agreements will be made, because the prolifera-tion of incompatible computer systems and applicationshas imposed delays, costs and confusion on nearly everyarea of human activity. Users want to share information

and ideas and do business without all having to use thesame computer platform and software; field-specificinterchange languages go a long way toward making

that possible.Before drafting a new markup language with XML,

designers must agree on three things: which tags will be

allowed, their content type and how tagged elementsmay nest within one another. These declarations are

typically codified in a DTD. An XML document thatconforms to a DTD, as determined by an XML parser,

is classified as ‘‘valid’’. The XML standard does notcompel language designers to use a DTD, but most newmarkup languages will probably have them, because

they make it much easier for programmers to writesoftware applications that understand the tags and dointelligent things with the content. A DTD, in effect,becomes the metadata standard (or schema) for a

particular field of activity. HTML, for instance has aDTD that is incorporated into all Web browsers.

4. XML and related standards

The W3C was founded in October 1994 to lead the

Web to its full potential by developing commonprotocols that promote its evolution and ensure itsinteroperability. It is funded by member organizations,

and is vendor neutral, working with the global commu-nity to produce specifications and reference softwarethat is made freely available throughout the world.

4.1. Extensible markup language (XML)

The current W3C recommendations are XML 1.0,February 1998. As announced by W3C

XML is primarily intended to meet the requirementsof large-scale Web content providers for industry-specific markup, vendor-neutral data exchange,

media-independent publishing, one-on-one market-ing, workflow management in collaborative author-ing environments, and the processing of Web

documents by intelligent clients. It is also expectedto find use in certain metadata applications. XML isfully internationalized for both European and Asian

languages, with all conforming processors required tosupport the Unicode character set in both its UTF-8and UTF-16 encodings. The language is designed for

the quickest possible client-side processing consistentwith its primary purpose as an electronic publishingand data interchange format.

XML includes recommendations for the DOM, DTDand XLL. However, the DTD and XLL specifications inparticular are still evolving and subject to change.

4.2. Extensible stylesheet language (XSL)

XSL is a language for expressing stylesheets and isitself an implementation of XML. As defined by W3C in

the current working draft, it consists of two parts:

* a language for transforming XML documents;* an XML vocabulary for specifying formatting

semantics.

S.W. Houlding / Computers & Geosciences 27 (2001) 839–849 843

Page 6: XML — an opportunity for  data standards in the geosciences

An XSL stylesheet specifies the browser presentationof XML documents by describing how an instance is

transformed into an XML document that includes theformatting vocabulary. Thus XSL includes appropriatetags for selection, evaluation, formatting and presenta-

tion of XML content.

4.3. Emerging XML-based data standards

There is already a rapidly developing body of XML

tag-sets and applications. mathML is a tag-set thatallows integration and display of mathematical expres-sions in a browser document. Similarly, chemML allows

use of molecular symbols. W3C announced in February1999 that it has initiated the process to define extensible3D (X3D), a next-generation 3D standard for virtual

reality modeling language (VRML) that includes in-tegration with XML.Microsoft is using XML to create the channel

definition format for managing its browser-based sub-scription news channels. There are also specialized XMLtag-sets in existence for astronomy: astronomical mark-up language (ASL) and DNA sequencing: bioinformatic

markup language (BML).The synchronized multimedia integration language

(SMIL) is already a W3C recommendation. This

employs an XML implementation to coordinate, inte-grate and synchronize digital files in different media(video, audio, images and text) into a multimedia

presentation, with tags that control what to play, whenand for how long.In the context of e-commerce, W3C has a working

draft for a Common Markup for Web Micropayment

Systems. This specification provides an extensible way toembed in a Web page all the information necessary toinitialize a micropayment (amounts, currencies, payment

systems, etc.).

4.4. Available XML software applications

Microsoft has released Internet Explorer 5 (beta) asan XML-compliant browser, with its own interpretationof the W3C working drafts for XSL and DTD, i.e. these

implementations are still subject to change. Netscapehas announced imminent release of its own XML-compliant browser.A variety of shareware and commercial XML docu-

ment editors, XSL stylesheet editors and DTD editorsare already available. XML parsers are also available asshareware for incorporation into application software.

IBM recently announced the availability of the firstXML-powered search engine.Major database software vendors are implementing

XML interfaces, as evidenced by Oracle’s stated strategyof delivering a platform for software developers to build

and deploy scalable Web applications that exploit XML.More recently, Oracle announced a complete infrastruc-

ture, based on XML, for the exchange and manage-ment of information associated with all aspects ofe-commerce.

5. Data standards in the geosciences

5.1. Current focus on GIS standards

The principal focus in the development of datastandards in the geosciences has been on GIS datasets,involving both raster and vector data types. As

emphasized by Albrecht (1999), there is a plethora oforganizations concerned with GIS standards and acorresponding number of proposed standards, i.e. thereis no real standard in the strict meaning of the term.

With no apologies for the ever-growing propensity foracronyms, the list of proposed standards includesDIGEST, GDF, SAIF, SDTS, CEN/TC287, ISO/

TC211, OGIS, SQL3-MM, GRIB and BUFR. As notedby Huber and Schneider (1999), many of thesestandardization efforts are dominated by efforts to

prolong the life of legacy systems rather than to ensurethe interoperability of GIS datasets.On the brighter side, much useful work, based on

object-oriented approaches, has been accomplished bythese standardization efforts in terms of establishing thedata types and data dependencies inherent to GISdatasets. XML provides a relatively simple means of

leveraging these accomplishments to achieve meaningfuldata standards.Although a good place to start, GIS datasets are

only part of the story; they are generally limited to2D (at best 2.5D) and 3D datasets such as borehole logsand geophysical surveys must also be provided for.

Anyone familiar with computer applications in thegeosciences is only too painfully aware of the scarcityof appropriate data standards. Considerably more timeis expended on manipulating and reformatting data

between applications than on processing the applica-tions themselves.Despite the volume of data processed, there have as

yet, with the exceptions discussed in Section 5.4, been noconcerted efforts to develop comprehensive metadatastandards in the geosciences. This applies to all of the

basic data types dealt with, from vector maps, DTMsand raster images to borehole and well logs andgeophysical surveys. As stated by Strand (1995) in a

GIS context

Until definition and standards are adopted, the

definition and development of GIS applications willremain an admixture of art, science and perspiration.

S.W. Houlding / Computers & Geosciences 27 (2001) 839–849844

Page 7: XML — an opportunity for  data standards in the geosciences

5.2. XML simplification of the development process

Emergence of the XML standard, and its enthusiastic

reception by science, industry and commerce, providesan ideal opportunity to rectify this situation. This comesabout not necessarily because of the advantages of Webcompatibility (see Section 5.3), but rather because of the

ready availability of XML parsers (APIs) for interfacingwith XML content in a standard way. As a result, XMLsimplifies the task of developing a new metadata

language to the following steps:

* agreement on tag names, e.g. 5BOREHOLE>,5INTERVAL>, 5SAMPLE>;

* formulation of tag dependencies, e.g. a BOREHOLEmust have only one IDENTITY, but may have zeroor many downhole INTERVALs, each of which may

have zero or many DATA or TEXT elementscontaining sample values and observations;

* specification of tag contents, e.g. a 5NAME> tag

contains text, a 5DISTANCE> tag contains a fixedvalue, a 5BOREHOLE> tag only contains othertags.

Once these specifications have been accepted andpublished, any application can access the contents of aconforming document (or dataset) by incorporating an

XML parser. The specifications should also be declaredin a DTD or schema, firstly so that documents canbe validated, and secondly so that the DTD can be made

available to the appropriate applications to ensureconformance.As stated earlier, one of XML’s advantages is that it

separates content from presentation (style). This eli-

minates from the development process any concern withhow the content is presented and allows development ofstandards to focus on the principal objective of ensuring

the interoperability of content between applications.Initially at least, concern with processing, presentationand development of stylesheets (or their equivalent) then

becomes the responsibility of individual applicationsand audiences. This separation of concern with contentfrom concerns with processing and presentation is anecessary simplifying step in the successful development

and introduction of meaningful standards.

5.3. Advantages of Web compatibility

Just as HTML fuelled the first-generation Web, XMLis set to fuel the next generation. The Web is already aprimary means of communicating and disseminating

information. The advantages of compatibility betweengeoscience datasets and the Web appear obvious and arelikely to become even more so. The advantages of being

able to perform an intelligent search on availabledatasets alone are significant. The current lack of

appropriate graphical display functionality forgeoscience datasets will be alleviated by the imminent

release of X3D, an XML implementation of VRML.And for those concerned with security of information,the Web already makes better provision than most

proprietary software applications.

5.4. Existing metadata implementations

Several ongoing attempts to develop metadata stan-

dards in the geosciences are discussed briefly below.Starting in 1996, the Federal Geographic Data

Committee (FGDC) in the US has developed a

‘‘Content Standard for Digital Geospatial Metadata’’based on SGML:

The objectives of the standard are to provide a

common set of terminology and definitions for thedocumentation of digital geospatial data. The stan-dard establishes the names of data elements andcompound elements (groups of data elements) to be

used for these purposes, the definitions of thesecompound elements and data elements, and informa-tion about the values that are to be provided for the

data elements.

The standard was developed from the perspective of

defining the information required by a prospectiveuser to determine the availability of a set ofgeospatial data, to determine the fitness of the setof geospatial data for an intended use, to determine

the means of accessing the set of geospatial data, andto successfully transfer the set of geospatial data. Assuch, the standard establishes the names of data

elements (tags) and compound elements to be usedfor these purposes, the definitions of these dataelements and compound elements, and information

about the values that are to be provided for the dataelements. The standard does not specify the means bywhich this information is organized in a computer

system or in a data transfer, nor the means by whichthis information is transmitted, communicated, orpresented to the user.

The standard is limited to GIS datasets and, based on

available discussion, appears overly complex (it employsa total of 334 different tags). The complaints regardingcomplexity are likely due at least in part to its use of

SGML and dependence on SGML software tools.The Australia and New Zealand Land Information

Commission (ANZLIC) has adopted a simpler ap-

proach, originally based on SGML and now compatiblewith XML

ANZLIC, through its Metadata Working Group, is

actively pursuing an objective to implement adistributed national directory system to form a

S.W. Houlding / Computers & Geosciences 27 (2001) 839–849 845

Page 8: XML — an opportunity for  data standards in the geosciences

foundation for the Australian and New ZealandSpatial Data Infrastructures. The various State and

Commonwealth jurisdictions are currently collectingmetadata, as per the ANZLIC metadata standard(1996), to provide an extensive national picture of

available spatial data which is available through adistributed directory of the Australian Spatial DataDirectory and accessible over the Internet, managedjointly by all the ANZLIC jurisdictions.

The US approach, developed by the Federal Geo-graphic Data Committee (FGDC), specifies the

structure and expected content of some 220 elements(tags) which are intended to describe digital geospa-tial datasets adequately for all purposes. The

ANZLIC approach is deliberately less ambitiousthan what has been attempted in the US. Argumentsadvanced in support of the more modest objective

rely on experience to date with the creation of high-level directories in Australia.

Users need a level of detail, clarity and accuracy in

the metadata sufficient for them to judge whether ornot to make further inquiries of the contact

organisation responsible for a dataset. Maintaininga comprehensive directory, however, imposes a

significant burden on custodians. Experience indi-cates that a balance needs to be struck between thesetwo factors.

The ANZLIC standard is primarily concerned with

information regarding dataset accessibility and qualityrather than the dataset itself. In summary, both of themetadata implementations discussed above are, by

design, closer in function to information retrievalsystems as opposed to operational data standards.

6. XML borehole data demonstration

A small borehole dataset is employed for a simpledemonstration of XML functionality in a Web browser.The demonstration is performed with Microsoft’s

Internet Explorer 5 (beta) and includes both an XSLstylesheet and a DTD (or schema in Microsoftterminology). It should be noted that certain features

of the current Internet Explorer 5 implementation ofXML and related standards are not yet fully compliant

Fig. 2. Tree representation of XML tags and contents in XML borehole data demonstration file with Microsoft XMLNotepad

editor.

S.W. Houlding / Computers & Geosciences 27 (2001) 839–849846

Page 9: XML — an opportunity for  data standards in the geosciences

with W3C recommendations due to the evolutionarystatus of the standards. However, the principles of XML

are adequately demonstrated.The XML dataset is included in file borehole.xml.

A color-coded presentation of portions of the dataset

is shown in Fig. 3. This is one of the presentationoptions provided by the stylesheet, which in realityis several stylesheets in one. Both the stylesheetand the DTD are referenced at the head of the XML

document.

The XML dataset was compiled with Microsoft’sXMLNotepad editor. A tree representation of the XML

tag dependencies is shown in Fig. 2. This is a graphicalrepresentation of the DOM produced by an XMLparser.

Features demonstrated by this simple XML imple-mentation include the following:

* Variable stylesheet formatting of XML content:For example, the Lithology values in the interval

Fig. 3. Optional color-coded presentation of XML tags, attributes and contents of XML borehole data demonstration file with

Microsoft Internet Explorer 5 browser.

S.W. Houlding / Computers & Geosciences 27 (2001) 839–849 847

Page 10: XML — an opportunity for  data standards in the geosciences

records shown in Fig. 4 are color-coded

according to value and the interval records arehighlighted according to a cut-off test applied tothe Lead values.

* Stylesheet formatting and ordering of XML contentbased on data values: For example, the intervalrecords shown in Fig. 4 can be dynamically re-

ordered and sorted in ascending order by clickingon one of the column headers.

* Stylesheet computation of new values based ondocument content: For example, the $Equivalent

values included in the interval records shown in

Fig. 4 are obtained by combining the Lead and Zinc

values in a mathematical expression.* Dynamic reformatting via script manipulation of thestylesheet DOM: For example the stylesheet accom-

modates client-side presentation of the content eitherin a color-coded XML listing format (refer Fig. 3) orin a simple borehole log format (refer Fig. 4); the

stylesheet also accommodates optional display ofCollar and Survey information.

The stylesheet for the demonstration is included in fileborehole.xsl. It employs XSL tags to wrap HTML

Fig. 4. Optional borehole log presentation of contents of XML borehole data demonstration file with Microsoft Internet Explorer 5

browser.

S.W. Houlding / Computers & Geosciences 27 (2001) 839–849848

Page 11: XML — an opportunity for  data standards in the geosciences

presentation and style information around the XMLcontent prior to browser display (see Harold (1998),

Pardi (1999)). Stylesheet access to the XML documentDOM is achieved through Javascript code embeddeddirectly in the stylesheet. Dynamic reformatting of the

browser presentation is achieved by Javascript codeembedded in the resulting HTML document by thestylesheet. This code accesses the stylesheet DOM toreset the values of display flags and sort criteria. It

should be noted that XSL is itself an implementation ofXML and therefore has its own accessible DOM.The schema (DTD) for the demonstration is included

in file borehole-schema.xml. It specifies tag dependenciesand content requirements. As an example of how theschema works: a 5BOREHOLE> element (tag) has no

content other than other elements, one of which must bean 5IDENTITY> element which must include oneeach of 5NAME>, 5PROPERTY> and 5DATE>

elements, each of which contains text. In contrast, a5BOREHOLE> element may contain many 5IN-TERVAL> elements, each of which must contain oneeach of 5FROM> and 5TO> elements and may

optionally contain zero or many 5DATA> elements(containing fixed values) and zero or many 5TEXT>elements (containing text).

7. Conclusions

XML is a new metadata standard that will increasethe efficiency of the Web by reducing network traffic andallowing intelligent searches. Perhaps of greater impor-

tance is its potential for creating data standards for e-commerce, industry and science that will allow efficientmovement of information between computer platforms

and applications.XML presents an opportunity for development of

meaningful data standards in the geosciences that builds

upon previous achievements in the field. For GISdatasets in particular, much of the development interms of establishing data dependencies and data types

has already been achieved. The benefits to the geo-sciences in terms of efficient, unrestricted movement ofinformation between computer applications require littleelaboration. The additional benefits of browser compat-

ibility, being able to display a single document or datasetin different ways for different audiences, and the abilityto conduct meaningful and efficient information

searches on the Web are added incentives. XML toolsthat facilitate these objectives are now readily available.

The challenge will be to establish an appropriate bodywithin the geosciences to coordinate development of

data standards.

References

Albrecht, J., 1999. Geospatial information standards. A

comparative study of approaches in the standardisation of

geospatial information. Computers & Geosciences 25 (1),

9–24.

Harold, E.R., 1998. XML Extensible Markup Language. IDG

Books Worldwide Inc., Foster City, CA, 426pp.

Huber, M., Schneider, D., 1999. Spatial data standards in view

of models of space and the functions operating on them.

Computers & Geosciences 25 (1), 25–38.

Pardi, W.J., 1999. XML in Action. Microsoft Press, Redmond

WA, 329pp.

Strand, E., 1995. GIS application transfer rooted in trees. GIS

World 6 (2), 28–30.

Further reading

ANZLIC. SGML/XML Document Type Definition (DTD) for

geospatial metadata in Australasia, http://www.environ-

ment.gov.au/net/anzmeta/.

Bosak, J., Bray, T., 1999. XML and the second-generation

Web. Scientific American 5, http://www.sciam.com/1999/

0599issue/0599bosak.html.

FGDC. Content Standard for Digital Geospatial Metadata,

http://www.fgdc.gov/metadata/contstan.html.

Microsoft. Internet Explorer 5 download, http://www.micro

soft.com/windows/ie/download/windows.htm.

Microsoft. XML Developer’s Guide, http://msdn.microsoft.

com/xml/XMLGuide/default.asp.

Microsoft. XSL Developer’s Guide, http://msdn.microsoft.

com/xml/XSLGuide/default.asp.

The XML Files (XML tutorials), http://www.webdeveloper.

com/xml/.

World Wide Web Consortium (W3C). Extensible Markup

Language (XML) 1.0, http://www.w3.org/TR/REC-

xml.html.

XML.COM (XML news reports), http://www.xml.com/xml/

pub.

XML for Fun and Profit (XML overview), http://etext.virgi-

nia.edu/helpsheets/xml-basic.html.

XML Resource (XML-related links), http://www.kric.ac.kr/

�wslee/xml/xml_resource.html.

XML Repository (XML-related links), http://xmlrepository.

com/.

XML Software (reviews and downloads), http://www.xmlsoft

ware.com/.

S.W. Houlding / Computers & Geosciences 27 (2001) 839–849 849