if we're not there yet, how far do we have to go ? web metadata at the university of melbourne

22
Young & Hughes, DC-ANZ 2005 1 If we’re not there yet, how far do we have to go ? A review of web metadata at The University of Melbourne Eve Young, Metadata Coordinator Information Acquisition and Organisation Section Information Division Baden Hughes, Research Fellow Department of Computer Science and Software Engineering The University of Melbourne

Upload: baden-hughes

Post on 06-May-2015

1.217 views

Category:

Technology


2 download

DESCRIPTION

Paper at DC-ANZ 2005 (May 2005, Melbourne)

TRANSCRIPT

Page 1: If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The University of Melbourne

Young & Hughes, DC-ANZ 2005 1

If we’re not there yet, how far do we have to go ?

A review of web metadata at The University of Melbourne

Eve Young, Metadata CoordinatorInformation Acquisition and Organisation Section

Information Division

Baden Hughes, Research FellowDepartment of Computer Science and Software

Engineering

The University of Melbourne

Page 2: If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The University of Melbourne

Young & Hughes, DC-ANZ 2005 2

Overview

� Background � Web publishing policies circa 1999, 2001� Research projects

� Towards standardization� Dublin Core� UniMelb administrative metadata

� Broad scale compliance analysis� UniMelb web environment� DC Metadata on the UniMelb Web� UniMelb Metadata on the UniMelb Web

� Reflections and challenges for the future

Page 3: If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The University of Melbourne

Young & Hughes, DC-ANZ 2005 3

Before Metadata on UniM site

� Existing standard (1999) not widely adopted� 9 metadata tags

� expiryDate, maintainer, authoriser, author, description, keywords, lastModified, distribution, contentType

� Operational and implementation issues� Difficulty finding information� Suspected non-compliance

� Investigate and analyze � Manual research

Page 4: If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The University of Melbourne

Young & Hughes, DC-ANZ 2005 4

Expiry Tag Analysis

� Expiry tag functionality important� Analysis into non-compliance (608 pages)� Only 27% of pages audited were compliant� Of the remainder of pages reviewed, 441

had no date, or NA as value

Page 5: If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The University of Melbourne

Young & Hughes, DC-ANZ 2005 5

A to Z Index: Compliance Audit

� Audit of metadata on 78 web pages� Highest compliance 84.6% (content type)� Lowest 11.5 % (expiry date)� More unknown than known maintainers� Default value tags had high degree of

compliance� Page specific tags (keywords) had lowest

Page 6: If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The University of Melbourne

Young & Hughes, DC-ANZ 2005 6

Metadata Working Group

� Advise on implementation of a uniform approach to the creation of metadata

� Membership drew on expertise from across the university - academics, IT, web, metadata, and library

� Reviewed metadata standards, DC, IMS, AGLS� Metadata use in large information –rich

organizations, eg, Aust Govt, UK Government, UNSW libraries

Page 7: If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The University of Melbourne

Young & Hughes, DC-ANZ 2005 7

UniMelb metadata standard

� 19 elements (meta tags) to describe and manage a resource

� 2003 revised standard endorsed by Information Strategy Committee.

� Requirement on all University of Melbourne web pages

Page 8: If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The University of Melbourne

Young & Hughes, DC-ANZ 2005 8

Why Dublin Core (besides this being a DC-ANZ conference) ?

� ISO 15836� 15 elements, simple� International consensus� Well supported� Offers semantic interoperability� Extensible� Easy to implement in our environment

Page 9: If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The University of Melbourne

Young & Hughes, DC-ANZ 2005 9

University of Melbourne DC Metadata Elements

DC.TitleDC.CreatorDC.SubjectDC.DescriptionDC.PublisherDC.Contributor

DC.RightsDC.DateDC.Date.ModifiedDC.LanguageDC.FormatDC.Identifier

Page 10: If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The University of Melbourne

Young & Hughes, DC-ANZ 2005 10

University of Melbourne Administrative Metadata ElementsUM.Creator.EmailUM.Authoriser.NameUM.Authoriser.TitleUM.Maintainer.NameUM.Maintainer.DepartmentUM.Maintainer.EmailUM.Date.ReviewDue

Page 11: If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The University of Melbourne

Young & Hughes, DC-ANZ 2005 11

Broad Scale Compliance Analysis

� Full crawl of the University of Melbourne web presence in March 2005

� Used was the Internet Archive's Heritrix suite (an open-source, extensible, web-scale, archival-quality web crawler)

� Total 57Gb of data was retrieved from www.unimelb.edu.au and its associated sub-domains over a period of 146 hours

� 1.4 million documents were retrieved

Page 12: If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The University of Melbourne

Young & Hughes, DC-ANZ 2005 12

The UniMelb Web Environment

Format Demographics of UniMelb Web

text/html

image/jpeg

image/gif

application/pdf

text/plain

application/msword

application/msexcel

application/mspowerpoint

application/postscript

others

Page 13: If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The University of Melbourne

Young & Hughes, DC-ANZ 2005 13

Observations� HTML is no longer the dominant format

� UniMelb’s metadata creation processes primarily oriented at creating Dublin Core-extended metadata as simple HTML meta tags

� Pure HTML content in fact is no longer dominant format� Web-accessibility of “non-native” document types

� Many MIME Types are not addressed by the UniMelb guidelines for metadata creation but which do offer some potential for restricted metadata inclusion

� Emerging document types such as XML and RDF do not easily allow for the embedding of metadata internal to the resource.

� The emergence of dynamic documents� Analysis of “All Other” categories shows many (~38%) of these

documents are dynamic, generated server side on demand by PHP, ASP, JSP etc.

� No thought currently given to inclusion of metadata in automatically generated documents of this type

Page 14: If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The University of Melbourne

Young & Hughes, DC-ANZ 2005 14

DC Metadata on the UniMelb WebUsage of DC Elements

0.010.020.030.040.050.060.070.080.090.0D

C.T

itle

DC

.Cre

ator

DC

.Sub

ject

and

DC

.Des

crip

tion

DC

.Pub

lishe

r

Dc.

Con

tribu

tor

DC

.Rig

hts

DC

.Dat

e

DC

.Dat

eMod

ified

DC

.Lan

guag

e

DC

.For

mat

DC

.Iden

tifie

r

Ove

rall

Ave

rage

DC Metadata Element

% C

over

age

% HTML Pages withMetatdata in <HEAD>

%HTML Pages withMetadata in <BODY>element

Total % HTML Pagescontaining Metadata ineither <HEAD> or<BODY>

Page 15: If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The University of Melbourne

Young & Hughes, DC-ANZ 2005 15

Observations

� Alignment with broad Dublin Core norms� These figures are generally in line with the

findings of broad scale Dublin Core-oriented metadata communities � OAI (Ward, 2003) � OLAC (Hughes, 2004)

Page 16: If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The University of Melbourne

Young & Hughes, DC-ANZ 2005 16

UM Metadata on the UniMelb WebUsage of UM Elements

0.0

10.0

20.0

30.0

40.0

50.0

60.0

70.0

80.0

UM.Date

.Rev

iewDue

UM.Crea

tor.E

mail

UM.Auth

orise

r.Title

UM.Main

taine

r.Nam

e

UM.Main

taine

r.Email

Overa

ll Ave

rage

UM Metadata Element

% C

over

age

% HTML Pages withMetatdata in <HEAD>

%HTML Pages with Metadatain <BODY> element

Total % HTML Pagescontaining Metadata in either<HEAD> or <BODY>

Page 17: If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The University of Melbourne

Young & Hughes, DC-ANZ 2005 17

Observations� Differences between core Dublin Core and institutional metadata

� institutional metadata is more regularly contributed, despite the automatic creation of some DC by content creation applications

� Correlation with manual inspection statistics� these experiments suggest trends detected in earlier focused

studies such as Zajacek (2002a, 2002b) are valid. � Differences between metadata included in <HEAD> vs <BODY>

elements� for institutional metadata, there is a strong tendency to include

metadata in the <BODY> elements where it is immediately visible on the page rather than in the <HEAD> elements which may reflects the emphasis of the training materials

Page 18: If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The University of Melbourne

Young & Hughes, DC-ANZ 2005 18

Reflections and Challenges 1� % coverage of HTML sources is relatively low, but it does

account for a large number of documents (650K total in this survey)� Many documents are non-compliant for identifiable reasons – eg

exclusion of metadata in template based pages such as those within the learning management system

� External search engines like are not using meta tag information any more but perform full text indexing (see Richardson, 2004)� Benefit to general web searchers of institutional metadata creation is

almost zero � May still retain currency for other administrative purposes eg the

authorization of web content publication. � Need to distinguish between the institutions need for web content

management, and how metadata facilitates this goal, and decoupling from web search experience in general.

Page 19: If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The University of Melbourne

Young & Hughes, DC-ANZ 2005 19

Reflections and Challenges 2� Potential impact of institution wide Content Management System

� Existing metadata standards failed to address distributed content creation (or underestimated the pervasive effect of “publish to web” type technologies to all staff),

� Opportunity to increase compliance with new generation tools andpractices.

� Revisiting motivation for web metadata: search assistance or administrative processes ?� Changes to work practices required for web publishing authorisation

� “Compliance audit” service � for run time verification of metadata compliance, with a

“watermarking” service which automatically imprimaturs compliantpages in the absence of manual inspection.

� Require the formalisation of University of Melbourne metadata as a true Dublin Core application profile and an associated formal schema, and the creation of controlled vocabularies for extensions.

Page 20: If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The University of Melbourne

Young & Hughes, DC-ANZ 2005 20

Reflections and Challenges 3

� Large number of pages which will be updated only at an irregularinterval� Substantially increasing the coverage of institutional metadata in the

short to medium term may require the deployment of an automated metadata creation service such as DCdot (Powell, 2000) or an augmentation service such as OLACdot (Hughes, 2005).

� Early experiments with DCdot show significant promise, but need to be more carefully evaluated in light of recent research in the area (Greenberg, 2005).

� Training of critical importance� Significant effort was invested in training key personnel, and the

propagation of the institutional standards and training notes online, only a small number of face to face classes have been held.

Page 21: If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The University of Melbourne

Young & Hughes, DC-ANZ 2005 21

Conclusion� UniMelb was identified as one of the leading universities with

regard to metadata implementation (Ivanova, 2004)� Empirical evidence suggests that The University of Melbourne

still faces significant challenges� Compliance in the age of moving standards - over a 2 year

period the evolution of external standards, web content creationtools, and web content demography is significant

� Strong basis for institutional metadata was formed by the adoption of Dublin Core � the disparate content creation environment and rapidly changing

composition of web content has induced a less than satisfactory application of these standards.

� Automated metadata creation and assessment, forming a significant component of future work may address this problem inpart

Page 22: If We're Not There Yet, How Far Do We Have To Go ? Web Metadata at The University of Melbourne

Young & Hughes, DC-ANZ 2005 22

Questions / Comments

� http://eprints.unimelb.edu.au/archive/00000983� Eve Young

Metadata CoordinatorInformation Acquisition and Organisation SectionInformation [email protected]

� Baden HughesResearch FellowDepartment of Computer Science and Software Engineering [email protected]