metadata quality evaluation: experience from the open language archives community

Metadata Quality Evaluation: Experience from the Open Language Archives Community

Baden HughesDepartment of Computer Science and Software Engineering

University of [email protected]

Metadata Quality Evaluation: Experience from the Open Language Archives Community (Baden Hughes @ ICADL2004)

2

Presentation Overview Introduction OLAC Community Background Motivation Algorithm Design Implementation Demo Evaluation Future Directions Conclusion


3

Introduction It is unfortunate that distributed metadata creation

practices result in highly variable metadata quality A lack of extant metadata quality evaluation tools

and methodologies means addressing this variation is difficult

Our contribution is a suite of metadata quality evaluation tools within a specific OAI sub-domain, the Open Language Archives Community

Significant feature of these tools is the ability to assess metadata quality against actual community practice and external best practice standards


4

Open Language Archives Community (OLAC) An open consortium of 29 linguistic data archives

cataloguing 27K language-related objects OLAC metadata is a Dublin Core application profile

for domain-specific areas: language, linguistic type, subject language, linguistic subject, linguistic role

Based on OAI architecture, 2 tiered data provider and service provider model linked by OAI Protocol for Metadata Harvesting (OAI-PMH)

A number of OLAC innovations have motivated the development of OAI services: static repositories, virtual service providers, personal metadata creation and management tools


5

Motivation To establish infrastructural support for ongoing

metadata quality evaluation Validation tools for higher layer

interoperability such as OAI work well for conformance checking

At a community level, we are generally lacking tools which provide qualitative analyses in both semantic and syntactic modes

Differentiating factors of our work: establishing a common baseline; assisting individual data providers directly; assessing use of OLAC controlled vocabularies (CVs)


6

Algorithm Design #1 Basic objective is to generate a score for each

metadata record based on Dublin Core and OLAC best practice recommendations

Code Existence Score: number of elements containing code attributes divided by the number of elements of a type associated with a controlled vocabulary in the record

Element Absence Penalty: number of core elements absent divided by total number of core elements in the record


7

Algorithm Design #2 Per Metadata Record Weighted Aggregate: an

arbitrary maximum multiplied by the weighted product of Code Existence Score and Element Absence Penalty

Derivative metrics: archive diversity, metadata quality score, core elements per record, core element usage, code usage, code and element usage, “star rating”

Using these metrics, we compute a score for each metadata record in an archive; each archive in total; and for the whole community


8

Implementation Live service at

http://www.language-archives.org/tools/reports/archiveReportCard.php

Metadata quality evaluation suite installed in the service layer on top of OLAC Harvestor and Aggregator

Based on Apache, MySQL, and PHP – runs on Windows/Mac/Linux

All codebase components are open source, licensed under GPL, and available from SourceForge http://sf.net/projects/olac


9

Demo Metadata quality report on all OLAC Data

Providers [Live] [Local]

Metadata quality report for a single OLAC data provider (PARADISEC) [Live] [Local]


10

Evaluation #1 Creating a data provider ranking system was not

a primary goal of the work reported here Per data provider

Apparently no systematic correlation between size of archive and overall metadata quality

A positive correlation between size of archive and the average number of elements per metadata record

Community-wide Additional evidence supporting earlier work as to most

common metadata elements 4 distinct classes: subject; title, description, date,

identifier, creator; format, type, contributor, publisher, isPartOf; all others (including OLAC CVs)


11

Evaluation #2 Qualitatively-based archive clustering

3 distinct groups of archives based on Per Metadata Record Weighted Aggregate

Characterised by metadata creation technique, size, number of elements used, application of OLAC controlled vocabularies

Use of OLAC CVs Subject: OLAC CV used 56% of the time, for language

identification where the DC recommendation of ISO 639-2 is too coarse

Contributor: OLAC CV used 78% of the time, for distinct roles in the linguistic data creation/curation process

Type: OLAC CV used 33% of the time, suprising given domain requirement for differentiating linguistic data types


12

Future Directions Algorithm improvements – particularly

weighting in proportion to size of data provider A longitudinal study of metadata evolution,

including qualitative aspects (commenced, and retrofitted to Jan 2002)

New services based on quality attributes – the OLAC Search Engine uses metadata quality as a ranking scheme for result sets

New metrics which reflect other values of the OLAC community eg online data, use of CVs


13

Conclusions Reported the design and deployment of

scalable, dynamic metadata quality evaluation infrastructure

A distinct contribution in the absence of comparable services and models, our code is open to the community to experiment with

Allowing more accurate identification of leverage points for metadata enrichment effort

Promoting better practice in metadata development and management

Ultimately enabling better search and retrieval experiences for end users


14

Acknowledgements National Science Foundation Grants #9910603

(International Standards in Language Engineering) and #0094934 (Querying Linguistic Databases)

Amol Kamat, Steven Bird and Gary Simons ICADL Program Committee and Reviewers

metadata quality evaluation: experience from the open language archives community

Economy & Finance