refactoring metadata:

May 2-6, 2004 Copyright © 2004 Baden Hughes 1

Refactoring Metadata: Finding architectural compatibility through

structural comparisons

Baden HughesDepartment of Computer Science and Software Engineering

The University of Melbourne


Agenda• Motivation for Refactoring Metadata• Setting the Context• Identifying Points of Comparison• Goals for Structural Comparison• Methods for Structural Comparison• Refactoring in Practice• Principles for Robust Instances• Conclusion


Motivations for RefactoringMetadata

• The need to addressing the problem of metadata volatility is conceptually juxtaposed with the motivation for metadata creation

• XML technologies have become pervasive within the metadata domain

• Different communities = different standards = different degrees of maturity in metadata adoption resulting in metadata being (highly?) variable even within an organization

• Systematically determining similarity and difference is the key to effective refactoring of metadata

• Automatically determining similarity and difference is the key to efficient refactoring of metadata


Setting the Context• XML-based metadata from natural language engineering

and digital libraries• Wide variety of

– traditions of metadata development– technologies for metadata implementation– objects described by metadata– granularity of metadata descriptions

• Motivated by interoperability analysis • Seeking to leverage processes not dissimilar to

database schema comparisons


Identifying Points of Comparison• Robust instances require both syntactic and semantic

analysis • Points of comparison

– XML Document Instance– DTDs– Schemata– Namespaces– RDF Instances– Ontologies

• Likely that different methods are required for each different input


Goals of Structural Comparison• While validation of XML based metadata does contribute

to the quality of metadata, it does not necessarily assist in determining architectural compatibility

• Systematic, iterative evaluation of metadata architectures can contribute to maturity of XML based metadata

• Quantifying the degree of syntactic and semantic similarity is an important first step in the refactoringprocess – it may in fact demonstrate viability


Methods for Structural Comparison

• Different methods for structural comparison depending on the input– XML documents: trees– DTDs: regexps and feature structures– XML Namespaces: feature structures– XML Schemas: regexps and graph matching– RDF Instances: graph matching– Ontologies: feature structures and graph matching


Tree Matching• Common conception of an XML document as a tree

structure• Tree matching is a widely used IE/IR technique for

structured data, and is applicable to XML based metadata

• Tree matching is largely derivative from pattern matching, and is largely independent of syntactic or semantic constraints

• While tree matching can provide basic information about the similarity of two documents, for architectural compatibility a deeper analysis is required


Regular Expression Matching• DTD syntax is derived from regular expressions• Well known evaluation methodologies for regexps are

applicable to DTDs• In contrast to pure syntactic comparison, regexp

matching allows the discovery of the legal constituents of syntactic structures

• Regexp evaluation is is a highly efficient exercise even on large metadata collections, and widely implemented in common programming languages


Feature Structure Matching• Typed feature structures are widely used for

deriving controlled vocabularies – XML attribute instances are typically able to be reduced to typed feature structures for comparison

• Evaluation of the semantic content of feature structures is well grounded in formal logic

• Feature structure comparisons can also reveal syntactic constraints expressed as dimensionality of feature matrices


Graph Matching• Rich XML representations such as RDF can be

construed as a series of arcs and nodes, allowing the adoption of graph theory techniques for the determination of isomorphism

• Finding the minimum and maximum common subgraphsis a technique which can be used to determine architectural compatibility in the syntactic domain

• Graph matching is primarily syntactic, although it can also be applied to semantic analysis on sources such as ontologies


Refactoring in Practice• XML Documents• DTDs• Namespaces• Schemata• RDF instances• Ontologies• See http://www.cs.mu.oz.au/~badenh/projects/metadata-comparison

for demo materials

http://www.cs.mu.oz.au/~badenh/projects/metadata-comparison


Principles for Robust Instances• Both syntactic and semantic analysis are required• Initiate comparisons at the highest level, and proceed

downwards – higher level incompatibilities are more complex to resolve

• Quantifying degree of similarity is extremely important as it impacts directly on the complexity of refactoringprocesses

• Accurately identified commonalities at both syntactic and semantic levels can be leveraged efficiently


Conclusion

• Adopting and permuting a range of techniques for structural comparison from a variety of other disciplines can lead to efficient methods for metadata structural analysis and consequently refactoring

• Large scale metadata management requires an automated approach to both syntactic and semantic evaluation in order to contribute to ROI


Acknowledgements

• National Science Foundation Grant Number 9910603 (International Standards in Language Engineering)

• National Science Foundation Grant Number 0317826 (Querying Linguistic Databases)

refactoring metadata:

Technology