refactoring metadata:
DESCRIPTION
Talk at DAMA/Metadata 2004 (2-6 May 2004, Los Angeles)TRANSCRIPT
May 2-6, 2004 Copyright © 2004 Baden Hughes 1
Refactoring Metadata: Finding architectural compatibility through
structural comparisons
Baden HughesDepartment of Computer Science and Software Engineering
The University of Melbourne
May 2-6, 2004 Copyright © 2004 Baden Hughes 2
Agenda• Motivation for Refactoring Metadata• Setting the Context• Identifying Points of Comparison• Goals for Structural Comparison• Methods for Structural Comparison• Refactoring in Practice• Principles for Robust Instances• Conclusion
May 2-6, 2004 Copyright © 2004 Baden Hughes 3
Motivations for RefactoringMetadata
• The need to addressing the problem of metadata volatility is conceptually juxtaposed with the motivation for metadata creation
• XML technologies have become pervasive within the metadata domain
• Different communities = different standards = different degrees of maturity in metadata adoption resulting in metadata being (highly?) variable even within an organization
• Systematically determining similarity and difference is the key to effective refactoring of metadata
• Automatically determining similarity and difference is the key to efficient refactoring of metadata
May 2-6, 2004 Copyright © 2004 Baden Hughes 4
Setting the Context• XML-based metadata from natural language engineering
and digital libraries• Wide variety of
– traditions of metadata development– technologies for metadata implementation– objects described by metadata– granularity of metadata descriptions
• Motivated by interoperability analysis • Seeking to leverage processes not dissimilar to
database schema comparisons
May 2-6, 2004 Copyright © 2004 Baden Hughes 5
Identifying Points of Comparison• Robust instances require both syntactic and semantic
analysis • Points of comparison
– XML Document Instance– DTDs– Schemata– Namespaces– RDF Instances– Ontologies
• Likely that different methods are required for each different input
May 2-6, 2004 Copyright © 2004 Baden Hughes 6
Goals of Structural Comparison• While validation of XML based metadata does contribute
to the quality of metadata, it does not necessarily assist in determining architectural compatibility
• Systematic, iterative evaluation of metadata architectures can contribute to maturity of XML based metadata
• Quantifying the degree of syntactic and semantic similarity is an important first step in the refactoringprocess – it may in fact demonstrate viability
May 2-6, 2004 Copyright © 2004 Baden Hughes 7
Methods for Structural Comparison
• Different methods for structural comparison depending on the input– XML documents: trees– DTDs: regexps and feature structures– XML Namespaces: feature structures– XML Schemas: regexps and graph matching– RDF Instances: graph matching– Ontologies: feature structures and graph matching
May 2-6, 2004 Copyright © 2004 Baden Hughes 8
Tree Matching• Common conception of an XML document as a tree
structure• Tree matching is a widely used IE/IR technique for
structured data, and is applicable to XML based metadata
• Tree matching is largely derivative from pattern matching, and is largely independent of syntactic or semantic constraints
• While tree matching can provide basic information about the similarity of two documents, for architectural compatibility a deeper analysis is required
May 2-6, 2004 Copyright © 2004 Baden Hughes 9
Regular Expression Matching• DTD syntax is derived from regular expressions• Well known evaluation methodologies for regexps are
applicable to DTDs• In contrast to pure syntactic comparison, regexp
matching allows the discovery of the legal constituents of syntactic structures
• Regexp evaluation is is a highly efficient exercise even on large metadata collections, and widely implemented in common programming languages
May 2-6, 2004 Copyright © 2004 Baden Hughes 10
Feature Structure Matching• Typed feature structures are widely used for
deriving controlled vocabularies – XML attribute instances are typically able to be reduced to typed feature structures for comparison
• Evaluation of the semantic content of feature structures is well grounded in formal logic
• Feature structure comparisons can also reveal syntactic constraints expressed as dimensionality of feature matrices
May 2-6, 2004 Copyright © 2004 Baden Hughes 11
Graph Matching• Rich XML representations such as RDF can be
construed as a series of arcs and nodes, allowing the adoption of graph theory techniques for the determination of isomorphism
• Finding the minimum and maximum common subgraphsis a technique which can be used to determine architectural compatibility in the syntactic domain
• Graph matching is primarily syntactic, although it can also be applied to semantic analysis on sources such as ontologies
May 2-6, 2004 Copyright © 2004 Baden Hughes 12
Refactoring in Practice• XML Documents• DTDs• Namespaces• Schemata• RDF instances• Ontologies• See http://www.cs.mu.oz.au/~badenh/projects/metadata-comparison
for demo materials
May 2-6, 2004 Copyright © 2004 Baden Hughes 13
Principles for Robust Instances• Both syntactic and semantic analysis are required• Initiate comparisons at the highest level, and proceed
downwards – higher level incompatibilities are more complex to resolve
• Quantifying degree of similarity is extremely important as it impacts directly on the complexity of refactoringprocesses
• Accurately identified commonalities at both syntactic and semantic levels can be leveraged efficiently
May 2-6, 2004 Copyright © 2004 Baden Hughes 14
Conclusion
• Adopting and permuting a range of techniques for structural comparison from a variety of other disciplines can lead to efficient methods for metadata structural analysis and consequently refactoring
• Large scale metadata management requires an automated approach to both syntactic and semantic evaluation in order to contribute to ROI
May 2-6, 2004 Copyright © 2004 Baden Hughes 15
Acknowledgements
• National Science Foundation Grant Number 9910603 (International Standards in Language Engineering)
• National Science Foundation Grant Number 0317826 (Querying Linguistic Databases)