metadata provenance
DESCRIPTION
Two motivating scenarios for metametadataTRANSCRIPT
1DCMI Metadata Provenance
Metadata ProvenanceTwo motivating scenarios for metametadata
Kai EckertMannheim University Library
Michael PanzerOCLC
DCMI Metadata ProvenanceF2F Meeting and Workshop
October 20th, 2010Pittsburgh, PA, USA
2DCMI Metadata Provenance
Metametadata
Provenance information outside of existing data models „Transparent“ Potential usecases:
Whenever you have lots of legacy data in a model that does not support provenance.
Whenever new applications require information that can not be expressed in the existing data model.
3DCMI Metadata Provenance
Need for Metametadata Metadata are also data, so we need additional data
about them. Metametadata Metadata about a whole metadata record, not for single
statements: Who created this metadata record? When was this record created? …
Metadata Provenance
4DCMI Metadata Provenance
Statements about (single) statements
Often proposed, but only vague instructions how to implement it.
Needed, if metadata records are created by the combination of single statements from different sources.
Needed for the storage of arbitrary additional information for single statements, that can not be represented in the metadata format easily.
5DCMI Metadata Provenance
Metametadata vs. Model based provenance
Simple statement: Peter knows Paul.
Provenance information: This statement is made by Mary.
Peter Paul
Mary
Knows
says
Metalevel
6DCMI Metadata Provenance
Data model extension
Peter
Paul
Mary
Has RelationRelation
Has Object
Has Creator
Knows Relation
Has Type
Simple statement: Peter knows Paul.
Provenance information: This statement is made by Mary.
7DCMI Metadata Provenance
Peter
Paul
Mary
hasRelationRelation
Has Object
Has Creator
Knows Relation
Has Type
Peter Paul
Mary
Knows
says
Metalevel
8DCMI Metadata Provenance
Implementation in RDF
This should not be limited to RDF! But it is a good example and RDF has a currently a
high impact. RDF provides no satisfying answer how to express
provenance information. Different possible implementation, e.g.:
Reification Named Graphs Extended data models ...
9DCMI Metadata Provenance
RDF Reification
RDF supports statements about statements by means of Reification, literally „objectification“ (actually a “subjectification”...).
“The book is written by Goethe“ is said by Kai.
How is it done in RDF:
ex:someID rdf:type rdf:Statement .ex:someID rdf:subject “The book”.ex:someID rdf:predicate ex:isWrittenBy . ex:someID rdf:object "Goethe" .ex:someID ex:isSaidBy “Kai” .
Subject Predicate Object
10DCMI Metadata Provenance
S u b j e c t P r e d i c a t e O b j e c t
1 e x : p 1 2 3 r d f : t y p e e x : p e r s o n
2 e x : p 1 2 3 e x : h a s N a m e “ K a i E c k e r t ”
3 e x : p 1 2 3 e x : w o r k s F o r e x : u n i m a
E x a m p l e 1 : A s i m p l e R D F e x a m p l e
Simplified Presentation
Based on Notation 3 (RDF/N3)
Identification of statements by the line number:
4 #1 dc:creator ''Kai Eckert''
The subject of a statement is a reference to another statement. With this notation, we imply a reification.
11DCMI Metadata Provenance
Scenario 1: Crosswalks
Crosswalks define rules, how metadata from one schema are represented in a different schema.
Problems: Loss of information Erroneous Crosswalks
MARC field Dublin Core element
260$c (Date of publication, distribution, etc.) → Date.Created
522 (Geographic Coverage Note) → Coverage.Spatial
300$a (Physical Description) → Format.Extent
12DCMI Metadata Provenance
Possibilities for Metametadata
Storage of additional information, which would be lost in the target format.
Identification of Crosswalks with version and the specific rule for every generated statement.
Which statements are generated by a specific rule?
Which rule is responsible for a specific (erroneous) statement?
Which data in the originating format was used to generate a specific statement?
13DCMI Metadata Provenance
Example 1: Crosswalk Data
S u b j e c t P r e d i c a t e O b j e c t
1 e x : d o c b a s e / d o c 1 d c : t i t l e “ E x a m p l e t i t l e ”
2 # 1 e x : r u l e 1 6
3 # 1 e x : c r o s s w a l k 3
4 # 1 e x : o r i g i n M A R C : 2 4 5
5 e x : d o c b a s e / d o c 2 d c : t i t l e “ A b o u t f i n d i n g a t i t l e ”
6 # 5 e x : r u l e 1 6
7 # 5 e x : c r o s s w a l k 3
8 # 5 e x : o r i g i n M A R C : 2 4 5
9 e x : d o c b a s e / d o c 3 d c : t i t l e “ L o r e m i p s u m d o l o r ”
1 0 # 9 e x : r u l e 1 8
1 1 # 9 e x : c r o s s w a l k 3
1 2 # 9 e x : o r i g i n M A R C : 2 4 5
1 3 # 9 e x : o r i g i n M A R C : 2 4 6
1 4 e x : d o c b a s e / d o c 4 d c : t i t l e “ C o n s e t e t u r S a d i p s c i n g ”
1 5 # 1 4 e x : r u l e 1 9
1 6 # 1 4 e x : c r o s s w a l k 6
1 7 # 1 4 e x : o r i g i n x m l : / r e c o r d / d e s c r i p t i o n
E x a m p l e 4 : R e s u l t i n g R D F s t a t e m e n t s w i t h a d d i t i o n a l M e t a m e t a d a t a
14DCMI Metadata Provenance
Crosswalk Updates
Which statements are generated by a given rule and need to be regenerated after an update?
SELECT ?document ?field ?value WHERE { ?t rdf:subject ?document . ?t rdf:predicate ?field . ?t rdf:object ?value . ?t ex:rule 16 . ?t ex:crosswalk 3 .}
document field valueex:docbase/doc1 http://www.example.org/dc#title "Example title"ex:docbase/doc2 http://www.example.org/dc#title "About ding a title"
document field valueex:docbase/doc1 http://www.example.org/dc#title "Example title"ex:docbase/doc2 http://www.example.org/dc#title "About ding a title"
15DCMI Metadata Provenance
Crosswalk Debugging
Which rule is responsible for a given statement and what was the original data?
SELECT ?crosswalk ?rule ?origin WHERE { ?t rdf:subject <ex:docbase/doc1> . ?t rdf:predicate dc:title . ?t rdf:object "Example title" . ?t ex:rule ?rule . ?t ex:crosswalk ?crosswalk . ?t ex:origin ?origin .}
crosswalk rule origin3 16 "MARC:245"
crosswalk rule origin3 16 "MARC:245"
16DCMI Metadata Provenance
Scenario 2: Different Sources for Metadata
Manual indexing is costly. Many documents are not indexed at all or not
searchable: Journal Articles Externally owned documents Working papers Webpages
New sources for metadata?
17DCMI Metadata Provenance
New ways for document indexing
Automatic processes Tagging (Automatic) mapping of metadata from external
sources Problem: Lack of quality How do you integrate these data from different sources without
compromising the retrieval quality?
18DCMI Metadata Provenance
Possibilities for Metametadata
Storage of the source of single statements. Storage of further sourcespecific information:
Weighting for automatically generated subject headings. Number of users who tagged a document with a given tag. The original subject heading in case of an automatic
translation or mapping.
Can we use these additional information to improve document retrieval?
19DCMI Metadata Provenance
Example 2: Subject indexing
S u b j e c t P r e d i c a t e O b j e c t
1 e x : d o c b a s e / d o c 1 d c : s u b j e c t e x : t h e s / s u b 2 0
2 # 1 e x : s o u r c e e x : s o u r c e s / a u t o i n d e x 1
3 # 1 e x : r a n k 0 . 5 5
4 e x : d o c b a s e / d o c 1 d c : s u b j e c t e x : t h e s / s u b 3 0
5 # 4 e x : s o u r c e e x : s o u r c e s / a u t o i n d e x 1
6 # 4 e x : r a n k 0 . 8
7 e x : d o c b a s e / d o c 1 d c : s u b j e c t e x : t h e s / s u b 3 0
8 # 7 e x : s o u r c e e x : s o u r c e s / p f e f f e r
9 # 7 e x : r a n k 1 . 0
1 0 e x : d o c b a s e / d o c 1 d c : s u b j e c t e x : t h e s / s u b 4 0
1 1 # 1 0 e x : s o u r c e e x : s o u r c e s / p f e f f e r
1 2 # 1 0 e x : r a n k 1 . 0
1 3 e x : s o u r c e s / a u t o i n d e x 1 e x : t y p e e x : t y p e s / a u t o
1 4 e x : s o u r c e s / p f e f f e r e x : t y p e e x : t y p e s / m a n u a l
E x a m p l e 7 : S u b j e c t a s s i g n m e n t s b y d i f f e r e n t s o u r c e s
20DCMI Metadata Provenance
Backward compatibility
While there are four assignments for subject headings, the statement “ex:docbase/doc1 dc:subject ex:thes/sub30”is still one statement, regardless of the number of times you put it into your RDF store.
Important for applications, that access the RDF Data, but do not handle the RDF reification.
Your metadata remains valid, in particular there are no doublets.
21DCMI Metadata Provenance
Separating the sources
Which statements are made by a specific source (here: Pfeffer)?
SELECT ?document ?value WHERE { ?t rdf:subject ?document . ?t rdf:predicate dc:subject . ?t rdf:object ?value . ?t ex:source <ex:sources/pfeffer> .}
document subjectex:docbase/doc1 ex:thes/sub30ex:docbase/doc1 ex:thes/sub40
document subjectex:docbase/doc1 ex:thes/sub30ex:docbase/doc1 ex:thes/sub40
22DCMI Metadata Provenance
Extended queries
Use all manually created subject headings. Use all subject headings with a rank > 0.7.
SELECT DISTINCT ?document ?subject WHERE { ?t rdf:subject ?document . ?t rdf:predicate dc:subject . ?t rdf:object ?subject . ?t ex:source ?source . ?source ex:type ?type . ?t ex:rank ?rank . FILTER ( ?type = <ex:types/manual> || ?rank > 0.7 )}
document subjectex:docbase/doc1 ex:thes/sub30ex:docbase/doc1 ex:thes/sub40
document subjectex:docbase/doc1 ex:thes/sub30ex:docbase/doc1 ex:thes/sub40
23DCMI Metadata Provenance
Conclusion Many applications of metametadata in the library fields
can be realized with Metametadata. No change on the underlying data models needed. But:
Reification is not well accepted in the community. Named graphs are not (yet) part of RDF standard. ...
Existing approaches are usable, but users need more guidance how to implement them.
Metametadata is not always the appropriate solution (metalevel complexity vs. data model complexity)