metadata quality assurance part ii. the implementation begins
TRANSCRIPT
Metadata Quality Assurance FramworkPart II. – The implementation beginsPéter Kirá[email protected]öttingen, Geiststraße 10, GWDG meeting room 20/05/2016Oberseminar Datenmanagement, Cloud und e-Infrastructure
Gesellschaft für wissenschaftliche Datenverarbeitung mbH Göttingen
2
Metadata Quality Assurance Framework
Why data quality is important?
„Fitness for purpose”
no metadata no access to data no data usage
more explanation:Data on the Web Best PracticesW3C Working Draft 17 December 2015http://www.w3.org/TR/2015/WD-dwbp-20151217/
3
Metadata Quality Assurance Framework
What it is good for?
Improve the metadata Improve metadata schema and its
docum. Propagate „good practice” Improve services: „good” data is ranked
higher in search result list
Specifically for GWDG: Could be built in to current and planned
data management / data archiving tools
4
Metadata Quality Assurance Framework
Project principles
Full transparency Open source, open data (CC0) Minimal viable product „Release early. Release often. And listen
to your customers” (Eric S. Raymond) „Eat your own dog food” Getting real https
://gettingreal.37signals.com/
5
Metadata Quality Assurance Framework
Measurements
Schema-independent structural featuresExistence, cardinality, uniqueness
Use case scenarios („fit for purpose”)Requirements of the most important
functions
Problem catalogKnown metadata problems
6
Metadata Quality Assurance Framework
Europeana Data Quality Committee
Online collaboration Use case documents Problem catalog Tickets Discussion forum #EuropeanaDataQuali
ty Bi-weekly teleconf Bi-yearly face-to-face
meeting
Topics Usage scenarios Metadata profiles Schema modification Measuring Event model
7
Metadata Quality Assurance Framework
Discovery scenarios and their metadata requirements
1. Basic retrieval with high precision and recall2. Cross-language recall3. Entity-based facets4. Date-based facets5. Improved language facets6. Browse by subjects and resource types7. Browse by agents8. Browse/Search by Event9. Entity-based knowledge cards and pages10.Categorised similar items11.Spatial search, browse, and map display12.Entity-based autocompletion13.Diversification of results14.Hierarchical search and facets
Credit: the document was initialized by Tim Hill, Europeana’s search engineer
8
Metadata Quality Assurance Framework
Discovery scenarios and their metadata requirements - 3. Entity-based facets
ScenarioAs a user, ... I want to be able to filter by whether a person is the subject of a book, or its author, engraver, printer etc.
Metadata analysisIn each case the underlying requirement is that the relevant EDM fields for objects be populated by identifying URIs rather than free text. These URIs need to be related, at a minimum, to a label for each of the supported languages.
Measurement rules The relevant field values should be resolvable URI each URI should have labels in multiple languages
9
Metadata Quality Assurance Framework
Discovery scenarios and their metadata requirements – 4. Date-based facets
ScenarioI want to be able to filter my results by a variety of timespans, e.g.: Date of creation Date of publication Date as subject
Metadata analysisDates should be fully and consistently normalised to follow the XSD date-time data types. Dates expressed in styles like “490 avant J.C” that are inherently language dependent should be avoided as they’re very difficult to normalise (e.g. this should be represented as “-0490”^^xsd:gYear).
Measurement rules Field value should be XSD date-time data types
10
Metadata Quality Assurance Framework
Problem catalog
Title contents same as description contents Systematic use of the same title Bad string: "empty" (and variants) Shelfmarks and other identifiers in fields Creator not an agent name Absurd geographical location Subject field used as description field Unicode U+FFFD ( )� Very short description field
Credit: the document was initialized by Tim Hill, Europeana’s search engineer
11
Metadata Quality Assurance Framework
Problem catalog
Description Title contents same as description contentsExample /2023702/35D943DF60D779EC9EF31F5DF...Motivation Distorts search weightingsChecking Method Field comparisonNotes Record display: creator concatenated onto titleMetadata Scenario Basic Retrieval
12
Metadata Quality Assurance Framework
Problem catalog – proposed basis of implementation
Shapes Constraint Language (SHACL)https://www.w3.org/TR/shacl/
SHACL (Shapes Constraint Language) is a language for describing and constraining the contents of RDF graphs. SHACL groups these descriptions and constraints into "shapes", which specify conditions that apply at a given RDF node. Shapes provide a high-level vocabulary to identify predicates and their associated cardinalities, datatypes and other constraints.
sh:equals, sh:notEquals sh:hasValue sh:in sh:lessThan, sh:lessThanOrEquals sh:minCount, sh:maxCount sh:minLength, sh:maxLength sh:pattern
13
Metadata Quality Assurance Framework
Field frequency / main
14
Metadata Quality Assurance Framework
Field frequency per collections / all
15
Metadata Quality Assurance Framework
Field frequency per collections / >0%
16
Metadata Quality Assurance Framework
Field frequency per collections / =100%
17
Metadata Quality Assurance Framework
Field cardinality – overview
18
Metadata Quality Assurance Framework
Field cardinality –histogram
19
Metadata Quality Assurance Framework
Field cardinality – an outlier
20
Metadata Quality Assurance Framework
Multilinguality
@ = language notation in RDF
resource notation
no language
21
Metadata Quality Assurance Framework
Language frequency / barchart
22
Metadata Quality Assurance Framework
Language frequency / barchart
23
Metadata Quality Assurance Framework
Language frequency / Treemap
24
Metadata Quality Assurance Framework
Language frequency / Treemap with resources
25
Metadata Quality Assurance Framework
Language frequency / Treemap + interaction + table
26
Metadata Quality Assurance Framework
Entropy – term uniqueness / main
27
Metadata Quality Assurance Framework
Entropy – term uniqueness / collection
28
Metadata Quality Assurance Framework
Entropy – term uniqueness / field value
29
Metadata Quality Assurance Framework
Entropy – term uniqueness / terms
30
Metadata Quality Assurance Framework
Problem catalog – Long subject
31
Metadata Quality Assurance Framework
Problem catalog – Long subject – example (not so long...)
Conclusion: we have to refine the definition of „long”
32
Metadata Quality Assurance Framework
Problem catalog – same title and description
33
Metadata Quality Assurance Framework
Problem catalog – same title and description – example
34
Metadata Quality Assurance Framework
Record view – functionality matrix
35
Metadata Quality Assurance Framework
Other elements of the record view
36
Metadata Quality Assurance Framework
Further steps
Building in completeness measurements to Europeana’s ingestion tool Including usage statistics (log files, Google Analitics API) Human evaluation of metadata quality Measuring timeliness (changes of scores over time) Machine learning:
Classification/Clustering of records Statistical relevancy of measurements
Göttingen use case: proposed SUB project „Shared Print Study” Göttingen use case: incorporating into research data management tool Cooperation with other projects
37
Metadata Quality Assurance Framework
Architectural overview
Apache Spark (Java)
OAI-PMH client (PHP)
Analysis with Spark (Scala) Analysis with R
Web interface(PHP, d3.js)
Hadoop File System
JSON files
Apache Solr
Apache Cassandra
JSON filesJSON files
Image files
CSV files CSV files
recent workflowplanned workflow
38
Metadata Quality Assurance Framework
Articles, reports, presentations
39
Metadata Quality Assurance Framework
Follow me
Project plan and blog: http://pkiraly.github.io
Site: http://144.76.218.178/europeana-qa/
Software development: https://github.com/pkiraly/europeana-qa-spark:
Europeana Metadata Quality Assurance Toolkit https://github.com/pkiraly/europeana-qa-r:
Europeana Metadata Quality Assurance Toolkit @kiru, https://
www.linkedin.com/in/peterkiraly