ldq 2014 dq methodology

14
A Methodology for Assessment of Linked Data Quality Anisa Rula Amrapali Zaveri

Upload: amrapali-zaveri

Post on 02-Jul-2015

123 views

Category:

Data & Analytics


0 download

DESCRIPTION

"Methodology for Assessment of Linked Data Quality: A Framework" at Workshop on Linked Data Quality Paper: https://dl.dropboxusercontent.com/u/2265375/LDQ/ldq2014_submission_3.pdf

TRANSCRIPT

Page 1: LDQ 2014 DQ Methodology

A Methodology for Assessment of

Linked Data QualityAnisa Rula

Amrapali Zaveri

Page 2: LDQ 2014 DQ Methodology

Outline➢ Linked Data Quality

○ Current State ○ Limitations

➢Quality Assessment Methodology ○ 3 phases, 6 steps

➢Conclusion ○ Future Work

Page 3: LDQ 2014 DQ Methodology

Linked Data Quality● c.a. 50 Billion Facts in

the Linked Data Cloud ● But, what about the quality?

● Data is only as good as its quality !

Page 4: LDQ 2014 DQ Methodology

Linked Data Quality➢ 30 approaches, 18 Dimensions, 69 Metrics* ➢ 12 Tools

○ Automated ○ Semi-automated

➢No generalized methodology ➢Not taking into account the actual use case/user

requirements ➢Only assessment, no improvement * http://www.semantic-web-journal.net/content/quality-assessment-linked-data-survey

Page 5: LDQ 2014 DQ Methodology

Quality Assessment Methodology for Linked Data

➢ 3 Phases ➢ 6 steps

Page 6: LDQ 2014 DQ Methodology

Phase I: Requirement Analysis Step I: Use Case Analysis - Description that best illustrates the intended usage of the dataset(s) Two types of users ➢Consumers ➢Potential consumers

Page 7: LDQ 2014 DQ Methodology

Phase II: Quality AssessmentStep II: Identification of quality issues ➢Based on the use case ➢Checklist-based approach ➢Yes - 1, No - 0 ➢ List of quality dimensions

Page 8: LDQ 2014 DQ Methodology

Phase II: Quality AssessmentStep III: Statistics and Low-level Analysis ➢Generic statistics ➢Example

○ Interlinking degree ○ Blank nodes

Page 9: LDQ 2014 DQ Methodology

Phase II: Quality AssessmentStep IV: Advanced Analysis ➢High-level metrics ➢Example

○ Accuracy ○ Completeness

➢Requires (i) input and (ii) target dataset

Page 10: LDQ 2014 DQ Methodology

Data Quality Score➢Ratio

○ DQscore = 1 - (V/T) ■ V - total no. of instances that violate a DQ rule ■ T - total no. of relevant instances ■ for each property

○ DQweightedscore= (DQscore * wi / W) ■ wi - weight ■ W - sum of all weighted factors of the properties ■ for quality of overall properties

Page 11: LDQ 2014 DQ Methodology

Phase III: Quality ImprovementStep V: Root Cause Analysis ➢Analyze cause of each quality issue ➢Helps user interpret the results ➢Detect whether the problem occurs in the

original dataset ➢ In case original dataset is unavailable,

analyze the available dataset to determine the cause

Page 12: LDQ 2014 DQ Methodology

Phase III: Quality ImprovementStep VI: Fixing Quality Problems ➢Semi-automatic

○ Consistency ○ Completeness ○ Syntactic validity

➢Crowdsourcing* ○ Semantic accuracy

○ Datatypes ○ Interlinks

* Acosta et al., Crowdsourcing Linked Data Quality Assessment. ISWC 2013.

Page 13: LDQ 2014 DQ Methodology

Conclusion and Future Work➢Assessment methodology - 3 phases, 6

steps ➢Focus on use case ➢ Improvement phase

!Future Work ➢Application to an actual use case ➢Build a tool

Page 14: LDQ 2014 DQ Methodology

Questions Suggestions Comments

Thank you

@AnisaRula @amrapaliz