adoption of the linked data best practices in different topical domains
DESCRIPTION
Slides from the presentation of the following paper: Max Schmachtenberg, Christian Bizer, Heiko Paulheim: Adoption of the Linked Data Best Practices in Different Topical Domains. 13th International Semantic Web Conference (ISWC2014) - RDB Track, pp. 245-260, Riva del Garda, Italy, October 2014. Paper URL: http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/pub/SchmachtenbergBizerPaulheim-AdoptionOfLinkedDataBestPractices.pdf Abstract: The central idea of Linked Data is that data publishers support applications in discovering and integrating data by complying to a set of best practices in the areas of linking, vocabulary usage, and metadata provision. In 2011, the State of the LOD Cloud report analyzed the adoption of these best practices by linked datasets within different topical domains. The report was based on information that was provided by the dataset publishers themselves via the datahub.io Linked Data catalog. In this paper, we revisit and update the findings of the 2011 State of the LOD Cloud report based on a crawl of the Web of Linked Data conducted in April 2014. We analyze how the adoption of the different best practices has changed and present an overview of the linkage relationships between datasets in the form of an updated LOD cloud diagram, this time not based on information from dataset providers, but on data that can actually be retrieved by a Linked Data crawler. Among others, we find that the number of linked datasets has approximately doubled between 2011 and 2014, that there is increased agreement on common vocabularies for describing certain types of entities, and that provenance and license metadata is still rarely provided by the data sources.TRANSCRIPT
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 1
Max SchmachtenbergChristian BizerHeiko Paulheim
Adoption of the Linked Data Best Practicesin Different Topical Domains
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 2
The Linked Data Best Practices
1. Linking Best Practices• Set RDF links pointing at instances in other data sources.
2. Vocabulary Best Practices• Reuse terms from widely-used vocabularies.
• Make definitions of proprietary terms dereferencable.
• Link vocabulary terms to terms in other vocabularies.
3. Metadata Best Practices• Publish machine-readable provenance and licensing metadata.
• Publish metadata about alternative access methods (SPARQL, dumps)
Central idea of Linked Data: Ease data discovery and integration by complying to a set of best practices.
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 3
State of the LOD Cloud Report - 2011
http://lod-cloud.net/state/
Based on informationby provided dataset publishers via thedatahub.io catalog
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 4
LOD Cloud - 2011
Consists of 295 datasets.
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 5
Outline
1. Methodology
2. Adoption of the Linking Best Practices
3. Adoption of the Vocabulary Best Practices
4. Adoption of the Metadata Best Practices
5. Conclusions (in Relation to Schema.org)
Goal: Update the State of the LOD Cloud report and LOD Cloud itself to 2014.
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 6
1. Methodology: Crawl of the Linked Data Web
Crawler: LDSpider, Crawl Date: April 2014
Seeds: 560,000 seed URIs from1. Example URIs in datahub.io catalog
2. URIs from BTC2012 dataset
3. URIs from datasets advertised on [email protected] mailing list
Crawled Data Corpus• 900,000 documents containing
• 8,038,000 resources
• 1014 datasets• 77 datasets prevent
crawling via robots.txt
• Distribution by dataset• Red line: documents• Blue line: resources
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 7
Categorization by Topical Domain
Used categorization from datahub.io for existing datasets.
Manually categorized remaining datasets.
Added new category Social Networking
Growth without new category Social Networking: 94 %
LODstats (http://stats.lod2.eu/) discovered similar number of datasets: 1048
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 8
2. Adoption of the Linking Best Practices
Data publishers should set RDF links as:1. Discoverability depends on being linked.2. RDF links ease data integration.
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 9
Degrees
56% of all datasets set RDF links pointing to other datasets. • The remaining 44% are either only the target of RDF links from other
datasets or are isolated.
Datasets with Top In- and Outdegrees:
Most widely used linking predicates: owl:sameAs, rdfs:seeAlso, foaf:knows
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 10
“Crawlable” LOD Cloud 2014
ss
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 11
Degree Distributions
Dotted line: Social Networking (status.net, etc.)
Solid line: Cross-Domain datasets (DBpedia, etc.)
Largest Strongly Connected Component: 36% (377 datasets)
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 12
Conclusion concerning Linking Best Practices
Some datasets put a lot of effort into linking.
Many datasets only link to a small number of other datasetsor do not set RDF links at all.
Similar situation as in 2011.
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 13
3. Adoption of the Vocabulary Best Practices
Goal: Help applications understand the data by1. Reusing terms from widely-used vocabularies.2. Making definitions of proprietary terms
dereferencable.3. Linking vocabulary terms to terms in other
vocabularies.
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 14
Widely-Used and Proprietary Vocabularies
Strong agreement on some vocabularies.
Proprietary vocabularies are used inaddition to common ones, as data is often very specific
Widely-Used Vocabularies
Proprietary Vocabularies
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 15
Dereferencability of Term URIs and Vocabulary Linking
28% of the proprietary vocabularies provide dereferencable URIs.
21% set RDF links to other vocabularies (8% in 2011)• Popular linking predicates: rdfs:range, rdfs:subClassOf, rdfs:subClassOf
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 16
Adoption of the Metadata Best Practices
1. Publish machine-readable provenance information.2. Publish machine-readable licensing information.3. Publish metadata about alternative access methods
(SPARQL endpoints, RDF dumps)
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 17
Provenance and Licensing Metadata
37% of the datasets provide provenance information• Dublin Core is used more than W3C Prov
10% provide machine-readable licensing information• Most used predicates dc:license, cc:license
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 18
Dataset Level Metadata (VoID)
15% of the datasets publish VoID descriptions.
Via these descriptions, it is possible to discover SPARQL endpoints and dumps for about 10% of the data sources.
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 19
Conclusion concerning Metadata Best Practices
Applications can not rely on availability of metadata, as only a small fraction of all data sources publishes such data.
The Government and Library domains are positive exceptions.
Similarly low numbers as in 2011.
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 20
“Full” LOD Cloud Diagram
570 datasets 374 datahub.io
196 our crawl
http://lod-cloud.net/
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 21
Growth of the “Full” LOD Cloud Diagram
2011: 295 datasets
2014: 570 datasets (+ 93 %)
http://lod-cloud.net/
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 22
Comparison of Linked Data and Schema.org
Schema.org1. does not expect data publishers to set data links.2. relies on marking up data in HTML pages.3. Strong application pull by Google, Microsoft, Yahoo!
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 23
* WebDataCommons extracts Microdata, RDFa, Microformat datafrom the CommonCrawl (2.2 billion HTML pages from 12.8 million PLDs).
** Guha in LDOW2014 Keynote
Adoption
WebDataCommons, 2013*:463,000 websites (PLDs) provide Microdata annotations.
Google, 2014**:5 million websites provide Schema.org data.
Orders of magnitude more Schema.org data sources.
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 24
Schema.org Topical Focus
Different topicscompared to Linked Data.
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 25
Class / Property Distribution
Only a small set of classes / properties is actually used.
Less variety compared to Linked Data.
Microdata 2012
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 26
Shallowness of the Schema.org Data
Product Names• AppleMacBook Air MC968/A 11.6-Inch Laptop• Apple MacBook Air 11-in, Intel Core i5 1.60GHz, 64 GB, Lion 10.7
JobPostings• More specific properties like skills are hardly used.• 57% of all hiringOrganizations are strings not instances.
schema:Product schema:JobPosting
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 27
Conclusion
Linked Data Schema.org
~ 1,000 sources > 460,000 sourcescovers wider range of specific topics(government, libraries, science)
topics focused on search engines (products, organizations)
contains more complex data structures
very simple and shallow data structures
partial ontology agreement strong ontology agreement
identity resolution eased by RDF links identity resolution often requires value parsing
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 28
Thank you.
References
Reporthttp://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/
Cataloghttp://linkeddatacatalog.dws.informatik.uni-mannheim.de/
Acknowledgement
This work was supported by