adoption of the linked data best practices in different topical domains

28
Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 1 Max Schmachtenberg Christian Bizer Heiko Paulheim Adoption of the Linked Data Best Practices in Different Topical Domains

Upload: chris-bizer

Post on 27-Jun-2015

389 views

Category:

Internet


0 download

DESCRIPTION

Slides from the presentation of the following paper: Max Schmachtenberg, Christian Bizer, Heiko Paulheim: Adoption of the Linked Data Best Practices in Different Topical Domains. 13th International Semantic Web Conference (ISWC2014) - RDB Track, pp. 245-260, Riva del Garda, Italy, October 2014. Paper URL: http://dws.informatik.uni-mannheim.de/fileadmin/lehrstuehle/ki/pub/SchmachtenbergBizerPaulheim-AdoptionOfLinkedDataBestPractices.pdf Abstract: The central idea of Linked Data is that data publishers support applications in discovering and integrating data by complying to a set of best practices in the areas of linking, vocabulary usage, and metadata provision. In 2011, the State of the LOD Cloud report analyzed the adoption of these best practices by linked datasets within different topical domains. The report was based on information that was provided by the dataset publishers themselves via the datahub.io Linked Data catalog. In this paper, we revisit and update the findings of the 2011 State of the LOD Cloud report based on a crawl of the Web of Linked Data conducted in April 2014. We analyze how the adoption of the different best practices has changed and present an overview of the linkage relationships between datasets in the form of an updated LOD cloud diagram, this time not based on information from dataset providers, but on data that can actually be retrieved by a Linked Data crawler. Among others, we find that the number of linked datasets has approximately doubled between 2011 and 2014, that there is increased agreement on common vocabularies for describing certain types of entities, and that provenance and license metadata is still rarely provided by the data sources.

TRANSCRIPT

Page 1: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 1

Max SchmachtenbergChristian BizerHeiko Paulheim

Adoption of the Linked Data Best Practicesin Different Topical Domains

Page 2: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 2

The Linked Data Best Practices

1. Linking Best Practices• Set RDF links pointing at instances in other data sources.

2. Vocabulary Best Practices• Reuse terms from widely-used vocabularies.

• Make definitions of proprietary terms dereferencable.

• Link vocabulary terms to terms in other vocabularies.

3. Metadata Best Practices• Publish machine-readable provenance and licensing metadata.

• Publish metadata about alternative access methods (SPARQL, dumps)

Central idea of Linked Data: Ease data discovery and integration by complying to a set of best practices.

Page 3: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 3

State of the LOD Cloud Report - 2011

http://lod-cloud.net/state/

Based on informationby provided dataset publishers via thedatahub.io catalog

Page 4: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 4

LOD Cloud - 2011

Consists of 295 datasets.

Page 5: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 5

Outline

1. Methodology

2. Adoption of the Linking Best Practices

3. Adoption of the Vocabulary Best Practices

4. Adoption of the Metadata Best Practices

5. Conclusions (in Relation to Schema.org)

Goal: Update the State of the LOD Cloud report and LOD Cloud itself to 2014.

Page 6: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 6

1. Methodology: Crawl of the Linked Data Web

Crawler: LDSpider, Crawl Date: April 2014

Seeds: 560,000 seed URIs from1. Example URIs in datahub.io catalog

2. URIs from BTC2012 dataset

3. URIs from datasets advertised on [email protected] mailing list

Crawled Data Corpus• 900,000 documents containing

• 8,038,000 resources

• 1014 datasets• 77 datasets prevent

crawling via robots.txt

• Distribution by dataset• Red line: documents• Blue line: resources

Page 7: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 7

Categorization by Topical Domain

Used categorization from datahub.io for existing datasets.

Manually categorized remaining datasets.

Added new category Social Networking

Growth without new category Social Networking: 94 %

LODstats (http://stats.lod2.eu/) discovered similar number of datasets: 1048

Page 8: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 8

2. Adoption of the Linking Best Practices

Data publishers should set RDF links as:1. Discoverability depends on being linked.2. RDF links ease data integration.

Page 9: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 9

Degrees

56% of all datasets set RDF links pointing to other datasets. • The remaining 44% are either only the target of RDF links from other

datasets or are isolated.

Datasets with Top In- and Outdegrees:

Most widely used linking predicates: owl:sameAs, rdfs:seeAlso, foaf:knows

Page 10: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 10

“Crawlable” LOD Cloud 2014

ss

Page 11: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 11

Degree Distributions

Dotted line: Social Networking (status.net, etc.)

Solid line: Cross-Domain datasets (DBpedia, etc.)

Largest Strongly Connected Component: 36% (377 datasets)

Page 12: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 12

Conclusion concerning Linking Best Practices

Some datasets put a lot of effort into linking.

Many datasets only link to a small number of other datasetsor do not set RDF links at all.

Similar situation as in 2011.

Page 13: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 13

3. Adoption of the Vocabulary Best Practices

Goal: Help applications understand the data by1. Reusing terms from widely-used vocabularies.2. Making definitions of proprietary terms

dereferencable.3. Linking vocabulary terms to terms in other

vocabularies.

Page 14: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 14

Widely-Used and Proprietary Vocabularies

Strong agreement on some vocabularies.

Proprietary vocabularies are used inaddition to common ones, as data is often very specific

Widely-Used Vocabularies

Proprietary Vocabularies

Page 15: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 15

Dereferencability of Term URIs and Vocabulary Linking

28% of the proprietary vocabularies provide dereferencable URIs.

21% set RDF links to other vocabularies (8% in 2011)• Popular linking predicates: rdfs:range, rdfs:subClassOf, rdfs:subClassOf

Page 16: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 16

Adoption of the Metadata Best Practices

1. Publish machine-readable provenance information.2. Publish machine-readable licensing information.3. Publish metadata about alternative access methods

(SPARQL endpoints, RDF dumps)

Page 17: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 17

Provenance and Licensing Metadata

37% of the datasets provide provenance information• Dublin Core is used more than W3C Prov

10% provide machine-readable licensing information• Most used predicates dc:license, cc:license

Page 18: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 18

Dataset Level Metadata (VoID)

15% of the datasets publish VoID descriptions.

Via these descriptions, it is possible to discover SPARQL endpoints and dumps for about 10% of the data sources.

Page 19: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 19

Conclusion concerning Metadata Best Practices

Applications can not rely on availability of metadata, as only a small fraction of all data sources publishes such data.

The Government and Library domains are positive exceptions.

Similarly low numbers as in 2011.

Page 20: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 20

“Full” LOD Cloud Diagram

570 datasets 374 datahub.io

196 our crawl

http://lod-cloud.net/

Page 21: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 21

Growth of the “Full” LOD Cloud Diagram

2011: 295 datasets

2014: 570 datasets (+ 93 %)

http://lod-cloud.net/

Page 22: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 22

Comparison of Linked Data and Schema.org

Schema.org1. does not expect data publishers to set data links.2. relies on marking up data in HTML pages.3. Strong application pull by Google, Microsoft, Yahoo!

Page 23: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 23

* WebDataCommons extracts Microdata, RDFa, Microformat datafrom the CommonCrawl (2.2 billion HTML pages from 12.8 million PLDs).

** Guha in LDOW2014 Keynote

Adoption

WebDataCommons, 2013*:463,000 websites (PLDs) provide Microdata annotations.

Google, 2014**:5 million websites provide Schema.org data.

Orders of magnitude more Schema.org data sources.

Page 24: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 24

Schema.org Topical Focus

Different topicscompared to Linked Data.

Page 25: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 25

Class / Property Distribution

Only a small set of classes / properties is actually used.

Less variety compared to Linked Data.

Microdata 2012

Page 26: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 26

Shallowness of the Schema.org Data

Product Names• AppleMacBook Air MC968/A 11.6-Inch Laptop• Apple MacBook Air 11-in, Intel Core i5 1.60GHz, 64 GB, Lion 10.7

JobPostings• More specific properties like skills are hardly used.• 57% of all hiringOrganizations are strings not instances.

schema:Product schema:JobPosting

Page 27: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 27

Conclusion

Linked Data Schema.org

~ 1,000 sources > 460,000 sourcescovers wider range of specific topics(government, libraries, science) 

topics focused on search engines (products, organizations)

contains more complex data structures

very simple and shallow data structures

partial ontology agreement strong ontology agreement

identity resolution eased by RDF links identity resolution often requires value parsing

Page 28: Adoption of the Linked Data Best Practices in Different Topical Domains

Schmachtenberg, Bizer, Paulheim: Adoption of the Linked Data Best Practices, 23.10.2014 Slide 28

Thank you.

References

Reporthttp://linkeddatacatalog.dws.informatik.uni-mannheim.de/state/

Cataloghttp://linkeddatacatalog.dws.informatik.uni-mannheim.de/

Acknowledgement

This work was supported by