journal club report “linked data – the story so far” denise warzel feb. 2012

64
Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Upload: amelia-reeves

Post on 03-Jan-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Journal Club Report

“Linked Data – The Story So Far”

Denise Warzel

Feb. 2012

Page 2: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Linked Data – The Story So Far

• Authors: • Christian Bizer, Frele Universität Berlin,

Germany

• Tom Heath, Talis Information Ltd, United Kingdom

• Tim Berner-Lee, Massachusetts Institute of Technology, USA

• International Journal on Semantic Web and Information Systems (2009)

• Volume: 5, Issue: 3, Publisher: Elsevier, Pages: 1-22, DOI: 10.4018/jswis.2009081901

• Follow on to Berners-Lee talk:• http://www.ted.com/talks/

tim_berners_lee_on_the_next_web.html 

• These authors are thought leaders for Linked Data

• Good overview of the state of the art in Linked Data

• Linked Data uses the technologies of the Semantic Web

• Describes the various technologies and tools

• caBIG has not really taken advantage of the Semantic Web

• Hope to stimulate discussion and consideration of this approach for publishing data

Page 3: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Paper Overview

• Section 1: Introduction

• Section 2: Overview of key features of Linked Data

• Section 3: Activities and outputs of the Linking Open Data Project

• Section 4: State of the art in publishing Linked Data

• Section 5: Overview of the Linked Data applications

• Section 6: Compares Linked Data to other technologies for publishing structured data on the Web

• Section 7: Ongoing research challenges

Page 4: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

1. Introduction

• “Linked Data refers to a set of best practices for publishing and connecting structured data on the Web”

• Similar to principles for publishing web documents

• Web documents are linked via hypertext links• Web browsers allow navigation from one document to another• Search engines index the documents and infer potential relevance to users'

search queries • These kinds of links do not describe the nature of the link between two

documents, HTML is not expressive enough, the relationship is implicit

• Traditionally data on the web is made available as raw dumps• CSV• XML• HTML Tables

4

Page 5: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

1. Introduction

• Web has evolved to include links to documents and data

• Linked Data principles are for publishing data on the web

• New way for sharing data

• Publishing and connecting structured data

• Creates a “web of data”• Mechanism similar to hypertext links in web documents• Connecting data from diverse domains such as people, companies, books,

scientific publications, films, music, television and radio programes, genes, proteins, drugs and clinical trials, online communities, statistical and scientific data

• Generic Linked Data Browsers and Search engines will allow users to crawl the web of data by following links between data sources

• start browsing in one data source and navigate links to other data sources

Page 6: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Linked Data differs from Web 2.0 Mashups

Mashups rely on fixed data sources put together to present new or related information

Portal

Pre-processing

Web 2.0

Page 7: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Linked Data differs from Web 2.0 Mashups

• In Linked Data new information automatically becomes available as users deliver new data sources to the web that are linked to existing data sources

• Users will start in one data source and by following links may end up in an entirely different data source

• Heavier lines indicate more links between the two datasets

• Much of the linked open data cloud was generated by putting wrappers around existing databases

•Some data comes from info boxes seen on right hand side of Wikipedia articles

Linked Open Data Cloud

Page 8: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

• Linked Data relies on documents containing data in Resource Description Framework (RDF) format making it machine-readable

• You can build links between data from different sources in different geographical locations from different organizations

• Using this format you can make typed statements about the links between things

• “Typed” meaning that certain RDF tags mean certain things, there is explicit meaning conveyed that RDF query languages can interpret

• RDF provides a generic, graph-based data model with which to structure and link data on the web

• RDF Triples consist of 3 parts:

• Subject - Predicate – Object

• The Predicate specifies how the Subject and Object are linked

• Explicit linkage using explicit language

2. What is Linked Data?

Page 9: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Linked Data Principles

• “Rules” for publishing linked data (Berners-Lee 2006)

1. Use Uniform Resource Identifiers (URIs) as “names” for things. (all three parts of the RDF Triple are represented by a URI)

2. Use Hypertext Transfer Protocol HTTP so that people can look up the URIs (“names”)

3. When someone looks up a URI provide useful information using the standards (RDF, SPARQL)

4. Include links to other URIs so that they can discover more things

9

Page 10: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

RDF Triples

PersonA

Knows

PersonB

subject object

predicate

Predicates specifies how the subject and object are related.

URI URI

URI

Using URIs, A, B and the Predicate can be in different data sets on the web.

Page 11: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Linked Data – another example

PersonC Author

Scientific Article

Dsubject

objectpredicate

URIURI

URI

PersonA

Knows

PersonB

subject object

predicateURI URI

URI“RDF Vocabulary”

Page 12: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Linked Data

Inferencing allows implicit relationships to be discoveredURI

URI

URI

subjectobjectpredicate

URI

URI

URI

subject object

predicate

Figure derived from text in the articleBlue text added for emphasis

Page 13: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

“RDF Triples” http://www.rdfabout.com/quickintro.xpd

• @prefix : <http://www.example.org/> .• :john a :Person .• :john :hasMother :susan .• :john :hasFather :richard .• :richard :hasBrother :luke .

• RDF/XML representation of the above:<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns:ns="http://www.example.org/#"> <ns:Person rdf:about="http://www.example.org/#john"> <ns:hasMother rdf:resource="http://www.example.org/#susan" /> <ns:hasFather>

<rdf:Description rdf:about="http://www.example.org/#richard"> <ns:hasBrother rdf:resource="http://www.example.org/#luke"

/> </rdf:Description> </ns:hasFather></ns:Person>

</rdf:RDF>

“RDF Vocabulary”

Page 14: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

The Linked Data Technology Stack

• Three key techologies: URIs, HTTP Protocol, RDF

• URIs identify resources (subject, predicate, object)

• HTTP protocol retrieves information about resources

• RDF statements describe resources

• Builds on the classic document Web architecture• Linked Data can contain any type of data• Anyone can publish data• Data publishers are unconstrained in choice of RDF vocabularies (the terms

used for predicates)• URIs are connected by RDF links, creating a global data graph that spans

data sources thus enabling the discovery of new data sources

Page 15: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

The Linked Data Technology Stack

• From application development perspective the Web of Data has the following characteristics:

• Data is strictly separated from formatting and presentational aspects

• Data is self-describing• If an application consuming Linked Data encounters data described with an

unfamiliar RDF vocabulary (predicates), the application can deference the URIs that identify the vocabulary terms in order to find their meaning.

• The use of HTTP as a standard data access mechanism and RDF as a standard data model (subject/predicate/object) simplifies access compared to web APIs which rely on heterogeneous data models and access interfaces

• Data is Open • Unlike mashups, applications do not have to be implemented against a fixed

set of data sources, but can discover new data sources at run-time by following RDF links

Page 16: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

About dereferencing URIs …. - the act of retrieving information about a resource identified by a URI

• When information about the Subject URI “…/data#DIG” is dereferenced (acted upon) the dig.csail.mit.edu server answers with a RDF description about the resource identified by the “…/data#DIG” the server provides a set of RDF statements that say it is “MIT Decentralized Information Group” (as opposed to delivering an HTML page)

• When the Object URI is dereferenced “…Berners-Lee/card#1” the W3C server provides a set of RDF statements describing the resource identified by the “…Berners-Lee/card#1”

• When the Predicate URI is dereferenced “…foaf/member” the XMLNS server returns RDF statements providing a definition of “…foaf/0.1/member”.

Page 17: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

The RDF Triple

• This statement connects the subject “… film/77” in the Linked Movie Database (data.linkedmdb.org) with the description provided by Dbpedia.org, The predicate states that the URI “…film/77” and the URI “…Pulp_Fiction_%20film%29” refer to ’”the same thing”.

• This Predicate is represented by the URI “….owl#sameAs” which is a Web Ontology Language (OWL) term. OWL is used as the RDF vocabulary (predicate) in this RDF statement, just as FOAF was basis of the RDF Vocabulary term “member” in the prior slide.RDF statement.

Page 18: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

3. The Linking Open Data Project http://www.w3.org/wiki/SweoIG/TaskForces/CommunityProjects/LinkingOpenData

- Bootstrap the “Web of Data” [cloud] by identifying existing datasets that are available under open licenses, converting them to RDF according to Linked Data principles and publishing them on the web

- Initial involvement by BBC, Thomson Reuters and the Library of Congress- Anyone can participate by following the Linked Data principles

- DBPedia (wikipedia info boxes) and Geonames (RDF about geographical locations around the world) are two key hubs in this web of data that contains content about :

• Geographic locations• People• Companies• Books• Scientific Publications (PubMed)• Films• Music• Television and Radio programmes

• Genes (GeneID, GeneOntology)• Proteins (UniProt)• Drugs• Clinical Trials• Online communities• Statistical data• Census results• Reviews

Page 19: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

4. Publishing Linked Data on the Web

• Three Basic Steps to publishing data that becomes part of the “Linked Data Cloud”:

1. Assign HTTP URIs to the entities to be described in the data set and provide servers for dereferencing these URIs into RDF statements

2. Make RDF statements (links) about other data sources on the web so clients can navigate the Web of Data as a whole by following the links

3. Provide metadata about published data so clients can assess the quality of published data (dataset metadata)

Page 20: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

• Choosing URIs• URIs are used for the subject, predicate and object (bears repeating)• Different parties may publish information about the same real-world

entity using different URIs – URIs are “aliases” for the real-world entity

• E.g. Berlin• http://dbpedia.org/resource/Berlin• http://sws.geonames.org/2950159

• URI aliases allow different information providers to speak about the same entity

• Information providers set owl:sameAs links to URI aliases they know about

• Publishers must choose the URI schema for the Subject and Object, and an RDF vocabulary for the predicates

Page 21: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Choosing URIs, how does it work?

• By convention, the use of two HTTP URI patterns for linked data help client software distinguish between URIs for real-world entities (a document) versus URIs that are used for web documents (RDF)

• 303 URIs – http://www.example.com/id/alice • “303 See Other” - The 303 is a redirected to the real URI - through content

negotiation selects the RDF document

• Hash URIs - http://www.example.com/about#alice • Used for resources that are not HTML documents

• Clients know to strip off the fragment represented at the hash before retrieving the URI

• Through content negotiation information in the HTTP Header specifies what kind of media/formats the user can accept, allows the selects the RDF document

Content Negotiation

Page 22: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Choosing RDF Vocabularies for Linked Data• Different communities have specific preferences for publishing data on the

web• Good practice to reuse vocabularies from well-known RDF vocabularies

• FOAF (Friend of a Friend) • SIOC (Semantically-Interlinked Online Communities)• SKOS (Simple Knowledge Organization System)• DOAP (Description of a Project)• vCard - Electronic Business cards, has become an ontology• Dublin Core• OAI-ORE -Open Archives Initiative Object Reuse and Exchange• GoodRelations – Professional Web ontology for E-Commerce

• If new terminology is developed, it should be self-describing using URIs to identify the terms and be dereferencable

• Common serialization format for linked data is RDF/XML, or Notation3 and Turtle if human inspection of RDF data is required

• These practices makes it easier to allow clients to retrieve and process linked data

Page 23: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Link Generation

• Build data source including RDF links (statements) to point to other URIs in other data sources that are part of the Web of Data

• Choose a URI naming schema for the Subject and Object

• Publications - ISBN and ISSN numbers

• Finance – ISIN identifiers

• Products - EAN and EPC codes

• Life Sciences – identification schemata for genes, molecules and chemical substances (HUGO, UNIPROT, GENEID)

• Use automated or semi-automated approaches to generating RDF statements

• This approach was used to generate links between data sources in the LOD cloud

• When link source and target data sets already both support one of these schema, implicit relationships between entities can be made explicitly by navigating the RDF links (same RDF Vocabulary)

Page 24: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Link Generation

• When no shared schema exists, RDF statements can be generated based on the similarity of entities in both data sets

• Examples of mechanisms to generate links • Similarity computations (Elmagarmid et al., 2007; Raimond et al., 2008)

• Such as in music comparing the names of artists as well as the titles of albums and songs

• Duplicate detection (Winkler, 2006)• Ontology matching (Euzenat & Shvaiko, 2007)

• RDF link generation frameworks are available

• Silk Framework lets data publishers set links between their data source to other data sources

• LinQL Framework is an extension of SQL that integrates querying with link discovery methods

Page 25: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Metadata

• Several types of metadata increase utility for data consumers

• Creator, creation data, creation method• Provenance information using Dublin Core terms of Semantic Web

Publishing vocabulary• Open Provenance Model provides terms for describing transformation

workflows (wasGeneratedAt, wasGeneratedBy, wasEncodedBy, etc)• Methods are being designed to provide evidence (maybe expressed as RDF

statements) for how RDF links change over time

• Technical metadata • Information about the data set and its interlinkage relationships with other

data sets• Vocabulary of Interlinked Datasets (VoID) defines terms and best practices to

categorize and provide statistical meta-information about datatsets

What about Metadata about the data?

Page 26: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Publishing Tools

• Serve RDF Content or Provide Linked Data views over non-RDF legacy data sources

• Views shield publishers from dealing with technical details

• Supports dereferencing URIs into RDF descriptions

• D2R Server • Publisher defines a mapping between relational db schema and target RDF

vocabulary• D2R publishes a Linked Data view over the db that supports SPARQL

queries

• Virtuoso Universal Server• Serves RDF data via a Linked Data interface and a SPARQL endpoint • Stores RDF directly or created on the fly from non-RDF relational db based

on publisher supplied mapping

• Talis Platform• Software as a Service (SaaS) accessed over HTTP• Native RDF storage accessible as a SPARQL endpoint and REST APIs

Page 27: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Publishing Tools

• Pubby• An extension to any RDF store that supports SPARQL• Rewrites URI requests into SPARQL DESCRIBE queries against RDF store

[defererences URIs]• Also provides HTML view over the datastore and handles the 303 URI

redirects and content negotiation [serving the format that the client/user agent can accept e.g. PNG vs GIF to a web browser that only accepts one type of format]

• Triplify• Extends existing Web applications with Linked Data front end• Based on SQL query templates, Triplify serves a Linked Data and JSON view

over the database

• SparqPlug• Enables the extraction of Linked Data from legacy HTML documents• Serializing the HTML as RDF and allows users to define SPARQL queries

that transform elements into an RDF graph of their choice

Page 28: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Publishing Tools

• OAI2LOD Server• Linked Data Wrapper for document servers that support Open Archives OAI-

RMH protocol

• SIOC Exporters• (SIOC - Semantically Inter-linked On-line Communities) • Linked Data wrappers for several blogging engines, content management

systems and discussion forums such as WordPress, Drupal and phpBB

Page 29: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

5. Linked Data Applications

• Applications that exploit the Web of Data• Three categories:

• Linked Data browsers• Linked Data search engines and indexes• Domain specific Linked Data applications

Page 30: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Linked Data Browsers

• Linked Data Browsers – Disco, Tabulator, Marbles, FOAFNaut and Fenfire• Navigate between data sources by following links expressed as RDF

triples• Eg. View Dbpedia’s RDF description of Birmingham (UK) follow a “birthplace”

predicate link to a description of the comedian Tony Hancock who was born in the city, from there to BBC broadcasts in which Hancock “starredIn”

• Traverse RDF links rather than HTML links• E.g. Disco hyperdata browser

• Traverse the web and expose pieces of it in a controlled way, “outline mode” to discover and highlight a pattern of interest and then query for other similar patterns in the web of data

• E.g. Tabulator• Results can be analyzed with conventional presentation methods such as

faceted browsers, maps, timelines, etc

• Track provenance of data while merging data about the same thing from different sources

• E.g Tabulator, Marbles, FOAFNaut and Fenfire

Page 31: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Linked Data Browsers – Marbles

Figure 3. The Marbles Linked Data browser displaying data about Tim Berners-Lee. Thecolored dots indicate the data sources from which data was merged.

http://www5.wiwiss.fu-berlin.de/marbles?uri=http%3A%2F%2Fwww.w3.org%2FPeople%2FBerners-Lee%2Fcard%23i (couldn’t reproduce the display above)

Page 32: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Linked Data Search Engines and Indexes

• Browsers navigate Linked Data, Search Engines are where the navigation process begins

• Provide query capabilities over aggregated data

• Two categories:

1. Human-oriented search engines

2. Application-oriented indexes

1. Human Oriented Search Engines• Falcons and SWSE (Semantic Web Search Engine)• Keyboard based search similar interaction as Google and Yahoo• Search box into which keywords are entered, returns a list of results• Provide a detailed interface to the user that exploits the underlying structure

of the data rather than simply providing a link from the search result to the source documents

• User selects from the results list alongside structured data from the linked entities

Page 33: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Human Oriented Search Engine http://ws.nju.edu.cn/falcons/objectsearch/Object Search for “tim berners lee”

Page 34: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Ontology search for “blood”

Page 35: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Human Oriented Search EngineSWSE: http://swse.deri.org/

Page 36: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Linked Data Search Engines and Indexes cont.

2. Application-oriented Indexes• Swoogle, Sindice, Watson (DARQ – federated query)

• APIs through which Linked Data applications can discover RDF documents on the Web that reference URIs or keywords

• Applications can query these indexes and receive pointers to relevant documents that can be processed by the application itself

• Sindice is oriented towards providing access to RDF documents containing instance data

• Swoogle and Watson are oriented towards finding ontologies that provide coverage of certain concepts relevant to the query

Page 37: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Swoogle: http://swoogle.umbc.edu/

Page 38: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Search term “blood” finds Ontologies

Page 39: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Sindice: http://sindice.org

Page 40: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Sindice: search term “blood”, finds Documents

Page 41: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Linked Data Domain Specific Applications

• Domain-specific Applications

• Revyu, DBPedia Mobile, Talis Aspire, BBC Programmes and Music, and DERI Pipes

• Domain specific functionality by providing “mashup” of data from various Linked Data sources

• Revyu• Generic review and rating site• Consumes Linked Data from the web about films that are reviewed on Revyu

to enhance the experience of site users• Pulls matching information from other sources such as DBPedia and shows it

in the human-oriented pages of the site while maintaining the references to the URIS that may be used by Linked-Data aware applications

• Information such as director’s name, film posters, books and publications, and external data sets to enhance user profiles with FOAF data

Page 42: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Linked Data Domain Specific Applications – cont.

• DBPedia Mobile• Location-aware browser for i-Phpone or other mobile device• Oriented towards tourist exploring a city• Based on GPS location provides location-centric mashup of nearby locations

from DBPedia, associated reviews from Revyu, and photos from Flickr photo-sharing API

Page 43: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Linked Data Domain Specific Applications – DBpedia Mobile

Figure 5. DBpedia Mobile displaying information about Berlin

Page 44: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Linked Data Domain Specific Applications

• Talis Aspire• Resource List Management for university lecturers and students• User creates lists through web interface, the application produces and stores

RDF triples• Items in one list are linked to corresponding items on other lists at other

institutions building a web of Scholarly data through non-[linked-data] users

• BBC Programmes and Music• Uses DBpedia and MusicBrainz to connect content about BBC radio and

television topics to augment the content with additional data

• DERI Pipes• Similar to Yahoo Pipes• Provides mashup platform enabling data sources to be plugged together to

form new feeds of data• May contain identifier consolidation, schema mapping, RDFS or OWL

reasoning and data transformations expressed using SPARQL CONSTRUCT operations or XSLT templates

Page 45: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Linked Data Domain Specific Applications – DERI pipesa workbench to produce an output stream of data from 3 sources – http://pipes.deri.org

Figure 6. DERI pipes workflow integrating data about Tim Berners-Lee from three data sources.

Page 46: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

6. Related Developments (in Research and Practice)

• Microformats

• Aim is to extend HTML pages to include structured data

• Defines a set of formats embedded into HTML via class attributes

• 2 major differences in the Microformats vs Linked Data RDF serialization1. Linked Data not limited in the vocabularies that can be used – Microformats

are restricted to a small set of vocabularies closely managed by a specific community

2. Data items in HTML via Microformats do not have their own identifiers which prevents assertions across documents and web sites whereas Linked Data URIs are global identifiers that can be used to represent relationships

Page 47: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

6. Related Developments (in Research and Practice)

• Web APIs

• Amazon, eBay, Yahoo! And Google

• 1,309 Web APIs and 3,966 mashups based on these APIs

• APIs are accessed using a wide range of mechanisms and retrieved data is represented using various content formats

• Difference is that Linked Data is committed to a small set of standard technologies: URIs as identifiers, HTTP as access mechanisms, and RDF as content format

• Single set of technologies results in ability to use generic data browsers and search engines

• Most Web APIs don’t assign unique identifiers to data items therefore can’t be linked across different data sources

• And Mashups are implemented against a set of fixed data sources• By contrast, Linked Data can take advantage of connecting different data

silos into a single global information space

Page 48: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

6. Related Developments (in Research and Practice)

• Dataspaces

• System architecture around which research on reference reconciliation, schema matching and mapping, data lineage, data quality and information extraction are unified

• Offer ‘best-effort’ answers

• Semantic cohesion is increased over time by different parties providing mappings

• Web of Data seen as a realization of dataspaces concept on a global scale

Page 49: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

6. Related Developments (in Research and Practice)

• Semantic Web

• Evolution of human-readable documents to contain more machine-readable semantic information seen as the seeds for what is known as Semantic Web

• A web of data that can be processed directly or indirectly by machines

• Web of Data is the goal of Semantic Web

• Linked Data provides the means by which to reach this goal

• Building the Web of Data with Linked Data as the foundation may facilitate the reality of intelligent agents and other promises of a more sophisticated Semantic Web vision

Page 50: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

7. Research Challenges

User Interfaces and Interaction Paradigms

• Challenge is linking and presenting dynamically linked and potentially unexpected information to a user without unacceptable cognitive overload

• Example: traditional browsers allow user to move forward and backward in a document-centric resource

• Similar navigation is expected in a Linked Data browser but instead of within a document, it is moving forward and backwards between entities [of different types], changing the focal point of the application

• Will need to be able to add/remove resources from view

• Sindice gives an indication of how such functionality could be delivered, but data sources in the thousands and millions will be a research challenge

Page 51: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

7. Research Challenges

Application Architectures

• Semantic Web Client Library and SQUIN have demonstrated that queries can be answered by relying on runtime link traversal

• Potential problem with scalability of on-the-fly link traversal and federated query

• Widespread crawling and caching may become the norm

Schema Mapping and Data Fusion

• Retrieving data from distributed sources presents a challenge for how to display the information to the user

• Most browsers just present the information along side each other

• Requires mapping of terms from different vocabularies and fusing data about the same entity

• Data sources can publish correspondences between their local terminology and terminology of related data sources

• Use W3C recommendations from RDF Schema and OWL termionlogy like owl:equivalentClass, owl:equivalentProperty, rdfs:subClassOf, rdfs:subPropertyOf to publish correspondences

Page 52: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

7. Research Challenges

Schema Mapping and Data Fusion (cont.)

• These can be too coarse-grained to perform transformations

• Need more fine grained schema mappings that support transitive mappings and combining partial mappings

• Several alignment languages have been presented (Haslhofer 2008 and Euzenat & Scharfee & Zimmerman 2007) as well as rules interchange Format (RIF)

• Data fusion is the process of integrating multiple data items representing the same real world object

• Main challenge is resolution of data conflicts where multiple sources provide different values for the same property of an object

• Distinguishing requirements for data fusion related to Linked Data are the scarceness and uncertainty of quality-related meta-information in order to resolve inconsistencies

• DERI Pipes and KnoFuss architecture are two protypi8cal systems for fusing Linked Data

Page 53: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

7. Research ChallengesLink Maintenance

• Content of Linked Data sources change• New entities are added, outdated, changed or removed• RDF links updated only sporadically leaving dead links pointing to URIs no

longer maintained• Can lead to large number of unnecessary HTTP requests by client applications• Proposed solutions range form recalculating links at regular intervals (Silk or

LinQL) with data sources publishing update feeds or subscriptions, to central registries such as Ping the Semantic Web that keep track of new or changed data items

Licensing• Applications consuming data need to access specifications of the terms for

reusing and republishing the data• Creative Commons - a framework for open licensing underpinned by the notion

of copyright - others say copyright law doesn’t apply to data• Should adopt Open Data Commons Public Domain Dedication and License• Further research is needed for interfaces where attribution is required for reuse

Page 54: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

7. Research Challenges

Trust, Quality and Relevance

• This is the challenge of ensuring that the data more relevant to the user is identified

• Content-based, context-based and rating-based techniques that can be used are given in (Bizer & Cyganiak, 2009; Heath, 2008a)

• PageRank and other algorithms will be important for coarse-grained measures of popularity – as a proxy for relevance or quality – but need to be adapted for the Web of Data

• Interfaces for how to represent this information is a significant research challenge

• An “Oh, yeah?” button

• WIQA and InferenceWeb may be able to contribute to this area by providing explanation about information quality and inference processes used to derive query results

Page 55: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

7. Research Challenges

Privacy

• Opportunity to violate privacy when aggregating data from so many distinct sources

• Likely will require technical and legal means together with higher awareness of users about what to provide in which context

• Research is being done by Weitzner, 2007 and the TAMI project on information accountability (Weitzner, et al., 2008)

Page 56: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Conclusions

• Linked Data is being adopted by increasing number of data providers

• Has the potential to enable a revolution in how data is accessed and utilized just as Web brought revolution in how documents were accessed and utilized

• Developers utilizing Mashups have the challenge of scaling beyond fixed data sources whereas Linked Data is on top of unbounded data sources via standardized access mechanisms

Page 57: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Denise’s take:

• There is potential to build data sources alongside the terminologies also published in RDF to leverage encoded knowledge represented by the ontologies used to annotate caBIG data

• The technology is still very techie oriented, not accessible to the average Joe

• Whereas most people can learn HTML or wiki markup pretty easily to publish data and attach a file

• Have to learn RDF, OWL various different vocabularies for expressing predicates

• Learn to use the link generation tools, set them up

• Set up a server to dereference URIs

• ….

• Jim will tell us more!

Page 58: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

EXTRA Slides

Page 59: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

• Uses HTTP redirect status code: “303 See Other”

• Solution proposed by W3C's Technical Architecture Group in its httpRange-14 resolution [httpRange]

• Indicates that the URI is not a web document

• Directs you to a document about the thing you asked about

• Avoids ambiguity about the real-world object and the resource that represents it

• http://www.example.com/id/exampleinc • Example Inc., the company

• http://www.example.com/id/bob • Bob, the person

• http://www.example.com/id/alice • Alice, the person

Source: http://www.w3.org/TR/cooluris/

[303 URIs for Linked Data]

Page 60: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

[Relationship between a resource identifier and its documents]

Page 61: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

[303 URIs for Linked Data]

Source: http://www.w3.org/TR/cooluris/

Page 62: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

[Hash URIs for Linked Data]

• Hash URI

• Used for non-document resources

• Clients strip off the fragment represented at the hash before retrieving the URI

• The part after the hash cannot be retrieved directly and therefore is not necessarily a document

• http://www.example.com/about#exampleinc • Example Inc., the company

• http://www.example.com/about#bob • Bob, the person

• http://www.example.com/about#alice • Alice, the person

Source: http://www.w3.org/TR/cooluris/

Page 63: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

[Hash URIs for Linked Data]

Source: http://www.w3.org/TR/cooluris/

Page 64: Journal Club Report “Linked Data – The Story So Far” Denise Warzel Feb. 2012

Linked Data Applications – Search Engines and Indexes cont.

• The document web and the data web from one connected navigable space

• A User may perform an HTML document based query, follow a link into the Web of Data, then follow another link into a different HTML document

• The search engines might be expected to perform more sophisticated queries but so far that has not proven to be the case [2009] with the exception of Tabulator’s style of query-by-example and faceted browsing

• SWSE provides access to the underlying data store via SPARQL, but this is suitable for application developers with knowledge of the language rather than a user asking specific questions through human interface