linked data for librarians

43
Linked Data Fundamentals Trevor Thornton Senior Applications Developer, NYPL Labs The New York Public Library

Upload: trevorthornton

Post on 17-Dec-2014

211 views

Category:

Education


6 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Linked data for librarians

Linked Data Fundamentals

Trevor Thornton

Senior Applications Developer, NYPL Labs

The New York Public Library

Page 2: Linked data for librarians

Linked Data

Data published on the Web in accordance with principles

designed to facilitate linkages between resources

The potential for linked data in libraries:

• Eliminates data silos - makes data accessible on the Web

and promotes sharing and re-use

• Promotes discovery of related resources through links

(to common people, subjects, etc.)

• Supports cooperative description

(‘open world assumption’)

Page 3: Linked data for librarians

Key aspects of linked data

• Based on the core Web technologies (HTTP, URIs)

• Uses a simple data structure based on atomic statements

about resources (RDF)

• Can be interpreted by machines (semantic data)

• Focus on connecting resources, rather than simply

describing them (though it can do both)

Page 4: Linked data for librarians

HTTP (Hypertext Transfer Protocol)

The foundation of data communication for the Web

HTTP request

HTTP response

Client/User agent(e.g. web browser)

WebServer

Page 5: Linked data for librarians

URI (Uniform Resource Identifier)

Globally unique identifier for a resource on a computer

or a network.

HTTP URIs identify resources on the Web.

http://www.yourdomain.org/something

Page 6: Linked data for librarians

URI vs. URL

URLs (Uniform Resource Locators) are a subset of URIs

that, in addition to identifying a resource, provide a means of

locating it.

A URI does not necessarily point to a document;

a URL does.

A URI can identify a real-world object.

Page 7: Linked data for librarians

The Semantic Web

Proposed by Tim Berners-Lee in a 2001 article in Scientific

American

“The Semantic Web is not a separate Web but an extension of the current one, in

which information is given well-defined meaning, better enabling computers and

people to work in cooperation…

In the near future, these developments will usher in significant new functionality

as machines become much better able to process and ‘understand’ the data that

they merely display at present.”

Page 8: Linked data for librarians

The Linked Data PrinciplesTim Berners-Lee, 2006

1. Use URIs as names for things.

2. Use HTTP URIs so that people can look up those names.

3. When someone looks up a URI, provide useful

information, using the standards (RDF, SPARQL).

4. Include links to other URIs so that they can discover

more things.

Page 9: Linked data for librarians

RDF (Resource Description Framework)

A framework for describing Web resources.

A Web resource is anything that can be retrieved or identified

on the Web via a URI.

RDF descriptions are based on simple

subject-predicate-object expressions called “triples”.

Page 10: Linked data for librarians

The RDF Triple

Subject - the resource being described

Predicate - a property of that resource

Object - the value of the property

Subject and predicate are defined using URIs.

Object can either be a URI or a literal value

(text, number, date, etc.)

subjectpredicate

object

Page 11: Linked data for librarians

Here is some metadata…

Robert Moses Papers

CREATOR:

Moses, Robert, 1888-1981

EXTENT:

142 linear feet

REPOSITORY:

The New York Public Library. Manuscripts and Archives Division.

Page 12: Linked data for librarians

Here are some triples

http://archives.nypl.org/mss/

2071

http://viaf.org/viaf/52866196

http://archives.nypl.org/mss/

2071‘142 linear feet’

http://archives.nypl.org/mss/

2071

http://data.nypl.org/org_units/mss

http://purl.org/dc/terms/creator

http://purl.org/dc/terms/extent

http://purl.org/archival/vocab/arch#heldBy

Robert Moses Papers

Robert Moses Papers

Robert Moses Papers

creator Moses, Robert, 1888-1981

extent

repository NYPL Manuscripts & Archives

Page 13: Linked data for librarians

A set of related triples = a graph

http://archives.nypl.org/mss/

2071

http://viaf.org/viaf/52866196

‘142 linear feet’

http://archives.nypl.org/mss/

2071

http://purl.org/dc/terms/creator

http://purl.org/dc/terms/extent

http://purl.org/archival/vocab/arch#heldBy

Page 14: Linked data for librarians

This is another graph

http://www.worldcat.org/oclc/

834874

http://viaf.org/viaf/44312399

http://viaf.org/viaf/52866196

http://purl.org/dc/terms/creator

http://purl.org/dc/terms/subject

Page 15: Linked data for librarians

Put the graphs together to make a new graph

http://archives.nypl.org/mss/

2071

http://viaf.org/viaf/52866196

‘142 linear feet’

http://archives.nypl.org/mss/

2071

http://purl.org/dc/terms/creatorhttp://purl.org/dc/

terms/extent

http://purl.org/archival/vocab/arch#heldBy

http://viaf.org/viaf/44312399

http://purl.org/dc/terms/creator

http://purl.org/dc/terms/subject

Robert Moses Papers

The Power Broker

http://www.worldcat.org/oclc/

834874

Page 16: Linked data for librarians

RDF serialization formats

‘Serialization’ = to record one or more RDF graphs in a

machine-readable file. There are 2 basic options:

RDF in a standalone text file:• RDF XML• N3 (Notation 3)• Turtle (Terse RDF Triple Language)• N-Triples

RDF embedded in HTML• RDFa (RDF in attributes)

Page 17: Linked data for librarians

<http://archives.nypl.org/mss/2071> <http://purl.org/dc/terms/creator>

<http://viaf.org/viaf/52866196> .

<http://archives.nypl.org/mss/2071> <http://purl.org/dc/terms/extent>

‘142 linear feet’ .

<http://archives.nypl.org/mss/2071> <http://purl.org/archival/vocab/arch#heldBy>

<http://archives.nypl.org/mss/2071> .

Basic triples in N-Triples

N-Triples is the most basic expression of RDF.

Page 18: Linked data for librarians

@prefix dcterms: <http://purl.org/dc/terms/>.

@prefix arch: <http://purl.org/archival/vocab/arch#>.

<http://archives.nypl.org/mss/2071>

dcterms:creator http://viaf.org/viaf/52866196;

dcterms:extent ‘142 linear feet’;

arch:heldBy http://archives.nypl.org/mss/2071.

Basic triples in N3/Turtle

Statements about the same resource are grouped together.Property URIs are shortened using prefixes (‘q-names’).

Page 19: Linked data for librarians

Basic triples in RDF-XML

<?xml version="1.0" encoding="UTF-8"?>

<rdf:RDF xmlns:rdf=“http://www.w3.org/1999/02/22-rdf-syntax-ns#”

xmlns:dcterms="http://purl.org/dc/terms/”

xmlns:arch="http://purl.org/archival/vocab/arch#">

<rdf:Description rdf:about="http://archives.nypl.org/mss/2071">

<dcterms:creator rdf:resource="http://viaf.org/viaf/52866196” />

<dcterms:extent>142 linear feet</dcterms:extent>

<arch:heldBy rdf:resource="http://archives.nypl.org/mss/2071” />

</rdf:Description>

</rdf:RDF>

Page 20: Linked data for librarians

RDFa (RDF in Attributes)

RDFa allows RDF data to be embedded within HTML.

Rendered HTML:

The Power Broker, by Robert Caro, is a biography of Robert Moses.

HTML code:<div about=“http://www.worldcat.org/oclc/834874”

prefix=“dcterms: http://purl.org/dc/terms/>

The Power Broker, by <span property=“dcterms:creator”

resource=“http://viaf.org/viaf/44312399”>Robert Caro</span>, is a biogrpahy of

<span property=“dcterms:subject”

resource=“http://viaf.org/viaf/52866196”>Robert Moses</span>

</div>

Page 21: Linked data for librarians

RDF Ontologies/vocabularies

• Define categories of things and the relationships that they

can have to each other

• Provide the semantics that allow data to be interpreted

by machines

• Establish rules of inference – what can be assumed to

be true based on what is asserted by a triple

Page 22: Linked data for librarians

RDFS (RDF Schema)

A basic vocabulary for ontology development.

RDFS defines RDF classes and properties.

Class: a category of resources; a resource in such a

category is said to be an instance of the class

Property: a relation between a subject and object in a triple

Page 23: Linked data for librarians

Classes and subClasses

The subClassOf property (used in defining a class) allows a

broad class to serve as the basis of a more specific class.

Defining a class (A) as a subClassOf another class (B)

means that any instance of A can be inferred to also be an

instance of B.

Class B

Class A

Page 24: Linked data for librarians

A simple Class/subClass example

Based on these class definitions:

‘Dog’ is a Class

‘Poodle’ is a Class

‘Poodle’ is a subClassOf ‘Dog’

And the statement:

Fido is a Poodle.

It can be inferred that:

Fido is a Dog.

Page 25: Linked data for librarians

RDFS Properties

The predicates in RDF triples are properties.

Properties themselves have two important properties:

domain: asserts that the subject of the triple is an instance

of specific class

range: asserts that the object of the triple is an instance of

specific class

Page 26: Linked data for librarians

OWL (Web Ontology Language)

Provides an extended set of properties used in

ontology/vocabulary definitions (used in conjunction with

RDFS)

• Equivalence/disjunction

• Advanced property definitions

• Restrictions and cardinality

owl:sameAs: A property that asserts that two resources are

the same (i.e. two URIs refer to the same thing)

Page 27: Linked data for librarians

SKOS(Simple Knowledge Organization System)

Defines classes and properties to support the use of

thesauri, classification schemes, subject heading systems

and taxonomies in RDF

• Classes: skos:ConceptScheme, skos:Concept

• Properties: skos:broader, skos:narrower, skos:related,

skos:prefLabel, skos:altLabel

Page 28: Linked data for librarians

Library of Congress Linked Data Service (id.loc.gov)

• Provides URIs for LC controlled vocabularies, thesauri,

language codes, classification schemes

• Most terms defined using SKOS + RDF representation

of MADS (where applicable)

• Complete vocabularies available as free downloads

Page 29: Linked data for librarians

FOAF (Friend of a Friend)

• Provides a vocabulary for describing people and their

relationships to each other and to the things they

make and do

• Originally intended for web-based social networks,

FOAF has gained wider acceptance in describing

historical figures and their relationships

• Classes: Agent, Person, Organization, Group

• Properties: knows, name, based_near

Page 30: Linked data for librarians

VIAF (Virtual International Authority File)

• Clusters names in authority files from numerous national

libraries and other agencies

• Named entities vs. just names

• OCLC is actively establishing links between VIAF and

Wikipedia, building an invaluable resource for

libraries/archives/museums to provide context for their

collections

Page 31: Linked data for librarians

Dublin Core Metadata Initiative

• Terms for general use in describing resources

• Properties relating to simple and qualified Dublin Core

elements

• Classes for general material types (Text, Image,

PhysicalObject, etc.)

• Classes for other resources referenced by DCMI

properties (FileFormat, RightsStatement,

ProvenanceStatement, etc.)

Page 32: Linked data for librarians

Schema.org

• Cooperative project between Bing, Google and Yahoo to

provide mechanism to describe web content via

standardized vocabularies

• Structured data is included in HTML content via microdata

(similar to RDFa)

• Basis of Google Knowledge Graph

• OCLC now provides Schema.org linked data for all

records in WorldCat

Page 33: Linked data for librarians

DbPedia

• Crowd-sourced community effort to extract structured

information from Wikipedia

• Enables sophisticated queries against Wikipedia

• Makes Wikipedia data freely available for re-use

Page 34: Linked data for librarians

Other useful/notable linked data sources

Vocabularies/ontologies

• Bibliographic ontology

• Archival ontology

• Relationship ontology

Data sources

• GeoNames, Europeana, MusicBrainz, data.gov,

nytimes.com, BBC, Project Gutenberg…

Page 35: Linked data for librarians

The obligatory linked data cloud slide

Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

Page 36: Linked data for librarians

Technical things to know a little about

• Triplestore – a database for storing RDF data

• SPARQL (SPARQL Protocol and RDF Query Language)

The primary query language for RDF data (analogous to

SQL for relational databases)

• SPARQL endpoint – Web service that provides direct

access to RDF data stores via SPARQL queries

• HTTP content negotiation – process for delivering

content (data) in different formats (e.g. RDF vs. HTML)

based on HTTP request

Page 37: Linked data for librarians

Linked data attribution

A growing concern in the linked data community is the need

to include attribution with data in order to determine whether

or not it can/should be trusted.

• RDF reification – allows source attribution to be associated with an

RDF triple

• Named graphs – Extension of RDF that allows attribution and other

metadata to be associated with RDF descriptions

• Quad stores – Similar to triplestores but with an additional element

that connects the triple with its source

Page 38: Linked data for librarians

Linked Open Data

Linked data that is freely usable, reusable, and

redistributable — subject, at most, to attribution and ‘share

alike’ requirements

Page 39: Linked data for librarians

Open data licensing

A nonprofit organization that enables the sharing and use of creativity and knowledge through free legal tools.

CC provides alternatives to “all rights reserved” copyright.

Page 40: Linked data for librarians

Creative Commons LicensesO

PEN

DAT

A (: Attribution (CC BY)

Allows distribution and reuse in any way as long as you get credit

Attribution-ShareAlike (CC BY-SA)Allows distribution and reuse in any way as long as you get credit and derivative works are released under the same license

Attribution-NoDerivs (CC BY-ND)Requires that the original is used unchanged and in whole, with credit to you

Attribution-NonCommercial (CC BY-ND)Allows distribution and reuse in any way, for non-commercial purposes only, as long as you get credit

Attribution-NonCommercial-ShareAlike (CC BY-NC-SA)Requires that the original is used unchanged and in whole, with credit to you, provided that derivative works are released under the same license

Attribution-NonCommercial-NoDerivs (CC BY-NC-ND)Only permits use as-is, for non commercial purposes, and with credit to you – the most restrictive CC license available

NO

T O

PEN

DAT

A ):

Page 41: Linked data for librarians

CC0 (‘CC Zero’)

• Allows creators to waive all rights to work and to place it

as completely as possible into the public domain.

• Designed to make it as clear as is legally possible that any

use of your content is allowed

• Quickly becoming the preferred license for open data

Page 42: Linked data for librarians

LC Bibliographic Framework Initiative

• Developing a new bibliographic framework (to replace

MARC) based on linked data principles

• First draft of the Bibliographic Framework (BIBFRAME)

model published in November 2012

Page 43: Linked data for librarians

LC Bibliographic Framework Initiative