linked data and locah, uksg2011
DESCRIPTION
An introduction to Linked Data and to the Linked Open Copac and Archives Hub project.TRANSCRIPT
How to Become a First Class Citizen of
the WebLinked Data and the LOCAH project
Jane Stevenson & Adrian Stevenson
Remit
This session will give a brief overview of the concepts behind Linked Data and will explain how we are applying these ideas to archival and bibliographic data.
Archives Hub: merged catalogue of archival descriptions from 200 institutions across the UK
Copac: merged catalogue of bibliographic records from libraries across the UK
Introduction
The goal of Linked Data is to enable people to share structured data on the Web as easily as they can share documents today.
[The creation of] a space where people and organizations can post and consume data about anything.
Bizer/Cyganiak/Heath Linked Data Tuturial, linkeddata.org
In essence, it marks a shift in thinking from publishing data in human readable HTML documents to machine readable documents. That means that machines can do a little more of the thinking work for us.
http://www.linkeddatatools.com/semantic-web-basics
Linked Data encourages open data, open licences and reuse.
…but Linked Data does not have to be open.
Core questions
Is it achievable?
Will it bring substantial benefits?
“It is the unexpected re-use of information which is the value added by the web”
What is Linked Data?4 ‘rules’ of for the web of data:
Use URIs as names for things
Use HTTP URIs so that people can look up those names.
When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL)
Include links to other URIs. so that they can discover more things.
http://www.w3.org/DesignIssues/LinkedData.html
Giving Things identifiers
We can make statements about things and establish relationships by assigning identifiers to them.
Jane Stevenson = http://archiveshub.ac.uk/janefoaf.rdfManchester = http://dbpedia.org/resource/manchesterEnglish = http://lexvo.org/id/iso639-3/eng
URIs
Uniform Resource Identifiers (URIs) are identifiers for entities (people, places, subjects, records, institutions).
They identify resources, and ideally allow you to access representations of those resources.
Think not of locations, but of identifiers!
For Linked Data you use HTTP URIs
Jane Stevenson = http://archiveshub.ac.uk/janefoaf.rdfManchester = http://dbpedia.org/resource/manchesterEnglish = http://lexvo.org/id/iso639-3/eng
Entities and Relationships
Archival Resource
Repository
ProvidesAccessTo
Subject: Archival ResourcePredicate: AccessProvidedByObject: Repository
Subject > Predicate > Object
AccessProvidedBy
Triple statement
ArchivalResource: http://data.archiveshub.ac.uk/id/findingaid/gb-106-7esp
<accessProvidedBy>
Repository: http://data.archiveshub.ac.uk/id/repository/gb106
HTTP URIs
Archival Resource
Repository
accessProvidedBy
Archival Resource
Repository
Finding Aid
describedBy
heldAt
encodedAs EAD
document
Title
has
An RDF Graph
So...?
If something is identified, it can be linked to
We can then take items from one dataset and link them to items from other datasets
BBC
VIAF
DBPedia Archives Hub
Copac
GeoNames
BBC:CranfordVIAF:Dicken
s
DBPedia: Gaskell Hub:Gaske
ll
Copac:Cranford
Geonames:Manchester
DBPedia: Dickens Hub:Dickens
The Linking benefits of Linked Data
The Web of ‘Documents’
Global information space (for humans)
Document paradigm
Hyperlinks
Search engines index and infering relevance
Implicit relationships between documents
Lack of semantics
The Web of Linked Data
Global data space (for humans and machines)
Making connections between entities across domains (people, books, films, music, genes, medicines, health, statistics...)
LD is not about searching for specific documents or visiting particular websites, it is about things - identifying and connecting them.
Closely aligned to the general architecture of the Web
From one thing…to the same thing
<sameAs>
http://dbpedia.org/resource/manchester
http://sws.geonames.org/2643123
http://data.archiveshub.ac.uk/id/concept/ncarules/manchester
Are they the same?
Vocabularies & Ontologies
Vocabularies & Ontologies
Vocabulary: set of terms
Ontology: organisation of terms – hierarchy,
relationships
Two different databases: one for films one for actorsTo collaborate using their current databases, the owners of either site would have to decide on a common data format by which to share information that they could both understand by using a common film and actor unique ID scheme of their own invention.
Problems of data integration: information exchange across independently designed systems
Shared vocabularies
Need ‘film title’; ‘actor name’; ‘actor birthdate’, etc. to mean the same thing to each
Use the same vocabulary
Query both databases.No need for transformations, mappings, contracts
Vocabularies in Linked Data
Common vocabulary to describe the data, e.g. ‘film-title’ means the same thing
Adopt the same ontologies for expressing meaning
Use semantics to link data
Want to avoid transformation, mapping, contracts between data providers
Copac RDF
Hub RDF
DC
foaf
skos
HubDC
foaf
skos
Copac
bibo
dcterms:titledcterms:identifier
Shared use of vocabularies
Ontologies
Many widely used ontologies
Use others as far as possible
Use your own where necessary
Dublin CoreFriend of a Friend (FOAF)Simple Knowledge Organisation System (SKOS)BiboOpen Cyc
Linked Data on the Hub & Copac
Linked Open Copac and Archives Hub: Locah
JISC funded project
August 2010 – July 2011
MimasUKOLNEduserv
What is LOCAH doing?
Part 1: Exposing the Linked Data
Part 2: Creating a prototype visualisation
Part 3: Reporting on opportunities and barriers
How are we exposing the Data?
1. Model our ‘things’ into RDF
2. Transform the existing data into RDF/XML
3. Enhance the data
4. Load the RDF/XML into a triple store
5. Create Linked Data Views
6. Document the process, opportunities and barriers on LOCAH Blog
1. Modelling ‘things’ into RDF
Hub data in ‘Encoded Archival Description’ EAD XML form
Copac data in ‘Metadata Object Description Schema’ MODS XML form
Take a step back from the data formatThink about your ‘things’What is EAD document “saying” about “things in
the world”?What questions do we want to answer about
those “things”?
http://www.loc.gov/ead/ http://www.loc.gov/standards/mods/
1. Modelling ‘things’ into RDF
Need to decide on patterns for URIs we generate
Following guidance from W3C ‘Cool URIs for the Semantic Web’ and UK Cabinet Office ‘Designing URI Sets for the UK Public Sector’
http://data.archiveshub.ac.uk/id/findingaid/gb1086skinner ‘thing’ URI
… is HTTP 303 ‘See Other’ redirected to …
http://data.archiveshub.ac.uk/doc/findingaid/gb1086skinner document URI
… which is then content negotiated to …http://data.archiveshub.ac.uk/doc/findingaid/gb1086skinner.htmlhttp://data.archiveshub.ac.uk/doc/findingaid/gb1086skinner.rdf http://data.archiveshub.ac.uk/doc/findingaid/gb1086skinner.turtlehttp://data.archiveshub.ac.uk/doc/findingaid/gb1086skinner.json
http://www.w3.org/TR/cooluris/http://www.cabinetoffice.gov.uk/resource-library/designing-uri-sets-uk-public-sector
1. Modelling ‘things’ into RDF
Using existing RDF vocabularies:DC, SKOS, FOAF, BIBO, WGS84 Geo, Lexvo, ORE,
LODE, Event and Time Ontologies
Define additional RDF terms where required,hub:ArchivalResourcecopac:BibiographicResourcehub:maintenanceAgencycopac:Creator
It can be hard to know where to look for vocabs and ontologies
Decide on licence – CC BY-NC 2.0, CC0, ODC PDD
ArchivalResource
Finding Aid
EAD Document
Biographical
History
Agent
Family Person Place
Concept
Genre Function
Organisation
maintainedBy/maintains
origination
associatedWith
accessProvidedBy/providesAccessTo
topic/page
hasPart/partOf
hasPart/partOf
encodedAs/encodes
Repository(Agent)
Book
Place
topic/page
Language
Level
administeredBy/administers
hasBiogHist/isBiogHistFor
foaf:focus Is-a associatedWith
level
Is-a
language
ConceptScheme
inScheme
ObjectrepresentedBy
PostcodeUnit
Extent
Creation
Birth Death
extent
participates in
TemporalEntity
TemporalEntity
at time
at time
product of
in
Archives Hub Model (as at 14/2/2011)
Copac Model (as at November 2010)
Feedback Requested!
We would like feedback on the model
Appreciate this will be easier when the data available
Via blog http://blogs.ukoln.ac.uk/locah/2010/09/28/model-a-first-cut/ http://blogs.ukoln.ac.uk/locah/2010/11/08/some-more-things-
some-extensions-to-the-hub-model/ http://blogs.ukoln.ac.uk/locah/2010/10/07/modelling-copac-
data/
Via email, twitter, in person
2. Transforming in RDF/XML
Transform EAD and MODS to RDF/XML based on our models
Hub: created XSLT Stylesheet and used Saxon parserhttp://saxon.sourceforge.net/Saxon runs the XSLT against a set of
EAD files and creates a set of RDF/XML files
Copac: created in-house Java transformation program
3. Enhancing our data
Language - lexvo.org Time periods - reference.data.gov.uk Geolocation - UK Postcodes URIs and
Ordnance Survey URIs Names - Virtual International Authority File
Matches and links widely-used authority files - http://viaf.org/
Names (and subjects) - DBPediaSubjects - Library of Congress Subject
Headings
4. Load RDF/XML into triple store
Using the Talis Platform triple store
RDF/XML is HTTP POSTed
We’re using Pynappl Python client for the Talis Platformhttp://code.google.com/p/pynappl/
Store provides us with a SPARQL query interface
5. Create Linked Data Views
Expose ‘bounded’ descriptions from the triple store over the Web
Make available as documents in both human-readable HTML and RDF formats (also JSON, Turtle, CSV)
Using Paget ‘Linked Data Publishing Framework’http://code.google.com/p/paget/PHP scripts query Sparql endpoint
http://data.archiveshub.ac.uk/id/archivalresource/gb1086skinner
http://data.archiveshub.ac.uk/
Can I access the Locah Linked Data?
Will be releasing the Hub data very soon!
Copac data will follow approx 1 month later
Release will include Linked Data views, Sparql endpoint details, example queries and supporting documentation
Reporting on opportunities and barriers
Locah Blog (tags: ‘opportunities’ ‘barriers’)
Feed into #JiscEXPO programme evidence gathering
More at: http://blogs.ukoln.ac.uk/locah/2010/09/22/creating-linked-
data-more-reflections-from-the-coal-face/ http://blogs.ukoln.ac.uk/locah/2010/12/01/assessing-linked-
data
Creating the Visualisation Prototype
Based on researcher use cases
Data queried from Sparql endpoint
Use tools such as Simile, Many Eyes, Google Charts
For first Hub visualisation using Timemap – Googlemaps and Similehttp://code.google.com/p/timemap/
Visualisation Prototype Using Timemap –
Googlemaps and Simile
http://code.google.com/p/timemap/
Early stages with this
Will give location and ‘extent’ of archive.
Will link through to Archives Hub
Sir Ernest Henry Shackleton
Archives related to Shackleton:
VIAF URL: http://viaf.org/viaf/12338195/
Biographical History:Ernest Henry Shackleton was born on 15 February 1874 in Kilkea, Ireland, one of six children of Anglo-Irish parents. The family moved from their farm to Dublin, where his father, Henry studied medicine. On qualifying in 1884, Henry took up a practice in south London, and between 1887 and 1890, Ernest was educated at Dulwich College. On leaving school, he entered the merchant service, serving in the square-rigged ship Hoghton Tower until 1894 when he transferred to tramp steamers. In 1896, he qualified as first mate, and two years later, was certified as master, joining the Union Castle line in 1899. [more]
http://archiveshub.ac.uk/data/gb15sirernesthenryshackleton
Books related to Shackleton:
The challenges
The learning process
Model the data, not the description
The description is one of the entities
Understand the importance of URIs
Think about your world before others
…but external links are important
Try to get to grips with terminology
Names
6947115KNAPPFF Knapp associated with record 6947115/id/agent/6947115KNAPPF
<copac:isCreatorOf rdf:resource="http://data.copac.ac.uk/id/mods/6947115"/>
6957115KNAPPF 6947115
<isCreatorOf>
Index terms (names, subjects, places)
‘AssociatedWith’ as the relationship
Benefits of structured index terms
Use /person/ and /organisation/ in the URI
Distinguish /person/pilkington’ the person and /organisation/pilkington
Distinguish place/reading/ and subject/reading/
Problems with source data
EAD very permissive: whole range of finding aids
Copac more consistent but still wide variety
Hub EAD: We limited the tags we worked with
Large files (around 5Mb) tend to need splitting up
Duplication of data
“So statements which relate things in the two documents must be repeated in each. This clearly is against the first rule of data storage: don't store the same data in two different places: you will have problems keeping it consistent.” (T B-L www.w3.org/designissues/linkeddata.html)
Archival Inheritance
“Do not repeat information at a lower level of description that has already been given at a higher level.” ISAD(G)
Many elements do not apply to ‘child’ descriptions
Simple rule of inheritance not always appropriate
LD does assert hierarchical relationships but no requirement to follow these links
CopacLarger community: more potential
vocabularies/documentation/support/confusion/inconsistencies
Merged catalogues: a unique scenario
‘Creator’ and ‘Others’ (editor, authors, illustrator)
Learning from Hub / Doing what is appropriate
Usually not right or wrong answers
Copac model
Groundwork done with Archives Hub. Then had to decide what we wanted to say about the data
Challenges over what a ‘record’ is – ‘Bleak House’ from each contributor? or one merged record?
In many ways simpler than archival data; but also can decide to create a simpler model
Copac Model
Copac specification
Hard to start but proved to be very crucial
Very iterative process between spec and RDF output
Important to establish the structure of the spec (we used tabs for each ‘entity’)
Copac specification
Copac decisions
Where to create Copac URIs – copac:creatorcopac:contributorcopac:heldBy
When to create URIsTitle = literalPublication place = URI
How to deal with problematic/ambiguous dataDate? = productionDate
Issues
Risks
Can you rely on data sources long-term?
Persistence of persistent URIs?
New technologies
Investment of time – unsure of benefits
Licensing issues
Provenance
Track which data comes from our sources: URIs identify your entities
Linked Data tends towards disassembling
Copac/Hub as trusted sources…is DBPedia (for example) as reliable?
Contributors may want data to be identified
Issues around administrative/biographical history
Benefits of trust?
Users may want to know where data is from
Licensing
Nature of Linked Data: each triple as a piece of data
‘Ownership’ of data?
Data often already freely available (M2M interfaces)
Licensing
Public Domain Licences: simple, explicit, and permit widest possible reuse. Waive all rights to the data
BL, British National Bibiography uses public domain licence
Limit commercial uses?
Build in community norms: attribution, share alike - to reinforce desire for acknowledgement
Legal situation?
Thank You
Sections of this presentation adapted from materials created by other members of the LOCAH Project
This presentation available under creative commons Non Commercial-Share Alike:http://creativecommons.org/licenses/by-nc/2.0/uk/
Attribution and CC licence