porting library vocabularies to the semantic web - ifla 2010

22
making sense of content TM 2010-08-15 Session 149 Information Technology, Cataloguing, Classification and Indexing with Knowledge Management Porting library vocabularies to the Semantic Web, and back A win-win round trip [email protected] making sense of content TM

Upload: bernard-vatant

Post on 11-May-2015

1.540 views

Category:

Technology


3 download

DESCRIPTION

Presentation at IFLA 2010

TRANSCRIPT

Page 1: Porting Library Vocabularies to the Semantic Web - IFLA 2010

making sense of content TM

2010-08-15 Session 149

Information Technology, Cataloguing, Classification and Indexing with Knowledge Management

Porting library vocabularies to the Semantic Web, and backA win-win round trip

[email protected]

making sense of content TM

Page 2: Porting Library Vocabularies to the Semantic Web - IFLA 2010

making sense of content TM Summary

1. Libraries and the Web have a twenty-years affair of love and hatewhich should now come of age.

2. The role of vocabularies is critical in this affair.

3. The Linked Data architecture should leverage proven heritage vocabularies instead of reinventing them.

4. Specific features of library vocabularies make them more or less portable and useful to the Semantic Web.

5. To-do list and guidelines for vocabulary audit and publication.

6. Semantic Web tools feedback : helping vocabulary management.

Page 3: Porting Library Vocabularies to the Semantic Web - IFLA 2010

making sense of content TM Libraries and the Web, a love and hate story

Libraries have been around for thousands of years

The Web is barely in its twenties

Webbies has been claiming the Web was bound to become the Universal Library- Bottom line : traditional libraries are obsolete

Librarians have been claiming the Web is a mess and will never improve- Bottom line : keep using libraries for serious stuff

But they look at each other with fascination- The Web : if only we could be as efficient as libraries for classification and index- Libraries : if only we could scale at the size of the Web, and be as user-friendly

They are bound to be married at the end of the day !

Page 4: Porting Library Vocabularies to the Semantic Web - IFLA 2010

making sense of content TM What is there? Everything!

In libraries- Resources (aka records) and catalogues- Authority lists (aka descriptions of « real-world entities »)- Subject headings, thesaurus, classification schemes (aka vocabularies)- Metadata linking resources to entities and subjects

On the Web- Resources (aka web pages) w/o catalogue- Real-world entities descriptions (w/o authority lists)

- Exemple : Wikipedia, Facebook

- Profusion of vocabularies, but w/o general schemes- Often called « taxonomies », handcrafted for user navigation

- More and more metadata based on RDF family of standards

Vocabularies are the missing pieces of the Semantic Web- Libraries are the natural providers!

Page 5: Porting Library Vocabularies to the Semantic Web - IFLA 2010

making sense of content TM Web vs Libraries (1.0 view)

Library The Web

scope local, focused global

size manageable unknown, organic growth

content controlled unknown, organic growth

type of content controlled unknown

methods polished, proven experimental, wild

organization of content native, vocabulary-based local, quality unknown

classification native, vocabulary-based local, quality unknown

search and retrieval vocabulary-based search algorithms

Page 6: Porting Library Vocabularies to the Semantic Web - IFLA 2010

making sense of content TM Attempts for organizing the Web #fail

Directories- Could not cope with the scale of Web growth- Were often built by amateurs in classification and vocabulary management- Were biased by the commercial use of the Web

Vocabularies- Open Directory categories- Wikipedia categories- Globally messy, organic growth

Metadata in html <head>- Spammed, not in sync with the content- Ignored by most search engines now

Bottom line : The Web is not and will never be a Global Library

Page 7: Porting Library Vocabularies to the Semantic Web - IFLA 2010

making sense of content TM What the Web is good at

Creating representation of « things »- Wikipedia pages- Facebook pages- Pages for products, species, places …

Providing standard identifiers (URI) associated to access protocol (http)- Identity of things is encapsulated in resources URIs

Linking things together- Via http protocol, hypertext etc

Semantic Web is just an extension of the Web- Leveraging all the above features- Expliciting the semantics of URIs and descriptions- Allowing better, less ambiguous access to resources

Page 8: Porting Library Vocabularies to the Semantic Web - IFLA 2010

making sense of content TM The Semantic Web in perspective

INTERNET (ca.1970)- Network of identified, connected and addressable computers

- Similar to libray infrastructure level : buildings, rooms, shelves …- Technical support : IP addresses

WEB 1.0 (ca. 1990)- Network of identified, connected and addressable resources

- Similar to library resources level : books, documents …- Technical support : URLs, http

Semantic Web (ca. 2010)- Network of identified, connected and addressable concepts

- Similar to library vocabulary level : thesaurus, classification, authority lists- Technical support : URIs, RDF, content negociation

Page 9: Porting Library Vocabularies to the Semantic Web - IFLA 2010

making sense of content TM Vocabularies as Core Data

Definition of « Core Data » (Hannemann & Kett, 2010)- Stable and reliable- persistent nodes with a strict, transparent policy :

data provenance, no deletions, versioning- maintained or backed by trusted public organizations- standards based

Library Vocabularies are just that!- Or at least they should play this role

Vocabularies can be used on the Web as in libraries- Despite the difference in scope and size

Based on shared metadata standards- That’s where the Semantic Web comes in

Page 10: Porting Library Vocabularies to the Semantic Web - IFLA 2010

making sense of content TM A roadmap for Semantic Web migration

Audit- Choose the vocabularies which are worth publishing- Sort out terms, concepts and things- Explicit the semantics of your specific syntactic constructs- Check the actual transitivity of your hierarchies- Figure the translation into vocabularies/ontologies popular on the Web

Make ready for publication- Package by domains- Define a strict URI policy, including versioning- Map to other vocabularies

Integration- Expose and promote your Vocabulary as a Service

- Using de-referencable URIs and SPARQL endpoints

- Use Semantic Web software for vocabulary management- To ensure native standard conformance and logical consistency

Page 11: Porting Library Vocabularies to the Semantic Web - IFLA 2010

making sense of content TM Semantic audit of vocabularies ?

Publication for the Semantic Web can be a painful process

Not only technically (formats etc) but conceptually

Making semantic explicit often shows clearly where you’ve gone wrong

But it’s an healthy process anyway

Better do most of the audit before any publication on the Web

But publishing early can trigger useful Web community feedback …

Page 12: Porting Library Vocabularies to the Semantic Web - IFLA 2010

making sense of content TM Which vocabularies are worth it?

Named entities, authority files- A growing number of entities already defined in the Linked Data Cloud …

- 3.4 million « things » identified and described at dbpedia.org- Over 7 million « features »identified and described at geonames.org- http://www.freebase.com/view/people/views/person 1,687,119 entries and counting- http://www.freebase.com/view/en/victor_hugo- http://dbpedia.org/resource/Victor_Hugo

- Consider if duplicate efforts are worth it- Should you throw away our yet-another-Victor Hugo entry?- No, but link to other descriptions in the Cloud (based on http URI)

and keep existing identifiesr for retro-compatibility

Taxonomies, subject headings, classifications- That’s where library heritage is strong and the Web is weak- Such vocabularies can be structuring for the web of data as they are for libraries- Their publication should be a priority !

Page 13: Porting Library Vocabularies to the Semantic Web - IFLA 2010

making sense of content TM Sort out Terms, Concepts and Things

Terms are denotations for concepts- In a given language- If possible qualified by vocabulary specialists

Concepts are specific representations of « things »- In a certain view of the world- For a specific functional purpose in mind

Things are ... just things- What users are about at the end the day (people, places, products …)

Terms, Concepts and Things should all be first-class citizens in the Semantic Web- Switching from a term-centric to a concept-centric view

- That’s what SKOS and ISO 25964 … are all about

- Does not mean that terms and terminology are out of the picture!- They simply need to be defined and managed at a different level

Page 14: Porting Library Vocabularies to the Semantic Web - IFLA 2010

making sense of content TM Building on SKOS and extensions

Thing

Concept

Term

denotes

represents

owl:Thing

skos:Concept

skos-xl:Label

foaf:focus *

skos-xl:prefLabel

* Under discussion

geo:Feature

foaf:Person

time:TemporalEntity

Page 15: Porting Library Vocabularies to the Semantic Web - IFLA 2010

making sense of content TM Explicit the semantics of your syntax

<http://id.loc.gov/authorities/sh00000562> a skos:Conceptskos:prefLabel ‘Environmental justice--Religious aspects--Buddhism, [Christianity, etc.]’

What does such an aggregation mean?- Has « -- » the same semantics in all subject headings in LCSH?- Is yes, which one?- Same questions for [ ]

How does this concept link to its components?- Currently it does not, although they are defined elsewhere in LCSH- http://id.loc.gov/authorities/sh97002483 : Environmental justice- http://id.loc.gov/authorities/sh00000564 : Environmental justice--Religious aspects- http://id.loc.gov/authorities/sh85017454 : Buddhism

Expliciting the link between the above concepts would definitely add value!- To do : figure out how (using flavors of skos:semanticRelation)

Page 16: Porting Library Vocabularies to the Semantic Web - IFLA 2010

making sense of content TM Make sense of hierarchies

From LCSH hierarchyAuxiliary sciences of history.Civilization..Learning and Scholarship…Humanities….Philosophy…..Psychology……Attention…….Listening……..Eavesdropping………Wiretapping

Kind of semantic drift all the way down- Every local relation makes sense, globally it’s weird if transitivity applies- Bust most automatic systems will rely on transitivity as default feature

Either fix it, or specify the hierarchy is not transitive

Page 17: Porting Library Vocabularies to the Semantic Web - IFLA 2010

making sense of content TM Consider how the vocabulary will be used

The Web is an open world- Whatever is not explicitly forbidden is allowed- Whereas in closed library practice, whatever is not explicitly allowed is forbidden- So be prepared to all sort of misuses if you let room for interpretation

Package by domains- Web users will be happy to integrate hundreds of concepts rather than millions- Small, focused vocabularies are more re-usable than general ones- The widest the scope, the more room for ambiguity!

Package by versions- Versioning at vocabulary level and/or at concept level (open issue)

- Should a concept keep the same URI in successive versions?- When has a concept changed enough to be replaced by a different one?

- Never delete a concept, deprecate it if necessary- Using e.g. dcterms:isReplacedBy- Concepts have a life cycle, but cool URIs don’t change

Page 18: Porting Library Vocabularies to the Semantic Web - IFLA 2010

making sense of content TM Mapping to other vocabularies

A most important but still open area- Dealing with hard notions like identity, similarity, sameness …

Tools for help to alignment are emerging- See e.g. ONAGUI http://sourceforge.net/projects/onagui/- Work in progress (TAE project)

- More background in ontology mapping than vocabulary mapping

Mapping at the concept level seems to make sense- SKOS provides basic vocabulary for simple mapping- But no provision for mapping a simple concept to an aggregate- In particular no boolean operators (Actor AND Musician vs Actor OR Musician)

Alignment of « things » is a contentious area- See various debates on use and abuse of owl:sameAs

- http://events.linkeddata.org/ldow2010/papers/ldow2010_paper09.pdf

Page 19: Porting Library Vocabularies to the Semantic Web - IFLA 2010

making sense of content TM Ready to publish?

Follow recommended (best) practices - See e.g. http://www.w3.org/TR/swbp-vocab-pub/

Provide usable packaging (see above, particularly if the vocabulary is large)- 500 Mo dump with one single SKOS file is not the most manageable form!

Choose and expose a clear licensing policy- Using e.g. Creative Commons license model

Make all concept URIs de-referencable- Using content negociation : one description for machines, one for humans

Publish a vocabulary description- For humans and for machines (using e.g., SIOC)

Publish your own use of the vocabulary- In metadata of your records

Page 20: Porting Library Vocabularies to the Semantic Web - IFLA 2010

making sense of content TM Spread the word, join the community!

The Web is an open collaborative entreprise …

Push what you’ve done outside the library world- Join linked data and Semantic Web forums and work groups- Be ready to answer feedback

The Semantic Web will be a Social Web- Look out how your vocabulary will be adopted or not by social applications- Facebook, Twitter and the like …

And of course look out for further developments at LLD XG- http://www.w3.org/2005/Incubator/lld/

Page 21: Porting Library Vocabularies to the Semantic Web - IFLA 2010

making sense of content TM Put Semantic Web software in the back-office

Publication of linked data can be done from any information system- It can be dealt as yet another publication issue …

But it’s simpler if semantic formats are dealt natively in the back-office- Supporting import/export of vocabularies in the system in standard formats

- Native RDF and SKOS integration

- Supporting semantic queries in the back-office- Supporting sanity checking, inference rules

… And making librarians natively at ease with semantic technologies!- Not the least part of it

The technology is mature, software is on the market …- Time to think about it!- So, just ask

Page 22: Porting Library Vocabularies to the Semantic Web - IFLA 2010

making sense of content TM

Thanks for your attention

Questions?