unlocking taxonomic literature ii using linked open data
DESCRIPTION
The Smithsonian Libraries has digitized Taxonomic Literature II, an essential research tool for Botanists. This presentation, with audio, starts with a description of Linked Data, a history of TL-2 and some of the methods and challenges we are encountering as we convert it to an digital version and Linked Open Data.TRANSCRIPT
Joel Richard, Smithsonian Libraries
Unlocking Taxonomic Literature II
using Linked Open Data
• What is Linked Open Data / The Semantic
Web?
• Where can I see LOD in use?
• What is Taxonomic Literature II?
• How is it being converted to LOD?
• Did we encounter any challenges?
Agenda
Linked dataFrom Wikipedia, the free encyclopedia
A method of publishing structured data so that it can be interlinked and become more useful. It builds upon standard Web technologies … [and] extends them to share information in a way that can be read automatically by computers. This enables data from different sources to be connected and queried.
What is Linked Open Data?
http://en.wikipedia.org/wiki/Linked_Open_Data
What is the Semantic Web?
Semantic WebFrom Wikipedia, the free encycloped
A movement led by the World Wide Web Consortium… to promote common data formats on the Web.
By encouraging the inclusion of semantic content in web pages, the Semantic Web aims at converting the current web dominated by unstructured and semi-structured documents into a "web of data".
"The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries."
http://en.wikipedia.org/wiki/Semantic_Web)
Five Stars of Linked Open Data
Available on the web (in any format) but with an open license, to be Open Data.
Available as machine-readable structured data (e.g. excel instead of image scan of a table.)
As (2) plus non-proprietary format (e.g. CSV instead of Microsoft Excel.)
All the above plus, Use open standards from W3C (RDF and SPARQL) to identify things, so that people can point at your stuff.
All the above, plus: Link your data to other people’s data to provide context.
What is Linked Open Data?
★
★★
★★★
★★★★
★★★★★
http://www.w3.org/DesignIssues/LinkedData.html
What is Linked Open Data?
Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/
What is Linked Open Data?
Charles Darwin
“Feb 12, 1809”
Shrewsbury
Born On
Born In
City
England
Type
Is In
Person
Type
Country
Type
Charles Darwin “Feb 12, 1809”Born On
Identifier Predicate Identifier / Value(subject) (verb/relationship) (object)
On the Originof Species
Author Of
Tim Berners-Lee outlined four principles for linked open data:
1. Use URIs to denote things.
2. Use HTTP URIs so that these things can bereferred to and looked up ("dereferenced") by people and user agents.
3. Provide useful information about the thing when its URI is dereferenced, leveraging standards such as RDF, SPARQL.
4. Include links to other related things (using their URIs) when publishing data on the Web.
What is Linked Open Data?
http://www.w3.org/DesignIssues/LinkedData.html
http://5stardata.info/
What is Linked Open Data?
http://dbpedia.org/resource/Charles_Darwin
“Feb 12, 1809”
http://dbpedia.org/resource/Shrewsbury
Born On
Born In
City
http://dbpedia.org/resource/United_Kingdom
Type
Is In
Person
Type
Country
Type
Identifier Predicate Identifier / Value
http://dbpedia.org/resource/On_the_Origin_of_Species
Author Of
Predicate Identifier / Value
What is Linked Open Data?
Predicate Vocabularies• Dublin Core – General Metadata for Discovery• SKOS – Simple Knowledge Organization
System• BIBO – Bibliographic Ontology• BIO – Biographical • FOAF – Friend of a Friend• Events…• Geographic…• Many others!• OWL – Web Ontology Language
What is Linked Open Data?
Mondeca Labs
Linked Open Vocabularies (LOV)
Vocabulary of a Friend(VOAF)
A vocabulary for describing other vocabularies
http://labs.mondeca.com/dataset/lov
What is Linked Open Data?
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .@prefix dbpprop: <http://dbpedia.org/property/> .
<http://dbpedia.org/resource/Charles_Darwin> rdf:type <http://xmlns.com/foaf/0.1/Person>; rdf:type <http://dbpedia.org/ontology/Scientist>; foaf:name “Charles Darwin”; foaf:depiction “http://upload.wikimedia.org/…/Charles_Darwin_seated_crop.jpg”; dbpedia-owl:field <http://dbpedia.org/resource/Natural_history> dbpprop:placeOfBirth "Mount House, Shrewsbury, Shropshire, England”; dbpedia-owl:birthDate "1809-02-12"; dbpedia-owl:birthPlace <http://dbpedia.org/resource/Shrewsbury> dbpedia-owl:deathDate "1882-04-19"; dbpedia-owl:deathPlace <http://dbpedia.org/resource/Down_House> dbpprop:awards <http://dbpedia.org/resource/Royal_Medal>
What is Linked Open Data?
Benefits of Linked Open Data
• Disambiguation
• Connecting Relevant Content
• More visibility via Search
• Enrichment of your data
• Easier reuse of data
Linked Open Data in Use
Google Knowledge Graph
Linked Open Data in Use
Google Knowledge Graph
Linked Open Data in Use
Congress: Linked Data Serviceshttp://id.loc.gov/
Schema.orghttp://www.schema.org
Data.gov / Semantichttp://www.data.gov/semantic
Linked Data.orghttp://linkeddata.org/
Stephen Dale: Linked Data in Actionhttp://www.slideshare.net/stephendale/linked-data-in-action-4487244
Other LOD Examples and Information
Taxonomic Literature: A selective guide to botanical publications and collections with dates, commentaries and types. (Stafleu et al.)
Essential Reference Tool for Botanists
Authors and their Publications from1753 to 1940
It is a “database in book form.”
Taxonomic Literature II
Taxonomic Literature II
Taxonomic Literature II
Taxonomic Literature II
Taxonomic Literature II
Taxonomic Literature II
Scanned the pages.
Uploaded to the Internet Archive.
Hired contractor for OCR and correction (99.97% accuracy.)
Received XML dataset from Contractor.
Verified and Imported to SQL Server Database.
Built a website to search the data.
Taxonomic Literature II
Taxonomic Literature II
First...what does 99.97% accuracy mean?
Taxonomic Literature II
~12,000 Errors
1. Select Identifiers for our data
http://library.si.edu/digital-library/tl-2/author/darwin
http://library.si.edu/digital-library/tl-2/title/origin_of_species
http://library.si.edu/digital-library/tl-2/title/1313
2. Choose vocabularies for predicates (harder than it sounds)
OWL, FOAF, DublinCore, OpenGraph, SIOC, SKOS, BIBO, etc.
3. Create Links to other data sources on the web.
Taxonomic Literature II
Taxonomic Literature II as Linked Data
http://library.si.edu/tl2/author/darwin
http://library.si.edu/tl2/title/1313
tl2:creator <http://library.si.edu/tl2/title/1313>
owl:sameAs <http://viaf.org/viaf/27063124>
dc:creator <http://library.si.edu/tl2/author/darwin>
owl:sameAs http://www.archive.org/details/originofspecies00darwuoft
owl:sameAs <http://www.worldcat.org/oclc/425919213>
Select Identifiers
Taxonomic Literature II as Linked Data
<http://library.si.edu/tl2/author/darwin>
rdf:type <http://xmlns.com/foaf/0.1/Person>
foaf:lastName “Darwin”
foaf:familyName “Darwin”
foaf:firstName “Charles”
foaf:givenName “Charles”
foaf:name “Darwin, Charles Robert”
skos:prefLabel “Darwin, Charles Robert”
bio:birth “1809”
bio:death “1882”
skos:defintion “British evolutionary biologist”
tl2:personAbbreviation “Darwin”
Select Identifiers: Authors
Taxonomic Literature II as Linked Data
<http://library.si.edu/tl2/book/1313>
rdf:type <http://purl.org/ontology/bibo/Book>
tl2:titleNumber “1313”
tl2:titleAbbreviation “Origin sp.”
tl2:shortTitle “On the origin of species”
dc:title “On the origin of species by means of natural
selection, or the preservation of favoured races in the...”
dc:publisher “John Murray”
event:place “London”
dc:created “1859”
Select Vocabularies: Publications
Taxonomic Literature II as Linked Data
Linking: Author Names
Used a combination of OpenRefine and LODRefine as well as custom code.
Results: Mixed
• Matched 15 - 20% of the names in our sample set• Some named weren’t high in the list and required a
human touch
Conclusion: Computer code needs to be improved with the aim of minimizing amount of staff or volunteer time spent matching names.
Taxonomic Literature II as Linked Data
Charles Darwin(From the dbpedia.org)
Taxonomic Literature II as Linked Data
Linking: Herbaria
Used computer code to split the herbarium names and identify them in data provided by the Biodiversity Collections Index.
Results: Good
• Matched 95+% of the herbarium names in all of TL-2• Careful attention to “A” which is an herbarium, but
also starts some sentences in the HERBARIUM and TYPES blocks
Conclusion: These will be added to TL-2 when it is launches as LOD.
Taxonomic Literature II
Missouri Botanical Garden Herbarium (From the Biodiversity Collections Index)
Lsid urn:lsid:biocol.org:col:15859Name Missouri Botanical Garden HerbariumCode MOKind HerbariumTaxon Scope Herbarium collection limited to vascular plants (5.6 million
specimens) and bryophytes (500,000 specimens), Jan. 2009.Geo Scope Worldwide; phanerogams strong in Central America (especially
Costa Rica, Nicaragua, and Panama), tropical South America. . .Size 6,150,000Founded Year 1859Web Site http://www.mobot.org/Location Street P.O. Box 299Location City Saint LouisLocation State MissouriLocation Postcode 63166-0299Location Country Iso US
http://www.biodiversitycollectionsindex.org/urn:lsid:biocol.org:col:15859
Taxonomic Literature II as LOD
How are we going to store all this?
We’re using Drupal – automatically embed some Linked Open Data elements in the webpage.
Probably not a good idea for very large datasets.
TL-2 = 10,000 authors + 37,000 titles (about 400,000 triples, but growing)
TL-2 and LOD Challenges
Performance of Drupal Import:Feeds Import: 7 Hours for 35,000 “Records” or Drupal NodesOther options? Still searching…
Our linked data set will grow to at least 600-700k Drupal nodes.
Is Drupal the best way to do this?
Challenges
• Errors in the Corrected OCR
• Challenges in Parsing Citations
• The 80/20 rule: manually making connections unable to be made by automated means
• Finding suitable sources of data to link to. (DBPedia? VIAF? EOL? Others?)
Summary
• This data may already exist online.
• It may also not always be as accurate as needed for science.
• We are in a position to be the authoritative source for this information.
• Linked Data allows it to be easily reused and shared.
Closing: something fun
One example of reuse
Ryan Schenk http://synynyms.com/
Closing: something fun
One example of reuse
Ryan Schenk http://synynyms.com/
Thank You!
Unlocking Taxonomic Literature IIusing Linked Open Data
Joel [email protected]/staff/joel-richard
Special thanks to
The International Association for Plant Taxonomy, for giving us permission to scan and digitize TL-2 and place it online.
For his advice and support, Dr. Laurence Dorr, Botanist and Curator, Department of Botany, Smithsonian National Museum of Natural History.
This project was partially funded by the Atherton Seidell Endowment Fund of the Smithsonian Institution.