unlocking taxonomic literature ii using linked open data

41
Joel Richard, Smithsonian Libraries Unlocking Taxonomic Literature II using Linked Open Data

Upload: joel-richard

Post on 27-May-2015

385 views

Category:

Technology


2 download

DESCRIPTION

The Smithsonian Libraries has digitized Taxonomic Literature II, an essential research tool for Botanists. This presentation, with audio, starts with a description of Linked Data, a history of TL-2 and some of the methods and challenges we are encountering as we convert it to an digital version and Linked Open Data.

TRANSCRIPT

Page 1: Unlocking Taxonomic Literature II using Linked Open Data

Joel Richard, Smithsonian Libraries

Unlocking Taxonomic Literature II

using Linked Open Data

Page 2: Unlocking Taxonomic Literature II using Linked Open Data

• What is Linked Open Data / The Semantic

Web?

• Where can I see LOD in use?

• What is Taxonomic Literature II?

• How is it being converted to LOD?

• Did we encounter any challenges?

Agenda

Page 3: Unlocking Taxonomic Literature II using Linked Open Data

Linked dataFrom Wikipedia, the free encyclopedia

A method of publishing structured data so that it can be interlinked and become more useful. It builds upon standard Web technologies … [and] extends them to share information in a way that can be read automatically by computers. This enables data from different sources to be connected and queried.

What is Linked Open Data?

http://en.wikipedia.org/wiki/Linked_Open_Data

Page 4: Unlocking Taxonomic Literature II using Linked Open Data

What is the Semantic Web?

Semantic WebFrom Wikipedia, the free encycloped

A movement led by the World Wide Web Consortium… to promote common data formats on the Web.

By encouraging the inclusion of semantic content in web pages, the Semantic Web aims at converting the current web dominated by unstructured and semi-structured documents into a "web of data".

"The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries."

http://en.wikipedia.org/wiki/Semantic_Web)

Page 5: Unlocking Taxonomic Literature II using Linked Open Data

Five Stars of Linked Open Data

Available on the web (in any format) but with an open license, to be Open Data.

Available as machine-readable structured data (e.g. excel instead of image scan of a table.)

As (2) plus non-proprietary format (e.g. CSV instead of Microsoft Excel.)

All the above plus, Use open standards from W3C (RDF and SPARQL) to identify things, so that people can point at your stuff.

All the above, plus: Link your data to other people’s data to provide context.

What is Linked Open Data?

★★

★★★

★★★★

★★★★★

http://www.w3.org/DesignIssues/LinkedData.html

Page 6: Unlocking Taxonomic Literature II using Linked Open Data

What is Linked Open Data?

Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

Page 7: Unlocking Taxonomic Literature II using Linked Open Data

What is Linked Open Data?

Charles Darwin

“Feb 12, 1809”

Shrewsbury

Born On

Born In

City

England

Type

Is In

Person

Type

Country

Type

Charles Darwin “Feb 12, 1809”Born On

Identifier Predicate Identifier / Value(subject) (verb/relationship) (object)

On the Originof Species

Author Of

Page 8: Unlocking Taxonomic Literature II using Linked Open Data

Tim Berners-Lee outlined four principles for linked open data:

1. Use URIs to denote things.

2. Use HTTP URIs so that these things can bereferred to and looked up ("dereferenced") by people and user agents.

3. Provide useful information about the thing when its URI is dereferenced, leveraging standards such as RDF, SPARQL.

4. Include links to other related things (using their URIs) when publishing data on the Web.

What is Linked Open Data?

http://www.w3.org/DesignIssues/LinkedData.html

http://5stardata.info/

Page 9: Unlocking Taxonomic Literature II using Linked Open Data

What is Linked Open Data?

http://dbpedia.org/resource/Charles_Darwin

“Feb 12, 1809”

http://dbpedia.org/resource/Shrewsbury

Born On

Born In

City

http://dbpedia.org/resource/United_Kingdom

Type

Is In

Person

Type

Country

Type

Identifier Predicate Identifier / Value

http://dbpedia.org/resource/On_the_Origin_of_Species

Author Of

Predicate Identifier / Value

Page 10: Unlocking Taxonomic Literature II using Linked Open Data

What is Linked Open Data?

Predicate Vocabularies• Dublin Core – General Metadata for Discovery• SKOS – Simple Knowledge Organization

System• BIBO – Bibliographic Ontology• BIO – Biographical • FOAF – Friend of a Friend• Events…• Geographic…• Many others!• OWL – Web Ontology Language

Page 11: Unlocking Taxonomic Literature II using Linked Open Data

What is Linked Open Data?

Mondeca Labs

Linked Open Vocabularies (LOV)

Vocabulary of a Friend(VOAF)

A vocabulary for describing other vocabularies

http://labs.mondeca.com/dataset/lov

Page 12: Unlocking Taxonomic Literature II using Linked Open Data

What is Linked Open Data?

@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .@prefix foaf: <http://xmlns.com/foaf/0.1/> .@prefix dbpedia-owl: <http://dbpedia.org/ontology/> .@prefix dbpprop: <http://dbpedia.org/property/> .

<http://dbpedia.org/resource/Charles_Darwin> rdf:type <http://xmlns.com/foaf/0.1/Person>; rdf:type <http://dbpedia.org/ontology/Scientist>; foaf:name “Charles Darwin”; foaf:depiction “http://upload.wikimedia.org/…/Charles_Darwin_seated_crop.jpg”; dbpedia-owl:field <http://dbpedia.org/resource/Natural_history> dbpprop:placeOfBirth "Mount House, Shrewsbury, Shropshire, England”; dbpedia-owl:birthDate "1809-02-12"; dbpedia-owl:birthPlace <http://dbpedia.org/resource/Shrewsbury> dbpedia-owl:deathDate "1882-04-19"; dbpedia-owl:deathPlace <http://dbpedia.org/resource/Down_House> dbpprop:awards <http://dbpedia.org/resource/Royal_Medal>

Page 13: Unlocking Taxonomic Literature II using Linked Open Data

What is Linked Open Data?

Benefits of Linked Open Data

• Disambiguation

• Connecting Relevant Content

• More visibility via Search

• Enrichment of your data

• Easier reuse of data

Page 14: Unlocking Taxonomic Literature II using Linked Open Data

Linked Open Data in Use

Google Knowledge Graph

Page 15: Unlocking Taxonomic Literature II using Linked Open Data

Linked Open Data in Use

Google Knowledge Graph

Page 16: Unlocking Taxonomic Literature II using Linked Open Data

Linked Open Data in Use

Page 17: Unlocking Taxonomic Literature II using Linked Open Data

Congress: Linked Data Serviceshttp://id.loc.gov/

Schema.orghttp://www.schema.org

Data.gov / Semantichttp://www.data.gov/semantic

Linked Data.orghttp://linkeddata.org/

Stephen Dale: Linked Data in Actionhttp://www.slideshare.net/stephendale/linked-data-in-action-4487244

Other LOD Examples and Information

Page 18: Unlocking Taxonomic Literature II using Linked Open Data

Taxonomic Literature: A selective guide to botanical publications and collections with dates, commentaries and types. (Stafleu et al.)

Essential Reference Tool for Botanists

Authors and their Publications from1753 to 1940

It is a “database in book form.”

Taxonomic Literature II

Page 19: Unlocking Taxonomic Literature II using Linked Open Data

Taxonomic Literature II

Page 20: Unlocking Taxonomic Literature II using Linked Open Data

Taxonomic Literature II

Page 21: Unlocking Taxonomic Literature II using Linked Open Data

Taxonomic Literature II

Page 22: Unlocking Taxonomic Literature II using Linked Open Data

Taxonomic Literature II

Page 23: Unlocking Taxonomic Literature II using Linked Open Data

Taxonomic Literature II

Page 24: Unlocking Taxonomic Literature II using Linked Open Data

Scanned the pages.

Uploaded to the Internet Archive.

Hired contractor for OCR and correction (99.97% accuracy.)

Received XML dataset from Contractor.

Verified and Imported to SQL Server Database.

Built a website to search the data.

Taxonomic Literature II

Page 25: Unlocking Taxonomic Literature II using Linked Open Data

Taxonomic Literature II

Page 26: Unlocking Taxonomic Literature II using Linked Open Data

First...what does 99.97% accuracy mean?

Taxonomic Literature II

~12,000 Errors

Page 27: Unlocking Taxonomic Literature II using Linked Open Data

1. Select Identifiers for our data

http://library.si.edu/digital-library/tl-2/author/darwin

http://library.si.edu/digital-library/tl-2/title/origin_of_species

http://library.si.edu/digital-library/tl-2/title/1313

2. Choose vocabularies for predicates (harder than it sounds)

OWL, FOAF, DublinCore, OpenGraph, SIOC, SKOS, BIBO, etc.

3. Create Links to other data sources on the web.

Taxonomic Literature II

Page 28: Unlocking Taxonomic Literature II using Linked Open Data

Taxonomic Literature II as Linked Data

http://library.si.edu/tl2/author/darwin

http://library.si.edu/tl2/title/1313

tl2:creator <http://library.si.edu/tl2/title/1313>

owl:sameAs <http://viaf.org/viaf/27063124>

dc:creator <http://library.si.edu/tl2/author/darwin>

owl:sameAs http://www.archive.org/details/originofspecies00darwuoft

owl:sameAs <http://www.worldcat.org/oclc/425919213>

Select Identifiers

Page 29: Unlocking Taxonomic Literature II using Linked Open Data

Taxonomic Literature II as Linked Data

<http://library.si.edu/tl2/author/darwin>

rdf:type <http://xmlns.com/foaf/0.1/Person>

foaf:lastName “Darwin”

foaf:familyName “Darwin”

foaf:firstName “Charles”

foaf:givenName “Charles”

foaf:name “Darwin, Charles Robert”

skos:prefLabel “Darwin, Charles Robert”

bio:birth “1809”

bio:death “1882”

skos:defintion “British evolutionary biologist”

tl2:personAbbreviation “Darwin”

Select Identifiers: Authors

Page 30: Unlocking Taxonomic Literature II using Linked Open Data

Taxonomic Literature II as Linked Data

<http://library.si.edu/tl2/book/1313>

rdf:type <http://purl.org/ontology/bibo/Book>

tl2:titleNumber “1313”

tl2:titleAbbreviation “Origin sp.”

tl2:shortTitle “On the origin of species”

dc:title “On the origin of species by means of natural

selection, or the preservation of favoured races in the...”

dc:publisher “John Murray”

event:place “London”

dc:created “1859”

Select Vocabularies: Publications

Page 31: Unlocking Taxonomic Literature II using Linked Open Data

Taxonomic Literature II as Linked Data

Linking: Author Names

Used a combination of OpenRefine and LODRefine as well as custom code.

Results: Mixed

• Matched 15 - 20% of the names in our sample set• Some named weren’t high in the list and required a

human touch

Conclusion: Computer code needs to be improved with the aim of minimizing amount of staff or volunteer time spent matching names.

Page 32: Unlocking Taxonomic Literature II using Linked Open Data

Taxonomic Literature II as Linked Data

Charles Darwin(From the dbpedia.org)

Page 33: Unlocking Taxonomic Literature II using Linked Open Data

Taxonomic Literature II as Linked Data

Linking: Herbaria

Used computer code to split the herbarium names and identify them in data provided by the Biodiversity Collections Index.

Results: Good

• Matched 95+% of the herbarium names in all of TL-2• Careful attention to “A” which is an herbarium, but

also starts some sentences in the HERBARIUM and TYPES blocks

Conclusion: These will be added to TL-2 when it is launches as LOD.

Page 34: Unlocking Taxonomic Literature II using Linked Open Data

Taxonomic Literature II

Missouri Botanical Garden Herbarium (From the Biodiversity Collections Index)

Lsid urn:lsid:biocol.org:col:15859Name Missouri Botanical Garden HerbariumCode MOKind HerbariumTaxon Scope Herbarium collection limited to vascular plants (5.6 million

specimens) and bryophytes (500,000 specimens), Jan. 2009.Geo Scope Worldwide; phanerogams strong in Central America (especially

Costa Rica, Nicaragua, and Panama), tropical South America. . .Size 6,150,000Founded Year 1859Web Site http://www.mobot.org/Location Street P.O. Box 299Location City Saint LouisLocation State MissouriLocation Postcode 63166-0299Location Country Iso US

http://www.biodiversitycollectionsindex.org/urn:lsid:biocol.org:col:15859

Page 35: Unlocking Taxonomic Literature II using Linked Open Data

Taxonomic Literature II as LOD

How are we going to store all this?

We’re using Drupal – automatically embed some Linked Open Data elements in the webpage.

Probably not a good idea for very large datasets.

TL-2 = 10,000 authors + 37,000 titles (about 400,000 triples, but growing)

Page 36: Unlocking Taxonomic Literature II using Linked Open Data

TL-2 and LOD Challenges

Performance of Drupal Import:Feeds Import: 7 Hours for 35,000 “Records” or Drupal NodesOther options? Still searching…

Our linked data set will grow to at least 600-700k Drupal nodes.

Is Drupal the best way to do this?

Page 37: Unlocking Taxonomic Literature II using Linked Open Data

Challenges

• Errors in the Corrected OCR

• Challenges in Parsing Citations

• The 80/20 rule: manually making connections unable to be made by automated means

• Finding suitable sources of data to link to. (DBPedia? VIAF? EOL? Others?)

Page 38: Unlocking Taxonomic Literature II using Linked Open Data

Summary

• This data may already exist online.

• It may also not always be as accurate as needed for science.

• We are in a position to be the authoritative source for this information.

• Linked Data allows it to be easily reused and shared.

Page 39: Unlocking Taxonomic Literature II using Linked Open Data

Closing: something fun

One example of reuse

Ryan Schenk http://synynyms.com/

Page 40: Unlocking Taxonomic Literature II using Linked Open Data

Closing: something fun

One example of reuse

Ryan Schenk http://synynyms.com/

Page 41: Unlocking Taxonomic Literature II using Linked Open Data

Thank You!

Unlocking Taxonomic Literature IIusing Linked Open Data

Joel [email protected]/staff/joel-richard

Special thanks to

The International Association for Plant Taxonomy, for giving us permission to scan and digitize TL-2 and place it online.

For his advice and support, Dr. Laurence Dorr, Botanist and Curator, Department of Botany, Smithsonian National Museum of Natural History.

This project was partially funded by the Atherton Seidell Endowment Fund of the Smithsonian Institution.