linked open communism - c4l13
TRANSCRIPT
Linked Data
Linked Open Communism:
Presented at code4lib2013
by Corey A Harper
2013-02-13Better discovery through data dis- and re-aggregation
How I learned to shut about about linked dataAND BUILD SOMETHING!!
--- or ---
Linked Data
Metadata as a Graph
Typed things, named by URIs
The relationships between those things, also built on URIs
Ease of integration *across* data sources merging graphs
Refine
ViewShare
Context
NarrativeStory tellingContext
The archives story,
The library's story,
but also
Users stories
Adding context through recombinant metadata
Backing Away from Evangelism...
Image NOT used by permission.Probably a violation of several copyrights & trademarks.
Aside on metaphors
Image by Jonestown Institute via Wikimedia Commonshttp://en.wikipedia.org/wiki/File:Jonestown_entrance.jpg
Aside on metaphors
Image by Joe Mabel via Wikimedia Commons.http://en.wikipedia.org/wiki/File:Furthur_05.jpg
Premise
Context is so central
And yet our Controlled VocabsAre nearly gone
Because the interfaces to them
were broken
The Death of Browse
Next-Gen Discovery Systems don't make use of Authority Control
Browse was/is broken as a UI Design
Rich data in Authorities, disconnected from narrative, context, search
Richer Authority type data outside libraries...
Linked Data Based UI Design
For Boutique Collections
Public Domain image of Paulette Goddard
via Wikimedia
Commons.http://en.wikipedia.org/wiki/File:Paulette_Goddard-publicity.JPG
A research leave
Initial Scope
Public Domain image via Wikimedia Commons.http://en.wikipedia.org/wiki/File:Symbol-hammer-and-sickle.svg
Linked Open Communism
Dis-aggregate EAD records into Collections & Components
Create a broad set of resource types
Extract key entities from EADPeople, Places, Topics, Corporate Bodies
Incorporate additional data about entites
Put this in Blacklight
Load MARC & other data
Technology Stack - UI
Vanilla BlacklightMinor SOLR Index Tweaks / Additions
Minor View Hacks
pre-betaOnly on localhost right now
Technology Stack Support Tools
Gadget!
Technology Stack - Backend
Python & RDFLib
4Store & HTTP4Store
Sunburnt
FuzzyWuzzy
(Lots of other Python modules....)
FuzzyWuzzy & SeatGeek!
Fuzzy Wuzzy Awesome Library from SeatGeekhttps://github.com/seatgeek/fuzzywuzzyhttp://seatgeek.com/blog/dev/fuzzywuzzy-fuzzy-string-matching-in-python
Data Flow
Object Oriented Python
Classes: Collections, Components, Entities
Class methodsmakeGraph
makeSolr
to4store
output (turtle, rdf/xml, etc)
Performance Benchmarks
EAD -> SOLR:~26 hrs to parse 1600 EAD, push 385k records to SOLR
DBPedia matchingX-ref label varients for entities against 9.4 million DBPedia labels (labels-en.ttl).
Should be using Hadoop
Other ideas?
Re-solr-izing entities: ~10 minutesPulls local copy of dbpedia data from 4store
4Store
Provenance-ishNaming of sub-graphs
Default context is everything
First EAD cut produced ~4m triples
Easy to delete whole graphs, or individ triples
SPARQL-able good for stats:992 DBPedia links for 6331 Entities
https://github.com/chrpr/ead2rdf2solr
Image by wallygrom via flickrhttp://www.flickr.com/photos/33037982@N04/3669790240/
Future Steps: Code to Incorporate
Components: Inheritance of accesspoints fuzzywuzzy string match to unittitle
matched about 10%
Extend to cross ead match via 4Store
VIAF, id.loc, fast reconciliation
Override configs for DBPedia matching
Germany. |t Treaties, etc. |g Soviet Union, |d 1939 Aug. 23.http://dbpedia.org/page/Treaty_of_Non-Aggression_between_Germany_and_the_Soviet_Union
Textile Workers' Strike, Gastonia, N.C., 1929.
http://dbpedia.org/page/Loray_Mill_Strike
DBPedia Override Examples
Further Development Next Steps
EAC-CPF reconciliation, record creation
Possibly relationship to Hydra?Annotation Interface, DBP Overrides
SOLR Relevancy Ranking
SOLR-Marc Modifications
Update mechanism
Test with other Datasets (NYPL/NYU/METRO project)
Thanks!
[email protected]@chrpr