gokb and refine (kuali days 2013)
TRANSCRIPT
•The problem space•Tool Selection•Enhancements•Open Refine in Action •Features and Limitations•The Data Journey •So What?
Kristin Antelman(North Carolina State University)
David Kay (Sero Consulting, UK)
Problem Space / Domain Requirement
• Unstructured messy data– Critical data is largely poorly controlled text strings (titles,
publishers)– Data is sloppy: duplicate rows, blank rows, multiple values in
single column, incorrectly formatted dates– Standards and identifiers exist but have poor -- or incorrect --
adoption
• Bad data – Titles associated with wrong identifiers– Data is out of date (has changed)– Key data is missing
Problem Space / Domain Requirement
Library Book Lifecycle
4
Library E-Content Lifecycle
5
Open Refine
GOKbDatabase
KualiOLE
Library
API
Ingest
PublisherSourceData
Ingest
The Data Improvement Workflow
From Vision to Implementation
July 2012 to October 2013
• Straw Man
• Feasibility Study
• Iterative Development
Lucas van Valckenborch (1535 or later–1597) [Public domain], via Wikimedia Commons
Aspiration
Tools Selection
Feasibility StudyKnowledge Integration – Summer 2012
Options• Open Rules• Drools• DIY• Google Refine
Considerations• Open • Performance• Rule Syntax & Interface *• Rule Management *• Rule Precedence Support• Auditing • Deployment *
Open Rules
Drools Expert
DIY
Critical Factors• Geared to the main objective• Suited to the expected user skill sets • Ease of deployment• Scales in the ways we need • An open platform for integration and extensions• Supported by an active community
Selection of Google Refine
:= Open Refine
Open Refine Extensions
GOKb Open Refine Extensionsin the current release (September 2013)
• Server side management– Projects– Check-out, Check-in– Rules
• Refine UI extensions geared to GOKb expectations– Pre-edit checks – e.g. New file? White space? – Authority validation – e.g. Organisations– Feedback panel – Errors and Warnings– Access to Quick Resolutions involving stored transformations– Pre-processing impact assessment – what this will do to the database– Update options - Incremental and Replacement
• Post-ingest support within GOKb– Audit trail, To do checklist
GOKb Open Refine Screencast
Why Open Refine is a good fit for us (and may be for you as well)
• Extensible • Supports collaboration/shared workspace• Supports users at multiple levels of expertise– Cross between a spreadsheet and a database for
novices– GREL, JSON scripting– API calls to external data sets
• But sometimes it’s not the right tool….
Round Trip Data Journey
OpenRefine
GOKbDatabase
TargetApplications
e.g. OLE
Route 2
Route 4
Route 3
Route 1
API
API
Route 1 – New projectRoute 2 – CRED user editsRoute 3 - Update projectRoute 4 – CRED Delta ingest
Ingest
What’sNext?The Round Trip
RESTful APIsSupportingJSON
So what?Or … why might you be interested?
The Application• Data cleansing / enhancement• Reuse … Automation• Managing distributed activity• Leveraging Refine and Excel user skills• Note - GOKb Extensions are Open Source
The Meta Challenge• Kuali software and the evolving ecosystem• Tool selection • An example of community innovation
Open Refine ResourcesTutorials, FAQs and the Open Refine wikihttp://openrefine.org/documentation.
About GREL https://github.com/OpenRefine/OpenRefine/wiki/Understanding-Expressions-
Common formulas for editing with GREL https://github.com/OpenRefine/OpenRefine/wiki/Recipes
Step-by-step tutorialshttp://www.davidhuynh.net/spaces/nicar2011/tutorial.pdf, http://freeyourmetadata.org
Book by the freeyourmetadata authorshttp://www.packtpub.com/openrefine-guide-for-data-analysis-and-linking-dataset-to-the-web/book
GOKb guidance on Open Refinehttps://wiki.kuali.org/display/OLE/OpenRefine
Twitter @OpenRefine