gokb and refine (kuali days 2013)

25
•The problem space •Tool Selection •Enhancements •Open Refine in Action •Features and Limitations •The Data Journey •So What? Kristin Antelman (North Carolina State University) David Kay (Sero Consulting, UK)

Upload: gokb-project

Post on 20-Aug-2015

208 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: GOKb and Refine (Kuali Days 2013)

•The problem space•Tool Selection•Enhancements•Open Refine in Action •Features and Limitations•The Data Journey •So What?

Kristin Antelman(North Carolina State University)

David Kay (Sero Consulting, UK)

Page 2: GOKb and Refine (Kuali Days 2013)

Problem Space / Domain Requirement

• Unstructured messy data– Critical data is largely poorly controlled text strings (titles,

publishers)– Data is sloppy: duplicate rows, blank rows, multiple values in

single column, incorrectly formatted dates– Standards and identifiers exist but have poor -- or incorrect --

adoption

• Bad data – Titles associated with wrong identifiers– Data is out of date (has changed)– Key data is missing

Page 3: GOKb and Refine (Kuali Days 2013)

Problem Space / Domain Requirement

Page 4: GOKb and Refine (Kuali Days 2013)

Library Book Lifecycle

4

Page 5: GOKb and Refine (Kuali Days 2013)

Library E-Content Lifecycle

5

Page 6: GOKb and Refine (Kuali Days 2013)

Open Refine

GOKbDatabase

KualiOLE

Library

API

Ingest

PublisherSourceData

Ingest

The Data Improvement Workflow

Page 7: GOKb and Refine (Kuali Days 2013)

From Vision to Implementation

July 2012 to October 2013

• Straw Man

• Feasibility Study

• Iterative Development

Page 8: GOKb and Refine (Kuali Days 2013)

Lucas van Valckenborch (1535 or later–1597) [Public domain], via Wikimedia Commons

Aspiration

Page 9: GOKb and Refine (Kuali Days 2013)
Page 10: GOKb and Refine (Kuali Days 2013)
Page 11: GOKb and Refine (Kuali Days 2013)
Page 12: GOKb and Refine (Kuali Days 2013)

Tools Selection

Page 13: GOKb and Refine (Kuali Days 2013)

Feasibility StudyKnowledge Integration – Summer 2012

Options• Open Rules• Drools• DIY• Google Refine

Considerations• Open • Performance• Rule Syntax & Interface *• Rule Management *• Rule Precedence Support• Auditing • Deployment *

Page 14: GOKb and Refine (Kuali Days 2013)

Open Rules

Drools Expert

DIY

Page 15: GOKb and Refine (Kuali Days 2013)

Critical Factors• Geared to the main objective• Suited to the expected user skill sets • Ease of deployment• Scales in the ways we need • An open platform for integration and extensions• Supported by an active community

Selection of Google Refine

:= Open Refine

Page 16: GOKb and Refine (Kuali Days 2013)

Open Refine Extensions

Page 17: GOKb and Refine (Kuali Days 2013)

GOKb Open Refine Extensionsin the current release (September 2013)

• Server side management– Projects– Check-out, Check-in– Rules

• Refine UI extensions geared to GOKb expectations– Pre-edit checks – e.g. New file? White space? – Authority validation – e.g. Organisations– Feedback panel – Errors and Warnings– Access to Quick Resolutions involving stored transformations– Pre-processing impact assessment – what this will do to the database– Update options - Incremental and Replacement

• Post-ingest support within GOKb– Audit trail, To do checklist

Page 18: GOKb and Refine (Kuali Days 2013)

GOKb Open Refine Screencast

Page 19: GOKb and Refine (Kuali Days 2013)

Why Open Refine is a good fit for us (and may be for you as well)

• Extensible • Supports collaboration/shared workspace• Supports users at multiple levels of expertise– Cross between a spreadsheet and a database for

novices– GREL, JSON scripting– API calls to external data sets

• But sometimes it’s not the right tool….

Page 20: GOKb and Refine (Kuali Days 2013)
Page 21: GOKb and Refine (Kuali Days 2013)
Page 22: GOKb and Refine (Kuali Days 2013)

Round Trip Data Journey

Page 23: GOKb and Refine (Kuali Days 2013)

OpenRefine

GOKbDatabase

TargetApplications

e.g. OLE

Route 2

Route 4

Route 3

Route 1

API

API

Route 1 – New projectRoute 2 – CRED user editsRoute 3 - Update projectRoute 4 – CRED Delta ingest

Ingest

What’sNext?The Round Trip

RESTful APIsSupportingJSON

Page 24: GOKb and Refine (Kuali Days 2013)

So what?Or … why might you be interested?

The Application• Data cleansing / enhancement• Reuse … Automation• Managing distributed activity• Leveraging Refine and Excel user skills• Note - GOKb Extensions are Open Source

The Meta Challenge• Kuali software and the evolving ecosystem• Tool selection • An example of community innovation

Page 25: GOKb and Refine (Kuali Days 2013)

Open Refine ResourcesTutorials, FAQs and the Open Refine wikihttp://openrefine.org/documentation.

About GREL https://github.com/OpenRefine/OpenRefine/wiki/Understanding-Expressions-

Common formulas for editing with GREL https://github.com/OpenRefine/OpenRefine/wiki/Recipes

Step-by-step tutorialshttp://www.davidhuynh.net/spaces/nicar2011/tutorial.pdf, http://freeyourmetadata.org

Book by the freeyourmetadata authorshttp://www.packtpub.com/openrefine-guide-for-data-analysis-and-linking-dataset-to-the-web/book

GOKb guidance on Open Refinehttps://wiki.kuali.org/display/OLE/OpenRefine

Twitter @OpenRefine