tdwg14 fp-kurator-ludaescher
DESCRIPTION
Workflow Support for Continuous Data Quality Control in a FilteredPush Network J. Hanken, D. Lowery, B. Ludäscher, J. Macklin, T. McPhillips, P. Morris, B. Morris, T. Song Presentation given at TDWG 2014 Jönköping, SwedenTRANSCRIPT
1
Workflow Support for Continuous Data Quality Control in a FilteredPush Network
J. Hanken, D. Lowery, B. Ludäscher, J. Macklin, T. McPhillipsP. Morris, B. Morris, T. Song
2
Problem: Data & Metadata Quality• Collections & occurrence data
… is all over the map
… literally (off the map!)• DQ Issues, e.g., …
– Lat/Long transposition, coordinate & projection issues
– Scientific Names (spelling errors, other)
– Data entry/creation, “fuzzy” data, naming issues, bit rot, data conversions and transformations, schema mappings, … (you name it)
• Related Projects:– Filtered-Push– Kurator
3
What problems are we trying to solve?• Detect and flag data quality issues• Repair if possible
– … ask human curators as needed
• Keep track of provenance– automatic repairs– human curators’ edits
• Employ workflow (semi-)automation – Scientific workflow systems:
• Kepler/COMAD, Restflow, Galaxy, Biovel/Taverna, Argo, VisTrails, …
– Related technologies• Akka parallel execution platform• Script-based automation (e.g. Python) and digital notebooks (iPython)
4
Data Curation Workflow
Dou, Lei., G. Cao, P.J. Morris, R.A. Morris, B. Ludäscher, J.A. Macklin, J. Hanken. 2012. Kurator: A Kepler Package for Data Curation Workflows, Procedia Computer Science, 9:1614-1619, doi:10.1016/j.procs.2012.04.177
5
Customers of Curation Workflows
• Collection Managers – … who are managing the collections databases– Can run curation workflows periodically
• … in the presence of new data and/or new curation services
• (Biodiversity) Researchers– To perform an analysis in the presence of (partially)
dirty data, researchers need to• Clean or fix dirty data• Throw out unfixable data
– Reporting back to the collection managers (cf. FPush)
6
Filtered Push
http://xkcd.com/386/
(1) Kvetch about data
(2) Push to interested parties
(3) Human Filter
(4) Change data in databases
(5) Store all assertions
Source: Paul J. Morris
7
Akka curation workflowon FP2, working on DWspreadsheet reports
Symbiota Instance & DB
Symbiota Instance
Source: Paul J. Morris
8
AccessPoint
SymbiotaPortal Access
Point
AkkaKurator
Workflows
FilteredPushNode
OccurrenceRecords
Quality ControlAnnotations
Quality ControlWorkflowQuality Controlled
Data Set
Overall Dataflow
Source: Paul J. Morris
9
Example Curation Workflow …
• Load Dataset• Scientific Name Validation • Georeference Validation • Collection Date Validation• [Create Annotations into FPush Network]• Output results
– translate to spreadsheet – with provenance!
some steps of a larger workflow
10
… Curation Workflow Output …
11
… close up …
• CORRECT– Checked and OK
• CURATED:– Checked and fixed
• UNABLE_CURATE– Internally inconsistent– cannot fix
• UNABLED_DET_VALIDITY– Not enough data:
• No external reference found
12
… even more close: Spreadsheet Provenance
• Assertions made– sign changed coordinates are on the Earth's surface – Coordinates not inside country– transposed/sign changed coordinates to place inside country– Transposed/sign changed coordinates are near georeference
of locality from Geolocate
• Sources used– Land data from Natural Earth– Country boundary data from GeoCommunity– GeoLocate
13
Date Validation
• Check: – Collector’s life span – .. vs. Date-Collected
• Possible outcomes:– Valid– Corrected– Unable to validate
• Internal inconsistency– Contradicting dates
• External inconsistency– Lack of date data
14
The Logic Behind Each Step …
• Date Collected– … collectors life-time vs date collected
• Georeference Validation– Lat/long valid (on Earth)– … within a country (shape file), point in polygon– If georef is “bad” then try
• … transpositions, sign-swapping etc of lat/long• If they match fix it!• Make sure to record in provenance • Using the transposed (or sign-fixed) original date
(not the Geolocate)
15
… Logic Behind Each Step (cont’d)
• Scientific Name Validation– Customer-dependent:
• Collection Managers:– Nomenclature
• Researchers:– Taxonomy (current names)
– Several Remote services• IPNI, GNI, …
• …. <your logic here> …
16
Curation Workflow Challenges: Machine Cycles
• Scalability & Technology Issues:– Clean aggregated data at a FP Node
• Headless• Use of Kepler/COMAD, pros & cons:
– OK on human cycles, but NOT OK on machine cycles
• Akka – Parallelize remote service invocation: helps – Non-trivial programming
• => add another layer on top of Akka• .. or … ?? <tell us about your technology!>
17
Challenges: Human Cycles
• New Kurator project:– Enable tool makers– Make it easy to build
• components (software “actors”, services)• workflows (gluing services together)
• Data Curation Workflows Interest Group !?– Service builders– Service & Workflow Registries
• cf. myExperiment
– Service aggregators • cf. BioVel, DwC validator, …
18
What is Kurator?
• NSF-DBI #1356751 – Collaborative Research: ABI Development:
Kurator: A Provenance-enabled Workflow Platform and Toolkit to Curate Biodiversity Data
– Sept. 2014 – 2017– @Illinois:
• B. Ludäscher, James Macklin, Tim McPhillips, …
– @Harvard: • James Hanken, Paul Morris, Bob Morris, …
– @TDWG community• <your name here>
19
Kurator Tenets• Technology Agnostic
– … to the extent we can … – … avoid reinventing the wheel– … one size probably doesn’t fit all=> Deploy curation steps on different wf systems, platforms
• For Tool Makers• Agile, Community-Driven Development• Kurator just started, evolving
– Get involved now!– Kick-off meeting November 17 & 18
• @ NCSA (University of Illinois, Urbana-Champaign)
20
How we do it
• Build a library of curation services such that curation workflows can be run from various platforms– Scientific workflow systems
• e.g. Restflow, Kepler, Taverna, Galaxy
– Other platforms• e.g. Akka, Python-based, …
• … leveraging existing technologies
21
How we do it
• Open source, community-friendly approach– git repository (NCSA open source projects)
• Agile software development– NCSA support tools, e.g. JIRA, Bamboo
• Inspired by – Small bioinformatics tools manifesto (post-facto)
• cf. Unix tenets (small tools, use filters, pipes, … KISS!)
– Experience with other (sometimes not so agile) development projects
22
Agile Kurator Development
Interested in looking under the hood?Kurator/Akka curation wf demo:
Wed PM
Initial URL: opensource.ncsa.illinois.edu/projects/KURATOR
23
Related Research (Tianhong Song, UC Davis)
• Analyze linear workflow “story”
• Use patterns to discover wf design issues (e.g. use before update); then fix them
• Parallelize when possible