guoda: a unified platform for large-scale computational research on open-access biodiversity data

iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

GUODA: A Unified Platform for Large-Scale Computational Research on Open-Access Biodiversity DataMatthew Collins, Alexander Thompson, Jorrit Poelen, Jennifer Hammock

2

What is GUODA?Global Unified Open Data Access

An informal collaboration between technologists from organizations like EOL , ePANDDA, and iDigBio as well as independent biodiversity informaticists. We share data use cases, best practices, infrastructure, code, and ideas around the science that can be done by analyzing large open-access biodiversity datasets.

http://guoda.bio

3

What our members are interested inComputation with biodiversity data• Research at scale• Lowering barriers to accessing computation• Reproducibility

Matthew CollinsTechnical Operations

Manager - iDigBio

Jorrit PoelenIndependant

Alexander Thompson Software Products

Lead - iDigBio

Jennifer HammockMarine Theme

Coordinator - EOL

Nathan BirdSoftware

Developer - iDigBio

4

An example use of GUODA

Does anyone use catalog numbers in remarks fields to document relationships between specimen records in iDigBio?

(We’re at TDWG so we’ve got to do something with identifiers, right?)

5

A term-document index of iDigBio(idb_df .select(idb_df["uuid"], idb_df["uuid"]) .where(sql.column("note") != "") .withColumn("tokens", udf_tokenize(sql.column("note"))) .select(sql.explode(sql.column("tokens"))) .groupBy(sql.column("uuid"), sql.column("token")) .count())

6

What terms match catalognumber?joined = (idb_df_ids .join(idb_tf_df, on=idb_df_ids["idb_catalognumber"] == idb_tf_df["token"]) .join(idb_df_notes, on=sql.column("uuid") ==

idb_df_notes["note_uuid"]) .withColumn("catalognumber_len", sql.length(sql.column("idb_catalognumber"))) )

7

What do we find?

A few things like record bd347847…Has a remark Part of Collection at FH: barcode-00374180.

Which matches record 826da57a...

Histogram of matching catalognumber length

8

How long did that take to write?

< 200 lines of code (including whitespace and comments)

1 intermittent day of coding

https://github.com/iDigBio/idb-spark

9

How long did that take to run?73.5 million records in iDigBio to 151 million document:term:counts

40 minutes

Joined back to iDigBio resulting in 2.9 billion terms found in the catalognumber field

3 hours 40 minutes

10

Good tools in the hands of people with good ideas:

IDEAS RESULTSW

ORK

11

Servers!MesosHDFSSparkMarathonDockerCassandra

Infrastructure

Advanced Computing and Information Systems Labhttp://acis.ufl.edu

12

Data is half the toolCopies of whole datasets

• Stored locally• Refreshed automatically

Re-represent datasets in a useful structure for high performance computing - parquet on HDFS:https://github.com/bio-guoda/guoda-datasets

13

Interfaces to GUODA

• Jupyter Notebooks for end-users

• Containers for API and web services

• Persistent storage for application state

• Hangouts calls every 2-4 weeks

14

The front door to GUODA

Notebooks

“Literate Programming”

Comments, code, and outputs all together in a readable document that describes what is being done

15

Here’s what it looks like

16

GUODA Jupyter notebook interface

17

What would you do with it?Have a Github account and want to write code? This is an alpha quality system.

http://jupyter.idigbio.org

Or talk to us if you want to host an application on our systems

[email protected] [email protected]

mailto:[email protected]

iDigBio is funded by a grant from the National Science Foundation’s Advancing Digitization of Biodiversity Collections Program. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

idigbio.org/wikifacebook.com/iDigBio

twitter.com/iDigBio

vimeo.com/iDigBio

idigbio.org/rss-feed.xml

idigbio.org/events-calendar/export.ics

Get involved!

http://www.facebook.com/iDigBio

http://www.facebook.com/iDigBio

https://twitter.com/iDigBio

https://twitter.com/iDigBio

http://vimeo.com/idigbio

http://vimeo.com/idigbio

https://www.idigbio.org/rss-feed.xml

https://www.idigbio.org/rss-feed.xml

guoda: a unified platform for large-scale computational research on open-access biodiversity data

Data & Analytics