Transcript

Open Source Technologiesat the

National Agricultural Library

Ursula PieperIT Specialist – Web Team Lead

National Agricultural LibraryAgricultural Research Service

United States Department of Agriculture

Feb 17, 2016

2

Ursula [email protected]

301-504-7379

Acknowledgements:Knowledge Services Division (Susan McCarthy)

Monica Poelchau and Chris Childers (i5K Workspace)Peter Arbuckle and Ezra Kahn (LCA Commons)Jeffrey Campbell (LTAR)Cynthia Parr (Ag Data Commons)

Information Services Division (Vernon Chapman) Chuck Schoppet, NAL – (Fedora Commons/Islandora)

Why Open Source?

• Benefit from community contributions and support• Security managed by community• Cost – Vendor lock-in• Can get customized locally• Interoperability• Re-use of skills

PHP

Available Expertise @ NAL

Drupal

Python

Grails

Java

Solr

Subject Matter Experts

Django

Open Source based Projects(Selection)

Drupal

Python

Grails

Java

Solr

Django• Ag Data Commons

– Scientific data catalog/repository • LCA Commons

– Life Cycle Assessment repo and tools• PubAg

– Catalog of agricultural scientific literature• I5K@NAL Workspace

– Repository and workspace for Arthropod Genomes• Long Term Agro-ecosystem Research

– Historical and future agricultural research data• National Nutrient Database• Dr. Duke's Phytochemical and Ethnobotanical

Databases

Open Source based Projects(Selection)

Drupal

Grails

Java Based

Ag Data Commons http://data.nal.usda.govi5K@NAL Workspace http://i5k.nal.usda.govLCA Commons http://lcacommons.govPubAg – Data Management System http://pubag.nal.usda.gov

LCA Commons http://lcacommons.govNational Nutrient Database http://ndb.nal.usda.gov/ndb/Phytochem Database (Duke) http://phytochem.nal.usda.gov

Long-term Agro-ecosystem Researchhttp://ltar.nal.usda.gov

Ag Data Commons

Requirements• Public Access to USDA

funded research results• Support scientific research

and evidence-based policy• Re-use / re-analysis• REE Action Plan: 2012 goals• Journal submission

requirements

Mandates• America COMPETES Act• OSTP Memorandum• M-13-13, Open Data Policy

7

Ag Data CommonsA data catalog and repository based on

the Drupal DKAN distribution

8

Summary of Required Capabilities

• Comprehensive catalog of research results– Support for compliance reporting– Feeds Data.gov– Enhanced dataset description for discovery and reuse

• Flexibility to support distributed data repositories– Some disciplines already have repositories (e.g. GenBank)

• Preservation of valuable data for long-term research• Supportive infrastructure for small agencies & labs• Link scholarly literature to its supporting data• Sustainable business model

9

Ag Data Commons Pilot Standard DKAN Features

• Drupal 7 Installation Profile• Fulfills Project Open Data requirements

– Dataset content type: POD 1.1 metadata schema– Unlimited number of resources can get uploaded– data.json and rdf available

• Additional Features– Social media links– Some data analysis tools (map, graph through recline

library)– License display

10

Ag Data Commons Pilot What’s missing from DKAN?

• DKAN’s main use case: Government and organizational documents and datasets

• General improvements

– Large File upload, virus checking, file size display– Harvest Dashboard – for harvesting external POD datasets or data using other standards– Solr search– Versioning– Data curation workflow

• Scientific data require additional functionality

– DOI assignments to datasets – Identity management for authors (orcid, etc.)– Citation information (Primary citation, Methods citation, Related publications)– Collection of additional metadata – Long-term archiving capabilities– Funding source reference– Embargo period– Specialized taxonomies

11

Ag Data Commons Pilot Lessons learned

• Keeping codebase compliant with standard DKAN – All configuration changes need to get committed to code– Codebase cannot clash with standard DKAN

(which requires discipline when under time pressure)– Significant pain merging NAL customizations with new DKAN releases– Local programming and systems support is necessary (our model)

• Contributing back to DKAN and Drupal– Many of NAL’s customizations are adopted (and then maintained) by standard DKAN– General Drupal functionality:

• Open data schema mapper • NALT Thesaurus

• Taking advantage of customizations by other organizations– Workflow, Stories, Visualizations

12

Ag Data Commons Pilothttps://data.nal.usda.gov

13

I5k Workspace@NAL• Provides tools and resources for scientists

working on insect genomes. • Goal:

– to store insect genome sequences– visualize them, – enable their curation– make them accessible to scientists.

• Designed specifically to handle and support genomic data.

• Website: https://i5k.nal.usda.gov

Key open-source software used by the i5k Workspace

1. Main portal/website– built with Drupal/Tripal

2. Key web application for genome visualization and feature annotation– Jbrowse/Apollo

Key open-source software used by the i5k Workspace

I5K Workspace @ NAL 1. Drupal + Tripal

• Chado is a database schema for biological data• Tripal allows Drupal to access data stored in the

Chado database to populate web pages using Drupal functionality.

• Community: small and academic

• Apollo is a web application that allows interactive, instantaneous editing of genome features

• It is one of the key features of the i5k Workspace • Community: small and academic

I5K Workspace @ NAL 2. Apollo

• Registration module for Apollo application– Completely built in house– Integrates notifications, account creation, and captcha

• Visualizing custom data types: gene pages– Hierarchical view to display gene/transcript relationships

• Search website (many thousands of nodes)– Apache Solr search

I5K Workspace @ NAL Customized Resources

• Customization requires one full-time developer at the NAL

• Because our customizations are forked off the main repository, any updates in the main branch require more updates on our part

• Customizations are too specific to our website to be able to fully contribute back to/integrate with the main project

I5K Workspace @ NAL Tripal: Lessons learned

• Instead of building customized resources, we contributed financially to the salary of the lead developer.

• Improvements were not specific to the NAL’s goals, but were aimed at improving the stability of the application

• Even without a financial contribution, bug reports and feature requests from the entire user community are usually addressed very quickly due to an active development team, and a lead developer solely focused on this project.

I5K Workspace @ NAL Apollo: Customized resources

• How you interact with the development community of an OSS project depends on – 1) the community itself – 2) the specificity of the customization required

I5K Workspace @ NAL Apollo: Lessons learned

I5K Workspace @ NAL https://i5k.nal.usda.gov

Life Cycle Assessment (LCA) Commons• LCA Commons is a repository that provides access to

data and tools that support life cycle assessment of agricultural products.

• We collect, curate, and provide access to data edited and formatted explicitly for use in LCA

• The LCA Commons is designed specifically to handle and support unit process data for LCA.

• Website: www.lcacommons.gov

LCA Commons Technology Stack

• Three separate applications accessed through Drupal web content management system. – Discovery and Editorial Applications

• Groovy/grails web implementation of domain specific openLCA data model/modeling tool

– LCA Collection on Ag Data Commons• DKAN catalog and datastore

LCA Commons Technology Stack

Discovery Application Editorial Application LCA Collection on Ag Data Commonslcacommons.govApplication

Groovy/Grails Framework

Solr Index openLCA API Activiti BPM

DKANDrupalTechnology

Drupal Custom User Mgt.

openLCA mySQL

openLCA mySQL

DKANDatastore

DKAN Catalog

Database

LCA Commons Technology Stack

LCA CommonsCustomized Resources

• openLCA datastore not designed explicitly for data management beyond what is necessary for desktop modeling. – has required developing custom “work-arounds” for data

management

• Activiti BPM has required significant customization for editorial workflow for LCA data

• Will need to develop customized search capabilities that enable search across all three applications through Drupal

LCA CommonsLessons learned

• Technology selection based on clearly defined functional requirements is critical– Using openLCA for an application for which it was not

exactly designed has required custom development– AND innovation in the field

• Spurred openLCA developer to build functionality that more closely meets our needs and pushed the domain forward in terms of data sharing and management

LCA Commonshttp://lcacommons.gov

PubAg Data Management System• PubAg is the National Agricultural Library's

search system for agricultural information.• Content:

– Full-text articles relevant to the agricultural sciences– Citations to peer-reviewed journal articles.

• Repository (Data Management):– Fedora Commons/Islandora/Drupal

• Public Interface:– Apache Solr and Java application layer

PubAg Data Management System

PubAg Data Management System• From Islandora (https://wiki.duraspace.org/)

PubAg Data Management SystemLessons learned

• Customization needed to accommodate NAL Quality Assurance and workflow

• Performance tuning is necessary and non-trivial for large repositories

PubAg Data Management SystemInternal Access Only

Long-Term Agroecosystem Research Network

• Historical and future agricultural research data https://ltar.nal.usda.gov

• Aims to ensure sustained crop and livestock production and ecosystem services from agroecosystems.

• Aims to forecast and verify the effects of environmental trends, public policies, and emerging technologies.

Long-Term Agroecosystem Research Network

• Historical and future agricultural research data• 18 sites across country• Aim: 30 to 100+ years of data

Long-Term Agroecosystem Research Network

Long-Term Agroecosystem Research NetworkLessons learned

• The project is still in the initial stages• Lessons learned is: we still have a lot to learn

Long-Term Agroecosystem Research Networkhttp://ltar.nal.usda.gov

ConclusionWhat have we learned?

• Use of open source technology – Allows us to test out technology in depth without a

huge initial investment– Gives us access to community development (avoids

reinventing the wheel)– Is mainly useful when customized

?


Top Related