idigbio technology, cloud and appliances

Post on 23-Feb-2016

42 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

iDigBio Technology, Cloud and Appliances. Jose Fortes (on behalf of the iDigBio IT team). Paleocollections Workshop Gainesville, Florida April 27, 2012 Supported by NSF Award EF-1115210. iDigBio (idigbio.org). - PowerPoint PPT Presentation

TRANSCRIPT

iDigBio Technology, Cloud andAppliancesJose Fortes(on behalf of the iDigBio IT team)

Paleocollections WorkshopGainesville, Florida

April 27, 2012Supported by NSF Award EF-1115210

Advanced Computing and Information Systems laboratory 2

iDigBio (idigbio.org)

Goal: making data and images for millions of biological specimens available in electronic format for the biological research community, agencies, students, educators, and public

Mission: leadership, coordination, and outreach in digitization

of collections by implementing resources for communication, use of technology, access to data, research and education. The “Hub” part of the NSF ADBC program aggregating TCNs and

PENsA resource: permanent cloud computing infrastructure

to link biological data from collections across the USAto use search and analytics tools to mine and reference data

Advanced Computing and Information Systems laboratory

iDigBio IT VisionCyberinfrastructure to enable

the collaborative creation, integration and management of digitized biocollections,

their use in scientific research, education and outreachVisible as a collection of persistent Internet-accessible

services, data and resourcesFor biocollection “producers”For biocollection “consumers”For biocollection service providersFor cyberinfrastructure providersFor national/global data aggregators

Advanced Computing and Information Systems laboratory 4

CI StakeholdersDomain Data

Producers

Infrastructure Providers

Domain Service Providers

Domain Data Consumers

National/Global Data

Aggregators

iDigBio

Museums

Amazon WS

Google

Microsoft Azure

DataONE

TCNs

Collectors

GBIF

ALA

Researchers

Amazon Turk

Georeferencing

Imaging services

Data quality

Mapping

EOLTCNs

TCNsGovernmentTranslation

OCR

BISON

NESCent

Data Conservancy

iPlant

iPlant

TeachersCitizens

TCNs

Advanced Computing and Information Systems laboratory 5

Stakeholders APIsDomain Data

Producers

Infrastructure Providers

Domain Service Providers

Domain Data Consumers

National/Global Data

Aggregators

iDigBio

Museums

Amazon WS

Google

Microsoft Azure

TCNs

Collectors

GBIF

ALA

Researchers

Citizens

Amazon Turk

Georeferencing

Imaging services

Data quality

Mapping

EOL

TCNsTCNsGovernment

TranslationOCR

Domain data

BLOBsAppliances

UpdatesNotification Query

results Customer Requests

Processed data

Domain-level data

UpdatesNotificationUsage track

BISON

DataONE

TCNs

Data Conservancy

NESCent

iPlant

Teachers

Advanced Computing and Information Systems laboratory 6

Interface Model for iDigBio and TCNs

Infrastructure Providers, National/Global Data Aggregators, Domain Service Providers, Domain Data Consumers

. . .

. . .

iDigBio + Resources

TDWG

XMPPOCCIWG

REST WS

WS-I TAPIR

HTTP

SQL UTF-8

RDF

XML

X.509 OpenID

SAML

TCP JPEG2000 ODBC

Virtual Appliances Machines

Storage

Networking

Learning Modules

Archiving Data Collections

Structured Data Services

Wiki Workshop Resources

Workflow Engines

Taxonomic Validation

Data Conversion

Geographical Mapping

Collaboration Tools

Non-structured Data Services

TCNs

National History Museums

Google App Engine

XSEDE

Microsoft Live

Amazon EC2/S3

Applied Innovations

Microsoft Azure

Google Apps

BISON/Federal

CollectionsiPlant TCNsNCBI LifeMapper ALAEOL NESCentAcademic CloudsDataONE

Advanced Computing and Information Systems laboratory

Building the iDigBio CloudCloud-based strategy

Providing useful services/APIs (programmatic and web-based)Federated scalable object storage and information processingDigitization-oriented virtual appliancesReliance on standards, proven solutions and sustainable software

Continuous consultation with stakeholdersSurveys, workgroups, summit/workshops, person-to-person …

Advanced Computing and Information Systems laboratory 8

Keeping our eyes on the ballCommon/frequent needs: archival storage, server hosting, feedback on the data, data intensive transformations …10-year tsunami of requirements: from being on Facebook to multilingual search-and-compute across multiple data sets…

Advanced Computing and Information Systems laboratory 9

Evolution of iDigBio capabilities

Time

Data ingestion

Data access, provision and visualization

Provide and enable data feedback

Data linking and federation

Process and visualize integrated data

Increasing storage and server hosting in support of the aboveIncreasing number of appliances in support of the aboveWeb site for interaction with public, community, education and above

Q3/2012 Q3/2013 Q3/2014 Q3/2015

Advanced Computing and Information Systems laboratory

Near-term goals: ingest data

• Textual datao JSON document databaseo Data ingestion via DwC-a files

o Get / Set API

• Image Datao Internet-accessible object

storage

o Upload appliance

o Limited access to low-level APIs

TextualData

(RIAK)

ImageData

(SWIFT)

API Gateway

Internet access

Advanced Computing and Information Systems laboratory

Medium-term goals• Textual Data

o JSON document databaseo Data Ingestion via DwC-a fileso Rich RESTful API

• Image Datao Web-accessible object storageo Upload applianceo Fully abstracted storage

• Indexing and Searcho Extract EXIF data from imageso Limited but useful set of indexeso Intuitive search UIo Search available via API

• Portalo Consumes and interfaces text, image and search APIs (minimal server

side code)o Web-based mapping - client side javascript limits useable record count to

about 50k records at a time.

TextualData

(RIAK)

ImageData

(SWIFT)

API Gateway

Internet access

Filter Set Query

interface

EXIF extraction

iDigBio Portal

Advanced Computing and Information Systems laboratory

(Very) Long-term Goals

Advanced Computing and Information Systems laboratory

Virtual appliance cycle

download instantiate

Domain expertiDigBio

Users atTCNs

CollectionsCommunity

Requirements,standards

Advanced Computing and Information Systems laboratory 14

Toolbox Workflow Example

Linux,MySQL,Specify,

GEOlocate

(2) Data entry, improvementTCN server

Cloud providers(Amazon, Azure…)

(6) Search

(3) Data ingested into iDigBio (4b) Replica

tion Services

(7) Visualization

iDigBio Cloud(1) Download iDigBio

appliance

Global Aggregators

(4a) Data publishing

Domain Data Consumer

(5) Download analysis

appliance

Advanced Computing and Information Systems laboratory

Short term

Ingestion applianceWeb-based UI

Images captured(e.g. HD/flash media)/images/1/100.tif /1/101.tif /2/200.tif …

iDigBio objectStorage cloud(Swift)

Batch upload,Cloud APIs

Webserver

Cloudclient

File interface

/1/100.tif GUID1/1/101.tif GUID2

Facilitate data ingestion, interface with iDigBioTools identified by community in workshops/groups

Advanced Computing and Information Systems laboratory

Medium-term – “Marketplace”

iDigBio Portal

Users/ Developers Community

appliances

Endusers

iDigBioPersonnel

iDigBioappliances

Proposals

Advanced Computing and Information Systems laboratory

Long-term – information processing

iDigBio Portal

Users/ Developers

Communityappliances

Download

Endusers

iDigBioPersonnel

Deploy

SpecimenDatabase

WorkflowsMap/Reduce

Advanced Computing and Information Systems laboratory

SummaryiDigBio cloud

Service-oriented standards-based cyberinfrastructure focused on the ADBC community needs

Scalable data management and information processing using standard interfaces, data formats, protocols, tools

Toolboxes as appliancesEvolving collection of community-selected toolsBuilt-in interfaces for effortless iDigBio integrationEmbedded best practices and standards in biocollections work

Software re-use when open-source, well maintained, manageable, sustainable and efficient to re-purpose

Feedback and suggestions welcomefortes@ufl.edu and “Contacts” at idigbio.org

Advanced Computing and Information Systems laboratory 19

AcknowledgmentsNational Science Foundation

Judith Skog and Anne Maglia

IDigBio team at University of Florida and Florida State University

Advanced Computing and Information Systems laboratory 20

Extras

Advanced Computing and Information Systems laboratory

ExamplesImage ingestion appliances (short term)

Batch upload of several images from a local storage device/file system to cloud storage

Generate GUID/URLs for later processingReliable transfers using cloud APIs (e.g. Swift/iDigBio)

Post-processing appliancesOCR tools; end-user or for batch processing

Geo-referencing appliancesTraining/verification

Research workflow appliancesData-intensive/batch processing workflows; e.g. data

mining, image processing

Advanced Computing and Information Systems laboratory

Now: appliance proposal processBy users/developers through the iDigBio Web portal

Requirements – demonstrates usage/buy-in, software license, documentation, etc

Queue of appliances for integrationiDigBio will prioritize and work with developers

Leverage expertise in appliance developmentFocus on images that users can download and run on

VMware, VirtualboxApplication, in addition to appliance, if applicable/desirable

Advanced Computing and Information Systems laboratory

Virtual Appliances in iDigBioPackaging of software and dependences in virtual machines

End user/desktop (e.g. VMware, Virtualbox)Infrastructure-as-a-Service clouds (e.g. OpenStack)Enhance user experience, facilitate integration with cloud

Image ingestion appliances (short term)Batch upload of images from a local storage to cloudGenerate GUID/URLs for later processingReliable transfers using cloud APIs (e.g. Swift/iDigBio)

Post-processing appliances (OCR tools; end-user or batch)Geo-referencing appliances (Training/verification)Research appliances (Data-intensive/batch workflows)

Advanced Computing and Information Systems laboratory 24

iDigBio Cloud Internal Architecture

Object store Database

iDigBio Collections

Management

Media Data/Metadata

Compute

API/XML ConsumerGBIFMorphbank…

Specimen-record objectsSpecimen-image objects

PublishCommentUpdatesNotifications

Domain Data Producers

National/Global Data Aggregators

Data Intensive Processing

Initial deploymenton UF ACIS resources; partially replicated at FSU for reliability and performance

(NOVA)(SWIFT) (RIAK)

Advanced Computing and Information Systems laboratory

Archer cyber-infrastructure

Archer seed resources

Local resource pools:servers, clusters,desktop labs

Userdesktops

Self-configuringVirtual appliances

Deployment, support, configuration, troubleshooting

Archer software andmanagement

Voluntaryresources

Web portal,documentation,

tutorials

Community-contributedcontent: applications,datasets

Archer seed resources

Local resource pools:servers, clusters,desktop labs

Userdesktops

Self-configuringVirtual appliances

Deployment, support, configuration, troubleshooting

Archer software andmanagement

Voluntaryresources

Web portal,documentation,

tutorials

Community-contributedcontent: applications,datasets

www.archer-project.org

Advanced Computing and Information Systems laboratory

Unique UF+FSU IT resourcesExcellent resources

Computational ACIS lab: 14 clusters, 700+ cores, 500 Terabytes 3 HP centers: ~6000 cores, 300 Terabytes

Networking to/from UF and FSU 10 Gbit connectivity to UF Campus Research Network 10 Gbit connections to Florida Lambda Rail, National Lambda Rail,

and Internet2

Advanced Computing and Information Systems laboratory

Invasive SpeciesWhere have they been introduced, and how

quickly are they spreading? What is the pattern of spread, and do they covary

with other taxa? What is the effect of climate change on the

spread of invasives?

Advanced Computing and Information Systems laboratory 28

Florida Plant Phylogeny:Phylogenetic Diversity Under Climate Change

Vascular Plant Diversity in Florida

2609 species (of 4200)all included in phylogeny

203 speciesendemic to Florida

Ratio of endemicsto all species

~200,000 location points; data from UF, FSU, USF, GBIF, FNAI

Advanced Computing and Information Systems laboratory 29

Florida Plant Phylogeny:Phylogenetic Diversity Under Climate Change

Vascular Plant Diversity in Florida

2609 species (of ~4200)all included in phylogeny

+

Phylogenetic tree, 2609 speciesGenBank, new (1000 spp)

Advanced Computing and Information Systems laboratory 30

Florida Plant Phylogeny:Phylogenetic Diversity Under Climate Change

Integrate distribution data, ecological data, climate models, phylogenyHow does species diversity compare to

phylogenetic diversity?How do species diversity and phylogenetic

diversity change?How do invasive species respond?Integrate across cladesDevelop workflows to facilitate such studies

D. Soltis, G. Burleigh, C. Germain-Aubrey, J. Allen, L. Majure

Advanced Computing and Information Systems laboratory

Research & Scientific OutreachFoster, encourage, enhance, enable research using

collections dataFoster research in IT

Integrate with various research communitiesWork with research communities to develop collections

and research-related workshops and symposia at meetingsWork with research communities to develop interfaces

with data repositories, etc. to promote integrated researchCoordinate these efforts with TCNs and PENs

Advanced Computing and Information Systems laboratory

Linking Collections to EcologyThrough collections from LTERs

Advanced Computing and Information Systems laboratory

Linking Collections to Ecology

Through NEON

Biological monitoring at sites across USA; collectionsBaseline for changes in

species distribution and abundance over time

National Ecological Observatory Network

Advanced Computing and Information Systems laboratory

Paleobiology Database (http://paleodb.org/cgi-bin/bridge.pl)

Linking Collections to Paleobiology

Advanced Computing and Information Systems laboratory

Linking Collections to GenomicsNational network of tissue and genetic

resources

Advanced Computing and Information Systems laboratory

Linking Collections to GenomicsExtend HUB connections to genomics databases

Advanced Computing and Information Systems laboratory

Linking to Living CollectionsBotanical gardens, zoos, culture collections

Advanced Computing and Information Systems laboratory

Interactions with Systematics Community and Beyond

Facilitate digitization effortsCoordinate with other databasing efforts in systematics

Connect to databases outside systematics: ecology to genomics (NEON to GenBank)

Advanced Computing and Information Systems laboratory

Interactions Fostered Through…Discussions at national meetings of

professional societies (systematics, ecology, evolution, genomics)

Workshops to engage members of systematics community

Workshops to engage members of different communities

Advanced Computing and Information Systems laboratory

Unique UF+FSU recordTrack record of building cyberinfrastructure

PUNCH and In-VIGO Nanohub, Netcare, In-VIGOBlast …

MorphbankAFRESHTelecenterArcher

Advanced Computing and Information Systems laboratory

Archer cyber-infrastructure

Hundreds of distributed compute/routers nodes24/7 operation, 650+ cores

Custom appliance imagefor computer architecturecommunity

Job scheduling acrossparticipating institutions

Advanced Computing and Information Systems laboratory

• How are species distributed in geographical and ecological space?

• What is the history of life on Earth?• What factors lead to speciation, dispersal, and

extinction?• What are the impacts of climate change likely to

be?• What information is needed for effective

conservation strategies?

Research Questions

Slide provided by Pam Soltis

top related