an on-line collaborative data management system
DESCRIPTION
A presentation I prepared that was presented by Rob Simmonds at the Gateway Computing Environments 2010 Workshop in New Orleans on November 14, 2010. It provides an overview of a data management system that was developed for GeoChronos - an on-line collaborative platform for Earth observation scientists.TRANSCRIPT
An On-line Collaborative Data Management System
Roger Curry1, Cameron Kiddle1, Rob Simmonds1 and Gilberto Z. Pastorello Jr.2
1Grid Research Centre, University of Calgary2Centre for Earth Observation Science, University of Alberta
Data Challenges Related Work Data Management System Use Case: GeoChronos Summary and Future Work
Outline
GCE 2010 Nov. 14, 2010 2
Data Acquisition Much scientific data stored on off-line media Cumbersome and time consuming to access Making data available on-line difficult Insufficient storage and bandwidth
Sharing of Data Lack of willingness to share data Proprietary data - need for controlled access
Data Challenges - I
GCE 2010 Nov. 14, 2010 3
Usability of Data Insufficient metadata to describe data Various metadata standards in some domains,
but many lacking metadata standards – many scientists use their own metadata format
Finding Data Difficult to find data that you need Different data organized / stored differently Tools to browse, search, visualize data often
lacking
Data Challenges - II
GCE 2010 Nov. 14, 2010 4
Content Management Systems i.e., Drupal, Joomla!, Microsoft SharePoint, Plone, ... Offer rich set of features but do not handle:
Meaningful support to specific data formats Efficient association of metadata and ancillary files to data sets Access to a variety of data processing tools Uniform handling of outputs from processing tools
Spectral Libraries i.e., USGS, ASTER, Vegetation Spectral Library (VSL) Are available on-line but lack:
ability to dynamically restructure metadata for browsing collaboration features enabled by social networking
Related Work - I
GCE 2010 Nov. 14, 2010 5
Spectral Library Tools i.e., DLR-DFD Spectral Archive, SPECCHIO Flexibile in creating / handling metadata but:
Have a fixed metadata schema – do not support new metadata needs
Data repositories for other domains i.e., Astrophysics Data System, FLUXNET, European Bioinformatics (EBI)
Databases Offer wide range of functionality but:
Primarily focus on data that is already validated and structured Do not handle preliminary, intermediate, untested data (i.e. research in progress)
Digital Libraries i.e., Planetary Data Systems, NCore, SciPort Have flexible functionality but:
Most focus on well-defined digital artefacts Limited in handling collaboration on evolving data, metadata and schemas
Related Work - II
GCE 2010 Nov. 14, 2010 6
Supports the following functionality: On-line access to data Enables scientists to share data while
maintaining control of who sees it Ability to add and edit metadata while working
with multiple schemas Collaboratively create new schemas to facilitate
consistent/accurate recording of metadata Dynamically restructure the way data is browsed
Data Management System - Overview
GCE 2010 Nov. 14, 2010 7
Data Management System - Framework
GCE 2010 Nov. 14, 2010 8
User & Data: User acquires data from sensor and
uploads to portal Direct acquisition of data also possible
Elgg Portal: Built on top of Elgg – Open source
social networking platform Fine grained access control Flexible data model
Data Storage: Currently local NFS storage Working on distributed iRODS based
system Data Ingestion Service:
Creates records, parses metadata, establishes ancillary relationships
Deployed on cloud-based Condor pool
Data Management System – Data Model
GCE 2010 Nov. 14, 2010 9
Source: http://docs.Elgg.org/wiki/File:Elgg_data_model.png)
Data Management System – Data Model
Arbitrary metadata can be assigned to any entity
Annotations allow users to comment on entities not owned by them
Data management system adds three new types of ElggObjects Schema Collection Record
Data Management System - Schemas
GCE 2010 Nov. 14, 2010 10
Create schemas Custom or standards-based (i.e.
Dublin Core) Individually or as a collaborative
team Schemas consist of
Namespace Description Read/write access permissions Series of metadata keys
Metadata keys consist of Name Description Type (text, latlong, ancillary) Optionality: required,
recommended, optional
Data Management System - Collections Group of related data
i.e., spectral library, set of satellite data Collection consists of
Name, description, read/write access permissions, metadata, records
GCE 2010 Nov. 14, 2010 11
Data Management System - Records
GCE 2010 Nov. 14, 2010 12
Atomic unit of data management system Usually represents a single file, but does not need to be
associated with a file Tabbed interface for viewing:
Spectral plot, metadata, ancillary data, map, comments Custom tabs based on data type
Data Management System – Virtual Directory Structure
GCE 2010 Nov. 14, 2010 13
Dynamic restructuring of data for browsing purposes Folders based on metadata keys/values User can customize the metadata keys used to establish the
directory hierarchy
Use Case - GeoChronos
GCE 2010 Nov. 14, 2010 14
(http://geochronos.org/)
An on-line platform For:
Earth Observation Scientists Facilitating:
Collaboration between scientists Data access, management and sharing Application access, management and sharing
Leveraging: Web 2.0 and social networking technologies Cloud computing technologies
Funded by: CANARIE - Network Enabled Platform (NEP-1) program Cybera
GeoChronos - Overview
GCE 2010 Nov. 14, 2010 15
GeoChronos - Project Team
GCE 2010 Nov. 14, 2010 16
Dr. Arturo Sanchez-AzofeifaUniversity of Alberta
Dr. John GamonUniversity of Alberta
Dr. Benoit RivardUniversity of Alberta
Dr. Rob SimmondsUniversity of Calgary
Prinicipal Investigators
Project Coordination Platform Development Domain Scientists
GeoChronos - Virtual Organization
GCE 2010 Nov. 14, 2010 17
Libraries created Ingested some existing on-line libraries
USGS, ASTER, Vegetation Spectral Library (VSL) Many enhanced features as part of GeoChronos
Spectral Library module - improved browsing, dynamic plotting, mapping, annotations, ...
Domain scientists have contributed libraries Rock samples, tar sand samples, lichen samples,
vegetation samples, alfalfa/barley field samples Data formats / parsers supported
ENVI, UNISPEC, ASD, several ASCII formats Schemas incorporated
Library specific – USGS, ASTER, VSL, ... Sensor/Format specific – UNISPEC, ENVI, .. Other Standards – Dublin Core
Currently hosting (including MODIS data) 10+ schemas, 20+ collections (libraries), 20,000+ records
GeoChronos – Spectral Libraries
GCE 2010 Nov. 14, 2010 18
GeoChronos – MODIS Satellite Data Developed automated workflow
service for mosaicing, subsetting, reprojecting and masking MODIS satellite data
Significantly reduces time that scientists have spent manually doing such workflows
Data management system used to store raw MODIS satellite data and data products derived from the workflow
Parsers/schemas specific to MODIS data have been added to system
User provided with same powerful interface as Spectral Libraries for browsing, accessing and viewing data
GCE 2010 Nov. 14, 2010 19
Have developed data management system in an interactive, iterative fashion
Domain scientists on project have provided much guidance, testing and feedback
Have customized, enhanced the data management system based on feedback received
GeoChronos – User Feedback
GCE 2010 Nov. 14, 2010 20
Identified data related challenges facing scientists
Discussed some related efforts and shortcomings of these approaches
Presented an on-line collaborative data management system addressing many data challenges
Showed example usage of the data management system by GeoChronos
Summary
GCE 2010 Nov. 14, 2010 21
Currently have a single local data repository Working on extending data management system to work with
distributed data repositories using iRODS Currently have powerful browsing functionality
Need to add search functionality across collections and based on metadata values
Currently support custom metadata schemas Plan to make use of Semantic Web technologies to better
relate data and provide ontological mapping between different metadata schemas / standards
Currently work with spectral and MODIS satellite data Plan to incorporate other data such as carbon flux data, other
satellite data, meteorological data, phenology tower data
Next Steps
GCE 2010 Nov. 14, 2010 22
Contact Information
GCE 2010 Nov. 14, 2010 23
http://geochronos.org/[email protected]
http://grid.ucalgary.ca/ http://ceos.ualberta.ca/ http://www.cybera.ca/