csiro marine research divisional data centre current and future activities tony rees, data centre...

43
CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

Upload: magdalena-blakeman

Post on 01-Apr-2015

229 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

CSIRO Marine ResearchDivisional Data Centre

Current and Future Activities

Tony Rees, Data Centre ManagerApril 2004

Page 2: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

Talk outline

• General Divisional context – past and present

• Data Centre approaches and tools – including MarLIN, Data Warehouse & Trawler, CAAB, C-squares, and OBIS

• Data Centre services to CMR projects

• Cleveland-specific issues

Target audience and level of talk

• Introductory / overview level, some examples but not full detail

• Aimed at CMR staff in general, project managers, plus project metadata staff

• Database designers, application developers will find material of interest, but need separate more detailed info.

Page 3: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

• Our people (and their intellectual capabilities)

• Our hardware (collecting platforms etc.) and technologies

• Our data – newly collected, plus historic data

What are the Division’s chief assets?

How do we manage our data assets?

• Mixture of good, moderately good, and not good at all

• “good” – well documented; details in searchable catalogue; appropriate/current formats; online access (to appropriate users); ongoing curation

• “moderately good” and “not good” depart from the above, to lesser or greater degree

• Data Centre curates selected datasets on behalf of the Division, others reside long-term in projects

• Data Centre also maintains “MarLIN” – the Division’s data catalogue (metadata system)

Page 4: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

Overview of metadata, data systems – national context

CMR NOO GA AAD AIMS etc.metadata systems

MarLIN Neptune

CMR data

NOO data

GA data

AAD data

AIMS data

etc.

describe / point to ...

ASDDAustralian Spatial Data Directory –

national cross-agency metadata gateway

3rd party data(CMR copy)

example

search via ASDD – search across multiple agencies, basic functionality

search via MarLIN – search only CMR holdings, but extra functionality (also view “CMR internal” records not visible to external users)

Page 5: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

The Card Index ...

Page 6: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

MarLIN

Marine Laboratories Information Network

Divisional Data Catalogue (metadata system)

Page 7: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

What is in MarLIN?

• Descriptions of <2,000 Divisional datasets (including c.1000 held by the Data Centre)

• Individual MarLIN records are searchable by subject, keyword, CMR project, geographic region, time period, biological species, voyage reference, and more

• Contain metadata (“data about data”) in a common structure (ANZLIC format plus CMR-specific additional fields)

• Can contain links to images, related documents, data files, and other metadata records

• “Quick maps” (using c-squares data footprints, see later) can indicate the spatial extent of the data

Who creates MarLIN metadata records?

• Records are created/maintained by the data custodians, who best understand the data and associated useful resources, using an online metadata entry form (Data Centre staff can assist with this process)

Page 8: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

revised Data Centre website (extract)

Page 9: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

Sample MarLIN content

Alphabetical dataset lists

Page 10: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

Sample MarLIN content

Alphabetical dataset lists Indexes by keyword, etc.

Page 11: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

Sample MarLIN content

Alphabetical dataset lists Indexes by keyword, etc.Brief dataset details

Page 12: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

example search result ...

(etc.)

Page 13: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

Viewing the full metadata record produces ...

(etc.) with clickable link to show dataset extent using c-squares:

Page 14: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

(Quick look at the ASDD)

Page 15: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

What’s in it for me / us?

• Allows CMR staff / others to know what data we have already, what we are collecting (or plan to collect), what we do not have (gap analysis) – facilitates data re-use, avoids duplicate acquisition, fosters collaborations

• Permits inspection of relevant data documentation in order to assess data usefulness / completeness / quality, inspect thumbnails of data coverage, etc.

• Gives a contact person and/or electronic access for the data, via a standard entry point

• Provides dissemination of project scientific activities into a new “information space” – online searching via the ASDD, indexing by web search engines, possible future one-csiro system (only don’t hold your breath for the latter)

• Can be feasible for projects to utilise MarLIN to catalogue / access their own data – use MarLIN’s built-in search capability rather than re-invent.

Page 16: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

Data Warehouseand

Data Trawler

Page 17: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

2000 onwards – databasing of “all” Data Centre holdings into a Divisional Data Warehouse, accessed by a custom “Data Trawler” application

• Historic holdings of Hydrology (bottle chemistry) and CTD data – 200,000 HYD analyses, 10,000 CTD casts, from hundreds of research voyages and coastal stations

• Underway data for 175 research voyages (10 million observations) – depth, position, time, meteorological variables, sea temperature, salinity, fluorescence

• Biological (catch composition) data from 85 voyages – 10,000 trawls, 240,000 individual species records (number or weight caught)

• Currents data from 548 moored current meters (3 million readings)

ADCP data, some old hydrology data still in archives, awaiting migration to on-line Warehouse system. Also note, c. 50% of Divisional catch data is not held by the Data Centre at this time (probably still with original investigators)

Page 18: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

example Data Trawler Screens

Page 19: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

HYD and CTD data – all years

current Warehouse content accessible via Data Trawler

Page 20: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

moorings data – all years

current Warehouse content accessible via Data Trawler

Page 21: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

catch data – all years

current Warehouse content accessible via Data Trawler

Page 22: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

What’s in it for me / us?

• Provides access to centrally held data on a self-serve basis, via a standard web browser

• Allows queries to be constructed by data type, region, time period, species, voyage ...

• Contains the actual data, but not text information (the latter is in MarLIN)

• Permits retrieval of data across multiple projects, as integrated result set in a common format

• Provides preview / mapping of spatial extents of result sets generated (closer to true web GIS facility cf. MarLIN, which is more of a quick “thumbnail” facility)

• Data are provided in csv / spreadsheet compatible format, suitable for upload to user’s own machine for further manipulation.

Page 23: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

Remote ApplicationsDivisional Systems

“MarLIN”Data

Catalogue

Divisional Data Warehouse

“Data Trawler”

application

Austr. Spatial Data Directory

(ASDD)

Hyperlinked documents,

graphics, etc.Project-based data holdings

Off line archived

data

Systems considered thus far ...

Page 24: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

CAABCodes for Australian Aquatic

Biota

master taxonomic database

Page 25: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

1999-current – upgrading of “CAAB” master taxon management system for the Division

• CAAB (Codes for Australian Aquatic Biota) is a database of species names and codes, now covering >25,000 marine species in Australian waters

• codes are standardised species identifiers for use in Divisional databases (species names may change, codes are intended to be constant)

• “quick maps” of all catch data in the Warehouse (by species) have been associated with relevant CAAB record; also predicted species ranges for c. 3,000 fish species

• individual maps form clickable interface(s) to retrieve corresponding data items (individual catch records) from the warehouse and display in a web page

Page 26: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

web-accessible version of CAAB

Page 27: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

web-accessible version of CAAB

Page 28: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

web-accessible version of CAAB

Page 29: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004
Page 30: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

What’s in it for me / us?

• Codes are a standard storage and interchange format for taxonomic information in CMR and other regional databases

• CAAB website and derived tables allow matching of codes to names, and vice versa

• Check correct spelling of species names, full citation, generate Australian species lists per genus / family / larger category

• Links to pictures and maps of CMR data distribution, where available

• “Quick maps” form clickable front end to Data Warehouse queries

• Also provides access to most recent predicted species range in many cases

• Potentially supports “what lives here” queries from predicted species ranges and specified depths (fishes only, at present time).

Page 31: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

C-squares

Concise Spatial Query and Representation System

spatial indexing and mapping utility

Page 32: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

“C-squares” mapping / spatial indexing utility

• Original Data Centre creation, 2001 onwards

• Mainly a developer’s tool

• Permits “lightweight” spatial indexing, queries, and web mapping from a standard text-based system (no GIS required)

• Currently used in 4 CMR and 3 international systems

(Tony Rees can supply more details if interested).

Page 33: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

OBIS

Ocean Biogeographic Information System

www.iobis.org

Page 34: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

OBIS – Ocean Biogeographic Information System

• Operated by an international consortium, including CMR representation

• Like a “super CAAB” for the world, but with names only (not codes)

• Can currently access point data for 20,000 marine species from c. 20 institutions worldwide (2 million records), plus lists of names awaiting data, and returns integrated result sets (like Data Trawler)

• Many aspects similar to CAAB, including “Quick maps”, click-on-map spatial queries, OBIS taxonomic groups, and more (Data Centre staff did the interface and query logic)

• CMR catch data to be visible via the system in due course.

Page 35: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

Data Centre Servicesto CMR projects

Page 36: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

Who are we?

• Tony Rees (Hobart) – Data Centre manager; MarLIN, CAAB, C-squares technical support & development; national & international connections; project-level advice (metadata)

• Pamela Brodie, Leanne Wilkes (Hobart) – Data Warehouse, Data Trawler support and data loading; project-level advice (databases)

• Miroslaw Ryba (Hobart) – Oracle support; ships biological data collection suite

• Terry Byrne (Hobart) – National Facility Data Librarian; data requests; data archiving

• Hiski Kippo (Floreat) – project-level liaison, DC representation (WA)

• Steven Edgar (Cleveland) – project-level liaison, DC representation (QLD)

Page 37: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

“On the ground” DC services to CMR projects

• Advice and assistance to CMR project staff – metadata entry, database design, general data management issues

• Maintaining the Division’s Oracle systems, and provision of Oracle advice and web-based help

• Servicing/forwarding data requests as appropriate

• Migrating project data to the Data Warehouse, for integration with other relevant data holdings, and archiving data to offline media as required

• Looking at whole-of-Division issues such as data access and exchange policies, engagement with relevant national and international data operations, cross-CSIRO data access, etc.

• New Data Management officers in Floreat (2002) and Cleveland (2004)

• Developing interest in GIS data layers and systems e.g. ArcSDE, ArcIMS

• Continuing to advance existing DC systems on three fronts – tools, content, and connectivity (internally, nationally, internationally).

Page 38: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

Cleveland-specific issues ...

• Steven Edgar has an advisory role for Data Management in projects at the Cleveland site (project personnel actually do the project-level management); can assist with database design, etc., also some/all Oracle administration needs

• Steve’s time (or portions of it) can be spent on migrating project data to our central warehouse/trawler system, also assisting project staff with metadata entry as needed

• Steve brings new expertise in GIS systems to the Data Centre; will take an interest in cross-project / cross-Divisional GIS issues and progress where possible

• Steve can act as conduit for technology/content/expertise transfer in 2 directions (DC Systems/tools > CMR projects and vice versa) – also the “eyes and ears” of the Data Centre in Cleveland to bring local issues to Hobart attention as needed

• Additional Hobart-based staff are only an email or phone call away if they can be of assistance.

Page 39: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

Summary – an idealised “data life cycle” at CMR

Project starts

Divisional Data Warehouse

“MarLIN”Data

Catalogue

PSS

“Data Trawler”

application

administrative details

project overview

interimdocuments,

graphics, etc.

Project-based data holdings

Project completed

Off line archived

data

Persistent project db’s

project data repository

project data

published output

Page 40: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

towards “best practice” data management at project level ...

• Projects should be recording the existence of their data in MarLIN – ideally sooner rather than at end of project

• Data should eventually be migrated off PCs into Divisional systems

• As much relevant data as possible should be in the Warehouse

• Effort should be made to produce definitive / final version of the data

• Data Centre can help with archiving for closed projects

• Data Warehouse table structure, and other Divisional databases, can provide starting points / examples for project level databases

• Taxonomic / survey data recording should employ CAAB codes as a Divisional standard

... refer Data Centre internal website and local Data Centre person/s for additional information.

Page 41: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

Some action items / ideas for discussion ...

• Upgrade MarLIN content to reflect the true data holdings of the Division (augmented with project descriptions as available)

• Look into migrating more “completed” project datasets into centralised (Data Centre) holdings / systems

• Locate as much as possible of the “missing” catch data, to add to present Warehouse content

• Obtain clearance as needed to make CMR catch data visible to the outside world (currently, it is all intranet-only) via Data Trawler and other linked systems (CAAB, OBIS, others)

• Assist project staff with pressing data management issues and work to ensure good technology transfer for database design, etc.

• Work with key project staff to progress the usefulness of the “new” web-enabled GIS systems across appropriate datasets, for the benefit of multiple users

• Identify needs to digitise important non-digital data holdings (notebooks, field log sheets etc.) and assist in seeking resources to digitise them.

Page 42: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

Feedback / discussion time ...

Page 43: CSIRO Marine Research Divisional Data Centre Current and Future Activities Tony Rees, Data Centre Manager April 2004

“C-squares” Spatial Indexing/ Mapping

System

Remote ApplicationsDivisional Systems

Divisional Data Warehouse

“MarLIN”Data

Catalogue

Hyperlinked documents,

graphics, etc.

“CAAB”Taxonomic Database

“Data Trawler”

application

Project-based data holdings

Off line archived

data

Distributed AODC?

OBIS?

other?

Austr. Spatial Data Directory

(ASDD)

external c-squares users – FishBase,

OBIS, others

www.marine.csiro.au/marlin/

www.marine.csiro.au/caab/

www.marine.csiro.au/csquares/

www.marine.csiro.au/warehouse/jsp/loginpage.jsp

asdd.ga.gov.au/asdd/

(e.g.) www.iobis.org/

Summary of core Data Centre components as at April 2004