on-line biological data concepts at csiro marine research, australia tony rees & kim finney...

26
On-line biological data concepts at CSIRO Marine Research, Australia Tony Rees & Kim Finney Divisional Data Centre CSIRO Marine Research, Hobart, Australia http://www.marine.csiro.au/datacentre/

Upload: charla-watts

Post on 31-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: On-line biological data concepts at CSIRO Marine Research, Australia Tony Rees & Kim Finney Divisional Data Centre CSIRO Marine Research, Hobart, Australia

On-line biological data concepts at CSIRO Marine Research, Australia

Tony Rees & Kim Finney

Divisional Data Centre

CSIRO Marine Research, Hobart, Australia

http://www.marine.csiro.au/datacentre/

Page 2: On-line biological data concepts at CSIRO Marine Research, Australia Tony Rees & Kim Finney Divisional Data Centre CSIRO Marine Research, Hobart, Australia

Our website: http://www.marine.csiro.au/datacentre/

Page 3: On-line biological data concepts at CSIRO Marine Research, Australia Tony Rees & Kim Finney Divisional Data Centre CSIRO Marine Research, Hobart, Australia

Pre-existing situation at CMR (before 1997)

• Data in a variety of databases and flat files

• No metadata or digital documentation

• No web access to any data or metadata

• CAAB (taxon coding system) in existence but coverage patchy and compliance variable

Page 4: On-line biological data concepts at CSIRO Marine Research, Australia Tony Rees & Kim Finney Divisional Data Centre CSIRO Marine Research, Hobart, Australia

Our implementation path

Stage 1 (1997-2000) ...

• Construct a searchable, web-accessible metadata system and start population it with information - MarLIN v1

• Upgrade CAAB to form a comprehensive taxon dictionary for MarLIN (also accessible by SQuID)

• Build a pilot data store and visualisation system with a web-driven GUI (Java applet) - SQuID v1

Stage 2 (2000-) ...

• Build SQuID v2 (onwards) to become a comprehensive data store, with upgraded links to MarLIN and CAAB

• Implement linkage between MarLIN and Australia-wide, distributed metadata search system

Stage 3… ???

Page 5: On-line biological data concepts at CSIRO Marine Research, Australia Tony Rees & Kim Finney Divisional Data Centre CSIRO Marine Research, Hobart, Australia

Our system overview

Subsets of information shared with other metadata

directory systems

Entry point to data

Display relevant metadata

Data directory(metadatabase)

- holds info at “dataset” level (e.g. survey, species range)

Master data storage (includes index layer) - holds info at the atomic

data level

Taxon dictionary

Page 6: On-line biological data concepts at CSIRO Marine Research, Australia Tony Rees & Kim Finney Divisional Data Centre CSIRO Marine Research, Hobart, Australia

Digression #1: Taxon matching

• Simplistic view:

– text match on one field (“scientific name”) or two (genus + species)

• More comprehensive approach:

– 10 or more fields required, e.g. in CAAB we define the following:Genus Subgenus Species Qualifier also need to flag: Subspecies - Is botanical or zoological code applicable? Variety - Species name latin or informal (“sp. A”, etc.)? Original Author/s - Has name changed from original? (even if Original Date no revising author/date stored) Revising Author/s Revision Date Authority Addendum

Examples from our database:• Chlamys (Belchlamys) aktinos (Petterd, 1886) … a scallop

• Ophiaster hydroideus (Lohmann) Lohmann, 1913 emend. Manton & Oates, 1983 … a coccolithophorid

• Heteroclinus sp. 1 [in Gomon et al, 1994] .. Kuiter's weedfish

Page 7: On-line biological data concepts at CSIRO Marine Research, Australia Tony Rees & Kim Finney Divisional Data Centre CSIRO Marine Research, Hobart, Australia

Taxon matching … continued

• We have standardised on taxon codes, rather than names for data storage and matching … names are stored as an attribute of the code (and can be updated in the future as needed)

• Our “CAAB” coding system has evolved over 20+ years - earlier generations of codes are maintained on the system

• New web-based access facility for retrieving latest name for a code, searching for a taxon, etc.

• Same CAAB codes are also used by other marine science/fisheries agencies around Australia

• Facility newly implemented in CAAB to hold ITIS codes, for cross-reference to international systems in the future

Page 8: On-line biological data concepts at CSIRO Marine Research, Australia Tony Rees & Kim Finney Divisional Data Centre CSIRO Marine Research, Hobart, Australia

CAAB services available

• Retrieve current sci. name, common name(s), taxon code, taxon report

CAABuser

interface

• Initiate a MarLIN search, ITIS report, FishBase report

User searches by scientific name,

common name or taxon code (or portion

thereof)

• List taxa by CAAB category or family

Application-level

requests

• Generate scientific name, common name, current code (if applicable) for a given taxon code

• Call a CAAB taxon report

• List taxa matching query

• Translate an ITIS number to a CAAB code (or vice versa)

Page 9: On-line biological data concepts at CSIRO Marine Research, Australia Tony Rees & Kim Finney Divisional Data Centre CSIRO Marine Research, Hobart, Australia

CAAB web interface (current version)

Page 10: On-line biological data concepts at CSIRO Marine Research, Australia Tony Rees & Kim Finney Divisional Data Centre CSIRO Marine Research, Hobart, Australia

Digression #2: taxonomy keywords• CAAB uses “major categories” (mostly = phyla)

• MarLIN uses Australian “Blue Pages” keywords (c. 100 terms) - independent of CAAB codes (in current implementation)

• NASA GCMD keywords would be an OBIS option (maybe with additions to suit OBIS) - c. 50 currently relevant … could also cross-map to GEMET (EC) list (c.200)

EARTH SCIENCE >> Biosphere >> Zoology >> AmphibiansEARTH SCIENCE >> Biosphere >> Zoology >> AnemonesEARTH SCIENCE >> Biosphere >> Zoology >> ArachnidsEARTH SCIENCE >> Biosphere >> Zoology >> ArthropodsEARTH SCIENCE >> Biosphere >> Zoology >> BirdsEARTH SCIENCE >> Biosphere >> Zoology >> CentipedesEARTH SCIENCE >> Biosphere >> Zoology >> CoralsEARTH SCIENCE >> Biosphere >> Zoology >> CrustaceansEARTH SCIENCE >> Biosphere >> Zoology >> EchinodermsEARTH SCIENCE >> Biosphere >> Zoology >> FishEARTH SCIENCE >> Biosphere >> Zoology >> FlatwormsEARTH SCIENCE >> Biosphere >> Zoology >> InsectsEARTH SCIENCE >> Biosphere >> Zoology >> InvertebratesEARTH SCIENCE >> Biosphere >> Zoology >> JellyfishEARTH SCIENCE >> Biosphere >> Zoology >> MammalsEARTH SCIENCE >> Biosphere >> Zoology >> MillipedesEARTH SCIENCE >> Biosphere >> Zoology >> MollusksEARTH SCIENCE >> Biosphere >> Zoology >> ReptilesEARTH SCIENCE >> Biosphere >> Zoology >> RoundwormsEARTH SCIENCE >> Biosphere >> Zoology >> Segmented WormsEARTH SCIENCE >> Biosphere >> Zoology >> SpongesEARTH SCIENCE >> Biosphere >> Zoology >> VertebratesEARTH SCIENCE >> Biosphere >> Zoology >> Zooplankton

EARTH SCIENCE >> Biosphere >> Microbiota >> AmoebaeEARTH SCIENCE >> Biosphere >> Microbiota >> BacteriaEARTH SCIENCE >> Biosphere >> Microbiota >> Blue-green AlgaeEARTH SCIENCE >> Biosphere >> Microbiota >> CiliatesEARTH SCIENCE >> Biosphere >> Microbiota >> CoccolithophoreEARTH SCIENCE >> Biosphere >> Microbiota >> DiatomsEARTH SCIENCE >> Biosphere >> Microbiota >> FlagellatesEARTH SCIENCE >> Biosphere >> Microbiota >> ForaminifersEARTH SCIENCE >> Biosphere >> Microbiota >> MicroalgaeEARTH SCIENCE >> Biosphere >> Microbiota >> MicrophyteEARTH SCIENCE >> Biosphere >> Microbiota >> PhytoplanktonEARTH SCIENCE >> Biosphere >> Microbiota >> PlanktonEARTH SCIENCE >> Biosphere >> Microbiota >> ProtistEARTH SCIENCE >> Biosphere >> Microbiota >> RadiolariansEARTH SCIENCE >> Biosphere >> Microbiota >> Zooplankton

EARTH SCIENCE >> Biosphere >> Vegetation >> AlgaeEARTH SCIENCE >> Biosphere >> Vegetation >> Flowering PlantsEARTH SCIENCE >> Biosphere >> Vegetation >> LichensEARTH SCIENCE >> Biosphere >> Vegetation >> MacroalgaeEARTH SCIENCE >> Biosphere >> Vegetation >> MacrophyteEARTH SCIENCE >> Biosphere >> Vegetation >> Phytoplankton

Page 11: On-line biological data concepts at CSIRO Marine Research, Australia Tony Rees & Kim Finney Divisional Data Centre CSIRO Marine Research, Hobart, Australia

Taxonomy keyword cross-mapping (examples)

Invertebrates Sponges Jellyfish Anemones Corals Flatworms Roundworms Segmented Worms Mollusks

Arthropods Insects ArachnidsEchinoderms CrustaceansVertebrates Fish Amphibians Reptiles Birds Mammals

invertebrate … S709 poriferan … S744 coelenterate … S737 coral … S738 nematode … S743 annelid … S711 ++ mollusc … S740 cephalopod … S741 gastropod … S742 arthropod … S713 insect … S719 ++ chelicerate … S714 ++ echinoderm … S739 crustacean … S717vertebrate … S649 fish … S754 amphibian … S 650 ++ reptile … S691 ++ bird … S654 ++ mammal … S 664 ++

GCMD list GEMET list

Page 12: On-line biological data concepts at CSIRO Marine Research, Australia Tony Rees & Kim Finney Divisional Data Centre CSIRO Marine Research, Hobart, Australia

MarLIN - used for data discovery

• MarLIN - based on an Oracle database containing dataset, project, and survey descriptions, plus on-line links to data and web resources

• Holds metadata according to regional (ANZLIC and “Blue Pages”) standards, with additional agency-constructed fields (“extended ANZLIC”)

• Web interface for searching and metadata contribution/update, using HTML, Oracle Web Server and custom PL/SQL application

• Produces lists of datasets, or dataset reports, as requested

• Includes links to pre-formatted data “packets” (now) and to SQuID (in future), for access to the data

NB: no data visualising capability, apart from “thumbnails” showing data extent

Page 13: On-line biological data concepts at CSIRO Marine Research, Australia Tony Rees & Kim Finney Divisional Data Centre CSIRO Marine Research, Hobart, Australia

MarLIN - behind the scenes

• Some 25+ tables, holding the following:

– text-based fields (e.g. title, abstract, contributors, references, etc.)

– keywords, handled as numeric ID’s (including taxonomic keywords)

– species/species groups, handled as CAAB codes

– spatial extent, handled as bounding coordinates (max and min. latitude and longitude)

– time extent, handled as earliest and latest collection date for items in the dataset

– originator organisation, present custodian, survey, contact person, etc, handled as numeric ID’s

• Initial search set up by keyword/ID type, spatial coordinates, time period (if desired)

• Then search/browse by subject categories, keywords, taxon names, contributing project, vessel/voyage identifier, location of data, etc.

• Free text search also supported

Page 14: On-line biological data concepts at CSIRO Marine Research, Australia Tony Rees & Kim Finney Divisional Data Centre CSIRO Marine Research, Hobart, Australia

MarLIN search interface

Page 15: On-line biological data concepts at CSIRO Marine Research, Australia Tony Rees & Kim Finney Divisional Data Centre CSIRO Marine Research, Hobart, Australia

Example MarLIN search result - by taxonomic group

subject categories | custodian organisations | vessels | voyages | projects |taxonomic groups | species | habitats | parameters | equipment

The following choices are presently available for MarLIN records in the selected region and/or time period: Start year: 1990 End year: 1995 Selected region: Australian North West Shelf (stored coordinates used: North=-17, West=114, South=-24,

East=122)

Click on any hyperlink to see the full listing for that item. Invertebrates 4 . . . . Cephalopods 1 . . . . . . Squids 1 . . Crustaceans 2 . . . . Prawns & Shrimps 2 Fishes 4 . . Breams 1 . . Dories 1 . . Leatherjackets 1 . . Perches 3 . . Redfishes 1 . . Roughies 1 . . Snappers 4 . . Whales 1

Page 16: On-line biological data concepts at CSIRO Marine Research, Australia Tony Rees & Kim Finney Divisional Data Centre CSIRO Marine Research, Hobart, Australia

Example MarLIN search result - by species

subject categories | custodian organisations | vessels | voyages | projects |taxonomic groups | species | habitats | parameters | equipment

The following choices are presently available for MarLIN records in the selected region and/or time period: Start year: 1990 End year: 1995 Selected region: Australian North West Shelf (stored coordinates used: North=-17, West=114, South=-24,

East=122)

Click on any hyperlink to see the full listing for that item.

23 636004 Nototodarus gouldi .. Gould's squid 1

28 786002 Metanephrops boschmai .. Boschma's scampi 1

28 786005 Metanephrops velutinus .. velvet scampi 1

28 821001 Ibacus alticrenatus .. deepwater bug 1

28 821002 Ibacus pubescens .. [a shovel-nosed/slipper lobster] 1

37 118001 Saurida undosquamis .. brushtooth lizardfish 3

37 118016 Saurida sp. 2 [in Sainsbury et al, 1985] .. grey lizardfish 3

37 255004 Gephyroberyx darwinii .. Darwin's roughy 1

37 258002 Beryx splendens .. alfonsino 1

(etc.)

Page 17: On-line biological data concepts at CSIRO Marine Research, Australia Tony Rees & Kim Finney Divisional Data Centre CSIRO Marine Research, Hobart, Australia

Example MarLIN search result - dataset titles

You searched on the following criteria:

Start year: 1990 End year: 1995 Selected region: Australian North West Shelf CAAB Species: 37 118001 - Saurida undosquamis

There are 3 datasets matching your criteria in MarLIN at this time.Click on the dataset title to view the metadata record for any dataset.

Southern Surveyor Voyage SS 02/90 - Biological Data Overview Southern Surveyor Voyage SS 04/91 - Biological Data Overview

Southern Surveyor Voyage SS 08/95 - Biological Data Overview

------------------------------------------------------------------------------

Page 18: On-line biological data concepts at CSIRO Marine Research, Australia Tony Rees & Kim Finney Divisional Data Centre CSIRO Marine Research, Hobart, Australia

SQuID - data repository and visualisation tool

• Oracle relational database containing c. 45 tables (present version)

• Holds point, poly-line, and polygon based, geo-referenced data (also time and depth referenced)

• Client runs as Java applet, connects to Oracle data store by Remote Method Invocation (RMI) and JDBC

• Search by spatial coordinates, time period, data “stream” … can subset by survey if desired

• Retrieve atomic-level data for inspection or upload to user’s system

• Basic plotting routines provided, such as:– geographic distribution of data (sampling points, vessel tracks)

– vertical plots (e.g. temperature, salinity, oxygen vs depth)

– time-based plots (e.g. water temperature measurement through a voyage)

– pie charts for catch composition by number or weight

– length-frequency data, aggregated or by sex of individual

• Taxon handling using CAAB codes (system includes legacy data with obsolete codes)

• Links to MarLIN to display relevant metadata

Page 19: On-line biological data concepts at CSIRO Marine Research, Australia Tony Rees & Kim Finney Divisional Data Centre CSIRO Marine Research, Hobart, Australia

SQuID user interface - version 1.0

Page 20: On-line biological data concepts at CSIRO Marine Research, Australia Tony Rees & Kim Finney Divisional Data Centre CSIRO Marine Research, Hobart, Australia

Example SQuID search result

Page 21: On-line biological data concepts at CSIRO Marine Research, Australia Tony Rees & Kim Finney Divisional Data Centre CSIRO Marine Research, Hobart, Australia

SQuID atomic level data - example

Page 22: On-line biological data concepts at CSIRO Marine Research, Australia Tony Rees & Kim Finney Divisional Data Centre CSIRO Marine Research, Hobart, Australia

Time series data in SQuID

Page 23: On-line biological data concepts at CSIRO Marine Research, Australia Tony Rees & Kim Finney Divisional Data Centre CSIRO Marine Research, Hobart, Australia

SQuID vs MarLIN / CAAB - two different approaches

SQuID - a data-rich browser environment

• Large files uploaded to the browser to allow interactive functions (zoomable maps, on-demand display of sample details, cursor tracking, browser-generated plots)

• Disadvantages: more complex applet to load, longer waits for queries to be serviced, performance on user’s machine may be limiting

MarLIN & CAAB - a minimal browser environment

• No reliance on JAVA version control, browser plugins etc, no load time at startup

• All processing takes place on the server (can maximise performance there) - less stringent requirements for users in hardware terms

• Disadvantage: less real-time interactivity provided (although some workarounds possible)

… May look at a hybrid solution for SQuID v2 - prioritise what level of interactivity/data upload is really needed, handle more at server level

Page 24: On-line biological data concepts at CSIRO Marine Research, Australia Tony Rees & Kim Finney Divisional Data Centre CSIRO Marine Research, Hobart, Australia

some considerations for OBIS ...

• For agency-specific reasons, we have arrived at separate metadata/data systems. OBIS might want to integrate these two aspects more fully

• Automated generation/maintenance of metadata might be possible (at least in part) and is certainly desirable

• Where would OBIS metadata reside? (centrally or replicated or fully distributed?) - Australian “ASDD” is an example of a fully distributed system, NASA “GCMD” is a centralised one

• Need to decide on taxon handling for OBIS (names or codes), plus standard(s) for higher level searching

• OBIS software should aim to tolerate a diversity of agency-level systems, while encouraging/facilitating “best practice” data management

Page 25: On-line biological data concepts at CSIRO Marine Research, Australia Tony Rees & Kim Finney Divisional Data Centre CSIRO Marine Research, Hobart, Australia

The End

Page 26: On-line biological data concepts at CSIRO Marine Research, Australia Tony Rees & Kim Finney Divisional Data Centre CSIRO Marine Research, Hobart, Australia

CAAB web search