architecture and standards for global biodiversity informatics a gbif and tdwg perspective donald...

22
Architecture and Standards for Global Biodiversity Informatics A GBIF and TDWG Perspective Donald Hobern GBIF Program Officer for Data Access and Database Interoperability November 2004

Upload: bernardo-knowlden

Post on 19-Jan-2016

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Architecture and Standards for Global Biodiversity Informatics A GBIF and TDWG Perspective Donald Hobern GBIF Program Officer for Data Access and Database

Architecture and Standards for Global Biodiversity Informatics

A GBIF and TDWG Perspective

Donald Hobern GBIF Program Officer for Data Access and Database Interoperability

November 2004

Page 2: Architecture and Standards for Global Biodiversity Informatics A GBIF and TDWG Perspective Donald Hobern GBIF Program Officer for Data Access and Database

TDWG and GBIF

TDWG – Taxonomic Databases Working Group• Not-for-profit scientific and educational association• Affiliated to the International Union of Biological Sciences• Mission

• To provide an international forum for biological data projects • To develop and promote the use of standards• To facilitate data exchange

• Products• Standards/guidelines for recording/exchanging data about organisms• Promotion of use of these standards• Forum for discussion (especially annual meeting)

GBIF – Global Biodiversity Information Facility• Megascience activity involving 42 countries/economies and 28 international organisations• Secretariat based in Copenhagen, Denmark• Mission

• Free and universal access to world’s biodiversity data via Internet • Sharing primary biodiversity data for society, science and a sustainable future

• Products• Registry of biodiversity data resources• Index of biodiversity data• Software tools• Web portals (http://www.gbif.net) and data services

Page 3: Architecture and Standards for Global Biodiversity Informatics A GBIF and TDWG Perspective Donald Hobern GBIF Program Officer for Data Access and Database

Vernacular (FR): Pyrale du maïs

Vernacular (ES): Piral del maíz

Vernacular (DE): Maiszünsler

Diagnosis: Wingspan 26-30mm; sexually dimorphic;male: forewings ochreous to dark brown; female: forewings pale yellow; …

Foodplant: Zea mais L. 1753

Primary biodiversity data

Species: Ostrinia nubilalis (Hübner, 1796)

Family: Pyralidae

Order: Lepidoptera

Class: Insecta

Genus: Ostrinia Hübner, 1825

Vernacular (EN): European Corn-borer

Family: Gramineae

Taxonomic Names

Collection: DGH LepidopteraRecord id: DGHEUR_003217Country: FranceCoordinates: 03.047˚E 48.730˚NDate: 28 June 2003Collector: Donald Hobern

Specimens and Observations

Ecological Interactions

Locus: AAL35331Definition: acyl-CoA Z/E11 desaturase

1 mvpyattadg hpekdecfed...

Sequence Data

Average RainfallLocation: 48.82°N 2.29°E Jan Feb Mar Apr ...182.3 120.6 158.1 204.9 ...

Abiotic Data

Taxonomic Descriptions

Pheromones of Ostriniahttp://www.nysaes.cornell.edu/fst/faculty/acree/pheronet/phlist/ostrinia.html

Digital Literature and Web Resources

Synonym: Pyralis nubilalis Hübner, 1796

Page 4: Architecture and Standards for Global Biodiversity Informatics A GBIF and TDWG Perspective Donald Hobern GBIF Program Officer for Data Access and Database

Standardised structured data

<?xml version="1.0" encoding="UTF-8"?><response>

<record><darwin:DateLastModified>2003-06-08</darwin:DateLastModified><darwin:InstitutionCode>DGH</darwin:InstitutionCode><darwin:CollectionCode>DGH Lepidoptera</darwin:CollectionCode><darwin:CatalogNumber>DGHEUR_0002976</darwin:CatalogNumber><darwin:ScientificName>Dichomeris marginella (Fabricius, 1781)</darwin:ScientificName><darwin:BasisOfRecord>O</darwin:BasisOfRecord><darwin:Kingdom>Animalia</darwin:Kingdom><darwin:Order>Lepidoptera</darwin:Order><darwin:Family>Gelechiidae</darwin:Family><darwin:Genus>Dichomeris</darwin:Genus><darwin:Species>marginella</darwin:Species><darwin:ScientificNameAuthor>(Fabricius, 1781)</darwin:ScientificNameAuthor><darwin:IdentifiedBy>Donald Hobern</darwin:IdentifiedBy><darwin:Collector>Donald Hobern</darwin:Collector><darwin:YearCollected>2003</darwin:YearCollected><darwin:MonthCollected>06</darwin:MonthCollected><darwin:DayCollected>08</darwin:DayCollected><darwin:ContinentOcean>Europe</darwin:ContinentOcean><darwin:Country>Denmark</darwin:Country><darwin:County>Københavns Amt</darwin:County><darwin:Locality>Merianvej, Hellerup</darwin:Locality><darwin:Longitude>12.538</darwin:Longitude><darwin:Latitude>55.737</darwin:Latitude><darwin:CoordinatePrecision>100</darwin:CoordinatePrecision><darwin:IndividualCount>1</darwin:IndividualCount><darwin:Notes>1 in Skinner trap</darwin:Notes>

</record></response>

June 2003 S M T W T F S 1 2 3 4 5 6 7 8 9 10 11 12 13 1415 16 17 18 19 20 2122 23 24 25 26 27 2829 30

Observation record formatted using the Darwin Core

Page 5: Architecture and Standards for Global Biodiversity Informatics A GBIF and TDWG Perspective Donald Hobern GBIF Program Officer for Data Access and Database

TDWG Data Standards

Darwin Core• Simple XML data model to represent taxon occurrence records (only core attributes)• Extensions to handle e.g. curation details, microbial data, image data

ABCD Schema – Access to Biological Collection Data• More complex XML data model to represent collection or observation data• Detailed document structure including features for different communities

DiGIR – Distributed Generic Information Retrieval• XML protocol for searching remote data resources • Suitable for use with a wide range of different data models

BioCASe Protocol• XML protocol for searching remote data resources with more complex schema (e.g. ABCD)• Derived from DiGIR – new unified DiGIR/BioCASe protocol being developed

Taxon Concept Schema• XML data model currently under development for exchange of nomenclatural/taxonomic data• First version to be used for implementation in 2005

SDD Schema – Structured Descriptive Data• XML data model for descriptive data relating to taxa or specimens (highly generalised)• Suitable for representation of character tables, diagnostic keys, etc.

Page 6: Architecture and Standards for Global Biodiversity Informatics A GBIF and TDWG Perspective Donald Hobern GBIF Program Officer for Data Access and Database

BioCASe-ABCD<?xml version='1.0' encoding='UTF-8'?><response xmlns='http://www.biocase.org/schemas/protocol/1.3' xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' xsi:schemaLocation='http://www.biocase.org/schemas/protocol/1.3 http://www.bgbm.org/biodivinf/schema/protocol_1_3.xsd'> <header> <version software='Python Interpreter'>2.3 (#46, Jul 29 2003, 18:54:32) [MSC v.1200 32 bit (Intel)]</version> <sendTime>2004-10-10T22:22:40+02:00</sendTime><source>192.168.1.12</source><destination>132.181.101.155</destination><type>search</type> </header> <content recordDropped='0' recordCount=‘1' recordStart='0' totalSearchHits=‘1'> <DataSets xmlns='http://www.tdwg.org/schemas/abcd/1.2'> <DataSet> <OriginalSource><SourceInstitutionCode>BGBM</SourceInstitutionCode><SourceName>Bridel Herbar</SourceName><SourceLastUpdatedDate>2004-04-29</SourceLastUpdatedDate></OriginalSource> <DatasetDerivations> <DatasetDerivation> <DateSupplied>2004-07-29</DateSupplied> <Supplier> <Organisation><OrganisationName>Botanic Garden and Botanical Museum Berlin-Dahlem</OrganisationName></Organisation><Person><PersonName>Andrea Hahn</PersonName></Person> <TelephoneNumbers><TelephoneNumber><Number>+49 (0)30 838 50286</Number></TelephoneNumber></TelephoneNumbers><URLs><URL>http://www.bgbm.org</URL></URLs> </Supplier> <Rights> <TermsOfUse>The use of the data is allowed only for non-profit scientific use and for non-profit nature conservation purpose.</TermsOfUse> <LegalOwner> <Organisation> <OrganisationName>Botanic Garden and Botanical Museum Berlin-Dahlem</OrganisationName> <OrganisationCodes><OrganisationCode>BGBM</OrganisationCode></OrganisationCodes> </Organisation> </LegalOwner> <CopyrightDeclaration>No part of this data base may be copied or reproduced without written permission from the legal owner.</CopyrightDeclaration> <IPRDeclaration>The Intellectual Property Rights are held by the legal owner or, in case of living persons, by the collector or determinator.</IPRDeclaration> </Rights> <Statements><Disclaimer>No responsibility is accepted for the accuracy of the information in this data base.</Disclaimer></Statements> </DatasetDerivation> </DatasetDerivations> <Units> <Unit> <UnitID>Bridel-1-362</UnitID> <Identifications> <Identification PreferredIdentificationFlag='0'> <TaxonIdentified> <HigherTaxa><HigherTaxon TaxonRank='Family'>Pottiaceae</HigherTaxon><HigherTaxon TaxonRank='Kingdom'>Plantae</HigherTaxon></HigherTaxa> <NameAuthorYearString>Leucophanes octoblepharioides Brid. 1827</NameAuthorYearString> <ScientificNameString>Leucophanes octoblepharioides</ScientificNameString> <AuthorString>Brid. 1827</AuthorString> </TaxonIdentified> <Identifier><IdentifierPersonName><PersonName>Allen, Noris Salazar</PersonName></IdentifierPersonName></Identifier> <IdentificationDate><ISODateTimeBegin>1986-07</ISODateTimeBegin></IdentificationDate> </Identification> </Identifications> <Gathering><GatheringSite><ContinentOrOcean>Asia</ContinentOrOcean><Country><CountryName>NP</CountryName></Country><AreaDetail>Nepal</AreaDetail></GatheringSite></Gathering> </Unit> </Units> </DataSet> </DataSets> </content> <diagnostics><diagnostic>OK</diagnostic></diagnostics></response>

SP

EC

IME

N

CO

LL

EC

TIO

N (

INC

LU

DIN

G

ME

TA

DA

TA

)

PR

OT

OC

OL

PR

OT

OC

OL

Note that structure of record elements is part of the content schema (ABCD), not part of the protocol

Page 7: Architecture and Standards for Global Biodiversity Informatics A GBIF and TDWG Perspective Donald Hobern GBIF Program Officer for Data Access and Database

DiGIR-Darwin Core

<?xml version='1.0' encoding='utf-8' ?><response xmlns='http://digir.net/schema/protocol/2003/1.0'> <header> <version>$Revision: 1.14 $</version> <sendTime>2004-10-10T13:48:12-0700</sendTime> <source resource="martius_munchen_infocomp">http://digir.bebif.be/main/DiGIR.php</source> <destination>132.181.101.155</destination> <type>search</type> </header> <content xmlns:darwin='http://digir.net/schema/conceptual/darwin/2003/1.0' xmlns:xsd='http://www.w3.org/2001/XMLSchema'> <record> <darwin:DateLastModified>2004-05-12</darwin:DateLastModified> <darwin:InstitutionCode>Botanische Staatssammlung München</darwin:InstitutionCode> <darwin:CollectionCode>Infocomp</darwin:CollectionCode> <darwin:CatalogNumber>010702P1</darwin:CatalogNumber> <darwin:ScientificName>Wedelia longifolia Mart. ex Baker</darwin:ScientificName> <darwin:BasisOfRecord>Label</darwin:BasisOfRecord> <darwin:ScientificNameAuthor>Baker, J.G.</darwin:ScientificNameAuthor> <darwin:TypeStatus>Holotypus</darwin:TypeStatus> <darwin:CollectorNumber>s.n.</darwin:CollectorNumber> <darwin:Collector>Martius, C.F.P. von</darwin:Collector> <darwin:ContinentOcean>South America</darwin:ContinentOcean> <darwin:Country>Brazil</darwin:Country> <darwin:Locality>' ... in prov. S. Paulo, inter herbas locis irriguis ad Lorena ... ' (op. cit.)</darwin:Locality> <darwin:Notes>Tribe: Heliantheae, Reference protologue: Martius, C.F.P. von: Flora Brasiliensis 6(3): 182-183. 1884., </darwin:Notes> </record> </content> <diagnostics> <diagnostic code="STATUS_INTERVAL" severity="info">3600</diagnostic> <diagnostic code="STATUS_DATA" severity="info">79,5,2</diagnostic> <diagnostic code="MATCH_COUNT" severity="info">1</diagnostic> <diagnostic code="RECORD_COUNT" severity="info">1</diagnostic> <diagnostic code="END_OF_RECORDS" severity="info">true</diagnostic> </diagnostics></response>

SP

EC

IME

N

PR

OT

OC

OL

PR

OT

OC

OL

Note that structure of record elements is part of the protocol – the content schema (Darwin Core) only defines the attributes describing the record

Page 8: Architecture and Standards for Global Biodiversity Informatics A GBIF and TDWG Perspective Donald Hobern GBIF Program Officer for Data Access and Database

BioCASe-ABCD compared to DiGIR-Darwin Core

BioCASe-ABCD model• Document-based (response document includes metadata and records as a structured package)

• Strengths

• No problem with modelling complex nested structures and repeating elements

• Fits perfectly with UBIF proposal – ABCD DataSet elements and ABCD Metadata could readily be standardised with the DataSet/Metadata structures from other TDWG standards such as Structured Descriptive Data (SDD) and Taxon Concept Schema (TCS) – with rather little work. DataSets from all three of these could be combined to form a single document with cross-references between sections.

• Possible weaknesses

• Not simple for specialist networks to extend the structure with additional elements of their own (requires well-planned open extension points to be designed into the schema), especially if a provider wishes simultaneously to be part of more than one such specialist network .

• (At present) all elements in the ABCD schema are versioned together. Handling an updated version of the schema requires significant additional effort on the part of providers and users. For example, adding new elements to support plant genetic resource data – without changing the elements for museum/herbarium specimens – requires all users to handle a new version of the schema.

DiGIR-Darwin Core model• Record-based (response returns a set of records which may contain descriptor elements from any schema)

• Strengths

• Massively flexible and extensible model allowing different networks to use a common protocol and shared core elements alongside their own network-specific extensions.

• (In integrated protocol version) could return ABCD elements as part of response records. If ABCD is treated as a library of elements, this fits even better.

• Model maps well to supporting a flexible object-oriented data model for biodiversity informatics.

• Possible weaknesses

• (In existing version) cannot readily handle complex data structures with nested repeating elements.

• Records have no intrinsic data type – currently relies on an implicit understanding between user and data provider.

Page 9: Architecture and Standards for Global Biodiversity Informatics A GBIF and TDWG Perspective Donald Hobern GBIF Program Officer for Data Access and Database

Exchange via web services

Heterogenous Databases Web Services

<request>

<response> <record> …

StandardisedStructured Data UsersInternet

<request>

<response> <record> …

<request>

<response> <record> …

Page 10: Architecture and Standards for Global Biodiversity Informatics A GBIF and TDWG Perspective Donald Hobern GBIF Program Officer for Data Access and Database

GBIF network of biodiversity data nodes

Specimens:Flowering Plants of

Africa

Specimens:Proteaceae of

the World

Taxon Names:Proteaceae of

the World

Observations:Birds of Central

America

Observations:Butterflies of

Belize

Checklist:Birds of Belize

Specimens:Mammals of North Europe

Taxon Names:Mammals of

the World

Specimens:Bacteria Cultures

Taxon Names:Bacteria

Further Links:Bacteria

Further Links:Mammals

Museum A

Museum C University D

Observer Network B

GBIFNetwork

DiGIR-DarwinCore

DiGIR-DarwinCore

DiGIR-DarwinCore

BioCASe-ABCD

BioCASe-ABCD

BioCASe-ABCD

Taxon Concept SchemaTaxon Concept Schema

Taxon Concept SchemaTaxon Concept Schema

Page 11: Architecture and Standards for Global Biodiversity Informatics A GBIF and TDWG Perspective Donald Hobern GBIF Program Officer for Data Access and Database

Central GBIF registry of data nodes

Data Node Type of data Taxon Region Records

Museum A Specimen/Observation

Flowering Plants Africa 327000

Specimen/Observation

Proteaceae World 23000

Taxonomic Names Proteaceae World 1500

Observer Network B

Specimen/Observation

Birds Central America 68500

Specimen/Observation

Butterflies Belize 4200

Name List Birds Belize 587

Museum C Specimen/Observation

Mammals North Europe 1800

Taxonomic Names Mammals World 8000

General Resources Mammals World 600

University D Specimen/Observation

Bacteria World 1200

Taxonomic Names Bacteria World 5000

General Resources Bacteria World 400

Page 12: Architecture and Standards for Global Biodiversity Informatics A GBIF and TDWG Perspective Donald Hobern GBIF Program Officer for Data Access and Database

DiGIR-BioCASe Protocol and Nested Networks

User

Get Darwin Core records where darwin:ScientificName equals Puma concolor from any provider.

MaNIS Provider

Darwin Core

Curatorial

Taxon Occurrence

OBIS Provider

Darwin Core

Marine

Taxon Occurrence

IPGRI Banana Provider

Darwin Core

IPGRI Passport

Banana Descriptor

Taxon Occurrence

IPGRI Soy Bean Provider

Darwin Core

IPGRI Passport

Soy Bean Descriptor

Taxon Occurrence

BioCASe Provider

Darwin Core

ABCD

Taxon Occurrence

BioCASe Provider

Darwin Core

ABCD

Taxon Occurrence

Get standard plant genetic resource Passport data for all crop types.

Get full set of Soy Bean crop descriptors.

Get complete ABCD documents from each BioCASe provider

Get DiGIR-style records each with a set of Darwin Core descriptors and a complete ABCD Unit

Page 13: Architecture and Standards for Global Biodiversity Informatics A GBIF and TDWG Perspective Donald Hobern GBIF Program Officer for Data Access and Database

GBIF index to biodiversity data

Catalogue of Life

Biodiversity Data Access

Biodiversity DataIndex

Taxonomic Name

Service (ECAT)

Userrequests

GBIF Data Nodes

Specimen DataSpecimen DataLinks to other data

Specimen DataSpecimen DataName Lists

Specimen DataSpecimen DataObservation Data

Specimen DataSpecimen DataSpecimen Data DiGIR/BiOCASe

DiGIR/BiOCASe

Taxon Concept

Page 14: Architecture and Standards for Global Biodiversity Informatics A GBIF and TDWG Perspective Donald Hobern GBIF Program Officer for Data Access and Database

GBIF data index

Page 15: Architecture and Standards for Global Biodiversity Informatics A GBIF and TDWG Perspective Donald Hobern GBIF Program Officer for Data Access and Database

Central portal to biodiversity data

Show specimen records for Erinaceus europaeus

GBIFPortal

6 records

35 records

17 records

0 records

58 records:

1. Museum A Paris2. Museum A Nice3. Museum A Paris4. Museum A Avignon5. Museum A Avignon6. Museum A Marseille7. Observer B Norwich8. Observer B Norwich9. Observer B

Southampton

. . .

Page 16: Architecture and Standards for Global Biodiversity Informatics A GBIF and TDWG Perspective Donald Hobern GBIF Program Officer for Data Access and Database

GBIF Data Portal

Page 17: Architecture and Standards for Global Biodiversity Informatics A GBIF and TDWG Perspective Donald Hobern GBIF Program Officer for Data Access and Database

GBIF Data Portal

Page 18: Architecture and Standards for Global Biodiversity Informatics A GBIF and TDWG Perspective Donald Hobern GBIF Program Officer for Data Access and Database

Participant Nodes with tailored information

Show specimen records for Erinaceus europaeus from France

58 GBIF records:

1. Museum A Paris 2. Museum A Nice 3. Museum A Paris 4. Museum A Avignon 5. Museum A Avignon 6. Museum A Marseille 7. Observer B Norwich 8. Observer B Norwich 9. Observer B Southampton

. . .58. Museum C Toulouse

GBIFPortal

Geographic Services

26 records:

1. Museum A Paris2. Museum A Nice3. Museum A Paris4. Museum A Avignon5. Museum A Avignon6. Museum A Marseille23. Observer B Calais29. Observer B Paris

. . .58. Museum C Toulouse

GBIFFrance

Show occurrence of Hérisson d’Europe

Page 19: Architecture and Standards for Global Biodiversity Informatics A GBIF and TDWG Perspective Donald Hobern GBIF Program Officer for Data Access and Database

Flexible applications

Provide key to identify reportable Curculionidae

GBIF

WANTED

List of names of reportable pest species

Descriptive data

1. Elytra brown 2Elytra not brown5

2. Thorax black

Thorax brown 3

3. Hind tibia black Non-pestHind tibia brown4

4. Hind femur brown

Hind femur blackNon-pest

. . .

A customs official discovers specimens of a possible pest species of weevil (Curculionidae) on a consignment of agricultural produce at a port of entry.  The GBIF Network generates an identification key to support identification of pest weevil species to allow the official to determine appropriate response. This application requires access to data from a wide range of sources, including those GBIF participants that are organisations.

Page 20: Architecture and Standards for Global Biodiversity Informatics A GBIF and TDWG Perspective Donald Hobern GBIF Program Officer for Data Access and Database

Monitoring of data usage

Show specimen records for Upupa epops

GBIFPortal

Data Usage Reports

81 records:

1. Museum AParis

2. Museum ANice

3. Museum AParis

. . .

Show bird specimen records from Nice

126 records:

1. Museum A Upupa epops

2. Museum A Apus apus

3. Museum A Athene noctua

. . .

Data Usage Logs

GBIF Usage: Museum A

16 August 2003 Search: Upupa epops5 records returned

18 Augúst 2003Search: Birds from Nice16 records returned

GBIF Usage: Observer B

16 August 2003 Search: Upupa epops2 records returned

Page 21: Architecture and Standards for Global Biodiversity Informatics A GBIF and TDWG Perspective Donald Hobern GBIF Program Officer for Data Access and Database

Future activity

Globally unique identifiers• TDWG-GBIF collaboration to develop models to allow data providers to attach persistent

identifiers to their data records• Allow software to detect multiple instances of the same record• Allow users to save resolvable references to specimens, collections, taxon concepts, etc.

Schema repository• Central library of information on data models• Resource for discovering documentation or mappings between different schemas• Better support for intelligent software applications

Data validation tools• Framework for running sets of validation tests against XML data (content values, controlled

vocabularies, relating georeference data to named localities, etc.)• Support different uses (data providers to locate possible problems in data; users to assure

themselves of suitability of data; GBIF to provide metadata on data completeness/coherence)

Access to a wide range of taxonomic name data• Taxonomic/nomenclatural authorities (nomenclators, global species databases, revisions, etc.)• Lists used by different communities/organisations (red lists, pest species, regional checklists, etc.)

Customised portals• Organised according to taxon lists used by each user• Notifications of new data based on user profiles (taxonomic, geographic, etc.)

Page 22: Architecture and Standards for Global Biodiversity Informatics A GBIF and TDWG Perspective Donald Hobern GBIF Program Officer for Data Access and Database

Links

Taxonomic Databases Working Group

http://www.tdwg.org/Including access to working groups

Global Biodiversity Information Facility

http://www.gbif.org/Communications Portal

http://www.gbif.net/Data Portal

http://circa.gbif.net/Public/irc/gbif/dadi/library?l=/architectureArchitecture documents