the gbif information architecture technological integration at the global level donald hobern gbif...
TRANSCRIPT
The GBIF Information ArchitectureTechnological integration at the global level
Donald Hobern GBIF Program Officer for Data Access and Database Interoperability
March 2005
GBIF and TDWG
GBIF – Global Biodiversity Information Facility• Megascience activity involving 43 countries/economies and 28 international organisations• Secretariat based in Copenhagen, Denmark• Mission
• Free and universal access to world’s biodiversity data via Internet • Sharing primary biodiversity data for society, science and a sustainable future
• Products• Registry of biodiversity data resources• Index of biodiversity data• Software tools• Web portals (http://www.gbif.net) and data services
TDWG – Taxonomic Databases Working Group• Not-for-profit scientific and educational association• Affiliated to the International Union of Biological Sciences• Mission
• To provide an international forum for biological data projects • To develop and promote the use of standards• To facilitate data exchange
• Products• Standards/guidelines for recording/exchanging data about organisms• Promotion of use of these standards• Forum for discussion (especially annual meeting)
Vernacular (FR): Pyrale du maïs
Vernacular (ES): Piral del maíz
Vernacular (DE): Maiszünsler
Diagnosis: Wingspan 26-30mm; sexually dimorphic;male: forewings ochreous to dark brown; female: forewings pale yellow; …
Foodplant: Zea mais L. 1753
Biodiversity data
Species: Ostrinia nubilalis (Hübner, 1796)
Family: Pyralidae
Order: Lepidoptera
Class: Insecta
Genus: Ostrinia Hübner, 1825
Vernacular (EN): European Corn-borer
Family: Gramineae
Taxonomic Names
Collection: DGH LepidopteraRecord id: DGHEUR_003217Country: FranceCoordinates: 03.047˚E 48.730˚NDate: 28 June 2003Collector: Donald Hobern
Specimens and Observations
Ecological Interactions
Locus: AAL35331Definition: acyl-CoA Z/E11 desaturase
1 mvpyattadg hpekdecfed...
Sequence Data
Average RainfallLocation: 48.82°N 2.29°E Jan Feb Mar Apr ...182.3 120.6 158.1 204.9 ...
Abiotic Data
Taxonomic Descriptions
Pheromones of Ostriniahttp://www.nysaes.cornell.edu/fst/faculty/acree/pheronet/phlist/ostrinia.html
Digital Literature and Web Resources
Synonym: Pyralis nubilalis Hübner, 1796
Resources vary in interoperability and scope for dynamic integration
• Images, audio, etc.• Require text (metadata) for interpretation• Very hard for software to interpret automatically• (Normally) link to resource as a unit (e.g. via URL in web page)
• Unstructured text documents (including e.g. PDF, Word, HTML)• Can be indexed for full-text search (Google)• Hard for software to interpret automatically• Normally link to document as a unit (e.g. URL in web page)
• Structured XML documents• Can be indexed for full-text search• Can be parsed and interpreted by software tools (for known schemas)• Can link to document as a unit or to fragments of a document• Can use structure to develop highly-specific searches• Normally process whole documents (not random-access)
• Databases• Can be exposed as structured XML documents• Can process any subset of the data (random-access selection of specific records)
Data resources
<?xml version="1.0" encoding="UTF-8"?><response>
<record><darwin:DateLastModified>2003-06-08</darwin:DateLastModified><darwin:InstitutionCode>DGH</darwin:InstitutionCode><darwin:CollectionCode>DGH Lepidoptera</darwin:CollectionCode><darwin:CatalogNumber>DGHEUR_0002976</darwin:CatalogNumber><darwin:ScientificName>Dichomeris marginella (Fabricius, 1781)</darwin:ScientificName><darwin:BasisOfRecord>O</darwin:BasisOfRecord><darwin:Kingdom>Animalia</darwin:Kingdom><darwin:Order>Lepidoptera</darwin:Order><darwin:Family>Gelechiidae</darwin:Family><darwin:Genus>Dichomeris</darwin:Genus><darwin:Species>marginella</darwin:Species><darwin:ScientificNameAuthor>(Fabricius, 1781)</darwin:ScientificNameAuthor><darwin:IdentifiedBy>Donald Hobern</darwin:IdentifiedBy><darwin:Collector>Donald Hobern</darwin:Collector><darwin:YearCollected>2003</darwin:YearCollected><darwin:MonthCollected>06</darwin:MonthCollected><darwin:DayCollected>08</darwin:DayCollected><darwin:ContinentOcean>Europe</darwin:ContinentOcean><darwin:Country>Denmark</darwin:Country><darwin:County>Københavns Amt</darwin:County><darwin:Locality>Merianvej, Hellerup</darwin:Locality><darwin:Longitude>12.538</darwin:Longitude><darwin:Latitude>55.737</darwin:Latitude><darwin:CoordinatePrecision>100</darwin:CoordinatePrecision><darwin:IndividualCount>1</darwin:IndividualCount><darwin:Notes>1 in Skinner trap</darwin:Notes>
</record></response>
June 2003 S M T W T F S 1 2 3 4 5 6 7 8 9 10 11 12 13 1415 16 17 18 19 20 2122 23 24 25 26 27 2829 30
Observation record formatted using the Darwin Core
Structured XML data
Some TDWG Data Standards
Darwin Core• Simple XML data model to represent taxon occurrence records (only core attributes)• Extensions to handle e.g. curation details, microbial data, image data
ABCD Schema – Access to Biological Collection Data• More complex XML data model to represent collection or observation data• Detailed document structure including features for different communities
DiGIR – Distributed Generic Information Retrieval• XML protocol for searching remote data resources • Suitable for use with a wide range of different data models
BioCASe Protocol• XML protocol for searching remote data resources with more complex schema (e.g. ABCD)• Derived from DiGIR – new unified DiGIR/BioCASe protocol being developed (”TAPIR”)
Taxon Concept Schema• XML data model currently under development for exchange of nomenclatural/taxonomic data• First version to be used for implementation in 2005
SDD Schema – Structured Descriptive Data• XML data model for descriptive data relating to taxa or specimens (highly generalised)• Suitable for representation of character tables, diagnostic keys, etc.
DiGIR Protocol <?xml version="1.0" encoding="UTF-8"?><request xmlns="http://digir.net/schema/protocol/2003/1.0" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:darwin="http://digir.net/schema/conceptual/darwin/2003/1.0"> <header> <version>1.0.0</version> <sendTime>2004-07-20T09:56:00-0500</sendTime> <source>127.0.0.1</source> <destination resource="HDS">http://any.net/digir/DiGIR.php</destination> <type>search</type> </header> <search> <filter> <and> <equals> <darwin:Collector>Thomas, J.</darwin:Collector> </equals> <equals> <darwin:Genus>Rubus</darwin:Genus> </equals> </and> </filter> <records limit="3" start="0"> <structure> <xsd:element name="record“ minOccurs="0“ maxOccurs="unbounded"> <xsd:complexType> <xsd:sequence> <xsd:element ref="darwin:InstitutionCode"/> <xsd:element ref="darwin:CollectionCode"/> <xsd:element ref="darwin:CatalogNumber"/> <xsd:element ref="darwin:ScientificName"/> <xsd:element ref="darwin:Latitude"/> <xsd:element ref="darwin:Longitude"/> </xsd:sequence> </xsd:complexType> </xsd:element> </structure> </records> <count>true</count> </search></request>
<?xml version="1.0" encoding="utf-8" ?><response xmlns="http://digir.net/schema/protocol/2003/1.0"> <header> <version>$Revision: 1.96 $</version> <sendTime>2004-07-20T09:56:03-0500</sendTime> <source resource="HDS">http://any.net/digir/DiGIR.php</source> <destination>127.0.0.1</destination> <type>search</type> </header> <content xmlns:darwin="http://digir.net/schema/conceptual/darwin/2003/1.0"> <record> <darwin:InstitutionCode>INS</darwin:InstitutionCode> <darwin:CollectionCode>COL</darwin:CollectionCode> <darwin:CatalogNumber>27694</darwin:CatalogNumber> <darwin:ScientificName>Rubus brasiliensis</darwin:ScientificName> <darwin:Latitude>-23.6625</darwin:Latitude> <darwin:Longitude>-46.7738</darwin:Longitude> </record> <record> <darwin:InstitutionCode>INS</darwin:InstitutionCode> <darwin:CollectionCode>COL</darwin:CollectionCode> <darwin:CatalogNumber>27907</darwin:CatalogNumber> <darwin:ScientificName>Rubus brasiliensis</darwin:ScientificName> <darwin:Latitude>-23.412</darwin:Latitude> <darwin:Longitude>-46.7899</darwin:Longitude> </record> <record> <darwin:InstitutionCode>INS</darwin:InstitutionCode> <darwin:CollectionCode>COL</darwin:CollectionCode> <darwin:CatalogNumber>27908</darwin:CatalogNumber> <darwin:ScientificName>Rubus urticaefolius</darwin:ScientificName> <darwin:Latitude>-23.412</darwin:Latitude> <darwin:Longitude>-46.7899</darwin:Longitude> </record> </content> <diagnostics> <diagnostic code="STATUS_INTERVAL" severity="info">3600</diagnostic> <diagnostic code="STATUS_DATA" severity="info">191,1,1</diagnostic> <diagnostic code="MATCH_COUNT" severity="info">26</diagnostic> <diagnostic code="RECORD_COUNT" severity="info">3</diagnostic> <diagnostic code="END_OF_RECORDS" severity="info">false</diagnostic> </diagnostics></response>
Request Response
DiGIR-BioCASe Protocol and Nested Networks
User
Get Darwin Core records where darwin:ScientificName equals Puma concolor from any provider.
MaNIS Provider
Darwin Core
Curatorial
Taxon Occurrence
OBIS Provider
Darwin Core
Marine
Taxon Occurrence
IPGRI Banana Provider
Darwin Core
IPGRI Passport
Banana Descriptor
Taxon Occurrence
IPGRI Soy Bean Provider
Darwin Core
IPGRI Passport
Soy Bean Descriptor
Taxon Occurrence
BioCASe Provider
Darwin Core
ABCD
Taxon Occurrence
BioCASe Provider
Darwin Core
ABCD
Taxon Occurrence
Get standard plant genetic resource Passport data for all crop types.
Get full set of Soy Bean crop descriptors.
Get complete ABCD documents from each BioCASe provider
Get DiGIR-style records each with a set of Darwin Core descriptors and a complete ABCD Unit
Data navigation
SpecimenCatalogue Number
SpecimenCatalogue Number
SpecimenCatalogue Number
Taxon ConceptName
Taxon ConceptName
Identified as
Identified as
Taxon ConceptName
PublicationTitle, Year
PublicationTitle, Year
PublicationTitle, Year
SynonymyRelationship
AuthorName
AuthorName
CollectorName
CollectorName
SequenceData
BarcodeData
BarcodeData
BarcodeData
Type material
Type material
Identified as
XML
XML
Referenced material
CharacterData
CharacterData
CharacterData
Type material
Integration strategiesDifferent approaches to integrating distributed information:
• Data Warehouseo Bring all data togethero Data owners can still maintain control over their datao Single agreed central data modelo All data in one place any complex query can be constructed
• Federated Networko Data remain with original ownerso Data owners can manage data as best suits themo Agreed data models for exchangeo Queries can become expensive as number of resources increases
• Indexed Network o Data remain with original ownerso Data owners can manage data as best suits themo Agreed data models for exchangeo Most frequently-issued queries can be handled in index
• Selection may be influenced by community culture
• Network may include warehouses/indices for specific areas
• Key driver should be user requirements (use cases)
Central services
GBIF plans to offer the following central services:
• Service Registry – discover data resources
• Data Index and Aggregated Data Services – find distributed data
• Schema Repository – interpret structured data
• Feedback Mechanisms – add value by connecting users to providers
• Data Validation Services – report potential inconsistencies within data
• Usage Reporting Mechanisms – recognise providers for sharing data
• Globally Unique Identifier Services – provide persistent references to data records
Networked data resources
Specimens:Flowering Plants of
Africa
Specimens:Proteaceae of
the World
Taxon Names:Proteaceae of
the World
Observations:Birds of Central
America
Observations:Butterflies of
Belize
Checklist:Birds of Belize
Specimens:Mammals of North Europe
Taxon Names:Mammals of
the World
Specimens:Bacteria Cultures
Taxon Names:Bacteria
Further Links:Bacteria
Further Links:Mammals
Museum A
Museum C University D
Observer Network B
GBIFNetwork
Service registry
Data Node Type of data Taxon Region Records
Museum A Specimen/Observation
Flowering Plants Africa 327000
Specimen/Observation
Proteaceae World 23000
Taxonomic Names Proteaceae World 1500
Observer Network B
Specimen/Observation
Birds Central America 68500
Specimen/Observation
Butterflies Belize 4200
Name List Birds Belize 587
Museum C Specimen/Observation
Mammals North Europe 1800
Taxonomic Names Mammals World 8000
General Resources Mammals World 600
University D Specimen/Observation
Bacteria World 1200
Taxonomic Names Bacteria World 5000
General Resources Bacteria World 400
Data index
Catalogue of Life
Biodiversity Data Access
Biodiversity DataIndex
Taxonomic Name
Service (ECAT)
Userrequests
GBIF Data Nodes
Specimen DataSpecimen DataLinks to other data
Specimen DataSpecimen DataName Lists
Specimen DataSpecimen DataObservation Data
Specimen DataSpecimen DataSpecimen Data
Data index
Distributed data access
GBIF Service Registry
GBIF Global Portal Regional PortalThematic Portal
Taxonomy / Nomenclature
PDFHTML web page
Struc-tured XML
Image
GBIF Data Index Entrez
Software Applications
Discovery
Indexing
Presentation &Analysis
ResourcesSpecimens
Descriptive Data
Sequences / Barcodes
OBIS
Network-specific
Registries
Structured Unstructured
Databases Documents
Access XML XMLTCS
XMLABCDDwC
XMLSDD
XMLTaXMLit
URL URL URL
CGIAR
Handling of unstructured data – a possible approach
• Develop set of keywords to classify web pages– Need standard classification of biodiversity information (probably with varying levels
of granularity)– Register pages under { taxonName, url, keywords, audience, language, ... }– Sufficient immediately to construct tables of links as part of taxon-specific web
pages
• Develop corresponding XML tags to mark up sections of web text (<description>, <habitat>, etc.)
– See TaXMLit from Biologia Centrali-Americana project for related approach– Possible to extract sections for re-use elsewhere– Example: a web site requires a description of common garden birds
• Find suitable text by audience and language for each taxon?
• Maximise benefits from effort expended in developing content– All information developed in any format can be indexed and discovered– All resources can contribute to the global data pool– Can develop plug-and-play taxonomic/regional/thematic portals based on the entire
network of registered data elements, structured and unstructured
Data validation
Administrator
Specimen
Researcher
Data Input
Database
Web Server
Internet
IndexingService
ValidationService
Validation service:
• Extensible framework for checking XML data
• Basic checks for XML validity (e.g. UTF-8)
• Validation against schema
• Checks that element content has expected structure or uses expected values
• Geospatial checks – coordinates in expected countries or on land / in sea
• Taxonomic checks – is name known? does it appear in referenced authority list? is higher taxonomy consistent?
Schema repository – version history
Schema A
1.0 1.1
Schema B
1.0 1.1 1.2
Structured store of data models for each schema
Modelling of version history for each schema
Single location from which to generate new machine-readable formats
Should such a tool also provide an environment in which to develop new data models?
.xsd
.dtd
.xsd
.???
.dtd
• Elements• Datatypes• Cardinality• Enumerated
values• Annotations
Schema repository – documentation
Automated generation of standardised HTML, PDF and other human readable documentation for each standard
Can easily generate documentation e.g. of inter-version differences
Schema A
1.0 1.1
Schema B
1.0 1.1 1.2
Schema repository – conceptual mappings
Schema A
1.0 1.1
Schema B
1.0 1.1 1.2
Identify relationships between elements in different versions of the same schema or different schemas (need interfaces to allow these linkages to be defined)
Generate software products to automate transformation between different versions and schemas
How should we handle different content models for related concepts?
Should mappings simply be between different imported schemas, or should they all be made against some central set of concepts?
.xslt
.pl?
.java?
Aggregator DiGIR provider
Museum B DiGIR provider
Museum A DiGIR provider
Problems to solve – data served from multiple locations
Different networks of providers will serve up the same records through different routes, and software tools will not be able to detect all duplicate records
Aggregator nodes may in many cases themselves be the sole provider for other important records
The same situation will also occur e.g. if a taxonomist serves data for all specimens from many institutions for a single genus, including some that may also have been digitised by the institutions themselves
No reliable mechanism to relate Agg:MusA/CollX to MusA:CollX or Agg:MusA/CollX:1001 to MusA:CollX:1001
Inst Coll CatNo
MusA CollX 1001
MusA CollX 1002
Inst Coll CatNo
MusB CollY 202-1
MusB CollY 465-2
Inst Coll CatNo
Agg MusA/CollX 1001
Agg MusA/CollX 1002
Agg MiscOther 1000
Agg MiscOther 1001
Global Provider Registry
Key Service Name
1 Museum A DiGIR
2 Museum B DiGIR
3 Aggregator DiGIR
Global Data Index
Service Inst Coll CatNo
1 MusA CollX 1001
1 MusA CollX 1002
2 MusB CollY 202-1
2 MusB CollY 465-2
3 Agg MusA/CollX 1001
3 Agg MusA/CollX 1002
3 Agg MiscOther 1000
3 Agg MiscOther 1001
Import by aggregator
Museum A DiGIR providerhttp://www.museumA.org/digir.php
Problems to solve – referring to data from outside network
Data records have no stable URLs on the network (even using constructed DiGIR queries)
Provider endpoints may change, and specimens may move to different institutions
There needs to be a reliable way to reference each data record from other web sites and in external media
The solution needs to handle situations in which data providers move to new locations
It should ideally also be possible to refer to a particular version of each record in case of modifications
Inst Coll CatNo
MusA CollX 1001
MusA CollX 1002
Global Data Indexhttp://www.gbif.net/data/digir.jsp
Service Inst Coll CatNo
1 MusA CollX 1001
1 MusA CollX 1002
Revision of the genus Aenetus
Material examined
Museum A, Collection X
1001, 1002
Revision of the genus Aenetus
Material examined
Museum A, Collection X
1001, 1002
Aenetus virescens
Distribution
Auckland (see 1001)
Wellington (see 1002)
?
?
Taxonomy Server B
Concept: Xus bus sensu J. Owen, 2002== Aus bus sensu P. Brown, 1949
Concept: Yus dus sensu J. Owen,2002== Aus dus sensu P. Brown, 1949
Taxonomy Server A
Concept: Aus bus sensu Smith, 2003== Aus bus sensu Jones, 1965== Aus cus bus sensu Black, 1926> Aus bus sensu Brown, 1949> Aus dus sensu Brown, 1949
Problems to solve – referring to taxon concepts
No standardised way to reference taxonomic publications or taxon concepts to allow reliable machine processing of the relationships between them (is ”P. Brown, 1949” the same as ”Brown, 1949”?)
Need mechanisms to allow data providers to refer to taxon concepts in a reliable and consistent way
User
What is the relationship between the concepts from Taxonomy Server A and Taxonomy Server B?
• Universal reusable mechanism which can be applied to any type of biodiversity data record (specimen, collection, institution, taxon concept, taxonomic publication, image, etc.)
• Identifiers must refer uniquely to a single data element
• Users must be able to resolve each identifier to locate the data to which it relates
• Identifiers must be resolvable even if data changes ownership or the server is moved to a new endpoint
• Identifiers must be suitable for use by the wider life science community and others
• It should be as simple as possible for researchers– to create new identifiers– to discover existing identifiers associated with data items– to resolve identifiers to retrieve the associated data
Globally unique identifiers – requirements
Potential model – Life Science Identifiers (LSIDs)
Specification for Uniform Resource Name model from EMBL, IBM, I3C and OMG
”A straightforward approach to naming and identifying data resources stored in multiple, distributed data stores in a manner that overcomes the limitations of naming schemes in use today”
All LSIDs are formed from 5 or 6 colon-delimited components: urn, lsid, the domain name of the authority for the identifier, a namespace for the identifier, the identifier for the object within the namespace, and an optional revision number
Format
urn:lsid:<domainName>:<namespace>:<objectId>[:<revisionId>]
Example: referencing a PubMed article
urn:lsid:ncbi.nlm.nih.gov:pubmed:12571434
Example: referencing first version of the 1AFT protein in the Protein Data Bank
urn:lsid:pdb.org:pdb:1AFT:1
Potential example: referencing specimen record in GBIF Network (identifiers assigned centrally)
urn:lsid:gbif.net:Specimen:2706712
Potential example: name record from IPNI
urn:lsid:ipni.org:TaxonName:82090-3:1.1
Web-based taxonomy?
GBIF Registry
GBIF Global Portal Thematic Portal
Taxonomy / Nomenclature
PDFHTML web page
Struc-tured XML
Image
GBIF Data Index
Discovery
Indexing
Presentation &Analysis
ResourcesSpecimens
Descriptive Data
Sequences / Barcodes
Taxonomic Collaboration Environment
Structured Unstructured
Databases Documents
Sequences / Barcodes
Taxonomy / Nomenclature
Specimens
Characters
Literature
Network resources
Group resources
Network data
Group data
Taxonomist input
Revision
Taxonomy / Nomenclature
PDFHTML web page
Struc-tured XML
Image
SpecimensDescriptive
DataSequences /
Barcodes
What is Species Bank?
GBIF Data Portal
GBIF Data Index
User Browser
/ Client Tool
Map Service
Taxon Portal
HTML web pages
Regional Portal
Species Bank ?
Species Bank ?
Links
Global Biodiversity Information Facility
http://www.gbif.org/Communications Portal
http://www.gbif.net/Data Portal
http://circa.gbif.net/Public/irc/gbif/dadi/library?l=/architectureArchitecture documents
Taxonomic Databases Working Group
http://www.tdwg.org/Including access to working groups