gbrds tech issues op
TRANSCRIPT
GLOBALGLOBALBIODIVERSITYBIODIVERSITYGLOBALGLOBALBIODIVERSITYBIODIVERSITY
INFORMATIONINFORMATIONFACILITYFACILITY
Tim Robertson
Systems Architect
September 2009
WWW.GBIF.OWWW.GBIF.ORGRG
Technical Issues and Technical Issues and Opportunities for Opportunities for Resource Resource DiscoveryDiscovery
ContentContent
A look at the past, present and future of the GBIF registry and portals for biodiversity resources discovery.
Register existence Associate metadata Enable discovery through search
Registry: The past…Registry: The past…
Universal Description Discovery and Integration (UDDI)
“…XML-based registry for businesses worldwide to list themselves on the Internet …”
UDDI GBIF
Businesses Institutions
+ Services + Collections
+ Service Bindings + Endpoints (DiGIR etc)
+ TModels + Application Schemas (DwC etc)
UDDI: Metadata UDDI: Metadata
Limited by-in-large to: Contact Information (emails, addresses
etc) Key-Value pairs
ISO country code Endorsing node
Allows for search by title, contact etc 2 levels of credit
Data provenance is lost – lack of recognition!
Past: Search capabilitiesPast: Search capabilities
Recognising the federated search was limited, GBIF built the Data Portal ( http://data.gbif.org )
Harvesting of resources registered in the UDDI
TAPIR, DiGIR, BioCASe Rich search for individual records and
resources by Darwin Core type terms (the what, where, when etc) by building indexes
Limited metadata search capabilities DiGIR, BioCASe, TAPIR etc offer TECHNICAL
metadata only
GBIF Network: The real scenarioGBIF Network: The real scenario
Challenge #1:
Model the true nature of the network makeup.
A graph and not a tree Multiple entity types
Institutions, networks, collections, GBIF Nodes
Many relationship types
Benefits: Accurate data provenance Duplicate record detection Ability to model sub networks
Opportunity: Re-use of registry for your own purposes
Registry: A graph based modelRegistry: A graph based model
Challenge #2: Scalable deployment supporting this
reuse (99.9%, 24/7)
Authentication model Identity management? Cascading permissions? Wiki style?
Or perhaps copy the model of ?
“Institution X requests to be associated with you. Would you like to accept this association?”
Registry: A graph based modelRegistry: A graph based model
Challenge #2 (cont.): Who should curate?
Private and community copies?
Single (scalable) instance or multiple masters?
Opportunity: Offering tagging (machine and human) allows for
people to make use of the registry in ways we would not envision
myimagebank.org:containsTypesInTaxon = Leiopelmatidae
Registry: A graph based modelRegistry: A graph based model
Endpoint monitoring http://bioguid.info/status/ (Rod Page)
Provider monitoringProvider monitoring
Enabling discoverabilityEnabling discoverability
Combination of human authored with machine generated metadata?
“…artificial intelligence is just that; ‘ARTIFICIAL intelligence’. For a system to feel smart to humans, you need human crafted metadata…”
Challenge #3:
If there is agreement to improve discoverability by associating automatically generated metadata with a registered entity:
How to uniquely identify resources within the registry? Preserve existing (multiple) identifiers
Where does one stop? (Inventory of Taxa for example?)
What services are required to enable this association? E.g. Find resource for “DwC:collectionCode”
Associating data and metadataAssociating data and metadata
Existing metadata storesExisting metadata stores
There are many existing resources… Identification of the master copy is critical for
success Conflict resolution – how do we achieve this?
Complete copies or subset copies? Wikipedia style, make copies available?
Service registrationService registration
To enable a service oriented architecture (SOA) workflow definition
Requires the definition of Service endpoints Input formats Output formats
Remember:
GUID ResolutionGUID Resolution
Awaiting recommendation from the task group
Do we envisage GBIF running a generic resolver (multiple)?
Act as a cache? Include endpoint monitoring and early warning
system?
Vocabulary definitionsVocabulary definitions
Requires consensus within the community that terms adequately describe the content.
Community site for authoring vocabularies?
The same applies for extensions to the Darwin Core The GBIF Integrated Publishing Toolkit (IPT) uses
the GBRDS as the source for extension definition and vocabulary definition.
Be smart with our limited resourcesBe smart with our limited resources
ContactContact
Web site: http://www.gbif.org
Data portal: http://data.gbif.org
GBIF SecretariatUniversitetsparken 152100 CopenhagenDenmark
E-mail: [email protected]
Phone: +45 3532 1487