richard white biodiversity informatics projects. thoughts role of biodiversity data in...

Click here to load reader

Post on 11-Jan-2016

213 views

Category:

Documents

1 download

Embed Size (px)

TRANSCRIPT

  • Richard WhiteBiodiversity Informatics Projects

  • ThoughtsRole of biodiversity data in bioinformaticsassisting with organising and retrieving bioinformatic (molecular) dataa separate area with different users (taxonomy, ecology, conservation, resource management )Demand from users for taxonomic and species diversity information on the WebPressure on the taxonomic community to deliverDemand for more sophisticated use of available data: interoperability = online analysis, not just browsing

  • Assembling biodiversity information sourcesDelivering species diversity information byassembling, merging & linking databases and publishing on the Web, with special emphasis on linking

  • Issues in assembling and linking biodiversity information sourcesAssembling a web-site (ERMS)Assembling databases by merging (ILDIS)Linking on-line databases through a gateway (Species 2000 and SPICE)Onward links to related informationChecking the reliability of links (LITCHI) Intelligent linkingPersistent identifiers

  • Assembling species databasesFirst of all, before we start merging and linking databases, lets assemble a database from scratch: ERMS (European Register of Marine Species)Now at www.marbef.org/data/erms.php

  • ERMS

  • Incoming dataApproximately 100 separate lists for different taxonomic groupsMostly compiled as spreadsheetsScientific names, synonyms, geography (at least Atlantic or Mediterranean)Some optional fieldsObjective to create a book and a web-site, partially supported by a database

  • List conversionwas carried out in several stages:Excel spreadsheets were exported to text filesTab-delimited text files were imported into a client-server database (MySQL)Database queries results are passed through templates to generate either RTF (for the printed publication) or HTML (for the Web site)

  • Variations on a themeFields may be combined or separated e.g. genus species authority dateHigher taxa may be:repeated in fields of the species recordgiven once in separate preceding records in various different formatsSynonyms may be:in a separate field of the species record, or mixed with other remarks, with various delimiters and separators in separate records, linked by code or by name or even abbreviatedimplied, e.g. Genus1 specname (Smith as Genus2)Geographical information is often free text

  • ERMS book page

  • Osteichthyes: brief checklist

  • Reptilia: full details

  • Taxonomic hierarchy for Reptilia

  • Merging versus linkingMerging databases to create a single larger databaseLinking databases to create a distributed information system

  • Merging species databases1The original databases are physically copied into a new combined database.

    2The user interacts with the new combined database.

  • Linking1The user interacts with an access system which does not itself contain data. 2When the user requests data, it is fetched from the appropriate database.

  • Assembling databases by mergingNow we have some databases, lets build a bigger one by merging: ILDIS (International Legume Database and Information Service)

  • ILDISInternational Legume Database and Information ServiceInternational collaborative project10 Regional Centres30 Taxonomic CoordinatorsIts goals includebuilding, maintaining and enhancing the ILDIS World Database of Legumesdesigning and providing services from it to users, including: ILDIS LegumeWebvia Species 2000

  • ILDIS World Database of Legumes v. 7.00TaxaSpecies 15,500Subspecies 1,600Varieties 2,400 19,500NamesAccepted names 19,500Synonyms 19,000 39,500

  • ILDISs data model: core dataA core taxonomic checklist, assembled from regional data sets and nearing completion, provides a consensus taxonomy - a unified taxonomic treatment or backbone on which other data can be hung

    Various kinds of additional data may be attached to this backbone (see later)

  • Features of ILDIS LegumeWebWell look at examples of the use of LegumeWeb, to show a couple of features:Two-stage access with synonymic indexingA gateway to external information - onward links (direct species name links) to further sources of information

  • User access to LegumeWeb: Step 1

    The user types in a name, which may be incomplete (or wrong!)

    LegumeWeb responds by showing a list of the species names which fit the users specification

  • User access to LegumeWeb: Step 2The user chooses one of the species names provided (which may be synonym or an accepted name)In this example, the user chooses Abrus cyaneus (a synonym for Abrus precatorius)LegumeWeb responds by showing a standard set of information about the chosen species

  • Synonymic indexingAutomated synonymic indexingsynonym entered accepted name found (name taxon)taxon found synonyms listedTypes of synonymsUnambiguousAmbiguouspro partehomonymsmisapplied namesIn these cases an explanation is offered to the user

  • Assembling databases by linkingNow we have some biggish databases, lets build something even bigger by linking databases together: Species 2000SPICESpecies 2000 Europa

  • Linking1The user interacts with an access system which does not itself contain data. 2When the user requests data, it is fetched from the appropriate database.

  • The Catalogue of Life (Species 2000)An international collaborative project to provide access to an authoritative and up-to-date checklist of all the worlds speciesA distributed array of Global Species Databases (GSDs) can be accessed through a Web gateway or Central Access System (CAS) The array of GSDs provide an index to a further range of information about each species, using onward links (see later)www.sp2000.org

  • Species 2000 organisation

  • Architecture of Species 2000

  • Species 2000s Common Access System

    Species 2000 gives users a single point of access to GSDs Access involves a two-stage search process similar to that used in LegumeWebIn the second stage, the user sees a screen of standard data about a species

  • The standard dataThis comprises the information about a species which Species 2000 wishes to provide:Accepted name (with references)Synonyms (with references)Common Names (with references)Family or other higher taxonGeographyCommentScrutiny informationURL or URLs linking to further data sources for this species

  • Need for communicationDifferent people are building the various components of the system:GSDswrappersCASuser interfaceWe need to ensure they all have a common understanding of the data to avoid mistakes

  • Common Data ModelWe use a Common Data Model (CDM)A definition of the information being passed to and froHuman-readable, not machine-readableHelps to manage complexityUsed to create specific machine-readable implementations for Corba (IDL), CGI/XML (DTD, XML Schema), Web Services, etc.

  • What does the CDM look like?

    It defines the input (request) and output (response) for six fundamental operations which the system needs to be able to carry out

  • Request Types 0-6Type 0: Get CDM version supported by a GSDs wrapperType 3: Get information about a GSDType 1: Search for a name in a GSDType 2: Fetch standard data about a chosen speciesType 4: Move up the taxonomic hierarchy (towards the root of the tree)Type 5: Move down the taxonomic hierarchy (towards the species level)

  • Spice CAS in use

    Screen-shots of an old version of the Spice system in use:

  • Spice 1 CAS

  • Onward links to related external data

    Species databases such as ILDIS and federated systems such as Species 2000 envisage providing links from their data to external sources of related data, so-called onward linksExample from ILDIS ...

  • Onward linksThe user may follow a hyperlink to some other data source for further information, not managed by ILDISIn this example, the user chooses to go to W3Tropicos at Missouri Botanical Garden to see more informationIn this way LegumeWeb acts as a gateway to other information about legume species

  • LegumeWeb page with onward links

  • Destination of an onward link

  • Further information obtained

  • Checking the reliability of linksWhether in merging data sets to construct a species database like ILDIS, or in linking from one data set to another, it is necessary to ensure that the species concepts in the different databases do not conflict

  • Example 1Database A

    Caragana arborescens Lam. [accepted name] Caragana sibirica Medikus [synonym]

    Database B

    Caragana sibirica Medikus [accepted name]Caragana arborescens Lam. [synonym]

  • Example 2Database A

    Caesalpinia crista L. [accepted name]

    Database BCaesalpinia crista L. [accepted name]

    Caesalpinia bonduc (L.) Roxb. [accepted name] Caesalpinia crista L., p.p. [synonym]

  • LITCHI projectWe modelled the knowledge integrity rules in a taxonomic treatmentThe knowledge tested is implicit in the assemblage of scientific names and synonyms used to represent each taxonPractical uses include helping a taxonomist to detect and resolve taxonomic conflicts when merging or linking two databaseshelping a non-taxonomist user follow links from one database to another, in which the species may be differently classified

  • Conflict display

  • Outcome of LITCHI projectA prototype tool for merging checklists & checking integrity of individual checklists was implementedIn the Species 2000 Europa project, we are now creating a completely new second version with a view to allowing: dynamic linking (so-called taxonomically intelligent links)Presentation of attached data to be organised, merged and used to support conflict resolution

  • Intelligent linkingThe Catalogue of Life (Species 2000) is not just a catalogue (which lists things)it is an index (which points to things)GSDs, and gateways to them such as the Catalogue of Life, can serve not only as catalogues of species but also as indexes giving access, potentially, to all species information on the Internet

  • Intelligent linkingSpecies 2000 plans to provide links to take a user from a species entry (from a GSD) to further sources of information about that particular species (Species Information Sources or SISs)

  • Species 2000 organisation

  • Intelligent species links

    Given that it is possible to detect many cases of potential taxonomic conflict when linking species databases, how can such links be managed?

    There are a number of choices in the ways links may be made and handled

  • Cross-mappingSo how can we make intelligent links work, especially in the difficult cases where a species in one database does not have an exact match in the other ?One way is to create and maintain cross-maps which describe how one or more taxa in one resource (such as the Species 2000 index) relate to one or more taxa in another resource

  • A dreamA system for managing intelligent species links would maximise the potential of the plethora of species-based catalogues, indexes and rich species resources currently being assembled all over the worldPerhaps on the Web, as with the current Spice/Species 2000 prototypesOr ...

  • The GridThe Grid is often thought of as a new toy for particle physicists, with very high bandwidthdistributed computational resourcesBut it also provides opportunities for more structured and reliable access to data and information sources, using improved protocols with metadataFor example, access to such knowledge sources as these cross-maps

  • Using biodiversity information resourcesHelping Biodiversity Researchers to do their WorkCollaborative e-Science and Virtual Organisations

  • Biodiversity analysis and modellingScientists working with biodiversity information employ a wide variety of resources: data sourcesstatistical analysis and modelling toolspresentation or visualisation softwarewhich may be available on various local and remote computer platforms.

  • Examples of biodiversity resourcesData sources:Names: Species 2000 & ITIS Catalogue of LifeData: GBIF, sequence databasesGeography: GazetteersCollections and distributions: BioCASE, MaNISAnalysis tools:Statistical and multivariate analysisModellingVisualisation

  • Use of resources togetherScientists frequently need to use several of these resources in sequence to carry out their research.Much effort is currently expended in initially acquiring resourcesinstalling and sometimes adapting them to run on the users own machineconverting and transporting data sets between stages of the analysis process

  • Biodiversity researchBiologists are working to understand the adaptation of organisms to their environmental niche, eventually by combining knowledge of all the levels of biological organisation

    and to predict their interactions with their environment genome transcription proteome metabolic pathways cell tissue organ individual whole organism population species evolutionary pathways

  • WorkflowsResources are called into use in an appropriate sequence from an interactive workflow. The facility for scientists to be able to create their own workflows, without the need for regular assistance from computer scientists, is an essential part of the BDWorld system. Accessible tools for resource discovery and for workflow design, enactment and re-use are therefore required.

  • For example

    Changes in distribution in response to climate changes brought about by global warming

  • CSM: Climate-space modellingModelling and predicting changes in distribution in response to climate changes such as those brought about by global warming An unreasonably brief explanation:Get current distribution of a species (e.g. specimen records)Get current or recent climate data for those localitiesCalculate a model for the climate space the species can occupyPredict the distribution the species would have in any specified climate (may be different to the climate used above)Project back on world map

  • Example work-flow (Climate-space Modelling)ProjectionPredictionSPICELocalitiesClimate Space ModelBase MapsClimateClimateSubmit scientificname; retrieveaccepted name& synonymsfor speciesRetrievedistribution mapsfor species ofinterestClimatesurfacesModel of climatic conditionswhere species is currentlyfoundPossibly differentclimate surfaces(e.g. predictedclimate)World orregionalmapsPrediction of suitableregions for speciesof interestProjection of predicted distribution on to base map

  • Triana screen-shots1Creation (design, editing)

  • Triana screen-shots

  • Triana screen-shots

  • Triana screen-shots

  • Triana screen-shots2Execution (enactment, run-time)

  • Triana screen-shots

  • Triana screen-shots

  • Triana screen-shots

  • Triana screen-shots

  • And finally

  • Triana screen-shots

  • Elements of the BDWorld systemWhat did the system have to do to make that example happen?

  • Role of the work-flow engineCreate and edit a workflowlocate an appropriate resourcecheck interoperabilityarrange any necessary transformationsrecord provenance of generated data setsExecute a workflow, passing data sets to and froCreate a log or lab book for user

  • Difficulties with resourcesFinding the resourcesKnowing how to use these heterogeneous resourcesOriginally constructed for various reasons, often with little attention to standards or interoperabilityHave to pass data sets from one to anotherSome involve user interaction

  • Role of metadataMetadata is needed to enable discovery of resources and to indicate how they are to be used. Properties to help locate appropriate resourcesCheck interoperability, suggest transformationsProvenance of data setsLog of work-flows executed

  • What is biodiversity informatics?The preceding project, among others, shows that the challenges facing biodiversity informatics include not onlyDescribing the diversity of life at all levels of organisation, so that biologists can understand, conserve and exploit it,But alsoInventing ways to describe the ever-increasing diversity of information resources and analysis tools available, so that users can find and use them

  • A challenge to link resourcesIt is potentially very difficult to link all these resources togetherMuch attention is currently being given to:Providing unique identifiers for data objectsWhich can return metadata about themselvesWhich can be stitched together into a distributed collaborative information system: see the biodiversity informatics organisations TDWG and GBIF (later)

  • End

    The Biostatistics and Bioinformatics Unit benefits from the support of a combined group of Organisations. Working in partnership these organisations are driving the future direction of this vital initiative