the possibility and probability of establishing a global neuroscience information framework: lessons...
TRANSCRIPT
The possibility and probability of establishing a global neuroscience
information framework:lessons learned from practical experiences in
data integration for neuroscience
Maryann Martone, Ph. D.University of California, San Diego
“Neural Choreography”
“A grand challenge in neuroscience is to elucidate brain function in relation to its multiple layers of organization that operate at different spatial and temporal scales. Central to this effort is tackling “neural choreography” -- the integrated functioning of neurons into brain circuits--their spatial organization, local and long-distance connections, their temporal orchestration, and their dynamic features. Neural choreography cannot be understood via a purely reductionist approach. Rather, it entails the convergent use of analytical and synthetic tools to gather, analyze and mine information from each level of analysis, and capture the emergence of new layers of function (or dysfunction) as we move from studying genes and proteins, to cells, circuits, thought, and behavior....
However, the neuroscience community is not yet fully engaged in exploiting the rich array of data currently available, nor is it adequately poised to capitalize on the forthcoming data explosion. “
Akil et al., Science, Feb 11, 2011
On the other hand... In that same issue of Science
Asked peer reviewers from last year about the availability and use of data About half of those polled store their data only in their
laboratories—not an ideal long-term solution. Many bemoaned the lack of common metadata and
archives as a main impediment to using and storing data, and most of the respondents have no funding to support archiving
And even where accessible, much data in many fields is too poorly organized to enable it to be efficiently used.
“...it is a growing challenge to ensure that data produced during the course of reported research are appropriately described, standardized, archived, and available to all.” Lead Science editorial (Science 11 February 2011: Vol. 331 no. 6018 p. 649 )
We speak piously of taking measurements and making small studies that will add another brick to the temple
of science. Most such bricks just lie around the
brickyard.
Platt, J.R. (1964) Strong Inference. Science. 146:
347-353.
"We now have unprecedented ability to collect data about nature…but there is now a crisis developing in biology, in that completely unstructured information does not enhance understanding”
-Sidney Brenner
The Encyclopedia of Life
A…
Access to data has changed
over the years
Tim Berner-s Lee: Web of data
Wikipedia defines Linked Data as "a term used to describe a recommended best practice for exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web using URIs and RDF.” http://linkeddata.org/
Genbank
PDB
Are we there yet?We’d like to be able to find: What is known****:
What is the average diameter of a Purkinje neuron
Is GRM1 expressed In cerebral cortex?
What are the projections of hippocampus
What genes have been found to be upregulated in chronic drug abuse in adults
What studies used my monoclonal mouse antibody against GAD in humans?
Find all instances of spines that contain membrane-bound organelles ****by combining data from
different sources and different groups
What is not known: Connections among data Gaps in knowledge
We’d like it to be really simple to implement and use:– Query interface– Search strategies– Data sources– Infrastructure– Results display
– Trust– Context
– Analysis tools
– Tools for translating existing content into linkable form
– Tools for creating new data ready to be linked
NIF is an initiative of the NIH Blueprint consortium of institutes What types of resources (data, tools, materials,
services) are available to the neuroscience community?
How many are there? What domains do they cover? What domains do
they not cover? Where are they?
Web sites Databases Literature Supplementary material
Who uses them? Who creates them? How can we find them? How can we make them better in the future?
http://neuinfo.org
A look into the brickyard
• PDF files
• Desk drawers
How many resources are there?
•NIF Registry: A catalog of neuroscience-relevant resources
• > 3500 currently described
• > 1700 databases
•Another 3000 awaiting curation•And we are finding more every day
But we have Google!
Current web is designed to share documents Documents are
unstructured data Much of the content
of digital resources is part of the “hidden web”
Wikipedia: The Deep Web (also called Deepnet, the invisible Web, DarkNet, Undernet or the hidden Web) refers to World Wide Web content that is not part of the Surface Web, which is indexed by standard search engines.
A tip of the “resourceome”
Microarray9, 535, 440
Model organisms246, 639
Connectivity
26, 443
Antibodies890, 571
Pathways43, 013
Brain Activation
Foci56, 591
65 databases
But we have Pub Med!
Bulk of neuroscience data is published as part of papers > 20,000,000
Structured vs unstructured information
“...it is a growing challenge to ensure that data produced during the course of reported research are appropriately described, standardized, archived, and available to all.” Lead Science editorial (Science 11 February 2011: Vol. 331 no. 6018 p. 649 )
Author, year, journal, keywords
Content
The Neuroscience Information Framework: Discovery and utilization of web-based
resources for neuroscience
A portal for finding and using neuroscience resources
A consistent framework for describing resources
Provides simultaneous search of multiple types of information, organized by category
Supported by an expansive ontology for neuroscience
Utilizes advanced technologies to search the “hidden web”
http://neuinfo.org
UCSD, Yale, Cal Tech, George Mason, Washington Univ
Supported by NIH Blueprint
Literature
Database Federation
Registry
Neuroscience is unlikely to be served by a few large databases like the genomics and proteomics community
Whole brain data (20 um microscopic
MRI) Mosiac LM images (1
GB+)
Conventional LM images
Individual cell morphologies
EM volumes & reconstruction
s
Solved molecular structures
No single technology serves these all equally well.Multiple data types;
multiple scales; multiple databases
A data federation problem
NIF Data Federation Too many databases to visit
Registry not adequate for finding and using them Capturing content in a few keywords is difficult if not impossible
Access to deep content; currently searches over 30 million records from > 65 different databases Flexible tools for resource providers to make their content available as
easily and meaningfully as possible Organized according to level of nervous system and data type,
e.g., brain activation foci Link to host resource: these databases are independent!
Provides simplified and unified views to help users navigate very different resources Common vocabularies Common data models for basic neuroscience data Laying the foundations for data integration for neuroscience
Hippocampus OR “Cornu Ammonis” OR “Ammon’s horn” Query expansion:
Synonyms and related concepts
Boolean queriesData sources
categorized by “data type” and level of nervous
system
Simplified views of complex data
sources
Tutorials for using full
resource when getting there
from NIF
Link back to record in
original source
What are the connections of the hippocampus?
Connects to
Synapsed with
Synapsed by
Input region
innervates
Axon innervates
Projects to
Cellular contact
Subcellular contact
Source site
Target site
Each resource implements a different, though related model; systems are complex and difficult to learn, in many cases
NIF: Minimum requirements to use shared data
You (and the machine) have to be able to find it Accessible through the web Structured or semi-structured Annotations
You (and the machine) have to be able to use it Data type specified and in a usable form
You (and the machine) have to know what the data mean
Semantics Context: Experimental metadata
Reporting neuroscience data within a consistent framework helps enormously
Is GRM1 in cerebral cortex?
NIF system allows easy search over multiple sources of information But, we have difficulty finding data Well known difficulties in search
Inconsistent and sparse annotation of scientific data Many different names for the same thing The same name means many things “Hidden semantics”: 1 = male; 1 = present; 1=mouse
Allen Brain Atlas
MGD
Gensat
Cerebral CortexAtlas Children Parent
Genepaint Neocortex, Olfactory cortex (Olfactory bulb; piriform cortex), hippocampus
Telencephalon
Allen Brain Atlas Cortical plate, Olfactory areas, Hippocampal Formation
Cerebrum
MBAT (cortex) Hippocampus, Olfactory, Frontal, Perirhinal cortex, entorhinal cortex
Forebrain
GENSAT Not defined Telencephalon
BrainInfo frontal lobe, insula, temporal lobe, limbic lobe, occipital lobe
Telencephalon
BrainmapsEntorhinal, insular, 6, 8, 4, A SII 17, Prp, SI
Telencephalon
What is an ontology?
Brain
Cerebellum
Purkinje Cell Layer
Purkinje cell
neuron
has a
has a
has a
is a
Ontology: an explicit, formal representation of concepts relationships among them within a particular domain that expresses human knowledge in a machine readable form
Branch of philosophy: a theory of what is
e.g., Gene ontologies
What can ontology do for us?
Express neuroscience concepts in a way that is machine readable Synonyms, lexical variants Definitions
Provide means of disambiguation of strings Nucleus part of cell; nucleus part of brain; nucleus part of atom
Rules by which a class is defined, e.g., a GABAergic neuron is neuron that releases GABA as a neurotransmitter
Properties Provide universals for navigating across different data
sources Semantic “index” Perform reasoning Link data through relationships not just one-to-one
mappings Provide the basis for concept-based queries to probe
and mine data As a branch of philosophy, make us think about the
nature of the things we are trying to describe, e.g., synapse is a site
Linking datatypes to semantics: What is the
average diameter of a Purkinje neuron dendrite?
Branch structure not a tree, not a set of blood vessels, not a road map but a DENDRITE
Because anyone who uses Neurolucida uses the same concepts: axon, dendrite, cell body, dendritic spine, information systems can combine the data together in meaningful ways
Neurolucida doesn’t, however, tell you that dendrite belongs to a neuron of a particular type or whether this dendrite is a neural dendrite at all
( (Color Yellow) ; [10,1] (Dendrite) ( 5.04 -44.40 -89.00 1.32) ; Root ( 3.39 -44.40 -89.00 1.32) ; R, 1 ( ( 2.81 -45.10 -90.00 0.91) ; R-1, 1 ( 2.81 -45.18 -90.00 0.91) ; R-1, 2 ( 1.90 -46.01 -90.00 0.91) ; R-1, 3 ( 1.82 -46.09 -90.00 0.91) ; R-1, 4 ( 0.91 -46.59 -90.00 0.91) ; R-1, 5 ( 0.41 -46.83 -92.50 0.91) ; R-1, 6 ( ( -0.66 -46.92 -88.50 0.74) ; R-1-1, 1 ( -0.74 -46.92 -88.50 0.74) ; R-1-1, 2 ( -2.15 -47.25 -88.00 0.74) ; R-1-1, 3 ( -2.15 -47.33 -88.00 0.74) ; R-1-1, 4 ( -3.06 -47.00 -87.00 0.74) ; R-1-1, 5 ( -4.05 -46.92 -86.00 0.74) ; R-1-1, 6
Output of Neurolucida neuron trace
“A rose by any other name...”:
Identity: Entities are uniquely identifiable Name is a meaningless numerical identifier (URI: Uniform
resource identifier) Any number of human readable labels can be assigned to it
Definition: Genera: is a type of (cell, anatomical structure, cell part) Differentia: “has a” A set of properties that distinguish among
members of that class Can include necessary and sufficient conditions
Implementation: How is this definition expressed Depending on the nature of the concept or entity and the needs
of the information system, we can say more or fewer things Different languages; can express different things about the
concept that can be computed upon OWL W3C standard, RDF
Comprehensive Ontology
NIF covers multiple structural scales and domains of relevance to neuroscience
Aggregate of community ontologies with some extensions for neuroscience, e.g., Gene Ontology, Chebi, Protein Ontology
Simple, basic “is a : hierarchies that can be used “as is” or to form the building blocks for more complex representations
NIFSTD
Organism
NS FunctionMolecule InvestigationSubcellular structure
Macromolecule Gene
Molecule Descriptors
Techniques
Reagent Protocols
Cell
Resource Instrument
Dysfunction QualityAnatomical Structure
Query across resources: Snca and
striatum
NIF uses the NIFSTD ontologies to query across sources that use very different terminologies, symbolic notations and levels of granularity
Entity mapping
BIRNLex_435 Brodmann.3
Explicit mapping of database content helps disambiguate non-unique and custom terminology
Concept-based search: search by meaning
Search Google: GABAergic neuron Search NIF: GABAergic neuron
NIF automatically searches for types of GABAergic neurons
Types of GABAergic neurons
Data mining through interrogation
What genes are upregulated by drugs of abuse in the adult mouse?
MorphineIncreased expression
Adult Mouse
Integration of knowledge based on relationships
Looking for commonalities and distinctions among animal models and human conditions based on phenotypes
Sarah Maynard, Chris Mungall, Suzie Lewis
NINDS
Thalamus
Cellular inclusion
Midline nuclear group
Lewy Body
Paracentral nucleus
Cellular inclusion
And now, the literature
The scientific article remains the currency of science
Vast majority of neuroscience data is published in the literatureComputational biologists like to consume
dataNeuroscientists like to produce it
Two NIF projects: 1) Resource identification from the literature
Identifying antibodies used in scientific studies from text
2) Extracting data from tables and supplementary material
Neuroscience is fundamentally reliant on antibodies
Neuroscientists spend a lot of time searching for antibodies that will work in their system for the target of interest and troubleshooting experiments that didn’t work
The scientific literature is a major source of information on antibodies
Proposal Use text mining strategies to identify
antibodies, protocol type and subject organism from materials and methods section of J. Neuroscience
Problem: antibodies
Midfrontal cortex tissue samples from neurologically unimpaired subjects (n9) and from subjects with AD (n11) were obtained from the Rapid Autopsy Program
Immunoblot analysis and antibodies
The following antibodies were used for immunoblotting: -actin mAb (1:10,000 dilution, Sigma-Aldrich); -tubulin mAb (1:10,000, Abcam); T46 mAb (specific to tau 404–441, 1:1000, Invitrogen); Tau-5 mAb (human tau 218–225, 1:1000, BD Biosciences) (Porzig et al., 2007); AT8 mAb (phospho-tau Ser199, Ser202, and Thr205, 1:500, Innogenetics); PHF-1 mAb (phospho-tau Ser396 and Ser404, 1:250, gift from P. Davies); 12E8 mAb (phospho-tau Ser262 and Ser356, 1:1000, gift from P. Seubert); NMDA receptors 2A, 2B and 2D goat pAbs (C terminus, 1:1000, Santa Cruz Biotechnology)…
Semantic annotation: Entity mapping by humanSato et al., J. Neurosci. 2008 Subject is
Human
Antibody #7
"12E8" is a Monoclonal antibody birnlex_2027Antibody reagent has target human PHF tau
Waiting for Neurolex ID
Protein product ofAntibody reagent has provider Peter SeubertAntibody reagent has catalog #Antibody reagent
has source organism Mouse birnlex_167 NCBI Taxonomic ID: 10090
Antibody reagent has id "12E8"
Provider has locationElan Pharmaceuticals, South San Francisco, CA
Provider has url
Try this Watson!
•95 antibodies were identified in 8 articles•52 did not contain enough information to determine the antibody used• Some provided details in another paper
• And another paper, and another...• Failed to give species, clonality, vendor, or
catalog number• But, many provided the location of the vendor
because the instructions to authors said to do so
• no antibodies had lot numbers associated We never got to test the algorithms!
NIF along with several other large informatics projects recommends that all authors provide vendor and catalog # for all reagents use
But...vendors merge and sell each other’s antibodies, making it difficult to track down exactly which reagent was used in some cases
Catalog numbers get replaced; many variants on the same product, e.g., HRP-conjugated, 200 ul vs 500 ul
Clone names are not unique Universal antibody ID
Publishing for the 21st Century
NIF Antibody Registry• We have created an antibody
registry database• Assigns each antibody a
persistent identifier to both commercial and non-commercial antibodies
• ID will persist even if company goes out of business or the antibody is sold by multiple vendors
• The data model is being formalized into a rigorous ontology in collaboration with others:
• We negotiated with antibody aggregators to pull data for over 800,000 commercial antibodies, 200 vendors
• Can be used to register homegrown antibodies as well• http://
antibodyregistry.org
“Find studies that used a rabbit polyclonal antibody against GFAP that
recognizes human in immunocytochemisty”
Paz et al, J Neurosci, 2010
(AB_310775)
Demo 2: Extracting data from tables and supplementary material
Challenge: Extract data on gene expression in brain from studies relevant to drug abuse
Workflow:
Find articlesExtract
results from tables
Standardize results
Load into NIF
Current DB: 140 tables from 54 articles
Andrea Arnaud-Stagg, Anita Bandrowski
Gene for tyrosine hydroxylase has increased expression in locus coeruleus of mouse compared to control when given chronic morphineTranslations:
Upregulated p < 0.05 = increased expressionLC = locus coeruleusProbe ID = gene name
Extract data and meaning of data from tables
Challenges working with tables and supplemental data
Difficult data arrangements PDF, JPG, TXT, CSV, XLS Difficult styles: colors, symbols, data arrangements
(results combined into one column, multiple comparisons in one table, legends defining values, unclearly described data (eg., unclear significance)
Not clear what tables/values represent nothing in paper about the supplementary data file and table has no
heading Probe ID’s are given but not gene identifiers
No link from supplemental material back to article; lose provenanceResults are presented but values of significance unclear
Neither curator (nor machine) could distinguish between no difference and not reported
What affects SMN1 expression?
Researchers often report results in a way where curators cannot extract full information from a study
Common theme•We are not publishing data in a form that is easy to integrate
• What we mean isn’t clear to a search engine (or even to a human)
• We use many different data structures to say the same thing
• We don’t provide crucial information
•Searching and navigating across individual resources takes an inordinate amount of human effortTempus Pecunia Est Painting by
Richard Harpum
Collaboration, competition, coordination, cooperation
The diversity and dynamism of neuroscience will make data integration challenging alwaysNeural space is vast: No one group or individual can do everythingWe don’t have to solve everything to make it betterGlobal partnership with room for everyone:
Neuroscientists Curators Resource developers Funders Computational biologists Text miners Computer scientists Watson
Hopeful signs...
•Means for sharing data on the web becoming more routine
•With availability, growing recognition for a role of standards and curation
•For neuroscience, we now have organizations that can help coordinate
•NIF, NITRC (http://nitrc.org)•Neuroimaging Tools and Resource Clearinghouse
•International Neuroinformatics Coordinating Facility
•Educate neuroscientists on what is necessary•Bring together stakeholders to define what is necessary for interoperation•Implement structures and procedures for developing neuroscience resources within a framework
Click icon to add picture
http://incf.org
We don’t know everything but we do know some things
1. Register your resource with NIF!!!!
3: Be mindful
Resource providers: Mindfulness that your resource is contributing data to a global federation Link to shared ontology identifiers
where possible Stable and unique identifiers for data Explicit semantics Database, model, atlas
Researchers: Mindfulness when publishing data that it is to be consumed by machines and not just your colleagues Accession numbers for genes and
species Catalog numbers for reagents Provide supplemental data in a form
where it is is easy to re-use
2. Become involved with NIF and INCF
Many thanks to...Amarnath Gupta, UCSD, Co Investigator
Jeff Grethe, UCSD, Co Investigator
Anita Bandrowski, NIF Curator
Gordon Shepherd, Yale University
Perry Miller
Luis Marenco
David Van Essen, Washington University
Erin Reid
Paul Sternberg, Cal Tech
Arun Rangarajan
Hans Michael Muller
Giorgio Ascoli, George Mason University
Sridevi Polavarum
Fahim Imam, NIF Ontology Engineer
Karen Skinner, NIH, Program Officer
Mark Ellisman
Lee Hornbrook
Kara Lu
Vadim Astakhov
Xufei Qian
Chris Condit
Stephen Larson
Sarah Maynard
Bill Bug
How old is an adult squirrel?
Definitions can be quantitative
Arbitrary but defensible
Qualitative categories for quantitative attributes Best practice to
provide ages of subjects, but for query, need to translate into qualitative concepts
Jonathan Cachat, Anita Bandrowski