knowledge-based integration of neuroscience data sources
Post on 23-Jan-2016
51 Views
Preview:
DESCRIPTION
TRANSCRIPT
Knowledge-Based Integration of Neuroscience Data Sources
Amarnath Gupta
Bertram Ludäscher
Maryann Martone
University of California San Diego
A Standard Information Mediation Framework
Client Query
Integrated XML View
DataSource
XML DataSource
DataSource
XMLView
Wrapper Wrapper XMLView
XMLView
MediatorMediatorView Definition
A Neuroscience Question
protein localization
Cerebellar distribution of rat proteins with more than 70%homology with human NCS-1? Any structure specificity?
How about other rodents?
Integrated View
MediatorMediatorView Definition
morphometry neurotransmission
WWW
CaBP, Expasy
Wrapper WrapperWrapper Wrapper
Integration Issues
• Structural Heterogeneity– Resolved by converting to common semistructured data
model
• Heterogeneity in Query Capabilities– Resolved by writing wrappers with binding patterns
and other capability-definition languages
• Semantic Heterogeneity– Schema conflicts
• Partially resolved by mapping rules in the mediator
– Hidden Semantics?
Hidden Semantics:Protein Localization
<protein_localization><neuron type=“purkinje cell” /><protein channel=“red”>
<name>RyR</>….</protein><region h_grid_pos=“1” v_grid_pos=“A”>
<density> <structure fraction=“0.8”>
<name>spine</><amount name=“RyR”>0</>
</> <structure fraction=“0.2”>
<name>branchlet</><amount name=“RyR”>30</>
</>
Molecular layer ofCerebellar Cortex
Purkinje Cell layer ofCerebellar Cortex
Fragment of dendrite
Hidden Semantics: Morphometry<neuron name=“purkinje cell”>
<branch level=“10”> <shaft>
… </shaft>
<spine number=“1”><attachment x=“5.3” y=“-3.2”
z=“8.7” /> <length>12.348</> <min_section>1.93</> <max_section>4.47</> <surface_area>9.884</> <volume>7.930</> <head> <width>4.47</>
<length>1.79</> </head>
</spine> …
Branch level beyond 4 is a branchlet
Must be dendritic because Purkinje cells
don’t have somatic spines
The Problem
• Multiple Worlds Integration– compatible terms not directly joinable– complex, indirect associations among schema elements– unstated integrity constraints
• Why not use ontologies?– typical ontologies associate terms along limited number
of dimensions
• What’s needed– a “theory” under which non-identical terms can be
“semantically” joined
Our Approach• Modify the standard Mediation Architecture
– Wrapper • Extend to encode an object-version of the structure schema
– Mediator• Redesign to incorporate auxiliary knowledge sources to
– Correlate object schema of sources– Define additional objects not specified but derivable from sources
• At the Mediator– Use a logic engine to
• Encode the mapping rules between sources• Define integrated views using a combination of exported objects
from source and the auxiliary knowledge sources• Perform query decomposition
• We still use Global-as-View form of mediation
The KIND Architecture
View Definition Rules
Logic Engine Integration Logic
Schema of Registered Sources
Integrated User ViewAuxiliary
KnowledgeSource 1
AuxiliaryKnowledge
Source 2
Object Wrapper
Structure Wrapper
Object Wrapper
Structure Wrapper
Src 1 Src 2
MaterializedViews
The Knowledge-Base• Situate every data object in its anatomical context
– An illustration
– New data is registered with the knowledge-base
– Insertion of new data reconciles the current knowledge-base with the new information by:
• Indexing the data with the source as part of registration
• Extending the knowledge-base
• Creating new views with complex rules to encode additional domain knowledge
F-Logic for the Mediation Engine
• Why F-Logic?– Provides the power of Datalog (with negation) and
object creation through Skolem IDs – Correct amount of “notational sugar” and rules to
provide object-oriented abstraction– Schema-level reasoning– Expressing variable arity
• F-Logic in KIND– Source schema wrapped into F-Logic schema– Knowledge-sources programmed in F-Logic– Definition of Integrated Views
Wrapping into Logic Objects
• Automated Part<!ELEMENT Studies (Study)*><!ELEMENT Study (study_id, … animal, experiments, experimenters><!ELEMENT experiments (experiment)*><!ELEMENT experiment (description, instrument, parameters)>
studyDB[studies study].study[study_id string; … animal animal; experiments experiment; experimenters string].…
• Non-automated Part• Subclasses
• Rules
• Integrity Constraints
mushroom_spine::spine
S:mushroom_spine IF S:spine[head_;neck _].
ic1(S):alert[type “invalid spine”; object S] IF S:spine[undef {head, neck}].
Computing with Auxiliary Sources
• Creating Mediated Classes
• Reasoning with Schema
animal[MR] IF S:source, S.animal [MR] .animal[taxon ‘TAXON’.taxon].X[taxonT] IF X: ‘PROLAB’.animal[name N],
words(N,[W1,W2|_]), T: ‘TAXON’.taxon[genus W1;species W2].
union view
association rule
taxon[subspecies string; species string; genus string; … phylum string; kingdom string; superkingdom string].Schema
subspecies::species::genus:: … kingdom::superkingdomAt Mediator
T:TR, TR::TR1 IFT: ‘TAXON’.taxon[Taxon_Rank TR, Taxon_Rank1 TR1],Taxon_Rank::Taxon_Rank1.
Class creation byschema reasoning
Integrated View Definition
• Views are defined between sources and knowledge base• Example: protein_distribution
– given: organism, protein, brain_region– KB Anatom:
• recursively traverse the has_a paths under brain_region collect all anatomical_entities
– Source PROLAB:• join with anatomical structures and collect the value of attribute
“image.segments.features.feature.protein_amount” where “image.segments.features.feature.protein_name” = protein and “study_db.study.animal.name” = organism
– Mediator:• aggregate over all parents up to brain_region• report distribution
a secondintegrated view
Query Evaluation Example
• protein distribution of Human NCS-1 homologue– from wrapped CaBP website:
• get the amino acid sequence for human NCS-1
– from wrapped Expasy website:• submit amino acid sequence, get ranked homologues
– at Mediator:• select homologues H found in rat, and homology > 0.70
– at Mediator:• for each h in H
– from previous view:» protein_distribution(rat, h, cerebellum, distribution)
• Construct result
Implementation
• System– Flora as F-Logic Engine
– Communicate with ODBC databases through underlying XSB Prolog
– XML wrapping and Web querying through XMAS, our XML query language and custom-built wrappers
• Data– Human Brain Project sites
– NPACI Neuroscience Thrust sites
Work in Progress
• Architecture– plug-in architecture for
• domain knowledge sources• conceptual models from data sources
• Functionality– better handling of large data– operations
• expressive query language• operators for domain knowledge manipulation
– query evaluation• query optimization using domain knowledge
• Demonstration– at VLDB 2000
top related