knowledge-based integration of neuroscience data sources

Knowledge-Based Integration of Neuroscience Data Sources

Amarnath Gupta

Bertram Ludäscher

Maryann Martone

University of California San Diego

A Standard Information Mediation Framework

Client Query

Integrated XML View

DataSource

XML DataSource

DataSource

XMLView

Wrapper Wrapper XMLView

XMLView

MediatorMediatorView Definition

A Neuroscience Question

protein localization

Cerebellar distribution of rat proteins with more than 70%homology with human NCS-1? Any structure specificity?

How about other rodents?

Integrated View

MediatorMediatorView Definition

morphometry neurotransmission

CaBP, Expasy

Wrapper WrapperWrapper Wrapper

Integration Issues

• Structural Heterogeneity– Resolved by converting to common semistructured data

• Heterogeneity in Query Capabilities– Resolved by writing wrappers with binding patterns

and other capability-definition languages

• Semantic Heterogeneity– Schema conflicts

• Partially resolved by mapping rules in the mediator

– Hidden Semantics?

Hidden Semantics:Protein Localization

<protein_localization><neuron type=“purkinje cell” /><protein channel=“red”>

<name>spine</><amount name=“RyR”>0</>

</> <structure fraction=“0.2”>

<name>branchlet</><amount name=“RyR”>30</>

Molecular layer ofCerebellar Cortex

Purkinje Cell layer ofCerebellar Cortex

Fragment of dendrite

Hidden Semantics: Morphometry<neuron name=“purkinje cell”>

… </shaft>

<spine number=“1”><attachment x=“5.3” y=“-3.2”

z=“8.7” /> <length>12.348</> <min_section>1.93</> <max_section>4.47</> <surface_area>9.884</> <volume>7.930</> <head> <width>4.47</>

</spine> …

Branch level beyond 4 is a branchlet

Must be dendritic because Purkinje cells

don’t have somatic spines

The Problem

• Multiple Worlds Integration– compatible terms not directly joinable– complex, indirect associations among schema elements– unstated integrity constraints

• Why not use ontologies?– typical ontologies associate terms along limited number

of dimensions

• What’s needed– a “theory” under which non-identical terms can be

“semantically” joined

Our Approach• Modify the standard Mediation Architecture

– Wrapper • Extend to encode an object-version of the structure schema

– Mediator• Redesign to incorporate auxiliary knowledge sources to

– Correlate object schema of sources– Define additional objects not specified but derivable from sources

• At the Mediator– Use a logic engine to

• Encode the mapping rules between sources• Define integrated views using a combination of exported objects

from source and the auxiliary knowledge sources• Perform query decomposition

• We still use Global-as-View form of mediation

The KIND Architecture

View Definition Rules

Logic Engine Integration Logic

Schema of Registered Sources

Integrated User ViewAuxiliary

KnowledgeSource 1

AuxiliaryKnowledge

Source 2

Object Wrapper

Structure Wrapper

Object Wrapper

Structure Wrapper

Src 1 Src 2

MaterializedViews

The Knowledge-Base• Situate every data object in its anatomical context

– An illustration

– New data is registered with the knowledge-base

– Insertion of new data reconciles the current knowledge-base with the new information by:

• Indexing the data with the source as part of registration

• Extending the knowledge-base

• Creating new views with complex rules to encode additional domain knowledge

F-Logic for the Mediation Engine

• Why F-Logic?– Provides the power of Datalog (with negation) and

object creation through Skolem IDs – Correct amount of “notational sugar” and rules to

provide object-oriented abstraction– Schema-level reasoning– Expressing variable arity

• F-Logic in KIND– Source schema wrapped into F-Logic schema– Knowledge-sources programmed in F-Logic– Definition of Integrated Views

Wrapping into Logic Objects

• Automated Part<!ELEMENT Studies (Study)*><!ELEMENT Study (study_id, … animal, experiments, experimenters><!ELEMENT experiments (experiment)*><!ELEMENT experiment (description, instrument, parameters)>

studyDB[studies study].study[study_id string; … animal animal; experiments experiment; experimenters string].…

• Non-automated Part• Subclasses

• Rules

• Integrity Constraints

mushroom_spine::spine

S:mushroom_spine IF S:spine[head_;neck _].

ic1(S):alert[type “invalid spine”; object S] IF S:spine[undef {head, neck}].

Computing with Auxiliary Sources

• Creating Mediated Classes

• Reasoning with Schema

animal[MR] IF S:source, S.animal [MR] .animal[taxon ‘TAXON’.taxon].X[taxonT] IF X: ‘PROLAB’.animal[name N],

words(N,[W1,W2|_]), T: ‘TAXON’.taxon[genus W1;species W2].

union view

association rule

taxon[subspecies string; species string; genus string; … phylum string; kingdom string; superkingdom string].Schema

subspecies::species::genus:: … kingdom::superkingdomAt Mediator

T:TR, TR::TR1 IFT: ‘TAXON’.taxon[Taxon_Rank TR, Taxon_Rank1 TR1],Taxon_Rank::Taxon_Rank1.

Class creation byschema reasoning

Integrated View Definition

• Views are defined between sources and knowledge base• Example: protein_distribution

– given: organism, protein, brain_region– KB Anatom:

• recursively traverse the has_a paths under brain_region collect all anatomical_entities

– Source PROLAB:• join with anatomical structures and collect the value of attribute

“image.segments.features.feature.protein_amount” where “image.segments.features.feature.protein_name” = protein and “study_db.study.animal.name” = organism

– Mediator:• aggregate over all parents up to brain_region• report distribution

a secondintegrated view

Query Evaluation Example

• protein distribution of Human NCS-1 homologue– from wrapped CaBP website:

• get the amino acid sequence for human NCS-1

– from wrapped Expasy website:• submit amino acid sequence, get ranked homologues

– at Mediator:• select homologues H found in rat, and homology > 0.70

– at Mediator:• for each h in H

– from previous view:» protein_distribution(rat, h, cerebellum, distribution)

• Construct result

Implementation

• System– Flora as F-Logic Engine

– Communicate with ODBC databases through underlying XSB Prolog

– XML wrapping and Web querying through XMAS, our XML query language and custom-built wrappers

• Data– Human Brain Project sites

– NPACI Neuroscience Thrust sites

Work in Progress

• Architecture– plug-in architecture for

• domain knowledge sources• conceptual models from data sources

• Functionality– better handling of large data– operations

• expressive query language• operators for domain knowledge manipulation

– query evaluation• query optimization using domain knowledge

• Demonstration– at VLDB 2000

knowledge-based integration of neuroscience data sources

Documents

interactively mapping data sources into the …interactively...

toward an integration of deep learning and neuroscience ·...

semantic integration of relational data sources with topic...

information integration birn supports integration across...

beyond reduction: mechanisms, multiﬁeld integration and...

integration of data sources

an agent-oriented approach to the integration of information...

integration of renewable energy sources in sustainable...

integration of renewable energy sources: …

integration of renewable sources into hybrid...

semantic integration of traditional and web-based...

integration of sources on variation, development...

integration of renewable and green energy sources...

energy integration based on renewable sources as

data extraction and integration from imprecise web sources

integration of renewable energy sources

midas: scalable entity integration for unstructured data...

entrez neuron: an owl/rdfa–based web application for...

use and integration of sources

integration of sustainable energy sources through power...