quete: ontology-based query system for distributed sources haridimos kondylakis, anastasia analyti,...

Quete: Ontology-Based Query System for Distributed Sources

Haridimos Kondylakis, Anastasia Analyti, Dimitris Plexousakis

Kondylak, analyti, dp @ics.forth.gr

2Computer Science Department, University of Crete & FORTH-ICS

Presentation Outline

1. Motivation2. Current Integration Approaches3. Quete Overview4. Querying in Quete5. Evaluation6. Conclusions7. Future Work


1. Motivation

Clinical IS

mediatorD.B.

Genomic IS

Visualization Tools

Regulatory Element

Tools

Statistical,Clustering,

Classification Tools

Query Engine

metadata

findings

Normalization Tools

Sample name Normalized

data


2. Current Approaches (1/2) Warehouse Integration

Data is downloaded, filtered, integrated and stored in a warehouse. Answers are taken from the warehouse

GUS Navigational Integration

Explicit Links Between data SRS, Entrez

Mediator - Wrapper Approaches A global schema is defined over all data

sources K2/BioKleisli, TAMBIS, BACIIS, DiscoveryLink


2. Current Approaches (2/2)

Mediator-Wrapper approach GAV approach

The global schema is defined in terms of the source terminologies

LAV approach The sources are

defined in terms of the global schemaSource 1 Source 2

Mediator

Wrapper Wrapper

Query Results


3. Integration Architecture

Ontology

Source 1 Source 3

Java Application

Query Result

Jdbc-OdbcJdbc-OdbcJdbc-Odbc

Source 2

Java DB EngineQ

U

E

T

E


3.1 The Reference Ontology Ontology is organized as a graph

(+relationship concepts) related through IS-A HAS-A

TumorSample

TumorIdentifier : StringSurgeryDate : Date

RiskFactors

YearsOfSmoking : IntAge : Int

Hybridization

HybridizationDate : Date

BreastCancerPatient

Name : StringCity : StringSSN : String

GeneExpression

RatioValue : Decimal

Reporter

ReporterName :StringHGNCGeneSymbol :String

GOAnnotation

GOId : StringGOName : String

GOBiologicalProcess GOMolecularFunction

GOCellularComponent

IS-A

HAS-A


3.2 Semantic Names

A semantic name (SN) captures the system independent semantics of a schema element combining one or more ontology terms Semantic_name= [CN1; …; CNm] AN

The semicolon between CNi and CNi+1 means that concept CNi is generalization of concept CNi+1 .

Type Semantic Name System Name

Table [BreastCancerPatient] BreatCancerPatient

Field [BreastCancerPatient] Name Name

Field [[BreastCancerPatient] City City

Table [BreastCancerPatient;TumorSample] SurgicalExcision

Field [BreastCancerPatient;TumorSample] TumorId TumorSampleId

Field [BreastCancerPatient;TumorSample] SurgeryDate SurgeryDate


3.3 Definitions A semantic name [CN1; …; CNm] AN is

subsumed by a semantic name [CN1 ’; …; CNm

’] AN ’ , if m ’ <= m CNm-m’+I coincides with or is a specialization of CNi ’

, i=1, …, m’ AN=AN’

Two semantic names are semantically overlapping if Their last i concept names are the same or related

through the ISA relationship They have the same attribute name AN


3.4 Integration Steps Capture Process

Captures the data to be integrated Performed independently in each source

Use Extractor tool to export database schemata Choose fields/tables of interest Use the Ontology to Annotate Schemata Use the

Ontology to Annotate Schemata Database schemata extracted and stored in X-

Spec files that are sent to the central site. Integration Process

Central Integration of the various data sources A global view is produced in memory called

Context View


4.1 Query Formulation Attribute-only version of SQLSELECT [BreastCancerPatient]Name, [Reporter]HGNCGeneSymbol,[GeneExpression]RatioValueWHERE [RiskFactors]YearsOfSmoking>30 AND[Hybridization]HybridizationDate=[TumorSample]SurgeryD

ateAND

[Reporter;GOMolecularFunction]GOName=“celladhesion”

ORDERBY [BreastCancerPatient]Name

SELECT clause contains concepts to be projected WHERE clause specifies selection criteria FROM clause is absent since the integration system will automatically

identify tables to be used. No need for explicit join declarations


4.2 Query Answering

Semantic Query is decomposed in SQL subqueries When possible all operations are pushed

into subqueries They are issued in parallel in distinct

data sources When all results are returned in

central site, all remaining operations are performed

( joins, ordering etc)


4.3 Requirements in forming local subqueries1. Identify the interesting to the user table

attributes with semantic name [CNpath]AN1. i.e (attributes with the same or more specific

information+ local join keys)

2. Since the from clause is missing, the linking tables with interesting to the user attributes must be determined and their join conditions

3. The join attributes called DB link attributes are needed to link the interesting to the user attributes among sources


4.4 Forming the local sub-queries Extension of Unity’s algorithm that

increase’s system recall with no sacrifice in precision

Our algorithm takes into account The user query The ontology The data source-to-ontology mappings

…and formulates a single sub query (SQ) for each data source


4.5 Algorithm: Result CompositionInput: (i)The user semantic query (ii) local SQsOutput: Composition plan1. Find all minimal subsets of SQs such that

1. There is a join tree connecting all subqueries2. All the semantic query’s fields exist3. In each SQ there is a projection attribute which

does not overlap with the projection attribute of another SQ

2. Join the queries in each minimal subset3. Project the common requested attributes4. Union Results5. Apply Group and Order operations


4.6 Results composition Is done with the help of a central DBMS

For every sub query design the temporary table in central db and store the returned results

Build the global SQL query to be issued to the central DB according to the result composition plan

Execute the global SQL query Pros

First step executed in parallel Uses DBMS technology to handle efficient

join, union, order and group operators


4.6 Novel features Horizontal, vertical and hybrid fragmentation

can be declared and used During the formation of local sub queries During the formation of the result composition plan

It rebuilds the fragmented tables before going further down to composition plan

Advantages Eliminate unnecessary local sub queries Avoids joins that are certain to return empty results Increasing system’s recall Improving performance.


Preliminary Evaluation

0

10

20

30

40

50

60

70

80

90

100

0 10000 20000 30000 40000 50000 60000

Rows

Tim

e (s

ec)

4DBs no fragment

4DBs fragment

3DBs no fragment

3DBs fragment


Conclusions Information Integration is a difficult task

Heterogeneity of Sources Independent Evolution Communication costs Complicated Structures

Our system has good performance. A LAV system

Global Schema do not change as sources evolve new sources are added

But without LAV’s complexity in processing

Trade off between complexity and efficiency


Future Work

More Query Algorithms in memory Database Cycles Non – Relational Data Sources Exploit Systems for Automatic Schema

matching Web Service – Grid approach Caching Updates in sources

Thanks !!!

quete: ontology-based query system for distributed sources haridimos kondylakis, anastasia analyti,...

Documents

computer science department

university of crete

concept cn i

discoverylink slide

normalized data slide

string goname

string hgncgenesymbol

string surgerydate