fédération de données et de connaissances distribuées en ... · mediation interface to access...
TRANSCRIPT
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 1
fédération de données et de ConnaissancEs Distribuées en Imagerie BiomédicaLE
Data fusion, semantic alignment, distributed queries
Johan MontagnatCNRS, I3S lab, Modalis team
on behalf of the CrEDIBLE consortiumCNRS/UNS, laboratoire I3S (UMR7271), équipe MODALIS INRIA/CNRS/UNS, laboratoire I3S (UMR 7271), équipe WimmicsINSERM U1099, laboratoire LTSI, équipe MediCISU. Picardie, laboratoire MIS, équipe Connaissances
CNRS/INSERM/INSA/U. Lyon 1, laboratoire CREATIS (UMR 5220 / U1044)
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 2
Motivations● Biomedical data
– High heterogeneity: images, clinical data, biomarkers, biology....– Increasing amount / number of (open) sources – Big Data
● Large-scale medical studies
(statistical medical studies, epidemiology...)
– Need for cross-factors analysis – Linked Data● Data (re)analysis opportunities● Translational research
● Centralized approaches encounter limitations– Multiple data source kinds– Large data volumes to transfer / archive / search– Sensitive patient data / complex access control policies– Need to adopt uniform data model & format
● Data is de facto distributed over acquisition centers
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 3
Biomedical data mediation & federation– Data federation through distributed querying and query rewriting
– Heterogeneous databases schema mediation– Medical data & metadata:
● raw data + models + processing results + models + provenance...
Client
Federator(query decomposition, planning & results federation)
Mediator(ETL or query rewriting)
Mediator(ETL or query rewriting)
Site 1 Site 2
Query-based access to data
Remote sub-queries
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 4
Domain ontology-based federation
Client
Federator
Mediator MediatorSite 1 Site 2
Query-based access to data
Remote sub-queries
Medical domain ontology(reference model)
Data querying
Dataalignment
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 5
Challenges and expertise
● Challenges– Representation of data semantics for heterogeneous
data sources → biomedical ontology building– Data federation → distributed query engine– Data mediation → RDB2RDF, ontology alignment
● Partnership
I3S/ModalisINRIA/I3S/WimmicsLTSI/MediCISMIS/ConnaissancesCREATIS
Semantic reference
Semantic representation
Distributed query engine
Medical applications
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 6
Scientific networking – annual workshop● Objectives
– Multidisciplinary workshop gathering experts in biomedical data representation, semantic, distribution, federation, integration...
● October 2012: data integration, ontologies, data models, reasoning– Semantic models are now widely accepted although ontologies
resources are not sufficient– Existing systems are mostly centralized but strigent need to
support multi-centric studies● October 2013: data reuse, ontologies, mediation,
federation, graphs– Both horizontal and vertical data partition schemes need to be
addressed– Data models in use are often constructed bottom-up– Expressivity of the query language is important
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 7
Bibliography study
● Distributed (semantic) Query Processing– Report CrEDIBLE-12-2, november 2012– A. Gaignard PhD thesis, march 2013, U. Nice Sophia
● Relational data mediation– Report I3S 2013-04-FR, november 2013– Submitted to Journal of Web Semantics
● Ontology of scientific measures (data acquisition, metrology)– Report CrEDIBLE-13-1, december 2013
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 8
Reference ontology● 3-levels structure: foundational (DOLCE), core, domain● Domain-specific rules
– Inference abilities
Particular
Endurant Perdurant
Action
Dataset
acquisitionsConceptualization
Expression
Inscription
Dataset
acquisition
equipment
Studies
Examinations
Dataset
processings
Datasets Medical
image
expressions
Medical
image
files
Subjects
Investigators
Centres
Languages
Medical
Image
formats
● DataTop ontology– Current focus
on measurements● Derived relational
schema
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 9
Ontology modules
● Modularized ontology to improve reuse and lightweightness– ONL-MR-DA: MR Dataset Acquisition– ONL-DP: Data Processing– ONL-MSA: Mental State Assessment– OntoVIP: Medical Image Simulation
● Wide diffusion– http://bioportal.bioontology.org/ontologies
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 10
Data query and federation engine● KGRAM (Knowledge Graph Abstract Machine) Semantic
query engine:– Full support of SPARQL1.1– Generic interface for heterogeneous backends– Flexible architecture facilitating different deployment scenarios
● Mediation interface to access relational data– Federated relational schema derived from the ontology
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 11
Distributed Query Processing● Query federator decoupled from data sources● Asynchronous querying of multiple data sources● Query planning and parallel querying
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 12
Distributed Query Processing● KGRAM query processing
SELECT ?name ?dateWHERE { ?x foaf:name ?name . ?x dbpedia:birthDate ?date . FILTER (CONTAINS (?name, 'Bob')) }
Q
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 13
Distributed Query Processing● KGRAM query processing
Q1
SELECT ?name ?dateWHERE { ?x foaf:name ?name . ?x dbpedia:birthDate ?date . FILTER (CONTAINS (?name, 'Bob')) } Q2
Q
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 14
Distributed Query Processing● KGRAM query processing
● Asynchronous execution
Q1
SELECT ?name ?dateWHERE { ?x foaf:name ?name . ?x dbpedia:birthDate ?date . FILTER (CONTAINS (?name, 'Bob')) } Q2
Q
Interface
KGRAM core
Data source#1
Data source#2
Q
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 15
Distributed Query Processing● KGRAM query processing
● Asynchronous execution
Q1
SELECT ?name ?dateWHERE { ?x foaf:name ?name . ?x dbpedia:birthDate ?date . FILTER (CONTAINS (?name, 'Bob')) } Q2
Q
Interface
KGRAM core
Data source#1
Data source#2
Q
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 16
Distributed Query Processing● KGRAM query processing
● Asynchronous execution
Q1
SELECT ?name ?dateWHERE { ?x foaf:name ?name . ?x dbpedia:birthDate ?date . FILTER (CONTAINS (?name, 'Bob')) } Q2
Q
Interface
KGRAM core
Data source#1
Data source#2
Q
Q1
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 17
Distributed Query Processing● KGRAM query processing
● Asynchronous execution
Q1
SELECT ?name ?dateWHERE { ?x foaf:name ?name . ?x dbpedia:birthDate ?date . FILTER (CONTAINS (?name, 'Bob')) } Q2
Q
Interface
KGRAM core
Data source#1
Data source#2
Q
Q1
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 18
Distributed Query Processing● KGRAM query processing
● Asynchronous execution
Q1
SELECT ?name ?dateWHERE { ?x foaf:name ?name . ?x dbpedia:birthDate ?date . FILTER (CONTAINS (?name, 'Bob')) } Q2
Q
Interface
KGRAM core
Data source#1
Data source#2
Q
Q1 Q2'
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 19
Distributed Query Processing● KGRAM query processing
● Asynchronous execution
Q1
SELECT ?name ?dateWHERE { ?x foaf:name ?name . ?x dbpedia:birthDate ?date . FILTER (CONTAINS (?name, 'Bob')) } Q2
Q
Interface
KGRAM core
Data source#1
Data source#2
Q
Q1 Q2'
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 20
Distributed Query Processing● KGRAM query processing
● Asynchronous execution
Q1
SELECT ?name ?dateWHERE { ?x foaf:name ?name . ?x dbpedia:birthDate ?date . FILTER (CONTAINS (?name, 'Bob')) } Q2
Q
Interface
KGRAM core
Data source#1
Data source#2
Q
Q1 Q2' Q2''
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 21
Distributed Query Processing● KGRAM query processing
● Asynchronous execution
Q1
SELECT ?name ?dateWHERE { ?x foaf:name ?name . ?x dbpedia:birthDate ?date . FILTER (CONTAINS (?name, 'Bob')) } Q2
Q
Interface
KGRAM core
Data source#1
Data source#2
Q
Q1 Q2' Q2''
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 22
Distributed Query Processing● KGRAM query processing
● Asynchronous execution
Q1
SELECT ?name ?dateWHERE { ?x foaf:name ?name . ?x dbpedia:birthDate ?date . FILTER (CONTAINS (?name, 'Bob')) } Q2
Q
Interface
KGRAM core
Data source#1
Data source#2
Q
Q1 Q2' Q2''
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 23
Distributed Query Processing● KGRAM query processing
● Asynchronous execution
Q1
SELECT ?name ?dateWHERE { ?x foaf:name ?name . ?x dbpedia:birthDate ?date . FILTER (CONTAINS (?name, 'Bob')) } Q2
Q
Interface
KGRAM core
Data source#1
Data source#2
Q
Q1 Q2' Q2''
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 24
Performance results● Mixed (relational / semantic) stores querying
● FedBench standard benchmark
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 25
Exploitation in ANR GINSENG (epidemiology)
● Partnership with GINSENG consortium and Mnemotix SME
● Federation of heterogeneous epidemiology repositories– Multiple epidemiology data
acquisition networks– Cross-correlation with “external”
data (e.g. demographic: IGN)● Mediation of the EHR
(Electronic Health Record) data schema
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 26
Collaboration with Pitié Salpétrière
● CAC database– Neurodegenerative diseases (esp. Alzheimer's)– Mediation of relational CAC schema
● Comparative study of different relational-to-RDF mediation techniques– R2RML transformation language– Test case for various RDB2RDF conversion tools
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 27
Perspectives● DataTop ontology development
– Data distribution (spatial and temporal) and data collections structure
– Alzheimer disease● Query engine architecture
– Code modularity for alternative optimization strategy implementation
● Query performance optimization– Query plan improvement (especially when dealing both with
horizontal and vertical partitioning)● Mediation
– Dynamic mediation of data sources– Integration of new medical data sources
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 28
Perspectives: collaborations
● Université de Laval au Québec– Ontology design: Alzheimer's disease data
● ANR CONTINT 2013 BIOMIST– Industrial system for large databases management
(Product Lifecycle Management - PLM)– Application to medical databases in the area of brain
functions analysis ● France Life Imaging excellence infrastructure
– National-level infrastructure for medical data sharing and analysis
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 29
Publications● Peer-reviewed
– Medical domain: iDash image informatics workshop, sept. 2012; DCICTIA-MICCAI workshop, oct. 2012
– Distributed Query Processing: Web Intelligence, dec. 2012; Journal of Web Semantics (submitted)
– Mediation: Journal of Web Semantics (submitted)● PhD thesis
– A. Gaignard (march 2013, U. Nice Sophia Antipolis)– N. Cerezo (december 2013, U. Nice Sophia Antipolis)
● Research reports– Workshop summaries: #12-1, #14-1– Annual reports: #12-3, #13-3– Bibliography studies: #12-2, I3S/2013-04-FR
http://credible.i3s.unice.fr MI CNRS – bilan 2013 – Paris, 23/1/2014 30
Conclusions● Query-based data federation grounded on
semantic web standards (SPARQL, RDF, RDFS)– Emphasis on query language expressivity– Support for both horizontal and vertical data partitioning– Broad applicability (given that ontologies are available)
● Ontology-based– Reference model for data alignment and query terms
● Currently using Extract-Transform-Load– Static conversion of data sources in RDF format– Work on dynamic data mediation on-going
● Optimization work on-going– Flexible software architecture– Distributed query processing optimization