gem: the gaain entity mapper naveen ashish, peehoo dewan, jose-luis ambite and arthur w. toga usc...
TRANSCRIPT
![Page 1: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/1.jpg)
GEM: The GAAIN Entity Mapper
Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. TogaUSC Stevens Neuroimaging and Informatics Institute
Keck School of Medicine of USC
July 9th, 2015
At the 11th Data Integration in Life Sciences Conference (DILS) 2015Marina del Rey
![Page 2: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/2.jpg)
Introduction: GAAIN
• GAAIN: Global Alzheimer’s Association Interactive Network
• Current
• Data integrated from 30+ sources
• Over 250,000 research subjects
• Access http://www.gaain.org
![Page 3: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/3.jpg)
Data Integration in GAAIN• Data
• Subject research data• Well structured• (Mostly) relational
• Data harmonization• Common data model• MAP datasets to common model
• Data ownership sentsitivity
![Page 4: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/4.jpg)
Data Mapping
![Page 5: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/5.jpg)
The Data Mapping Problem• Resource intensive
• “On average, converting a database to the OMOP CDM, including mapping terminologies, required the equivalent of four full-time employees for 6 months and significant computational resources for each distributed research partner. Each partner utilized a number of people with a wide range of expertise and skills to complete the project, including project managers, medical informaticists, epidemiologists, database administrators, database developers, system analysts/ programmers, research assistants, statisticians, and hardware technicians. Knowledge of clinical medicine was critical to correctly map data to the proper OMOP CDM tables. “
• Complexity of data harmonization• Several thousand data elements per dataset• Multiple datasets
• Data elements• Complex scientific concepts• Cryptic names• Domain expertise to interpret
![Page 6: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/6.jpg)
Observations
• Rich element information in documentation • Data dictionaries !
• Element information• Descriptions• Metadata
• Need better approaches to matching element names• MOMDEMYR1• PTGNDR
![Page 7: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/7.jpg)
Data Dictionaries
• Rich element details
![Page 8: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/8.jpg)
Approach
• Extract element description and metadata details from data dictionaries
• Determine element matches based on above
• Block improbable match candidates based on metadata
• Determine element similarity (and thus match likelihood) based on name and description similarity
• Initial version of system knowledge-driven, then added machine-learning classification
![Page 9: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/9.jpg)
GEM: A Software Assistant for Data Mapping
![Page 10: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/10.jpg)
GEM Architecture
![Page 11: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/11.jpg)
Element Extraction
• Extract and segregate element information
√
![Page 12: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/12.jpg)
Metadata Detail Extraction
• Element categoriesFour categories
(i) Special(ii) Coded
BinaryOther coded
(iii) Numerical(iv) Text
ClassifierHeuristic based
• Other metadata detailsCardinalityRange (min, max)
√
![Page 13: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/13.jpg)
MDB: The Metadata Database
• Extracted detailed metadata per element Source Name Description Legend Cardinality Range Category
9/8/14
√
![Page 14: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/14.jpg)
Matching: Metadata Based “Blocking”
• Elimination of candidatesEliminate candidates from second source that are
incompatible• Incompatibility criteria
- Category mismatch- Cardinality mismatch
- For coded elements- Assume normal distribution with SD of 1
- Range mismatch
9/8/14
√
![Page 15: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/15.jpg)
Matching Text Descriptions
• Employ a regular Tfidf cosine distance on bag-of-words• Based on unsupervised topic modeling (LDA)
- Treat element descriptions as ‘documents’ - Topic model over these documents- Each element (description) has a probability distribution over topics- Element similarity (or distance) based on similarity (not) of associated topic distributions
√
![Page 16: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/16.jpg)
Element Name Matching
• Composite element names
P T G E N D E R
P AT G N D R
M O M D E M
F H Q D E M Y R 1
![Page 17: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/17.jpg)
𝑇𝐶𝑆 (𝑒𝑆 ,𝑒𝑇 )=Ʃ𝑎𝑙𝑙𝑒𝑙𝑒𝑚𝑒𝑛𝑡𝑠 𝑖𝑛𝑇𝑎𝑏 (𝑒𝑆 )𝑀 (𝑒𝑆 ,𝑒𝑇 )
min (𝑂 (𝑇𝑎𝑏 (𝑒𝑆) ) ,𝑂 (𝑇𝑎𝑏 (𝑒𝑇 ) ) )
Table Correspondence
• Elements generally do match across ‘corresponding’ tables
• Literal table names not scalable as a feature
• Determine table correspondence heuristically, based on knowledge driven match likelihood
![Page 18: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/18.jpg)
Experimental Results• Setup
• Various data dictionaries
• ADNI, NACC, DIAN, LAADC, INDD
• Mapping pairs
• Pairs of datasets
• ADNI-NACC, ADNI-INDD, ADNI-LAADC, …
• Dataset to GAAIN Common Model (GCM)
• ADNI-GCM, NACC-GCM, …
• Experiments
• Mapping accuracy
• Effectiveness of individual components
• Topic Modeling (text description) match and Filtering
• Comparison with related systems
• System parameters
![Page 19: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/19.jpg)
Related Systems
9/8/14
1) Coma++http://dbs.uni-leipzig.de/Research/coma.html
• More suited for ‘semantic’, ontology integration tasks
• Based on XML (nested structure) similarity• No support for incorporating element
descriptions
1) Harmonyhttp://openii.sourceforge.net
• System targets exactly the same mapping problem as ours
• Utilizes element name similarity and also element descriptions in matching
![Page 20: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/20.jpg)
Evaluated What• Taken mappings pairwise
• Dataset pairs
• ADNI-NACC, ADNI-INDD and ADNI-LAADC
• Goldsets: ~ 150 element pairs (created manually)
• To GAAIN Common Model
• ADNI-GAAIN Common Model
• 24 GAAIN Common Model elements
• Report• Accuracy in terms of F-Measure (Precision and Recall)
• Against N – the size of result alternatives per match
• Matching algorithms
(i) Harmony
(ii) TFIDF
(iii) Topic Modeling for text match
(iv) Topic Modeling + Metadata Filtering
9/8/14
![Page 21: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/21.jpg)
Results ADNI to NACC
![Page 22: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/22.jpg)
Results
ADNI to LAADC
![Page 23: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/23.jpg)
Results
ADNI to INDD
![Page 24: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/24.jpg)
Results
ADNI to GAAIN Common Model
![Page 25: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/25.jpg)
Training Topic Model
![Page 26: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/26.jpg)
Comparison
![Page 27: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/27.jpg)
Common Model Mapping
![Page 28: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/28.jpg)
Conclusions from Evaluation• As a medical dataset mapping tool
• High mapping accuracy (90% and above) possible for datasets in this domain
• Significantly higher mapping accuracy compared to available schema mapping systems like Coma++ and Harmony
• From a matching approach perspective
• No universally superior for text similarity matching
• Topic modeling based text matching provides significantly higher mapping accuracies as opposed to TfIdf when the descriptions are not exactly same
• TfIdf outperforms topic modeling when descriptions are exactly same
• Metadata based blocking is beneficial
Internal system
• Mapping accuracy is sensitive to topic model parameters
• Hyperparameters in the underlying “LDA’ topic model
• Filter first, then match – better than Match, then eliminate
![Page 29: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/29.jpg)
Data Understanding: Model Discovery Using GEM
• Identifying data elements for a common data model over collection of multiple, disparate datasets
• Common data model design is a complex problem• GEM helps significantly in the bottom up design of
common data model• For each column of source, corresponding matches
from all destination sources given
![Page 30: GEM: The GAAIN Entity Mapper Naveen Ashish, Peehoo Dewan, Jose-Luis Ambite and Arthur W. Toga USC Stevens Neuroimaging and Informatics Institute Keck School](https://reader035.vdocuments.mx/reader035/viewer/2022081503/5697bfdb1a28abf838cb0982/html5/thumbnails/30.jpg)
Current Work• Machine-learning classification
• Text similarity, name similarity, table correspondence …
• Active-learning for training• Data dictionary ingestion
Links1) http://www.gaain.org2) http://www-hsc.usc.edu/~ashish/ADT.htm
Thank you [email protected]