treating heterogeneity and uncertainty in data integration ... · long-term monitoring platform for...
TRANSCRIPT
![Page 1: Treating heterogeneity and uncertainty in data integration ... · Long-term monitoring platform for Zika Aims: Systematic and longitudinal monitoring of children born and registered](https://reader034.vdocuments.mx/reader034/viewer/2022052001/60135637d7046b38a174f32f/html5/thumbnails/1.jpg)
Treating heterogeneity and uncertainty in data integration:study on Brazilian healthcare databases
Marcos Barreto1, Mauricio Barreto2, Spiros Denaxas3
1. Computer Science Dept., Federal University of Bahia (UFBA), Salvador, Bahia, Brazil2. Institute Gonçalo Moniz, Oswaldo Cruz Foundation (FIOCRUZ), Salvador, Bahia, Brazil
3. Farr Institute of Health Informatics Research, UCL, London
![Page 2: Treating heterogeneity and uncertainty in data integration ... · Long-term monitoring platform for Zika Aims: Systematic and longitudinal monitoring of children born and registered](https://reader034.vdocuments.mx/reader034/viewer/2022052001/60135637d7046b38a174f32f/html5/thumbnails/2.jpg)
Outline Projects’ scopes
Platform under development Linkage methods / accuracy results
Proposed approach Initial issues / preliminary results
Current work
![Page 3: Treating heterogeneity and uncertainty in data integration ... · Long-term monitoring platform for Zika Aims: Systematic and longitudinal monitoring of children born and registered](https://reader034.vdocuments.mx/reader034/viewer/2022052001/60135637d7046b38a174f32f/html5/thumbnails/3.jpg)
The 100 million cohort project Aim: develop a platform to support a population-basedcohort built from CadastroÚnico (socioeconomic database)and assess the impact of several social protectionprogrammes on health, education, work etc.
Social Programmes using CadastroÚnico
Databases Coverage
CadastroÚnico 2007 - 2015
Bolsa Família (PBF) 2007 - 2015
SIH (hospitalization) 1998 - 2012
SINAN (notifiable diseases) 2000 - 2012
SIM (mortality) 2000 - 2012
SINASC (live births) 2001 - 2012
# o
f lin
es
(mill
ion
s)
114 million
![Page 4: Treating heterogeneity and uncertainty in data integration ... · Long-term monitoring platform for Zika Aims: Systematic and longitudinal monitoring of children born and registered](https://reader034.vdocuments.mx/reader034/viewer/2022052001/60135637d7046b38a174f32f/html5/thumbnails/4.jpg)
Long-term monitoring platform for Zika Aims:
Systematic and longitudinal monitoring of children born and registered in SINASC (live births) between July/2016 and July/2017
Assess the impact of microcephaly and outcomes (mortality, hospitalization etc) related to Zika virus.
Assess outcomes in cognitive ability through school performance studies.
Possible linkage with other databases (retroactively to 2001): 2,800,000 births / year≃ Possible introduction of other outcomes (Dengue, Chikungunya)
Bahia Notifications of Chikungunya:
Aedes aegypti infestation index: 1.4% (OMS suggested threshold: 1%)
Notifications of Dengue + Zika + Chikungunya (1 January – 6 August): 161,883
Jan/Dec 2015: 24,308Jan/Jul 2016: 47,092
![Page 5: Treating heterogeneity and uncertainty in data integration ... · Long-term monitoring platform for Zika Aims: Systematic and longitudinal monitoring of children born and registered](https://reader034.vdocuments.mx/reader034/viewer/2022052001/60135637d7046b38a174f32f/html5/thumbnails/5.jpg)
Proposed platform
Users (scientists,government etc)
Web portal
Linkage pipeline
Original data sets and dedicated resources
Developers(Computing,Statistics,
Epidemiology)
Anonymizeddata marts
Metadata / IndexingCohort management
+
+
Yemoja supercomputer (#2 in LatAm)
Safe room + medium-scale clusters
Dedicatedfiber opticsconnection
(2 km)
![Page 6: Treating heterogeneity and uncertainty in data integration ... · Long-term monitoring platform for Zika Aims: Systematic and longitudinal monitoring of children born and registered](https://reader034.vdocuments.mx/reader034/viewer/2022052001/60135637d7046b38a174f32f/html5/thumbnails/6.jpg)
Record linkage pipeline
Data quality assessment
Data conditioning
Record linkage
Accuracy assessment
CadU baseline + SUS files Metrics for qualitative analysis Candidate attributes for linkage
ETL-based routines (cleansing, standardization) Anonymization (Bloom filter) Blocking routines Comparison blocks
Linkage parameters Linkage routines (deterministc and probabilistic) Data marts
Assessment metrics (sensitivity, specificity, VPP etc) Controlled scenarios Accuracy results
A Spark-based workflow for probabilistic record linkage of healthcare dataPITA, R.; PINTO, C.; MELO, P.; SILVA, M.; BARRETO, M.; RASELLA, D. (BeyondMR - EDBT/ICDT 2015)
ATYIMOATYIMO
CadastroÚnicobaseline
Payments fromBolsa Família (PBF)
SUS (National Unified Health System)
SIH(hospitalization)
SINASC(live births)
SIM(mortality)
SINAN(notifiable diseases)
DeterministicProbabilistic
![Page 7: Treating heterogeneity and uncertainty in data integration ... · Long-term monitoring platform for Zika Aims: Systematic and longitudinal monitoring of children born and registered](https://reader034.vdocuments.mx/reader034/viewer/2022052001/60135637d7046b38a174f32f/html5/thumbnails/7.jpg)
Record linkage methods Full probabilistic: Sorensen (Dice) index applied to Bloom filters.
2h
|a| + |b|Da,b = = [0, 1]
h = number of 1's at same position in both Bloom filtersa = number of 1's in Bloom filter Ab = number of 1's in Bloom filter B
A
B
Hybrid approach: individual comparison of attributes based on different rules
Correlação probabilística de bases de dados governamentais. PINTO, C.; PITA, R.; MELO, P.; SENA, S.; BARRETO, M. (Brazilian Symposium on Databases – SBBD 2015)
![Page 8: Treating heterogeneity and uncertainty in data integration ... · Long-term monitoring platform for Zika Aims: Systematic and longitudinal monitoring of children born and registered](https://reader034.vdocuments.mx/reader034/viewer/2022052001/60135637d7046b38a174f32f/html5/thumbnails/8.jpg)
Record linkage methods – accuracy resultsControlled scenario: 2 databases
4 simulated scenarios different percentage of changes in records
Main metrics: Sensitivity ('sensibilidade') Positive predictive value (VPP)
Databases Numberof records
Truematches
Rotavirus (diarrhea) 686 486(positive exams)
Other causes(children treated at outpatient clinics)
9,678
Full prob., without blockingFull prob., blockingHybrid prob., without blockingHybrid prob., blocking
Blocking Without blocking
![Page 9: Treating heterogeneity and uncertainty in data integration ... · Long-term monitoring platform for Zika Aims: Systematic and longitudinal monitoring of children born and registered](https://reader034.vdocuments.mx/reader034/viewer/2022052001/60135637d7046b38a174f32f/html5/thumbnails/9.jpg)
Record linkage methods – accuracy resultsUncontrolled scenario:
BCG vaccination X SIM (mortality)Manaus state
MA
Databases Linked pairs True positives
BCG vaccination (156,331 records)X SIM (16,260 records)
2,247 2,169(96,53%)
![Page 10: Treating heterogeneity and uncertainty in data integration ... · Long-term monitoring platform for Zika Aims: Systematic and longitudinal monitoring of children born and registered](https://reader034.vdocuments.mx/reader034/viewer/2022052001/60135637d7046b38a174f32f/html5/thumbnails/10.jpg)
Record linkage methods – accuracy resultsUncontrolled scenario:
CadastroÚnico (2011 extraction)Hospitalizations (SIH) by tuberculosis
Sergipe (SE), Santa Catarina (SC) and Rondônia (RO)Notifications (SINAN) from Santa Catarina (SC)
624
SC
SERO
![Page 11: Treating heterogeneity and uncertainty in data integration ... · Long-term monitoring platform for Zika Aims: Systematic and longitudinal monitoring of children born and registered](https://reader034.vdocuments.mx/reader034/viewer/2022052001/60135637d7046b38a174f32f/html5/thumbnails/11.jpg)
CadastroÚnicoX SIH (SE)
CadastroÚnicoX SIH (RO)
![Page 12: Treating heterogeneity and uncertainty in data integration ... · Long-term monitoring platform for Zika Aims: Systematic and longitudinal monitoring of children born and registered](https://reader034.vdocuments.mx/reader034/viewer/2022052001/60135637d7046b38a174f32f/html5/thumbnails/12.jpg)
CadastroÚnicoX SIH (SC)
CadastroÚnicoX SINAN (SC)
![Page 13: Treating heterogeneity and uncertainty in data integration ... · Long-term monitoring platform for Zika Aims: Systematic and longitudinal monitoring of children born and registered](https://reader034.vdocuments.mx/reader034/viewer/2022052001/60135637d7046b38a174f32f/html5/thumbnails/13.jpg)
Approach being discussedHeterogeneity treated inside the pipelining
How to learn from our data sources and linkage/accuracy results to understand the probabilistic behaviour of our scenarios / projects?
Use of ‘possible worlds’ (pw) abstraction to model these uncertain relationships and create reference (‘gold’) standards to assess and certify accuracy
Data conditioning
ETL-based routines (cleansing, standardization)Anonymization (Bloom filter)Blocking routinesComparison blocks
S=(pname, email-addr, home-addr, office-addr)
T=(name, mailing-addr)
Possible Mapping Probability
{(pname,name),(home-addr, mailing-addr)} 0.5
{(pname,name),(office-addr, mailing-addr)} 0.4
{(pname,name),(email-addr, mailing-addr)} 0.1
pname email-addr home-addr office-addr
Alice alice@ Mountain View Sunnyvale
Bob bob@ Sunnyvale Sunnyvale
name mailing-addr
Alice Mountain View
Bob Sunnyvale
name mailing-addr
Alice Sunnyvale
Bob Sunnyvale
name mailing-addr
Alice alice@
Bob bob@
Pr(pw1)=0.5
Pr(pw2)=0.4
Pr(pw3)=0.1
DOAN, Anhai et al. Principles of Data Integration, Morgan Kaufmann, 2012.
![Page 14: Treating heterogeneity and uncertainty in data integration ... · Long-term monitoring platform for Zika Aims: Systematic and longitudinal monitoring of children born and registered](https://reader034.vdocuments.mx/reader034/viewer/2022052001/60135637d7046b38a174f32f/html5/thumbnails/14.jpg)
Learning from our data sourcesIDB 2012 (indicators and basic data) TABNET, provided by DATASUS (http://datasus.saude.gov.br/)
![Page 15: Treating heterogeneity and uncertainty in data integration ... · Long-term monitoring platform for Zika Aims: Systematic and longitudinal monitoring of children born and registered](https://reader034.vdocuments.mx/reader034/viewer/2022052001/60135637d7046b38a174f32f/html5/thumbnails/15.jpg)
Learning from our linkage resultsDice coefficients with good sensitivity and VPP vary significantly depending onthe databases involved
Usage of supervised and unsupervised machine learning techniques to analyzethe accuracy results and (try to) provide a way to eliminate manual review
– Supervised: ID3 and Naïve-Bayes– Unsupervised: partitional (k-means, CLARA) and hierarchical (AGNES, DIANA)
– Spark MLlib, R cluster / clusteval
– Data mart:• CadastroÚnico (2011) x SINAN 2011 (tuberculosis): 4,910 records
– Cross-validation based on a sliding windowfrom block #0 to block #9 as training data
Block #0 – 491 records
Block #1 – 491 records
Block #9 – 491 records
![Page 16: Treating heterogeneity and uncertainty in data integration ... · Long-term monitoring platform for Zika Aims: Systematic and longitudinal monitoring of children born and registered](https://reader034.vdocuments.mx/reader034/viewer/2022052001/60135637d7046b38a174f32f/html5/thumbnails/16.jpg)
Learning from our linkage resultsMetrics:
– Dice coefficient (from linkage), edit distance of name (complete),given name and surname, equality on gender, municipality and on birth date (day, month and year)
Name (complete): low (0-2), medium (3-4), high (>=5) Given name: low (0-2), high (>=3) Surname: low (0-2), high (>=3) Day, month, year, gender, municipality: equal (true), different (false)
![Page 17: Treating heterogeneity and uncertainty in data integration ... · Long-term monitoring platform for Zika Aims: Systematic and longitudinal monitoring of children born and registered](https://reader034.vdocuments.mx/reader034/viewer/2022052001/60135637d7046b38a174f32f/html5/thumbnails/17.jpg)
Supervised methods: ID3
Cross-validation ID3(average – 10 executions)
Partitioning (training/test) Expected (manual review)
![Page 18: Treating heterogeneity and uncertainty in data integration ... · Long-term monitoring platform for Zika Aims: Systematic and longitudinal monitoring of children born and registered](https://reader034.vdocuments.mx/reader034/viewer/2022052001/60135637d7046b38a174f32f/html5/thumbnails/18.jpg)
Unsupervised methods
![Page 19: Treating heterogeneity and uncertainty in data integration ... · Long-term monitoring platform for Zika Aims: Systematic and longitudinal monitoring of children born and registered](https://reader034.vdocuments.mx/reader034/viewer/2022052001/60135637d7046b38a174f32f/html5/thumbnails/19.jpg)
Current workDetailed study on models and metrics to deal with uncertainty in probabilistic data linkage scenarios
Generation of new data marts from AtyImo v2 (full + hybrid approach)+ accuracy assessment
Generation of training/test data from these data marts
New tests with DataFrame-based API in the spark.ml package.
![Page 21: Treating heterogeneity and uncertainty in data integration ... · Long-term monitoring platform for Zika Aims: Systematic and longitudinal monitoring of children born and registered](https://reader034.vdocuments.mx/reader034/viewer/2022052001/60135637d7046b38a174f32f/html5/thumbnails/21.jpg)
Cohort setup / managementLongitudinal merge of CadastroÚnico based on NIS (social ID) attribute
# o
f lin
es
(mill
ion
s)
114 million
Table Filesize # of records Version
![Page 22: Treating heterogeneity and uncertainty in data integration ... · Long-term monitoring platform for Zika Aims: Systematic and longitudinal monitoring of children born and registered](https://reader034.vdocuments.mx/reader034/viewer/2022052001/60135637d7046b38a174f32f/html5/thumbnails/22.jpg)
Metadata and indexing
Exposition(payments received)
2007 2008 2009 2010 2011 ... 2015
1
...
114,000,000
…N
Individuals(cohort
+
otherdatabases)
SINAN
SIMOutcomesSINASC
SIH
![Page 23: Treating heterogeneity and uncertainty in data integration ... · Long-term monitoring platform for Zika Aims: Systematic and longitudinal monitoring of children born and registered](https://reader034.vdocuments.mx/reader034/viewer/2022052001/60135637d7046b38a174f32f/html5/thumbnails/23.jpg)
Metadata and indexing
2007 2008 …. 2015
CadastroÚnico Bolsa Família (PBF)
2007 2008 …. 2015
Baseline
Cohort profile
Health data (SUS)
SINASC 2007 …. 2012
SIH 2007 …. 2012
SINAN 2007 …. 2012
SIM 2007 …. 2012