health risk prediction via mining big health data -...

73
Health Risk Prediction via Mining Big Health Data Vincent S. Tseng [email protected] Department of Computer Science National Chiao Tung University Taiwan Taiwan-Italy bilateral workshop on Smart City, 2015

Upload: ngongoc

Post on 06-Mar-2018

219 views

Category:

Documents


2 download

TRANSCRIPT

  • Health Risk Prediction via Mining Big Health Data

    Vincent S. [email protected]

    Department of Computer ScienceNational Chiao Tung University

    Taiwan

    Taiwan-Italy bilateral workshop on Smart City, 2015

  • 22

    Emerging Needs in Smart City Development

  • 3

    Outline

    Brief Bio Sketch Introduction Traditional Health Risk Assessment Health Risk Mining and Prediction

    Some Recent Developments Large-Scale Population-based Health Data Mining Disease Risk Patterns Mining Health Risk Mining for Chronicle DiseasesEarly

    Prediction & Monitoring for Disease Outbreak Concluding Remarks

  • 44

    Brief Bio Sketch Dr. Vincent S. Tseng Professional Positions

    Professor, Dept. Computer Science, National Chiao Tung University, Taiwan Chair, IEEE Computational Intelligence Society (CIS) Tainan Section Review Board for government units Taiwan, including Ministry of Science

    and Technology, Ministry of Health and Welfare Editor board: IEEE Transactions on Knowledge and Data Engineering

    (TKDE), IEEE Journal on Biomedicine and Health Informatics (JBHI), ACM Transactions on Knowledge Discovery from Data (TKDD), etc.

    President, Taiwanese Association for Artificial Intelligence (2011-2012) Director, Institute of Medical Informatics, NCKU, Taiwan (2008/8-2011/7) Director, Medical Informatics Center, NCKU Hospital (2004-2007)

    Has published more than 300 papers in referred journals/conferences related to data mining and intelligent computing; Held/filed more than 15 patents in USA and ROC

    Has conducted 50+ academic/industrial research projects

  • 55

    Development of Intelligent and High-Performance Big Data Mining Techniques

    Applications in Emerging Domains

    Biomedical/Health Data Mining

    Mobile and Social Network Data Mining

    Multimedia Data Mining

    Manufacturing Data Mining

    Etc.

    Main research topics in my research group

  • 6

    Cloud, GPU & Stream Computing Platform

    Big Data Mining Platform

    C ++C ++

    ClustersAssociation.

    PredictiveModels

    Reports

    Models/Rules

    ODBC

    Direct

    Custom

    Data A

    ccess C A

    PI

    Data A

    ccess C A

    PI Data

    PreparationComponents

    Data PreparationComponents

    Mining EngineComponents

    AssociationRulesAssociationRules

    SequentialPatternsSequentialPatterns

    ClusteringClustering

    ClassificationClassification

    Rules RetrieveComponentsRules RetrieveComponents

    PredictionComponentsPredictionComponents

    ApplicationsModuleApplicationsModule

    DataPreparation Deploy

    DataAccess

    DataModeling Presentation

    InterestingPatterns

    InputData

  • 7

    Applications in Various Domains

  • 8

    A large-scale research initiative in 2012 aimed at Innovations around smartphone-based research Collect smartphone data in everyday life conditions Community-based evaluation of related mobile data analysis

    methodologies Data source: Lausanne Data Collection Campaign

  • 9

    Personal information Media files Calendar Applications

    Social information Call log Contacts Bluetooth

    User Profile/Behavior Modeling and Prediction Device information

    Process Accelerometer System Information

    Location information GSM WLAN Sequence of place visits

  • 10

  • 11

    11

    Biomedical/Health Data Mining

    Gene Expression MiningAssociation pattern analysisClusteringTime-series Analysis

    Protein Expression Mining

    Mass SpectrometryLC/MS mining

    Biomarker & Health Risk Mining Vital sign analysis (ECG/EEG)Disease biomarker analysisHealth Risk AssessmentTele-care platform Patient Behavior Mining

    Gene Regulation Network AnalysisProtein Structure Mining

    Data Mining TechniquesAssociation Rule MiningSequential Pattern MiningClusteringClassificationTime-series Analysisetc.

  • 12

    12

    Trends of Medicine & Healthcare

    Medicine Personalized Medicine Personalized treatment

    Preventive Medicine Early detection

    Preventive Healthcare Personalized Preventive Risk Assessment

  • 13

    Health Risk Assessment (HRA)- General health examination -

    Examinee

    Assessed &Interpretation

    Health examination report

    suggestions

    Doctors

    Is there some potentialhealth risk?

    In general health examination, health conditions are diagnosed and assessed simply as normal/abnormal by Lab results

    Lack of predictive assessment on health riskHealth

    Examination

    0.88Waist-Hip Ratio (male

  • 14

    Historic Health Examination data

    HRA System

    Traditional Health Examination System

    Traditional Health Examination ReportMr. Chen

    Examined Date: 2009/05/03Weight:59 kgHeight:165.0 cmBlood Pressure: 120~70

    Examined Date:2008/04/04Weight:65 kgHeight:165.2 cmBlood Pressure: 110~70

    Examined Date:2007/10/16Weight:64 kgHeight:165.1 cmBllod Pressure: 120~70

    Examined Date:2005/09/30Weight:61 kgHeight:165.2 cmBlood Pressure: 110~70

    14

    Predictive HRA- Scenario-

  • 15

    HRA System

    Traditional Health Report

    Integrated Report

    Doctor

    Mr. Chen

    Diabetes Prediction Model

    Pneumonia Prediction Model

    Heart Disease Prediction Model

    Apoplexy Prediction Model

    Malignancy Prediction Model

    Predictive HRA (cont.)- Scenario-

  • 16

    Feature Set

    Disease Prediction Related Work

    [Huang et al., 2007]

    Clinical examination

    Clinical examination Feature weighting

    Naive Bayes

    C4.5

    IB1

    ClassificationFeature selection

    Physician selection

    Good

    Bad

  • 17Intelligent Database Laboratory, CSIE, NCKU - 17 -

    Disease Prediction Related Work (cont.)

  • 18

    Disease Prediction Related Work (cont.)

    [Palaniappan et al., 2008] Intelligent Heart Disease Prediction System (IHDPS)

    - 18 -

  • 19

    [Patil et al., 2009]

    Disease Prediction Related Work (cont.)

    Heart DiseaseDatabases

    Weighted

    Clustered using K-means

    Frequent Pattern Mining

    Preprocessing

    Heart Attack Prediction System

    Process

  • 20

    Predictive Risk Study of Hepatocellular Carcinoma (HCC)- [Chen et al. JAMA06]

    Prospective cohort study of 3653 participants (aged 30-65) in Taiwan

    Six main indicators Sex Age cigarette smoking alcohol consumption serostatus for the hepatitis B e antigen (HBeAg) serum alanine aminotransferase level

    Elevated serum HBV DNA level (> or =10,000 copies/mL) is a strong risk predictor of hepatocellular carcinoma, independent of HBeAg, serum alanine aminotransferase level, and liver cirrhosis.

  • 21

    Taiwan offers new models to predict Hepatitis C cancer risk

    Announced in 2012 Conference of the Asian Pacific Association for Study of the Liver by Dr. CJ Chens team

    The model delivers prediction results with 80% accuracy based on a large-scale population screening study in Taiwan

    Main indicators Age Liver function indexes ALT and AST Hepatitis C virus RNA in serum Genotype of the virus Liver cirrhosis

    Free Smartphone App provided for use

  • 22

    Neural NetworkStatistics

    Health Risk Assessment - Disease Prediction -

    Traditional prediction mechanisms were built based on static and simple view on health/medical records.

    Physical Body height (cm) 166

    Body weight (kg) 62.8

    BMI (18.5~24) 22.8

    Systolic blood pressure (110~140) 120

    Diastolic blood pressure (60~90) 85

    Waist circumference (cm) 79

    Hip circumference (cm) 89

    Waist-Hip Ratio (male

  • 23

    Health Risk Prediction- Divination for Health?

    What have we seen? A snapshot? (point) Temporal evolution?

    (segment) Full view? (full coverage)

  • 24

    Challenges

    Big data with heterogeneous data sources Volume, Variety, Velocity, etc.

    Data quality problem Data imbalance problem Post-processing of large analyzed/extracted results Need of deep incorporation of medical domain

    knowledge Privacy issue

  • Some Recent Developmentsin Taiwan

    25

  • Large-Scale Population-based Health Data Mining

    26

  • Health-related information are derived from various data

    One hospital data Through electronic medical record (EMR) system we could calculate the

    incidence of event B on patients receiving treatment A. Limitation: patients with event B might be diagnosed in another hospital

    Question: whats the probability of occurrence of adverse event B after taking treatment A?

    Several hospitals data Through EMR exchange system we could have a more accurate estimate

    of number of incidence of event B. Limitation: difficult to detect rare event B

    All hospitals population-based data Through National Health Insurance Claims Data we could have very

    large patient sample size to detect rare event B.

    27

  • Micro view through hospital-based data

    Conclusions from hospital A

    Conclusions from hospital B

    28

  • Macro view through bigger data

    29

  • Occurrence of febrile convulsion

    Frightened mother would ask doctor: what is the probability that my child will become a patient with epilepsy?

    Febrile convulsion as an example

    30

  • The probability varied greatly with different datasets

    Mainly due to referral

    bias

    31

  • Importance of using big health data to detect drug adverse effects

    Lancet 2005;365:475-481

    32

  • 33

    National Health Insurance Research Databasein Taiwan

    National Health Insurance (NHI ) Established in March 1, 1995 Serves 99.2% of Taiwanese population (20M+) Covers 92.62% of medical institutions

    Longitudinal Health Insurance Database ( LHID ) sampled from NHIRD Including 951,044 people health records 1997 now

    Strongly representative in Taiwan Every living regions Big time interval

    15+ years

    Reference : National Health Insurance, http://www.nhi.gov.tw, 2012

  • Evolution of value-added analysis of health datasets in Taiwan

    CR

    2000 2005 2010

    1G NHI

    2G NHI

    2015

    CODBR

    3G Lab data & Patient centered outcomeCloud

    computing

    CRNHI CODBR

    NHI: National Health InsuranceBR: Birth RegistryCR: Cancer RegistryCOD: Cause of Death Mortality Lab: Laboratory data

    34

  • Incorporation of more heterogeneous datasets

    Lab data & Patient reported outcome

    Cloud computing

    CRNHI CODBR

    Sensor-based biomarker monitoring data

    Smart Health RiskAlert

    Environmental monitoring data

    35

  • 36

    Rich Topics for Explorations

    Taiwans government units have launched large-scale projects for value-added analysis on the national health data: Disease markers discovery Disease progression model Adverse drug reactions (ADRs) Medication redundancy Public health issues Privacy issue .. etc.

  • 37

    DB1

    Databases

    DB2 DBn

    Data Loading

    DB Service

    DB Server

    Data IntegrationData ProcessingQuery Interface

    VisualizationData Output

    Data Download

    USER

    DB

    Data Analytics Service

    Cloud-based Data Mining Components

    Mining Result

    Cloud System

    Integrated Database &Data Analytics Services

    DataMigration

    Integrated Platform for Big Health Data Analytics

    Cloud-based Data Mining Components

    Cloud-based Data Mining Components

  • 38

    Goal: Healthcare as a Service

    Data Cloud Computing Cloud

  • Disease Risk Patterns Mining

    39

  • 40

    Goal

    To develop an effective framework for

    1. Mining disease risk patterns

    For further medical research

    2. Assessing disease risk

    Identify potential patients for health monitoring and diagnosis assistance

  • 41

    System Framework

  • 42

    System Main Frame

  • 43

    Pattern Annotation & Visualization Frame

  • 44

    Case Study: Chronic Kidney Disease (CKD)

    Leads to End-Stage Renal Disease (ESRD) Dialysis High cost of NT$30 billion per year

    Not easy to be found at early stage

    Not a single specific disease

    Multiple and complex causes

    Reference : UNITED STATES RENAL DATA SYSTEM, 2013 Atlas of End-Stage Renal Disease

  • 45

    Well-Known Related

    Risk Pattern ICD-9-CM Definition of Ancestors Support Confidence PubMed Search

    {40190}CKD 40190: Essential hypertension, unspecified 18.05% 76.41% 83

    {25000}CKD

    25000: Diabetes mellitus without mention of complication, type II, or unspecified type, not stated as uncontrolled

    14.57% 80.78% 1313

    Potentially Surprising

    Risk Pattern ICD-9-CM Definition of Ancestors Support Confidence

    {53300}CKD 53300:Peptic ulcer, site unspecified

    14.42% 65.56%

    {30000}CKD 30000:Anxiety state

    9.22% 64.83%

    {52300}CKD 52300:Gingival and periodontal diseases

    37.71% 57.63%

    Example Risk Patterns Discovered

  • 46

    Health Risk Mining for Chronicle Diseases

  • 47

    Health WarningDiscover potential riskusing trend analysis

    The current health examination report did not carry out trend analysis

    Mining of Potential Health Risk Trend

    200 mg/dl

    Value

    Date2006/03 2007/02 2008/03 2009/02triglyceride values

    200 mg/dl

    Value

    Date2006/03 2007/02 2008/03 2009/02cholesterol values

  • 48

    Health Risk Mining and Prediction- in Large-Scale and Dynamic View

    Doctors

    Risk Pattern Mining&

    Health Risk Prediction

    Examinee

    0.88Waist-Hip Ratio (male

  • 4949

    General System Framework

    Integrateddataset

    Health Risk Pattern Mining

    Doctor

    Disease Predictor Building

    Parameter Setting

    Profiledataset

    Valuedataset

    Reportdataset

    DiseasePrediction

    Model

    Feature Integration

    Health Risk Patterns

    Phase 1. Health Risk Pattern Mining

    Phase 2. Model Construction

    Health check

    PredictionResults

    Phase 3. Risk Prediction

    Integrateddataset

  • 50

    Key Steps

    Feature Selection Pattern Mining Frequent pattern mining, surprising pattern mining, etc.

    Modeling Decision tree, SVM, Neural network, etc

    Prediction Ensemble, etc

  • 51

    Health Examination Data of a Medical Center in Taiwan Time Period: February, 1996 ~ August, 2009 Total Number of subjects:14,218 Target Disease: Diabetes Disease on Fasting Plasma Glucose (FPG)

    51

    Item Item Item Item ItemTriacylGlycerol Waistline Systolic pressure

    (Left Hand)Diastolic Pressure(Right Hand)

    HbA1c

    HDL-C Arm girth Sphygmus(Right Hand)

    Diastolic Pressure(Left Hand)

    Fasting Plasma Glucose

    Total-cholesterol

    Weight Sphygmus(Left Hand)

    Diastolic Pressure(Before Stand)

    Height Systolic pressure (RightHand)

    Sphygmus(Before Stand)

    OGTT

    Experimental Evaluation

  • 52

    Experimental Results (FPG)

    0%10%20%30%40%50%60%70%80%90%

    100%Accuracy Precision Sensitivity Specificity F-Measure

    AllFemaleMale

  • 53

    Experimental Results (High Dense Lipoprotein)

    0%10%20%30%40%50%60%70%80%90%

    100%Accuracy Precision Sensitivity Specificity F-Measure

    AllFemaleMale

  • 54

    >0 Represent the existence of the health risk patte100) 5.0The 5 records are classified into C3 by the rule

    (Note that they are correctly classified.)

    Health Risk Pattern - HbA1c (F2) (A238):

    Health Risk Pattern - HbA1c (F2) (A211):

    Health Risk Pattern - Diastolic Pressure(Left Hand) (F17) (A87):

    Health Risk Pattern - HbA1c (F2) (A286):

    Decision Tree

  • 55- 55 -

    Health Risk Pattern - HbA1c (F2) (A286):

    Health Risk Pattern - HbA1c (F2) (A238):

    Health Risk Pattern - Diastolic Pressure (Left Hand) (F17) (A87):

    Health Risk Pattern - HbA1c (F2) (A211):

    Target Attribute: Fasting plasma glucose (FPG)C1Represent unhealthy range (100)

    Decision Tree (cont.)

  • 56

    Health Examination Historic Data

    Our System

    Assume five common chronic disease prediction models have been built via our health analysis system from historic health examination data.

    Outpatient Data

    Diabetes Prediction Model

    Pneumonia Prediction Model

    Heart Disease Prediction Model

    Apoplexy Prediction Model

    Malignancy Prediction Model

    Practical Application - Scenario-

  • Early Prediction & Monitoring for Disease Outbreak

    - Case Study on Asthma Care

    57

  • 5858

    Asthma Care

    Asthma is a chronic disease

    Airway constrict Apply MDI When asthma attacks!

    - Potential triggers

    Cold airWarm, moist airAllergens Stress Cold

  • 59

    Prediction of Asthma Outbreak

    Sliding window Size: 5

    day1 day2 day3 day4 day5

    day6

    Combined Data

    Server

    Asthma OutbreakPrediction

  • 60

    60

    Integrated Data Mining Mechanism

    Data Mining Understandable Pattern DB

    Association Rule

    Sequential Pattern

    Time Series

    Classification

    Predictive Alarm Engine

    Symptom

    Factor

    Factor

    SymptomFactor

    EnvironmentDataset

    Chronic Disease Patients

    Bio-SignalDataset

    UserProfiles

  • 6161

    Data Mining Workflow

    UserProfiles

    EnvironmentDataset

    AsthmaSymptomDataset

    Data Pre-Processing

    Sequential Pattern MiningPattern Mining

    Classification MiningRule Mining: PBD

    Classification MiningRule Mining: PBC

    {1st asthma symptom} Potential Asthma{5th allergy symptom} Potential Asthma

    {PM10 is low, Moderate temperature} None Asthma

    Predictive Alarm Engine

    Phase 1Phase 2

    7 8

    0 1

    4

    3

    0

    5

    2

    1

    7 8

    0 1

    4

    3

    0

    5

    2

    1

    Testing DatasetTraining Dataset

    Integrated Dataset

  • 62

    Integrated Data: Patient SymptomsAsthma

    SymptomDataset

    Asthma Sympto

    ms

    Fever Sympt

    omNighttime Symptom Daytime Symptom

    0 NO Sleep well No cough, exercise regularly

    1 Yes Sleep well, but with intermittent cough Intermittent dry cough

    2 Wake up coughing and can fall asleep after inhaling steroids Cough with phlegm, cough when exercising

    3 Serious cough and fall asleep hardly Wheezing, use of inhaled vasodilators

    4 Short breath, incessant cough, need medicine and go to hospital immediatelyAllergy Symptom

    Nose Symptom eye symptom skin symptom

    0 Itchy and rubbing normal normal

    1 sneeze rubbing Itchy, no swelling

    2 Nasal congestionSwelling and Photophobia

    Local swelling

    3 Running noseMore than 2 rash

    blocks

  • 63

    Weather Information Source data: Central Weather Burau, Taiwan

    Temperature Humidity Highest Temperature Lowest Temperature Temperature

    Difference

    IntegratedData:EnvironmentData

  • 64

    Air Pollutants Source Data: Open Environmental Database, Taiwan

    Air pollution attributes:

    IntegratedData:EnvironmentData(cont.)

  • 65

    Example of Decision Tree Output

    Yes

    2nd allergy symptom

    PM10 is high Low humidity

    Catch a cold

    5th asthma symptom

    3rd allergy symptom

    PotentialAsthma

    High temperature difference

    PotentialAsthma

    None PotentialAsthma

    YesNo

    YesNo

    YesNo

    Yes

    None PotentialAsthma

    No

    YesNo

    PotentialAsthma

    Yes

    PotentialAsthma

    None PotentialAsthma

    No

  • 66

    Example Induction Rules

    ,inhaledmaintenancemedicineandbronchodilator Highrisk[sup:14.16%,conf:100%]

    ,noneedmedicine Normal[sup:5.584%,conf:100%]

    ,noneedmedicine,differenceintemperature:2,theindicatorofpollutants:O3 Normal[sup:5.076%,conf:100%]

    ,noneedmedicine,,noneedmedicine,humidity:pattern(up,flat)Highrisk[sup:0.761%,conf:100%]

    ,inhaledmaintenancemedicineor oralmaintenancemedicinedayandnight,,inhaledmaintenancemedicineand oralmaintenancemedicinedayandnight,Thequalityofair:good general Highrisk[sup:2.792%,conf:100%]

    ,inhaledmaintenancemedicineororalmaintenancemedicinedayandnight,,inhaledmaintenancemedicineand oralmaintenancemedicinedayandnight,differenceintemperature:8 Highrisk[sup:3.553%,conf:100%]

    ,inhaledmaintenancemedicineororalmaintenancemedicinedayandnight,thequalityofair:general good bad,PM10:50150 Highrisk[sup:7.614%,conf:100%]

  • 67

    PerformanceofClassifiers

    0.00%

    10.00%20.00%

    30.00%

    40.00%

    50.00%

    60.00%70.00%

    80.00%90.00%

    Air pollution Weather Asthma Combined

    Datasets

    Average of 10 Experiments: PBC Summary

    Precision (Out)

    Recall(Out)

  • 68

    GPRS

    Intelligent Mobile Healthcare: Framework

    Patient Medical Station/ Clinician

    InternetAccess online

    Query Online

    Query by cell phone Urgent event notice

    Show messages

    AutomaticTransmit

    Medical Center

    Database

    Server

    GPRS

    Data mining systemPatient monitoring system

    EnvironmentalInformation

  • 69

    Location-Aware Asthma Alert

    Data Mining System

    Predictive Alert System

  • 70

    Concluding Remarks

    Points -> Segment -> Coverage

    Full view for health risk prediction!

  • 71

    Concluding Remarks (cont.)- A Highly Integrated Framework for Smart Healthcare via Big Data Mining

  • Thanks for your attention&

    Look forward to collaborations

  • 73