a hadoop ecosystem to advance clinical research and practice

1
Introduction Facebook, Twitter, LinkedIn and Yahoo share the same underlying infrastructure, Apache Hadoop. All three of these applications consume, process and store millions of records consisting of structured, unstructured, image and video data. As healthcare data shares many of the characteristics of the data found in Facebook, Twitter, LinkedIn and Yahoo, Hadoop should be an ideal environment for the ingestion, storing and utilization of healthcare data. Methods A virtual Apache Hadoop version 1.0 infrastructure consisting of a single NameNode server and four Task Node servers was set up within the UCI Medical Center data center. Ubuntu Linux running on VMware was the chosen OS. The Hadoop modules utilized were: Hadoop Common, Hadoop Distributed File System (HDFS), MapReduce, Pig, Mahout and Zookeeper. Java scripted routines processed the legacy data. Mirth HL7 listener and a java scripted routine processed the HL7 data. Results The legacy data of 1.2 million patients, contained in 9 million patient medical records was successfully ingested into the Saritor Hadoop Distributed File System. For researchers the drag and drop query and visualization tool allowed for the visualization of the legacy data. For clinicians in patient care complete patient records were retrieved via a web browser. HL7 messages from all source systems, physiological monitoring data in one-minute intervals, and ventilator data in one-minute intervals and EMR generated data was ingested and stored. Algorithms for sepsis, hospital acquired conditions and 30 day readmits are able to be built into Mahout for real time surveillance. Discussion Our initial findings demonstrated the Hadoop ecosystem is well suited for the ingestion, storage and retrieval of both legacy EMR data and runtime EMR data. Minimal programing is required to process legacy data and the processing of runtime EMR data requires the cloning of existing interfaces. The functionality of real time clinical surveillance presents unlimited use cases. Hadoop is an ecosystem that is affordable, scalable, highly available, allows for clinical research and clinical practice to coexist in the same system. Charles Boicey, MS, RNFBC 1 , Lisa Dahm, PhD 1 , David Gonzalez 1 , Mahesh Rangarajan 2 , Rushipriya Panda 2 , Jeff Markham 3 1 University of California, Irvine, 2 CMC Americas, 3 Hortonworks Saritor: A Hadoop Ecosystem to Advance Clinical Research and Prac/ce The Clinical and Transla/onal Science Awards (CTSA) is a registered trademark of DHHS. New Learning (Pa-ern Refinement) Historical Data Sets Hypothesis / Algorithm Model (Core Engine with the EquaEons / Analysis) StaEsEcal Techniques Publish new version to Repository Output / Results (Actual) Input Data A-ributes, Rules, Parameters RealLEme Data Feeds Create layers of knowledge that improves the understanding, one layer at a ;me Training and Test Data sets for tes;ng the model hypothesis Modeling Possibili-es: Linear Equa;on (to start with) Regression Models (Linear / Mul;variate) Neural Networks (Layers of knowledge) Use the new baseline for real$;me analysis of the incoming feeds Training Data Set Test Data Set Diagnosis PaIerns Repository Input Data AIributes, Rules, Parameters Hypothesis / Algorithm Model (Core Engine with the Equa#ons / Analysis) Analyze Output for Model Behavior (Actual versus Desired) Iden#fy Improvements Feedback and Refine the Model Matches Expecta#on Release for Tes#ng the Model Output / Results (Actual) Input Data AIributes, Rules, Parameters Hypothesis / Algorithm Model (Core Engine with the Equa#ons/ Analysis) Analyze Output for Model Behavior (Actual versus Desired) Iden#fy Improvements Feedback and Refine the Model Matches Expecta#on Baseline the PaIern Publish new version to Repository Output / Results (Actual) Not Sa'sfactory Sa'sfactory Result Not Sa'sfactory Sa'sfactory Result Available Data Set Sta#s#cal Techniques Sta#s#cal Techniques Cohort Discovery Legacy Data Visualiza/on Algorithm Management FeedFforward Learning Hadoop Distributed File System (HDFS) Hive User/Role Based Access Control Neo 4j Graph Database Mahout Compute pa^ern MapReduce Generate and filter raw data from HDFS TDS (Legacy System) 22 Years Pa/ent Data 1.2M Pa/ents 9M Records Orders Labs Transcribed Results Pa/ent Record HL7 Feed Lab Results Physiological Monitors Ven/lators Transcribed Reports Radiology Results Endoscopy Results Orders EMR Generated Data RN Documenta/on Provider Documenta/on External Data Home Monitoring Personal Health Record Social Media *Twi^er *Foursquare *Yelp *RSS & Blog Mongo DB Store data matrix for pa^ern recogni/on Query Language Clinician Viewer Events (Sepsis) / Chronic Disease Monitoring Legacy Data Viewer Predic/ve Analy/cs Research Viewer Legacy + EMR Data Cohort Discovery Rela/onship / Graph Analysis DeFiden/fied at presenta/on Quality/Opera/ons Viewer Pa/ent Throughput (RTLS) Quality Measures Pa/ent Engagement Asset U/liza/on Metrics Saritor Business Services Request / Reply processing Engine (HTML 5 / Resiul Services / JSON driven) External Interfaces “Saritor Surround” Ecosystem

Upload: amia

Post on 07-Nov-2014

195 views

Category:

Documents


0 download

DESCRIPTION

2013 Summit on Clinical Research Informatics

TRANSCRIPT

Page 1: A Hadoop Ecosystem to Advance Clinical Research and Practice

QUICK DESIGN GUIDE (--THIS SECTION DOES NOT PRINT--)

This PowerPoint 2007 template produces a 36”x60” professional poster. You can use it to create your research poster and save valuable time placing titles, subtitles, text, and graphics. We provide a series of online tutorials that will guide you through the poster design process and answer your poster production questions. To view our template tutorials, go online to PosterPresentations.com and click on HELP DESK. When you are ready to print your poster, go online to PosterPresentations.com. Need Assistance? Call us at 1.866.649.3004

Object Placeholders

Using the placeholders To add text, click inside a placeholder on the poster and type or paste your text. To move a placeholder, click it once (to select it). Place your cursor on its frame, and your cursor will change to this symbol . Click once and drag it to a new location where you can resize it. Section Header placeholder Click and drag this preformatted section header placeholder to the poster area to add another section header. Use section headers to separate topics or concepts within your presentation. Text placeholder Move this preformatted text placeholder to the poster to add a new body of text. Picture placeholder Move this graphic placeholder onto your poster, size it first, and then click it to add a picture to the poster.

QUICK TIPS (--THIS SECTION DOES NOT PRINT--)

This PowerPoint template requires basic PowerPoint (version 2007 or newer) skills. Below is a list of commonly asked questions specific to this template. If you are using an older version of PowerPoint some template features may not work properly.

Template FAQs

Verifying the quality of your graphics Go to the VIEW menu and click on ZOOM to set your preferred magnification. This template is at 100% the size of the final poster. All text and graphics will be printed at 100% their size. To see what your poster will look like when printed, set the zoom to 100% and evaluate the quality of all your graphics before you submit your poster for printing. Modifying the layout This template has four different column layouts. Right-click your mouse on the background and click on LAYOUT to see the layout options. The columns in the provided layouts are fixed and cannot be moved but advanced users can modify any layout by going to VIEW and then SLIDE MASTER. Importing text and graphics from external sources TEXT: Paste or type your text into a pre-existing placeholder or drag in a new placeholder from the left side of the template. Move it anywhere as needed. PHOTOS: Drag in a picture placeholder, size it first, click in it and insert a photo from the menu. TABLES: You can copy and paste a table from an external document onto this poster template. To adjust the way the text fits within the cells of a table that has been pasted, right-click on the table, click FORMAT SHAPE then click on TEXT BOX and change the INTERNAL MARGIN values to 0.25. Modifying the color scheme To change the color scheme of this template go to the DESIGN menu and click on COLORS. You can choose from the provided color combinations or create your own.

©"2013"PosterPresenta/ons.com"""""2117"Fourth"Street","Unit"C"""""Berkeley""CA""94710"""""[email protected]

Student discounts are available on our Facebook page. Go to PosterPresentations.com and click on the FB icon.

Introduction Facebook, Twitter, LinkedIn and Yahoo share the same underlying infrastructure, Apache Hadoop. All three of these applications consume, process and store millions of records consisting of structured, unstructured, image and video data. As healthcare data shares many of the characteristics of the data found in Facebook, Twitter, LinkedIn and Yahoo, Hadoop should be an ideal environment for the ingestion, storing and utilization of healthcare data.   Methods A virtual Apache Hadoop version 1.0 infrastructure consisting of a single NameNode server and four Task Node servers was set up within the UCI Medical Center data center. Ubuntu Linux running on VMware was the chosen OS. The Hadoop modules utilized were: Hadoop Common, Hadoop Distributed File System (HDFS), MapReduce, Pig, Mahout and Zookeeper. Java scripted routines processed the legacy data. Mirth HL7 listener and a java scripted routine processed the HL7 data.   Results The legacy data of 1.2 million patients, contained in 9 million patient medical records was successfully ingested into the Saritor Hadoop Distributed File System. For researchers the drag and drop query and visualization tool allowed for the visualization of the legacy data. For clinicians in patient care complete patient records were retrieved via a web browser. HL7 messages from all source systems, physiological monitoring data in one-minute intervals, and ventilator data in one-minute intervals and EMR generated data was ingested and stored. Algorithms for sepsis, hospital acquired conditions and 30 day readmits are able to be built into Mahout for real time surveillance.   Discussion Our initial findings demonstrated the Hadoop ecosystem is well suited for the ingestion, storage and retrieval of both legacy EMR data and runtime EMR data. Minimal programing is required to process legacy data and the processing of runtime EMR data requires the cloning of existing interfaces. The functionality of real time clinical surveillance presents unlimited use cases. Hadoop is an ecosystem that is affordable, scalable, highly available, allows for clinical research and clinical practice to coexist in the same system.

Charles"Boicey,"MS,"RNFBC1,"Lisa"Dahm,"PhD1,"David"Gonzalez1,"Mahesh"Rangarajan2,"Rushipriya"Panda2,"Jeff"Markham3"

"1University"of"California,"Irvine,""2CMC"Americas,"3Hortonworks""

Saritor:"A"Hadoop"Ecosystem"to"Advance"Clinical"Research"and"Prac/ce""

The"Clinical"and"Transla/onal"Science"Awards"(CTSA)"is"a"registered"trademark"of"DHHS.""

Feed$forward*Learning*

New$Learning$(Pa-ern$Refinement)$

Historical$Data$Sets$

Hypothesis$/$Algorithm$Model$(Core$Engine$with$the$EquaEons$/$Analysis)$

StaEsEcal$Techniques$

Publish$new$version$to$Repository$Output$/$Results$(Actual)$

Input$Data$A-ributes,$Rules,$Parameters$

RealLEme$Data$Feeds$

Create*layers*of*knowledge*that*improves*the*

understanding,*one*layer*at*a*;me*

Training*and*Test*Data*sets*for*

tes;ng*the*model**hypothesis*

Modeling)Possibili-es:)Linear*Equa;on*(to*start*with)*Regression*Models*(Linear*/*Mul;variate)*Neural*Networks*(Layers*of*knowledge)*

Use*the*new*baseline*for*real$;me*analysis*of*the*

incoming*feeds*

Training'Data'Set'

Test'Data'Set'

Diagnosis'PaIerns'Repository'

Input'Data'AIributes,'Rules,'Parameters'

Hypothesis'/'Algorithm'Model'(Core'Engine'with'the'Equa#ons'/'Analysis)'

Analyze'Output'for'Model'Behavior''(Actual'versus'Desired)'

Iden#fy'Improvements'

Feedback'and'Refine'the'Model'

Matches'Expecta#on'

Release'for'Tes#ng'the'Model'

Output'/'Results'(Actual)'

Input'Data'AIributes,'Rules,'Parameters'

Hypothesis'/'Algorithm'Model'(Core'Engine'with'the'Equa#ons/'Analysis)'

Analyze'Output'for'Model'Behavior''(Actual'versus'Desired)'

Iden#fy'Improvements'

Feedback'and'Refine'the'Model'

Matches'Expecta#on'

Baseline'the'PaIern'

Publish'new'version'to'Repository'

Output'/'Results'(Actual)'

Not$Sa'sfactory$ Sa'sfactory$Result$ Not$Sa'sfactory$ Sa'sfactory$Result$

Available'Data'Set'

Sta#s#cal'Techniques'

Sta#s#cal'Techniques'

Algorithm)Management)

Cohort"Discovery"

Legacy"Data"Visualiza/on"

Algorithm"Management"

FeedFforward"Learning"

Hadoop"Distributed"File"System"(HDFS)"

Hive"

User/Role"Based"Access"Control"

Neo"4j""

Graph"Database"

Mahout"Compute"pa^ern"

MapReduce""

Generate"and"filter"raw"data"from"HDFS"

TDS"(Legacy"System)"•  22"Years"Pa/ent"Data"•  1.2M"Pa/ents"•  9M"Records"•  Orders"•  Labs"•  Transcribed"Results"•  Pa/ent"Record"

HL7"Feed"•  Lab"Results ""•  Physiological"Monitors"•  Ven/lators"•  Transcribed"Reports"•  Radiology"Results"•  Endoscopy"Results"•  Orders"

EMR"Generated"Data"•  RN"Documenta/on"•  Provider"

Documenta/on""

External"Data"•  Home"Monitoring"•  Personal"Health"Record"•  Social"Media""""""""""""*Twi^er"""""""""""""*Foursquare"""""""""""""*Yelp"""""""""""""*RSS"&"Blog"

Mongo"DB""

Store"data"matrix"for"pa^ern"recogni/on"

Query"Language""

Clinician"Viewer"•  Events"(Sepsis)"/"Chronic"

Disease"Monitoring"•  Legacy"Data"Viewer"•  Predic/ve"Analy/cs"

Research"Viewer!•  Legacy"+"EMR"Data"•  Cohort"Discovery"•  Rela/onship"/"Graph"Analysis"•  DeFiden/fied"at"presenta/on"

Quality/Opera/ons"Viewer"•  Pa/ent"Throughput"(RTLS)"•  Quality"Measures"•  Pa/ent"Engagement"•  Asset"U/liza/on"Metrics"

Saritor"Business"Services"Request"/"Reply"processing"Engine"(HTML"5"/"Resiul"Services"/"JSON"driven)"

External"Interfaces"“Saritor!Surround”!Ecosystem!