starting the hadoop journey at a global leader in cancer research

Vamshi Punugoti & Bryan LariMD Anderson Cancer Center

June 2016

HDP @ MD ANDERSONStarting the Hadoop Journey at a Global

Leaderin Cancer Research

Agenda

• About MD Anderson• Big Data Program• Our Hadoop Implementation• Lessons Learned• Next Steps

• Who we are– One of the worlds largest centers devoted exclusively to cancer care– Created by the Texas legislature in 1941– Named one of the nation's top two hospitals for cancer care every

year since the survey began in 1990

• Mission– MD Anderson’s mission is to eliminate cancer in Texas, the nation and

the world through exceptional programs that integrate patient care, research and prevention.

About MD Anderson

About MD Anderson cont.Patient Care Education

Research

Moon Shots Program

• Launched in 2012 – to make a giant leap for patients• Accelerating the pace of converting scientific discoveries into

clinical advances that reduce cancer deaths• Transdisciplinary team-science approach • Transformative professional platforms

List of Moon Shots12 Total Moon Shots B-cell Lymphoma Lung Cancer Breast Cancer Melanoma Colorectal Cancer Multiple Myeloma Glioblastoma Ovarian Cancer HPV-Related Cancers Pancreatic Cancer Leukemia (CLL, MDS, AML) Prostate Cancer

http://www.cancermoonshots.org

VolumeVarietyVelocityVeracity

Gulf of Mexico Analogy

Goals of Big Data Program• Data driven organization• All “types” of data• “Access” for all customers

• Clinicians• Researchers• Administrative / Operational

• Enable discovery of “insights”• Improve patient care• Increase research discoveries• Improve operations

• Govern data like an asset• Provide a platform / environment to enable all these things

To provide the right information to the right people at the right time with the right tools

Goaldata

insight

Insights

https://www.bing.com/images/search?q=haystack&view=detailv2&&id=360F3087D7A30E36FC5718E0B8763FF823B730C9&selectedIndex=9&ccid=hrjUdq0V&simid=608032473771150064&thid=OIP.M86b8d476ad15dd84ba5602963d3f4933H0

https://www.bing.com/images/search?q=needle+in+a+haystack&view=detailv2&&id=AAD5EAA355A37569025EBF7BE1599685C690A18F&selectedIndex=13&ccid=CxAKCk2A&simid=608016904514833993&thid=OIP.M0b100a0a4d80658b49013047adc5fc43o0

https://www.bing.com/images/search?q=data+scientist+cartoon&view=detailv2&&id=32ABD57F9636BE16F91E23DEF65D56D6F621EDA9&selectedIndex=30&ccid=/le0e5OA&simid=607990576363212461&thid=OIP.Mfe57b47b93801541ffb4005364c89c04o0

https://www.bing.com/images/search?q=data+scientist+cartoon&view=detailv2&&id=CEEB3B1E8BD613A7BC17FDF8E0E9C4050C31CE26&selectedIndex=38&ccid=MiOLczOK&simid=608051384505731207&thid=OIP.M32238b73338a0d2d0eb695f83e20114ao0

https://www.bing.com/images/search?q=explorer+cartoon&view=detailv2&&id=32813C4049169BFC765A8DFD8A9BA8BF4401C9FF&selectedIndex=7&ccid=OhBnphJZ&simid=607994862739784330&thid=OIP.M3a1067a61259fe14acbf90eb6832b923o0

https://www.bing.com/images/search?q=data+scientist+cartoon&view=detailv2&&id=8DA8483336F72374E1641FF5130126E46B520949&selectedIndex=50&ccid=NmMpzwDT&simid=608040032916932601&thid=OIP.M366329cf00d3562aa0a4de65f08a0c97o0

http://www.bing.com/images/search?q=crystal+ball+Cartoon&view=detailv2&&id=CC087A478C17341BB0C8A4E36BE2D60B5CA46764&selectedIndex=3&ccid=wwMKmiE0&simid=608026241769867960&thid=OIP.Mc3030a9a2134075d1da203e693f3164eH0

http://www.bing.com/images/search?q=cancer+cartoon&view=detailv2&&id=96740A156D3B9155007972EFB3B292E996B5DE3E&selectedIndex=1&ccid=UXt98Fzg&simid=608031339893688752&thid=OIP.M517b7df05ce0f10bd56d44a4245022d9H0

Make big data additive and build upon foundation

What are we doing today?• FIRE Enterprise Data Warehouse• Natural Language Processing (NLP)• Data Governance• Hadoop NoSQL• Cognitive Computing• Data Visualization• Evolving our Platform / Architecture• Identifying big data use cases• Training & Skills

• Federated Institutional Reporting Environment• Centralized data repository supporting analytics,

decision making, and business intelligence• Central repository for historical and operational data• Break-down data silos

Enterprise RepositorySource Systems

Dashboards

KPI’s

Analytic Reports

Analytics& Reporting

Discoveries

Improve

Patient Care

Quality / Perf

Improvements

Genomic

FIRE Program

Radiology

Labs

Epic / Clarity

Legacy Systems

• Vast amounts of unstructured data are stored on MDACC servers.

• Conventional ETL tools are not designed to mine unstructured data.

• Suite of tools make up the NLP Pipeline• Dictionaries were created to help Epic go-

live (Provider Friendly Terminology)• Other examples:

• Diagnosis from the pathology reports• Comorbidities• Family Cancer History• Cytogenetics• Obituary text• ICD10 Coding• Structured results feeding Moonshot TRA and OEA• Etc.

IBM ECM NLP

Engine

Unstructured Data Sources

Post NLP Database

HDWF (FIRE)

NLP Pipeline - Overview

Enterprise Business

Clinical Big Data

Peoplesoft

Systems of Record

Systems of Reporting

Systems of Insights

Kronos

Point of Sale

Volunteer Services

Rotary HouseMyHR

UTPD

Facilities

Clinic Station

EpicLab

GE IDX

Cerner

CARE

EPM

Hyperion

Oracle Business Intelligence

Smart View

Web Analytics

FIRE

EIW

Business Objects

Crystal

Hyperion Interactive Reporting

Facebook

Twitter

UPS

Center for Disease Control

The Weather Channel

LinkedIN

Youtube

oracle.comYelp!

Reuters

Google

U.S. Census

Medical Devices

Medical Equipment

Building Controls

Campus Video

Real-time Location Service

Wayfinding

Data Visualization

Ad Hoc

Cognitive Computing

Big Data for Analytics & Cognitive Computing

Presentation

Cohort Explorer

Parking Garages

Pharmacy

ResearchLCDR

Melcore

GeminiIPCT

Data Governance

Data Stewardshi

p

Data Portal

Data Profiling

and Quality

Data Standardization

Compliance

Metadata and

Business Glossary

Master Data

Management

DataRepository

Dashboards

KPI’s

Analytic Reports

Analytics & Informatics

Discoveries

Improve

Patient Care

Quality / Perf

Improvements

Data Mgt & Operations

Data Lake

Data DiscoveryProfiling

Standards / Quality

Big Data (Structured and NoSQL)

Insight Apps

Genomic

Radiology

Labs

Epic / Clarity

Legacy Systems

http://www.google.com/url?url=http://www.analyticsvidhya.com/blog/2014/05/hadoop-simplified/&rct=j&frm=1&q=&esrc=s&sa=U&ei=beuuVKP-L4-VyASL_oKQBw&ved=0CBoQ9QEwAg&usg=AFQjCNEtxqUI7If0ha3WBVucgU7ZG5B77w

http://www.google.com/url?url=http://www.tnooz.com/article/mobile-device-usage-in-day-to-day-life-are-you-a-morning-or-an-evening-person/&rct=j&frm=1&q=&esrc=s&sa=U&ei=xOuuVKjkF86GyATdtIGoAw&ved=0CCwQ9QEwCw&usg=AFQjCNEuvgdkHfAPi8YakgojQJ6AI-u2og

http://www.google.com/url?url=http://www.designnews.com/document.asp?doc_id%3D234558&rct=j&frm=1&q=&esrc=s&sa=U&ei=-OuuVL7EIZW2yAT6xYCwCw&ved=0CCoQ9QEwCg&usg=AFQjCNGwdovy3S2TeErbq9gH-b1amKJabw

http://www.google.com/url?url=http://www.extremetech.com/extreme/169526-worlds-most-powerful-mri-can-lift-a-tank-like-magneto-or-see-deep-into-your-brain&rct=j&frm=1&q=&esrc=s&sa=U&ei=MeyuVOjiF4OkyASu5YCwCQ&ved=0CCwQ9QEwCw&usg=AFQjCNHCL2KnkHOCOk87YC8msV9s_02Zzw

http://www.google.com/url?url=http://www.cloudera.com/content/cloudera/en/downloads/connectors/hive/jdbc/hive-jdbc-v2-5-4.html&rct=j&frm=1&q=&esrc=s&sa=U&ei=meyuVLKGNoidyAT5u4LABw&ved=0CBoQ9QEwAg&usg=AFQjCNGdsJeoxFjEQE0U_fiaUoio2gCNNg

https://www.google.com/url?url=https://play.google.com/store/apps/details?id%3Dcom.facebook.katana&rct=j&frm=1&q=&esrc=s&sa=U&ei=p92vVP64Fc-tyAS2rIH4Aw&ved=0CBYQ9QEwAA&usg=AFQjCNEGVIa431WUFxgAr8SrBh_QnNO9WQ

http://www.google.com/url?url=http://www.milestonemktg.com/contact/twitter-logo/&rct=j&frm=1&q=&esrc=s&sa=U&ei=u92vVMj_EoKQyAT76YKoAg&ved=0CB4Q9QEwBA&usg=AFQjCNERcc80lajdnNbknH8s-usO0ww1cA

http://www.google.com/url?url=http://www.searchenginejournal.com/instagram-becomes-serious-business-tool-suite-useful-new-features/114563/&rct=j&frm=1&q=&esrc=s&sa=U&ei=2N2vVLTpKcGQyAS_sIKICA&ved=0CBgQ9QEwAA&usg=AFQjCNHaSvAaM2WslMQMCq1dubgT83bXEQ

Big Data – High Level

Big Data Technical Architecture

Our Hadoop Implementation

Our Hadoop Implementation cont.


Average number of messages per day: 1,556,688Estimated amount of storage increase per day: 5.7 GBNumber of channels currently being used: 24Estimated daily message processing capacity: 4,320,000

Our Hadoop Implementation cont.Medical Device Data Flow

Data Source Data Capture MDA Big DataData Lake Access Portals(Analytics/Visualization)

Integration HUB Data ingestion

Processing Channels

HBase

Data Loader

Caps

ule

Capsule DB

Medical Device

End-Users

FIRE/Big Data

Cloverleaf Engine

Epic

TCP-based Data Listener - Flume

HIVE

PIG

HUNK

Sqoop

Validated HL7with Patient ID

(from Epic)

HL7

Raw HL7(from Capsule)

Cleanse &Transform

Raw HL7

Validated HL7


Developer Workstation/Sandbox

SVN(source control server)

Bamboo(build server)

HDP Dev Cluster HDP QA Cluster HDP Prod Cluster

Daily Checkin/Checkout

Development Cycle

On Dev Lead Approval:Build, Unit Test, Deploy & Tag

On Successful UAT& Release Approval:

Deploy Per Last Successful

Build Tag

Smoke TestBefore Updating Task status

Periodic Integration & Validation:Build, Unit Test

& Notify On Error

Development Cycle

Deployment Cycle

process

1. It’s complex2. It’s a journey3. Leverage existing strengths4. Collaborate openly5. Learn from experts6. One cluster – multiple use cases7. Follow best practices

Lessons Learned – what went well

people

1. Continue to expand/evolve our platform2. Ingest more data and data types3. Identify high value use cases4. Develop/Train people with new skills

Next Steps

Train People with new Skills

Accessing dataComputing dataVisualizing data

Insights & Cognitive Computing

https://www.bing.com/images/search?q=training+cartoon&view=detailv2&&id=55E04D1B2099321902B198CAECC11D757C7C6B0A&selectedIndex=0&ccid=vKl87xuB&simid=607991955039716610&thid=OIP.Mbca97cef1b81bce82c04c42f8ef2ebc0H0

http://www.bing.com/images/search?q=cognitive+cartoons&view=detailv2&&id=0DA1779171CCD6A0ACC341E6F2A0A11B08FBF42F&selectedIndex=12&ccid=Xy2wS66y&simid=607994106830587441&thid=OIP.M5f2db04baeb2a4beb61d0aa0075828dcH0

https://www.bing.com/images/search?q=big+data+cartoon&view=detailv2&&id=7BFDABFEFABDE05DCE98348EE3B50597BF3F6668&selectedIndex=8&ccid=jokLByYW&simid=608024970459483628&thid=OIP.M8e890b0726163f481db31f873cacc34eo0

https://www.bing.com/images/search?q=data+scientist+cartoon&view=detailv2&&id=32ABD57F9636BE16F91E23DEF65D56D6F621EDA9&selectedIndex=30&ccid=/le0e5OA&simid=607990576363212461&thid=OIP.Mfe57b47b93801541ffb4005364c89c04o0

https://www.bing.com/images/search?q=data+scientist+cartoon&view=detailv2&&id=8DA8483336F72374E1641FF5130126E46B520949&selectedIndex=50&ccid=NmMpzwDT&simid=608040032916932601&thid=OIP.M366329cf00d3562aa0a4de65f08a0c97o0

http://www.google.com/url?url=http://www.analyticsvidhya.com/blog/2014/05/hadoop-simplified/&rct=j&frm=1&q=&esrc=s&sa=U&ei=beuuVKP-L4-VyASL_oKQBw&ved=0CBoQ9QEwAg&usg=AFQjCNEtxqUI7If0ha3WBVucgU7ZG5B77w

https://www.bing.com/images/search?q=explorer+cartoon&view=detailv2&&id=32813C4049169BFC765A8DFD8A9BA8BF4401C9FF&selectedIndex=7&ccid=OhBnphJZ&simid=607994862739784330&thid=OIP.M3a1067a61259fe14acbf90eb6832b923o0

starting the hadoop journey at a global leader in cancer research

Technology