starting the hadoop journey at a global leader in cancer research
TRANSCRIPT
Vamshi Punugoti & Bryan LariMD Anderson Cancer Center
June 2016
HDP @ MD ANDERSONStarting the Hadoop Journey at a Global
Leaderin Cancer Research
Agenda
• About MD Anderson• Big Data Program• Our Hadoop Implementation• Lessons Learned• Next Steps
• Who we are– One of the worlds largest centers devoted exclusively to cancer care– Created by the Texas legislature in 1941– Named one of the nation's top two hospitals for cancer care every
year since the survey began in 1990
• Mission– MD Anderson’s mission is to eliminate cancer in Texas, the nation and
the world through exceptional programs that integrate patient care, research and prevention.
About MD Anderson
About MD Anderson cont.Patient Care Education
Research
Moon Shots Program
• Launched in 2012 – to make a giant leap for patients• Accelerating the pace of converting scientific discoveries into
clinical advances that reduce cancer deaths• Transdisciplinary team-science approach • Transformative professional platforms
List of Moon Shots12 Total Moon Shots B-cell Lymphoma Lung Cancer Breast Cancer Melanoma Colorectal Cancer Multiple Myeloma Glioblastoma Ovarian Cancer HPV-Related Cancers Pancreatic Cancer Leukemia (CLL, MDS, AML) Prostate Cancer
http://www.cancermoonshots.org
VolumeVarietyVelocityVeracity
Gulf of Mexico Analogy
Goals of Big Data Program• Data driven organization• All “types” of data• “Access” for all customers
• Clinicians• Researchers• Administrative / Operational
• Enable discovery of “insights”• Improve patient care• Increase research discoveries• Improve operations
• Govern data like an asset• Provide a platform / environment to enable all these things
To provide the right information to the right people at the right time with the right tools
Goaldata
insight
Insights
Make big data additive and build upon foundation
What are we doing today?• FIRE Enterprise Data Warehouse• Natural Language Processing (NLP)• Data Governance• Hadoop NoSQL• Cognitive Computing• Data Visualization• Evolving our Platform / Architecture• Identifying big data use cases• Training & Skills
• Federated Institutional Reporting Environment• Centralized data repository supporting analytics,
decision making, and business intelligence• Central repository for historical and operational data• Break-down data silos
Enterprise RepositorySource Systems
Dashboards
KPI’s
Analytic Reports
Analytics& Reporting
Discoveries
Improve
Patient Care
Quality / Perf
Improvements
Genomic
FIRE Program
Radiology
Labs
Epic / Clarity
Legacy Systems
• Vast amounts of unstructured data are stored on MDACC servers.
• Conventional ETL tools are not designed to mine unstructured data.
• Suite of tools make up the NLP Pipeline• Dictionaries were created to help Epic go-
live (Provider Friendly Terminology)• Other examples:
• Diagnosis from the pathology reports• Comorbidities• Family Cancer History• Cytogenetics• Obituary text• ICD10 Coding• Structured results feeding Moonshot TRA and OEA• Etc.
IBM ECM NLP
Engine
Unstructured Data Sources
Post NLP Database
HDWF (FIRE)
NLP Pipeline - Overview
Enterprise Business
Clinical Big Data
Peoplesoft
Systems of Record
Systems of Reporting
Systems of Insights
Kronos
Point of Sale
Volunteer Services
Rotary HouseMyHR
UTPD
Facilities
Clinic Station
EpicLab
GE IDX
Cerner
CARE
EPM
Hyperion
Oracle Business Intelligence
Smart View
Web Analytics
FIRE
EIW
Business Objects
Crystal
Hyperion Interactive Reporting
UPS
Center for Disease Control
The Weather Channel
Youtube
oracle.comYelp!
Reuters
U.S. Census
Medical Devices
Medical Equipment
Building Controls
Campus Video
Real-time Location Service
Wayfinding
Data Visualization
Ad Hoc
Cognitive Computing
Big Data for Analytics & Cognitive Computing
Presentation
Cohort Explorer
Parking Garages
Pharmacy
ResearchLCDR
Melcore
GeminiIPCT
Data Governance
Data Stewardshi
p
Data Portal
Data Profiling
and Quality
Data Standardization
Compliance
Metadata and
Business Glossary
Master Data
Management
DataRepository
Dashboards
KPI’s
Analytic Reports
Analytics & Informatics
Discoveries
Improve
Patient Care
Quality / Perf
Improvements
Data Mgt & Operations
Data Lake
Data DiscoveryProfiling
Standards / Quality
Big Data (Structured and NoSQL)
Insight Apps
Genomic
Radiology
Labs
Epic / Clarity
Legacy Systems
Big Data – High Level
Big Data Technical Architecture
Our Hadoop Implementation
Our Hadoop Implementation cont.
Our Hadoop Implementation cont.
Average number of messages per day: 1,556,688Estimated amount of storage increase per day: 5.7 GBNumber of channels currently being used: 24Estimated daily message processing capacity: 4,320,000
Our Hadoop Implementation cont.Medical Device Data Flow
Data Source Data Capture MDA Big DataData Lake Access Portals(Analytics/Visualization)
Integration HUB Data ingestion
Processing Channels
HBase
Data Loader
Caps
ule
Capsule DB
Medical Device
End-Users
FIRE/Big Data
Cloverleaf Engine
Epic
TCP-based Data Listener - Flume
HIVE
PIG
HUNK
Sqoop
Validated HL7with Patient ID
(from Epic)
HL7
Raw HL7(from Capsule)
Cleanse &Transform
Raw HL7
Validated HL7
Our Hadoop Implementation cont.
Developer Workstation/Sandbox
SVN(source control server)
Bamboo(build server)
HDP Dev Cluster HDP QA Cluster HDP Prod Cluster
Daily Checkin/Checkout
Development Cycle
On Dev Lead Approval:Build, Unit Test, Deploy & Tag
On Successful UAT& Release Approval:
Deploy Per Last Successful
Build Tag
Smoke TestBefore Updating Task status
Periodic Integration & Validation:Build, Unit Test
& Notify On Error
Development Cycle
Deployment Cycle
process
1. It’s complex2. It’s a journey3. Leverage existing strengths4. Collaborate openly5. Learn from experts6. One cluster – multiple use cases7. Follow best practices
Lessons Learned – what went well
people
1. Continue to expand/evolve our platform2. Ingest more data and data types3. Identify high value use cases4. Develop/Train people with new skills
Next Steps
Train People with new Skills
Accessing dataComputing dataVisualizing data
Insights & Cognitive Computing