genomics medicine solution on power 8 kathy tzeng, phd worldwide technical lead healthcare &...

37
Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

Upload: margery-sanders

Post on 18-Jan-2016

226 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

Genomics Medicine Solution on Power 8

Kathy Tzeng, PhDWorldwide Technical LeadHealthcare & Life SciencesIBM Systems GroupNovember 13, 2015

Page 2: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

GENOMIC MEDICINE– from Sequencing to Personalized Healthcare

NHGRI, a branch of NIH, has defined 5 steps for genomic medicine. (source: E. Green et al., Nature 470, 204–213)

Next Generation Sequencing(or other ingestion)

the focus is on very large data generation, mainly from $1000 whole genome sequencing, and the data processing and reductionincludes human, plant, animal, and microbiome genomics

Translational Research/Early Discoverythe focus is on data integration including genomic data, and the analytics required to identify biomarkers, understand disease mechanisms, and to identify new medical treatments

Personalized Healthcare/Clinical Genomicsthe focus is on delivering genomic medicine to patients to improve outcomes by associating patients with known genomic specific treatments

Page 3: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

Computational Challenges

Feature combinatorics Large file sizes Large population sizes Unstructured data types

A Computationally Challenging Problem

Breakthroughs in Genomic Medicine require quantifying associations between known population traits, environmental factors, and biological responses

Predictive Response Function

Known Traits or Environmental Features

Measured Biological Response

W(t)

Model of associations between features and responses as a function of time t

F(t) R(t)

Quantities describing population traits or environmental factors at time t

Quantities describing response events for an organism at time t

Page 4: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

Key CapabilitiesLeading biomedical research organizations are asking for technology capabilities that will give them a low-cost solution to accelerate scientific discovery in Genomic Medicine

Flexible, scalable, and low-cost high-performance compute and storage solutions capable of efficiently processing rapidly growing quantities of genomic and other types of complex life science data

Seamless integration of complex life science data types on a common analytical platform

Rapid extraction and analysis of unstructured language content from very large volumes of clinical and scientific documents

Metadata collection capabilities providing detailed audit trails as source data are transformed into analytical results

Tools for scientific collaboration that enable data and workload sharing to cross organizations and geographic boundaries in a secure environment that ensures data privacy

Page 5: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

5

A Foundation for Computational ScienceIBM’s Reference Architecture for Genomics supports ‘big data’ computational research on a foundation of HPC compute, storage, and workload management capabilities

Research Applicatio

ns

‘Big Data’

Foundation

Researchers

Intelligent resource allocation, sharing, and monitoring across parallel HPC workloadsIBM Platform Computing, IBM Business Partners

Gen

omic

Ana

lysi

s Pi

pelin

es

Tran

slati

onal

Re

sear

ch

Text

Ana

lytic

s /N

LP

LAN

Data Management: File System & Storage | ILM

‘Big’ Data Warehouse

Imag

e An

alys

is

- Apache UIMA

- IBM System T

+Low-cost, easy-access storage & archiving of data and metadata across heterogeneous environmentsIBM Spectrum Scale / Elastic Storage Server

Heterogenous, flexible server infrastructureIBM OpenPOWER serversHeterogen

eous Compute

Resources

Performance optimization for open source and commercial analytics applications

Text Analytics for the conversion of natural language concepts into structured data entities

IBM Research, IBM Watson, IBM Business Partners

Workload Orchestration, Monitoring & Metadata Capture

Resource Management enabling Private Clouds

Resource &

Workload Manageme

nt

Page 6: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

Data management and analytics tools can be accessed and shared across heterogeneous systems in on-premise and cloud environments

IBM Systems Facilitate Scientific Collaboration

External Collaborators (Heterogeneous Environments)Local Data Center

Virtual Private Clouds

Public Cloud UsersPrivate Cloud UsersOn-Premise Users

On-Premise Cluster

Encrypted VPN

‘Big Data’ foundation enables data access, data management, and HPC workload orchestration across heterogeneous on-premise, private cloud, public cloud, and hybrid cloud environments

HPC Network

Data Management: File System / Storage ILMWAN

Workload Burst

Applications

10GbE or InfiniBand

1/10 GbE

Workload Orchestration with Metadata Capture

‘Big’ Data Warehouse

Page 7: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

AppCenter(PAC, Galaxy, DataBiology, Lab7)

Orchestrator(ASC/EGO, LSF, Symphony, PPM)

Translational

SSD/Flash FC/IB Attached Low-cost Storage HA/DR Storage Cloud Storage

Pla

tform

sC

ompu

teS

tora

ge

Personalized HealthcareGenomics

Datahub(Spectrum Scale, Zato, Nirvana)

HPC Cluster Big Data Spark Cluster Openstack Docker

Application & Workflow File & Database Visualization System & LogAcc

ess

IBM Reference Architecture for Genomics

Page 8: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

February 2015 8

Genomic Analysis System Architecture

Page 9: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

IBM Genomics Reference Architecture

The IBM Reference Architecture is an ecosystem of data management and analytics tools developed by IBM and industry-leading commercial and open source software providers

Page 10: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

IBM Genomics Reference ArchitectureThe IBM Reference Architecture is an ecosystem of data management and analytics tools developed by IBM and industry-leading commercial and open source software providers

Page 11: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

BioBuilds – Open Source Bioinformatics

• Turn-key: Pre-built binaries and complete build scripts enable easy deployment

• Optimized: POWER8 binaries provide the best performance for your hardware

• Ready for the Clinic: A single source for tools streamlining verification and audit

• Long Term Support: Community sponsorship and support contracts ensure ongoing support for tools

http://biobuilds.org/

Open Source bioinformatics tools for research, commercial, and regulated environments.

Page 12: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

A Portfolio of Open Source Applications on Power

BioBuilds 2015.04• ALLPATHS-LG• Bedtools• Bfast• Bioconductor• BLAST (NCBI)• Bowtie• Bowtie2• BWA• Cufflinks• FASTA• FastQC• HMMER• HTSeq• IGV• ISAAC• iRODS• Mothur

• Numpy• PICARD• PLINK• Python• SAMTools• SHRiMP• SOAP3-DP• SOAPaligner• SOAPDenovo• SQLite• Tabix• TMAP• TopHat• Trinity• Velvec/Oases• R• RNAStar/STAR

BioBuilds Roadmap2015.11 SC15• bamtools• BarraCUDA• BioPython • ClustalW• Conda• EMBOSS• GraphViz• Htslib• Pysam• RSEM• Sratoolkit• Variant_tools

2016• BioPerl• CEGMA• Celera Assembler• DIALIGN-TX• FASTX-Toolkit• GenomicConsensus • Infernal• T-Coffee• MUSCLE• Sailfish• SIFT• STAR-fusion• Infernal And more

Page 13: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

https://www.broadinstitute.org/gatk/blog?id=4833

Optimization of GATK from Broad InstituteIBM works with genomics leaders to improve performance of analytical workflows like GATK on IBM Power 8 Systems

Page 14: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

14

Note*: http://library.wolfram.com/infocenter/Conferences/9045/Intel_LifeSciences_Personalized_Medicine_Wolfram%202014_Paolo%20Narvaez.pdf

Broad Institute Best Practice Performance

Input Dataset: G15512.HCC1954.1, coverage: 65x

Both IBM and Intel solution: # of Machines = 1# of cores/Machine: IBM: 16, Intel: 24

IBM Solution: IBM 3.32 GHz 8335-GTA with SMT=8ESS GL4

IBM Solution performance highlights:• 65X Whole Human Genome analysis done in about 21 hours• 150X Whole Exome analysis done in 2.55 hours

Steps Intel Runtime* IBM Runtime

BWA 7 4.26

Samtools 5 2.08

MarkDuplicates 11 6.86

RealignTargets 1 0.29

IndelRealigner 6.5 0.77

BaseRecalibrator 1.3 1.49

PrintReads+Index 12.3 2.55

PreProcessiong Total 44 18.3

HaplotypeCaller 2.64

Total 20.94

Page 15: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

+ The execution times were measured by using the following hardware and software configuration (the speed-up with FPGA was estimated based on the throughput of the compression accelerator) – Hardware: POWER8 (10-core chip x 2 SMT8 @3.5ghz, 510gb memory) with GPFS, Ubuntu 14.04. Software versions: Samtools (http://www.htslib.org/) – built using the source obtained from git on 2015-08-25. Our tool was prototyped by modifying the same version of Samtools. Picard (http://broadinstitute.github.io/picard/) – version 1.138. Java – java version "1.8.0“ Java(TM) SE Runtime Environment SR1 FP10. Input file: a SAM file of an unsorted and unmarked version of an NA12878 WGS chr20 file.++ The order of sorted reads and the reads marked as duplicates were compared between samtools/picard and ours. +++ https://www.broadinstitute.org/gatk/guide/best-practices

Significant Speed-up with Multi-threaded S/W and FPGA+ for NA12878 WGS chr

20 The Same Results++

Ours

Mark

Dup

licate

sS

ort

Map to Reference

Sort & Mark Duplicates

Indel Realignment

Base Recalibration

Pre-processing Part of a Typical Workflow+++

Samtools/Picard

Acceleration of genomics pipeline on POWER8

Page 16: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

Application: Illumina’s Casava V. 1.8 (BCL to FASTQ)Data Set: 8 lanes of HiSeq data

Elapsed Time = 1730 min Elapsed Time = 107 min

Without cache library With cache library

IO Cache Library to Optimize Performance of Genomics ApplicationIBM uses a File Cache Library to improve I/O Performance and reduce workflow runtimes

Page 17: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

2015 17

IBM scale genomic analysis from the desktop to the enterprise using IBM ESS/Spectrum Scale

Speed of the file system matters

Genomic Workload Optimization

BWA Samtools0

5

10

15

20

25

30

35

20

14

23

16

30

20

Analysis of 150x human genome WEX (NA12878)

GPFS Local NFS

Ela

pse

d T

ime

(m

inu

tes)

• Spectrum Scale scalable I/O performance significantly benefits various NGS workloads.

• Spectrum Scale also provides seamless capacity expansion, improved enterprise wide efficiency, commercial-grade reliability, business continuity and the flexibility of supporting a wide variety of platforms.

Page 18: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

Data Compression

Compression Algorithms

Compression ratio (lossless)

Speed/throughput

gzip on Power 8 with FPGA board– available now

On average 1:3 for fastq files 2.5GB/s on average (200 GB fastq can be compressed in 80 second)

CRAM 1:2 to 1:4 with respect to BAM files depending on the sequencing depth and other factors. (from FASTQ to compressed BAM ratio is 16X)

Achieved beyond 10 times speed up using 12 cores (approximately 0.5GB/min) FPGA acceleration is ongoing.

• Samtools sort: 1.19-1.25x faster

• Picard MarkDuplicate:1.42-1.47x faster

• Picard AddOrReplaceReadGroups: 1.89-2.3x faster

• IBM is collaborating with Sanger Institute and EBI on improving compression for genomics data – Samtools, Picard, CRAM

Source: Baker M., Nature Methods 7, 495 - 499 (2010)

Page 19: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

Noblis BioVelocity is Developed and Optimized on Power 8

Page 20: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

L3 Bioinformatics BALSA on Power 8 with GPU

Power8 3.32 GHz, 2x k40 GPU and Spectrum Scale

Page 21: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

Edico Genome

Proprietary & Confidential

21

Analyze a Whole Human Genome at 30x coverage in under 30 minutes

BCL Map/Align Sort Dedup Variant Calling VCF/GVCF

Page 22: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

IBM works with Lab7 to deliver data provenance with performance, reliability and security

IBM Power System Solution with Spectrum Scale and Platform LSF delivers:Superior compute infrastructure --- Superior performance, scalability & maximum throughput

8Outstanding enterprise-grade reliability and security:• Reliability, Availability and Serviceability (RAS) features help avoid unplanned downtime• IBM Power Security and Compliance (PowerSC™) enables security compliance automation and includes

reporting for compliance measurement and audit (HIPAA)8Total cost of ownership --- Very affordable compared to like-sized x86 systems

Lab7 ESPComprehensive software platform --- combines LIMS and informatics functionalitieshData provenance --- maintains continuous data provenance by:• Tracking the history of samples, analyses, and results• Providing detailed audit trails9Sequencing platform flexibility --- manages data generated from any sequencing platform

Enterprise Data Management

Page 23: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

IBM Power System Solution with Spectrum Scale and Platform LSF delivers:Superior compute infrastructure --- Superior performance, scalability & maximum throughput

8

Outstanding enterprise-grade reliability and security:• Reliability, Availability and Serviceability (RAS) features help avoid unplanned downtime• IBM Power Security and Compliance (PowerSC™) enables security compliance automation and

includes reporting for compliance measurement and audit (HIPAA)8Total cost of ownership --- Very affordable compared to like-sized x86 systems

3 C’s (Configure, Command, Collaborate)

OntologiesAnnotation

Samples

Comments + Attachments

Roles + Access

Shopping Basket

Social

Scientific

Lifecycle ManagementMeta Information

Financial + Resource MgmtTask Management

Project Management ApplicationsImportAnalysis

Visualization

Infrastructure

NetworkStorage

Compute

ConfigurationInstruments

Compute and StorageSoftlayer – LSF – GPFS

Transport

DBE Download Manager

S3, SCP, RSync, SFTP, FTP HTTP

Logic

Version Control + Reproducible

Data Provenance

Everything as an app:Scripts, Binaries,

Pipelines, Workflow Management, Virtual

Machines

Portal API Custom Web Apps via API

DBE Multiprot

Email + WF Integration

Identity Management

Info

rmati

on

Man

agem

ent

Inte

rfac

eO

rche

stra

tion

Databiology for Enterprise Functional Architecture Databiology for EnterpriseSaaS + customer specific instances

Central hub to manage all ‘omics data and to orchestrate all activities

Functionally rich and orientated on key steps in R&D life cycle

Insight to Instrument with best in class applications

Easy integration with existing environments

Automatic data provenance and reporting

Cost neutral deployment

Gradual roll-out / Low risk

Data Provenance with Performance, Reliability and Security

Page 24: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

IBM Genomics Reference ArchitectureThe IBM Reference Architecture is an ecosystem of data management and analytics tools developed by IBM and industry-leading commercial and open source software providers

Page 25: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

tranSMART - Optimized on Power8 and ESS/Spectrum Scale• tranSMART associates genotypic & phenotypic data for complex analytics • Watson Explorer extracts insight from scientific literature and data record and provides enrichment to

tranSMART’s analysis

https://www.dropbox.com/s/9qw2kr339cl0mie/wats_tran.mp4?dl=0

Page 26: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

IBM HPC, hardware, software, and libraries enable faster time to insight in translational medicine—from hours to minutes: tranSMART

Loading ‘TCGA_OV’ into PostgreSQLElapsed time, in seconds

5,789,362 recordsData Type File Size Time

Clinical Up to 0.5 MB 13s

Gene Expression

Up to 32 MB 227s (3min 47s)

tranSMART Analytics using ‘TCGA_OV’

Elapsed time by Analysis Type, in seconds, for single node

Type Acquire Data Run Analysis Total

MAS 157s 55s 212s (3min 32s)

HCA 262s 83s 345s (5min 45s)

PCA 253s 558s 811s (13min 31s)

Results are based on IBM internal testing and optimization of tranSMART using• IBM Power S822LC 16-core, 3.32 GHz 8335-GTA with SMT=2, Turbo Mode 512 GB memory, IBM XL Compiler Suite, IBM ESSL• IBM Elastic Storage Server Model GL4

Complete loading and analysis in mins instead of hours

tranSMART performance on Power 8 and ESS

Page 27: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

Dataset TCGA_OV Simulation GSE32583 GSE13168 GSE1456 GSE15258No. Records 5,789,632 40,774,968 942,724 1,203,282 3,600,555 4,702,050

Accelerate tranSMART ETL by Power8/Spectrum Scale

Page 28: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

R Analytics Tools

Solr Full Text index

Gene Patterns

PLINK

Watson Analytics

ApplicationBrowser

PostgreSQLtranSMART

DB

Spectrum Scale

JDBC

I2b2 Application

Server

Application Server

(Tomcat 7)

tranSMART

JDBCQuartz Job Call

HTTP

HTTP

HTTP

Web Server(Apache2)

HTTP

HTTP

HTTPUsers

Power8

Watson Analytics

Server

command

tranSMART Power8 Deployment Architecture

Page 29: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

Spanning Data Centers in Parallel with a Single Pane of Glass for Clinical and Research Applications on Power 8 and ESS/Spectrum Scale

NIH DataCDC Data NLM Data

Internet

Lab Results

Imaging Data

RadiologyReports

MicrobiologyReports

Nursing HomeRecords

Claims Data

VPN

VPNVPN

LAN

LAN

LAN

LAN

LAN

ElectronicHealth

Record Data

Genomic Data

Accepted Medical

Knowledge

Zato Data Federation Solution for Healthcare and Genomics Data

Page 30: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015
Page 31: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

Genomic Solution Enablement Team

Mission:• Porting and Optimization of Genomics/Translational applications on IBM solution• Developing Solutions with Partners• Making IBM SW/HW available to Software developers

Members: • Independent Software Vendor (ISV) team• Toronto Compiler Lab• Boeblingen Development Lab• Tokyo Research Lab• Austin Research Lab

Page 32: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

32

Page 33: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

Variant information requires a computationally intensive analysis of raw sequence data across thousands of genomic samples

Workload Challenge #1: ‘Big Data’ Analytics

ANNOVAR Gene Ontology …

~ 150 GB (compressed)

Each human genome can have a few million variants

High-Throughput Sequencing

File Format

Assembly & Alignment

BAM

Raw Reads

De Novo Assembly

~ 150 GB

Whole Human Genome

SOAPdenovo Velvet …

Reference-Based MappingBWA Bowtie SOAP …

Reference GenomesTGCA GEO dbSNP …

Variant CallingVariant Calling

VCF 100 to 200 MBPicard GATK SAMtools SOAPsnp …

Variant Annotations

Annotation Toolsintergenic … SNP in IL23R associated with Crohn's disease …

Sample:

Processing time per genome

1 to 100 hours* on 1 compute node

* Duration depends on selection of analytical tools and hardware

FastQ

500 MB

3 billion DNA base pairs

@ 30 x coverage

Page 34: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

Phenotypic DataEx. Clinical Histories, Medical Images

…was in good health until 2-3 months ago when she gradually developed fatigue and intermittent epigastric pain, …

exonic NOD2 16 … a frameshift … SNP… exonic GJB2 13 … associated with hearing loss … exonic CRYL1,GJB6 13 … a 342kb deletion

Omics DataVariant Databases

Scientific data must be extracted from very large volumes of natural language content, biomedical images, and other unstructured data, and transformed into a structured format for analysis

Workload Challenge #2: Unstructured Information

Scientific Literature

Peer-Reviewed Articles, Clinical Guidelines, Textbooks, Patents

… for statistical analysis and relationship visualization

Information must be transformed into normalized structured data …

Page 35: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

+

1 Omics Data

Workload Challenge #3: ‘Big Data’ Integration

2 Phenotypic Data 3 Knowledge Base

Discovery of genotype-phenotype associations requires an analysis of complex data types that must be integrated within a common analytical environment

Variant Calls & Annotations

Electronic Text & Web Sites

##FORMAT=<ID=DP, …##FORMAT=<ID=HQ, …#CHROM POS ID REF ALT …20 14370 rs6054257 G A …

Clinical Features,Environmental Factors, Biological Responses

Phenotypic Data

Knowledge Base

Variant ID

Patient-Centric Logical Data Model

Patient IDGenotypic Data

Patient Population

‘Big’ Data Warehouse Environment

RDBMS and/or NoSQL

Variant List

Detail on a Single Variant

VCF1

3

2

Phenotype ID

Patient ID

Observation Detail

Observed Traits & Responses

Page 36: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

Scale-out cluster

UsersUsers

DevicesDevices

Spectrum ArchiveTSM/LTFS

Scale-up SMP

HP

C M

anag

emen

t Sui

teP

latfo

rm S

oftw

are

Sta

ck

A framework for NGS and HPC Systems Architecture

Spectrum Scale ESS

Page 37: Genomics Medicine Solution on Power 8 Kathy Tzeng, PhD Worldwide Technical Lead Healthcare & Life Sciences IBM Systems Group November 13, 2015

Solution: GenWEQ hardware compression Three steps are tested

• Samtools Sort • Picard MarkDuplicate• Picard AddOrReplaceReadGroups -- an commonly used optional

component

• Easy to plugin and setup. Do not require any code changes or recompilation. Benefit multiple applications that use compression.

• Reduce compression time up to 98%, with a compression ratio up to 72% for bam files. Provides close to free compression.

• Significantly improved performance of important steps in the genomic pipeline with small sacrifices in compressed file size (~1.2x bigger):

• Samtools sort: 1.19-1.25x faster• Picard MarkDuplicate: 1.42-1.47x faster• Picard AddOrReplaceReadGroups: 1.89-2.30x faster

IBM Confidential