variantspark: applying spark-based machine learning methods to genomic information

16
Denis C. Bauer | Bioinformatics | @allPowerde 20 Nov 2015 CSIRO HEALTH & BIOSECURITY VariantSpark: applying Spark-based machine learning methods to genomic information B y T i m C o o p e r

Upload: denis-c-bauer

Post on 11-Apr-2017

998 views

Category:

Science


4 download

TRANSCRIPT

Page 1: VariantSpark: applying Spark-based machine learning methods to genomic information

Denis C. Bauer | Bioinformatics | @allPowerde20 Nov 2015

CSIRO HEALTH & BIOSECURITY

VariantSpark: applying Spark-based machine learning methods to genomic information

By Tim

Cooper

Page 2: VariantSpark: applying Spark-based machine learning methods to genomic information

Talk Overview

VariantSpark| Denis C. Bauer @allPowerde2 |

• Background: Why is genomics so important for medicine• VariantSpark: Overview• Whole Genome Analysis: Clustering samples by ethnicity

Page 3: VariantSpark: applying Spark-based machine learning methods to genomic information

Genome sequencing improves diagnosticsGenomic sequencing can lead to a successful diagnosis in up to 50% of cases where traditional genetic testing failed and is on average 96% cheaper

Presentation title | Presenter name3 |

Oncology

Tandem duplications

Tandem duplications

Identifying tumours by their genome-wide mutation profiles

Rare genetic disordersIdentifying causative mutations by interrogating all abnormal variants

http://matt.might.net

Bauer et al. Trends Mol Med. 2014 PMID: 24801560

Page 4: VariantSpark: applying Spark-based machine learning methods to genomic information

Generating data from 1 Million Americans

Presentation title | Presenter name4 |

Australia: ~ 100 Million dedicated to clinical genomics• $25 Million Australian Genomics Health Alliances (NHMRC Grant AI Denis); • VIC and QLD Alliances ($25 Million each); NSW and ACT (undisclosed $$ to Garvan and JCSMR)

Page 5: VariantSpark: applying Spark-based machine learning methods to genomic information

100,000 Genomes project70,000 individuals by 2017

The cancer genome atlas11,000 samples 2015

Genomics projects are getting bigger

VariantSpark| Denis C. Bauer @allPowerde | Page 5

The HapMap Project270 samples 2002

Human genome~1 sample

1000 Genome Project1097 samples 2012

Project MinE15,000 people with ALS

ASPREE4000 healthy 70+ year olds

Single samples are around 200GB in size

Page 6: VariantSpark: applying Spark-based machine learning methods to genomic information

Last Year

VariantSpark| Denis C. Bauer @allPowerde | Page 6

Page 7: VariantSpark: applying Spark-based machine learning methods to genomic information

Data Analysis categories for genomics

Map to genome and generate raw genomic features (e.g. SNPs)

Analyze the data; Uncover the biological meaning

Produce raw sequence readsBasic ProductionInformatics

Advanced Production Inform.

BioinformaticsResearch

VariantSpark| Denis C. Bauer @allPowerde | Page 7

Page 8: VariantSpark: applying Spark-based machine learning methods to genomic information

VariantSpark

Mllib*

VCF

VariantSpark is the interface enabling Spark’s MLlib machine learning algorithms to be applied to genomics data

e.g. grouping samples by genomic profile

Input Genomics Application Result

Larg

e sc

ale

com

pute

VariantSpark| Denis C. Bauer @allPowerde | Page 8

* VariantSpark also uses Spark.ML

Page 9: VariantSpark: applying Spark-based machine learning methods to genomic information

VariantSpark

VariantSpark| Denis C. Bauer @allPowerde | Page 9

Accepted BMC Genomics (IF=4)

Page 10: VariantSpark: applying Spark-based machine learning methods to genomic information

Cluster individuals into ethnic groups based on their genomic profiles

www.cloudaccess.eu

1000 x 40 Million variants Matrix *

Kmeans

Predict super population

414 ethnic groups and

s u p e r populations

VariantSpark| Denis C. Bauer @allPowerde | Page 10

* VariantSpark can also process phase 3 data: 3000 individuals and 80 million variants

Page 11: VariantSpark: applying Spark-based machine learning methods to genomic information

Clustering result

• (adjusted Rand index) ARI = 0.84, with -1 (independent labeling) and 1 (perfect match)

• Majority of American (AMR) individuals being placed in the same group as Europeans (EUR), likely reflecting their migrational backgrounds.

• ADMIXTURE (state-of-the-art tool for population structure determination) returns a low ARI of 0.25

Admixture: Alexander, D.H., Novembre, J., Lange, K.: Fast model-based estimation of ancestry in unrelated individuals. Genome Res 19(9), 1655–1664 (2009)

VariantSpark| Denis C. Bauer @allPowerde | Page 11

Page 12: VariantSpark: applying Spark-based machine learning methods to genomic information

Comparison to other implementations

• Preprocessing: converting location-centric VCF genotypes into sample-centric numerical vectors

• Clustering: Kmeans

• ADAM (BigData Genomics): Spark implementation with dense matrix

• Hadoop: MapReduce without in-memory caching

Chromosome 22; VM on Microsoft Azure with A7 Linux instance and 8 cores, 56GB memory running Ubuntu.

103 75 29 28 18 4 min

VariantSpark| Denis C. Bauer @allPowerde | Page 12

Page 13: VariantSpark: applying Spark-based machine learning methods to genomic information

Scaling VariantSpark to the whole genome • Pre-processing: scales

seamlessly as processes are independent

• Clustering: memory consumption increases linear with number of variants (24GB) due to additional distance measurements between variants and k-means centroids

• As total memory was the limiting factor on our infrastructure the number of simultaneously used nodes had to be reduced; increasing runtime.

CSIRO Spark Cluster: Whole genome; Hadoop 2.5.0, managed by cloudera’s CDH 5. We use Spark 1.3.1. This 13 node cluster has a total of 416 cores and 1.22TB memory.

VariantSpark| Denis C. Bauer @allPowerde | Page 13

Page 14: VariantSpark: applying Spark-based machine learning methods to genomic information

Three things to remember

• VariantSpark is an interface bringing bigLearning tasks to genomics applications

• VariantSpark can cluster 3000 individuals and 80 million variants in under 30 hours using minimal memory (24GB) – a task not being possible in R/python/ADMIXTURE due to memory limits.

• VariantSpark outperforms ADAM (Big Data Genomics) and equivalent Hadoop-implementation by almost an order of magnitude.

https://github.com/BauerLab/VariantSpark

VariantSpark| Denis C. Bauer @allPowerde | Page 14

Page 15: VariantSpark: applying Spark-based machine learning methods to genomic information

HEALTH AND BIOSECURITY

Thank youHealth & BiosecurityDenis C. Bauert +61 2 9123 4567e [email protected] aehrc.com/biomedical-informatics/

transformational-bioinformatics/

More talks online: Twitter:http://www.slideshare.net/allPowerde @allPowerde

Aidan O’BrienBill WilsonTransformational Bioinformatics Team, CSIROFormer membersFiroz AnwarNeil Saunders

Rodney ScottNewcastle University

Funding:National Health and Medical Research Council;National Breast Cancer Foundation;CSIRO's Transformational Capability Platform;CSIRO’s IM&T;Science and Industry Endowment Fund

Buske et al., Bioinformatics Jan 2014

O’Brien et al., BMC Genomics Dec 2015

Dunne et al., in preparation

FullySICEpistatic Gene Network modelling

in preparation

Anwar et al., in preparation

Piotr SzulGi GuoRobert DunneData61 CSIRO, Australia

GOdistinctGO Enrichment or genesets with distinctive function

Page 16: VariantSpark: applying Spark-based machine learning methods to genomic information

Presentation title | Presenter name16 |