variantspark: applying spark-based machine learning methods to genomic information

Denis C. Bauer | Bioinformatics | @allPowerde20 Nov 2015

CSIRO HEALTH & BIOSECURITY

VariantSpark: applying Spark-based machine learning methods to genomic information

By Tim

Cooper

Talk Overview

VariantSpark| Denis C. Bauer @allPowerde2 |

• Background: Why is genomics so important for medicine• VariantSpark: Overview• Whole Genome Analysis: Clustering samples by ethnicity

Genome sequencing improves diagnosticsGenomic sequencing can lead to a successful diagnosis in up to 50% of cases where traditional genetic testing failed and is on average 96% cheaper

Presentation title | Presenter name3 |

Oncology

Tandem duplications

Tandem duplications

Identifying tumours by their genome-wide mutation profiles

Rare genetic disordersIdentifying causative mutations by interrogating all abnormal variants

http://matt.might.net

Bauer et al. Trends Mol Med. 2014 PMID: 24801560

Generating data from 1 Million Americans


Australia: ~ 100 Million dedicated to clinical genomics• $25 Million Australian Genomics Health Alliances (NHMRC Grant AI Denis); • VIC and QLD Alliances ($25 Million each); NSW and ACT (undisclosed $$ to Garvan and JCSMR)

100,000 Genomes project70,000 individuals by 2017

The cancer genome atlas11,000 samples 2015

Genomics projects are getting bigger

VariantSpark| Denis C. Bauer @allPowerde |

The HapMap Project270 samples 2002

Human genome~1 sample

1000 Genome Project1097 samples 2012

Project MinE15,000 people with ALS

ASPREE4000 healthy 70+ year olds

Single samples are around 200GB in size

Last Year


Data Analysis categories for genomics

Map to genome and generate raw genomic features (e.g. SNPs)

Analyze the data; Uncover the biological meaning

Produce raw sequence readsBasic ProductionInformatics

Advanced Production Inform.

BioinformaticsResearch


VariantSpark

Mllib*

VCF

VariantSpark is the interface enabling Spark’s MLlib machine learning algorithms to be applied to genomics data

e.g. grouping samples by genomic profile

Input Genomics Application Result

Larg

e sc

ale

com

pute


* VariantSpark also uses Spark.ML

VariantSpark


Accepted BMC Genomics (IF=4)

Cluster individuals into ethnic groups based on their genomic profiles

www.cloudaccess.eu

1000 x 40 Million variants Matrix *

Kmeans

Predict super population

414 ethnic groups and

s u p e r populations


* VariantSpark can also process phase 3 data: 3000 individuals and 80 million variants

Clustering result

• (adjusted Rand index) ARI = 0.84, with -1 (independent labeling) and 1 (perfect match)

• Majority of American (AMR) individuals being placed in the same group as Europeans (EUR), likely reflecting their migrational backgrounds.

• ADMIXTURE (state-of-the-art tool for population structure determination) returns a low ARI of 0.25

Admixture: Alexander, D.H., Novembre, J., Lange, K.: Fast model-based estimation of ancestry in unrelated individuals. Genome Res 19(9), 1655–1664 (2009)


Comparison to other implementations

• Preprocessing: converting location-centric VCF genotypes into sample-centric numerical vectors

• Clustering: Kmeans

• ADAM (BigData Genomics): Spark implementation with dense matrix

• Hadoop: MapReduce without in-memory caching

Chromosome 22; VM on Microsoft Azure with A7 Linux instance and 8 cores, 56GB memory running Ubuntu.

103 75 29 28 18 4 min


Scaling VariantSpark to the whole genome • Pre-processing: scales

seamlessly as processes are independent

• Clustering: memory consumption increases linear with number of variants (24GB) due to additional distance measurements between variants and k-means centroids

• As total memory was the limiting factor on our infrastructure the number of simultaneously used nodes had to be reduced; increasing runtime.

CSIRO Spark Cluster: Whole genome; Hadoop 2.5.0, managed by cloudera’s CDH 5. We use Spark 1.3.1. This 13 node cluster has a total of 416 cores and 1.22TB memory.


Three things to remember

• VariantSpark is an interface bringing bigLearning tasks to genomics applications

• VariantSpark can cluster 3000 individuals and 80 million variants in under 30 hours using minimal memory (24GB) – a task not being possible in R/python/ADMIXTURE due to memory limits.

• VariantSpark outperforms ADAM (Big Data Genomics) and equivalent Hadoop-implementation by almost an order of magnitude.

https://github.com/BauerLab/VariantSpark


HEALTH AND BIOSECURITY

Thank youHealth & BiosecurityDenis C. Bauert +61 2 9123 4567e [email protected] aehrc.com/biomedical-informatics/

transformational-bioinformatics/

More talks online: Twitter:http://www.slideshare.net/allPowerde @allPowerde

Aidan O’BrienBill WilsonTransformational Bioinformatics Team, CSIROFormer membersFiroz AnwarNeil Saunders

Rodney ScottNewcastle University

Funding:National Health and Medical Research Council;National Breast Cancer Foundation;CSIRO's Transformational Capability Platform;CSIRO’s IM&T;Science and Industry Endowment Fund

Buske et al., Bioinformatics Jan 2014

O’Brien et al., BMC Genomics Dec 2015

Dunne et al., in preparation

FullySICEpistatic Gene Network modelling

in preparation

Anwar et al., in preparation

Piotr SzulGi GuoRobert DunneData61 CSIRO, Australia

GOdistinctGO Enrichment or genesets with distinctive function