variantspark: applying spark-based machine learning methods to genomic information
TRANSCRIPT
Denis C. Bauer | Bioinformatics | @allPowerde20 Nov 2015
CSIRO HEALTH & BIOSECURITY
VariantSpark: applying Spark-based machine learning methods to genomic information
By Tim
Cooper
Talk Overview
VariantSpark| Denis C. Bauer @allPowerde2 |
• Background: Why is genomics so important for medicine• VariantSpark: Overview• Whole Genome Analysis: Clustering samples by ethnicity
Genome sequencing improves diagnosticsGenomic sequencing can lead to a successful diagnosis in up to 50% of cases where traditional genetic testing failed and is on average 96% cheaper
Presentation title | Presenter name3 |
Oncology
Tandem duplications
Tandem duplications
Identifying tumours by their genome-wide mutation profiles
Rare genetic disordersIdentifying causative mutations by interrogating all abnormal variants
http://matt.might.net
Bauer et al. Trends Mol Med. 2014 PMID: 24801560
Generating data from 1 Million Americans
Presentation title | Presenter name4 |
Australia: ~ 100 Million dedicated to clinical genomics• $25 Million Australian Genomics Health Alliances (NHMRC Grant AI Denis); • VIC and QLD Alliances ($25 Million each); NSW and ACT (undisclosed $$ to Garvan and JCSMR)
100,000 Genomes project70,000 individuals by 2017
The cancer genome atlas11,000 samples 2015
Genomics projects are getting bigger
VariantSpark| Denis C. Bauer @allPowerde | Page 5
The HapMap Project270 samples 2002
Human genome~1 sample
1000 Genome Project1097 samples 2012
Project MinE15,000 people with ALS
ASPREE4000 healthy 70+ year olds
Single samples are around 200GB in size
Last Year
VariantSpark| Denis C. Bauer @allPowerde | Page 6
Data Analysis categories for genomics
Map to genome and generate raw genomic features (e.g. SNPs)
Analyze the data; Uncover the biological meaning
Produce raw sequence readsBasic ProductionInformatics
Advanced Production Inform.
BioinformaticsResearch
VariantSpark| Denis C. Bauer @allPowerde | Page 7
VariantSpark
Mllib*
VCF
VariantSpark is the interface enabling Spark’s MLlib machine learning algorithms to be applied to genomics data
e.g. grouping samples by genomic profile
Input Genomics Application Result
Larg
e sc
ale
com
pute
VariantSpark| Denis C. Bauer @allPowerde | Page 8
* VariantSpark also uses Spark.ML
VariantSpark
VariantSpark| Denis C. Bauer @allPowerde | Page 9
Accepted BMC Genomics (IF=4)
Cluster individuals into ethnic groups based on their genomic profiles
www.cloudaccess.eu
1000 x 40 Million variants Matrix *
Kmeans
Predict super population
414 ethnic groups and
s u p e r populations
VariantSpark| Denis C. Bauer @allPowerde | Page 10
* VariantSpark can also process phase 3 data: 3000 individuals and 80 million variants
Clustering result
• (adjusted Rand index) ARI = 0.84, with -1 (independent labeling) and 1 (perfect match)
• Majority of American (AMR) individuals being placed in the same group as Europeans (EUR), likely reflecting their migrational backgrounds.
• ADMIXTURE (state-of-the-art tool for population structure determination) returns a low ARI of 0.25
Admixture: Alexander, D.H., Novembre, J., Lange, K.: Fast model-based estimation of ancestry in unrelated individuals. Genome Res 19(9), 1655–1664 (2009)
VariantSpark| Denis C. Bauer @allPowerde | Page 11
Comparison to other implementations
• Preprocessing: converting location-centric VCF genotypes into sample-centric numerical vectors
• Clustering: Kmeans
• ADAM (BigData Genomics): Spark implementation with dense matrix
• Hadoop: MapReduce without in-memory caching
Chromosome 22; VM on Microsoft Azure with A7 Linux instance and 8 cores, 56GB memory running Ubuntu.
103 75 29 28 18 4 min
VariantSpark| Denis C. Bauer @allPowerde | Page 12
Scaling VariantSpark to the whole genome • Pre-processing: scales
seamlessly as processes are independent
• Clustering: memory consumption increases linear with number of variants (24GB) due to additional distance measurements between variants and k-means centroids
• As total memory was the limiting factor on our infrastructure the number of simultaneously used nodes had to be reduced; increasing runtime.
CSIRO Spark Cluster: Whole genome; Hadoop 2.5.0, managed by cloudera’s CDH 5. We use Spark 1.3.1. This 13 node cluster has a total of 416 cores and 1.22TB memory.
VariantSpark| Denis C. Bauer @allPowerde | Page 13
Three things to remember
• VariantSpark is an interface bringing bigLearning tasks to genomics applications
• VariantSpark can cluster 3000 individuals and 80 million variants in under 30 hours using minimal memory (24GB) – a task not being possible in R/python/ADMIXTURE due to memory limits.
• VariantSpark outperforms ADAM (Big Data Genomics) and equivalent Hadoop-implementation by almost an order of magnitude.
https://github.com/BauerLab/VariantSpark
VariantSpark| Denis C. Bauer @allPowerde | Page 14
HEALTH AND BIOSECURITY
Thank youHealth & BiosecurityDenis C. Bauert +61 2 9123 4567e [email protected] aehrc.com/biomedical-informatics/
transformational-bioinformatics/
More talks online: Twitter:http://www.slideshare.net/allPowerde @allPowerde
Aidan O’BrienBill WilsonTransformational Bioinformatics Team, CSIROFormer membersFiroz AnwarNeil Saunders
Rodney ScottNewcastle University
Funding:National Health and Medical Research Council;National Breast Cancer Foundation;CSIRO's Transformational Capability Platform;CSIRO’s IM&T;Science and Industry Endowment Fund
Buske et al., Bioinformatics Jan 2014
O’Brien et al., BMC Genomics Dec 2015
Dunne et al., in preparation
FullySICEpistatic Gene Network modelling
in preparation
Anwar et al., in preparation
Piotr SzulGi GuoRobert DunneData61 CSIRO, Australia
GOdistinctGO Enrichment or genesets with distinctive function
Presentation title | Presenter name16 |