ga4 gh meeting at the the sanger institute

26
ADAM: Fast, Scalable Genome Analysis Matt Massie Twitter: @matt_massie Email: [email protected] University of California, Berkeley http://amplab.cs.berkeley.edu http://bigdatagenomics.github.io

Upload: matt-massie

Post on 07-May-2015

11.570 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ga4 gh meeting at the the sanger institute

ADAM: Fast, Scalable Genome Analysis

Matt MassieTwitter: @matt_massieEmail: [email protected]

University of California, Berkeleyhttp://amplab.cs.berkeley.edu

http://bigdatagenomics.github.io

Page 2: Ga4 gh meeting at the the sanger institute

Design• Create a platform with an easy programming environment for

developers

• Provide both single and multi-sample methods that are fast and scalable for whole genome, high-coverage data

• Allow for multiple views of the same data, e.g. SQL/Table, Graph Analysis, Iterator on Records, Resilient Distributed Datasets

• Leverage existing open-source systems and plug into current “Big Data” ecosystems

• Deployable on an in-house cluster or any cloud vendor - Amazon EC2, Google Compute Engine or Microsoft Azure

• Everything is a file - bulk data transfer only requires standard tools like rsync, scp, distcp, S3sync, etc.

Page 3: Ga4 gh meeting at the the sanger institute

Implementation

• Accelerated work began September, 2013

• Nine contributors from Mt. Sinai, GenomeBridge, The Broad Institute and others

• Built using Apache Spark execution engine and Apache Avro and Parquet for file formats

• 20K lines of Scala code

• Apache-licensed open-source

Commits

Page 4: Ga4 gh meeting at the the sanger institute

Features• ADAM

• Read pre-processing: sort, mark dups, BQSR

• Read comparison across multiple covariates

• Converters between legacy and ADAM formats

• Avocado - A variant caller, distributed

• SNP caller

• Local assembler

• Support for integrating aligners in M/R frameworks

• Fully configurable pipeline via a config file

Raw Reads

Mapping

Sorted Mapping

Local Alignment

Mark Duplicates

Base Quality Score Recalibration

Calling-ReadyReads

Rea

d Pr

e-Pr

oces

sing

Page 5: Ga4 gh meeting at the the sanger institute

http://avro.apache.org/

Page 6: Ga4 gh meeting at the the sanger institute

Avro• Serialization system similar to Google Protobuf and

Apache Thrift

• Data formats are fully described with a schema

• Bindings for Java, C, C++, C#, JavaScript, Python, Ruby, PHP and Perl (R in the works)

• Datafile format is self-descriptive and record-oriented

• Provides schema evolution, resolution and projection

• Numerous conversion utilities to print Avro as JSON, extract schema from JAXB, turn XSD/XML to Avro

Page 7: Ga4 gh meeting at the the sanger institute
Page 8: Ga4 gh meeting at the the sanger institute
Page 9: Ga4 gh meeting at the the sanger institute
Page 10: Ga4 gh meeting at the the sanger institute
Page 11: Ga4 gh meeting at the the sanger institute

http://parquet.io/

Parquet

https://blog.twitter.com/2013/dremel-made-simple-with-parquet

Page 12: Ga4 gh meeting at the the sanger institute

Parquet• Based on Google Dremel design

• Created by Twitter and Cloudera with contributions from dozens of open-source developers

• Columnar File Format

• Limits I/O to only data that is needed

• Compresses very well - ADAM files are 5-25% smaller than BAM files without loss of data

• Fast scans - load only columns you need, e.g. scan a read flag on a whole genome, high-coverage file in less than a minute

• Integrates easily with Avro, Hadoop, Hive, Shark, Impala, Pig, Jackson/JSON, Scrooge and others

Page 13: Ga4 gh meeting at the the sanger institute

Read Data Examplechrom20 TCGA 4M

chrom20 GAAT 4M1D

chrom20 CCGAT 5M

chrom20 chrom20 chrom20 TCGA GAAT CCGAT 4M 4M1D 5M

Column Oriented

chrom20 TCGA 4M chrom20 GAAT 4M1D chrom20 CCGAT 5M

Row Oriented

PredicateProjection

Page 14: Ga4 gh meeting at the the sanger institute

http://spark.apache.org/

Page 15: Ga4 gh meeting at the the sanger institute

Apache Spark• Grew out of Berkeley AMPLab research - now a top-level

Apache project, commercially-supported

• Ease of Use - Spark offers over 80 high-level operators that make it easy to build parallel apps using Scala, Java, Python or R

• Easy to test code in “local” mode

• Can use it interactively for ad-hoc analysis from the Scala, Python and R shells or using iPython notebook

• Speed - Spark has an advanced DAG execution engine that is 10-100x faster than Hadoop M/R

• Runs well on in-house clusters, Amazon EC2 and Google Compute Engine

Page 16: Ga4 gh meeting at the the sanger institute

Performance as Proof

0

4

8

12

16

20

24

0.750.47

20.37

0.33

8.93

17.73

Hou

rs

1000g NA12878 Whole Genome, 60x Coverage

For comparison, Bina Technologies quotes .94 hours for BQSR at only 37x coverage

Sort Mark Duplicates BQSR

Picard ADAM Single Node ADAM 100 EC2 Nodes

Page 17: Ga4 gh meeting at the the sanger institute

Summary• Schema-driven design allows developers to

think at the logical layer

• Well-designed execution systems allows developers to focus on science and algorithms instead of implementation details

• Modern data formats enable distributed, fast computation and easier integration

• Moving computation to the data reduces transfers and improves performance

Page 18: Ga4 gh meeting at the the sanger institute

Thank you

Page 19: Ga4 gh meeting at the the sanger institute

Extra slides

Page 20: Ga4 gh meeting at the the sanger institute

Rank variants by read depth and print the top 100

val join : RDD[(ADAMVariant, ADAMRecord)] = partitionAndJoin(sc, dict, variants, reads)

val readCounts = join.map( p => (p._1, 1) ).reduceByKey(_ + _)

val sorted = readCounts.map( p=> (p._2, p._1) ).sortByKey()

val top100 = sorted.take(100)top100.foreach {  case (count, variant) =>" println("%d\t%s".format(count, variant.getId))}

Page 21: Ga4 gh meeting at the the sanger institute

Flagstat$ time adam flagstat NA12878.mapped.ILLUMINA.bwa.CEU.high_coverage_pcr_free.20130906.adam

757704193 + 0 in total (QC-passed reads + QC-failed reads)8158052 + 0 primary duplicates7594332 + 0 primary duplicates - both read and mate mapped563720 + 0 primary duplicates - only read mapped10344 + 0 primary duplicates - cross chromosome10227903 + 0 secondary duplicates10142158 + 0 secondary duplicates - both read and mate mapped85745 + 0 secondary duplicates - only read mapped4026853 + 0 secondary duplicates - cross chromosome750027254 + 0 mapped (98.99%:0.00%)757704193 + 0 paired in sequencing377464374 + 0 read1380239819 + 0 read2724651663 + 0 properly paired (95.64%:0.00%)745340038 + 0 with itself and mate mapped4687216 + 0 singletons (0.62%:0.00%)11135947 + 0 with mate mapped to a different chr5557972 + 0 with mate mapped to a different chr (mapQ>=5)

real    1m58.688suser    25m52.453ssys     0m43.879s

Would take 40 minutes just to read from a single disk (assuming 100mb/s)

Page 22: Ga4 gh meeting at the the sanger institute

Concordance between ADAM and GATK BQSR

0

10

20

30

40

50

0 10 20 30 40 50

AD

AM

GATK

RMSE: 1.48 Exact Matches: 50.06%

Page 23: Ga4 gh meeting at the the sanger institute

http://hadoop.apache.org/

Page 24: Ga4 gh meeting at the the sanger institute

Hadoop Distributed File System (HDFS)

• Based on GoogleFS

• Single namespace across entire cluster

• Uses commodity hardware - JBOD

• Files are broken into blocks (e.g. 128MB)

• Blocks replicated for durability and performance

• Write-once, read-many access pattern

Page 25: Ga4 gh meeting at the the sanger institute
Page 26: Ga4 gh meeting at the the sanger institute

$ adam

e 888~-_ e e e d8b 888 \ d8b d8b d8b /Y88b 888 | /Y88b d888bdY88b / Y88b 888 | / Y88b / Y88Y Y888b /____Y88b 888 / /____Y88b / YY Y888b/ Y88b 888_-~ / Y88b / Y888b

Choose one of the following commands:

transform : Convert SAM/BAM to ADAM format and optionally perform read pre-processing transformations print_tags : Prints the values and counts of all tags in a set of records flagstat : Print statistics on reads in an ADAM file (similar to samtools flagstat) reads2ref : Convert an ADAM read-oriented file to an ADAM reference-oriented file mpileup : Output the samtool mpileup text from ADAM reference-oriented data print : Print an ADAM formatted file aggregate_pileups : Aggregate pileups in an ADAM reference-oriented file listdict : Print the contents of an ADAM sequence dictionary compare : Compare two ADAM files based on read name compute_variants : Compute variant data from genotypes bam2adam : Single-node BAM to ADAM converter (Note: the 'transform' command can take SAM or BAM as input) adam2vcf : Convert an ADAM variant to the VCF ADAM format vcf2adam : Convert a VCF file to the corresponding ADAM format findreads : Find reads that match particular individual or comparative criteria fasta2adam : Converts a text FASTA sequence file into an ADAMNucleotideContig Parquet file which represents assembled sequences. plugin : Executes an AdamPlugin