petascale genomics (strata singapore 20151203)
TRANSCRIPT
1© Cloudera, Inc. All rights reserved.
Scaling Up Genomics with Hadoop and Spark
Uri Laserson | @laserson | 14 November 2015
Petascale Genomics
2© Cloudera, Inc. All rights reserved.
We come in peace.
Pioneer plaque
3© Cloudera, Inc. All rights reserved.
What is genomics?
4© Cloudera, Inc. All rights reserved.
Organism
5© Cloudera, Inc. All rights reserved.
Organism Cell
6© Cloudera, Inc. All rights reserved.
Organism Cell Genome
7© Cloudera, Inc. All rights reserved.
8© Cloudera, Inc. All rights reserved.
9© Cloudera, Inc. All rights reserved.
Reference chromosome
10© Cloudera, Inc. All rights reserved.
Reference chromosome
Location
11© Cloudera, Inc. All rights reserved.“… decoding the Book of Life”
12© Cloudera, Inc. All rights reserved.
...atatggaaccaaaaaagagcccgcatcgccaaggcaatcctaagccaaaagaacaaagctggaggcatcacactacctgacttcaaactatactaca
agcctacagtaaccaaaacagcatggtactggtaccaaaacagagatatagatcaatggaacagaacagagccctcagaaataacgccgcatatctacaa
ctatctgatctttgacgaacctgagaaaaacaagcaatggggaaaggattccctatttaataaatggtgctgggaaaactggctagccatatgtagaaag
ctgaaactggatcccttccttacaccttatacaaaaatcaattcaagatggattaaagacttaaacgttagacctaaaaccataaaaaccctagaagaaa
acctaggcagtaccattcaggacataggcatgggcaaggacttcatgtccaaaacaccaaaagcaatggcaacaaaagacaaaattgacaaatgggatct
aattaaactaaagagcttctgcacagcaaaagaaactaccatcagagtgaacaggaaacctacaaaatgggagaaaattttcgcaacctactcatctgac
aaagggctaatatccagaatctacaatgaactcaaacaaatttacaagaaaaaaacaaacaaccccatcaaaaagtgggcaaaggacatgaacagacact
tctcaaatgaagacatttatgcagccaaaaaacacatgaaaaaatgctcatcatcactggccatcagagaaatgcaaatcaaaaccacaatgagatacca
tctcacaccagttagaatggcaatcattaaaaagtcaggaaacaacaggtgctggagaggatgtggagaaataggaacacttttacactgttggtgggac
tgtaaactagttcaaccattgtggaagtcagtgtggtgattcctcagggatctagaactagaaataccatttgacccagccatcccattactgggtatat
acccaaaggactataaatcatgctgctataaagacacatgcacacgtatgtttattgcggcattattcacaatagcaaagacttggaaccaacccaaatg
tccaacaatgataaactggattaagaaaatgtggcacatatacaccatggaatactctgcagccataaaaaaggatgagttcatgtcctttgtagggaca
tggatgaaattggaaatcatcattctcagtaaactatcgcaagaataaaaaaccaaacaccgcatattctcactcataggtgggaattgaacaatgagat
cacatggacacaggaagaggaatatcacactctggggactgtggtggggtggggggaggggggagggatagcattgggagatatacctaatgctagatga
cgagttagtgggtgcagcgcaccagcatggcacatgtatacatatgtaactaacctgcacattgtgcacatgtaccctaaaacttaaagtataataaaaa
aataaaaaaaataaagtgtgtgtgtgtatgactttaattaacttgatcacccacacacacacaaacactgaccaaaattaatatcaagtcaggtctgtct
gaatgtaaagccaacagcaaacatccctctctccaaatggaaaagaaacagggggttatgggcagctacactgctaaatgttaaaactttatttttaaat
gtggccataaaaatcactaaataaaattgataatatatgtttttgatgaataaattttatatatgtctacactggaaactatatagcaataaaaactaac
catgtacaactaaactcataaatttcataaacataataagtaaaagaagccagacaaaaagtagtgtatactgttaaattccatttatataaaagttcaa
aaaagccaaaaagaaactatgctgttaaaagtaaggattatagttactattcagggaagagagtagtggctggaaagaaacataaagggggtctctgaag
tggaataatgttctgttttttgatctgggtattagggtgtttaatttcggaaaattattttatctttatacttattgtattattgattttttgcttaaca
aattactcaaaacttagaggtttaaaaaaaattaattattgtattaatttctctgggccaggaattggagagagcttagctgggtagttctggttcaaaa
tttctcatgagattaccgtcaagctgttggagggggctgcatcatctgaaggcttgaccgaggctagaggatctactttcaagatggcccactcacatgg
ctgttggcaagaagtttcagtttctcactagcttctagcaggaggccataatttctcaccacatagatctctctatagggctactcgagtgtcctcacag
caaggtagctggctttcttcagagccaagtgactcaaaggcaaagaggaagtcactatgccatttatgacctagttttggaactcacactttgttccgaa
ttgaccttccatcactttctagtcattaggatttaagtcactaactctgatccatagtcaaggggagtaaaatttggctttattgttggaggatggagta
gcaaagaatttgttgacacattttaaaactaccatacttaaacagttcatttttctgaatatgcttcaattagaagttaaaatgatgcaattttaaaaca
ttgtttcaaatgaacactgttagggagagaagtgcttcttctccatatctaatgtttcttccatatttagggagttccattagtttaacactttaag...
13© Cloudera, Inc. All rights reserved.
14© Cloudera, Inc. All rights reserved.
15© Cloudera, Inc. All rights reserved.
16© Cloudera, Inc. All rights reserved.
17© Cloudera, Inc. All rights reserved.
18© Cloudera, Inc. All rights reserved.
>read1TTGGACATTTCGGGGTCTCAGATT>read2AATGTTGTTAGAGATCCGGGATTT>read3GGATTCCCCGCCGTTTGAGAGCCT>read4AGGTTGGTACCGCGAAAAGCGCAT
19© Cloudera, Inc. All rights reserved.
>read1TTGGACATTTCGGGGTCTCAGATT>read2AATGTTGTTAGAGATCCGGGATTT>read3GGATTCCCCGCCGTTTGAGAGCCT>read4AGGTTGGTACCGCGAAAAGCGCAT
Bioinformatics!
20© Cloudera, Inc. All rights reserved.
>read1TTGGACATTTCGGGGTCTCAGATT>read2AATGTTGTTAGAGATCCGGGATTT>read3GGATTCCCCGCCGTTTGAGAGCCT>read4AGGTTGGTACCGCGAAAAGCGCAT
Bioinformatics!
21© Cloudera, Inc. All rights reserved.
Alignment Dedup Recalibrate QC/FilterVariantCalling
VariantAnnotation
Pipelines!
22© Cloudera, Inc. All rights reserved.
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
Compressed text files (non-splittable)Semi-structuredPoorly specified
23© Cloudera, Inc. All rights reserved.
##fileformat=VCFv4.1
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta
##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003
20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3
20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4
20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
Compressed text files (non-splittable)Semi-structuredPoorly specified
Global sort order
24© Cloudera, Inc. All rights reserved.
CHPC (scheduler)POSIX filesystem
JavaHPC (Queue)POSIX filesystem
C++Single-nodeSQLite
It’s file formats all the way down!
25© Cloudera, Inc. All rights reserved.
Dedup
26© Cloudera, Inc. All rights reserved.
/*** Main work method. Reads the BAM file once and collects sorted information about* the 5' ends of both ends of each read (or just one end in the case of pairs).* Then makes a pass through those determining duplicates before re-reading the* input file and writing it out with duplication flags set correctly.*/protected int doWork() {
// build some data structuresbuildSortedReadEndLists(useBarcodes);generateDuplicateIndexes(useBarcodes);
final SAMFileWriter out =new SAMFileWriterFactory().makeSAMOrBAMWriter(outputHeader, true, OUTPUT);
final CloseableIterator<SAMRecord> iterator = headerAndIterator.iterator;while (iterator.hasNext()) {
final SAMRecord rec = iterator.next();if (!rec.isSecondaryOrSupplementary()) {
if (recordInFileIndex == nextDuplicateIndex) {rec.setDuplicateReadFlag(true);// Now try and figure out the next duplicate indexif (this.duplicateIndexes.hasNext()) {
nextDuplicateIndex = this.duplicateIndexes.next();} else {
// Only happens once we've marked all the duplicatesnextDuplicateIndex = -1;
}} else {
rec.setDuplicateReadFlag(false);
Method
Code
27© Cloudera, Inc. All rights reserved.
/*** Main work method. Reads the BAM file once and collects sorted information about* the 5' ends of both ends of each read (or just one end in the case of pairs).* Then makes a pass through those determining duplicates before re-reading the* input file and writing it out with duplication flags set correctly.*/protected int doWork() {
// build some data structuresbuildSortedReadEndLists(useBarcodes);generateDuplicateIndexes(useBarcodes);
final SAMFileWriter out =new SAMFileWriterFactory().makeSAMOrBAMWriter(outputHeader, true, OUTPUT);
final CloseableIterator<SAMRecord> iterator = headerAndIterator.iterator;while (iterator.hasNext()) {
final SAMRecord rec = iterator.next();if (!rec.isSecondaryOrSupplementary()) {
if (recordInFileIndex == nextDuplicateIndex) {rec.setDuplicateReadFlag(true);// Now try and figure out the next duplicate indexif (this.duplicateIndexes.hasNext()) {
nextDuplicateIndex = this.duplicateIndexes.next();} else {
// Only happens once we've marked all the duplicatesnextDuplicateIndex = -1;
}} else {
rec.setDuplicateReadFlag(false);
Method
Code
28© Cloudera, Inc. All rights reserved.
@Option(shortName = "MAX_FILE_HANDLES",doc = "Maximum number of file handles to keep open when spilling " +
"read ends to disk. Set this number a little lower than the " +"per-process maximum number of file that may be open. This " +"number can be found by executing the 'ulimit -n' command on " +"a Unix system.")
public int MAX_FILE_HANDLES_FOR_READ_ENDS_MAP = 8000;
29© Cloudera, Inc. All rights reserved.
@Option(shortName = "MAX_FILE_HANDLES",doc = "Maximum number of file handles to keep open when spilling " +
"read ends to disk. Set this number a little lower than the " +"per-process maximum number of file that may be open. This " +"number can be found by executing the 'ulimit -n' command on " +"a Unix system.")
public int MAX_FILE_HANDLES_FOR_READ_ENDS_MAP = 8000;
Dedup
Method/Algo
Code
Platform
30© Cloudera, Inc. All rights reserved.
Alignment Dedup Recalibrate QC/FilterVariantCalling
VariantAnnotation
31© Cloudera, Inc. All rights reserved.
It’s pipelines all the way down!
Alignment Dedup Recalibrate QC/FilterVariantCalling
VariantAnnotation
Alignment Dedup Recalibrate QC/FilterVariantCalling
VariantAnnotation
Alignment Dedup Recalibrate QC/FilterVariantCalling
VariantAnnotation
32© Cloudera, Inc. All rights reserved.
It’s pipelines all the way down!
Alignment Dedup Recalibrate QC/FilterVariantCalling
VariantAnnotation
Node 1
Alignment Dedup Recalibrate QC/FilterVariantCalling
VariantAnnotation
Node 2
Alignment Dedup Recalibrate QC/FilterVariantCalling
VariantAnnotation
Node 3
33© Cloudera, Inc. All rights reserved.
Manually running pipelines on HPC
$ bsub –q shared_12h python split_genotypes.py
$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_2.vcf agg2.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_3.vcf agg3.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_4.vcf agg4.csv
$ bsub –q shared_12h python merge_maf.py
34© Cloudera, Inc. All rights reserved.
35© Cloudera, Inc. All rights reserved.
Alignment Dedup Recalibrate QC/FilterVariantCalling
VariantAnnotation
Alignment Dedup Recalibrate QC/Filter
Alignment Dedup Recalibrate QC/Filter
36© Cloudera, Inc. All rights reserved.
Node 1
Alignment Dedup Recalibrate QC/FilterVariantCalling
VariantAnnotation
Node 2
Node 3
Alignment Dedup Recalibrate QC/Filter
Alignment Dedup Recalibrate QC/Filter
Node 4
37© Cloudera, Inc. All rights reserved.
Node 1
Alignment Dedup QC/FilterVariantCalling
VariantAnnotation
Node 2
Node 3
Alignment Dedup QC/Filter
Alignment Dedup QC/Filter
Node 4
Recalibrate
38© Cloudera, Inc. All rights reserved.
Why Are We Still Defining File Formats By Hand?
• Instead of defining custom file formats for each data type and access pattern…
• Parquet creates a compressed format for each Avro-defineddata model
• Improvements over existing formats• ~20% for BAM• ~90% for VCF
39© Cloudera, Inc. All rights reserved.
YARN-managedHadoop cluster
Sparkexecutors
𝑗=1
𝑑𝑖
𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗 , 𝑓𝑖)
𝑗=1
𝑑𝑖
𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗 , 𝑓𝑖)
𝑗=1
𝑑𝑖
𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗 , 𝑓𝑖)Partial sums
𝑖=1
𝑁
𝑗=1
𝑑𝑖
𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗 , 𝑓𝑖)
Driver
Applicationcode
ContEst Algorithm
40© Cloudera, Inc. All rights reserved.
Hadoop provides layered abstractions for data processing
HDFS (scalable, distributed storage)
YARN (resource management)
MapReduce Impala (SQL) Solr (search) Spark
ADAMquince guacamole …
bd
g-fo
rmat
s (A
vro
/Par
qu
et)
41© Cloudera, Inc. All rights reserved.
• Hosted at Berkeley and the
AMPLab
• Apache 2 License
• Contributors from both
research and commercial
organizations
• Core spatial primitives,
variant calling
• Avro and Parquet for data
models and file formats
Spark + Genomics = ADAM
42© Cloudera, Inc. All rights reserved.
Core Genomics Primitives: Spatial Join
43© Cloudera, Inc. All rights reserved.
Executing query in Hadoop: interactive Spark shell (ADAM)
def inDbSnp(g: Genotype): Boolean = true or false
def isDeleterious(g: Genotype): Boolean = g.getPolyPhen
val samples = sc.textFile("path/to/samples").map(parseJson(_)).collect()
val dbsnp = sc.textFile("path/to/dbSNP").map(_.split(",")).collect()
val dnaseRDD = sc.adamBEDFeatureLoad("path/to/dnase”)
val genotypesRDD = sc.adamLoad("path/to/genotypes")
val filteredRDD = genotypesRDD
.filter(!inDbSnp(_))
.filter(isDeleterious(_))
.filter(isFramingham(_))
val joinedRDD = RegionJoin.partitionAndJoin(sc, filteredRDD, dnaseRDD)
val maf = joinedRDD
.keyBy(x => (x.getVariant, getPopulation(x)))
.groupByKey()
.map(computeMAF(_))
maf.saveAsNewAPIHadoopFile("path/to/output")
apply predicates
load data
join data
group-byaggregate (MAF)
persist data
44© Cloudera, Inc. All rights reserved.
Executing query in Hadoop: distributed SQL
SELECT g.chr, g.pos, g.ref, g.alt, s.pop, MAF(g.call)
FROM genotypes g
INNER JOIN samples s
ON g.sample = s.sample
INNER JOIN dnase d
ON g.chr = d.chr
AND g.pos >= d.start
AND g.pos < d.end
LEFT OUTER JOIN dbsnp p
ON g.chr = p.chr
AND g.pos = p.pos
AND g.ref = p.ref
AND g.alt = p.alt
WHERE
s.study = "framingham"
p.pos IS NULL AND
g.polyphen IN ( "possibly damaging", "probably damaging" )
GROUP BY g.chr, g.pos, g.ref, g.alt, s.pop
apply predicates
“load” and join data
group-by
aggregate (UDAF)
45© Cloudera, Inc. All rights reserved.
ADAM preliminary performance
46© Cloudera, Inc. All rights reserved.
1. Somebody will build on your code
2. You should have assembled a team to build your software
3. If you choose the right license, more people will use and build on your
software.
4. Making software free for commercial use shows you are not against
companies.
5. You should maintain your software indefinitely
6. Your “stable URL” can exist forever
7. You should make your software “idiot proof”
8. You used the right programming language for the task.
Lior Pachterhttps://liorpachter.wordpress.com/2015/07/10/the-myths-of-bioinformatics-software/
“Myths of Bioinformatics Software”
47© Cloudera, Inc. All rights reserved.
48© Cloudera, Inc. All rights reserved.
Acknowledgements
UCBerkeleyMatt MassieFrank NothaftMichael Heuer
TamrTimothy Danford
MSSMJeff HammerbacherRyan Williams
ClouderaTom WhiteSandy Ryza