petascale genomics (strata singapore 20151203)

49
1 © Cloudera, Inc. All rights reserved. Scaling Up Genomics with Hadoop and Spark Uri Laserson | @laserson | 14 November 2015 Petascale Genomics

Upload: uri-laserson

Post on 21-Jan-2018

1.005 views

Category:

Health & Medicine


1 download

TRANSCRIPT

Page 1: Petascale Genomics (Strata Singapore 20151203)

1© Cloudera, Inc. All rights reserved.

Scaling Up Genomics with Hadoop and Spark

Uri Laserson | @laserson | 14 November 2015

Petascale Genomics

Page 2: Petascale Genomics (Strata Singapore 20151203)

2© Cloudera, Inc. All rights reserved.

We come in peace.

Pioneer plaque

Page 3: Petascale Genomics (Strata Singapore 20151203)

3© Cloudera, Inc. All rights reserved.

What is genomics?

Page 4: Petascale Genomics (Strata Singapore 20151203)

4© Cloudera, Inc. All rights reserved.

Organism

Page 5: Petascale Genomics (Strata Singapore 20151203)

5© Cloudera, Inc. All rights reserved.

Organism Cell

Page 6: Petascale Genomics (Strata Singapore 20151203)

6© Cloudera, Inc. All rights reserved.

Organism Cell Genome

Page 7: Petascale Genomics (Strata Singapore 20151203)

7© Cloudera, Inc. All rights reserved.

Page 8: Petascale Genomics (Strata Singapore 20151203)

8© Cloudera, Inc. All rights reserved.

Page 9: Petascale Genomics (Strata Singapore 20151203)

9© Cloudera, Inc. All rights reserved.

Reference chromosome

Page 10: Petascale Genomics (Strata Singapore 20151203)

10© Cloudera, Inc. All rights reserved.

Reference chromosome

Location

Page 11: Petascale Genomics (Strata Singapore 20151203)

11© Cloudera, Inc. All rights reserved.“… decoding the Book of Life”

Page 12: Petascale Genomics (Strata Singapore 20151203)

12© Cloudera, Inc. All rights reserved.

...atatggaaccaaaaaagagcccgcatcgccaaggcaatcctaagccaaaagaacaaagctggaggcatcacactacctgacttcaaactatactaca

agcctacagtaaccaaaacagcatggtactggtaccaaaacagagatatagatcaatggaacagaacagagccctcagaaataacgccgcatatctacaa

ctatctgatctttgacgaacctgagaaaaacaagcaatggggaaaggattccctatttaataaatggtgctgggaaaactggctagccatatgtagaaag

ctgaaactggatcccttccttacaccttatacaaaaatcaattcaagatggattaaagacttaaacgttagacctaaaaccataaaaaccctagaagaaa

acctaggcagtaccattcaggacataggcatgggcaaggacttcatgtccaaaacaccaaaagcaatggcaacaaaagacaaaattgacaaatgggatct

aattaaactaaagagcttctgcacagcaaaagaaactaccatcagagtgaacaggaaacctacaaaatgggagaaaattttcgcaacctactcatctgac

aaagggctaatatccagaatctacaatgaactcaaacaaatttacaagaaaaaaacaaacaaccccatcaaaaagtgggcaaaggacatgaacagacact

tctcaaatgaagacatttatgcagccaaaaaacacatgaaaaaatgctcatcatcactggccatcagagaaatgcaaatcaaaaccacaatgagatacca

tctcacaccagttagaatggcaatcattaaaaagtcaggaaacaacaggtgctggagaggatgtggagaaataggaacacttttacactgttggtgggac

tgtaaactagttcaaccattgtggaagtcagtgtggtgattcctcagggatctagaactagaaataccatttgacccagccatcccattactgggtatat

acccaaaggactataaatcatgctgctataaagacacatgcacacgtatgtttattgcggcattattcacaatagcaaagacttggaaccaacccaaatg

tccaacaatgataaactggattaagaaaatgtggcacatatacaccatggaatactctgcagccataaaaaaggatgagttcatgtcctttgtagggaca

tggatgaaattggaaatcatcattctcagtaaactatcgcaagaataaaaaaccaaacaccgcatattctcactcataggtgggaattgaacaatgagat

cacatggacacaggaagaggaatatcacactctggggactgtggtggggtggggggaggggggagggatagcattgggagatatacctaatgctagatga

cgagttagtgggtgcagcgcaccagcatggcacatgtatacatatgtaactaacctgcacattgtgcacatgtaccctaaaacttaaagtataataaaaa

aataaaaaaaataaagtgtgtgtgtgtatgactttaattaacttgatcacccacacacacacaaacactgaccaaaattaatatcaagtcaggtctgtct

gaatgtaaagccaacagcaaacatccctctctccaaatggaaaagaaacagggggttatgggcagctacactgctaaatgttaaaactttatttttaaat

gtggccataaaaatcactaaataaaattgataatatatgtttttgatgaataaattttatatatgtctacactggaaactatatagcaataaaaactaac

catgtacaactaaactcataaatttcataaacataataagtaaaagaagccagacaaaaagtagtgtatactgttaaattccatttatataaaagttcaa

aaaagccaaaaagaaactatgctgttaaaagtaaggattatagttactattcagggaagagagtagtggctggaaagaaacataaagggggtctctgaag

tggaataatgttctgttttttgatctgggtattagggtgtttaatttcggaaaattattttatctttatacttattgtattattgattttttgcttaaca

aattactcaaaacttagaggtttaaaaaaaattaattattgtattaatttctctgggccaggaattggagagagcttagctgggtagttctggttcaaaa

tttctcatgagattaccgtcaagctgttggagggggctgcatcatctgaaggcttgaccgaggctagaggatctactttcaagatggcccactcacatgg

ctgttggcaagaagtttcagtttctcactagcttctagcaggaggccataatttctcaccacatagatctctctatagggctactcgagtgtcctcacag

caaggtagctggctttcttcagagccaagtgactcaaaggcaaagaggaagtcactatgccatttatgacctagttttggaactcacactttgttccgaa

ttgaccttccatcactttctagtcattaggatttaagtcactaactctgatccatagtcaaggggagtaaaatttggctttattgttggaggatggagta

gcaaagaatttgttgacacattttaaaactaccatacttaaacagttcatttttctgaatatgcttcaattagaagttaaaatgatgcaattttaaaaca

ttgtttcaaatgaacactgttagggagagaagtgcttcttctccatatctaatgtttcttccatatttagggagttccattagtttaacactttaag...

Page 13: Petascale Genomics (Strata Singapore 20151203)

13© Cloudera, Inc. All rights reserved.

Page 14: Petascale Genomics (Strata Singapore 20151203)

14© Cloudera, Inc. All rights reserved.

Page 15: Petascale Genomics (Strata Singapore 20151203)

15© Cloudera, Inc. All rights reserved.

Page 16: Petascale Genomics (Strata Singapore 20151203)

16© Cloudera, Inc. All rights reserved.

Page 17: Petascale Genomics (Strata Singapore 20151203)

17© Cloudera, Inc. All rights reserved.

Page 18: Petascale Genomics (Strata Singapore 20151203)

18© Cloudera, Inc. All rights reserved.

>read1TTGGACATTTCGGGGTCTCAGATT>read2AATGTTGTTAGAGATCCGGGATTT>read3GGATTCCCCGCCGTTTGAGAGCCT>read4AGGTTGGTACCGCGAAAAGCGCAT

Page 19: Petascale Genomics (Strata Singapore 20151203)

19© Cloudera, Inc. All rights reserved.

>read1TTGGACATTTCGGGGTCTCAGATT>read2AATGTTGTTAGAGATCCGGGATTT>read3GGATTCCCCGCCGTTTGAGAGCCT>read4AGGTTGGTACCGCGAAAAGCGCAT

Bioinformatics!

Page 20: Petascale Genomics (Strata Singapore 20151203)

20© Cloudera, Inc. All rights reserved.

>read1TTGGACATTTCGGGGTCTCAGATT>read2AATGTTGTTAGAGATCCGGGATTT>read3GGATTCCCCGCCGTTTGAGAGCCT>read4AGGTTGGTACCGCGAAAAGCGCAT

Bioinformatics!

Page 21: Petascale Genomics (Strata Singapore 20151203)

21© Cloudera, Inc. All rights reserved.

Alignment Dedup Recalibrate QC/FilterVariantCalling

VariantAnnotation

Pipelines!

Page 22: Petascale Genomics (Strata Singapore 20151203)

22© Cloudera, Inc. All rights reserved.

##fileformat=VCFv4.1

##fileDate=20090805

##source=myImputationProgramV3.1

##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta

##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>

##phasing=partial

##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">

##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">

##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">

##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">

##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">

##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">

##FILTER=<ID=q10,Description="Quality below 10">

##FILTER=<ID=s50,Description="Less than 50% of samples have data">

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">

##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003

20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.

20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3

20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4

20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2

Compressed text files (non-splittable)Semi-structuredPoorly specified

Page 23: Petascale Genomics (Strata Singapore 20151203)

23© Cloudera, Inc. All rights reserved.

##fileformat=VCFv4.1

##fileDate=20090805

##source=myImputationProgramV3.1

##reference=file:///seq/references/1000GenomesPilot-NCBI36.fasta

##contig=<ID=20,length=62435964,assembly=B36,md5=f126cdf8a6e0c7f379d618ff66beb2da,species="Homo sapiens",taxonomy=x>

##phasing=partial

##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">

##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">

##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">

##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">

##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">

##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">

##FILTER=<ID=q10,Description="Quality below 10">

##FILTER=<ID=s50,Description="Less than 50% of samples have data">

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">

##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">

##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">

##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">

#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003

20 14370 rs6054257 G A 29 PASS NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.

20 17330 . T A 3 q10 NS=3;DP=11;AF=0.017 GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3 0/0:41:3

20 1110696 rs6040355 A G,T 67 PASS NS=2;DP=10;AF=0.333,0.667 GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2 2/2:35:4

20 1230237 . T . 47 PASS NS=3;DP=13;AA=T GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2

Compressed text files (non-splittable)Semi-structuredPoorly specified

Global sort order

Page 24: Petascale Genomics (Strata Singapore 20151203)

24© Cloudera, Inc. All rights reserved.

CHPC (scheduler)POSIX filesystem

JavaHPC (Queue)POSIX filesystem

C++Single-nodeSQLite

It’s file formats all the way down!

Page 25: Petascale Genomics (Strata Singapore 20151203)

25© Cloudera, Inc. All rights reserved.

Dedup

Page 26: Petascale Genomics (Strata Singapore 20151203)

26© Cloudera, Inc. All rights reserved.

/*** Main work method. Reads the BAM file once and collects sorted information about* the 5' ends of both ends of each read (or just one end in the case of pairs).* Then makes a pass through those determining duplicates before re-reading the* input file and writing it out with duplication flags set correctly.*/protected int doWork() {

// build some data structuresbuildSortedReadEndLists(useBarcodes);generateDuplicateIndexes(useBarcodes);

final SAMFileWriter out =new SAMFileWriterFactory().makeSAMOrBAMWriter(outputHeader, true, OUTPUT);

final CloseableIterator<SAMRecord> iterator = headerAndIterator.iterator;while (iterator.hasNext()) {

final SAMRecord rec = iterator.next();if (!rec.isSecondaryOrSupplementary()) {

if (recordInFileIndex == nextDuplicateIndex) {rec.setDuplicateReadFlag(true);// Now try and figure out the next duplicate indexif (this.duplicateIndexes.hasNext()) {

nextDuplicateIndex = this.duplicateIndexes.next();} else {

// Only happens once we've marked all the duplicatesnextDuplicateIndex = -1;

}} else {

rec.setDuplicateReadFlag(false);

Method

Code

Page 27: Petascale Genomics (Strata Singapore 20151203)

27© Cloudera, Inc. All rights reserved.

/*** Main work method. Reads the BAM file once and collects sorted information about* the 5' ends of both ends of each read (or just one end in the case of pairs).* Then makes a pass through those determining duplicates before re-reading the* input file and writing it out with duplication flags set correctly.*/protected int doWork() {

// build some data structuresbuildSortedReadEndLists(useBarcodes);generateDuplicateIndexes(useBarcodes);

final SAMFileWriter out =new SAMFileWriterFactory().makeSAMOrBAMWriter(outputHeader, true, OUTPUT);

final CloseableIterator<SAMRecord> iterator = headerAndIterator.iterator;while (iterator.hasNext()) {

final SAMRecord rec = iterator.next();if (!rec.isSecondaryOrSupplementary()) {

if (recordInFileIndex == nextDuplicateIndex) {rec.setDuplicateReadFlag(true);// Now try and figure out the next duplicate indexif (this.duplicateIndexes.hasNext()) {

nextDuplicateIndex = this.duplicateIndexes.next();} else {

// Only happens once we've marked all the duplicatesnextDuplicateIndex = -1;

}} else {

rec.setDuplicateReadFlag(false);

Method

Code

Page 28: Petascale Genomics (Strata Singapore 20151203)

28© Cloudera, Inc. All rights reserved.

@Option(shortName = "MAX_FILE_HANDLES",doc = "Maximum number of file handles to keep open when spilling " +

"read ends to disk. Set this number a little lower than the " +"per-process maximum number of file that may be open. This " +"number can be found by executing the 'ulimit -n' command on " +"a Unix system.")

public int MAX_FILE_HANDLES_FOR_READ_ENDS_MAP = 8000;

Page 29: Petascale Genomics (Strata Singapore 20151203)

29© Cloudera, Inc. All rights reserved.

@Option(shortName = "MAX_FILE_HANDLES",doc = "Maximum number of file handles to keep open when spilling " +

"read ends to disk. Set this number a little lower than the " +"per-process maximum number of file that may be open. This " +"number can be found by executing the 'ulimit -n' command on " +"a Unix system.")

public int MAX_FILE_HANDLES_FOR_READ_ENDS_MAP = 8000;

Dedup

Method/Algo

Code

Platform

Page 30: Petascale Genomics (Strata Singapore 20151203)

30© Cloudera, Inc. All rights reserved.

Alignment Dedup Recalibrate QC/FilterVariantCalling

VariantAnnotation

Page 31: Petascale Genomics (Strata Singapore 20151203)

31© Cloudera, Inc. All rights reserved.

It’s pipelines all the way down!

Alignment Dedup Recalibrate QC/FilterVariantCalling

VariantAnnotation

Alignment Dedup Recalibrate QC/FilterVariantCalling

VariantAnnotation

Alignment Dedup Recalibrate QC/FilterVariantCalling

VariantAnnotation

Page 32: Petascale Genomics (Strata Singapore 20151203)

32© Cloudera, Inc. All rights reserved.

It’s pipelines all the way down!

Alignment Dedup Recalibrate QC/FilterVariantCalling

VariantAnnotation

Node 1

Alignment Dedup Recalibrate QC/FilterVariantCalling

VariantAnnotation

Node 2

Alignment Dedup Recalibrate QC/FilterVariantCalling

VariantAnnotation

Node 3

Page 33: Petascale Genomics (Strata Singapore 20151203)

33© Cloudera, Inc. All rights reserved.

Manually running pipelines on HPC

$ bsub –q shared_12h python split_genotypes.py

$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_1.vcf agg1.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_2.vcf agg2.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_3.vcf agg3.csv$ bsub –q shared_12h –R mem=4g python query_agg.py genotypes_4.vcf agg4.csv

$ bsub –q shared_12h python merge_maf.py

Page 34: Petascale Genomics (Strata Singapore 20151203)

34© Cloudera, Inc. All rights reserved.

Page 35: Petascale Genomics (Strata Singapore 20151203)

35© Cloudera, Inc. All rights reserved.

Alignment Dedup Recalibrate QC/FilterVariantCalling

VariantAnnotation

Alignment Dedup Recalibrate QC/Filter

Alignment Dedup Recalibrate QC/Filter

Page 36: Petascale Genomics (Strata Singapore 20151203)

36© Cloudera, Inc. All rights reserved.

Node 1

Alignment Dedup Recalibrate QC/FilterVariantCalling

VariantAnnotation

Node 2

Node 3

Alignment Dedup Recalibrate QC/Filter

Alignment Dedup Recalibrate QC/Filter

Node 4

Page 37: Petascale Genomics (Strata Singapore 20151203)

37© Cloudera, Inc. All rights reserved.

Node 1

Alignment Dedup QC/FilterVariantCalling

VariantAnnotation

Node 2

Node 3

Alignment Dedup QC/Filter

Alignment Dedup QC/Filter

Node 4

Recalibrate

Page 38: Petascale Genomics (Strata Singapore 20151203)

38© Cloudera, Inc. All rights reserved.

Why Are We Still Defining File Formats By Hand?

• Instead of defining custom file formats for each data type and access pattern…

• Parquet creates a compressed format for each Avro-defineddata model

• Improvements over existing formats• ~20% for BAM• ~90% for VCF

Page 39: Petascale Genomics (Strata Singapore 20151203)

39© Cloudera, Inc. All rights reserved.

YARN-managedHadoop cluster

Sparkexecutors

𝑗=1

𝑑𝑖

𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗 , 𝑓𝑖)

𝑗=1

𝑑𝑖

𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗 , 𝑓𝑖)

𝑗=1

𝑑𝑖

𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗 , 𝑓𝑖)Partial sums

𝑖=1

𝑁

𝑗=1

𝑑𝑖

𝑃(𝑏𝑖𝑗|𝑒𝑖𝑗 , 𝑓𝑖)

Driver

Applicationcode

ContEst Algorithm

Page 40: Petascale Genomics (Strata Singapore 20151203)

40© Cloudera, Inc. All rights reserved.

Hadoop provides layered abstractions for data processing

HDFS (scalable, distributed storage)

YARN (resource management)

MapReduce Impala (SQL) Solr (search) Spark

ADAMquince guacamole …

bd

g-fo

rmat

s (A

vro

/Par

qu

et)

Page 41: Petascale Genomics (Strata Singapore 20151203)

41© Cloudera, Inc. All rights reserved.

• Hosted at Berkeley and the

AMPLab

• Apache 2 License

• Contributors from both

research and commercial

organizations

• Core spatial primitives,

variant calling

• Avro and Parquet for data

models and file formats

Spark + Genomics = ADAM

Page 42: Petascale Genomics (Strata Singapore 20151203)

42© Cloudera, Inc. All rights reserved.

Core Genomics Primitives: Spatial Join

Page 43: Petascale Genomics (Strata Singapore 20151203)

43© Cloudera, Inc. All rights reserved.

Executing query in Hadoop: interactive Spark shell (ADAM)

def inDbSnp(g: Genotype): Boolean = true or false

def isDeleterious(g: Genotype): Boolean = g.getPolyPhen

val samples = sc.textFile("path/to/samples").map(parseJson(_)).collect()

val dbsnp = sc.textFile("path/to/dbSNP").map(_.split(",")).collect()

val dnaseRDD = sc.adamBEDFeatureLoad("path/to/dnase”)

val genotypesRDD = sc.adamLoad("path/to/genotypes")

val filteredRDD = genotypesRDD

.filter(!inDbSnp(_))

.filter(isDeleterious(_))

.filter(isFramingham(_))

val joinedRDD = RegionJoin.partitionAndJoin(sc, filteredRDD, dnaseRDD)

val maf = joinedRDD

.keyBy(x => (x.getVariant, getPopulation(x)))

.groupByKey()

.map(computeMAF(_))

maf.saveAsNewAPIHadoopFile("path/to/output")

apply predicates

load data

join data

group-byaggregate (MAF)

persist data

Page 44: Petascale Genomics (Strata Singapore 20151203)

44© Cloudera, Inc. All rights reserved.

Executing query in Hadoop: distributed SQL

SELECT g.chr, g.pos, g.ref, g.alt, s.pop, MAF(g.call)

FROM genotypes g

INNER JOIN samples s

ON g.sample = s.sample

INNER JOIN dnase d

ON g.chr = d.chr

AND g.pos >= d.start

AND g.pos < d.end

LEFT OUTER JOIN dbsnp p

ON g.chr = p.chr

AND g.pos = p.pos

AND g.ref = p.ref

AND g.alt = p.alt

WHERE

s.study = "framingham"

p.pos IS NULL AND

g.polyphen IN ( "possibly damaging", "probably damaging" )

GROUP BY g.chr, g.pos, g.ref, g.alt, s.pop

apply predicates

“load” and join data

group-by

aggregate (UDAF)

Page 45: Petascale Genomics (Strata Singapore 20151203)

45© Cloudera, Inc. All rights reserved.

ADAM preliminary performance

Page 46: Petascale Genomics (Strata Singapore 20151203)

46© Cloudera, Inc. All rights reserved.

1. Somebody will build on your code

2. You should have assembled a team to build your software

3. If you choose the right license, more people will use and build on your

software.

4. Making software free for commercial use shows you are not against

companies.

5. You should maintain your software indefinitely

6. Your “stable URL” can exist forever

7. You should make your software “idiot proof”

8. You used the right programming language for the task.

Lior Pachterhttps://liorpachter.wordpress.com/2015/07/10/the-myths-of-bioinformatics-software/

“Myths of Bioinformatics Software”

Page 47: Petascale Genomics (Strata Singapore 20151203)

47© Cloudera, Inc. All rights reserved.

Page 48: Petascale Genomics (Strata Singapore 20151203)

48© Cloudera, Inc. All rights reserved.

Acknowledgements

UCBerkeleyMatt MassieFrank NothaftMichael Heuer

TamrTimothy Danford

MSSMJeff HammerbacherRyan Williams

ClouderaTom WhiteSandy Ryza

Page 49: Petascale Genomics (Strata Singapore 20151203)

49© Cloudera, Inc. All rights reserved.

Thank you@laserson

[email protected]