genetic sequence analysis in the clouds: applications of mapreduce to the life science jimmy lin,...
Post on 19-Dec-2015
215 views
TRANSCRIPT
![Page 1: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/1.jpg)
Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science
Jimmy Lin, Michael Schatz, and Ben LangmeadUniversity of Maryland
Wednesday, June 10, 2009
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
![Page 2: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/2.jpg)
Cloud Computing @ Maryland Teaching
Cloud computing course (version 1.0): Spring 2008Part of the Google/IBM Academic Cloud Computing Initiative
Cloud computing course (version 2.0): Fall 2008Sponsored by Amazon Web Services through a teaching grant
Research Web-scale text processing Statistical machine translation Bioinformatics
![Page 3: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/3.jpg)
Maria
no dio
una
bofetada
bruja
verde
Mary
did not
a
slap
witch
green
green witchbruja verde
Learning Translation Models
Prodi ha erigido hoy un verdadero muro contra esas acciones, espero que el Sr. Moscovici lo haya comprendido bien, y realmente también espero que esta tendencia se rompa en los Consejos de Biarritz y de Niza, y se rectifique.
Mr Prodi has put an emphatic stop to this kind of action, which has hopefully resonated with Mr Moscovici, and I truly hope that this trend can be broken and reversed at the Councils in Nice and Biarritz.
Esas negociaciones sabemos que son muy difíciles y hacen temer un fracaso o un acuerdo de mínimos en Niza, lo que sería aún más grave y usted ya lo ha dicho, señor Ministro.
These are, as we know, very tricky negotiations and raise fears of a setback or a watered-down agreement in Nice which, as you have already acknowledged, Mr Moscovici, would be even more serious.
We built systems for “learning” translation models in Hadoop…… sort of like the word count example, but with more math
![Page 4: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/4.jpg)
Maria no dio una bofetada a la bruja verde
Mary not
did not
no
did not give
give a slap to the witch green
slap
a slap
to the
to
the
green witch
the witch
by
Example from Koehn (2006)
slap
Translation as a “Tiling” Problem
![Page 5: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/5.jpg)
From Text to DNA Sequences Text processing: [0-9A-Za-z]+
DNA sequence processing: [ATCG]+
Easier, right?
(Nope, not really)
Michael Schatz (Ph.D. student, Computer Science; Spring 2008)Ben Langmead (M.S. student, Computer Science; Fall 2008)
![Page 6: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/6.jpg)
Analogy(And two disclaimers)
![Page 7: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/7.jpg)
Strangely-Formatted Manuscript Dickens: A Tale of Two Cities
Text written on a long spool
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
![Page 8: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/8.jpg)
… With Duplicates Dickens: A Tale of Two Cities
“Backup” on four more copies
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
![Page 9: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/9.jpg)
Shredded Book Reconstruction Dickens accidently shreds the manuscript
How can he reconstruct the text? 5 copies x 138,656 words / 5 words per fragment = 138k
fragments The short fragments from every copy are mixed together Some fragments are identical
It was the best of of times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …
It was the best worst of times, it wasof times, it was the the age of wisdom, it was the age of foolishness,
It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …
It was was the worst of times,the best of times, it it was the age of wisdom, it was the age of foolishness, …
It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of of times, it was theof times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …
It was the best worst of times, it wasof times, it was theof times, it was the the age of wisdom, it was the age of foolishness,
It was the the worst of times, it best of times, it was was the age of wisdom, it was the age ofit was the age of foolishness, …
It was was the worst of times,the best of times, it it was the age ofit was the age of wisdom, it was the age of foolishness, …
It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …
![Page 10: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/10.jpg)
Overlaps
Generally prefer longer overlaps to shorter overlaps
In the presence of error, we might allow the overlapping fragments to differ by a small amount
It was the best of
of times, it was theof times, it was the
best of times, it was
times, it was the worsttimes, it was the worst
was the best of times,
the best of times, it
it was the worst of
was the worst of times,
of times, it was theof times, it was the
times, it was the age
it was the age ofit was the age of
was the age of wisdom,
the age of wisdom, it
age of wisdom, it was
of wisdom, it was the
it was the age ofit was the age of
was the age of foolishness,
the worst of times, it
It was the best of
was the best of times,4 word overlap
It was the best of
of times, it was the1 word overlap
It was the best of
of wisdom, it was the1 word overlap
![Page 11: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/11.jpg)
Greedy Assembly
The repeated sequence makes the correct reconstruction ambiguous
It was the best of
of times, it was theof times, it was the
best of times, it was
times, it was the worsttimes, it was the worst
was the best of times,
the best of times, it
of times, it was theof times, it was the
times, it was the agetimes, it was the age
It was the best of
of times, it was theof times, it was the
best of times, it was
times, it was the worsttimes, it was the worst
was the best of times,
the best of times, it
it was the worst of
was the worst of times,
of times, it was theof times, it was the
times, it was the age
it was the age ofit was the age of
was the age of wisdom,
the age of wisdom, it
age of wisdom, it was
of wisdom, it was the
it was the age ofit was the age of
was the age of foolishness,
the worst of times, it
![Page 12: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/12.jpg)
The Real Problem(The easier version)
![Page 13: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/13.jpg)
GATGCTTACTATGCGGGCCCC
CGGTCTAATGCTTACTATGC
GCTTACTATGCGGGCCCCTTAATGCTTACTATGCGGGCCCCTT
TAATGCTTACTATGCAATGCTTAGCTATGCGGGC
AATGCTTACTATGCGGGCCCCTT
AATGCTTACTATGCGGGCCCCTT
CGGTCTAGATGCTTACTATGC
AATGCTTACTATGCGGGCCCCTT
CGGTCTAATGCTTAGCTATGC
ATGCTTACTATGCGGGCCCCTT
?
Subject genome
Sequencer
Reads
![Page 14: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/14.jpg)
DNA Sequencing
ATCTGATAAGTCCCAGGACTTCAGT
GCAAGGCAAACCCGAGCCCAGTTT
TCCAGTTCTAGAGTTTCACATGATC
GGAGTTAGTAAAAGTCCACATTGAG
Genome of an organism encodes genetic information in long sequence of 4 DNA nucleotides: ATCG
Bacteria: ~5 million bp Humans: ~3 billion bp
Current DNA sequencing machines can generate 1-2 Gbp of sequence per day, in millions of short reads (25-300bp)
Shorter reads, but much higher throughput Per-base error rate estimated at 1-2% (Simpson, et al,
2009)
Recent studies of entire human genomes have used 3.3 (Wang, et al., 2008) & 4.0 (Bentley, et al., 2008) billion 36bp reads
~144 GB of compressed sequence data
![Page 15: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/15.jpg)
How do we put humpty dumpty back together?
![Page 16: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/16.jpg)
Human Genome
11 years, cost $3 billion… your tax dollars at work!
A complete human DNA sequence was published in 2003, marking the end of the Human Genome Project
![Page 17: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/17.jpg)
CGGTCTAGATGCTTAGCTATGCGGGCCCCTT
Reference sequence
Alignment
GCTTA T CTAT
TTA T CTATGC
A T CTATGCGGA T CTATGCGG
GCTTA T CTAT
TCTAGATGCT
CTATGCGGGCCTAGATGCTT
A T CTATGCGGCTATGCGGGC
A T CTATGCGG
Subject reads
![Page 18: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/18.jpg)
CGGTCTAGATGCTTATCTATGCGGGCCCCTT
GCTTATCTATTTATCTATGC
ATCTATGCGGATCTATGCGG
GCTTATCTAT GGCCCCTTGCCCCTT
CCTT
CGGCGGTCCGGTCTCGGTCTAG
TCTAGATGCTCTATGCGGGCCTAGATGCTT
CTT
ATGCGGGCCC
Reference sequence
Subject reads
![Page 19: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/19.jpg)
Reference: ATGAACCACGAACACTTTTTTGGCAACGATTTAT…Query: ATGAACAAAGAACACTTTTTTGGCCACGATTTAT…
Insertion Deletion Mutation
![Page 20: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/20.jpg)
1. Map: Catalog K-mers• Emit every k-mer in the genome and non-overlapping k-mers in the reads• Non-overlapping k-mers sufficient to guarantee an alignment will be found
CloudBurst
Human chromosome 1
Read 1
Read 2
Map
2. Shuffle: Coalesce Seeds• Hadoop internal shuffle groups together k-mers shared by the reads and the reference• Conceptually build a hash table of k-mers and their occurrences
shuffle
…
…
3. Reduce: End-to-end alignment• Locally extend alignment beyond seeds by computing “match distance”• If read aligns end-to-end, record the alignment
Reduce
Read 1, Chromosome 1, 12345-12365
Read 2, Chromosome 1, 12350-12370
![Page 21: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/21.jpg)
0 2000000 4000000 6000000 80000000
2000
4000
6000
8000
10000
12000
14000
16000Running Time vs Number of Reads on Chr 1
01234
Millions of Reads
Ru
nti
me
(s)
0 100000020000003000000400000050000006000000700000080000000
500
1000
1500
2000
2500
3000
Running Time vs Number of Reads on Chr 22
0
1
2
3
4
Millions of Reads
Ru
nti
me
(s)
Results from a small, 24-core cluster, with different number of mismatches
Michael Schatz. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics, 2009, in press.
![Page 22: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/22.jpg)
24 48 72 960
200
400
600
800
1000
1200
1400
1600
1800
Running Time on EC2 High-CPU Medium Instance Cluster
Number of Cores
Ru
nn
ing
tim
e (s
)
CloudBurst running times for mapping 7M reads to human chromosome 22 with at most 4 mismatches on EC2
Michael Schatz. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics, 2009, in press.
![Page 23: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/23.jpg)
What’s Next?(Michael Schatz’s Ph.D. dissertation)
![Page 24: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/24.jpg)
Wait, no reference?
![Page 25: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/25.jpg)
de Bruijn Graph Construction Dk = (V,E)
V = All length-k subfragments (k > l) E = Directed edges between consecutive subfragments
Nodes overlap by k-1 words
Locally constructed graph reveals the global sequence structure
Overlaps implicitly computed
It was the best was the best ofIt was the best of
Original Fragment Directed Edge
de Bruijn, 1946Idury and Waterman, 1995Pevzner, Tang, Waterman, 2001
![Page 26: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/26.jpg)
de Bruijn Graph Assembly
the age of foolishness
It was the best
best of times, it
was the best of
the best of times,
of times, it was
times, it was thetimes, it was the
it was the worst
was the worst of
worst of times, it
the worst of times,
it was the age
was the age ofthe age of wisdom,
age of wisdom, it
of wisdom, it was
wisdom, it was the
![Page 27: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/27.jpg)
Compressed de Bruijn Graph
Unambiguous non-branching paths replaced by single nodes
An Eulerian traversal of the graph spells a compatible reconstruction of the original text There may be many traversals of the graph
Different sequences can have the same string graph It was the best of times, it was the worst of times, it was the worst of
times, it was the age of wisdom, it was the age of foolishness, …
of times, it was the of times, it was the
It was the best of times, it
it was the age ofit was the age ofthe age of wisdom, it was the
it was the worst of times, it
the age of foolishness
![Page 28: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/28.jpg)
Hadoopification…(Stay tuned!)
![Page 29: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/29.jpg)
Cloud worthy?
![Page 30: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/30.jpg)
How much data?
![Page 31: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/31.jpg)
Bottom Line: Bioinformatics Great use case of Hadoop
Interesting computer science problems
Help unravel life’s mysteries?
![Page 32: Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland](https://reader036.vdocuments.mx/reader036/viewer/2022062714/56649d355503460f94a0cd88/html5/thumbnails/32.jpg)
Questions?Comments?
Thanks to the organizations who support our work: