Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science
Jimmy Lin, Michael Schatz, and Ben LangmeadUniversity of Maryland
Wednesday, June 10, 2009
This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details
Cloud Computing @ Maryland Teaching
Cloud computing course (version 1.0): Spring 2008Part of the Google/IBM Academic Cloud Computing Initiative
Cloud computing course (version 2.0): Fall 2008Sponsored by Amazon Web Services through a teaching grant
Research Web-scale text processing Statistical machine translation Bioinformatics
Maria
no dio
una
bofetada
bruja
verde
Mary
did not
a
slap
witch
green
green witchbruja verde
Learning Translation Models
Prodi ha erigido hoy un verdadero muro contra esas acciones, espero que el Sr. Moscovici lo haya comprendido bien, y realmente también espero que esta tendencia se rompa en los Consejos de Biarritz y de Niza, y se rectifique.
Mr Prodi has put an emphatic stop to this kind of action, which has hopefully resonated with Mr Moscovici, and I truly hope that this trend can be broken and reversed at the Councils in Nice and Biarritz.
Esas negociaciones sabemos que son muy difíciles y hacen temer un fracaso o un acuerdo de mínimos en Niza, lo que sería aún más grave y usted ya lo ha dicho, señor Ministro.
These are, as we know, very tricky negotiations and raise fears of a setback or a watered-down agreement in Nice which, as you have already acknowledged, Mr Moscovici, would be even more serious.
We built systems for “learning” translation models in Hadoop…… sort of like the word count example, but with more math
Maria no dio una bofetada a la bruja verde
Mary not
did not
no
did not give
give a slap to the witch green
slap
a slap
to the
to
the
green witch
the witch
by
Example from Koehn (2006)
slap
Translation as a “Tiling” Problem
From Text to DNA Sequences Text processing: [0-9A-Za-z]+
DNA sequence processing: [ATCG]+
Easier, right?
(Nope, not really)
Michael Schatz (Ph.D. student, Computer Science; Spring 2008)Ben Langmead (M.S. student, Computer Science; Fall 2008)
Analogy(And two disclaimers)
Strangely-Formatted Manuscript Dickens: A Tale of Two Cities
Text written on a long spool
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
… With Duplicates Dickens: A Tale of Two Cities
“Backup” on four more copies
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
Shredded Book Reconstruction Dickens accidently shreds the manuscript
How can he reconstruct the text? 5 copies x 138,656 words / 5 words per fragment = 138k
fragments The short fragments from every copy are mixed together Some fragments are identical
It was the best of of times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …
It was the best worst of times, it wasof times, it was the the age of wisdom, it was the age of foolishness,
It was the the worst of times, it best of times, it was was the age of wisdom, it was the age of foolishness, …
It was was the worst of times,the best of times, it it was the age of wisdom, it was the age of foolishness, …
It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of times, it was the worst of times, it was the age of wisdom, it was the age of foolishness, …
It was the best of of times, it was theof times, it was thetimes, it was the worst age of wisdom, it was the age of foolishness, …
It was the best worst of times, it wasof times, it was theof times, it was the the age of wisdom, it was the age of foolishness,
It was the the worst of times, it best of times, it was was the age of wisdom, it was the age ofit was the age of foolishness, …
It was was the worst of times,the best of times, it it was the age ofit was the age of wisdom, it was the age of foolishness, …
It it was the worst ofwas the best of times, times, it was the age of wisdom, it was the age of foolishness, …
Overlaps
Generally prefer longer overlaps to shorter overlaps
In the presence of error, we might allow the overlapping fragments to differ by a small amount
It was the best of
of times, it was theof times, it was the
best of times, it was
times, it was the worsttimes, it was the worst
was the best of times,
the best of times, it
it was the worst of
was the worst of times,
of times, it was theof times, it was the
times, it was the age
it was the age ofit was the age of
was the age of wisdom,
the age of wisdom, it
age of wisdom, it was
of wisdom, it was the
it was the age ofit was the age of
was the age of foolishness,
the worst of times, it
It was the best of
was the best of times,4 word overlap
It was the best of
of times, it was the1 word overlap
It was the best of
of wisdom, it was the1 word overlap
Greedy Assembly
The repeated sequence makes the correct reconstruction ambiguous
It was the best of
of times, it was theof times, it was the
best of times, it was
times, it was the worsttimes, it was the worst
was the best of times,
the best of times, it
of times, it was theof times, it was the
times, it was the agetimes, it was the age
It was the best of
of times, it was theof times, it was the
best of times, it was
times, it was the worsttimes, it was the worst
was the best of times,
the best of times, it
it was the worst of
was the worst of times,
of times, it was theof times, it was the
times, it was the age
it was the age ofit was the age of
was the age of wisdom,
the age of wisdom, it
age of wisdom, it was
of wisdom, it was the
it was the age ofit was the age of
was the age of foolishness,
the worst of times, it
The Real Problem(The easier version)
GATGCTTACTATGCGGGCCCC
CGGTCTAATGCTTACTATGC
GCTTACTATGCGGGCCCCTTAATGCTTACTATGCGGGCCCCTT
TAATGCTTACTATGCAATGCTTAGCTATGCGGGC
AATGCTTACTATGCGGGCCCCTT
AATGCTTACTATGCGGGCCCCTT
CGGTCTAGATGCTTACTATGC
AATGCTTACTATGCGGGCCCCTT
CGGTCTAATGCTTAGCTATGC
ATGCTTACTATGCGGGCCCCTT
?
Subject genome
Sequencer
Reads
DNA Sequencing
ATCTGATAAGTCCCAGGACTTCAGT
GCAAGGCAAACCCGAGCCCAGTTT
TCCAGTTCTAGAGTTTCACATGATC
GGAGTTAGTAAAAGTCCACATTGAG
Genome of an organism encodes genetic information in long sequence of 4 DNA nucleotides: ATCG
Bacteria: ~5 million bp Humans: ~3 billion bp
Current DNA sequencing machines can generate 1-2 Gbp of sequence per day, in millions of short reads (25-300bp)
Shorter reads, but much higher throughput Per-base error rate estimated at 1-2% (Simpson, et al,
2009)
Recent studies of entire human genomes have used 3.3 (Wang, et al., 2008) & 4.0 (Bentley, et al., 2008) billion 36bp reads
~144 GB of compressed sequence data
How do we put humpty dumpty back together?
Human Genome
11 years, cost $3 billion… your tax dollars at work!
A complete human DNA sequence was published in 2003, marking the end of the Human Genome Project
CGGTCTAGATGCTTAGCTATGCGGGCCCCTT
Reference sequence
Alignment
GCTTA T CTAT
TTA T CTATGC
A T CTATGCGGA T CTATGCGG
GCTTA T CTAT
TCTAGATGCT
CTATGCGGGCCTAGATGCTT
A T CTATGCGGCTATGCGGGC
A T CTATGCGG
Subject reads
CGGTCTAGATGCTTATCTATGCGGGCCCCTT
GCTTATCTATTTATCTATGC
ATCTATGCGGATCTATGCGG
GCTTATCTAT GGCCCCTTGCCCCTT
CCTT
CGGCGGTCCGGTCTCGGTCTAG
TCTAGATGCTCTATGCGGGCCTAGATGCTT
CTT
ATGCGGGCCC
Reference sequence
Subject reads
Reference: ATGAACCACGAACACTTTTTTGGCAACGATTTAT…Query: ATGAACAAAGAACACTTTTTTGGCCACGATTTAT…
Insertion Deletion Mutation
1. Map: Catalog K-mers• Emit every k-mer in the genome and non-overlapping k-mers in the reads• Non-overlapping k-mers sufficient to guarantee an alignment will be found
CloudBurst
Human chromosome 1
Read 1
Read 2
Map
2. Shuffle: Coalesce Seeds• Hadoop internal shuffle groups together k-mers shared by the reads and the reference• Conceptually build a hash table of k-mers and their occurrences
shuffle
…
…
3. Reduce: End-to-end alignment• Locally extend alignment beyond seeds by computing “match distance”• If read aligns end-to-end, record the alignment
Reduce
Read 1, Chromosome 1, 12345-12365
Read 2, Chromosome 1, 12350-12370
0 2000000 4000000 6000000 80000000
2000
4000
6000
8000
10000
12000
14000
16000Running Time vs Number of Reads on Chr 1
01234
Millions of Reads
Ru
nti
me
(s)
0 100000020000003000000400000050000006000000700000080000000
500
1000
1500
2000
2500
3000
Running Time vs Number of Reads on Chr 22
0
1
2
3
4
Millions of Reads
Ru
nti
me
(s)
Results from a small, 24-core cluster, with different number of mismatches
Michael Schatz. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics, 2009, in press.
24 48 72 960
200
400
600
800
1000
1200
1400
1600
1800
Running Time on EC2 High-CPU Medium Instance Cluster
Number of Cores
Ru
nn
ing
tim
e (s
)
CloudBurst running times for mapping 7M reads to human chromosome 22 with at most 4 mismatches on EC2
Michael Schatz. CloudBurst: Highly Sensitive Read Mapping with MapReduce. Bioinformatics, 2009, in press.
What’s Next?(Michael Schatz’s Ph.D. dissertation)
Wait, no reference?
de Bruijn Graph Construction Dk = (V,E)
V = All length-k subfragments (k > l) E = Directed edges between consecutive subfragments
Nodes overlap by k-1 words
Locally constructed graph reveals the global sequence structure
Overlaps implicitly computed
It was the best was the best ofIt was the best of
Original Fragment Directed Edge
de Bruijn, 1946Idury and Waterman, 1995Pevzner, Tang, Waterman, 2001
de Bruijn Graph Assembly
the age of foolishness
It was the best
best of times, it
was the best of
the best of times,
of times, it was
times, it was thetimes, it was the
it was the worst
was the worst of
worst of times, it
the worst of times,
it was the age
was the age ofthe age of wisdom,
age of wisdom, it
of wisdom, it was
wisdom, it was the
Compressed de Bruijn Graph
Unambiguous non-branching paths replaced by single nodes
An Eulerian traversal of the graph spells a compatible reconstruction of the original text There may be many traversals of the graph
Different sequences can have the same string graph It was the best of times, it was the worst of times, it was the worst of
times, it was the age of wisdom, it was the age of foolishness, …
of times, it was the of times, it was the
It was the best of times, it
it was the age ofit was the age ofthe age of wisdom, it was the
it was the worst of times, it
the age of foolishness
Hadoopification…(Stay tuned!)
Cloud worthy?
How much data?
Bottom Line: Bioinformatics Great use case of Hadoop
Interesting computer science problems
Help unravel life’s mysteries?
Questions?Comments?
Thanks to the organizations who support our work: