scaling ancestrydna using hadoop and hbase
DESCRIPTION
Learn more about how our team scales our AncestryDNA technology to match you within our database's DNA pool.TRANSCRIPT
1
Scaling AncestryDNAUsing Hadoop and HBase
April 10th, 2014Jeremy Pollack
Ancestry.com Mission
2
• Over 30,000 historical content collections
• 13 billion records and images
• Records dating back to 16th century
• 10 petabytes
We are the world's largest online family history resource
Discoveries are the Key
3
The “eureka” moment drives our business
Discoveries in Detail
4
Spit in a tube, pay $99, learn your past
Autosomal DNA testsOver 200,000+ DNA samples700,000 SNPs for each sample10,000,000+ cousin matches
DNA molecule 1 differs from DNA molecule 2 at a single base-pair location (a C/T polymorphism)
(http://en.wikipedia.org/wiki/Single-nucleiotide_polymorphism)
MayJune
July
August
Septem
ber
October
November
December
January
Febru
aryMarc
hApril
May -
50,000
100,000
150,000 Genotyped Samples
Discoveries with DNA
5
Network Effect – Cousin Matches
6
2000 10053 21205 40201 60240 80405 1157560
500,000
1,000,000
1,500,000
2,000,000
2,500,000
3,000,000
3,500,000
Cous
in M
atch
es
Database Size
So how do we do it?
7
• GERMLINE is an algorithm that finds hidden relationships within a pool of DNA
• GERMLINE also refers to the reference implementation of that algorithm written in C++
• You can find it here :
http://www1.cs.columbia.edu/~gusev/germline/
Introducing … GERMLINE!
8
• GERMLINE (the implementation) was not meant to be used in an industrial setting
Stateless Single threaded Prone to swapping (heavy memory usage)
• GERMLINE performs poorly on large data sets
• Our metrics predicted exactly where the process would slow to a crawl
• Put simply: GERMLINE couldn't scale
So What’s the Problem?
9
GERMLINE Run Times (in hours)
2500
5000
7500
10000
12500
15000
17500
20000
22500
25000
27500
30000
32500
35000
37500
40000
42500
45000
47500
50000
52500
55000
57500
60000
0
5
10
15
20
25H
ours
Samples
10
Projected GERMLINE Run Times (in hours)
2500
7500
12500
17500
22500
27500
32500
37500
42500
47500
52500
57500
62500
67500
72500
77500
82500
87500
92500
97500
102500
107500
112500
117500
122500
127500
132500
0
100
200
300
400
500
600
700
GERMLINE run times
Projected GERMLINE run times
Samples
Hou
rs
11
The Mission : Create a Scalable Matching Engine
... and thus was born
(aka "Jermline with a J")
12
What is Hadoop?
• Hadoop is an open-source platform for processing large amounts of data in a scalable, fault-tolerant, affordable fashion, using commodity hardware
• Hadoop specifies a distributed file system called HDFS
• Hadoop supports a processing methodology known as MapReduce
• Many tools are built on top of Hadoop, such as HBase, Hive, and Flume
13
What is HDFS?
What happens when HDFS loses a server
What is MapReduce?
16
What is HBase?
• HBase is an open-source NoSQL data store that runs on top of HDFS
• HBase is columnar; you can think of it as a weird amalgam of a hashtable and a spreadsheet
• HBase supports unlimited rows and columns
• HBase stores data sparsely; there is no penalty for empty cells
• HBase is gaining in popularity: Salesforce, Facebook, and Twitter have all invested heavily in the technology, as well as many others
17
Game of Thrones Characters, in an HBase Table
KEY gender hair_color family_name is_evilJoffrey male blonde Baratheon yesCersei female blonde Lannister kinda
18
Adding a Row to an HBase Table
KEY gender hair_color family_name is_evilJoffrey male blonde Baratheon yesCersei female blonde Lannister kindaSansa female red Stark no
19
Adding a Column to an HBase Table
KEY gender hair_color family_name is_evil titleJoffrey male blonde Baratheon yes kingCersei female blonde Lannister kindaSansa female red Stark no
20
Cersei : ACTGACCTAGTTGACJoffrey : TTAAGCCTAGTTGAC
The Input
Cersei Baratheon
• Former queen of Westeros
• Machiavellian manipulator
• Mostly evil, but occasionally sympathetic
Joffrey Baratheon
• Pretty much the human embodiment of evil
• Needlessly cruel • Kinda looks like
Justin Bieber
DNA Matching : How it Works
21
0 1 2Cersei : ACTGA CCTAG TTGACJoffrey : TTAAG CCTAG TTGAC
Separate into words
DNA Matching : How it Works
22
0 1 2Cersei : ACTGA CCTAG TTGACJoffrey : TTAAG CCTAG TTGAC
ACTGA_0 : CerseiTTAAG_0 : JoffreyCCTAG_1 : Cersei, JoffreyTTGAC_2 : Cersei, Joffrey
Build the hash table
DNA Matching : How it Works
23
Iterate through genome and find matches
Cersei and Joffrey match from position 1 to position 2
0 1 2Cersei : ACTGA CCTAG TTGACJoffrey : TTAAG CCTAG TTGAC
ACTGA_0 : CerseiTTAAG_0 : JoffreyCCTAG_1 : Cersei, JoffreyTTGAC_2 : Cersei, Joffrey
DNA Matching : How it Works
24
Does that mean they're related?
...maybe
?25
IBD to Relationship Estimation
• We use the total length of all shared segments to estimate the relationship between to genetic relatives
• This is basically a classification problem
26
Jaime : TTAAGCCTAGGGGCG
But Wait...What About Jaime?
Jaime Lannister
• Kind of a has-been• Killed the Mad King• Has the hots for his
sister, Cersei
27
Adding a new sample, the GERMLINE way
28
0 1 2Cersei : ACTGA CCTAG TTGACJoffrey : TTAAG CCTAG TTGACJaime : TTAAG CCTAG GGGCG
ACTGA_0 : CerseiTTAAG_0 : Joffrey, JaimeCCTAG_1 : Cersei, Joffrey, JaimeTTGAC_2 : Cersei, JoffreyGGGCG_2 : Jaime
Step one: Rebuild the entire hash table from scratch, including the new sample
The GERMLINE Way
29
Cersei and Joffrey match from position 1 to position 2Joffrey and Jaime match from position 0 to position 1Cersei and Jaime match at position 1
Step two: Find everybody's matches all over again, including the new sample. (n x n comparisons)
0 1 2Cersei : ACTGA CCTAG TTGACJoffrey : TTAAG CCTAG TTGACJaime : TTAAG CCTAG GGGCG
ACTGA_0 : CerseiTTAAG_0 : Joffrey, JaimeCCTAG_1 : Cersei, Joffrey, JaimeTTGAC_2 : Cersei, JoffreyGGGCG_2 : Jaime
The GERMLINE Way
30
Cersei and Joffrey match from position 1 to position 2Joffrey and Jaime match from position 0 to position 1Cersei and Jaime match at position 1
Step three : Now, throw away the evidence!
0 1 2Cersei : ACTGA CCTAG TTGACJoffrey : TTAAG CCTAG TTGACJaime : TTAAG CCTAG GGGCG
ACTGA_0 : CerseiTTAAG_0 : Joffrey, JaimeCCTAG_1 : Cersei, Joffrey, JaimeTTGAC_2 : Cersei, JoffreyGGGCG_2 : Jaime
The GERMLINE Way
31
Step one: Update the hash tableCersei Joffrey
2_ACTGA_0 1
2_TTAAG_0 1
2_CCTAG_1 1 1
2_TTGAC_2 1 1
Already stored in HBase
Jaime : TTAAG CCTAG GGGCG New sample to add
Key : [CHROMOSOME]_[WORD]_[POSITION]Qualifier : [USER ID]Cell value : A byte set to 1, denoting that the user has that word at that position on that chromosome
The Way
32
Jaime and Joffrey match from position 0 to position 1Jaime and Cersei match at position 1
Already stored in HBase
2_Cersei 2_Joffrey
2_Cersei { (1, 2), ...}
2_Joffrey { (1, 2), ...}
New matches to add
Key : [CHROMOSOME]_[USER ID]Qualifier : [CHROMOSOME]_[USER ID]Cell value : A list of ranges where the two users match on a chromosome
Step two: Find matches, update the results table
The Way
33
Results Table2_Cersei 2_Joffrey 2_Jaime
2_Cersei { (1, 2), ...} { (1), ...}
2_Joffrey { (1, 2), ...} { (0,1), ...}
2_Jaime { (1), ...} { (0,1), ...}
Hash TableCersei Joffrey Jaime
2_ACTGA_0 1
2_TTAAG_0 1 1
2_CCTAG_1 1 1 1
2_TTGAC_2 1 1
2_GGGCG_2 1
The Way
34
But wait ... what about Daenerys, Tyrion, Arya, and Jon Snow?
35
Run them in parallel with Hadoop!
36
Parallelism with Hadoop
• Batches are usually about a thousand people
• Each mapper takes a single chromosome for a single person
• MapReduce Jobs :Job #1 : Match Words
o Updates the hash tableJob #2 : Match Segments
o Identifies areas where the samples match
37
How does perform?
A 1700% performance improvement over GERMLINE!
38
2500
7500
12500
17500
22500
27500
32500
37500
42500
47500
52500
57500
62500
67500
72500
77500
82500
87500
92500
97500
102500
107500
112500
117500
0
5
10
15
20
25
Samples
Hou
rsRun times for Matching (in hours)
39
25007500
1250017500
2250027500
3250037500
4250047500
5250057500
6250067500
7250077500
8250087500
9250097500
102500
107500
112500
1175000
20
40
60
80
100
120
140
160
180
GERMLINE run times
Projected GERMLINE run times
Jermline run times
Run times for Matching (in hours)
Samples
Hou
rs
40
Bottom line: Without Hadoop and HBase, this would have been expensive and difficult.
Dramatically Increased our Capacity
41
And now for everybody's favorite part ....
Lessons Learned
42
Lessons Learned
What went right?
43
Lessons Learned : What went right?
44
• This project would not have been possible without TDD• Two sets of test data : generated and public domain• 89% coverage
• Corrected a bug in the reference implementation
• Has never failed in production
Lessons Learned
What would we do differently?
45
Lessons Learned : What would we do differently?
• Front-load some performance tests HBase and Hadoop can have odd performance profiles HBase in particular has some strange behavior if you're not
familiar with its inner workings
• Allow a lot of time for live tests, dry runs, and deployment
These technologies are relatively new, and it isn't always possible to find experienced admins. Be prepared to "get your hands dirty"
46
Questions?
47