scaling ancestrydna using hadoop and hbase

1

Scaling AncestryDNAUsing Hadoop and HBase

April 10th, 2014Jeremy Pollack

Ancestry.com Mission

2

• Over 30,000 historical content collections

• 13 billion records and images

• Records dating back to 16th century

• 10 petabytes

We are the world's largest online family history resource

Discoveries are the Key

3

The “eureka” moment drives our business

Discoveries in Detail

4

Spit in a tube, pay $99, learn your past

Autosomal DNA testsOver 200,000+ DNA samples700,000 SNPs for each sample10,000,000+ cousin matches

DNA molecule 1 differs from DNA molecule 2 at a single base-pair location (a C/T polymorphism)

(http://en.wikipedia.org/wiki/Single-nucleiotide_polymorphism)

MayJune

July

August

Septem

ber

October

November

December

January

Febru

aryMarc

hApril

May -

50,000

100,000

150,000 Genotyped Samples

Discoveries with DNA

5

Network Effect – Cousin Matches

6

2000 10053 21205 40201 60240 80405 1157560

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

3,500,000

Cous

in M

atch

es

Database Size

So how do we do it?

7

• GERMLINE is an algorithm that finds hidden relationships within a pool of DNA

• GERMLINE also refers to the reference implementation of that algorithm written in C++

• You can find it here :

http://www1.cs.columbia.edu/~gusev/germline/

Introducing … GERMLINE!

8

• GERMLINE (the implementation) was not meant to be used in an industrial setting

Stateless Single threaded Prone to swapping (heavy memory usage)

• GERMLINE performs poorly on large data sets

• Our metrics predicted exactly where the process would slow to a crawl

• Put simply: GERMLINE couldn't scale

So What’s the Problem?

9

GERMLINE Run Times (in hours)

2500

5000

7500

10000

12500

15000

17500

20000

22500

25000

27500

30000

32500

35000

37500

40000

42500

45000

47500

50000

52500

55000

57500

60000

0

5

10

15

20

25H

ours

Samples

10

Projected GERMLINE Run Times (in hours)

2500

7500

12500

17500

22500

27500

32500

37500

42500

47500

52500

57500

62500

67500

72500

77500

82500

87500

92500

97500

102500

107500

112500

117500

122500

127500

132500

0

100

200

300

400

500

600

700

GERMLINE run times

Projected GERMLINE run times

Samples

Hou

rs

11

The Mission : Create a Scalable Matching Engine

... and thus was born

(aka "Jermline with a J")

12

What is Hadoop?

• Hadoop is an open-source platform for processing large amounts of data in a scalable, fault-tolerant, affordable fashion, using commodity hardware

• Hadoop specifies a distributed file system called HDFS

• Hadoop supports a processing methodology known as MapReduce

• Many tools are built on top of Hadoop, such as HBase, Hive, and Flume

13

What is HDFS?

What happens when HDFS loses a server

What is MapReduce?

16

What is HBase?

• HBase is an open-source NoSQL data store that runs on top of HDFS

• HBase is columnar; you can think of it as a weird amalgam of a hashtable and a spreadsheet

• HBase supports unlimited rows and columns

• HBase stores data sparsely; there is no penalty for empty cells

• HBase is gaining in popularity: Salesforce, Facebook, and Twitter have all invested heavily in the technology, as well as many others

17

Game of Thrones Characters, in an HBase Table

KEY gender hair_color family_name is_evilJoffrey male blonde Baratheon yesCersei female blonde Lannister kinda

18

Adding a Row to an HBase Table

KEY gender hair_color family_name is_evilJoffrey male blonde Baratheon yesCersei female blonde Lannister kindaSansa female red Stark no

19

Adding a Column to an HBase Table

KEY gender hair_color family_name is_evil titleJoffrey male blonde Baratheon yes kingCersei female blonde Lannister kindaSansa female red Stark no

20

Cersei : ACTGACCTAGTTGACJoffrey : TTAAGCCTAGTTGAC

The Input

Cersei Baratheon

• Former queen of Westeros

• Machiavellian manipulator

• Mostly evil, but occasionally sympathetic

Joffrey Baratheon

• Pretty much the human embodiment of evil

• Needlessly cruel • Kinda looks like

Justin Bieber

DNA Matching : How it Works

21

0 1 2Cersei : ACTGA CCTAG TTGACJoffrey : TTAAG CCTAG TTGAC

Separate into words


22


ACTGA_0 : CerseiTTAAG_0 : JoffreyCCTAG_1 : Cersei, JoffreyTTGAC_2 : Cersei, Joffrey

Build the hash table


23

Iterate through genome and find matches

Cersei and Joffrey match from position 1 to position 2


ACTGA_0 : CerseiTTAAG_0 : JoffreyCCTAG_1 : Cersei, JoffreyTTGAC_2 : Cersei, Joffrey


24

Does that mean they're related?

...maybe

?25

IBD to Relationship Estimation

• We use the total length of all shared segments to estimate the relationship between to genetic relatives

• This is basically a classification problem

26

Jaime : TTAAGCCTAGGGGCG

But Wait...What About Jaime?

Jaime Lannister

• Kind of a has-been• Killed the Mad King• Has the hots for his

sister, Cersei

27

Adding a new sample, the GERMLINE way

28

0 1 2Cersei : ACTGA CCTAG TTGACJoffrey : TTAAG CCTAG TTGACJaime : TTAAG CCTAG GGGCG

ACTGA_0 : CerseiTTAAG_0 : Joffrey, JaimeCCTAG_1 : Cersei, Joffrey, JaimeTTGAC_2 : Cersei, JoffreyGGGCG_2 : Jaime

Step one: Rebuild the entire hash table from scratch, including the new sample

The GERMLINE Way

29

Cersei and Joffrey match from position 1 to position 2Joffrey and Jaime match from position 0 to position 1Cersei and Jaime match at position 1

Step two: Find everybody's matches all over again, including the new sample. (n x n comparisons)



The GERMLINE Way

30

Cersei and Joffrey match from position 1 to position 2Joffrey and Jaime match from position 0 to position 1Cersei and Jaime match at position 1

Step three : Now, throw away the evidence!



The GERMLINE Way

31

Step one: Update the hash tableCersei Joffrey

2_ACTGA_0 1

2_TTAAG_0 1

2_CCTAG_1 1 1

2_TTGAC_2 1 1

Already stored in HBase

Jaime : TTAAG CCTAG GGGCG New sample to add

Key : [CHROMOSOME]_[WORD]_[POSITION]Qualifier : [USER ID]Cell value : A byte set to 1, denoting that the user has that word at that position on that chromosome

The Way

32

Jaime and Joffrey match from position 0 to position 1Jaime and Cersei match at position 1

Already stored in HBase

2_Cersei 2_Joffrey

2_Cersei { (1, 2), ...}

2_Joffrey { (1, 2), ...}

New matches to add

Key : [CHROMOSOME]_[USER ID]Qualifier : [CHROMOSOME]_[USER ID]Cell value : A list of ranges where the two users match on a chromosome

Step two: Find matches, update the results table

The Way

33

Results Table2_Cersei 2_Joffrey 2_Jaime

2_Cersei { (1, 2), ...} { (1), ...}

2_Joffrey { (1, 2), ...} { (0,1), ...}

2_Jaime { (1), ...} { (0,1), ...}

Hash TableCersei Joffrey Jaime

2_ACTGA_0 1

2_TTAAG_0 1 1

2_CCTAG_1 1 1 1

2_TTGAC_2 1 1

2_GGGCG_2 1

The Way

34

But wait ... what about Daenerys, Tyrion, Arya, and Jon Snow?

35

Run them in parallel with Hadoop!

36

Parallelism with Hadoop

• Batches are usually about a thousand people

• Each mapper takes a single chromosome for a single person

• MapReduce Jobs :Job #1 : Match Words

o Updates the hash tableJob #2 : Match Segments

o Identifies areas where the samples match

37

How does perform?

A 1700% performance improvement over GERMLINE!

38

2500

7500

12500

17500

22500

27500

32500

37500

42500

47500

52500

57500

62500

67500

72500

77500

82500

87500

92500

97500

102500

107500

112500

117500

0

5

10

15

20

25

Samples

Hou

rsRun times for Matching (in hours)

39

25007500

1250017500

2250027500

3250037500

4250047500

5250057500

6250067500

7250077500

8250087500

9250097500

102500

107500

112500

1175000

20

40

60

80

100

120

140

160

180

GERMLINE run times

Projected GERMLINE run times

Jermline run times

Run times for Matching (in hours)

Samples

Hou

rs

40

Bottom line: Without Hadoop and HBase, this would have been expensive and difficult.

Dramatically Increased our Capacity

41

And now for everybody's favorite part ....

Lessons Learned

42

Lessons Learned

What went right?

43

Lessons Learned : What went right?

44

• This project would not have been possible without TDD• Two sets of test data : generated and public domain• 89% coverage

• Corrected a bug in the reference implementation

• Has never failed in production

Lessons Learned

What would we do differently?

45

Lessons Learned : What would we do differently?

• Front-load some performance tests HBase and Hadoop can have odd performance profiles HBase in particular has some strange behavior if you're not

familiar with its inner workings

• Allow a lot of time for live tests, dry runs, and deployment

These technologies are relatively new, and it isn't always possible to find experienced admins. Be prepared to "get your hands dirty"

46

Questions?

47

48

And yes, we are hiring!You can contact me at [email protected]

or just talk to me here!

mailto:[email protected]

mailto:[email protected]

scaling ancestrydna using hadoop and hbase

Technology