NEXTBIO 2008
Leveraging HBase for the World's Largest Curated Genomic Data Collection
Satnam Alag, Ph.D.VP of [email protected]
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Technology Generating Exponential Data
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Genomic Big Data
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Use Case 1: HBase to Store Variant Data• Each Genome has ~4 million
variants• Immutable – write once,
never change, read many times
• Bloom Filters are useful• Batch import of Data – HFile• Data to be accessed
collocated in region• Separate Hbase cluster from
Hadoop• All the smarts are in the keysFor the various tables
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
In Hbase: 1 Genome 10Million rows100 Genomes 1Billion rows100K Genomes 1Trillion rows100M Genomes 1 Quadrillion1,000,000,000,000,000
Fortunately, HBase cluster access can be partitioned by the application when required
Accessing Data with Pagination
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Table 1:Key: Bioset Id + Display Order
Columns
Pagination Example:Page 5, Page Size = 100
Retrieve 100 rows from Display Order = 400-500
Number of rows = 1 per SNPOrder of 4 million
Accessing Data with Keys
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Table 1:Key: Bioset Id + Display Order
Keys returned by search index
Filtering Data with Pagination
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Table 1:Key: Bioset Id + Display Order
Example:Gene: ESR1, Class: MisensePage Size = 100
Retrieve rows from Table 2Retrieve rows by keys fromTable 1
Number of rowsOrder of 0.5 million per dataset(# genes x classes)
Table 2:Id+GeneId+MutationClass
Column: Counts, Keys to Table
Powering the Genome Browser
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Table 1:Key: Bioset Id + Display Order
Example:Chr: 6Specified Range
Retrieve all rows
1 Row per SNP ~ 4 million per dataset
Table 2:Id+GeneId+MutationClass
Table 3:Id+ChromosomeId+Range+DisplayOrder
Use Case 2: Correlation Data
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Use Case 2• Each Correlation score stored as a row• HFile created for new score• Over 20 billion correlations
B1 B2 … … .. Bn Bn+1
B1
B2
…
…
Bn
Bn+1
T1: scorebioset (base table) key: biosetid_1 [+] biosetid_2
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.
Lessons Learnt• HBase Works Wells For
-- Immutable Data-- Insertions Using HFiles-- Billions of Rows-- Intelligence in Key Definition
• Road to Production-- Redundant Data in Database
© 2012 NextBio | All rights reserved | This information is proprietary and confidential.