genome read in-memory (grim)...

Genome Read In-Memory (GRIM) Filter Fast Location Filtering in DNA Read Mapping with

Emerging Memory Technologies

Jeremie Kim, Damla Senol, Hongyi Xin, Donghyuk Lee, Mohammed Alser,

Hasan Hassan, Oguz Ergin, Can Alkan, and Onur Mutlu

Introduction

n 3D-stacked Memory: an emerging technology q Processing-in-Memory (PIM) allows embedding customized logicq Enables high bandwidth

n Read mapping can utilize this technology to gain major performance improvements because it is:q Compute intensiveq Memory intensive

n Goal: We propose an implementation of read mapping using Processing-in-Memory (PIM) for acceleration

2

Hash Table Based Read Mappers

n Our work focuses on hash table based read mappers

n The filtering step in read mappers is now the bottleneckn Mappers align billions of reads, most incorrect mappingsn Filter Purpose: quickly rejects incorrect mappings before

alignment – to reduces costly edit distance calculationsn Costly because: they are compute and memory intensive

n Called for every candidate mapping location n Filtering each location requires nontrivial compute / multiple memory accesses

n How can we alleviate the bottleneck?

3

Problem

n Filters are generally either fast or accurate, i.e.q FastHASH [Xin+, BMC Genomics 2013]

n Fast but inaccurate under high error tolerance settingsq Q-Gram [Rasmussen+, Journal of Computational Biology 2006]

n Slow but accurate

n We Propose:q GRIM-Filter

n Faster than FastHASH with the accuracy of q-gramn Accomplished this by employing an emerging memory technology

4

Key Ideas

n GRIM-Filter, a PIM-friendly filtering algorithm that is both fast and accurate.

n GRIM-Filter is built upon two key ideas

1. Modify q-gram string matching n Enables concurrent checking for multiple locations

2. Utilize a 3D-stacked DRAM architecture n Alleviates memory bandwidth issue n Parallelizes most of the filter

5

Key Idea 1 – Q-gram Modification

Modify q-gram string matching for concurrently checking for multiple locations.

6

Reference Genome

Read

Key Idea 2 – Utilize 3D-stacked Memory

n 3D-stacked DRAM architecture is extremely high bandwidthand can parallelize most of the filter

n Embed GRIM-Filter into DRAM logic layer and appropriately distribute bitvectorsthroughout memory

7

Logic Layer

TSVs

Memory Layers

Key Idea 2 – Utilize 3D-stacked Memory

n 3D-stacked DRAM architecture is extremely high bandwidthand can parallelize most of the filter

n Embed GRIM-Filter into DRAM logic layer and appropriately distribute bitvectorsthroughout memory

8

Logic Layer

TSVs

Memory Layers

http://images.anandtech.com/doci/9266/HBMCar_678x452.jpg

http://i1-news.softpedia-static.com/images/news2/Micron-and-Samsung-Join-Force-to-Create-Next-Gen-Hybrid-Memory-2.png

Q-gram Modified in 3D stacked DRAM

n We employ both key ideas to implement the following figure to modify q-gram filtering in order to make it more amenable for processing-in-memory

9

3D-Memory Layout

[36], [57], [30], [10], [9]. PIM enables orders of magnitudeof improvement in performance and energy efficiency [9],[36], [30] by offloading data-intensive computation from CPUto memory, mainly due to three reasons. First, offloadingcomputation increases parallelism (and thus throughput) byutilizing the naturally higher internal bandwidth availablewithin memory on top of freeing the CPU to execute othercompute-bound portions of the program. Second, it reducesoff-chip memory bus utilization by eliminating significantamounts of data transfer. Reducing memory bus utilization 1)results in energy savings due to fewer memory requests and2) improves performance by servicing the memory requestsof compute-bound parts of the program with lower latency asfewer conflicts are expected to occur. Third, as a large portionof data is not fetched to the CPU caches, the reduction incache pollution provides even further performance improve-ment [46], [48]. As a result, a 3D-stacked memory modulethat includes a logic layer with processing capability improvesnot only performance but also energy efficiency.

resented as matrices. This framework provides two importantcapabilities that we exploit to enable efficient hardware-basedin-memory data reorganization. First, it enables restructuringthe data flow of permutations. This gives us the ability toconsider various alternative implementations to exploit thelocality and parallelism potentials in the 3D-stacked DRAM.Second, for a given permutation, it allows deriving the indextransformation as a closed form expression. Note that theindex transformation corresponds to the address remappingfor a data reorganization. We show that, for the class of per-mutations that we focus on, the index transformation is anaffine function. This implies that the address remapping of thedata reorganization for the entire dataset can be representedwith a single, affine remapping function. We develop a con-figurable address remapping unit to implement the derivedaffine index transformations, which allows us to handle theaddress remapping completely in hardware for the physicaldata reorganizations.

Driven by the implications of the mathematical framework,we develop an efficient architecture for data reorganization inmemory. Integrated within 3D-stacked DRAM, interfaced tothe local vault controllers behind the conventional interface,the data reorganization unit takes advantage of the internalresources, which are inaccessible otherwise. It exploits thefine-grain parallelism, high bandwidth and locality within thestack via simple modifications to the logic layer keeping theDRAM layers unchanged. Parallel architecture with multipleSRAM blocks connected via switch networks can sustain theinternally available bandwidth at a very low power and areacost. Attempting to do the same by external access (whetherusing stacked or planar DRAM) would be much more costlyin terms of power, delay and bandwidth.

We focus on two main use cases for the in-memory datareorganization. Firstly, we demonstrate a software transparentphysical address remapping via data layout transformation inmemory. The memory controller monitors the access patternsand determines the bit flip rates in the DRAM address stream.Depending on the DRAM address bit flip rates, it issues adata reorganization and changes the physical address mappingto utilize the memory locality and parallelism better. Thismechanism is performed transparent to the software and doesnot require any changes to the user software or OS, since ithandles the remapping completely in hardware.

In the second use case, we focus on offloading and acceler-ating commonly used data reorganization operations using the3D-stacked accelerator. We select common reshape operationsfrom the Intel Math Kernel Library (MKL) and compare thein-memory acceleration to the optimized implementations onCPU and GPU platforms. Explicitly offloading the operationsto the accelerator requires communication between the usersoftware and the in-memory accelerator. For that purpose weutilize a software stack similar to the one proposed in [31].

Contributions. In [13], we first introduced the basic con-cepts of a 3D-stacked DRAM based accelerator supported

Logic layer

…

…

… …

…

Logic Layer

TSV

DRAM layers Bank

Vault

Figure 1: Overview of a HMC-like 3D-stacked DRAM [46].

by the example of regular matrix reorganizations. This pa-per develops the concept fully to generalized data reorganiza-tion efficiently handled by a permutation based mathematicalframework for two fundamental use paradigms–explicit of-floading and transparent layout transform. The most salientspecific contributions are highlighted below.• We propose an efficient data reorganization architecture that

exploits the internal operation and organization of the 3D-stacked DRAM. It keeps the DRAM layers unchanged andrequires simple modifications in the logic layer, which yieldonly a 0.6% increase in power consumption.

• We present a mathematical framework to represent data reor-ganizations as permutations and systematically manipulatethem for efficient hardware based operation.

• We demonstrate a methodology to derive the required ad-dress remapping for a given permutation based data reorga-nization.

• We evaluate the software transparent physical addressremapping via data reorganization for various general pur-pose benchmarks [17, 32] and demonstrate up to 2x/1.2xperformance/energy improvements.

• We analyze common data reorganization routines selectedfrom MKL which are performed in memory and demon-strate substantial performance and energy improvementsover optimized CPU (MKL) and GPU (CUDA) based solu-tions.

• For various memory configurations, we compare the in-memory acceleration to the on-chip DMA based operationand show up to 2.2x/7x performance/energy improvements.

2. Background2.1. 3D-stacked DRAM Overview

3D-stacked DRAM is an emerging technology where multi-ple DRAM dies and logic layer are stacked on top of eachother and connected by through silicon vias (TSV) [46]. Bysidestepping the I/O pin count limitations, dense TSV connec-tions allow high bandwidth and low latency communicationwithin the stack. There are examples of 3D-stacking tech-nology both from industry such as Micron’s Hybrid MemoryCube (HMC) [46], AMD/Hynix’s High Bandwidth Memory(HBM) [9], and from academia [25, 39].

Figure 1 shows the overview of a 3D-stacked DRAM ar-chitecture. It consists of multiple layers of DRAM whereeach layer also has multiple banks. A vertical slice of stackedbanks forms a structure called vault. Each vault has its own

Fig. 1. 3D-stacked DRAM [11]

In this paper, we introduce the first read mapper locationfiltering algorithm that attempts to accelerate read mappingby overcoming the memory bottleneck with processing-in-memory using 3D-stacked memory technology. We developa new algorithm that accelerates genome read mapping byefficiently filtering out incorrect mappings using PIM. Our fil-ter, named Genome Read In-Memory Mapper (GRIMM), canbe used with any mapper including BWT-FM seed generatingmappers such as GEM [39] and BWA-MEM [31]. In this work,we focus primarily on the acceleration of hash table basedread mappers (e.g., [47], [21], [56], [12], [24], [8], [32], [26])as they provide high error tolerance and comprehensivenessat the expense of a slower run time. We aim to significantlyreduce the slowdown and achieve both high performance andhigh accuracy.

Our goal is to implement a filter that rejects incorrectmappings before the alignment step to increase performance ofread mapping while maintaining high sensitivity and compre-hensiveness. Due to the expensive computational complexityof alignment, we want to minimize the need for unnecessaryalignment. Naturally, any filtered incorrect mapping wouldreduce the overall computation time as long as the filteringruntime is lower than the alignment runtime. Therefore, anideal filter would have low runtime and a low false positiverate. Our newly designed filter achieves this goal by utilizingPIM to handle data-intensive computation. While conventionalmappers are heavily bottlenecked by data movement latency

and bandwidth due to massive amounts of data processed, in-memory computation capabilities allow us to design memory-intensive mapper-augmenting algorithms that yield high per-formance.

Key Mechanism. GRIMM is based on q-gram filters [45]and the pigeonhole principle for which we show a novel wayto account for error tolerance. We precompute long booleanbitvectors representing the existence of all permutations ofsmall fixed-sized seeds within short ranges of consecutivenucleotides across the entire reference genome. To determinethe potential existence of a short read in a portion of thegenome, we first get the bits of the bitvector that correspondto the substrings of length equivalent to the small fixed-sized seeds within the short read at the range of nucleotidesencompassing the expected location. We then compute thesum of these found bits. Last, we compare the sum against athreshold value that is determined by the read length, substringsize, and error acceptance rate. If the value does not meetthe threshold, we eliminate the need for alignment for thatlocation.

Key Results. We evaluate GRIMM qualitatively and quan-titatively against the most recent fastest hash table based readmapper, mrFAST with FastHASH [59]. Our results show thatour novel algorithm yields a 6.27x smaller false positive rateand runs approximately 2.36x faster than the fastest readmapper, when we use 5% error acceptance rate.

Contributions. We make the following contributions in thispaper:

• We note that alignment has traditionally been the bottle-neck of read mapping. We observe that, due to immenseefforts in accelerating alignment over the past severalyears, memory bandwidth has become the bottleneck inread mapping.

• We propose the first algorithm that accelerates readmapping by overcoming the memory bottleneck by uti-lizing 3D-stacked memory and its PIM capability. Weobserve the memory-intensive nature of our mechanismthat would only further strain the memory bottleneckgiven the already memory-bound nature of read mapping.We show that PIM enables our algorithm’s success byproviding very fast and massively parallel operations onvery large amounts of data nearby memory. We exploitthis knowledge and design our algorithm for a highmemory footprint in exchange for an embarrassinglyparallel filter. PIM also provides the benefit of freeing upthe memory bandwidth for the compute-bound portionsof read mapping, which results in further performanceimprovements.

• We qualitatively and quantitatively compare our filteringalgorithm’s performance improvement over the state-of-the-art read mapping algorithm mrFAST with FastHASH.We show that our filter is very simple to implement ona logic layer in memory and yields a 6.27x smaller falsepositive rate with approximately a 2.36x runtime speedup.

11

Akin, Berkin, Franz Franchetti, and James C. Hoe. "Data reorganization in memory using 3d-stacked dram." Computer Architecture (ISCA), 2015 ACM/IEEE 42nd Annual International Symposium on. IEEE, 2015.

Bitv

ecto

r 0

Bitv

ecto

r 1

Bitv

ecto

r 2

Bitv

ecto

r N

…

Row 0: AAAARow 1: AAACRow 2: AAAG

Row Buffer

Row N: TTTT

…

Row Buffer

Accu

mul

ator

…

Inc.

Bucket Existence Bitvector

…

Com

para

tor

3D-Stacked Memory technologies enable:1. In-memory Processing2. High Bandwidth

Memory Array Customized Logic

Key Results

10

Time (x1000 seconds)

False Negative Rates (%)

2.08x average performance benefit on real data sets

5.97x reduction in False Negative Rate on real data sets

Conclusions

n We propose an in memory filter that can drastically speed up read mapping

n Compared to the previous best filterq We observed 1.81x-3.65x speedupq We observed 5.59x-6.41x fewer false negatives

n GRIM-Filter is a universal filter that can be applied to any read mapper

11

Thank You! Poster #118

12

Genome Read In-Memory (GRIM) Filter Fast Location Filtering in DNA Read Mapping with

Emerging Memory Technologies

Jeremie Kim, Damla Senol, Hongyi Xin, Donghyuk Lee, Mohammed Alser,

Hasan Hassan, Oguz Ergin, Can Alkan, and Onur Mutlu

genome read in-memory (grim)...

Documents