a framework for accelerating bottlenecks in gpu execution ...a framework for accelerating...

A Framework for Accelerating Bottlenecks in GPU Executionwith Assist Warps

Nandita Vijaykumar Gennady Pekhimenko Adwait Jog† Saugata Ghose Abhishek BhowmickRachata Ausavarungnirun Chita Das† Mahmut Kandemir† Todd C. Mowry Onur Mutlu

Carnegie Mellon University † Pennsylvania State University{nandita,ghose,abhowmick,rachata,onur}@cmu.edu

{gpekhime,tcm}@cs.cmu.edu {adwait,das,kandemir}@cse.psu.edu

Abstract

Modern Graphics Processing Units (GPUs) are well provi-sioned to support the concurrent execution of thousands ofthreads. Unfortunately, diUerent bottlenecks during executionand heterogeneous application requirements create imbalancesin utilization of resources in the cores. For example, when a GPUis bottlenecked by the available oU-chip memory bandwidth, itscomputational resources are often overwhelmingly idle, waitingfor data from memory to arrive.This work describes the Core-Assisted Bottleneck Accelera-

tion (CABA) framework that employs idle on-chip resources toalleviate diUerent bottlenecks in GPU execution. CABA providesWexible mechanisms to automatically generate “assist warps”that execute on GPU cores to perform speciVc tasks that canimprove GPU performance and eXciency.CABA enables the use of idle computational units and

pipelines to alleviate the memory bandwidth bottleneck, e.g.,by using assist warps to perform data compression to transferless data from memory. Conversely, the same framework canbe employed to handle cases where the GPU is bottlenecked bythe available computational units, in which case the memorypipelines are idle and can be used by CABA to speed up compu-tation, e.g., by performing memoization using assist warps.

We provide a comprehensive design and evaluation of CABAto perform eUective and Wexible data compression in the GPUmemory hierarchy to alleviate the memory bandwidth bottle-neck. Our extensive evaluations show that CABA, when usedto implement data compression, provides an average perfor-mance improvement of 41.7% (as high as 2.6X) across a varietyof memory-bandwidth-sensitive GPGPU applications.We believe that CABA is a Wexible framework that enables

the use of idle resources to improve application performancewith diUerent optimizations and perform other useful tasks.We discuss how CABA can be used, for example, for memo-ization, prefetching, handling interrupts, proVling, redundantmultithreading, and speculative precomputation.

1. Introduction

Modern Graphics Processing Units (GPUs) play an importantrole in delivering high performance and energy eXciencyfor many classes of applications and diUerent computationalplatforms. GPUs employ Vne-grained multi-threading to hidethe high memory access latencies with thousands of concur-rently running threads [58]. GPUs are well provisioned withdiUerent resources (e.g., SIMD-like computational units, largeregister Vles) to support the execution of a large number of

these hardware contexts. Ideally, if the demand for all types ofresources is properly balanced, all these resources should befully utilized by the application. Unfortunately, this balanceis very diXcult to achieve in practice.As a result, bottlenecks in program execution, e.g., limita-

tions in memory or computational bandwidth, lead to longstalls and idle periods in the shader pipelines of modernGPUs [51, 52, 79, 103]. Alleviating these bottlenecks withoptimizations implemented in dedicated hardware requiressigniVcant engineering cost and eUort. Fortunately, the result-ing under-utilization of on-chip computational and memoryresources from these imbalances in application requirements,oUers some new opportunities. For example, we can usethese resources for eXcient integration of hardware-generatedthreads that perform useful work to accelerate the execu-tion of the primary threads. Similar helper threading ideashave been proposed in the context of general-purpose proces-sors [22, 23, 27, 31, 32, 97, 122] to either extend the pipelinewith more contexts or use spare hardware contexts to pre-compute useful information that aids main code execution(e.g., to aid branch prediction, prefetching, etc.).

We believe that the general idea of helper threading canlead to even more powerful optimizations and new opportu-nities in the context of modern GPUs than in CPUs because(1) the abundance of on-chip resources in a GPU obviatesthe need for idle hardware contexts [27, 28] or the additionof more storage (registers, rename tables, etc.) and computeunits [22, 75] required to handle more contexts and (2) therelative simplicity of the GPU pipeline avoids the complex-ities of handling register renaming, speculative execution,precise interrupts, etc. [23]. However, GPUs that executeand manage thousands of thread contexts at the same timepose new challenges for employing helper threading, whichmust be addressed carefully. First, the numerous regular pro-gram threads executing in parallel could require an equal orlarger number of helper threads that need to be managed atlow cost. Second, the compute and memory resources aredynamically partitioned between threads in GPUs, and re-source allocation for helper threads should be cognizant ofresource interference and overheads. Third, lock-step exe-cution and complex scheduling—which are characteristic ofGPU architectures—exacerbate the complexity of Vne-grainedmanagement of helper threads.In this work, we describe a new, Wexible framework for

bottleneck acceleration in GPUs via helper threading (calledCore-Assisted Bottleneck Acceleration or CABA), which ex-ploits the aforementioned new opportunities while eUectively

arX

iv:1

602.

0134

8v1

[cs

.AR

] 3

Feb

201

6

handling the new challenges. CABA performs acceleration bygenerating special warps—assist warps—that can execute codeto speed up application execution and system tasks. To sim-plify the support of the numerous assist threads with CABA,we manage their execution at the granularity of a warp anduse a centralized mechanism to track the progress of eachassist warp throughout its execution. To reduce the overheadof providing and managing new contexts for each generatedthread, as well as to simplify scheduling and data communi-cation, an assist warp shares the same context as the regularwarp it assists. Hence, the regular warps are overprovisionedwith available registers to enable each of them to host its ownassist warp.Use of CABA for compression. We illustrate an im-

portant use case for the CABA framework: alleviating thememory bandwidth bottleneck by enabling Wexible data com-pression in the memory hierarchy. The basic idea is to haveassist warps that (1) compress cache blocks before they arewritten to memory, and (2) decompress cache blocks beforethey are placed into the cache.

CABA-based compression/decompression provides severalbeneVts over a purely hardware-based implementation of datacompression for memory. First, CABA primarily employshardware that is already available on-chip but is otherwiseunderutilized. In contrast, hardware-only compression im-plementations require dedicated logic for speciVc algorithms.Each new algorithm (or a modiVcation of an existing one)requires engineering eUort and incurs hardware cost. Second,diUerent applications tend to have distinct data patterns [87]that are more eXciently compressed with diUerent compres-sion algorithms. CABA oUers versatility in algorithm choiceas we Vnd that many existing hardware-based compression al-gorithms (e.g., Base-Delta-Immediate (BDI) compression [87],Frequent Pattern Compression (FPC) [4], and C-Pack [25]) canbe implemented using diUerent assist warps with the CABAframework. Third, not all applications beneVt from datacompression. Some applications are constrained by other bot-tlenecks (e.g., oversubscription of computational resources),or may operate on data that is not easily compressible. As aresult, the beneVts of compression may not outweigh the costin terms of additional latency and energy spent on compress-ing and decompressing data. In these cases, compression canbe easily disabled by CABA, and the CABA framework canbe used in other ways to alleviate the current bottleneck.Other uses of CABA. The generality of CABA enables its

use in alleviating other bottlenecks with diUerent optimiza-tions. We discuss two examples: (1) using assist warps toperform memoization to eliminate redundant computationsthat have the same or similar inputs [13, 29, 106], by storingthe results of frequently-performed computations in the mainmemory hierarchy (i.e., by converting the computational prob-lem into a storage problem) and, (2) using the idle memorypipeline to perform opportunistic prefetching to better over-lap computation with memory access. Assist warps oUer ahardware/software interface to implement hybrid prefetch-ing algorithms [35] with varying degrees of complexity. Wealso brieWy discuss other uses of CABA for (1) redundant

multithreading, (2) speculative precomputation, (3) handlinginterrupts, and (4) proVling and instrumentation.Contributions. This work makes the following contribu-

tions:• It introduces the Core-Assisted Bottleneck Acceleration(CABA) Framework, which can mitigate diUerent bottle-necks in modern GPUs by using underutilized system re-sources for assist warp execution.

• It provides a detailed description of how our framework canbe used to enable eUective and Wexible data compression inGPU memory hierarchies.

• It comprehensively evaluates the use of CABA for datacompression to alleviate the memory bandwidth bottleneck.Our evaluations across a wide variety applications fromMars [44], CUDA [83], Lonestar [20], and Rodinia [24]benchmark suites show that CABA-based compression onaverage (1) reduces memory bandwidth by 2.1X, (2) im-proves performance by 41.7%, and (3) reduces overall sys-tem energy by 22.2%.

• It discusses at least six other use cases of CABA that canimprove application performance and system management,showing that CABA is a primary general framework fortaking advantage of underutilized resources in modern GPUengines.

2. Background

A GPU consists of multiple simple cores, also called streamingmultiprocessors (SMs) in NVIDIA terminology or computeunits (CUs) in AMD terminology. Our example architecture(shown in Figure 1) consists of 15 cores each with a SIMTwidth of 32, and 6 memory controllers. Each core is asso-ciated with a private L1 data cache and read-only textureand constant caches along with a low latency, programmer-managed shared memory. The cores and memory controllersare connected via a crossbar and every memory controller isassociated with a slice of the shared L2 cache. This architec-ture is similar to many modern GPU architectures, includingNVIDIA Fermi [84] and AMD Radeon [10].

CL1

CL1

CL1

CL1

CL1

CL1

Core 1 Core 15

L2DRAM

L2DRAM

L2DRAM

On Chip Network

Figure 1: Baseline GPU architecture. Figure reproducedfrom [115].

A typical GPU application consists of many kernels. Eachkernel is divided into groups of threads, called thread-blocks(or cooperative thread arrays (CTAs)). After a kernel islaunched and its necessary data is copied to the GPU memory,the thread-block scheduler schedules available CTAs onto allthe available cores [17]. Once the CTAs are launched ontothe cores, the warps associated with the CTAs are scheduledonto the cores’ SIMT pipelines. Each core is capable of con-

2

currently executing many warps, where a warp is typicallydeVned as a group of threads that are executed in lockstep.In modern GPUs, a warp can contain 32 threads [84], for ex-ample. The maximum number of warps that can be launchedon a core depends on the available core resources (e.g., avail-able shared memory, register Vle size etc.). For example, in amodern GPU as many as 64 warps (i.e., 2048 threads) can bepresent on a single GPU core.

For more details on the internals of modern GPU architec-tures, we refer the reader to [53, 61].

3. Motivation

We observe that diUerent bottlenecks and imbalances dur-ing program execution leave resources unutilized within theGPU cores. We motivate our proposal, CABA, by examiningthese ineXciencies. CABA leverages these ineXciencies asan opportunity to perform useful work.Unutilized Compute Resources. A GPU core employs

Vne-grained multithreading [105, 112] of warps, i.e., groups ofthreads executing the same instruction, to hide long memoryand ALU operation latencies. If the number of available warpsis insuXcient to cover these long latencies, the core stalls orbecomes idle. To understand the key sources of ineXciencyin GPU cores, we conduct an experiment where we showthe breakdown of the applications’ execution time spent oneither useful work (Active Cycles) or stalling due to one of fourreasons: Compute, Memory, Data Dependence Stalls and IdleCycles. We also vary the amount of available oU-chip memorybandwidth: (i) half (1/2xBW), (ii) equal to (1xBW), and (iii)double (2xBW) the peak memory bandwidth of our baselineGPU architecture. Section 6 details our baseline architectureand methodology.

Figure 2 shows the percentage of total issue cycles, dividedinto Vve components (as described above). The Vrst two com-ponents—Memory and Compute Stalls—are attributed to themain memory and ALU-pipeline structural stalls. These stallsare because of backed-up pipelines due to oversubscribedresources that prevent warps from being issued to the respec-tive pipelines. The third component (Data Dependence Stalls)is due to data dependence stalls. These stalls prevent warpsfrom issuing new instruction(s) when the previous instruc-tion(s) from the same warp are stalled on long-latency opera-tions (usually memory load operations). In some applications(e.g., dmr), special-function-unit (SFU) ALU operations thatmay take tens of cycles to Vnish are also the source of datadependence stalls. The fourth component, Idle Cycles, refersto idle cycles when either all the available warps are issued tothe pipelines and not ready to execute their next instructionor the instruction buUers are Wushed due to a mispredictedbranch. All these components are sources of ineXciency thatcause the cores to be underutilized. The last component, Ac-tive Cycles, indicates the fraction of cycles during which atleast one warp was successfully issued to the pipelines.We make two observations from Figure 2. First, Compute,

Memory, and Data Dependence Stalls are the major sourcesof underutilization in many GPU applications. We distin-

guish applications based on their primary bottleneck as eitherMemory or Compute Bound. We observe that a majority ofthe applications in our workload pool (17 out of 27 studied)are Memory Bound, and bottlenecked by the oU-chip memorybandwidth.Second, for the Memory Bound applications, we observe

that the Memory and Data Dependence stalls constitute a sig-niVcant fraction (61%) of the total issue cycles on our baselineGPU architecture (1xBW). This fraction goes down to 51%when the peak memory bandwidth is doubled (2xBW), andincreases signiVcantly when the peak bandwidth is halved(1/2xBW), indicating that limited oU-chip memory bandwidthis a critical performance bottleneck for Memory Bound ap-plications. Some applications, e.g., BFS, are limited by theinterconnect bandwidth. In contrast, the Compute Bound ap-plications are primarily bottlenecked by stalls in the ALUpipelines. An increase or decrease in the oU-chip bandwidthhas little eUect on the performance of these applications.Unutilized On-chip Memory. The occupancy of any

GPU Streaming Multiprocessor (SM), i.e., the number ofthreads running concurrently, is limited by a number of fac-tors: (1) the available registers and shared memory, (2) thehard limit on the number of threads and thread blocks percore, (3) the number of thread blocks in the application ker-nel. The limiting resource from the above, leaves the otherresources underutilized. This is because it is challenging, inpractice, to achieve a perfect balance in utilization of all ofthe above factors for diUerent workloads with varying charac-teristics. Very often, the factor determining the occupancy isthe thread or thread block limit imposed by the architecture.In this case, there are many registers that are left unallocatedto any thread block. Also, the number of available registersmay not be a multiple of those required by each thread block.The remaining registers are not enough to schedule an entireextra thread block, which leaves a signiVcant fraction of theregister Vle and shared memory unallocated and unutilizedby the thread blocks. Figure 3 shows the fraction of staticallyunallocated registers in a 128KB register Vle (per SM) witha 1536 thread, 8 thread block occupancy limit, for diUerentapplications. We observe that on average 24% of the registerVle remains unallocated. This phenomenon has previouslybeen observed and analyzed in detail in [3, 39, 40, 41, 63]. Weobserve a similar trend with the usage of shared memory (notgraphed).Our Goal. We aim to exploit the underutilization of com-

pute resources, registers and on-chip shared memory as anopportunity to enable diUerent optimizations to acceleratevarious bottlenecks in GPU program execution. To achievethis goal, we would like to enable eXcient helper threadingfor GPUs to dynamically generate threads in hardware thatuse the available on-chip resources for various purposes. Inthe next section, we present the detailed design of our CABAframework that enables the generation and management ofthese threads.

3

0%

20%

40%

60%

80%

100%

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

1/2

x B

W1x B

W2x B

W

BFS CONS JPEG LPS MUM RAY SCP MM PVC PVR SS sc bfs bh mst sp sssp Avg. NN STO bp hs dmr NQU SLA lc pt mc Avg.

Memory Bound Compute Bound

Pe

rce

nta

ge o

f C

ycle

sCompute Stalls Memory Stalls Data Dependence Stalls Idle Cycles Active Cycles

Figure 2: Breakdown of total issue cycles for 27 representative CUDA applications. See Section 6 for methodology. Figurereproduced from [115].

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Un

all

oc

ate

d R

eg

iste

rs,

%

Figure 3: Fraction of statically unallocated registers. Figurereproduced from [115].4. The CABA Framework

In order to understand the major design choices behind theCABA framework, we Vrst present our major design goals anddescribe the key challenges in applying helper threading toGPUs. We then show the detailed design, hardware changes,and operation of CABA. Finally, we brieWy describe potentialapplications of our proposed framework. Section goes into adetailed design of one application of the framework.

4.1. Goals and Challenges

The purpose of CABA is to leverage underutilized GPU re-sources for useful computation. To this end, we need toeXciently execute subroutines that perform optimizationsto accelerate bottlenecks in application execution. The keydiUerence between CABA’s assisted execution and regular ex-ecution is that CABA must be low overhead and, therefore,helper threads need to be treated diUerently from regularthreads. The low overhead goal imposes several key require-ments in designing a framework to enable helper threading.First, we should be able to easily manage helper threads—toenable, trigger, and kill threads when required. Second, helperthreads need to be Wexible enough to adapt to the runtimebehavior of the regular program. Third, a helper thread needsto be able to communicate with the original thread. Finally,we need a Wexible interface to specify new subroutines, withthe framework being generic enough to handle various opti-mizations.

With the above goals in mind, enabling helper threading inGPU architectures introduces several new challenges. First,execution on GPUs involves context switching between hun-dreds of threads. These threads are handled at diUerent gran-ularities in hardware and software. The programmer reasons

about these threads at the granularity of a thread block. How-ever, at any point in time, the hardware executes only a smallsubset of the thread block, i.e., a set of warps. Therefore,we need to deVne the abstraction levels for reasoning aboutand managing helper threads from the point of view of theprogrammer, the hardware as well as the compiler/runtime.In addition, each of the thousands of executing threads couldsimultaneously invoke an associated helper thread subroutine.To keep the management overhead low, we need an eXcientmechanism to handle helper threads at this magnitude.

Second, GPUs use Vne-grained multithreading [105, 112] totime multiplex the Vxed number of compute units among thehundreds of threads. Similarly, the on-chip memory resources(i.e., the register Vle and shared memory) are statically parti-tioned between the diUerent threads at compile time. Helperthreads require their own registers and compute cycles toexecute. A straightforward approach would be to dedicatefew registers and compute units just for helper thread execu-tion, but this option is both expensive and wasteful. In fact,our primary motivation is to utilize existing idle resources forhelper thread execution. In order to do this, we aim to enablesharing of the existing resources between primary threadsand helper threads at low cost, while minimizing the interfer-ence to primary thread execution. In the remainder of thissection, we describe the design of our low-overhead CABAframework.

4.2. Design of the CABA Framework

We choose to implement CABA using a hardware/softwareco-design, as pure hardware or pure software approachespose certain challenges that we describe below. There aretwo alternatives for a fully software-based approach to helperthreads. The Vrst alternative, treating each helper threadas independent kernel code, has high overhead, since weare now treating the helper threads as, essentially, regularthreads. This would reduce the primary thread occupancy ineach SM (there is a hard limit on the number of threads andblocks that an SM can support). It would also complicate thedata communication between the primary and helper threads,since no simple interface exists for inter-kernel communica-tion. The second alternative, embedding the helper threadcode within the primary thread kernel itself, oUers little Wexi-bility in adapting to runtime requirements, since such helper

4

threads cannot be triggered or squashed independently of theprimary thread.On the other hand, a pure hardware solution would make

register allocation for the assist warps and the data communi-cation between the helper threads and primary threads morediXcult. Registers are allocated to each thread block by thecompiler and are then mapped to the sections of the hardwareregister Vle at runtime. Mapping registers for helper threadsand enabling data communication between those registersand the primary thread registers would be non-trivial. Fur-thermore, a fully hardware approach would make oUering theprogrammer a Wexible interface more challenging.Hardware support enables simpler Vne-grained manage-

ment of helper threads, aware of micro-architectural eventsand runtime program behavior. Compiler/runtime supportenables simpler context management for helper threads andmore Wexible programmer interfaces. Thus, to get the bestof both worlds, we propose a hardware/software cooperativeapproach, where the hardware manages the scheduling andexecution of helper thread subroutines, while the compilerperforms the allocation of shared resources (e.g., registerVle and shared memory) for the helper threads and the pro-grammer or the microarchitect provides the helper threadsthemselves.4.2.1. Hardware-based management of threads. To usethe available on-chip resources the same way that threadblocks do during program execution, we dynamically insertsequences of instructions into the execution stream. We trackand manage these instructions at the granularity of a warp,and refer to them as Assist Warps. An assist warp is a set ofinstructions issued into the core pipelines. Each instructionis executed in lock-step across all the SIMT lanes, just likeany regular instruction, with an active mask to disable lanesas necessary. The assist warp does not own a separate con-text (e.g., registers, local memory), and instead shares botha context and a warp ID with the regular warp that invokedit. In other words, each assist warp is coupled with a parentwarp. In this sense, it is diUerent from a regular warp anddoes not reduce the number of threads that can be scheduledon a single SM. Data sharing between the two warps becomessimpler, since the assist warps share the register Vle with theparent warp. Ideally, an assist warp consumes resources andissue cycles that would otherwise be idle. We describe thestructures required to support hardware-based managementof assist warps in Section 4.3.4.2.2. Register Vle/shared memory allocation. Eachhelper thread subroutine requires a diUerent number of reg-isters depending on the actions it performs. These registershave a short lifetime, with no values being preserved betweendiUerent invocations of an assist warp. To limit the registerrequirements for assist warps, we impose the restriction thatonly one instance of each helper thread routine can be activefor each thread. All instances of the same helper thread foreach parent thread use the same registers, and the registersare allocated to the helper threads statically by the compiler.One of the factors that determines the runtime SM occupancyis the number of registers required by a thread block (i.e,

per-block register requirement). For each helper thread sub-routine that is enabled, we add its register requirement to theper-block register requirement, to ensure the availability ofregisters for both the parent threads as well as every assistwarp. The registers that remain unallocated after allocationamong the parent thread blocks should suXce to support theassist warps. If not, register-heavy assist warps may limitthe parent thread block occupancy in SMs or increase thenumber of register spills in the parent warps. Shared memoryresources are partitioned in a similar manner and allocated toeach assist warp as and if needed.4.2.3. Programmer/developer interface. The assist warpsubroutine can be written in two ways. First, it can besupplied and annotated by the programmer/developer usingCUDA extensions with PTX instructions and then compiledwith regular program code. Second, the assist warp subrou-tines can be written by the microarchitect in the internalGPU instruction format. These helper thread subroutinescan then be enabled or disabled by the application program-mer. This approach is similar to that proposed in prior work(e.g., [22]). It oUers the advantage of potentially being highlyoptimized for energy and performance while having Wexibil-ity in implementing optimizations that are not trivial to mapusing existing GPU PTX instructions. The instructions forthe helper thread subroutine are stored in an on-chip buUer(described in Section 4.3).

Along with the helper thread subroutines, the programmeralso provides: (1) the priority of the assist warps to enable thewarp scheduler to make informed decisions, (2) the triggerconditions for each assist warp, and (3) the live-in and live-outvariables for data communication with the parent warps.

Assist warps can be scheduled with diUerent priority levelsin relation to parent warps by the warp scheduler. Some assistwarps may perform a function that is required for correctexecution of the program and are blocking. At this end of thespectrum, the high priority assist warps are treated by thescheduler as always taking higher precedence over the parentwarp execution. Assist warps should be given a high priorityonly when they are required for correctness. Low priorityassist warps, on the other hand, are scheduled for executiononly when computational resources are available, i.e., duringidle cycles. There is no guarantee that these assist warps willexecute or complete.The programmer also provides the conditions or events

that need to be satisVed for the deployment of the assist warp.This includes a speciVc point within the original programand/or a set of other microarchitectural events that couldserve as a trigger for starting the execution of an assist warp.

4.3. Main Hardware Additions

Figure 4 shows a high-level block diagram of the GPUpipeline [43]. To support assist warp execution, we add threenew components: (1) an Assist Warp Store to hold the assistwarp code, (2) an Assist Warp Controller to perform the de-ployment, tracking, and management of assist warps, and (3)an Assist Warp BuUer to stage instructions from triggered

5

assist warps for execution.

I-Cache

Assist

Warp Store

D

e

c

o

d

e

Fetch

Issue

ALU

Mem

Write

Back

Assist Warp

Controller

Active

Mask

SIMT

Stack

Activ

e

Ma

sk

TriggerPipeline

Utilization

4

1

7 2

SR.ID +

Inst.ID

5

3

Warp ID + Priority

+ Active Mask

Scheduler

Scoreboard

Assist Warp

Buffer6

Instruction

Buffer

Figure 4: CABA framework Wow within a typical GPUpipeline [43]. The shaded blocks are the components intro-duced for the framework. Figure reproduced from [115].

Assist Warp Store (AWS). DiUerent assist warp subrou-tines are possible based on the purpose of the optimization.These code sequences for diUerent types of assist warps needto be stored on-chip. An on-chip storage structure called theAssist Warp Store (Í) is preloaded with these instructions be-fore application execution. It is indexed using the subroutineindex (SR.ID) along with the instruction ID (Inst.ID).Assist Warp Controller (AWC). The AWC (Ë) is respon-

sible for the triggering, tracking, and management of assistwarp execution. It stores a mapping between trigger eventsand a subroutine index in the AWS, as speciVed by the pro-grammer. The AWC monitors for such events, and whenthey take place, triggers the fetch, decode and execution ofinstructions from the AWS for the respective assist warp.

Deploying all the instructions within an assist warp, back-to-back, at the trigger point may require increased fetch/de-code bandwidth and buUer space after decoding [23]. To avoidthis, at each cycle, only a few instructions from an assist warp,at most equal to the available decode/issue bandwidth, are de-coded and staged for execution. Within the AWC, we simplytrack the next instruction that needs to be executed for eachassist warp and this is stored in the Assist Warp Table (AWT),as depicted in Figure 5. The AWT also tracks additional meta-data required for assist warp management, which is describedin more detail in Section 4.4.Assist Warp BuUer (AWB). Fetched and decoded instruc-

tions (Ë) belonging to the assist warps that have been trig-gered need to be buUered until the assist warp can be selectedfor issue by the scheduler. These instructions are then stagedin the Assist Warp BuUer (Ï) along with their warp IDs. TheAWB is contained within the instruction buUer (IB), whichholds decoded instructions for the parent warps. The AWBmakes use of the existing IB structures. The IB is typicallypartitioned among diUerent warps executing in the SM. Sinceeach assist warp is associated with a parent warp, the assistwarp instructions are directly inserted into the same partitionwithin the IB as that of the parent warp. This simpliVes warpscheduling, as the assist warp instructions can now be issuedas if they were parent warp instructions with the same warpID. In addition, using the existing partitions avoids the costof separate dedicated instruction buUering for assist warps.

SR 1

.

.

.

SR 0Inst 0

.

.

.

.

.

.

.

.

Inst 1

.

.

.

.

.

.

.

.

SR.ID‘

.

.

.

.

.

.

.

.

.

.

Active

MaskPriority

.

.

.

.

SR.End

.

.

.

.

.

.

Warp

ID

Live in/out

RegsInst.ID

Figure 5: Fetch Logic: Assist Warp Table (contained in theAWC) and the Assist Warp Store (AWS). Figure reproducedfrom [115].

We do, however, provision a small additional partition withtwo entries within the IB, to hold non-blocking low priorityassist warps that are scheduled only during idle cycles. Thisadditional partition allows the scheduler to distinguish lowpriority assist warp instructions from the parent warp andhigh priority assist warp instructions, which are given prece-dence during scheduling, allowing them to make progress.

4.4. The Mechanism

Trigger and Deployment. An assist warp is triggered (Ê)by the AWC (Ë) based on a speciVc set of architectural eventsand/or a triggering instruction (e.g., a load instruction). Whenan assist warp is triggered, its speciVc instance is placed intothe Assist Warp Table (AWT) within the AWC (Figure 5).Every cycle, the AWC selects an assist warp to deploy ina round-robin fashion. The AWS is indexed (Ì) based onthe subroutine ID (SR.ID)—which selects the instruction se-quence to be executed by the assist warp, and the instructionID (Inst.ID)—which is a pointer to the next instruction tobe executed within the subroutine (Figure 5). The selectedinstruction is entered (Î) into the AWB (Ï) and, at this point,the instruction enters the active pool with other active warpsfor scheduling. The Inst.ID for the assist warp is updated inthe AWT to point to the next instruction in the subroutine.When the end of the subroutine is reached, the entry withinthe AWT is freed.Execution. Assist warp instructions, when selected for

issue by the scheduler, are executed in much the same way asany other instructions. The scoreboard tracks the dependen-cies between instructions within an assist warp in the sameway as any warp, and instructions from diUerent assist warpsare interleaved in execution in order to hide latencies. Wealso provide an active mask (stored as a part of the AWT),which allows for statically disabling/enabling diUerent laneswithin a warp. This is useful to provide Wexibility in lock-stepinstruction execution when we do not need all threads withina warp to execute a speciVc assist warp subroutine.Dynamic Feedback and Throttling. Assist warps, if not

properly controlled, may stall application execution. This canhappen due to several reasons. First, assist warps take upissue cycles, and only a limited number of instructions maybe issued per clock cycle. Second, assist warps require struc-tural resources: the ALU units and resources in the load-storepipelines (if the assist warps consist of computational andmemory instructions, respectively). We may, hence, need tothrottle assist warps to ensure that their performance ben-eVts outweigh the overhead. This requires mechanisms toappropriately balance and manage the aggressiveness of assistwarps at runtime.

6

The overheads associated with assist warps can be con-trolled in diUerent ways. First, the programmer can staticallyspecify the priority of the assist warp. Depending on thecriticality of the assist warps in making forward progress,the assist warps can be issued either in idle cycles or withvarying levels of priority in relation to the parent warps. Forexample, warps performing decompression are given a highpriority whereas warps performing compression are given alow priority. Low priority assist warps are inserted into thededicated partition in the IB, and are scheduled only duringidle cycles. This priority is statically deVned by the program-mer. Second, the AWC can control the number of times theassist warps are deployed into the AWB. The AWC monitorsthe utilization of the functional units (Ð) and idleness of thecores to decide when to throttle assist warp deployment.Communication and Control. An assist warp may need

to communicate data and status with its parent warp. Forexample, memory addresses from the parent warp need to becommunicated to assist warps performing decompression orprefetching. The IDs of the registers containing the live-indata for each assist warp are saved in the AWT when an assistwarp is triggered. Similarly, if an assist warp needs to reportresults to its parent warp (e.g., in the case of memoization),the register IDs are also stored in the AWT. When the assistwarps execute, MOVE instructions are Vrst executed to copythe live-in data from the parent warp registers to the assistwarp registers. Live-out data is communicated to the parentwarp in a similar fashion, at the end of assist warp execution.

Assist warps may need to be killed when they are notrequired (e.g., if the data does not require decompression) orwhen they are no longer beneVcial. In this case, the entries inthe AWT and AWB are simply Wushed for the assist warp.

4.5. Applications of the CABA Framework

We envision multiple applications for the CABA framework,e.g., data compression [4, 25, 87, 118], memoization [13, 29,106], data prefetching [15, 38, 54, 86, 108]. In Section 5, weprovide a detailed case study of enabling data compressionwith the framework, discussing various tradeoUs. We believeCABA can be useful for many other optimizations, and wediscuss some of them brieWy in Section 8.

5. A Case for CABA: Data Compression

Data compression is a technique that exploits the redundancyin the applications’ data to reduce capacity and bandwidthrequirements for many modern systems by saving and trans-mitting data in a more compact form. Hardware-based datacompression has been explored in the context of on-chipcaches [4, 11, 25, 33, 49, 87, 89, 99, 118], interconnect [30],and main memory [2, 37, 88, 90, 91, 104, 114] as a meansto save storage capacity as well as memory bandwidth. Inmodern GPUs, memory bandwidth is a key limiter to sys-tem performance in many workloads (Section 3). As such,data compression is a promising technique to help alleviatethis bottleneck. Compressing data enables less data to betransferred from/to DRAM and the interconnect.

In bandwidth-constrained workloads, idle computepipelines oUer an opportunity to employ CABA to enable datacompression in GPUs. We can use assist warps to (1) decom-press data, before loading it into the caches and registers, and(2) compress data, before writing it back to memory. Sinceassist warps execute instructions, CABA oUers some Wexi-bility in the compression algorithms that can be employed.Compression algorithms that can be mapped to the generalGPU execution model can be Wexibly implemented with theCABA framework.

5.1. Mapping Compression Algorithms into AssistWarps

In order to employ CABA to enable data compression, weneed to map compression algorithms into instructions thatcan be executed within the GPU cores. For a compressionalgorithm to be amenable for implementation with CABA, itideally needs to be (1) reasonably parallelizable and (2) simple(for low latency). Decompressing data involves reading theencoding associated with each cache line that deVnes howto decompress it, and then triggering the corresponding de-compression subroutine in CABA. Compressing data, on theother hand, involves testing diUerent encodings and savingdata in the compressed format.

We perform compression at the granularity of a cache line.The data needs to be decompressed before it is used by anyprogram thread. In order to utilize the full SIMD width ofthe GPU pipeline, we would like to decompress/compress allthe words in the cache line in parallel. With CABA, helperthread routines are managed at the warp granularity, enablingVne-grained triggering of assist warps to perform compres-sion/decompression when required. However, the SIMT exe-cution model in a GPU imposes some challenges: (1) threadswithin a warp operate in lock-step, and (2) threads operateas independent entities, i.e., they do not easily communicatewith each other.

In this section, we discuss the architectural changes andalgorithm adaptations required to address these challengesand provide a detailed implementation and evaluation of DataCompression within the CABA framework using the Base-Delta-Immediate compression algorithm [87]. Section 5.1.3discusses implementing other compression algorithms.5.1.1. Algorithm Overview. Base-Delta-Immediate com-pression (BDI) is a simple compression algorithm that wasoriginally proposed in the context of caches [87]. It is basedon the observation that many cache lines contain data withlow dynamic range. BDI exploits this observation to representa cache line with low dynamic range using a common base(or multiple bases) and an array of deltas (where a delta isthe diUerence of each value within the cache line and thecommon base). Since the deltas require fewer bytes thanthe values themselves, the combined size after compressioncan be much smaller. Figure 6 shows the compression of anexample 64-byte cache line from the PageViewCount (PVC)application using BDI. As Figure 6 indicates, in this case, thecache line can be represented using two bases (an 8-byte base

7

value, 0x8001D000, and an implicit zero value base) and anarray of eight 1-byte diUerences from these bases. As a result,the entire cache line data can be represented using 17 bytesinstead of 64 bytes (1-byte metadata, 8-byte base, and eight1-byte deltas), saving 47 bytes of the originally used space.

0x00 0x80001d000 0x10 0x80001d008 0x20 0x80001d010 0x30 0x80001d018

0x80001d000

Base

8 bytes

0x00 0x00 0x10 0x08 0x20 0x10 0x30 Saved Space0x18

64‐byte Uncompressed Cache Line

17‐byte Compressed Cache Line 47 bytes

8 bytes

0x55

Metadata

8 bytes1 byte 1 byte

Figure 6: Cache line from PVC compressed with BDI. Figurereproduced from [115].

Our example implementation of the BDI compression algo-rithm [87] views a cache line as a set of Vxed-size values i.e.,8 8-byte, 16 4-byte, or 32 2-byte values for a 64-byte cacheline. For the size of the deltas, it considers three options: 1,2 and 4 bytes. The key characteristic of BDI, which makesit a desirable compression algorithm to use with the CABAframework, is its fast parallel decompression that can be ef-Vciently mapped into instructions that can be executed onGPU hardware. Decompression is simply a masked vectoraddition of the deltas to the appropriate bases [87].5.1.2. Mapping BDI to CABA. In order to implement BDIwith the CABA framework, we need to map the BDI compres-sion/decompression algorithms into GPU instruction subrou-tines (stored in the AWS and deployed as assist warps).Decompression. To decompress the data compressed

with BDI, we need a simple addition of deltas to the appropri-ate bases. The CABA decompression subroutine Vrst loadsthe words within the compressed cache line into assist warpregisters, and then performs the base-delta additions in par-allel, employing the wide ALU pipeline.1 The subroutinethen writes back the uncompressed cache line to the cache.It skips the addition for the lanes with an implicit base ofzero by updating the active lane mask based on the cache lineencoding. We store a separate subroutine for each possibleBDI encoding that loads the appropriate bytes in the cacheline as the base and the deltas. The high-level algorithm fordecompression is presented in Algorithm 1.

Algorithm 1 BDI: Decompression

1: load base, deltas2: uncompressed_data = base + deltas3: store uncompressed_data

Compression. To compress data, the CABA compressionsubroutine tests several possible encodings (each represent-ing a diUerent size of base and deltas) in order to achieve ahigh compression ratio. The Vrst few bytes (2–8 dependingon the encoding tested) of the cache line are always used asthe base. Each possible encoding is tested to check whetherthe cache line can be successfully encoded with it. In order toperform compression at a warp granularity, we need to check

1Multiple instructions are required if the number of deltas exceeds the widthof the ALU pipeline. We use a 32-wide pipeline.

whether all of the words at every SIMD lane were success-fully compressed. In other words, if any one word cannot becompressed, that encoding cannot be used across the warp.We can perform this check by adding a global predicate reg-ister, which stores the logical AND of the per-lane predicateregisters. We observe that applications with homogeneousdata structures can typically use the same encoding for mostof their cache lines [87]. We use this observation to reducethe number of encodings we test to just one in many cases.All necessary operations are done in parallel using the fullwidth of the GPU SIMD pipeline. The high-level algorithmfor compression is presented in Algorithm 2.

Algorithm 2 BDI: Compression

1: for each base_size do2: load base, values3: for each delta_size do4: deltas = abs(values - base)5: if size(deltas) <= delta_size then6: store base, deltas7: exit8: end if9: end for10: end for

5.1.3. Implementing Other Algorithms. The BDI com-pression algorithm is naturally amenable towards implemen-tation using assist warps because of its data-parallel natureand simplicity. The CABA framework can also be used torealize other algorithms. The challenge in implementingalgorithms like FPC [5] and C-Pack [25], which have variable-length compressed words, is primarily in the placement ofcompressed words within the compressed cache lines. InBDI, the compressed words are in Vxed locations within thecache line and, for each encoding, all the compressed wordsare of the same size and can, therefore, be processed in par-allel. In contrast, C-Pack may employ multiple dictionaryvalues as opposed to just one base in BDI. In order to realizealgorithms with variable length words and dictionary valueswith assist warps, we leverage the coalescing/address gener-ation logic [81, 85] already available in the GPU cores. Wemake two minor modiVcations to these algorithms [5, 25]to adapt them for use with CABA. First, similar to priorworks [5, 25, 37], we observe that few encodings are suXcientto capture almost all the data redundancy. In addition, the im-pact of any loss in compressibility due to fewer encodings isminimal as the beneVts of bandwidth compression are only atmultiples of a single DRAM burst (e.g., 32B for GDDR5 [47]).We exploit this to reduce the number of supported encodings.Second, we place all the metadata containing the compressionencoding at the head of the cache line to be able to determinehow to decompress the entire line upfront. In the case ofC-Pack, we place the dictionary entries after the metadata.We note that it can be challenging to implement complex

algorithms eXciently with the simple computational logicavailable in GPU cores. Fortunately, there are already Spe-cial Function Units (SFUs) [21, 66] present in the GPU SMs,

8

used to perform eXcient computations of elementary math-ematical functions. SFUs could potentially be extended toimplement primitives that enable the fast iterative compar-isons performed frequently in some compression algorithms.This would enable more eXcient execution of the describedalgorithms, as well as implementation of more complex com-pression algorithms, using CABA. We leave the explorationof an SFU-based approach to future work.We now present a detailed overview of mapping the FPC

and C-PACK algorithms into assist warps.5.1.4. Implementing the FPC (Frequent Pattern Com-pression) Algorithm. For FPC, the cache line is treated asset of Vxed-size words and each word within the cache lineis compressed into a simple preVx or encoding and a com-pressed word if it matches a set of frequent patterns, e.g.narrow values, zeros or repeated bytes. The word is left un-compressed if it does not Vt any pattern. We refer the readerto the original work [5] for a more detailed description of theoriginal algorithm.The challenge in mapping assist warps to the FPC decom-

pression algorithm is in the serial sequence in which eachword within a cache line is decompressed. This is becausein the original proposed version, each compressed word canhave a diUerent size. To determine the location of a speciVccompressed word, it is necessary to have decompressed theprevious word. We make some modiVcations to the algorithmin order to parallelize the decompression across diUerentlanes in the GPU cores. First, we move the word preVxes(metadata) for each word to the front of the cache line, so weknow upfront how to decompress the rest of the cache line.Unlike with BDI, each word within the cache line has a diUer-ent encoding and hence a diUerent compressed word lengthand encoding pattern. This is problematic as statically storingthe sequence of decompression instructions for every com-bination of patterns for all the words in a cache line wouldrequire very large instruction storage. In order to mitigatethis, we break each cache line into a number of segments.Each segment is compressed independently and all the wordswithin each segment are compressed using the same encodingwhereas diUerent segments may have diUerent encodings.This creates a trade-oU between simplicity/parallelizabilityversus compressibility. Consistent with previous works [5],we Vnd that this doesn’t signiVcantly impact compressibility.

Decompression. The high-level algorithm we use fordecompression is presented in Algorithm 3. Each segmentwithin the compressed cache line is loaded in series. Eachof the segments is decompressed in parallel—this is possiblebecause all the compressed words within the segment havethe same encoding. The decompressed segment is then storedbefore moving onto the next segment. The location of thenext compressed segment is computed based on the size ofthe previous segment.Compression. Similar to the BDI implementation, we

loop through and test diUerent encodings for each segment.We also compute the address oUset for each segment at eachiteration to store the compressed words in the appropriatelocation in the compressed cache line. Algorithm 4 presents

Algorithm 3 FPC: Decompression

1: for each segment do2: load compressed words3: pattern speciVc decompression (sign extension/zero value)4: store decompressed words5: segment-base-address= segment-base-address+ segment-size6: end for

the high-level FPC compression algorithm we use.

Algorithm 4 FPC: Compression

1: load words2: for each segment do3: for each encoding do4: test encoding5: if compressible then6: segment-base-address = segment-base-address +

segment-size7: store compressed words8: break9: end if10: end for11: end for

Implementing the C-Pack Algorithm. C-Pack [25] isa dictionary based compression algorithm where frequent"dictionary" values are saved at the beginning of the cacheline. The rest of the cache line contains encodings for eachword which may indicate zero values, narrow values, fullor partial matches into the dictionary or any word that isuncompressible.In our implementation, we reduce the number of possible

encodings to partial matches (only last byte mismatch), fullword match, zero value and zero extend (only last byte) andwe limit the number of dictionary values to 4. This enablesVxed compressed word size within the cache line. A Vxedcompressed word size enables compression and decompres-sion of diUerent words within the cache line in parallel. If thenumber of required dictionary values or uncompressed wordsexceeds 4, the line is left decompressed. This is, as in BDI andFPC, a trade-oU between simplicity and compressibility. Inour experiments, we Vnd that it does not signiVcantly impactthe compression ratio—primarily due the 32B minimum datasize and granularity of compression.Decompression. As described, to enable parallel decom-

pression, we place the encodings and dictionary values at thehead of the line. We also limit the number of encodings toenable quick decompression. We implement C-Pack decom-pression as a series of instructions (one per encoding used) toload all the registers with the appropriate dictionary values.We deVne the active lane mask based on the encoding (similarto the mechanism used in BDI) for each load instruction toensure the correct word is loaded into each lane’s register.Algorithm 5 provides the high-level algorithm for C-Packdecompression.Compression. Compressing data with C-Pack involves

determining the dictionary values that will be used to com-press the rest of the line. In our implementation, we serially

9

Algorithm 5 C-PACK: Decompression

1: add base-address + index-into-dictionary2: load compressed words3: for each encoding do4: pattern speciVc decompression . Mismatch byte load for zero

extend or partial match5: end for6: Store uncompressed words

add each word from the beginning of the cache line to be adictionary value if it was not already covered by a previousdictionary value. For each dictionary value, we test whetherthe rest of the words within the cache line is compressible.The next dictionary value is determined using the predicateregister to determine the next uncompressed word, as inBDI. After four iterations (dictionary values), if all the wordswithin the line are not compressible, the cache line is leftuncompressed. Similar to BDI, the global predicate register isused to determine the compressibility of all of the lanes afterfour or fewer iterations. Algorithm 6 provides the high-levelalgorithm for C-Pack compression.

Algorithm 6 C-PACK: Compression

1: load words2: for each dictionary value (including zero) do . To a maximum

of four3: test match/partial match4: if compressible then5: Store encoding and mismatching byte6: break7: end if8: end for9: if all lanes are compressible then10: Store compressed cache line11: end if

5.2. Walkthrough of CABA-based Compression

We show the detailed operation of CABA-based compressionand decompression mechanisms in Figure 7. We assume abaseline GPU architecture with three levels in the memoryhierarchy – two levels of caches (private L1s and a sharedL2) and main memory. DiUerent levels can potentially storecompressed data. In this section and in our evaluations, weassume that only the L2 cache and main memory containcompressed data. Note that there is no capacity beneVtin the baseline mechanism as compressed cache lines stilloccupy the full uncompressed slot, i.e., we only evaluate thebandwidth-saving beneVts of compression in GPUs.5.2.1. The Decompression Mechanism. Load instructionsthat access global memory data in the compressed form trig-ger the appropriate assist warp to decompress the data beforeit is used. The subroutines to decompress data are storedin the Assist Warp Store (AWS). The AWS is indexed by thecompression encoding at the head of the cache line and by abit indicating whether the instruction is a load (decompres-sion is required) or a store (compression is required). Eachdecompression assist warp is given high priority and, hence,

Front End

Instruction

BufferScore Board

Issue

Coalescing Unit

L1D $

Write back

MSHRs

AWB

AWS

AWC

Fill

Active Mask

Register File

Scheduler

Buffered Stores

Trigger

Compressed Line

Is Compressed?

Release Store

42

3

6

7

1

ALU

Load

Replays

5

To

/Fro

m L

2

Figure 7: Walkthrough of CABA-based Compression. Figurereproduced from [115].

stalls the progress of its parent warp until it completes itsexecution. This ensures that the parent warp correctly getsthe decompressed value.L1 Access. We store data in L1 in the uncompressed form.

An L1 hit does not require an assist warp for decompression.L2/Memory Access. Global memory data cached in

L2/DRAM could potentially be compressed. A bit indicat-ing whether the cache line is compressed is returned to thecore along with the cache line (Ê). If the data is uncompressed,the line is inserted into the L1 cache and the writeback phaseresumes normally. If the data is compressed, the compressedcache line is inserted into the L1 cache. The encoding of thecompressed cache line and the warp ID are relayed to the As-sist Warp Controller (AWC), which then triggers the AWS (Ë)to deploy the appropriate assist warp (Ì) to decompress theline. During regular execution, the load information for eachthread is buUered in the coalescing/load-store unit [81, 85]until all the data is fetched. We continue to buUer this loadinformation (Í) until the line is decompressed.

After the CABA decompression subroutine ends execution,the original load that triggered decompression is resumed(Í).5.2.2. The Compression Mechanism. The assist warpsto perform compression are triggered by store instructions.When data is written to a cache line (i.e., by a store), thecache line can be written back to main memory either in thecompressed or uncompressed form. Compression is oU thecritical path and the warps to perform compression can bescheduled when the required resources are available.Pending stores are buUered in a few dedicated sets within

the L1 cache or in available shared memory (Î). In the case ofan overWow in this buUer space (Î), the stores are released tothe lower levels of the memory system in the uncompressedform (Ï). Upon detecting the availability of resources to per-form the data compression, the AWC triggers the deploymentof the assist warp that performs compression (Ë) into theAWB (Ì), with low priority. The scheduler is then free toschedule the instructions from the compression subroutine.Since compression is not on the critical path of execution,keeping such instructions as low priority ensures that themain program is not unnecessarily delayed.L1 Access. On a hit in the L1 cache, the cache line is

already available in the uncompressed form. Depending on

10

the availability of resources, the cache line can be scheduledfor compression or simply written to the L2 and main memoryuncompressed, when evicted.L2/Memory Access. Data in memory is compressed at

the granularity of a full cache line, but stores can be at gran-ularities smaller than the size of the cache line. This posessome additional diXculty if the destination cache line for astore is already compressed in main memory. Partial writesinto a compressed cache line would require the cache lineto be decompressed Vrst, then updated with the new data,and written back to main memory. The common case—wherethe cache line that is being written to is uncompressed ini-tially—can be easily handled. However, in the worst case, thecache line being partially written to is already in the com-pressed form in memory. We now describe the mechanism tohandle both these cases.Initially, to reduce the store latency, we assume that the

cache line is uncompressed, and issue a store to the lowerlevels of the memory hierarchy, while buUering a copy inL1. If the cache line is found in L2/memory in the uncom-pressed form (Ê), the assumption was correct. The store thenproceeds normally and the buUered stores are evicted fromL1. If the assumption is incorrect, the cache line is retrieved(Ð) and decompressed before the store is retransmitted to thelower levels of the memory hierarchy.

5.3. Realizing Data Compression

Supporting data compression requires additional supportfrom the main memory controller and the runtime system, aswe describe below.5.3.1. Initial Setup and ProVling. Data compression withCABA requires a one-time data setup before the data is trans-ferred to the GPU. We assume initial software-based datapreparation where the input data is stored in CPU memoryin the compressed form with an appropriate compression al-gorithm before transferring the data to GPU memory. Trans-ferring data in the compressed form can also reduce PCIebandwidth usage.2

Memory-bandwidth-limited GPU applications are the bestcandidates for employing data compression using CABA. Thecompiler (or the runtime proVler) is required to identify thoseapplications that are most likely to beneVt from this frame-work. For applications where memory bandwidth is not abottleneck, data compression is simply disabled (this can bedone statically or dynamically within the framework).5.3.2. Memory Controller Changes. Data compression re-duces oU-chip bandwidth requirements by transferring thesame data in fewer DRAM bursts. The memory controller(MC) needs to know whether the cache line data is com-pressed and how many bursts (1–4 bursts in GDDR5 [47]) areneeded to transfer the data from DRAM to the MC. Similarto prior work [88, 100], we require metadata information forevery cache line that keeps track of how many bursts areneeded to transfer the data. Similar to prior work [100], wesimply reserve 8MB of GPU DRAM space for the metadata

2This requires changes to the DMA engine to recognize compressed lines.

(~0.2% of all available memory). Unfortunately, this simpledesign would require an additional access for the metadatafor every access to DRAM eUectively doubling the requiredbandwidth. To avoid this, a simple metadata (MD) cache thatkeeps frequently-accessed metadata on chip (near the MC)is required. Note that this metadata cache is similar to othermetadata storage and caches proposed for various purposesin the memory controller, e.g.,[42, 67, 74, 88, 94, 101]. Ourexperiments show that a small 8 KB 4-way associative MDcache is suXcient to provide a hit rate of 85% on average(more than 99% for many applications) across all applicationsin our workload pool.3 Hence, in the common case, a secondaccess to DRAM to fetch compression-related metadata canbe avoided.

6. Methodology

We model the CABA framework in GPGPU-Sim 3.2.1 [16].Table 1 provides the major parameters of the simulated sys-tem. We use GPUWattch [65] to model GPU power andCACTI [113] to evaluate the power/energy overhead asso-ciated with the MD cache (Section 5.3.2) and the additionalcomponents (AWS and AWC) of the CABA framework. Weimplement BDI [87] using the Synopsys Design Compilerwith 65nm library (to evaluate the energy overhead of com-pression/decompression for the dedicated hardware designfor comparison to CABA), and then use ITRS projections [50]to scale our results to the 32nm technology node.

System Overview 15 SMs, 32 threads/warp, 6 memory channels

Shader Core ConVg 1.4GHz, GTO scheduler [96], 2 schedulers/SM

Resources / SM 48 warps/SM, 32768 registers, 32KB Shared Memory

L1 Cache 16KB, 4-way associative, LRU replacement policy

L2 Cache 768KB, 16-way associative, LRU replacement policy

Interconnect 1 crossbar/direction (15 SMs, 6 MCs), 1.4GHz

Memory Model 177.4GB/s BW, 6 GDDR5 Memory Controllers (MCs),FR-FCFS scheduling, 16 banks/MC

GDDR5 Timing [47] tCL = 12, : tRP = 12, : tRC = 40, : tRAS = 28,tRCD = 12, : tRRD = 6 : tCLDR = 5 : tWR = 12

Table 1: Major parameters of the simulated systems.

Evaluated Applications. We use a number of CUDA ap-plications derived from CUDA SDK [83] (BFS, CONS, JPEG,LPS, MUM, RAY, SLA, TRA), Rodinia [24] (hs, nw), Mars [44](KM, MM, PVC, PVR, SS) and lonestar [20] (bfs, bh, mst, sp, sssp)suites. We run all applications to completion or for 1 billioninstructions (whichever comes Vrst). CABA-based data com-pression is beneVcial mainly for memory-bandwidth-limitedapplications. In computation-resource limited applications,data compression is not only unrewarding, but it can alsocause signiVcant performance degradation due to the compu-tational overheads associated with assist warps. We rely onstatic proVling to identify memory-bandwidth-limited appli-cations and disable CABA-based compression for the others.

3For applications where MD cache miss rate is low, we observe that MDcache misses are usually also TLB misses. Hence, most of the overhead ofMD cache misses in these applications is outweighed by the cost of pagetable lookups.

11

In our evaluation (Section 7), we demonstrate detailed resultsfor applications that exhibit some compressibility in memorybandwidth (at least 10%). Applications without compressibledata (e.g., sc, SCP) do not gain any performance from theCABA framework, and we veriVed that these applications donot incur any performance degradation (because the assistwarps are not triggered for them).

Evaluated Metrics. We present Instruction per Cycle(IPC) as the primary performance metric. We also use av-erage bandwidth utilization, deVned as the fraction of totalDRAM cycles that the DRAM data bus is busy, and compres-sion ratio, deVned as the ratio of the number of DRAM burstsrequired to transfer data in the compressed vs. uncompressedform. As reported in prior work [87], we use decompres-sion/compression latencies of 1/5 cycles for the hardwareimplementation of BDI.

7. Results

To evaluate the eUectiveness of using CABA to employ datacompression, we compare Vve diUerent designs: (i) Base -the baseline system with no compression, (ii) HW-BDI-Mem -hardware-based memory bandwidth compression with ded-icated logic (data is stored compressed in main memorybut uncompressed in the last-level cache, similar to priorworks [88, 100]), (iii) HW-BDI - hardware-based interconnectand memory bandwidth compression (data is stored uncom-pressed only in the L1 cache) (iv) CABA-BDI - Core-AssistedBottleneck Acceleration (CABA) framework (Section 4) withall associated overheads of performing compression (for bothinterconnect and memory bandwidth), (v) Ideal-BDI - com-pression (for both interconnect and memory) with no laten-cy/power overheads for compression or decompression. Thissection provides our major results and analyses.

7.1. EUect on Performance and Bandwidth Utilization

Figures 8 and 9 show, respectively, the normalized perfor-mance (vs. Base) and the memory bandwidth utilization ofthe Vve designs. We make three major observations.

0.81.01.21.41.61.82.02.22.42.62.8

Speedu

p

Base HW‐BDI‐Mem HW‐BDI CABA‐BDI Ideal‐BDI

Figure 8: Normalized performance. Figure reproduced from[115].

First, all compressed designs are eUective in providing highperformance improvement over the baseline. Our approach(CABA-BDI) provides a 41.7% average improvement, whichis only 2.8% less than the ideal case (Ideal-BDI) with noneof the overheads associated with CABA. CABA-BDI’s perfor-mance is 9.9% better than the previous [100] hardware-basedmemory bandwidth compression design (HW-BDI-Mem), and

0%

20%

40%

60%

80%

100%

BW Utilization, %


Figure 9: Memory bandwidth utilization. Figure reproducedfrom [115].only 1.6% worse than the purely hardware-based design (HW-BDI) that performs both interconnect and memory bandwidthcompression. We conclude that our framework is eUectiveat enabling the beneVts of compression without requiringspecialized hardware compression and decompression logic.Second, performance beneVts, in many workloads, corre-

late with the reduction in memory bandwidth utilization. Fora Vxed amount of data, compression reduces the bandwidthutilization, and, thus, increases the eUective available band-width. Figure 9 shows that CABA-based compression 1) re-duces the average memory bandwidth utilization from 53.6%to 35.6% and 2) is eUective at alleviating the memory band-width bottleneck in most workloads. In some applications(e.g., bfs and mst), designs that compress both the on-chipinterconnect and the memory bandwidth, i.e. CABA-BDI andHW-BDI, perform better than the design that compresses onlythe memory bandwidth (HW-BDI-Mem). Hence, CABA seam-lessly enables the mitigation of the interconnect bandwidthbottleneck as well, since data compression/decompression isWexibly performed at the cores.

Third, for some applications, CABA-BDI performs slightly(within 3%) better than Ideal-BDI and HW-BDI. The reasonfor this counter-intuitive result is the eUect of warp over-subscription [14, 56, 57, 96]. In these cases, too many warpsexecute in parallel, polluting the last level cache. CABA-BDIsometimes reduces pollution as a side eUect of performingmore computation in assist warps, which slows down theprogress of the parent warps.

We conclude that the CABA framework can eUectively en-able data compression to reduce both on-chip interconnectand oU-chip memory bandwidth utilization, thereby improv-ing the performance of modern GPGPU applications.

7.2. EUect on Energy

Compression decreases energy consumption in two ways: 1)by reducing bus energy consumption, 2) by reducing execu-tion time. Figure 10 shows the normalized energy consump-tion of the Vve systems. We model the static and dynamicenergy of the cores, caches, DRAM, and all buses (both on-chip and oU-chip), as well as the energy overheads related tocompression: metadata (MD) cache and compression/decom-pression logic. We make two major observations.

First, CABA-BDI reduces energy consumption by as muchas 22.2% over the baseline. This is especially noticeable formemory-bandwidth-limited applications, e.g., PVC, mst. Thisis a result of two factors: (i) the reduction in the amount ofdata transferred between the LLC and DRAM (as a result ofwhich we observe a 29.5% average reduction in DRAM power)

12

0.0

0.2

0.4

0.6

0.8

1.0

1.2Normalized

Ene

rgy


Figure 10: Normalized energy consumption. Figure repro-duced from [115].and (ii) the reduction in total execution time. This observa-tion agrees with several prior works on bandwidth compres-sion [88, 104]. We conclude that the CABA framework iscapable of reducing the overall system energy, primarily bydecreasing the oU-chip memory traXc.Second, CABA-BDI’s energy consumption is only 3.6%

more than that of the HW-BDI design, which uses dedicatedlogic for memory bandwidth compression. It is also only4.0% more than that of the Ideal-BDI design, which has nocompression-related overheads. CABA-BDI consumes moreenergy because it schedules and executes assist warps, uti-lizing on-chip register Vles, memory and computation units,which is less energy-eXcient than using dedicated logic forcompression. However, as results indicate, this additionalenergy cost is small compared to the performance gains ofCABA (recall, 41.7% over Base), and may be amortized byusing CABA for other purposes as well (see Section 8).Power Consumption. CABA-BDI increases the system

power consumption by 2.9% over the baseline (not graphed),mainly due to the additional hardware and higher utilizationof the compute pipelines. However, the power overheadenables energy savings by reducing bandwidth use and canbe amortized across other uses of CABA (Section 8).

Energy-Delay product. Figure 11 shows the productof the normalized energy consumption and normalized exe-cution time for the evaluated GPU workloads. This metricsimultaneously captures two metrics of interest—energy dis-sipation and execution delay (inverse of performance). Anoptimal feature would simultaneously incur low energy over-head while also reducing the execution delay. This metric isuseful in capturing the eXciencies of diUerent architecturaldesigns and features which may expend diUering amounts ofenergy while producing the same performance speedup orvice-versa. Hence, a lower Energy-Delay product is more de-sirable. We observe that CABA-BDI has a 45% lower Energy-Delay product than the baseline. This reduction comes fromenergy savings from reduced data transfers as well as lowerexecution time. On average, CABA-BDI is within only 4% ofIdeal-BDI which incurs none of the energy and performanceoverheads of the CABA framework.

7.3. EUect of Enabling DiUerent Compression Algo-rithms

The CABA framework is not limited to a single compres-sion algorithm, and can be eUectively used to employ otherhardware-based compression algorithms (e.g., FPC [4] andC-Pack [25]). The eUectiveness of other algorithms dependson two key factors: (i) how eXciently the algorithm maps to

0.0

0.2

0.4

0.6

0.8

1.0

1.2

No

rmal

ize

d E

ne

rgy

X D

ela

y

Base HW-BDI-Mem CABA-BDI Ideal-BDI

Figure 11: Energy-Delay product.GPU instructions, (ii) how compressible the data is with thealgorithm. We map the FPC and C-Pack algorithms to theCABA framework and evaluate the framework’s eXcacy.Figure 12 shows the normalized speedup with four ver-

sions of our design: CABA-FPC, CABA-BDI, CABA-C-Pack,and CABA-BestOfAll with the FPC, BDI, C-Pack compressionalgorithms. CABA-BestOfAll is an idealized design that se-lects and uses the best of all three algorithms in terms ofcompression ratio for each cache line, assuming no selectionoverhead. We make three major observations.

0.81.01.21.41.61.82.02.22.42.62.8

Speedu

pBase CABA‐FPC CABA‐BDI CABA‐CPack CABA‐BestOfAll

Figure 12: Speedup with diUerent compression algorithms.Figure reproduced from [115].

First, CABA signiVcantly improves performance with anycompression algorithm (20.7% with FPC, 35.2% with C-Pack).Similar to CABA-BDI, the applications that beneVt the mostare those that are both memory-bandwidth-sensitive (Fig-ure 9) and compressible (Figure 13). We conclude that ourproposed framework, CABA, is general and Wexible enoughto successfully enable diUerent compression algorithms.

0.00.51.01.52.02.53.03.54.0

Compression

Ratio

Base BDI FPC C‐Pack BestOfAll

Figure 13: Compression ratio of algorithms with CABA. Fig-ure reproduced from [115].

Second, applications beneVt diUerently from each algo-rithm. For example, LPS, JPEG, MUM, nw have higher com-pression ratios with FPC or C-Pack, whereas MM, PVC, PVRcompress better with BDI. This motivates the necessity ofhaving Wexible data compression with diUerent algorithmswithin the same system. Implementing multiple compressionalgorithms completely in hardware is expensive as it adds

13

signiVcant area overhead, whereas CABA can Wexibly enablethe use of diUerent algorithms via its general assist warpframework.Third, the design with the best of three compression al-

gorithms, CABA-BestOfAll, can sometimes improve perfor-mance more than each individual design with just one com-pression algorithm (e.g., for MUM and KM). This happensbecause even within an application, diUerent cache lines com-press better with diUerent algorithms. At the same time, dif-ferent compression related overheads of diUerent algorithmscan cause one to have higher performance than another eventhough the latter may have a higher compression ratio. For ex-ample, CABA-BDI provides higher performance on LPS thanCABA-FPC, even though BDI has a lower compression ratiothan FPC for LPS, because BDI’s compression/decompressionlatencies are much lower than FPC’s. Hence, a mechanismthat selects the best compression algorithm based on bothcompression ratio and the relative cost of compression/de-compression is desirable to get the best of multiple compres-sion algorithms. The CABA framework can Wexibly enablethe implementation of such a mechanism, whose design weleave for future work.

7.4. Sensitivity to Peak Main Memory Bandwidth

As described in Section 3, main memory (oU-chip) bandwidthis a major bottleneck in GPU applications. In order to conVrmthat CABA works for diUerent designs with varying amountsof available memory bandwidth, we conduct an experimentwhere CABA-BDI is used in three systems with 0.5X, 1X and2X amount of bandwidth of the baseline.

Figure 14 shows the results of this experiment. We observethat, as expected, each CABA design (*-CABA) signiVcantlyoutperforms the corresponding baseline designs with thesame amount of bandwidth. The performance improvementof CABA is often equivalent to the doubling the oU-chip mem-ory bandwidth. We conclude that CABA-based bandwidthcompression, on average, oUers almost all the performancebeneVt of doubling the available oU-chip bandwidth with onlymodest complexity to support assist warps.

0.00.51.01.52.02.53.03.5

Speedu

p

1/2x‐Base 1/2x‐CABA 1x‐Base 1x‐CABA 2x‐Base 2x‐CABA

Figure 14: Sensitivity of CABA tomemory bandwidth. Figurereproduced from [115].

7.5. Selective Cache Compression with CABA

In addition to reducing bandwidth consumption, data com-pression can also increase the eUective capacity of on-chipcaches. While compressed caches can be beneVcial—as highereUective cache capacity leads to lower miss rates—supportingcache compression requires several changes in the cache de-sign [4, 25, 87, 89, 99].

Figure 15 shows the eUect of four cache compression de-signs using CABA-BDI (applied to both L1 and L2 caches with2x or 4x the number of tags of the baseline4) on performance.We make two major observations. First, several applicationsfrom our workload pool are not only bandwidth sensitive, butalso cache capacity sensitive. For example, bfs and sssp signiV-cantly beneVt from L1 cache compression, while TRA and KMbeneVt from L2 compression. Second, L1 cache compressioncan severely degrade the performance of some applications,e.g., hw and LPS. The reason for this is the overhead of decom-pression, which can be especially high for L1 caches as theyare accessed very frequently. This overhead can be easilyavoided by disabling compression at any level of the memoryhierarchy.

0.60.70.80.91.01.11.21.3

Speedu

p

CABA‐BDI CABA‐L1‐2x CABA‐L1‐4x CABA‐L2‐2x CABA‐L2‐4x

Figure 15: Speedup of cache compression with CABA.

7.6. Other Optimizations

We also consider several other optimizations of the CABAframework for data compression: (i) avoiding the overheadof decompression in L2 by storing data in the uncompressedform and (ii) optimized load of only useful data.Uncompressed L2. The CABA framework allows us to

store compressed data selectively at diUerent levels of thememory hierarchy. We consider an optimization where weavoid the overhead of decompressing data in L2 by storingdata in uncompressed form. This provides another tradeoUbetween the savings in on-chip traXc (when data in L2 iscompressed – default option), and savings in decompressionlatency (when data in L2 is uncompressed). Figure 16 depictsthe performance beneVts from this optimization. Severalapplications in our workload pool (e.g., RAY) beneVt fromstoring data uncompressed as these applications have highhit rates in the L2 cache. We conclude that oUering the choiceof enabling or disabling compression at diUerent levels of thememory hierarchy can provide higher levels of the softwarestack (e.g., applications, compilers, runtime system, systemsoftware) with an additional performance knob.Uncoalesced requests. Accesses by scalar threads from

the same warp are coalesced into fewer memory transac-tions [82]. If the requests from diUerent threads within awarp span two or more cache lines, multiple lines have to beretrieved and decompressed before the warp can proceed itsexecution. Uncoalesced requests can signiVcantly increasethe number of assist warps that need to be executed. Analternative to decompressing each cache line (when only a

4The number of tags limits the eUective compressed cache size [4, 87].

14

few bytes from each line may be required), is to enhance thecoalescing unit to supply only the correct deltas from withineach compressed cache line. The logic that maps bytes withina cache line to the appropriate registers will need to be en-hanced to take into account the encoding of the compressedline to determine the size of the base and the deltas. As aresult, we do not decompress the entire cache lines and onlyextract the data that is needed. In this case, the cache line isnot inserted into the L1D cache in the uncompressed form,and hence every line needs to be decompressed even if it isfound in the L1D cache.5 Direct-Load in Figure 16 depictsthe performance impact from this optimization. The over-all performance improvement is 2.5% on average across allapplications (as high as 4.6% for MM).

0.8

1.0

1.2

1.4

1.6

1.8

2.0

2.2

2.4

2.6

2.8

Speed

up

CABA-BDI Direct-Load Uncompressed L2 Ideal-BDI

Figure 16: EUect of diUerent optimizations (Uncompresseddata in L2 and Direct Load) on applications’ performance.

8. Other Uses of the CABA Framework

The CABA framework can be employed in various ways toalleviate system bottlenecks and increase system performanceand energy eXciency. In this section, we discuss two otherpotential applications of CABA, focusing on two: Memoiza-tion and Prefetching. We leave the detailed evaluations andanalysis of these use cases of CABA to future work.

8.1. Memoization

Hardware memoization is a technique used to avoid re-dundant computations by reusing the results of previouscomputations that have the same or similar inputs. Priorwork [8, 13, 98] observed redundancy in inputs to data inGPU workloads. In applications limited by available computeresources, memoization oUers an opportunity to trade oUcomputation for storage, thereby enabling potentially higherenergy eXciency and performance. In order to realize mem-oization in hardware, a look-up table (LUT) is required todynamically cache the results of computations as well asthe corresponding inputs. The granularity of computationalreuse can be at the level of fragments [13], basic blocks or,functions [7, 9, 29, 45, 106], or long-latency instructions [26].The CABA framework provides a natural way to implementsuch an optimization. The availability of on-chip memorylends itself for use as the LUT. In order to cache previousresults in on-chip memory, look-up tags (similar to thoseproposed in [41]) are required to index correct results. With

5This optimization also beneVts cache lines that might not have many uncoa-lesced accesses, but have poor data reuse in the L1D.

applications tolerant of approximate results (e.g., image pro-cessing, machine learning, fragment rendering kernels), thecomputational inputs can be hashed to reduce the size ofthe LUT. Register values, texture/constant memory or globalmemory sections that are not subject to change are potentialinputs. An assist warp can be employed to perform memo-ization in the following way: (1) compute the hashed valuefor look-up at predeVned trigger points, (2) use the load/storepipeline to save these inputs in available shared memory, and(3) eliminate redundant computations by loading the previ-ously computed results in the case of a hit in the LUT.

8.2. Prefetching

Prefetching has been explored in the context of GPUs [12, 51,52, 63, 64, 73, 102] with the goal of reducing eUective memorylatency. With memory-latency-bound applications, the load-/store pipelines can be employed by the CABA frameworkto perform opportunistic prefetching into GPU caches. TheCABA framework can potentially enable the eUective use ofprefetching in GPUs due to several reasons: (1) Even simpleprefetchers such as the stream [54, 86, 109] or stride [15, 38]prefetchers are non-trivial to implement in GPUs since ac-cess patterns need to be tracked and trained at the Vnegranularity of warps [64, 102]. CABA could enable Vne-grained book-keeping by using spare registers and assistwarps to save metadata for each warp. The computationalunits could then be used to continuously compute stridesin access patterns both within and across warps. (2) Ithas been demonstrated that software prefetching and helperthreads [1, 19, 28, 48, 48, 63, 69, 77, 78, 111] are very eUectivein performing prefetching for irregular access patterns. As-sist warps oUer the hardware/software interface to implementapplication-speciVc prefetching algorithms with varying de-grees of complexity without the additional overheads of vari-ous hardware implementations. (3) In bandwidth-constrainedGPU systems, uncontrolled prefetching could potentiallyWood the oU-chip buses, delaying demand requests. CABAcan enable Wexible prefetch throttling (e.g., [34, 36, 109]) byscheduling assist warps that perform prefetching, only whenthe memory pipelines are idle or underutilized. (4) Prefetch-ing with CABA entails using load or prefetch instructions,which not only enables prefetching to the hardware-managedcaches, but also could simplify the usage of unutilized sharedmemory or register Vle as prefetch buUers.

8.3. Other Uses

Redundant Multithreading. Reliability of GPUs is a keyconcern, especially today when they are popularly employedin many supercomputing systems. Ensuring hardware protec-tion with dedicated resources can be expensive [71]. Redun-dant multithreading [76, 93, 116, 117] is an approach whereredundant threads are used to replicate program execution.The results are compared at diUerent points in execution todetect and potentially correct errors. The CABA frameworkcan be extended to redundantly execute portions of the origi-

15

nal program via the use of such approaches to increase thereliability of GPU architectures.Speculative Precomputation. In CPUs, speculative mul-

tithreading ( [72, 92, 107]) has been proposed to speculativelyparallelize serial code and verify the correctness later. Assistwarps can be employed in GPU architectures to speculativelypre-execute sections of code during idle cycles to furtherimprove parallelism in the program execution. Applicationstolerant to approximate results could particularly be amenabletowards this optimization [119].Handling Interrupts and Exceptions. Current GPUs

do not implement support for interrupt handling except forsome support for timer interrupts used for application time-slicing [84]. CABA oUers a natural mechanism for associ-ating architectural events with subroutines to be executedin throughput-oriented architectures where thousands ofthreads could be active at any given time. Interrupts andexceptions can be handled by special assist warps, withoutrequiring complex context switching or heavy-weight kernelsupport.ProVling and Instrumentation. ProVling and binary

instrumentation tools like Pin [70] and Valgrind [80] provedto be very useful for development, performance analysis anddebugging on modern CPU systems. At the same time, thereis a lack 6 of tools with same/similar capabilities for modernGPUs. This signiVcantly limits software development anddebugging for modern GPU systems. The CABA frameworkcan potentially enable easy and eXcient development of suchtools, as it is Wexible enough to invoke user-deVned codeon speciVc architectural events (e.g., cache misses, controldivergence).

9. Related Work

To our knowledge, this paper is the Vrst to (1) propose aWexible and general framework for employing idle GPU re-sources for useful computation that can aid regular programexecution, and (2) use the general concept of helper threadingto perform memory and interconnect bandwidth compres-sion. We demonstrate the beneVts of our new frameworkby using it to implement multiple compression algorithmson a throughput-oriented GPU architecture to alleviate thememory bandwidth bottleneck. In this section, we discussrelated works in helper threading, and memory bandwidthoptimizations, and memory compression.Helper Threading. Previous works [1, 19, 22, 23, 27, 28,

31, 32, 48, 55, 59, 68, 69, 77, 78, 111, 120, 121, 122] demon-strated the use of helper threads in the context of Simulta-neous Multithreading (SMT) and multi-core and single-coreprocessors, primarily to speed up single-thread executionby using idle SMT contexts, idle cores in CPUs, or idle cy-cles during which the main program is stalled on a singlethread context. These works typically use helper threads(generated by the software, the hardware, or cooperatively)to pre-compute useful information that aids the executionof the primary thread (e.g., by prefetching, branch outcome

6With the exception of one recent work [110].

pre-computation, and cache management). No previous workdiscussed the use of helper threads for memory/interconnectbandwidth compression or cache compression.These works primarily use helper threads to capture data

Wow and pre-compute useful information to aid in the exe-cution of the primary thread. In these prior works, helperthreads are either generated with the help of the com-piler [59, 68, 120] or completely in hardware [27, 77, 78].These threads are used to perform prefetching in [1, 19, 22,28, 48, 48, 60, 69, 77, 78, 111] where the helper threads pre-dict future load addresses by doing some computation andthen prefetch the corresponding data. Simultaneous Subordi-nate Multithreading [22] employs hardware generated helperthreads to improve branch prediction accuracy and cachemanagement. Speculative multi-threading [62, 72, 92, 107]involves executing diUerent sections of the serial programin parallel and then later verifying the run-time correctness.Assisted execution [31, 32] is an execution paradigm wheretightly-coupled nanothreads are generated using nanotraphandlers and execute routines to enable optimizations likeprefetching. In Slipstream Processors [48], one thread runsahead of the program and executes a reduced version of theprogram. In runahead execution [77, 78], the main thread isexecuted speculatively solely for prefetching purposes whenthe program is stalled due to a cache miss.While our work was inspired by these prior studies of

helper threading in latency-oriented architectures (CPUs), de-veloping a framework for helper threading (or assist warps) inthroughput-oriented architectures (GPUs) enables new oppor-tunities and poses new challenges, both due to the massiveparallelism and resources present in a throughput-orientedarchitecture (as discussed in Section 1). Our CABA frame-work exploits these new opportunities and addresses thesenew challenges, including (1) low-cost management of a largenumber of assist warps that could be running concurrentlywith regular program warps, (2) means of state/context man-agement and scheduling for assist warps to maximize eUec-tiveness and minimize interference, and (3) diUerent possibleapplications of the concept of assist warps in a throughput-oriented architecture.In the GPU domain, CudaDMA [18] is a recent proposal

that aims to ease programmability by decoupling executionand memory transfers with specialized DMA warps. Thiswork does not provide a general and Wexible hardware-basedframework for using GPU cores to run warps that aid themain program.Memory Latency and Bandwidth Optimizations in

GPUs. A lot of prior works focus on optimizing for mem-ory bandwidth and memory latency in GPUs. Jog et al. [51]aim to improve memory latency tolerance by coordinatingprefetching and warp scheduling policies. Lakshminarayanaet al. [63] reduce eUective latency in graph applications byusing spare registers to store prefetched data. In OWL [52]and [56], intelligent scheduling is used to improve DRAMbank-level parallelism and bandwidth utilization and Rhu etal. [95] propose a locality-aware memory to improve mem-ory throughput. Kayiran et al. [56] propose GPU throttling

16

techniques to reduce memory contention in heterogeneoussystems. Ausavarangniran et al. [14] leverage heterogene-ity in warp behavior to design more intelligent policies atthe cache and memory controller. These works do not con-sider data compression and are orthogonal to our proposedframework.Compression. Several prior works [6, 11, 88, 89, 90, 91,

100, 104, 114] study memory and cache compression withseveral diUerent compression algorithms [4, 11, 25, 49, 87,118], in the context of CPUs or GPUs.

Alameldeen et al. [6] investigated the possibility of band-width compression with FPC [5]. The authors show that sig-niVcant decrease in pin bandwidth demand can be achievedwith FPC-based bandwidth compression design. Sathish etal. [100] examine the GPU-oriented memory link compressionusing C-Pack [25] compression algorithm. The authors makethe observation that GPU memory (GDDR3 [46]) indeed al-lows transfer of data in small bursts and propose to store datain compressed form in memory, but without capacity beneVts.Thuresson et al. [114] consider a CPU-oriented design wherea compressor/decompressor logic is located on the both endsof the main memory link. Pekhimenko et al. [88] proposeLinearly Compressed Pages (LCP) with the primary goal ofcompressing main memory to increase capacity.Our work is the Vrst to demonstrate how one can adapt

some of these algorithms for use in a general helper threadingframework for GPUs. As such, compression/decompressionusing our new framework is more Wexible since it does notrequire a specialized hardware implementation for any algo-rithm and instead utilizes the existing GPU core resourcesto perform compression and decompression. Finally, as dis-cussed in Section 8, our CABA framework is applicable be-yond compression and can be used for other purposes.

10. Conclusion

This paper makes a case for the Core-Assisted BottleneckAcceleration (CABA) framework, which employs assist warpsto alleviate diUerent bottlenecks in GPU execution. CABAis based on the key observation that various imbalances andbottlenecks in GPU execution leave on-chip resources, i.e.,computational units, register Vles and on-chip memory, un-derutilized. CABA takes advantage of these idle resourcesand employs them to perform useful work that can aid theexecution of the main program and the system.We provide a detailed design and analysis of how CABA

can be used to perform Wexible data compression in GPUs tomitigate the memory bandwidth bottleneck. Our extensiveevaluations across a variety of workloads and system conVgu-rations show that the use of CABA for memory compressionsigniVcantly improves system performance (by 41.7% on av-erage on a set of bandwidth-sensitive GPU applications) byreducing the memory bandwidth requirements of both theon-chip and oU-chip buses.We conclude that CABA is a general substrate that can

alleviate the memory bandwidth bottleneck in modern GPUsystems by enabling Wexible implementations of data compres-

sion algorithms. We believe CABA is a general frameworkthat can have a wide set of use cases to mitigate many diUer-ent system bottlenecks in throughput-oriented architectures,and we hope that future work explores both new uses ofCABA and more eXcient implementations of it.

Acknowledgments

We thank the reviewers for their valuable suggestions. Wethank the members of the SAFARI group for their feed-back and the stimulating research environment they pro-vide. Special thanks to Evgeny Bolotin and Kevin Hsieh fortheir feedback during various stages of this project. We ac-knowledge the support of our industrial partners: Facebook,Google, IBM, Intel, Microsoft, Nvidia, Qualcomm, VMware,and Samsung. This research was partially supported by NSF(grants 0953246, 1065112, 1205618, 1212962, 1213052, 1302225,1302557, 1317560, 1320478, 1320531, 1409095, 1409723,1423172, 1439021, 1439057), the Intel Science and Technol-ogy Center for Cloud Computing, and the SemiconductorResearch Corporation. Gennady Pekhimenko is supported inpart by a Microsoft Research Fellowship and an Nvidia Grad-uate Fellowship. Rachata Ausavarungnirun is supported inpart by the Royal Thai Government scholarship. This articleis a revised and extended version of our previous ISCA 2015paper [115].

17

References[1] T. M. Aamodt et al. Hardware support for prescient instruction

prefetch. In HPCA, 2004.[2] B. Abali et al. Memory Expansion Technology (MXT): Software Sup-

port and Performance. IBM J.R.D., 2001.[3] M. Abdel-Majeed et al. Warped register Vle: A power eXcient register

Vle for gpgpus. In HPCA, 2013.[4] A. Alameldeen et al. Adaptive Cache Compression for High-

Performance Processors. In ISCA, 2004.[5] A. Alameldeen et al. Frequent Pattern Compression: A SigniVcance-

Based Compression Scheme for L2 Caches. Technical report, U. Wis-consin, 2004.

[6] A. Alameldeen et al. Interactions between compression and prefetch-ing in chip multiprocessors. In HPCA, 2007.

[7] C. Alvarez et al. On the potential of tolerant region reuse for multime-dia applications. In ICS, 2001.

[8] C. Alvarez et al. Fuzzy memoization for Woating-point multimediaapplications. IEEE Trans. Comput., 2005.

[9] C. Alvarez et al. Dynamic tolerance region computing for multimedia.IEEE Trans. Comput., 2012.

[10] AMD. Radeon GPUs. http://www.amd.com/us/products/desktop/graphics/amd-radeon-hd-6000/Pages/amd-radeon-hd-6000.aspx.

[11] A. Arelakis et al. SC2: A Statistical Compression Cache Scheme. InISCA, 2014.

[12] J. Arnau et al. Boosting mobile GPU performance with a decoupledaccess/execute fragment processor. In ISCA, 2012.

[13] J. Arnau et al. Eliminating Redundant Fragment Shader Executions ona Mobile GPU via Hardware Memoization. In ISCA, 2014.

[14] R. Ausavarangnirun et al. Exploiting Inter-Warp Heterogeneity toImprove GPGPU Performance. In PACT, 2014.

[15] J. Baer et al. EUective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput., 1995.

[16] A. Bakhoda et al. Analyzing CUDA Workloads Using a Detailed GPUSimulator. In ISPASS, 2009.

[17] A. Bakhoda, G.L. Yuan, W.W.L. Fung, H. Wong, and T.M. Aamodt.Analyzing CUDA Workloads Using a Detailed GPU Simulator. InISPASS, 2009.

[18] M. Bauer et al. CudaDMA: Optimizing GPU memory bandwidth viawarp specialization. In SC, 2011.

[19] J. A. Brown et al. Speculative precomputation on chip multiprocessors.In MTEAC, 2001.

[20] M. Burtscher et al. A quantitative study of irregular programs on gpus.In IISWC, 2012.

[21] D. De Caro et al. High-performance special function unit for pro-grammable 3-d graphics processors. Trans. Cir. Sys. Part I, 2009.

[22] R. S. Chappell et al. Simultaneous subordinate microthreading (SSMT).In ISCA, 1999.

[23] R. S. Chappell et al. Microarchitectural support for precomputationmicrothreads. In MICRO, 2002.

[24] S. Che et al. Rodinia: A Benchmark Suite for Heterogeneous Comput-ing. In IISWC, 2009.

[25] X. Chen et al. C-pack: A high-performance microprocessor cachecompression algorithm. In IEEE Trans. on VLSI Systems, 2010.

[26] D. Citron et al. Accelerating multi-media processing by implementingmemoing in multiplication and division units. In ASPLOS, 1998.

[27] J. D. Collins et al. Dynamic speculative precomputation. In MICRO,2001.

[28] J. D. Collins et al. Speculative Precomputation: Long-range Prefetchingof Delinquent Loads. ISCA, 2001.

[29] D. A. Connors et al. Compiler-directed dynamic computation reuse:rationale and initial results. In MICRO, 1999.

[30] R. Das et al. Performance and Power Optimization through DataCompression in Network-on-Chip Architectures. In HPCA, 2008.

[31] M. Dubois. Fighting the memory wall with assisted execution. In CF,2004.

[32] M. Dubois et al. Assisted execution. Technical report, USC, 1998.[33] J. Dusser et al. Zero-content augmented caches. In ICS, 2009.[34] E. Ebrahimi et al. Coordinated Control of Multiple Prefetchers in

Multi-core Systems. In MICRO, 2009.[35] E. Ebrahimi et al. Techniques for Bandwidth-eXcient Prefetching of

Linked Data Structures in Hybrid Prefetching Systems. In HPCA, 2009.[36] E. Ebrahimi et al. Prefetch-aware shared resource management for

multi-core systems. ISCA, 2011.[37] M. Ekman et al. A Robust Main-Memory Compression Scheme. In

ISCA-32, 2005.[38] J. W. C. Fu et al. Stride directed prefetching in scalar processors. In

MICRO, 1992.[39] M. Gebhart et al. A compile-time managed multi-level register Vle

hierarchy. In MICRO, 2011.[40] M. Gebhart et al. Energy-eXcient Mechanisms for Managing Thread

Context in Throughput Processors. In ISCA, 2011.[41] M. Gebhart et al. Unifying primary cache, scratch, and register Vle

memories in a throughput processor. In MICRO, 2012.

[42] M. Ghosh et al. Smart refresh: An enhanced memory controller designfor reducing energy in conventional and 3d die-stacked drams. MICRO,2007.

[43] GPGPU-Sim v3.2.1. GPGPU-Sim Manual.[44] B. He et al. Mars: A MapReduce Framework on Graphics Processors.

In PACT, 2008.[45] J. Huang et al. Exploiting basic block value locality with block reuse.

In HPCA, 1999.[46] Hynix. 512M (16mx32) GDDR3 SDRAM hy5rs123235fp.[47] Hynix. Hynix GDDR5 SGRAM Part H5GQ1H24AFR Revision 1.0.[48] K. Z. Ibrahim et al. Slipstream execution mode for cmp-based multi-

processors. In HPCA, 2003.[49] M. Islam et al. Zero-Value Caches: Cancelling Loads that Return Zero.

In PACT, 2009.[50] ITRS. International technology roadmap for semiconductors. 2011.[51] A. Jog et al. Orchestrated Scheduling and Prefetching for GPGPUs. In

ISCA, 2013.[52] A. Jog et al. OWL: Cooperative Thread Array Aware Scheduling

Techniques for Improving GPGPU Performance. In ASPLOS, 2013.[53] John L. Hennessey and David A. Patterson. Computer Architecture, A

Quantitaive Approach. Morgan Kaufmann, 2010.[54] N. Jouppi. Improving direct-mapped cache performance by the addi-

tion of a small fully-associative cache and prefetch buUers. In ISCA,1990.

[55] M. Kamruzzaman et al. Inter-core Prefetching for Multicore ProcessorsUsing Migrating Helper Threads. In ASPLOS, 2011.

[56] O. Kayiran et al. Neither More Nor Less: Optimizing Thread-levelParallelism for GPGPUs. In PACT, 2013.

[57] O. Kayiran et al. Managing GPU Concurrency in HeterogeneousArchitectures. In MICRO, 2014.

[58] S. W. Keckler et al. GPUs and the future of parallel computing. IEEEMicro, 2011.

[59] D. Kim et al. Design and Evaluation of Compiler Algorithms forPre-execution. In ASPLOS, 2002.

[60] Dongkeun Kim et al. Physical experimentation with prefetching helperthreads on intel’s hyper-threaded processors. In CGO, 2004.

[61] David B. Kirk and W. Hwu. Programming massively parallel processors:a hands-on approach. Morgan Kaufmann, 2010.

[62] Venkata Krishnan and Josep Torrellas. A chip-multiprocessor archi-tecture with speculative multithreading. IEEE Trans. Comput., 1999.

[63] N. Lakshminarayana et al. Spare register aware prefetching for graphalgorithms on GPUs. In HPCA, 2014.

[64] J. Lee et al. Many-Thread Aware Prefetching Mechanisms for GPGPUApplications. In MICRO, 2010.

[65] J. Leng et al. GPUWattch: Enabling Energy Optimizations in GPGPUs.In ISCA, 2013.

[66] E. Lindholm et al. Nvidia tesla: A uniVed graphics and computingarchitecture. IEEE Micro, 2008.

[67] J. Liu et al. Raidr: Retention-aware intelligent dram refresh. ISCA,2012.

[68] J. Lu et al. Dynamic Helper Threaded Prefetching on the Sun Ultra-SPARC CMP Processor. In MICRO, 2005.

[69] C. Luk. Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. In ISCA, 2001.

[70] C. Luk et al. Pin: Building Customized Program Analysis Tools withDynamic Instrumentation. In PLDI, 2005.

[71] Y. Luo et al. Characterizing Application Memory Error Vulnerabilityto Optimize Data Center Cost via Heterogeneous-Reliability Memory.DSN, 2014.

[72] Pedro Marcuello et al. Speculative multithreaded processors. In ICS,1998.

[73] J. Meng et al. Dynamic warp subdivision for integrated branch andmemory divergence tolerance. In ISCA, 2010.

[74] J. Meza et al. Enabling eXcient and scalable hybrid memories. IEEECAL, 2012.

[75] A. Moshovos et al. Slice-processors: An implementation of operation-based prediction. In ICS, 2001.

[76] S. Mukherjee et al. Detailed design and evaluation of redundantmultithreading alternatives. ISCA.

[77] O. Mutlu et al. Runahead execution: An alternative to very largeinstruction windows for out-of-order processors. In HPCA, 2003.

[78] O. Mutlu et al. Techniques for eXcient processing in runahead execu-tion engines. ISCA, 2005.

[79] V. Narasiman et al. Improving GPU performance via large warps andtwo-level warp scheduling. In MICRO, 2011.

[80] N. Nethercote et al. Valgrind: A Framework for Heavyweight DynamicBinary Instrumentation. In PLDI, 2007.

[81] B. S. Nordquist et al. Apparatus, system, and method for coalescingparallel memory requests, 2009. US Patent 7,492,368.

[82] NVIDIA. Programming Guide.[83] NVIDIA. CUDA C/C++ SDK Code Samples, 2011.[84] NVIDIA. Fermi: NVIDIA’s Next Generation CUDA Compute Archi-

tecture, 2011.[85] L. Nyland et al. Systems and methods for coalescing memory accesses

18

of parallel threads, 2011. US Patent 8,086,806.[86] S. Palacharla et al. Evaluating stream buUers as a secondary cache

replacement. In ISCA, 1994.[87] G. Pekhimenko et al. Base-Delta-Immediate Compression: Practical

Data Compression for On-Chip Caches. In PACT, 2012.[88] G. Pekhimenko et al. Linearly Compressed Pages: A Low Complexity,

Low Latency Main Memory Compression Framework. In MICRO,2013.

[89] G. Pekhimenko et al. Exploiting Compressed Block Size as an Indicatorof Future Reuse. In HPCA, 2015.

[90] G. Pekhimenko et al. Toggle-Aware Compression for GPUs. In IEEECAL, 2015.

[91] G. Pekhimenko et al. A Case for Toggle-Aware Compression in GPUs.In HPCA, 2016.

[92] Carlos García Quiñones et al. Mitosis Compiler: An Infrastructurefor Speculative Threading Based on Pre-computation Slices. In PLDI,2005.

[93] M. Qureshi et al. Microarchitecture-based introspection: A techniquefor transient-fault tolerance in microprocessors. DSN, 2005.

[94] M. Qureshi et al. Fundamental latency trade-oU in architecting dramcaches: Outperforming impractical sram-tags with a simple and prac-tical design. MICRO, 2012.

[95] Minsoo Rhu, Michael Sullivan, Jingwen Leng, and Mattan Erez. ALocality-Aware Memory Hierarchy for Energy-EXcient GPU Archi-tectures. In MICRO, 2013.

[96] T. G. Rogers et al. Cache-Conscious Wavefront Scheduling. In MICRO,2012.

[97] A. Roth et al. Speculative data-driven multithreading. In HPCA, 2001.[98] M. Samadi et al. Sage: Self-tuning approximation for graphics engines.

In MICRO, 2013.[99] S. Sardashti et al. Decoupled Compressed Cache: Exploiting Spatial

Locality for Energy-optimized Compressed Caching. In MICRO, 2013.[100] V. Sathish et al. Lossless and Lossy Memory I/O Link Compression for

Improving Performance of GPGPU Workloads. In PACT, 2012.[101] V. Seshadri et al. Page overlays: An enhanced virtual memory frame-

work to enable Vne-grained memory management. ISCA, 2015.[102] A. Sethia et al. Apogee: adaptive prefetching on gpus for energy

eXciency. In PACT, 2013.[103] A. Sethia et al. Equalizer: Dynamic tuning of gpu resources for eXcient

execution. In MICRO, 2014.[104] A. ShaVee et al. MemZip: Exploring Unconventional BeneVts from

Memory Compression. In HPCA, 2014.[105] B. Smith. A pipelined, shared resource MIMD computer. Advance

Computer Architecture, 1986.[106] A. Sodani et al. Dynamic Instruction Reuse. In ISCA, 1997.[107] Sohi et al. Multiscalar Processors. In ISCA, 1995.[108] S. Srinath et al. Feedback Directed Prefetching: Improving the Perfor-

mance and Bandwidth-EXciency of Hardware Prefetchers. In HPCA,2007.

[109] S. Srinath et al. Feedback Directed Prefetching: Improving the Perfor-mance and Bandwidth-EXciency of Hardware Prefetchers. In HPCA,2007.

[110] M. Stephenson et al. Flexible software proVling of GPU architectures.In ISCA, 2015.

[111] K. Sundaramoorthy et al. Slipstream Processors: Improving BothPerformance and Fault Tolerance. In ASPLOS, 2000.

[112] J. E. Thornton. Parallel Operation in the Control Data 6600. Proceedingsof the AFIPS FJCC, 1964.

[113] S. Thoziyoor et al. CACTI 5.1. Technical Report HPL-2008-20, HPLaboratories, 2008.

[114] M. Thuresson et al. Memory-Link Compression Schemes: A ValueLocality Perspective. IEEE Trans. Comput., 2008.

[115] N. Vijaykumar et al. A Case for Core-Assisted Bottleneck Accelerationin GPUs: Enabling Flexible Data Compression with Assist Warps. InISCA, 2015.

[116] J. Wadden et al. Real-world design and evaluation of compiler-managed gpu redundant multithreading. ISCA ’14, 2014.

[117] Wang et al. Compiler-managed software-based redundant multi-threading for transient fault detection. CGO, 2007.

[118] J. Yang et al. Frequent Value Compression in Data Caches. In MICRO,2000.

[119] A. Yazdanbakhsh et al. Mitigating the memory bottleneck with ap-proximate load value prediction. IEEE Date and Test, 2016.

[120] W. Zhang et al. Accelerating and adapting precomputation threadsfor eUcient prefetching. In HPCA, 2007.

[121] C. Zilles et al. The use of multithreading for exception handling.MICRO, 1999.

[122] C. Zilles et al. Execution-based Prediction Using Speculative Slices. InISCA, 2001.

19

a framework for accelerating bottlenecks in gpu execution ...a framework for accelerating...

Documents