06256661dd

Upload: riyazmbz

Post on 04-Jun-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 06256661dd

    1/15

    SD3: An Efficient DynamicData-Dependence Profiling Mechanism

    Minjang Kim, Nagesh B. Lakshminarayana, Hyesoon Kim, and Chi-Keung Luk

    AbstractAs multicore processors are deployed in mainstream computing, the need for software tools to help parallelize programs

    is increasing dramatically. Data-dependence profiling is an important program analysis technique to exploit parallelism in serial

    programs. More specifically, manual, semiautomatic, or automatic parallelization can use the outcomes of data-dependence profiling

    to guide whereandhow to parallelize in a program. However, state-of-the-art data-dependence profiling techniques consume

    extremely huge resources as they suffer from two major issues when profiling large and long-running applications: 1) runtime

    overheadand 2) memory overhead. Existing data-dependence profilers are either unable to profile large-scale applications with a

    typical resource budget or only report very limited information. In this paper, we propose an efficient approach to data-dependence

    profiling that can address both runtime and memory overhead in a single framework. Our technique, called SD 3, reduces the runtime

    overhead byparallelizingthe dependence profiling step itself. To reduce the memory overhead, we compress memory accesses that

    exhibit stride patternsand compute data dependences directly in a compressed format. We demonstrate that SD 3 reduces the

    runtime overhead when profiling SPEC 2006 by a factor of 4.1and 9.7on eight cores and 32 cores, respectively. For the memory

    overhead, we successfully profile 22 SPEC 2006 benchmarks with the reference input, while the previous approaches fail even with

    the train input. In some cases, we observe more than a 20 improvement in memory consumption and a 16 speedup in profiling

    time when 32 cores are used. We also demonstrate the usefulness of SD3 by showing manual parallelization followed by data

    dependence profiling results.

    Index TermsProfiling, data dependence, parallel programming, program analysis, compression, parallelization

    1 INTRODUCTION

    AS multicore processors are now ubiquitous in main-stream computing, parallelization has become themost important approach to improving application perfor-mance. However, specialized software support for parallelprogramming is still immature although a vast amount ofwork has been done on supporting parallel programming.For example, automatic parallelization has been researchedfor decades, but it was successful in only limited domains.Parallelization is still a burden for programmers.

    Recently, several tools, including Intel Parallel Advisor[11] CriticalBlue Prism [5], and Vector Fabric PareonvfAnalyst [30], have been introduced to help the paralleli-zation of legacy serial programs. These tools provide usefulinformation on parallelization by analyzing serial code. Akey component of such tools is dynamic data-dependenceanalysis, which indicates whether two tasks access the samememory location and at least one of them is a right. Twodata-independent tasks can be safely executed in parallelwithout synchronization.

    Traditionally, data-dependence analysis has been donestaticallyby compilers techniques, such as the GCD test [23]and Banerjees inequality test [15], were proposed in thepast to analyze data dependences in array-based dataaccesses. This static analysis may not be effective inlanguages that allow pointers and dynamic allocation. Forexample, we observed that state-of-the-art production-automatic parallelizing compilers often fail to parallelizesimple, embarrassingly parallel loops written in C/C++.The compilers also had limited success in irregular datastructures due to pointer analysis and control flows.

    Rather than entirely relying on static analysis, dynamicanalysis usingdata-dependence profilingis an alternative or acomplementary approach to address the limitations of thestatic-only approaches since all memory addresses areresolved in runtime. Data-dependence profiling has alreadybeen used in parallelization efforts like speculative multi-threading [3], [7], [19], [26], [33] and finding potentialparallelism [8], [16], [24], [29], [31], [35]. It is also beingemployed in the commercial tools we mentioned. However,the current algorithm for data-dependence profilingincurs significant costs of time and memory overhead.Surprisingly, although a number of research works havefocused on using data-dependence profiling, almost nowork exists on addressing the performance and overheadissues in data-dependence profilers.

    As a concrete example, Fig. 1 demonstrates the overheadissues of our baseline algorithm. We believe that this

    baseline algorithm is still state of the art and very similar tothe current commercialized tools. Figs. 1a and 1b show thememory and time overhead, respectively, when profiling17 SPEC 2006 C/C++ applications with the train input on a

    2516 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. 12, DECEMBER 2013

    . M. Kim is with the Qualcomm Research Silicon Valley, 3165 Kifer Road,Santa Clara, CA 95051. E-mail: [email protected].

    . N.B. Lakshminarayana and H. Kim are with the School of ComputerScience, College of Computing, Georgia Institute of Technology, 266 FerstDrive, KACB 2337, Atlanta, GA 30332.E-mail: {nageshbl, hyesoon}@cc.gatech.edu.

    . C.-K. Luk is with the Software and Services Group, Intel Corporation,HD2-3, 77 Reed Road, Hudson, MA 01749.E-mail: [email protected].

    Manuscript received 31 July 2011; revised 17 June 2012; accepted 3 July 2012;

    published online 24 July 2012.Recommended for acceptance by J. Xue.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TC-2011-07-0517.Digital Object Identifier no. 10.1109/TC.2012.182.

    0018-9340/13/$31.00 2013 IEEE Published by the IEEE Computer Society

  • 8/13/2019 06256661dd

    2/15

    12-GB machine. Among the 17 benchmarks, only fourbenchmarks were successfully analyzed; the rest of thebenchmarks failed because of insufficient physical memory(consumed more than 10-GB memory). The runtime over-head is between an 80 and 270 slowdown for the fourbenchmarks that worked. While both time and memoryoverhead are severe, the latter will stop further analysis.The culprit is the dynamic data-dependence profiling

    algorithm used in these tools, which we call the pairwisemethod. The algorithms of previous work [3], [16] are similarto the pairwise method. This pairwise method needs tostore all outstanding memory references to check depen-dences, resulting in huge memory bloats.

    Another example that clearly shows the memory over-head problem is a simple matrix addition program thatallocates three N N matrices for A BC. As shown inFig. 1c, the current tools require significant additionalmemory as the matrix size increases, while our method,SD3, needs only less than 10-MB memory.

    In this paper, we address these memory and timeoverhead problems by proposing an efficient data-depen-

    dence profiling algorithm called SD3. Our algorithm hastwo components. First, we propose a new data-dependenceprofiling technique using a compressed data format toreduce the memory overhead. Second, we propose the useof parallelization to accelerate the data-dependence profil-ing process. More precisely, this work makes the followingcontributions to the topic of data-dependence profiling:

    1. Reducing memory overhead by stride detectionand compression along with a new data-dependencecalculation algorithm. We demonstrate that SD3

    significantly reduces the memory consumption of

    data-dependence profiling. SD3

    is not a simplecompression technique; we should address severalissues to achieve memory-efficient profiling. Thefailed benchmarks in Fig. 1a are successfullyprofiled by SD3 on a 12-GB.

    2. Reducing runtime overhead with parallelization. Weshow that our memory-efficient data-dependenceprofiling itself can be effectively parallelized. Weobserve an average speedup of 4:1 on profilingSPEC 2006 using eight cores. For certain applica-tions, the speedup can be as high as 16 with32 cores.

    2 BACKGROUND

    Before describing SD3, we illustrate usage models of ourdynamic data-dependence profiler, as shown in Fig. 2. A tool

    using the data-dependence profiler takes a program in eithersource code or binary and profiles it with a representativeinput. A raw result from our dependence profiler is a list ofdiscovered data-dependence pairs. All or some of thefollowing information is provided by our profiler:

    . Sources andsinks of data dependences (in source codelines if possible; otherwise in program counters).

    . Types of data dependences: Flow (Read-After-Write, RAW), Anti (WAR), and Output (WAW)dependences.

    . Frequencies and distances of data dependences.

    . Whether a dependence is loop-carried or loop-independent, and data dependences carried by aparticular loop in nested loops.

    . Data-dependence graphsin functions and loops.

    A raw result can be further analyzed to give program-mers advice on parallelization models and the transforma-tion of the serial code. The raw results can also be used byaggressive compiler optimizations and opportunistic auto-

    matic parallelization [29]. Among the three steps, ob-viously the data-dependence profiler is the bottleneck ofthe overhead problem, and we focus on this in the paper.

    3 THEBASELINEPAIRWISEMETHOD

    We describe our baseline algorithm, the pairwise method,which is still the state-of-the-art algorithm for existing tools.SD3 is implemented on the top of the pairwise method. Atthe end of this section, we summarize the problems of thepairwise method. We begin our description of the algorithmby focusing on data dependences within loop nests because

    loops are major parallelization targets.3.1 Checking Data Dependences in a Loop Nest

    In the big picture, to calculate data dependences in a loop,we find conflicts between the memory references of thecurrent loop iteration and the previous iterations. Ourpairwise method temporarily buffers all memory references

    KIM ET AL.: SD3: AN EFFICIENT DYNAMIC DATA-DEPENDENCE PROFILING MECHANISM 2517

    Fig. 1. Overhead of our baseline algorithm and current tools. We profiled the top 20 hottest loops and their inner loops (if any) in (a) and (b).

    Fig. 2. Examples of data-dependence profiler applications.

  • 8/13/2019 06256661dd

    3/15

    during the current iteration of a loop. We call thesereferences pending references. When an iteration ends, wecompute data dependences by checking pending referencesagainst the history references, which are the memoryreferences that appeared from the beginning to the previousloop iteration. These two types of references are stored inthepending tableand thehistory table, respectively. Each loophas its own pending and history tables instead of having thetables globally. This is needed to compute data dependencescorrectly and efficiently while considering 1) nested loopsand 2) loop-carried/independent dependences. We ex-plained the pairwise algorithm with a loop example in [14].

    The pairwise algorithm handles a loop across functioncalls and recursion via the loop stack (see Section 4.3). It alsoeasily finds dependences between functions. When afunction starts, we assume that a loop, which encompassesthe whole function body with zero trip count, has beenstarted, such as do {func_body();} while(0);. Then,loop-independent dependences in this imaginary loop willbe dependences in the function.

    3.2 Handling Loop-Independent Dependences

    When reporting data dependences inside a loop, we mustdistinguish whether a dependence is loop-independent (i.e.,dependences within the same iteration) or loop-carriedbecause its implication is very different for judging theparallelizability of a loop. While loop-independent depen-dences do not prevent parallelizing a loop by DOALL, loop-carried flow dependences generally prohibit parallelizationexcept for DOACROSSor pipelining.

    To handle loop-independent dependence, we introduceakilledaddress, which is very similar to the killset in data-

    flow analysis. We mark an address as killed once thememory address is written in an iteration. Then, subse-quent accesses to the killed address within the sameiteration are no longer stored in tables and reported asloop-independent dependences. We provide an examplefrom SPEC 179.art in [14].

    3.3 Problems of the Pairwise Method

    The pairwise method needs to store all distinct memoryreferences within a loop invocation. As a result, the memoryrequirement per loop is obviously increased as thememory footprint is increased. The memory requirementcould be even worse because of nested loops. In the detailed

    description of the pairwise method [14], one of theimportant steps is propagation. For example, the historyreferences of inner loops propagate to their upper loops,which is implemented as merging history and pendingtables. Hence, only when the top-most loop finishes can allthe history references within the loop nest be flushed. Manyprograms have fairly deep loop nests (e.g., the geometricmean of the maximum loop depth in SPEC 2006 FP is 12),and most of the execution time is spent in loops. In turn,whole distinct memory references often need to be storedalong with PC addresses throughout the program execu-tion. In Section 4, we solve this problem using compression.

    Profiling time overhead is also critical since an extremenumber of memory loads and stores may be traced. Weattack this overhead by parallelizing the data-dependenceprofiling itself. We present our solution in Section 5.

    4 A MEMORY-EFFICIENTALGORITHM OF SD3

    The basic idea of solving the memory overhead problem isto store memory references as a compressed format. Sincemany memory references show stride patterns, ourprofiler can also compress memory references with astride format (e.g., Aa n b). However, a simple com-pression technique is not enough to build an efficient data-dependence profiler. We need to address the followingchallenges for SD3:

    . how to detect stride patterns dynamically (4.1),

    .

    how to perform data-dependence checking with thecompressed formatwithout decompression (4.2),. how to check dependences efficiently with both

    stride and nonstride patterns (4.4), and. how to handle loop nests and loop-independent

    dependence with the compressed format (4.5, 4.6).

    4.1 Dynamic Detection of Strides

    We define an address stream as a strideas long as the streamcan be expressed as basestride distancen. SD3 dynami-cally discovers strides and directly checks data depen-dences with strides and nonstride references. To detectstrides, when observing a memory access, the profiler trains

    a stride detector for each PC (or any location identifier of amemory instruction) and decides whether or not the accessis a part of a stride. Because the sources and sinks ofdependences should be reported, we have a stride detectorper PC. An address that cannot be represented as a part of astride is called a point in this paper.

    Fig. 3 illustrates that the state transitions in our stride islearned. When a newly observed memory address can beexpressed by the learned stride, FSM advances the state untilit reaches the StrongStride state. The StrongStride state cantolerate a small number of stride-breaking behaviors. Formemory accesses like A[i][j], when the program tra-verses in the same row, we will see a stride. When a rowchanges, however, there could be an irregular jump in thememory address, breaking the learned stride. HavingWeak/StrongStride states tolerates a few nonstride accessesso that no point references are recorded in the table.

    If a newly observed memory access cannot be repre-sented with the learned stride, it goes back to theFirstObserved state with the hope of seeing another stridebehavior. Our stride detector does not always requirestrictly increasing or decreasing patterns. For example, astream [10, 14, 18, 14, 18, 22, 18, 22, 26] is considered astride10 4n0 n 4. Note that such nonstrict stridesmay cause slight errors when calculating the occurrence

    count of data dependences (see Section 4.7). We do notfurther combine multiple strides from a PC into a singlestride even if these strides are generated by two- (or more)dimensional accesses.

    2518 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. 12, DECEMBER 2013

    Fig. 3. Stride detection FSM. The current state is updated on everymemory access with the following additional conditions:1 The addresscan be represented with the learned stride (stride); and 2 The addresscannot be represented with the current stride (point).

  • 8/13/2019 06256661dd

    4/15

    4.2 Stride-Based Dependence Checking Algorithm

    Checking dependences is trivial in the pairwise method: weexploit a hash table keyed by memory addresses, whichenables fast searching whether or not a given memoryaddress is dependent. Unfortunately, the stride-basedalgorithm cannot use such simple dependence checkingbecause a stride represents an interval. We also need toefficiently find dependence among both strides and points.This section first introduces algorithms to find conflictswithin two strides (or a stride and a point). Section 4.4 thendiscusses how SD3 implements efficient stride-based

    dependence checking.The key point in the new algorithm is to find conflicts oftwo strides. We attack the problem through two steps:1) finding overlapped strides and points and 2) performinga new data-dependence test, DYNAMIC-GCD, to calculatethe exact conflicts. For the first step, the overlapping test,we employ aninterval tree, which is based on the Red-BlackTree [4]. The test findsall overlapping strides and points ina tree for a given input.1 Fig. 4 shows an example of aninterval tree. Each node represents either a stride or point.Through a query, a stride of [86, 96] overlaps with [92, 192]and [96, 196].

    The next step is an actual data-dependence test between

    two overlapping strides. We extend the well-known great-est common divisor (GCD) test to the DYNAMIC-GCD TESTin two directions: 1) we dynamically construct affineddescriptors from address streams to use the GCD test, and2) we count the exact number of dependence occurrences(many static-time dependence test algorithms give a mayanswer along withdependent and independent).

    To illustrate the algorithm, consider the contrivedprogram in Fig. 5. We assume that array A is a type ofchar[] and begins at address 10. Then, two strides will becreated: 1) [20, 32] with the distance of 2 from line 2 and2) [21, 39] with the distance of 3 from line 3. D YNAMIC-GCD returns the exact number of conflicting addresses in

    the two strides. The problem is reduced to solving aDiophantine equation:2

    2x20 3y210 x; y 6: 1

    DYNAMIC-GCD, described in Algorithm 1, solves thisequation. We detail the computation steps using Fig. 6:

    1. Sort and obtain the overlapped bounds and lengths.Let low1 low2; otherwise, swap the strides. InFig. 6, the bounds are low 21, high 30, andlength 10.

    2. Check the existence of the dependence by the GCD test.We only have the runtime stride information. To use

    the GCD test, we transform the strides as if we havethe common array base such as A[dist1xdelta]and A[dist2y], where delta is the distance betweenlowand the immediately following accessed addressin Stride1. Then, we can use the GCD test. In Fig. 6,delta is 1; the GCD of 2 and 3 is 1, which dividesdelta. Therefore, the strides may be dependent.

    3. Count the exact dependences within the bound.To do so,we first compute the smallest conflicting point in thebound (24 in Fig. 6) by using EXTENDED-EUCLID[4],which returnsxandy inaxby gcda; b, whereaisdist1, andb is dist2.offsetis then defined as thedistance between low and this smallest conflicting

    point, which is 3 in Fig. 6. Observe that thedifference between two adjacent conflicting pointsis the least common multiple ofdist1 anddist2 (6 inFig. 6). Then, we can count the exact number ofdependences: two addresses (24 and 30) are con-flicting. The strides are dependent.

    Algorithm 1.DYNAMIC-GCD.Inputs:Two stride: low1; high1; dist1, low2; high2; dist2Output: Return the number of dependences of the twostrides.Require:low1 low2; otherwise swap the strides.

    1. Calculate low, high, andlengthas shown in Fig. 6.2. delta dist1 lowlow1mod dist1mod dist13. gcd The greatest common divisor ofdist1 anddist24. if deltamod gcd 60 then5. return06. end if7. x; y EXTENDED-EUCLIDdist1; dist28. lcm The least common multiple ofdist1 anddist29. offset dist2ydelta=gcd lcmmod lcm

    10. result length offset1 lcm=lcm11. return max0; result

    4.3 Overview of the Memory-EfficientSD3

    Algorithm

    The first part of SD3, a memory-efficient algorithm, ispresented in Algorithm 2. The algorithm augments thepairwise algorithm and will be parallelized to decrease timeoverhead. We still use the pairwise algorithm for memory

    KIM ET AL.: SD3: AN EFFICIENT DYNAMIC DATA-DEPENDENCE PROFILING MECHANISM 2519

    Fig. 4. Interval tree (based on a Red-Black Tree) for fast overlappingpoint/stride searching. Numbers are memory addresses. Black andwhite nodes represent Red-Black properties.

    Fig. 5. A simple example for DYNAMIC-GCD.

    Fig. 6. Two strides in Fig. 5. Lightly shaded boxes indicate accessedlocations, and black boxes are conflicting locations. The terms (length,delta, low, high, and offset) are explained in the following paragraphs.

    1. In the worst case, where all the nodes of an interval tree intersect an

    input, linear time is required. On average, an interval tree is still faster thana simple linear search. Section 4.4 introduces a further optimization.

    2. A Diophantine equation is an indeterminate polynomial equation inwhich only integer solutions are allowed. In our problem, we solve a linearDiophantine equation such as ax by 1.

  • 8/13/2019 06256661dd

    5/15

    references that do not have stride patterns. Algorithm 2 usesthe following data structures, which are depicted in Fig. 7:

    . POINT: This represents a nonstride memory accessfrom a PC. This structure has

    - PC address (or location ID);- the number of total accesses from this PC;- the pointer to the nextPOINT(if any), which is

    needed because a memory address can betouched by different PCs; and

    - miscellaneous information including the read/write mode and the last accessed iterationnumber to calculate dependence distance.Section 6.2.2 introduces optimizations to reducethe overhead of handling the list structure.

    . STRIDE: This represents a compressed stream ofmemory addresses from a PC. This structure has

    - the lowest and highest addresses;- the stride distance;- the number of total accesses in this stride; and- the pointer to the nextSTRIDEbecause a PC can

    create multiple strides that cannot be combined.Notice that miscellaneous fields such as RWmode and the last accessed iteration number arestored along with the PC because these fields arecommon per PC.

    . LoopInstance: This represents a dynamic execu-tion state of a loop, including statistics and datastructures for data-dependence calculation (the

    pending and history tables).. Pending{Point|Stride}Table: These tables

    capture memory references in the current iterationof a loop. PendingPointTable is a hash table,where the key is memory address for fast conflictchecking, and the value is POINT. PendingStri-deTableassociatesSTRIDEwith the originating PCaddress. Both pending tables have killed bits tohandle loop-independent dependences. These tablesemployDAS-ID(dynamic allocation-site ID) optimi-zation to minimize the search space in dependencechecking. The necessary structure change is dis-cussed in the following section.

    . History{Point|Stride}Table: This holdsmemory accesses inall executed iterations of a loopso far. The structure equals the pending tables exceptfor killed bits.

    . LoopStack: This keeps the history of a loopexecution like the callstack for function calls. ALoopInstance is pushed or popped as the corre-sponding loop is executed and terminated. It isneeded to calculate data dependences in loop neststhat may have function calls and recursion.

    Although the big picture of the algorithm is described, the

    stride-based algorithm also needs to address several chal-lenges. The following four sections elaborate these issues.

    Algorithm 2.THEMEMORY-EFFICIENT ALGORITHM.Note: New steps added on top of the pairwise method areunderlined.

    1. When a loop,L, starts, LoopInstanceofL is pushedon LoopStack.

    2. On a memory reference, R, ofLs ith iteration, checkthe killed bit ofR. If killed, report a loop-independentdependence, and halt the following steps.Otherwise, storeR in either PendingPointTableorPendingStrideTable based on the result of the

    stride detection (see Section 4.1). IfR is a write, set itskilled bit.

    3. At the end of the iteration, do the stride-baseddependence checking (see Sections 4.2 and 4.4). Reportany found dependences.

    4. After Step 3, merge the pending and history pointtables. Also, merge the stride tables (see Section 4.5).Finally, the pending tables, including killed bits,are flushed.

    5. When L terminates, flush the history tables, and popthe LoopStack. To handle loop nests, we propagatethe history tables ofL to the parent ofL, if they exist.Propagation is done by merging the history tables ofLwith the pending tables of the parent ofL.Meanwhile, to handle loop-independent dependences,if a memory address in the history tables ofL is killedby the parent ofL (see Section 4.6), this killed history isnot propagated.

    4.4 Optimizing Stride-Based Dependence Checking

    Fig. 7 summarizes the table structures and stride-baseddependence-checking algorithm. When the current iterationfinishes, the two pending tables are checked against the twohistory tables. Because it has both point and stride tables,the dependence checking now requires four substeps forevery pair of the tables, shown as the four large arrows inthe figure.

    Arrow1 is the case of the pairwise method. Arrows 2and3 show the case of checking a stride table and a pointtable. We takeeveryaddress in the point table and check theconflict against the stride table using the associated intervaltree. Arrow4 is the most complex step.Everystride in thepending stride table needs to be enumerated and checkedagainst the history stride table. This enumerating andchecking could take a long time, especially for a deepnested loop. Therefore, to reduce this enumeration and

    potentially huge search space, we introduce dynamicallocation-siteoptimization.

    This optimization is based on the fact that a memoryaccess on a variable or a structure must have its associated

    2520 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. 12, DECEMBER 2013

    Fig. 7. Structures of the point and stride tables and four cases of thedependence checking: This figure shows the moment when the currentiteration finishes. A stride table has an associated interval tree. Thetables support DAS-ID (dynamic allocation-site ID) to minimize thechecking space. Our implementation uses an additional hash table,where the key is DAS-ID and the value is either point or stride subtablethat only contains memory references from the same DAS-ID.

  • 8/13/2019 06256661dd

    6/15

    allocation site. Memory accesses from different allocationsites willneverconflict in a correct program. Once allocationsites are known, we need to check only dependences amongmemory accesses within the same allocation site, reducingsearch space significantly. In particular, we focus on heapaccesses because they are the main target of the analysis. Toobtain allocation site IDs on heap accesses effectively, wedynamicallytrack allocated heap regions and issue an ID oneach heap region. The DAS-ID optimization is implementedas follows:

    1. Instrument heap functions (e.g., malloc anddelete).

    2. On heap allocation, retrieve the allocated memoryrange, and issue a DAS-ID by an simply increasingcounter. Store this pair of the allocated range and theDAS-ID into a global table. The table is an intervaltree that allows a fast query of the associated DAS-ID for a given memory access.

    3. On heap deallocation, delete the corresponding node.4. On load and store, fetch the address, and query

    the table to obtain the corresponding DAS-ID ofthe access. Store the memory reference to either apoint or stride table reserved for this DAS-ID only.

    We finally update the point and stride table structure byadding an additional hash table layer, where the key type isDAS-ID and the value is either a point or a stride table. Thevalued point and stride tables now only contain memoryreferences from the same DAS-ID.

    4.5 Merging Stride Tables for Loop Nests

    In the pairwise method, we propagate the histories of innerloops to their upper loops to compute dependences in loopnests. Introducing strides makes this propagation difficult.Steps 4 and 5 in Algorithm 2 require a mergeoperation of ahistory table and a pending table. Without strides (i.e., onlypoints), merging tables is straightforward: we simply

    compute the union set of the two point hash tables.3

    On the other hand, merging two stride tables is nottrivial. A naive solution is to just concatenate two stridelists. If this is done, the number of strides could be bloated,resulting in increased memory consumption. Alternatively,we try to dostride-levelmerging rather than a simple stride-list concatenation. An example is illustrated in Fig. 8.

    Naive stride-level merging requires quadratic timecomplexity. Here, we again exploit the interval tree forfast overlapping testing. Nonetheless, we observed thattree-based searching still could take a long time if there isno possibility of stride-level merging. To minimize such

    waste, the profiler caches the result of the merging test in

    history counters per PC. If a PC shows very little chance ofhaving stride merges, SD3 skips the merging test andsimply concatenates the lists.

    4.6 Handling Killed Addresses in Strides

    We discussed that maintaining killed addresses is veryimportant to distinguish loop-carried and independentdependences. The pairwise method prevented killed ad-dresses from being propagated to further steps (to the nextiteration). This step becomes complicated with stridesbecause strides could be killed by the parent loops stridesor points.

    Fig. 9 illustrates this case. A stride is generated from theinstruction at line 6 when Loop_5 is being profiled. Afterfinishing Loop_5, its HistoryStrideTable is mergedinto Loop_1s PendingStrideTable. At this point,Loop_1 knows the killed addresses from lines 2 and 4.Thus, the stride at line 6 can be killed by either 1) a randompoint write at line 2 or 2) a write stride at line 4. We detectsuch killed cases when the history strides are propagated tothe outer loop. Detecting killed addresses is essentiallyidentical to finding conflicts between strides and points. Weuse the same dependence-checking algorithm.

    Interestingly, after processing killed addresses, a stride

    could be one of three cases: 1) a shrunk stride (the range ofstride addresses is reduced), 2) two separate strides, or3) complete elimination. For instance, a stride [4, 8, 12, 16]can be shortened by killed address 16. If a killed address is8, the stride is divided.

    4.7 Lossy Compression in Strides

    Our stride-based algorithm essentially uses compression,which can be either lossy or lossless. If we only consider astrictly increasing or decreasing stride, SD3 guarantees theperfect correctness of data-dependence profiling, whichmeans SD3 results are identical to the pairwise method

    results.Section 4.1 discussed that a stride like [10, 14, 18, 14, 18,22, 18, 22, 26] is also considered a stride in our implementa-tion. In this case, our stride format cannot perfectly recordthe original characteristic of the stream. We only remembertwo facts: 1) a stride of104n; 0 n 4and 2) the totalnumber of memory accesses in this stride is 9. The strideformat cannot precisely remember the occurrence count ofeach memory address. Such lossy compression may causeslight errors when DYNAMIC-GCD calculates.

    Suppose that this stride has a conflict at address 26.Address 26 is accessed only one time, but this informationhas been lost. For the compensation, we add a correction on

    the result of DYNAMIC-GCD by taking the averageoccurrence count of each reference: 9=5d e 2, the totalaccesses in the stride divided by the number of distinctaddresses in the stride.

    KIM ET AL.: SD3: AN EFFICIENT DYNAMIC DATA-DEPENDENCE PROFILING MECHANISM 2521

    Fig. 8. Two stride lists from the same PC are about to be merged. Thestride ([10, 100], +10, 24) means a stride of 10; 20; . . . ; 100and 24 totalaccesses in the stride. These two strides have the same stride distance.Thus, they can be merged, and the number of accesses is summed. Fig. 9. A stride at line 6 can be killed by either a point at line 2 or a stride

    at line 4.

    3. The point table merging must perform the union of two PC-lists, eachof which is from the pending tables and from history tables, respectively. Tomake PC-list merging faster, we employ PC-set optimization (Section 6.2.2).

  • 8/13/2019 06256661dd

    7/15

    Nonetheless, such error does not noticeably affect theusefulness of our approach because we still guarantee thecorrectness of the existence of data dependences.

    5 AN TIMEEFFICIENTALGORITHM OF SD3

    5.1 Overview of the Algorithm

    The time overhead of data-dependence profiling is veryhigh. A typical method to reduce the time overhead wouldbe to use sampling techniques. Unfortunately, simplesampling techniques are not desirable because they mostlytradeoff accurate results (for a given input) for low

    overhead. For example, a dependence pair could be misseddue to sampling, but this pair can prevent parallelization inthe worst case. We instead attack the time overhead byparallelizing data-dependence profiling itself. In particular,we discuss the following problems:

    . Which parallelization model is most efficient?

    . How do the stride algorithms work with paralleliza-tion?

    5.2 Parallelization Model of SD3: A HybridApproach

    We first survey parallelization models of the profiler that

    implements Algorithm 2. Before the discussion, we need toexplain the structure of our profiler briefly. Our profilerbefore parallelization is composed of the following threesteps:

    1. Fetching events from an instrumented program:Events include

    a. memory events: memory reference informationsuch as effective address and PC, and

    b. loop events: beginning/iteration/termination ofa loop, which is essential to implement Algo-rithm 2. Our profiler is an online tool. Events

    are processed on-the-fly.2. Loop execution profiling and stride detection: We

    collect statistics of loop execution (e.g., trip count)and train the stride detector on every memoryinstruction.

    3. Data-dependence profiling: Algorithm 2 is executed.

    To find an optimal parallelization model for SD3, threeparallelization strategies are considered: 1) task-parallel,2) pipeline, and 3) data-parallel. Our approach is using ahybrid model of pipeline and data-level parallelism.

    With the task-parallel strategy, several approachescould be possible. For instance, the profiler may spawn

    concurrent tasks for each loop. During a profile run, beforea loop is executed, the profiler forks a task that profiles theloop. This is similar to the shadow profiler [22]. Thisapproach is not easily applicable to the data-dependence

    profiling algorithm because it requires severe synchroniza-tion between tasks due to nested loops. We do not takethis approach.

    Pipelining enables each step to be executed on a differentcore in parallel. We have three steps, but the third step, thedata-dependence profiling, is the most time-consuming

    step. Although the third step determines the overallspeedup, we still can hide computation latencies of thefirst (event fetch) and the second (stride detection) stepsfrom pipelining.

    Regarding the data-parallel method, first notice that SD3

    itself is embarrassingly parallel. Checking data depen-dences for a particular address requires only informationon this address; no information from the other addresses isneeded. We take a single program multiple data (SPMD)style to exploit this data-level parallelism. A set of taskperform Algorithm 2 concurrently, but each task onlyprocesses a subset of the entire input. This data-parallelmethod is the most scalable one and does not require anysynchronizations except for the trivial final result reductionstep. We also use this model.

    From this survey, our solution is a hybrid model: webasically exploit pipelining, but the dependence profilingstep, which is the longest, is further parallelized by anSPMD style. Fig. 10 summarizes the parallelization model ofSD3. To obtain even higher speedup, we may exploitmultiple machines (see details in Section 6.2). However,there are several issues for an efficient parallelization,which will be now discussed.

    5.3 Event Distribution for Parallel Processing

    The event distribution

    step is introduced in stage 1. For theSPMD parallelization at stage 2, we need to prepare inputsfor each task. In particular, we divide the address spacein an interleaved fashion for better speedup, as shownin Fig. 11.

    The entire address space is divided every 2k bytes, andeach subset is mapped to M tasks in an interleavedmanner. Each task analyzes only the memory referencesfrom its own range. A thread scheduler then executesM tasks on N cores.

    The event distribution, as depicted in Fig. 12, is not asimple division of the entire input events: Memory eventsare distributed by the interleaving. By contrast, loop

    events must be duplicated for the correctness of theparallelized SD3 (i.e., the result should be identical to theserial SD3) because the steps of Algorithm 2 are triggeredon a loop event.

    2522 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. 12, DECEMBER 2013

    Fig. 10. SD3 exploits both pipelining (two-stage) and data-levelparallelism. Step 1 is augmented for the data-level parallelization.

    Fig. 11. Data-parallel model of SD3 with the address-range size of 2k,M tasks, and N cores: Address space is divided in an interleavedmanner. The above formula is used to determine the corresponding taskid for a memory address. In our experimentation, the address-range sizeis 128 bytes (k 7), and the number of tasks is the same as the numberof cores (M N).

  • 8/13/2019 06256661dd

    8/15

    5.4 Strides in Parallelized SD3

    Our stride-detection algorithm and DYNAMIC-GCD need arevision in the parallelized SD3. Stride patterns are detectedby observing a stream of memory addresses. In theparallelized SD3, however, each task can only observememory addresses in its own address range. This problemis illustrated in Fig. 13.

    Here, the address space is divided for three tasks, with arange size of 4 bytes. Consider a strideAwith a range of [10,24] and a stride distance of 2. Task0 can only see addressesin the ranges of [10, 14) and [22, 26). Therefore, Task0 will

    conclude that there are two different strides at [10, 12] and[22, 24] instead of only one stride. These fragmented stridesdramatically bloat the number of strides.

    To solve this problem, the stride detector of a taskassumes that any memory access pattern is possible in out-of-my-region so that fragmented strides can be combinedinto a single stride. In this example, the stride detector ofTask0 assumes that the following memory addresses areaccessed: 14, 16, 18, and 20. Then, the detector will create asingle stride. Even if the assumption is wrong, thecorrectness is not affected. To preserve the correctness,when performing DYNAMIC-GCD, SD3 excludes the num-

    ber of conflicts in out-of-my-region.5.5 Details of the Data-Parallel Model

    5.5.1 Choosing a Good Address-Range Size

    A key point in designing the data-parallel model is toobtain higher speedup via good load balancing. However,the division of the address space inherently creates a loadimbalance problem, as memory accesses often shownonuniform locality. Obviously, having too small or toolarge an address range would worsen this problem.Hence, we use an interleaved division as discussed andthen need to find a reasonably balanced address-rangesize. According to our experiment (not shown in the

    paper), as long as the range size is not too small or toolarge, address-range sizes from 64 to 256 bytes yield well-balanced workload distribution. In our implementation,we choose 128 bytes.

    5.5.2 Choosing an Optimal Number of Tasks

    Even when taking the interleaved approach, we cannotavoid the load imbalance problem. To address this issue,we attempt to create sufficient tasks and employ thework-stealing scheduler [2], that is, exploiting fine-granularity task parallelism. At a glance, this approachwould yield better speedup, but our data negated our

    hypothesis (not shown in the paper). We observed that nospeedup was gained using this approach.

    There are two reasons for this: First, even if the numberof the memory events is reduced, the number of strides may

    not be proportionally reduced. For example, in Fig. 13,despite the revised stride-detection algorithm, the totalnumber of strides for all tasks is three; on a serial version ofSD3, the number of strides would has been one. Hence,having more tasks may increase the overhead of storing andhandling strides, eventually resulting in poor speedup.Second, the overhead of the event distribution would besignificant as the number of tasks increases. Recall againthat loop events are duplicated, while memory events aredistributed. This restriction makes the event distribution acomplex and memory-intensive operation. On average, forSPEC 2006 with the train inputs, the ratio of the total size ofloop events to the total size of memory events is 8 percent.Although the time overhead of processing a loop event ismuch lighter than that of a memory event, the overhead oftransferring loop events could be serious as the number oftasks is increased.

    Therefore, we let the number of tasks be identical to thenumber of cores. Although data-dependence profiling isembarrassingly parallel, the mentioned challenges, hand-ling strides and distributing events, hinder an optimalworkload distribution and an ideal speedup.

    6 IMPLEMENTATION

    Building a profiler that implements SD3 has manyimplementation challenges. We build SD3 using both Pin[20], a dynamic binary-level instrumentation toolkit, andLLVM [18], a compiler framework. We discuss the motiva-tion for using Pin and LLVM and several important issues.

    6.1 Basic Architecture

    Our profiler consists of a tracer and an analyzer, a typicalproducer and consumer architecture:

    . Tracer: This instruments a program, capturesruntime execution traces (i.e., memory events and

    loop events), and transfers the traces to the analyzervia shared memory. A tracer is built on either Pinor LLVM.

    . Analyzer: This takes events from the tracer andperforms the SD3 algorithm, which is orthogonalto tracers.

    Our profiler must be an online tool. Because the majorityof loads and stores could be instrumented, generated tracescould be extremely large, up to an order of 10 TB. Wecannot simply use an offline approach with such traces. Anexample of such an offline approach would be storing andcompressing events (e.g., using bzip) and then decompres-

    sing and analyzing the events. This approach is noteffective at all because compressing/decompressing tracestakes most of the time. One concern of this online approachwould be the overhead of interprocess communication. We

    KIM ET AL.: SD3: AN EFFICIENT DYNAMIC DATA-DEPENDENCE PROFILING MECHANISM 2523

    Fig. 12. An example of the event distribution step with three tasks: Loopevents are duplicated for all tasks, while memory events are divideddepending on the address-range size and the formula of Fig. 11.

    Fig. 13. A single stride can be broken by interleaved address ranges.Stride A will be seen as two separate strides in Task0 with the originalstride-detection algorithm.

  • 8/13/2019 06256661dd

    9/15

    observed that the average event transfer rate between thetwo processes was approximately 1-3 GB/s, which can besufficiently handled by modern computers. The size of theexecution event is 12 bytes (on 86-64), and events aretransferred without compression.

    This separation of tracer and analyzer results in twosignificant benefits. First, pipeline parallelism, explained inSection 5.2, is easily achieved. Second, the analyzer can bereused by different tracers. We separately implementtracers based on instrumentation mechanisms. We alsodefine an abstracted communication layer between thesingle analyzer and multiple tracers, regardless of instru-mentation toolkits.

    6.2 Implementation of AnalyzerThe analyzer first implements the data structures de-scribed in Section 4.3 and Algorithm 2, and we thenparallelize using Intel threading building block (TBB) [12].To obtain even better parallelism, we extend our profilerto work on multiple machines, using an MPI-like executionmodel [9]. The same tracer, analyzer, and application arerunning in parallel on multiple machines, but eachmachine has an equally divided workload. This is asimple extension of our data-parallel model but appliesacross different machines.

    In practice, many programming techniques are essential

    to improve the performance of the pairwise and SD3

    significantly. For example, specialized data structuresshould be implemented rather than using general datastructures in C++ STL. Customized memory allocation isalso critical because the memory reference structures (POINTandSTRIDE) are frequently allocated and removed.

    6.2.1 False-Positive and False-Negative Issues

    For the implementation of the analyzer, we shouldconsider the false-positive (reported as having depen-dences, but it was a false alarm) and false-negative (nodependences reported, but it has a dependence) issues indata-dependence profiling.

    False negatives can occur when not all code can beexecuted with a specific input. Section 7.4 discusses thisproblem. False positives can also occur if we take a largergranularity in the memory instruction instrumentation,such as 8-byte or cache-line granularity rather than a bytegranularity. Data-dependence profiling in the speculativemultithreading domain can use a large granularity tominimize overhead, but this approach suffers from morefalse positive dependences [3]. In our implementation, thestride-based approach does not suffer from false positives.We correctly handle the size of the memory access (e.g.,whether char, int, or double) in the stride-based datastructures and DYNAMIC-GCD.

    For the pairwise method in which hash tables are keyedby addresses, we always use 1-byte granularity, which doesnot suffer from any false positives. False negatives, how-ever, can occur in a very unusual case, shown in Fig. 14.

    Even if there is an 8-byte write at line 4, the 1-bytegranularity policy records only the first byte of the access.The read from line 5 results in a missing data dependence.We believe such a case is very unlikely to occur in well-written code. This problem can be resolved by our DAS-IDoptimization. The heap accesses at lines 4 and 5 both havethe same allocation site at line 1. We then easily detectaliased accesses by the different access types.

    6.2.2 PC-Set Optimization for the Pairwise Method

    To reduce both time and space overhead in the pairwisemethod, we introduce PC-set optimization. Recall that SD3

    still uses the pairwise method when a memory access doesnot show stride behavior. As discussed in Section 4.5, thepairwise method merges two point tables when the currentiteration finishes and when an inner loop is terminated.

    The same memory address may be accessed by multiplePC locations. Since we report PC-wise sources and sinks ofdependences, each entry of a point table must have a PC list,which was implemented as a list ofPOINT. Having a PC list

    creates two challenges: 1) space overhead and 2) overheadon merging two PC lists (computing a union of two lists).Fortunately, we observed that the number of distinct PClists for SPEC 2006 was not significant: the geometric meanis only 6,242. We also learned that most of the unioncomputation was repeated.

    Hence, we introduce a global PC-set tablethat remembersall observed distinct PC lists and a cache for the unioncomputation. These two data structures, shown in Fig. 15,not only save the total memory consumption, but also avoidexcessive computation time. The hit ratio of the union cacheis 95 percent on average.

    This PC-set optimization causes another lossy compres-sion like the error discussed in Section 4.7. As illustratedin Fig. 15, introducing a PC set loses the occurrence countper each PC; the total occurrence count for the singlePC set is just saved. Hence, when reporting the frequencyof the data dependence, the pairwise method may havea minor error. Again, no error exists on judging theexistence of dependences.

    6.3 Implementation of Tracers

    We first implemented SD3 on Pin [14]. We discusschallenges for Pin-based SD3 and the motivations forLLVM-basedSD3.

    6.3.1 Issues in a Pin-Based Tracer

    A Pin-based tracer enables dependence profiling at thedynamic and binary levels. This approach broadens the

    2524 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. 12, DECEMBER 2013

    Fig. 14. A false-negative case with 1-byte granularity: both data1 anddata2are the alias of raw, but their access types are different.

    Fig. 15. PC-set optimization for fast PC-list merging. A PC list can berepresented as a single PC-set ID, resulting in saving the memory. PC-list merging can also be accelerated by a cache.

  • 8/13/2019 06256661dd

    10/15

    applicability of the tool compared to a compiler and source-code-level approach. A dynamic instrumentation does notrequire a recompilation of a profile. This is a great benefit ifthe application does not have full source code that requiresdifferent and complex tool chains.

    The downside of the Pin-based approach is that addi-tional binary-level static analysis is needed to recovercontrol flow graphs and loop structures, which is difficult

    to implement. For example, recovering indirect branchesand pinpointing the correct locations of loop entries andexits are challenges in binary-level analysis.

    Regarding the instrumentation of loads and stores,an 86 binary executable typically has a lot of artifactsfrom push/pop in stacks and system function calls. Withouteliminating such redundant loads and stores, results of aPin-based profiler would have a lot of dependences that arenot useful for the parallelization hints. Some loads andstores also do not need to be instrumented if theirdependences can be identified at static time, notablyinductions and reductions. Filtering such loads and stores

    selectively is also difficult. These challenges motivate theuse of an LLVM-based SD3.

    6.3.2 Issues in an LLVM-Based Tracer

    Using compiler-based instrumentation such as LLVM mayaddress the issues in the Pin-based tracer. LLVM providesa very rich static-analysis infrastructure, including correctcontrol flows and loop structures. Furthermore, LLVMsolves many challenges in binary-level instrumentation.For example, skipping inductions and reductions isrelatively easy to implement since all the data-flowinformation is retained, unlike with binaries. Further staticanalysis may be performed before dynamic profiling to

    decrease the profiling overhead [6]. For example, allmemory loads and stores whose data dependences can beidentified at static time could be excluded as profilingcandidates. Our implementation in this paper skipsinduction and reduction variables (both basic and derivedones) and some read-only accesses.

    The greatest downside of using an LLVM-based tracer isthat it requires recompilation. Recompiling an applicationwith instrumentation code is not always easy. It sometimesrequires modifications in compiler tool chains and compi-ler driver code. The analyzer needs some information fromthe instrumentation phase, such as a list of instrumented

    loops and memory instructions. As the instrumentationphase is separated from the runtime profiling, suchinformation should be transferred via a persistent mediumlike a file.

    7 EXPERIMENTAL RESULTS

    7.1 Experimentation Methodology

    We use 22 SPEC 2006 benchmarks to report runtimeoverhead by running the entire execution of benchmarkswith the reference input.4 We instrument all memory loadsand stores except for certain types of stack operations andcorner cases. Our profiler collects details of data-dependenceinformation as enumerated in Section 2. We profile the 20

    hottest loops (based on the number of executed instructions)andtheir inner loops. For comparing the overhead, we usethe pairwise method. We also use seven OmpSCR bench-marks [1] to evaluate the input sensitivity problem.

    Our experimental results were obtained on machineswith Windows 7 (64-bit), 8-core with hyper-threadingtechnology, and 16GB main memory. Memory overheadis measured in terms of the peak physical memoryfootprint. For results of multiple machines, our profilerruns in parallel on multiple machines but only profilesdistributed workloads. We then take the slowest time forcalculating speedup.

    7.2 Memory Overhead of SD3

    Fig. 16 shows the absolute memory overhead of SPEC 2006with the reference inputs. The memory overhead includeseverything: 1) native memory consumption of a benchmark,2) instrumentation overhead, and 3) dynamic profilingoverhead. Among 22 benchmarks, 21 benchmarks cannot beprofiled by the pairwise method on a 16 GB memorysystem. Sixteen out of 22 benchmarks consumed more than12 GB even with the train inputs.

    Fig. 18 shows the memory consumption of the pairwisemethod in every second for 433.milc, 434.zeusmp, 435.lbm,and 436.cactusADM. Within 500 seconds, these four bench-

    marks reached 10-GB memory consumption. We do noteven know how much memory would be needed tocomplete the profiling. We also tested 436.cactus and470.lbm on a 24-GB machine, but still failed. Simplydoubling memory size could not solve this problem.SD3 successfully profiled 22 benchmarks on a 12-GB

    machine although a couple of benchmarks needed 7+ GB.For example, while both 416.gamess and 436.cactusADMdemand 12GB in the pairwise method, SD3 requires only1.06 GB (just 1:26 of the native overhead) and 1.02 GB

    KIM ET AL.: SD3: AN EFFICIENT DYNAMIC DATA-DEPENDENCE PROFILING MECHANISM 2525

    4. Six SPEC benchmarks were not profiled successfully due toimplementation issues. A few loops (typically from functions that parse

    input files) in the other 22 benchmarks needed to be excluded to avoidruntime errors due to binary-level issues. Currently, the LLVM implemen-tation cannot instrument Fortran programs. The results in this paper arefrom the Pin-based profiler, but the LLVM-based profiler shows a similarperformance.

    Fig. 16. Absolute memory overhead for SPEC 2006 with the reference inputs: 21 out of 22 benchmarks ( mark) need more than 12 GB in thepairwise method, which is still the state-of-the-art algorithm of current tools. The benchmarks natively consume 158 MB memory on average.

  • 8/13/2019 06256661dd

    11/15

    (1:58 overhead), respectively. The geometric mean of thememory consumption of SD3 (1-task) is 2113 MB, whilethe overhead of native programs is 158 MB. Although483.xalancbmk needed more than 7 GB, we can concludethat the stride-based compression is very effective.

    ParallelizedSD3 naturally consumes more memory thanthe serial version of SD3, 2814 MB (8-task) compared to2,113 MB (1-task) on average. The main reason is that eachtask needs to maintain a copy of the information of theentire loops to remove synchronization. Furthermore, thenumber of total strides is generally increased compared tothe serial version since each task maintains its own strides.Nonetheless, SD3 still reduces memory consumption sig-nificantly compared to the pairwise method.

    7.3 Time Overhead of SD3

    The time overhead results ofSD3 are presented in Fig. 17.The time overhead includes both instrumentation-timeanalysis and runtime profiling overhead. The instrumenta-tion-time overhead, such as recovering loops, is quite small.

    For SPEC 2006, this overhead is only 1.3 seconds onaverage. The slowdowns are measured against the execu-tion time of native programs. As discussed in Section 5.5,the number of tasks is the same as the number of cores inthe experimentation.

    As shown in Fig. 17, serialSD3 shows an289slowdownon average, which is not surprising given the amount ofcomputations on every memory access and loop execution.The overhead could be improved by implementing betterstatic analysis that allows us to skip instrumenting loadsand stores that have proved not to create any datadependences. As discussed in Sections 6.2 and 6.2.2,implementation techniques are critical to boost the baselineperformance. Otherwise, a profiler could easily show athousand slowdown.

    When using eight tasks on eight cores, parallelized SD3

    shows an 70 slowdown on average, 29 and 181 in the

    best and worst cases, respectively. We also measure thespeedup with four eight-core machines (total 32 cores). On32 tasks with 32 cores, the average slowdown is29, and thebest and worst cases are 13 and 64, respectively.Calculating the speedups over the serial SD3, we achieve4:1and 9:7speedups on eight and 32 cores, respectively.

    Although the data-dependence profiling stage is embar-

    rassingly parallel, our speedup is lower than the idealspeedup (4:1 speedup on eight cores). The first reason isthat we have an inherent load imbalance problem. Thenumber of tasks is equal to the number of cores to minimizeredundant loop handling and event distribution overhead.However, the address space is statically divided for eachtask, and there is no simple way to change this mappingdynamically. Second, with the stride-based approach, theprocessing time for handling strides is not necessarilydecreased in the parallelized SD3.

    We also estimate slowdowns with infinite CPUs. In suchcases, each CPU observes only conflicts from a single

    memory address, which is extremely low. Therefore, theideal speedup would be very close to the runtime overheadwithout the data-dependence profiling. Some benchmarks,like 483.xalancbmk and 454.calculix, show 17 and 14slowdowns even without the data-dependence profiling.The large overhead of the loop profiling mainly comes fromfrequent loop start/termination and deeply nested loops.

    7.4 Input Sensitivity of Data-Dependence Profiling

    One of the concerns of using a data-dependence profiler asa programming-assistance tool is the input-sensitivityproblem. We quantitatively measure the similarity of data-dependence profiling results from different inputs. A

    profiling result has a list of discovered dependence pairs(source and sink). We compare the discovered dependencepairs from a set of different inputs. We compare only thetop 20 hottest loops and ignore the frequency of the data-dependence pairs. We define similarity as follows, whereRiis theith result (i.e., a set of data-dependence pair):

    Similarity 1XN

    i1

    RiTN

    k1Rk

    jRij :

    A similarity of 1 means all sets of results are exactly thesame (no differences in the existence of discovered data-

    dependence pairs, but not frequencies). We first testedeight benchmarks in the OmpSCR [1] suite. All of themare small numerical programs, including FFT, LUReduc-tion, and Mandelbrot. We tested them with three different

    2526 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. 12, DECEMBER 2013

    Fig. 17. Slowdowns (against the native run) for SPEC 2006 with the reference inputs: From left to right, 1) SD3 (1-task on 1-core), 2)SD3 (8-task on8-core), 3) SD3 (32-task on 32-core), and 4) estimated slowdowns of infinite CPUs. For all experiments, the address-range size is 128 bytes. Thegeometric mean of native runtime is 488 seconds on Intel Core i7 3.0 GHz.

    Fig. 18. Memory overhead of the pairwise for 4 SPEC 2006 benchmarks.

  • 8/13/2019 06256661dd

    12/15

    input sets by changing the input data size or iterationcount, but the input sets are long enough to execute themajority of the source code. We found that the profilingresults of OmpSCR were not changed by different inputsets (i.e., Similarity 1).

    Fig. 19 shows the similarity results of SPEC 2006,which are obtained from the reference and train inputsets. Our data show that there are very high similarities(0.98 on average) in discovered dependence pairs. Recallthat we compare the similarity for only frequently

    executed loops. Some benchmarks show a few differences(as low as 0.95), but we found that the differences werehighly correlated with the executed code coverage. In thiscomparison, we minimize 86-64 artifacts such as stackoperations in functions.

    A related work also showed a similar result. Thies et al.[27] used a dynamic analysis tool to find pipelineparallelism in streaming applications with annotated code.Their results showed that memory dependences betweenpipeline stages are highly stable and predictable overdifferent inputs.

    7.5 Discussion: Samplings

    Obviously, using sampling would be an obvious option tolimit the overhead of dependence profiling [3]. However,simple sampling techniques are inadequate for thepurpose of our data-dependence profiler for two reasons:1) Any sampling technique may introduce incorrectresults, either false negatives or false positives (seeSection 6.2.1); and 2) sampling may not be effective tosolve the overhead problem.

    . Correctness. Some usage models of the dependenceprofiling can tolerate inaccuracy, notably thread-level data speculation (TLDS) [3], [19], [26]. Thereason is that in TLDS, incorrect speculations can be

    recovered by the rollback mechanism in hardware.However, when using dependence profiling to guideprogrammer-assisted parallelization for conven-tional multicore processors, we claim that the profileshould be as correct as possible. In this context,correctness means that a profiling is correct foronly a given input. We cannot prove the correctnessfor all possible inputs using a dynamic approach.

    Fig. 20 illustrates a case where a simple samplingcould make a wrong decision on parallelizabilityprediction. The loop at line 307 of the figure ismistakenly reported as potentially parallelizable.The reason is that the work function that generated

    dependences was called just one time: at the end ofthe iteration. A simple sampling technique thatrandomly samples iterations in a loop may makethis error.

    . Overhead.A simple sampling technique also does notsolve the overhead problem inherently. To demon-strate the overhead problem, we compare thefollowing three simple sampling techniques andSD3 for the Jacobi benchmark in OmpSCR:

    - Burst sampling: Dependence profiling istriggered every N instructions and then lasts

    for M consecutive memory instructions. We fixNto be 5,000 and Mto be 1,000 in this evaluation;

    - Early stopping: For any given loop, weprofile up to N percent of its total invocationcount and M percent of its total iterationcount; and

    - Major-loops only: We perform dependenceprofiling only on hot loops (at least 5 percent ofthe total execution time).

    Fig. 21 shows the results. When the matrix is 3,000 by3,000, the pairwise method requires more than 12 GB ofmemory. Even when the samplings are used, the memory

    overhead is still in the order of GBs. In contrast, SD3

    requires only 50 MB because it can compress most of thememory references.

    Of course, sophisticated and adaptive sampling techni-ques could solve the overhead problem. To the best of ourknowledge, no advanced sampling is devised to reduce theoverhead while maintaining the correctness. Most samplingtechniques used in TLDS are not desirable for our purpose.

    8 RELATEDWORK

    8.1 Dynamic Data-Dependence Analysis

    One of the early works that used data-dependence profilingto help parallelization is Larus parallelism analyzer pp [16].The profiling algorithm is similar to the evaluated pairwisemethod but suffers from huge memory and time overhead.

    KIM ET AL.: SD3: AN EFFICIENT DYNAMIC DATA-DEPENDENCE PROFILING MECHANISM 2527

    Fig. 19. Similarity of the results from different inputs: 1.00 means allresults were identical (not the frequencies of dependence pairs). Fig. 20. 436.cactusADM, ScheduleTraverse.c.

    Fig. 21. Memory overhead for Jacobi in OmpSCR by different profilingmechanisms and policies.

  • 8/13/2019 06256661dd

    13/15

    Tournavitis et al. [29] proposed a dependence profilingmechanism to overcome the limitations of automaticparallelization. They also used a pairwise-like method anddid not discuss the overhead problem explicitly.

    8.2 Data-Dependence Profiling for Speculation

    As discussed in Section 7.5, the concept of dependenceprofiling has been used for speculative hardware-basedoptimizations. TLDS compilers speculatively parallelizecode sections that do not have much data dependence.Several methods have been proposed [3], [7], [19], [26], [33],and many of them employ sampling or allow aliasing toreduce overhead. All of these approaches do not have to giveaccurate results like SD3, assuming speculative hardware.

    8.3 Reducing Overhead of Dynamic Analysis

    Shadow Profiling [22], SuperPin [32], PiPA [36], and Haet al. [10] employed parallelization techniques to reduce thetime overhead of instrumentation-based dynamic analyses.Since all of them focus on a generalized framework, they

    only exploit task-level parallelism by separating instrumen-tation and dynamic analysis.SD3 further exploits data-levelparallelism while reducing the memory overhead at thesame time.

    A number of techniques that compress dynamic profilingtraces as well as standard compression algorithms like bziphave been proposed to save memory space [17], [21], [24],[34]. Their common approach is to use specialized datastructures and algorithms to compress instructions andmemory accesses that show specific patterns. METRIC [21],a tool to find cache bottlenecks, also exploits the stridebehavior like SD3. While METRIC only presents thecompression algorithm, SD3 introduces a number of algo-rithms that effectively calculate data dependence with thestride and nonstride formats.

    9 APPLICATIONS OF SD3 ALGORITHM

    In this section, we demonstrate the usefulness ofSD3 withour proposed profiling tool, Prospector [13]. Prospector isbuilt on top ofSD3. It performs loop and data dependenceprofiling and provides guidelines for how programmerscan manually parallelize their applications. We demonstratean actual usage of Prospector with SPEC 179.art, summar-ized in Fig. 22. Although 179.art could be easily parallelized

    by TLS techniques, production compilers cannot automati-cally parallelize it. A typical programmer who does notknow the algorithm detail would easily spend a day ormore on parallelizing. Table 1 shows the result with thetrain input. The detailed steps are as follows:

    1. Prospector finds that the loop scan_recognize:5in Fig. 22 is the hottest loop with 79 percent executioncoverage, only one invocation, and 20 iterations.Every iteration has almost an equal number ofexecuted instructions (implying good balance). Theloop could be a good candidate.

    2. Fortunately, the loop has no loop-carried flow

    dependences except a reduction variable. Strictlyspeaking, there are dependences on induction (i.e.,loop counter variables) and reduction variables, butthey can be filtered.

    3. scan_recognize:5has many loop-carried outputdependences on global variables such as f1_layerand Y. These variables are not automatically priva-tized to threads, so we perform privatization,explicitly allocating thread-local private copies ofthese global variables (not shown in Fig. 22).

    4. Temporary variables (in the scope of the loop:5) arealso found such as iat line 6. In this code, we needto insert i to OpenMPs private list.

    5. A potential reduction variable (in the scope of theloop:5), highest_confidence, is identified. Thevariable is intended to calculate the maximum,which is the commutative operation. We manuallymodify the code to obtain local results and computethe final answer (not shown in Fig. 22).

    Programmers do not need to know the very details of179.art. By interpreting Prospectors results, programmerscan easily understand and finally parallelize. It is true thatProspector cannot prove the parallelizability. However,when compilers cannot automatically parallelize the loop,programmers must prove its correctness by hand or verify itempirically using exhaustive tests. In such cases, we claim

    Prospector and SD3

    are very valuable for programmers.Recent commercial tools [5], [11], [30] and research that useprofiling to assist parallelization [8], [25], [27], [28] alsoadvocate this claim.

    2528 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. 12, DECEMBER 2013

    TABLE 1Profiling Result of scan_recognize:5, in 179.art

    Fig. 22. Simplified parallelization steps of 179.art on multicore by

    Prospectors result: 1) Privatize global variables; 2) Insert OpenMPpragmas; and 3) Add a reduction code (not shown here).

  • 8/13/2019 06256661dd

    14/15

    10 CONCLUSIONS ANDFUTURE WORK

    This paper proposed a new efficient data-dependenceprofiling technique called SD3. SD3 is the first solution thatattacks both the memory and the time overhead of data-dependence profiling at the same time. For the memoryoverhead,SD3 not only reduces the overhead by compres-sing memory references that show stride behavior, but also

    provides a new data-dependence checking algorithm withthe stride format. SD3 presents several algorithms onhandling the stride data structures. For the time overhead,SD3 parallelizes the data-dependence profiling itself whilekeeping the effectiveness of the stride compression. SD3

    successfully profiles 22 SPEC 2006 benchmarks with thereference inputs. In future work, we will focus on how suchan efficient data-dependence profiler can actually provideadvice on parallelizing legacy code as discussed in Section 9.

    ACKNOWLEDGMENTS

    Special thanks to Puyan Lotfi and Hyojong Kim for their

    support for the implementation of LLVM-based SD3. Theauthor thanks Geoffrey Lowney, Robert Cohn, and JohnPieper for their feedback and support; Paul Stravers forproviding an experimental result; Pranith D. Kumar, andHPArch members for their comments; finally, manyvaluable comments from the anonymous reviewers whohelped this paper. Minjang Kim, Nagesh B. Lakshminar-ayana, and Hyesoon Kim are supported by the US NationalScience Foundation (NSF) CCF0903447, NSF/SRC task1981, NSF CAREER award 1139083, Intel Corporation, andMicrosoft Research. Any opinions, findings, and conclu-sions or recommendations expressed in this material are

    those of the authors and do not necessarily reflect those ofNSF, Microsoft, or Intel.

    REFERENCES[1] OmpSCR: OpenMP Source Code Repository, http://sourceforge.

    net/projects/ompscr/, 2013.[2] R.D. Blumofe, C.F. Joerg, B.C. Kuszmaul, C.E. Leiserson, K.H.

    Randall, and Y. Zhou, Cilk: An Efficient Multithreaded RuntimeSystem,Proc. Fifth ACM SIGPLAN Symp. Principles and Practice ofParallel Programming (PPOPP 95), 1995.

    [3] T. Chen, J. Lin, X. Dai, W.-C. Hsu, and P.-C. Yew, DataDependence Profiling for Speculative Optimizations, CompilerConstruction,vol. 2985, pp. 57-72, 2004.

    [4] T.H. Cormen, C.E. Leiserson, R.L. Rivest, and C. Stein,Introduction

    to Algorithms, third ed. The MIT Press, 2009.[5] CriticalBlue, Prism: An Analysis Exploration and VerificationEnvironment for Software Implementation and Optimization on Multi-core Architectures, http://www.criticalblue.com, 2013.

    [6] D. Das and P. Wu, Experiences of Using a Dependence Profiler toAssist Parallelization for Multi-Cores, Proc. IEEE Intl Symp.Parallel and Distributed Processing (IPDPS) Workshops, 2010.

    [7] Z.-H. Du, C.-C. Lim, X.-F. Li, C. Yang, Q. Zhao, and T.-F. Ngai,A Cost-Driven Compilation Framework for Speculative Paralle-lization of Sequential Programs, Proc. ACM SIGPLAN Conf.Programming Language Design and Implementation (PLDI 04), 2004.

    [8] S. Garcia, D. Jeon, C.M. Louie, and M.B. Taylor, Kremlin:Rethinking and Rebooting Gprof for the Multicore Age, Proc.32nd ACM SIGPLAN Conf. Programming Language Design andImplementation (PLDI 11), 2011.

    [9] W. Gropp, E. Lusk, and A. Skjellum,Using MPI: Portable Parallel

    Programming with the Message-Passing Interface. MIT Press, 1994.[10] J. Ha, M. Arnold, S.M. Blackburn, and K.S. McKinley, AConcurrent Dynamic Analysis Framework for Multicore Hard-ware,Proc. 24th ACM SIGPLAN Conf. Object Oriented Program-ming Systems Languages and Applications (OOPSLA 09), 2009.

    [11] Intel Corporation, Intel Parallel Advisor, http://software.intel.com/en-us/articles/intel-parallel-advisor/, 2013.

    [12] Intel Corporation, Intel Threading Building Blocks, http://www.threadingbuildingblocks.org/, 2013.

    [13] M. Kim, H. Kim, and C.-K. Luk, Prospector: Helping ParallelProgramming by a Data-Dependence Profile, Proc. SecondUSENIX Conf. Hot Topics in Parallelism (HotPar 10), 2010.

    [14] M. Kim, H. Kim, and C.-K. Luk, SD3: A Scalable Approach toDynamic Data-Dependence Profiling,Proc. IEEE/ACM 43rd Ann.Intl Symp. Microarchitecture (MICRO 43), 2010.

    [15] X. Kong, D. Klappholz, and K. Psarris, The I Test: An ImprovedDependence Test for Automatic Parallelization and Vectoriza-tion, IEEE Trans. Parallel and Distributed Systems, vol. 2, no. 3,pp. 342-349, July 1991.

    [16] J.R. Larus, Loop-Level Parallelism in Numeric and SymbolicPrograms, IEEE Trans. Parallel and Distributed Systems, vol. 4,no. 7, pp. 812-826, July 1993.

    [17] J.R. Larus, Whole Program Paths, Proc. ACM SIGPLAN Conf.Programming Language Design and Implementation (PLDI 99),1999.

    [18] C. Lattner and V. Adve, LLVM: A Compilation Framework forLifelong Program Analysis & Transformation, Proc. Intl Symp.Code Generation and Optimization (CGO 04), 2004.

    [19] W. Liu, J. Tuck, L. Ceze, W. Ahn, K. Strauss, J. Renau, and J.Torrellas, POSH: A TLS Compiler That Exploits Program

    Structure,Proc. 11th ACM SIGPLAN Symp. Principles and Practiceof Parallel Programming (PPoPP 06), 2006.

    [20] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S.Wallace, V.J. Reddi, and K. Hazelwood, Pin: Building Custo-mized Program Analysis Tools with Dynamic Instrumentation,Proc. ACM SIGPLAN Conf. Programming Language Design andImplementation (PLDI 05), 2005.

    [21] J. Marathe, F. Mueller, T. Mohan, S.A. Mckee, B.R. De Supinski,and A. Yoo, METRIC: Memory Tracing via Dynamic BinaryRewriting to Identify Cache Inefficiencies, ACM Trans. Program-ming Languages and Systems, vol. 29, Apr. 2007.

    [22] T. Moseley, A. Shye, V.J. Reddi, D. Grunwald, and R. Peri,Shadow Profiling: Hiding Instrumentation Costs with Paralle-lism,Proc. Intl Symp. Code Generation and Optimization (CGO 07),2007.

    [23] S.S. Muchnick, Advanced Compiler Design and Implementation.Morgan Kaufmann Publishers Inc., 1997.

    [24] G.D. Price, J. Giacomoni, and M. Vachharajani, VisualizingPotential Parallelism in Sequential Programs,Proc. 17th Intl Conf.Parallel Architectures and Compilation Techniques (PACT 08), 2008.

    [25] S. Rul, H. Vandierendonck, and K. De Bosschere, A Profile-BasedTool for Finding Pipeline Parallelism in Sequential Programs,Parallel Computing,vol. 36, pp. 531-551, Sept. 2010.

    [26] J.G. Steffan, C.B. Colohan, A. Zhai, and T.C. Mowry, A ScalableApproach to Thread-Level Speculation, Proc. 27th Ann. IntlSymp. Computer Architecture (ISCA 00), 2000.

    [27] W. Thies, V. Chandrasekhar, and S. Amarasinghe, A PracticalApproach to Exploiting Coarse-Grained Pipeline Parallelism inC Programs, Proc. IEEE/ACM 40th Ann. Intl Symp. Microarchi-tecture,2007.

    [28] G. Tournavitis and B. Franke, Semi-Automatic Extraction and

    Exploitation of Hierarchical Pipeline Parallelism Using ProfilingInformation, Proc. 19th Intl Conf. Parallel Architectures andCompilation Techniques (PACT 10), 2010.

    [29] G. Tournavitis, Z. Wang, B. Franke, and M.F. OBoyle, Towards aHolistic Approach to Auto-Parallelization: Integrating Profile-Driven Parallelism Detection and Machine-Learning Based Map-ping, Proc. ACM SIGPLAN Conf. Programming Language Designand Implementation (PLDI 09), 2009.

    [30] Vector Fabrics, Pareon: Optimize Applications for Multicore Phones,Tablets and x86, http://www.vectorfabrics.com/.

    [31] C. von Praun, R. Bordawekar, and C. Cascaval, ModelingOptimistic Concurrency Using Quantitative Dependence Analy-sis, Proc. 13th ACM SIGPLAN Symp. Principles and Practice ofParallel Programming (PPoPP 08), 2008.

    [32] S. Wallace and K. Hazelwood, SuperPin: Parallelizing Dynamic

    Instrumentation for Real-Time Performance, Proc. Intl Symp.Code Generation and Optimization (CGO 07), 2007.[33] P. Wu, A. Kejariwal, and C. Cascaval, Languages and Compilers

    for Parallel Computing, chapter Compiler-Driven DependenceProfiling to Guide Program Parallelization. 2008.

    KIM ET AL.: SD3: AN EFFICIENT DYNAMIC DATA-DEPENDENCE PROFILING MECHANISM 2529

  • 8/13/2019 06256661dd

    15/15

    [34] X. Zhang and R. Gupta, Whole Execution Traces, Proc. IEEE/ACM 37th Ann. Intl Symp. Microarchitecture,2004.

    [35] X. Zhang, A. Navabi, and S. Jagannathan, Alchemist: ATransparent Dependence Distance Profiling Infrastructure, Proc.IEEE/ACM Seventh Ann. Intl Symp. Code Generation and Optimiza-tion (CGO 09), 2009.

    [36] Q. Zhao, I. Cutcutache, and W.-F. Wong, PiPA: PipelinedProfiling and Analysis on Multi-Core Systems, Proc. IEEE/ACMSixth Ann. Intl Symp. Code Generation and Optimization (CGO 08),2008.

    Minjang Kim received the BS degrees incomputer science and in naval architectureand ocean engineering, both from SeoulNational University and the MS degree incomputer science from Georgia Tech in 2008.He is working toward the PhD degree in theCollege of Computing at Georgia Tech. Hisresearch interests include parallel programmingtools and compilers.

    Nagesh B. Lakshminarayana received thebachelors degree in computer science andengineering from RV College of Engineering,

    India, in 2003 and the masters degree incomputer science from Georgia Tech in 2009.He is working toward the PhD degree in theCollege of Computing at Georgia Tech. Hisresearch interests include the field of computerarchitecture.

    Hyesoon Kim received the BA degree inmechanical engineering from Korea AdvancedInstitute of Science and Technology, the MSdegree in mechanical engineering from SeoulNational University, and the MS and PhDdegrees in computer engineering at The Uni-versity of Texas at Austin. She is an assistantprofessor in the School of Computer Scienceat Georgia Institute of Technology. Her re-search interests include high-performance en-

    ergy-efficient heterogeneous architectures, and programmer-compiler-microarchitecture interaction.

    Chi-Keung Luk received the PhD degree incomputer science from the University of Toronto.He is a senior staff engineer at Intel.His researchinterests include parallel programming tools andtechniques, compilers, and virtualization.

    . For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

    2530 IEEE TRANSACTIONS ON COMPUTERS, VOL. 62, NO. 12, DECEMBER 2013