simulee: detecting cuda synchronization bugs via memory ...lxz144130/publications/icse2020b.pdf ·...

Simulee Detecting CUDA Synchronization Bugs viaMemory-Access Modeling

Mingyuan WuSouthern University of Science and

TechnologyShenzhen China

11849319mailsustecheducn

Yicheng OuyangSouthern University of Science and


11610313mailsustecheducn

Husheng ZhouUniversity of Texas at Dallas

Dallas USAhushengzhouutdallasedu

Lingming ZhangUniversity of Texas at Dallas

Dallas USAlingmingzhangutdallasedu

Cong LiuUniversity of Texas at Dallas

Dallas USAcongutdallasedu

Yuqun ZhangSouthern University of Science and


zhangyqsustecheducn

ABSTRACTWhile CUDA has become a mainstream parallel computing plat-form and programming model for general-purpose GPU computinghow to effectively and efficiently detect CUDA synchronizationbugs remains a challenging open problem In this paper we pro-pose the first lightweight CUDA synchronization bug detectionframework namely Simulee to model CUDA program executionby interpreting the corresponding LLVM bytecode and collectingthe memory-access information for automatically detecting gen-eral CUDA synchronization bugs To evaluate the effectiveness andefficiency of Simulee we construct a benchmark with 7 popularCUDA-related projects from GitHub upon which we conduct anextensive set of experiments The experimental results suggest thatSimulee can detect 21 out of the 24 manually identified bugs in ourpreliminary study and also 24 previously unknown bugs among allprojects 10 of which have already been confirmed by the develop-ers Furthermore Simulee significantly outperforms state-of-the-artapproaches for CUDA synchronization bug detection

ACM Reference FormatMingyuanWu YichengOuyang Husheng Zhou Lingming Zhang Cong Liuand Yuqun Zhang 2020 Simulee Detecting CUDA Synchronization Bugsvia Memory-Access Modeling In 42nd International Conference on SoftwareEngineering (ICSE rsquo20) May 23ndash29 2020 Seoul Republic of Korea ACM NewYork NY USA 12 pages httpsdoiorg10114533778113380358

Yuqun Zhang is the corresponding author

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page Copyrights for components of this work owned by others than ACMmust be honored Abstracting with credit is permitted To copy otherwise or republishto post on servers or to redistribute to lists requires prior specific permission andor afee Request permissions from permissionsacmorgICSE rsquo20 May 23ndash29 2020 Seoul Republic of Koreacopy 2020 Association for Computing MachineryACM ISBN 978-1-4503-7121-62005 $1500httpsdoiorg10114533778113380358

1 INTRODUCTIONCUDA [3] is a mainstream parallel computing platform and pro-grammingmodel that allows software developers to leverage general-purpose GPU (GPGPU) computing [4] CUDA is advanced in sim-plifying IO streams to memories and dividing computations intosub-computations since it parallelizes programs in terms of gridsand blocks In addition CUDA enables more flexible cache manage-ment that speeds up the floating point computation of CPUs CUDAis thus considered rather powerful for accelerating deep-neural-network-related applications where relevant matrix computationscan be efficiently loaded

Unlike traditional CPU multi-thread programs where the syn-chronization mechanism is built upon managing the shared re-source that can be accessed by multiple threads the typical syn-chronization mechanism of GPU programs is built upon synchro-nizing the instruction flows [1] In particular since GPU programsuse barriers rather than locks for synchronization and enable sim-plified barrier-based happens-before relations traditional lockset-based [13 38] and happens-before-based bug detection approaches[18 34] for CPUs become obsolete and expensive in detectingparallel-computing-related bugs for GPUs

Recently several approaches eg [6 9 20 30 31 37 50 51] havebeen proposed to detect synchronization bugs for CUDA kernelfunctions In general they can be categorized into two classes (1)the compiler-based approaches that leverage compiler instrumenta-tion withwithout static analysis to identify race-free locations fordetecting data-race bugs [6 20 37 50 51] and (2) the SMT-solver-based approaches that integrate static analysis and symbolic exe-cution with SMT solver to detect data race and barrier-divergencebugs [9 30 31] Although such approaches can detect CUDA syn-chronization bugs under their respectively defined environmentsthey have limited applicability and may rely on certain assumptionsIn particular the compiler-based approaches are limited due to re-lying on developer-provided test inputs and detecting data racesonly while the SMT-solver-based approaches can be heavyweightdue to triggering expensive static-analysis overhead and may alsoinvolve manually-provided test inputs

In this paper to address the aforementioned issues we develop asystematic lightweight bug detection framework namely Simulee [7]

937

2020 IEEEACM 42nd International Conference on Software Engineering (ICSE)

ICSE rsquo20 May 23ndash29 2020 Seoul Republic of Korea Mingyuan Wu Yicheng Ouyang Husheng Zhou Lingming Zhang Cong Liu and Yuqun Zhang

which automatically detects general synchronization bugs for CUDAkernel functions In particular Simulee generates a Memory-AccessModel that maintains thread-wise memory-access information in-cluding thread id visit order and action Accordingly Simulee uti-lizes the LLVM bytecode of CUDA kernel functions to initialize therunning environmental setups including arguments dimensionsand global memory if necessary based on theMemory-Access ModelMoreover Simulee applies Evolutionary Programming [21] to ap-proach error-inducing inputs for exposing the dangerous programexecution paths which may possibly induce synchronization bugsSubsequently Simulee collects the corresponding memory-accessinformation from such paths At last by combining CUDA specificssuch collected memory-access information are analyzed to findwhether they can potentially lead to synchronization bugs

Compared with other CUDA synchronization bug detection ap-proaches Simulee can detect multiple bug types including data raceredundant barrier function and barrier divergence fully automati-cally Moreover Simulee benefits from exploring the dangerous exe-cution paths guided by the lightweight evolutionary search withoutincurring large overhead (eg for constraint solving) such that it ismore efficient than existing state-of-the-art approaches [9 30 31]that usually rely on heavyweight techniques or manual inputs

To evaluate the effectiveness and efficiency of Simulee we firstcollect 7 popular CUDA-related projects from GitHub (with a totalof 17928 commits and 136 million LOC by Aug 2019) as our bench-mark suite Then as a preliminary study we manually identified24 real-world synchronization bugs from a subset of the studiedreal-world CUDA projects and the GKLEE [30] benchmark suiteWe conduct a set of experiments to explore (1) how many of thosemanually identified bugs can be detected by Simulee for the prelim-inary analysis (2) whether Simulee can detect previously-unknownbugs on those real-world CUDA projects and (3) how Simuleecompares with state-of-the-art approaches in CUDA bug detectionThe experimental results suggest that Simulee can successfully de-tect 21 of the 24 manually identified synchronization bugs in ourpreliminary study Furthermore it even detects 24 previously un-known bugs from all the 7 real-world CUDA projects 10 of whichhave already been confirmed by the corresponding developers Theexperimental results also demonstrate that Simulee can be muchmore effective than state-of-the-art approaches ie GKLEE [30]GKLEE-SESA [31] GPUVerify [9] and RaceChecker [6] in detectingsynchronization bugs eg Simulee can detect 2X more previouslyunknown bugs than any of the other studied state-of-the-art ap-proaches In addition Simulee can significantly outperform theother studied approaches in terms of the runtime overhead

In summary our paper makes the following contributions

bull Idea To the best of our knowledge we propose the first ideaof CUDA synchronization bug detection via evolutionarysearch guided by memory-access modelingbull ImplementationWe have implemented the proposed ideaas a lightweight fully automated and general-purpose de-tection framework for CUDA synchronization bugs namelySimulee that can automatically detect a wide range of syn-chronization bugs in CUDA programs which are hard to becaptured manually

Block( 0 0 )

Block( 1 0 )

Grid

Block( 0 1 )

Block( 0 2 )

Block( 1 1 )

Block( 1 2 )

Thread( 0 0 )

Thread( 1 0 )

Thread( 2 0 )

Thread( 0 1 )

Thread( 1 1 )

Thread( 2 1 )

Thread( 0 2 )

Thread( 1 2 )

Thread( 2 2 )

Block ( 1 0 )

Figure 1 CUDA Hierarchy

bull Study We evaluate Simulee under multiple experimentalsetups The results suggest that Simulee is able to detectmost of the manually identified synchronization bugs inthe benchmark In addition it even detects 24 previously-unknown bugs of the entire benchmark fully automaticallyand significantly outperforms state-of-the-art approaches

2 BACKGROUNDIn this section we give an overview on CUDA the CUDA parallelcomputing mechanism and typical CUDA synchronization bugsCUDAOverview CUDA is a parallel computing platform and pro-grammingmodel which allows developers to use GPU hardware forgeneral-purpose computing eg autonoumos driving [48 52 53]CUDA is composed of a runtime library and an extended version ofCC++ In particular CUDA programs are executed on GPU coresnamely ldquodevicerdquo while they also need to be allocated with resourceson CPUs namely ldquohostrdquo prior to execution As a result developersneed to retrieve allocated resources such as global memory afterCUDA program execution To conclude a complete CUDA programcontains 3 runtime stages (1) host resource preparation (2) kernelfunction execution and (3) host resource retrieveCUDA Parallel Computing Mechanism A CUDA kernel func-tion refers to the part of CUDA programs that runs on the deviceside Specifically thread is the kernel functionrsquos basic executionunit At the physical level 32 threads are bundled as a thread warpwherein all the threads execute the same statement at any timeexcept undergoing a branch divergence while at the logic level oneor more threads are contained in a block and one or more blocksare contained in a grid The execution of kernel functions is ini-tialized by setting runtime environments eg dimensions of gridsand blocks passing relevant arguments such that the computationcan be divided into sub-computations and each sub-computationcan be dispatched to different threads Eventually the results ofsub-computations can be merged as the final result of the overallcomputation through applying algorithms such as reduction [35]The hierarchy of the parallel computing mechanism of CUDA ker-nel functions is presented in Figure 1

To synchronize threads CUDA applies barriers at which all thethreads in one block must wait before any can proceed In CUDAkernel functions the barrier function is ldquo__syncthreads()rdquo whichsynchronizes threads from the same block When a thread reachesa barrier it is expected to proceed to next statement if and only ifall the threads from the same block have reached the same barrierOtherwise the program would be exposed to undefined behaviors

938

Simulee Detecting CUDA Synchronization Bugs via Memory-Access Modeling ICSE rsquo20 May 23ndash29 2020 Seoul Republic of Korea

1 tid = threadIdxx

2

3 if (y gt 0 ampamp a lt C)

4 f_val2reduce[tid] = f

5 else

6 f_val2reduce[tid] = INFINITY

+7 __syncthreads () fix by adding syncthreads

8 get_block_min will write data to f_val2reduce

9 int ip = get_block_min(f_val2reduce f_idx2reduce)

10 float up_value_p = f_val2reduce[ip]

Figure 2 An Example of Data Race

1 const unsigned tid = threadIdxx

2 s_median[tid] = FLT_MAX

3 s_idx[tid] = 0

-4 __syncthreads ()

5 if (i lt iterations)

6

7 s_idx[tid] = i

8 s_median[tid] = m

9

Figure 3 An Example of Redundant Barrier

CUDA Synchronization Bugs There are three major synchro-nization bugs in CUDA kernel functions data race barrier di-vergence [14] and redundant barrier function [1] Specificallydata race indicates that for accessing global or shared memoryCUDA cannot guarantee the visit order of ldquoreadampwriterdquo actions orldquowriteampwriterdquo actions from two or more threads For example Fig-ure 2 demonstrates the bug-fixing Revision no ldquofebf515a82rdquo in thefile ldquosmo-kernelcurdquo of the project ldquothundersvmrdquo [42] one of thehighly-rated GitHub projects It can be observed from Figure 2 thatthe ldquoifrdquo statement writes to the memory of ldquof_val2reducerdquo whileinside the device the function ldquoget_block_minrdquo writes to the samememory This ldquowriteampwriterdquo bug is fixed by adding ldquo__syncthreadsrdquowhich synchronizes actions among threads

A barrier function is considered redundant when there is no datarace after deleting it from source code A redundant barrier func-tion compromises the program performance in terms of time andmemory usage For instance Figure 3 demonstrates bug-fixing Re-vision no ldquo31761d27f01rdquo in file ldquokernelhomographyhpprdquo fromproject ldquoarrayfirerdquo [8] It can be observed that the block is one-dimensional from Line 1 the value of ldquotidrdquo is assigned only byldquothreadIdxxrdquo That indicates that the ldquotidrdquos are identical amongdifferent threads from the same block As a result ldquos_median[tid]rdquoand ldquos_idx[tid]rdquo can only be accessed by one thread leading toa redundant barrier function in Line 4 because there is no race inldquos_medianrdquo or ldquos_idxrdquo after deleting it

A barrier divergence takes place when some threads in a blockcomplete their tasks and leave the barrier while the others havenot reached the barrier yet Figure 4 demonstrates the bug-fixingRevision no ldquo0ed6cccc5ffrdquo in the file ldquonearest_neighbourhpprdquofrom the project ldquoarrayfirerdquo caused by barrier divergence It canbe indicated from Figure 4 that developers make sure all the threadsin the same block reach the same barrier in every execution of thekernel function by moving the statement of ldquo__syncthreads()rdquo

1 s_dist[sid] = dist

2 s_idx[sid] = s_idx[sid + i]

-3 __syncthreads ()

4

5 fix by moving the barrier out

+6 __syncthreads ()

7

Figure 4 An Example of Barrier Divergence

outside the given branch Otherwise they will have to handle unde-fined behaviors

3 FRAMEWORK OF SIMULEEIn this section we introduce Simulee a lightweight automatic anddevice-independent framework to detect real-world CUDA synchro-nization bugs Typically Simulee takes LLVM bytecode translatedfrom CUDA kernel function programs as input Then it automati-cally generates the associated error-inducing test inputs and yieldsMemory-Access Model to detect synchronization bugs SpecificallySimulee is composed of two componentsmdashldquoAutomatic Input Gener-ationrdquo and ldquoBug Detection via Memory-Access Modelrdquo ldquoAutomaticInput Generation is initialized by inputting the LLVM bytecodeof CUDA kernel function programs Next it slices the memory-access statements (eg read and write statements) and inputs themfor Evolutionary Programming [21] Subsequently EvolutionaryProgramming helps generate error-inducing environmental setupsby iteratively mutating and sorting dimensionsarguments andpasses the acceptable ones to ldquoBug Detection via Memory-AccessModelrdquo At last ldquoBug Detection via Memory-Access Modelrdquo tracesreal execution paths by using the error-inducing inputs and collectsthe memory-access information from the paths to detect whetherthere are synchronization bugs as it were ldquosimulatingrdquo runtimeenvironment The details can be found in Figure 5

LLVM bytecode of kernel functions

Slicing original

statements

Mutating arguments

Mutating dimensions

Initializing solutions

Sorting solutions

Top-ranksolution

acceptable

Iterating

Memory model

construction

Memory-model-based

detection

Results

Automatic input generation (Evolutionary Programming)

Memory-based synchronization bug detection

Figure 5 Framework of Simulee

31 Automatic Input GenerationGenerating potentially error-inducing inputs is essentially equiva-lent to generating the inputs that can lead to the memory-access

939


conflicts among threads to improve the possibility of CUDA syn-chronization bug occurrences However how to automatically gen-erate such error-inducing inputs remains challenging Intuitivesolutions eg random generation coverage-oriented generationcan be limited in effectiveness and efficiency because they are notspecially designed for triggering memory-access conflicts In thissection we introduce how Simulee automatically generates poten-tially error-inducing inputs for exposing the dangerous programexecution paths which could lead to synchronization bugs in aneffective and efficient manner

311 Intuition An effective and efficient automatic approach togenerate potentially error-inducing inputs for triggering CUDAsynchronization bugs implies generating as many memory-accessconflicts as possible within a short time limit Given the ith mem-ory address and the kernel function inputs ie grid and blockdimensions and arguments f (i) is defined as the number of threadsthat access the ith memory address while g(i) is a function thatreturns 1 when the ith memory address is accessed by any threadand returns 0 otherwise [start end] denotes the memory-accessrange An intuitive target function F (dimensionsarдuments) canbe presented in Equation 1 which denotes the ratio of the totalnumber of the accessed memory addresses to the total number ofthe memory-access threads

F (dimensions arдuments) =

endsumi=star t

д(i)

endsumi=star t

f (i)

(1)

It can be derived that the max value of F (dimensionsarдuments)is 1 which denotes that there is no memory-access conflict betweenany thread pair On the other hand the smaller F (dimensionsarдuments)is the higher chance the memory-access conflict takes place There-fore F (dimensionsarдuments) can be used for optimization to ob-tain error-inducing inputs that trigger CUDA synchronization bugsNote that since F (dimensionsarдuments) is discrete we chooseEvolutionary Programming [44] as our optimization approach

312 Algorithm The framework of ldquoAutomatic Input Generationrdquois presented in Algorithm 1 First Simulee randomly initializesarguments and dimensions to create and sort individual solutionsfor evolving (Lines 3 to 7) In each generation each solution ismutated to generate two children which are added to the wholepopulation set (Lines 8 to 14) Next the population winners survivefor the subsequent iterations (Lines 15 to 16) The iterations can beterminated once it finds an acceptable solution Otherwise aftercompleting the iterations it returns the optimal solution

Initial Solutions The initial dimensions and arguments are ran-domly generated and passed to fitness functions as initial solutionsfor future evolution Note that the dimensions can be extracted fromkernel functions For instance if a kernel function has ldquothreadIdxxrdquoand ldquothreadIdxyrdquo it means the block is two-dimensional

Fitness Function Equation 1 is chosen as the primary fitnessfunction for Evolutionary Programming Specifically the outputof F (dimensionsarдuments) is the fitness score for a solution ofdimensions and arguments in Evolutionary Programming How-ever it is difficult to derive an optimal solution of dimensions and

Algorithm 1 Framework for Automatic Input GenerationInput population generationOutput acceptable arguments and dimensions

1 function EVOLUTION_ALGORITHM2 populationLstlarr list()3 for i in population do4 singleSolutionlarr InitialSolution()5 singleScorelarr fitness(singleSolution)6 populationLstappend([singleSolution singleScore])7 sortByScore(populationLst)8 for i in generation do9 childLstlarr list()10 for solution in populationLst do11 childrenSolutionslarr mutation(solution)12 newScoreslarr fitness(childrenSolutions)13 childLstappend([childrenSolutions newScores])14 populationLstmerge(childLst)15 sortByScore(populationLst)16 populationLstlarr populationLst[population]17 if populationLst[0] acceptable then18 return populationLst19 return populationLst

arguments by only optimizing F (dimensionsarдuments) In partic-ular since F (dimensionsarдuments) is non-differentiable whenthe gradient does not exist it is hard to find an optimal solu-tion given the set of inferior solutions eg all the solutions ofF (dimensionsarдuments) are ldquo1rdquos To address such issues we de-sign a secondary fitness function such that they are sorted accordingto their possibility to be optimal R(start end) = end minus start Inparticular it indicates that a smaller memory-access range leadsto a higher possibility of memory-access conflict As a result wedefine fitness score of the primary fitness function as primary scoreand the fitness score of the secondary fitness function as secondaryscore During the population evaluation the primary score is sortedfirst if and only if the top-ranked primary score is 1 the secondaryscore is sorted to decide which solution is more likely to convergeto the minimum of F (dimensionsarдuments)

Mutation In Simulee solutions are generated by mutationwhere each solution generates two children in one generationSpecifically arguments and dimensions are independent from eachother during mutation with respective mutation strategies Themutate strategy for dimensions is trivial first Simulee randomlygenerates an integer vector ranging from -1 to 1 according to thedimension size next the childrsquos dimension is mutated by summingthe parentrsquos dimension and the generated integer vector

The details of the mutation strategy for arguments is presented inAlgorithm 2 Since the memory-access-relevant arguments are num-bers Simulee views them as floating numbers and converts themback to their actual types when executing f (i) Accordingly eachgeneration generates two children one adds a random number gen-

erated by standard Normal Distribution [5] (N (x) = 1radic2π

eminusx 2

2)

to the arguments inherited from the parent solution and the otheradds a random number generated by standard Cauchy Distribu-tion [2] (C(x) = 1

π (1+x 2)) to the arguments inherited from the par-

ent solution We define the search step length of the arguments as

940


Algorithm 2Mutating ArgumentsInput parentOutput mutation children solutions

1 function ARGUMENT_MUTATION2 normalSolutionlarr copy(parent)3 cauchySolutionlarr copy(parent)4 for argument in parent do5 normalSolution[argument]larr parent[argument] + normal()6 cauchySolution[argument]larr parent[argument] + cauchy()7 return normalSolution cauchySolution

the absolute value of the number generated from the two aforemen-tioned distributions with the expected values shown in Equations2 and 3

Enormal (x ) =int infin0

x1radic2π

eminusx 2

2dx = 0399 (2)

Ecauchy (x ) =int infin0

x1

π (1 + x 2)dx = +infin (3)

We next explain why we apply the above two distributions It canbe observed from Equations 2 and 3 that the step length generatedfrom standard normal distribution is expected to be small Thatindicates that if there is an optimal solution nearby the generatedchild is likely to approach it On the contrary the step length gener-ated from standard cauchy distribution is expected to be large Thatindicates that if there is an inferior solution nearby the generatedchild is likely to escape from it

Acceptable Function The acceptable function is used to termi-nate the whole process given an acceptable solution In this paperthe acceptable solution is defined as that primary score is smallerthan 03

In summary by applying Evolutionary Programming Simulee isexpected to deliver error-inducing grid and block dimensions andarguments that lead to memory-access conflicts and trigger CUDAsynchronization bugs

32 Memory-based Synchronization BugDetection

With the auto-generated error-inducing inputs the synchronizationbug detection of Simulee is established on building aMemory-AccessModel that depicts thread-wise memory-access instances Basedon the Memory-Access Model Simulee develops a set of criteria todetect synchronization bugs including data race redundant barrierfunctions and barrier divergence

321 Memory-Access Model The Memory-Access Model accessedby the kernel functions is defined to be composed of a set ofMemoryUnits where each Memory Unit corresponds to a memory addressand is composed of a set of Unit Tuples A Unit Tuple is defined asa three-dimensional vector space ltvisit_order thread_id actiongtwhere visit_order represents the visit order to the associated mem-ory address from different threads thread_id represents the indicesof such threads and action refers to the read or write action fromthose threads

An example of Memory Unit is demonstrated in Figure 6 withfour Unit Tuples lt0 (1 0 0) readgt lt0 (2 0 0) readgt lt0 (3 0 0)readgt and lt1 (3 0 0) readgtwhere threads (100) (200) and (300)

lt 0 (1 0 0) read gt




Memory Unit

Barrier Function

Figure 6Memory Unit Example

read the same memory address in the same visit_order since none ofthem have reached any barrier function before they read Assumeall the threads reach a barrier function later and thread (3 0 0) readsthe visit_order is then incremented from 0 to 1 for thread (3 0 0)and the other threads afterwards

322 Memory Accessing Model Construction SinceMemory-AccessModel is only associated with barrier functions and memory-accessstatements it is applicable to detect synchronization bugs by ob-taining such statements and then extractinganalyzing the memory-access information instead of executing the complete CUDA pro-grams on GPUs ie modeling the execution of CUDA kernel func-tion programs This modeling process is initiated by inputting theauto-generated block and grid dimensions and arguments passedto the kernel functions Next it constructs the Memory Unit foreach memory address by tracing back the execution path for eachthread

The overall Memory-Access Model construction is demonstratedin Algorithm 3 In particular the algorithm is launched to initializethe block and grid dimensions as well as the global and sharedmemory for each thread (Lines 2 to 5) Next for each block theshared memory (Line 7) and the thread-wise visit_order for eachglobal and shared memory address (Lines 8 to 9) are initializedIf there are still some unterminated threads for all of them theircorresponding Memory Units are derived based on the collectedparameters eg globalMem and visitOrderGlobal (Lines 10 to14) The construction of the thread-wise Memory Units for sharedmemory and global memory are completed if there is no runningthread left (Lines 15 to 16)

Algorithm 4 illustrates the details of Memory-Access Model con-struction for a single thread Specifically given a running threadand the parameters passed by Algorithm 3 (Lines 2 to 4) Algorithm4 is initialized by detecting whether the current statement is the endof file If so the thread would be terminated If there is any threadhalting afterwards we can confirm there is a ldquobarrier divergencerdquobug because that indicates at least a thread has not reached thebarrier function where the other threads of the same block all havecompleted their tasks and left (Lines 5 to 9)

If the current statement calls barrier function and all the otherthreads have reached the same barrier function the visit_order forboth global and shared memory would be incremented if they havebeen visited before (Lines 10 to 13) since it indicates that (1) allthe threads in one block have visited the current memory addressand (2) the subsequent visits in the same block would be maderigorously later than the previous visits On the other hand if the

941


Algorithm 3 Memory-Access Model constructionInput gridDim blockDim argumentsOutput Memory-Access Model

1 function CONSTRUCT_MEMORY_MODEL2 BLOCKSlarr generateFromDimension(gridDim)3 THREADSlarr generateFromDimension(blockDim)4 globalMemlarr [MemoryUnit() for i in range(globalSize)]5 sharedMemLstlarr list()6 for blk in BLOCKS do7 sharedMemlarr [MemoryUnit() for i in range(sharedSize)]8 visitOrderGloballarr [0 for i in range(globalSize)]9 visitOrderSharedlarr [0 for i in range(sharedSize)]10 while hasUnterminatedThread() do11 for t in THREADS do12 envlarr Environment(arguments)13 PROCESS_THREAD(t globalMem sharedMem14 visitOrderGlobal visitOrderShared env)15 sharedMemLstappend(sharedMem)16 return globalMem sharedMemLst

current statement does not call barrier function the correspondingvisit_order and the action of the associated thread is recorded toconstruct the Memory-Access Model (Lines 14 to 21)

323 Bug Detection via Memory-Access Model Mechanism The de-sign ofMemory-Access Model can be used in Simulee to detect CUDAsynchronization bugs ie data race redundant barrier functionand barrier divergence

Data Race In general parallel computing programs a possibledata race takes place when multiple threads access the identicalmemory address in the same visit order and at least one of themwrites Specifically in CUDA kernel functions besides the genericcircumstances a data race also takes place when (1) the threads arefrom different thread warps or (2) the threads from the same threadwarp underwent branch divergence or (3) the threads from thesame thread warp without undergoing branch divergence write tothe same memory address by the same statement By combining thedata race detection criteria above and the design of Memory-AccessModel Simulee can detect data race in CUDA kernel functions asdescribed in Theorem 31

Theorem 31 Given two Unit Tuplesψi andψj from the identicalMemory Unit a data race between them takes place if the conditionsbelow are met

bull ψi [visit_order] =ψj [visit_order]bull ψi [thread_id] =ψj [thread_id]bull ψi [action] = lsquowritersquo orψj [action] = lsquowritersquo

when the threads ofψi andψj are (1) from different thread warps or(2) executing the ldquowriterdquo action on the same statements in the samethread warp or (3) underwent branch divergence before the currentldquowriterdquo action

Redundant Barrier Function A redundant barrier functionindicates that no data race can be detected by removing that barrierfunction In CUDA kernel functions the visit_order is incrementedfor one Unit Tuple when at least one thread reaches a barrier func-tion In other words two Unit Tuples with adjacent visit_order inoneMemory Unit indicates the presence of a barrier function shown

in Figure 6 Therefore to detect whether a barrier function is redun-dant or not it is essential to collect all the associated Unit Tuplesand analyze whether they together would lead to data race Thebarrier function is defined to be redundant if no data race can bedetected among such Unit Tuples

The details of how to detect data race and redundant barrierfunction based onMemory-Access Model are presented in Algorithm5 For each Memory Unit to detect data race Simulee first groupsthe Unit Tuples with the same visit_order For all the Unit Tuples inone group Simulee checks whether any Unit Tuple has data racewith others according to Theorem 31 (Lines 4 to 16) To detectredundant barrier function of oneMemory Unit Simulee extracts itsvisit_order and groups all the Unit Tuples with adjacent visit_orderto find out whether any data race can take place (Lines 18 to 22)If there is no data race Simulee identifies the associated barrierfunction and increments its recorder by 1 (Lines 23 to 24) At last itchecks whether the total recorder number matches the total numberof the changing visit_order caused by that barrier function whichcan be obtained after constructing the Memory-Access Model Thisbarrier function is redundant if the two numbers are equivalent(Lines 25 to 28)

Barrier Divergence As mentioned in Section 322 barrier di-vergence can be detected during constructingMemory-Access Modelwhen there is any halting thread after the current execution is ter-minated because it indicates that there is at least one thread whichhas not reached the barrier function while the others have alreadyleft

To conclude Simulee first applies Evolutionary Programmingto generate error-inducing grid and block dimensions and argu-ments Next Simulee inputs such dimensions and arguments toconstruct Memory-Access Model that delivers thread-wise memory-access information Eventually such information along with theCUDA synchronization bug detectionmechanism are used to detectwhether there exists any CUDA synchronization bug

4 EVALUATIONIn this section we conduct an extensive experimental study toevaluate the effectiveness and efficiency of Simulee in detectingsynchronization bugs of CUDA kernel functions In particular wefirst perform a preliminary study on 24 manually identified CUDAsynchronization bugs to explore the efficacy of Simulee Next weexplore the capability of Simulee in detecting previously-unknownbugs from all the real-world CUDA projects in our benchmark suiteFurthermore we also compare Simulee with multiple existing state-of-the-art approaches to explore whether Simulee can outperformthem

41 Benchmark ConstructionTo conduct a preliminary study for evaluating Simulee it is essentialto establish a set of CUDA synchronization bugs as the ground truthTo this end we first consider an existing benchmark ie the GKLEEdataset [22] and select 4 synchronization bugs that can representall the basic synchronization bug patterns in the dataset

Furthermore we augment the bug dataset with more real-worldCUDA bugs In this paper we aim to collect important and in-fluential real-world CUDA benchmark projects for our evalua-tion by defining a set of policies for selecting open-source CUDA

942


Algorithm 4 Thread ProcessorInput thread globalMem sharedMem visitOrderGlobal

visitOrderShared envOutputNone or BARRIER_DIVERGENCE

1 function PROCESS_THREAD2 if shouldHalt() or isFinished() then3 return4 curStmtlarr envgetNextInstruction()5 if curStmtisEOF () then6 threadfinish()7 if hasHaltThreads() then8 return BARRIER_DIVERGENCE9 return10 if curStmtisSyncthreads() then11 if all threads reach same barrier then12 updateCurrentVisitOrder(visitOrderShared)13 updateCurrentVisitOrder(visitOrderGlobal)14 else15 isGlobal memIndexlarr simulateExecute(curStmt env)16 if isGlobal then17 indexlarr visitOrderGlobal[memIndex]18 updateMemoryModel(globalMem memIndex index)19 else20 indexlarr visitOrderShared[memIndex]21 updateMemoryModel(sharedMem memIndex index)22 return

Table 1 Subject StatisticsProjects Star Number Commit Number LoCkaldi 6860 8681 364Karrayfire 2791 5314 381Kthundersvm 1014 827 343Kcuda-cnn 368 135 12KcudaSift 123 247 24Kcudpp 268 302 58Kgunrock 550 2422 178K

projects in GitHub Specifically we initialize our CUDA projectcollection by searching the keyword ldquoCUDArdquo and collect morethan 12000 projects from GitHub in the first place Next we sortthese projects in terms of the star number and commit number Werandomly select 7 projects with large starcommit numbers As a re-sult we collect ldquokaldirdquo [29] ldquoarrayfirerdquo [8] ldquothundersvmrdquo [42]ldquocuda-cnnrdquo [54] ldquocudaSiftrdquo [11] ldquocudpprdquo [15] and ldquogunrockrdquo [24]for augmenting our benchmark suite as listed in Table 1

More specifically we randomly select projects ldquoarrayfirerdquo ldquokaldirdquoand ldquothundersvmrdquo from our real-world CUDA benchmark suiteto retrieve their historical synchronization bugs Note that wedo not consider all the projects from our benchmark suite sincethe manual bug retrieval process can be quite time-consumingThe synchronization bugs for those selected projects are identi-fied based on their commit messages and ldquogit diffrdquo results Thespecific operations are listed as follows Following prior study onother types of bugs [49] we first filter the commits and only keepthe commits with the messages that contain at least one keywordin the set ldquofixrdquo ldquoerrorrdquo ldquosyncrdquo to retain the commits that havehigher chances to contain synchronization bugs However the

Algorithm 5 Bug Detection via Memory-Access Model

Input memoryModel changingVisitOrderNumberOutputDATA_RACE REDUNDANT_BARRIERS

1 function EXAMINE_MEMORY_MODEL2 DATA_RACElarr False3 REDUNDANT_BARRIERS = dict()4 for memoryUnit in memoryModel do5 for visitOrder in memoryUnit do6 tupleslarr getTuplesByOrder(visitOrder)7 for thread performing write in tuples do8 otherTslarr getDifferentThreads(thread tuples)9 for t in otherTs do10 if inSameWarp(t thread) then11 if usingSameStmt(t thread) then12 DATA_RACElarr True13 if hasBranchDivergence(t thread) then14 DATA_RACElarr True15 else16 DATA_RACElarr True17 barrierDict = dict()18 for visitOrder in memoryUnit do19 nextOrderlarr visitOrder + 120 currentlarr getTuplesByOrder(visitOrder)21 targetlarr getTuplesByOrder(nextOrder)22 if canMergeWithoutRace(target current) then23 barrierlarr getSplitBarrier(nextOrder memoryUnit)24 barrierDict[barrier] ++25 for barrier in barrierDict do26 REDUNDANT_BARRIERS[barrier]larr27 isRedundant(barrierDict[barrier]28 changingVisitOrderNumber[barrier])29 return DATA_RACE REDUNDANT_BARRIERS

commit messages only with these keywords might not be rele-vant with CUDA bugs Therefore next among the filtered com-mit messages we further filter them according to whether theyhave at least a keyword in the set ldquo__global__rdquo ldquo__device__rdquo ormatch at least one regular expression in the set ldquocudaw+s[(]rdquoldquo[ˆlt]ltltlt[ˆlt]rdquo with its parent nodersquos ldquogit diffrdquo results Toillustrate ldquo__global__rdquo is the modifier of kernel functions andldquo__device__rdquo is the modifier of the device functions that can becalled by kernel functions ldquocudaw+s[(]rdquo is designed in accor-dance with the information that the resource is preparedreleasedin host side beforeafter executing kernel functions For instanceldquocudaMalloc((void ) amphost sizeof(int) 100)rdquo allocatesa global 400-byte memory for kernel functions before executionldquocudaFree(amphost)rdquo releases the allocated memory for kernel func-tions after execution ldquo[ˆlt]ltltlt[ˆlt]rdquo is designed in accordancewith the scenario that sets up the environment for kernel functionseg ldquofunctionltltltgrid_size block_sizegtgtgt(arguments)rdquoAll these regular expressions together deliver the complete life cy-cle of executing kernel functions such that all the bugs of the wholelife cycle can be covered We further manually review all the re-maining commits after the above two rounds of filtering to removeany potential false positive Due to the tedious and time-consumingmanual inspection all the selected CUDA projects are analyzedwithin the most recent 1000 commits or all of them if there are lessthan 1000 commits As a result we collected a total of 20 real-world

943


CUDA bugs By combining with the 4 synchronization bugs fromthe GKLEE benchmark suite we obtain a total of 24 CUDA bugs asthe ground truth for our preliminary study

42 Environment SetupsWe performed our evaluation on a desktop machine with Intel(R)Xeon(R) CPU E5-4610 and 320 GB memory The operating system isUbuntu 1604 For the Evolutionary Programming settings of ldquoAuto-matic Input Generationrdquo in Simulee the population and generationare both set to be 10 by default Note that the Simulee webpage [7]includes more experimental results under different settings for theEvolutionary Programming component

We select state-of-the-art CUDA synchronization bug detectionapproaches ie GPUVerify [9] GKLEE [30] GKLEE-SESA [31] andRaceChecker [6] for the performance comparison with SimuleeMore specifically RaceChecker is the NVIDIArsquos official tool andrepresents state-of-the-art compiler-based approach while the restapproaches represent state-of-the-art SMT-solver-based approachesNote that the timeout for all studied techniques are uniformly setto be 5 hours

43 Result Analysis431 Preliminary Study on Known Bugs We first present the ex-perimental results for detecting the 24 manually identified synchro-nization bugs in our preliminary study in Table 2 In the table DRBD and RB respectively denote data race barrier divergence andredundant barrier function ldquordquo ldquordquo and ldquoFrdquo respectively denotethe successful failed and false-positive bug-detection attempts TOdenotes that the associated bug detection attempt incurs timeoutwhileNA denotes that the buggy kernel function is out of the scopeof the bug-detection capability of the corresponding techniquesNote that each row represents one kernel function which couldinclude multiple bugs (indicated by Column ldquoBug Numrdquo) For suchcases we present all the reported bugs for each kernel functioneg ldquoFFFrdquo denotes that the corresponding technique reports fourbugs in total for the kernel function 3 of which are false positives

From the table we can observe that in terms of the overalleffectiveness Simulee can detect 21 (875) of the the 24 manu-ally identified synchronization bugs within seconds eg Simuleeat most costs 1073 seconds (when applied on kernel functionJacobiSVD) We further analyze the 3 cases for which Simulee failsto detect bugs Simulee fails to detect the data race in kernel func-tion hamming_matcher because this bug can only be exposed underoccasional branch coverage which is out of the scope of the inputgeneration component of Simulee that focuses on the memory-access conflict potentials Simulee fails to detect bugs for kernelfunction _softmax_reduce and _div_rows_vec because they arelargely reimplemented to reduce the usage of unnecessary barrierfunctions while the current version of Simulee is not designed forsuch challenging bugs requiring large code refactoring

We next analyze the comparison results between Simulee andstate-of-the-art GPUVerify GKLEE GKLEE-SESA and RaceCheckerfrom the table in details Note thatGPUVerify requires user-provideddimension settings and the other techniques require the overalluser-provided environmental settings We provide them all with

the ideal settings that can trigger the most possible bugs for faircomparison with SimuleeGPUVerify GPUVerify is designed to detect data-race and barrier-divergence bugs via integrating static analysis with SMT solversFrom the table we can observe that GPUVerify can successfullydetect 17 out of the 24 synchronization bugs

Meanwhile GPUVerify also reports 6 false positives We nextmanually check all such false positives and observe that GPUVer-ify tends to report false positives due to two reasons First thefalse positives are triggered by the bottleneck of the static anal-ysis optimization For instance in kernel functions deadlock_0and computeDescriptor GPUVerify reports false positives due tononexistent execution paths Second its adopted SMT solvers areused to detect data-race bugs caused by thread-wise access conflictshowever the thread warp mechanism adopted in CUDA kernelfunctions can resolve some of such bugs eg the inter-instructionldquoreadampwriterdquo race Shown in Figure 7 warpSize is equal to 32 inRevision 5cc9731af4f of function _trace_mat_mat_trans fromproject kaldi Suppose there are two threads (0 0 0) and (1 00) and the tid of thread (0 0 0) is 0 while the tid of thread (10 0) is 1 Moreover when the loop terminates shift is set to 1Meanwhile for thread (0 0 0) statement ssum[0] += ssum[0 +1] is executed for thread (1 0 0) statement ssum[1] += ssum[1+ 1] is executed In traditional CPU programs since thread (0 00) is reading data from thread ssum[1] while thread (1 0 0) is at-tempting to write data to ssum[1] it can incur a ldquoreadampwriterdquo datarace However in CUDA kernel functions since thread (0 0 0) andthread (1 0 0) are located in the same thread warp without branchdivergence they essentially are executing the same instruction atthe same time Since the statement ssum[tid] += ssum[tid +shift] can be compiled to the following two instructions 1 =load ssum[tid+shift] store 1 ssum[tid] the ldquoreadrdquo actionis executed strictly prior to the ldquowriterdquo action As a result it turns tobe a false-positive data race in CUDA kernel functions We issuedthis case to its corresponding developers who further verified ourfinding as follows

ldquoYou are correct that ssum[0] += ssum[0 + 1] and ssum[1]+= ssum[1 + 1] are executed at the same time But sincewe are now in a warp all the 32 threads (tid=031)are synchronized So reading data from ssum[1] andssum[2] always happens before writing data to ssum[0]and ssum[1] for thread tid=0 and tid=1rdquomdash kaldi

GKLEE and GKLEE-SESA GKLEE and GKLEE-SESA are designedto detect synchronization bugs via integrating concolic executionwith SMT solvers By launching the environmental setups theycollect ldquoreadrdquo and ldquowriterdquo statements into different sets and useSMT solvers to detect synchronization bugs accordingly From ourpreliminary study GKLEE and GKLEE-SESA can detect 16 and 4manually identified bugs respectively without any false positive In-terestinglyGKLEE-SESA incurs more timeouts and detects less bugsthan GKLEE The reason is that GKLEE-SESA leverages more staticanalysis techniques to reduce its dependency on the initial user-provided environmental setups while such techniques are rathertime-consuming Moreover similar to GPUVerify because they areboth SMT-solver-based approaches they are also likely to reportfalse positives on data race (further confirmed by our later study

944


Table 2 Detection Results of the Identified BugsProject Revision Kernel Function Bug Type Bug Num Simulee GPUVerify GKLEE GKLEE-SEAS RaceChecker

Effect Time(s) Effect Time(s) Effect Time(s) Effect Time(s) Effect Time(s)

arrayfire

a7a297ba814 scan_nonfinal_kernel DR 1 101 301 16 TO 078a7a297ba814 scan_dim_nonfinal_kernel DR 1 117 34 089 TO 1170c5a38182b7 hamming_matcher DR 1 656 2353 297 TO 0370c5a38182b7 hamming_matcher_unroll DR 1 401 2493 238 TO 057d7abcf2358e JacobiSVD DR 2 1073 F F F 805 TO TO 121c59116e3ec3 warp_reduce DR 1 134 234 069 TO NAa515b112076 scan_dim_kernel DR 1 206 106 034 TO 0281050816e422 hamming_matcher DR 1 138 12085 2758 TO 047dfbfca5fb77 select_matches BD 1 061 516 052 TO NA0e0c726d7d0 hamming_matcher_unroll BD 1 043 TO 065 TO NAee4d0bd77d7 computeDescriptor BD 1 226 F F 778 184 TO NA0d0d7d1285a warp_reduce BD 1 142 234 069 TO NA31761d27f01 computeMedian RB 1 047 NA NA NA NAfaefa30c3a0 harris_response RB 1 066 NA NA NA NA

GkleeTests

10eb6373d53 device_global DR 1 053 223 156 091 NA10eb6373d53 colonel DR 1 111 212 071 065 NA10eb6373d53 dldeadlock_0 BD 1 161 F 239 061 023 NA10eb6373d53 dldeadlock_2 BD 1 194 21 112 041 NA

kaldibc13196e7fe _add_diag_mat_mat BD 1 938 341 18157 TO NA42352b63e62 _softmax_reduce RB 1 NA NA NA TO NAbb589475b10 _div_rows_vec RB 1 NA NA NA TO NA

thundersvm febf515a826 nu_smo_solve_kernel DR 2 123 TO 193 TO 124Total Detection Result 21 3 0 F 17 7 6 F 16 8 0 F 4 20 0 F 10 14 0 F

Table 3 Detection Results of the Previously Unknown BugsProject Revision Kernel Function Bug Type Bug Num Simulee GPUVerify GKLEE GKLEE-SEAS RaceChecker

Effect Time(s) Effect Time(s) Effect Time(s) Effect Time(s) Effect Time(s)cuda-cnn c843bb2861e g_getCost_3 RB 1 139 NA NA NA NA

cudaSift

a2e57327ddc FindMaxCorr RB 1 F F F 093 NA NA NA NAa2e57327ddc FindMaxCorr1 RB 1 102 NA NA NA NAa2e57327ddc FindMaxCorr2 RB 1 097 NA NA NA NAa2e57327ddc FindMaxCorr3 RB 1 105 NA NA NA NA

cudpp9dc7357ee81 sparseMatrixVectorFetchAndMultiply RB 1 07 NA NA NA NA9dc7357ee81 sparseMatrixVectorSetFlags RB 1 061 NA NA NA NA9dc7357ee81 yGather RB 1 113 NA NA NA NA

gunrock 248a12107ef Join BD 1 121 TO 184 TO NA

kaldi

5cc9731af4f _add_diag_vec_mat DR 2 123 117 187 TO NA5cc9731af4f _copy_low_upp DR 1 27 F 321 F 0572 F 0649 NA5cc9731af4f _copy_upp_low DR 1 079 F 311 F 0598 F 0712 NA5cc9731af4f _splice DR 1 073 131 2361 NA NA5cc9731af4f _copy_from_tp DR 3 096 199 263 TO NA5cc9731af4f _copy_from_mat DR 1 069 211 2059 TO NA5cc9731af4f _sum_reduce DR RB 2 1 014 286 NA 106 NA TO NA 034 NA

thundersvm 05de37f83b6 c_smo_solve_kernel DR 3 311 TO 643 TO 138Total Detection Result 24 0 3 F 11 13 2 F 11 13 2 F 2 22 2 F 3 21 0 F

on detecting new bugs) and fail to detect any redundant-barrier-function bugs because their detecting mechanism is not designedfor such bugs We next discuss the failure cases for the betterGKLEEbecause GKLEE-SESA timed out on most cases GKLEE fails to de-tect barrier-divergence bugs in kernel functions select_matchesand _add_diag_mat_mat because GKLEE models the kernel func-tions over two parametric threads and tends to ignore importantexecution paths for detecting barrier-divergence bugsRaceChecker RaceChecker is a device-dependent tool designedonly for detecting CUDA data-race bugs Specifically it can onlydetect the data-race bugs incurred in shared memory In particu-lar RaceChecker can detect all the 10 data-race bugs that are rele-vant to shared memory but fails to detect other synchronizationbugs including the data-race bugs incurred in global memory fromthe manually identified bugs for our preliminary study Note thatsince RaceChecker does not involve SMT solver it does not reportany false-positive data-race bug as the other SMT-solver-basedapproaches

432 Further Study on Previously Unknown Bugs After our pre-liminary study we further apply Simulee and all the comparedtechniques to detect previously unknown synchronization bugs for

all the 7 projects with the results demonstrated in Table 3 whichfollows the same format as Table 2 From the table we can observethat Simulee detects 24 bugs and reports 3 false positives in totalwhere all the false positives are redundant-barrier-function bugs inkernel function FindMaxCorr Note that Simulee reports such falsepositives because it is possible that ldquoAutomatic Input Generationrdquomay miss some potential error-inducing inputs that can triggersynchronization bugs such that a barrier function can be reportedas a redundant barrier function Meanwhile even the most effectiveexisting technique ie GPUVerify can only detect 11 previouslyunknown bugs with 2 false positives

We have also reported all the detected bugs to the correspondingdevelopers and show their feedback statistics in Table 4 To datethey have confirmed 10 bugs in total (TT) including 1 data-racebugs (DR) 1 barrier-divergence bug (BD) and 8 redundant-barrier-function bugs (RB) To be specific the developers of cudpp andCudaSift responded as follows

ldquoI think yoursquore right There are considerably fasterways to do matrix multiply callsrdquo mdash cudpp

945


Table 4 Developer Feedback StatisticsProjects Detected Confirmed Under Discussion Nonresponse

TT DR RB BD TT DR RB BD TT DR RB BD TT DR RB BDkaldi 12 11 1 0 2 1 1 0 6 6 0 0 4 4 0 0thundersvm 3 3 0 0 0 0 0 0 3 3 0 0 0 0 0 0CudaSift 4 0 4 0 4 0 4 0 0 0 0 0 0 0 0 0CUDA-CNN 1 0 1 0 0 0 0 0 0 0 0 0 1 0 1 0cudpp 3 0 3 0 3 0 3 0 0 0 0 0 0 0 0 0gunrock 1 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0Total 24 14 9 1 10 1 8 1 9 9 0 0 5 4 1 0

1 const int32_cuda tid = threadIdxy blockDimx

+ threadIdxx

2

3 if (tid lt warpSize)

4 pragma unroll

5 for (int shift = warpSize shift gt 0 shift gtgt= 1)

6 ssum[tid] += ssum[tid + shift]

7

8

9

Figure 7 An Example of False Race

ldquoYes there is a bit of cleaning up to do there Sometimeswhen I detect oddities in the output I add an unneces-sary synchronization just in case In fact those thingsshould be all run on the same thread since it cannot beparallelized anyway Thank you for pointing it outrdquo mdashCudaSift

Since barrier divergence is an undefined behavior it may nothang on every situation The developers of gunrock responded asfollows and further fixed the bug in a later commit [25]

ldquoI do see what Stefanlyy eagleShanf mean for thedivergence issue and surprise the code didnrsquot hangrdquo

In addition 9 data-race bugs are still being actively discussedby developers We label such bugs ldquounder discussionrdquo as stated inTable 4

In summary it can be observed from Section 431 and Sec-tion 432 that Simulee can correctly detect most of the synchroniza-tion bugs while the other approaches are all limited in their detec-tion scopes failing to detect certain bugs that they are designedfor or triggering additional false positives We now summarize thereasons why Simulee can outperform the other approaches Simuleeapplies lightweight evolutionary test generation (guided by effec-tive memory-access modeling) and dynamic runtime monitoring intandem for powerful CUDA synchronization bug detection On theother hand the other approaches are usually bounded by heavy-weight techniques such as constraint solvers which prevent thetechniques from exploring all the possible cases

5 THREATS TO VALIDITYThe threats to external validity mainly lie in the subjects and faultsused in our benchmark Though the projects of our benchmarksuite may not represent the overall project distributions they shallbe selected in order to possibly maximize the overall features ofthe CUDA projects In this way our benchmark is derived basedon real-world programs from GitHub ie we select seven popularCUDA-related projects with a total of 17928 commits and 136million LOC

The threats to internal validity mainly lie in the potential bugs inour implementation due to the complicated mechanism of Simulee

To reduce the threats three graduate students closely mentored bythree SESystems supervisors have been carefully working for overone year We manually reviewed all our implementation code andalso included corresponding tests for verifying our implementationIn addition the effectiveness of ldquoAutomatic Input Generationrdquo canimpact on the performance synchronization bug detection It ispossible that the ldquoAutomatic Input Generationrdquo component maymiss certain inputs that can trigger synchronization bugs such thatSimulee would miss detecting the bug or report a false positive Toreduce this threat we set a large number of suitable parameters forthe evolutionary algorithm adopted in ldquoAutomatic Input Generationrdquoto reduce the probability of missing error-inducing inputs

To threats to construct validity mainly lie in the metrics used inthis work To reduce the threats we measure the number of bothpreviously known and the identified bugs detected by the studiedtechniques as well as their false-positive rate and correspondingtime cost

6 RELATEDWORKAs our work investigates the automatic bug detection techniques forCUDA programs the related work includes the following two partsempirical studies on CUDA programs and techniques of CUDA bugdetection Moreover since Simulee essentially is a search-basedbug-detection technique we also discuss such relevant work

Empirical studies for CUDA programs There are several ex-isting work that study bugs and other features on CUDA programsFor instance Yang et al [43] delivered the empirical study on thefeatures of the performance bugs on CUDA programs Burtscher etal [10] studied the control-flow irregularity and memory-access ir-regularity and found that both irregularities aremutually dependentand exist in most of kernels Che et al[12] examined the effective-ness of CUDA to express with different sets of performance charac-teristics Some researchers are keen on the comparisons betweenCUDA and OpenCL For instance Demidov et al [17] comparedsome C++ programs running on top of CUDA and OpenCL andfound that they work equally well for problems of large size Du etal [19] on the other side studied the discrepancies in the OpenCLand CUDA compilersrsquo optimization that affect the associated GPUcomputing performance In our previous work we also conductedempirical studies to explore CUDA program features For instancewe investigated the features and the distribution of multiple CUDAprogram bug types based on a collected GitHub dataset [41] More-over we developed an approach that can automatically repair CUDAsynchronization bugs via program transformation and validatedits performance via an experimental study based on real-worldbenchmarks [40]

CUDA bug detection Unlike traditional program bugs whichcan be deterministically tested [39 46] and debugged [23 32 47]CUDA synchronization bugs especially data race and barrier di-vergence can often result in undefined behaviors To detect suchbugs there are typically two types of approachesmdashcompiler-basedapproaches and static analysis (SMT solver)-based approaches Inparticular compiler-based approaches eg [37][20][6] usually linkthe detectors to the applications in the compiling stage and detectthe bugs in the runtime process of GPU programs They are limitedby not being ldquofully automaticrdquo because developers have to manually

946


provide test cases Moreover given inferior inputs the synchroniza-tion bugs might not be triggered and detected because such bugscould only occur under limited conditions They can also be expen-sive since such runtime detection demands compiling process andGPU computing environment To the best of our knowledge theseapproaches fail to detect barrier divergence and redundant barrierfunction because of their detection mechanisms On the other handvarious automatic synchronization bug detection approaches eg[30][9] depend on static analysis and SMT solver [16] which couldlead to poor runtime performance when handling complicated GPUprograms Besides such approaches tend to report false positivesor false negatives because it lacks runtime information Althoughdevelopers do not have to provide whole test inputs for them theystill need to provide heuristic settings in order to avoid path explo-sions eg dimension settings for GPUVerify main functions andinitial environments of kernel functions for GKLEE

Compared with these approaches Simulee can automate thedetection process to achieve superior detection performance byenabling the ldquoAutomatic Input Generationrdquo component and theldquoMemory-based Synchronization Bug Detectionrdquo component

Search-based Software Engineering The optimization ap-proaches such as Evolutionary Programming are widely used tosolve software engineering problem by modeling them into op-timization problems [26] Yu et al [45] proposed a metric PSetconstraint to detect CPU-based synchronization bugs Harman etal [27] applied evolving pareto front approximation to refactorsoftware systems Hierons et al [28] used Many-Objective Evolu-tionary Optimisation to optimize the process of software productselection McMinn et al [33] evolved coverage criteria to improvethe performance of bug detection Ouni et al [36] modeled the pro-cess of refactoring into a multiple-objective search-based problemfor generating refactor patches

7 CONCLUSIONSIn this paper we develop a fully automated approach namelySimulee that can successfully detect CUDA synchronization bugsefficiently based on accurate memory-access modeling More specif-ically Simulee consists of two different components the ldquoAutomaticInput Generationrdquo component that applies Evolutionary Computa-tion for automatically generating bug-inducing test inputs and theldquoBug Detection via Memory-Access Modelrdquo component that buildsan accurate memory model for deriving the underlying CUDA syn-chronization bugs To evaluate the efficacy of Simulee we constructa benchmark from real-world CUDA-related projects Our evalua-tion results suggest that Simulee can detect most of the manuallyidentified synchronization bugs out of the studied projects andsuccessfully detect 24 previously unknown bugs which have neverbeen reporteddetected before In addition Simulee can achievebetter effectiveness and efficiency than multiple state-of-the-artapproaches

8 ACKNOWLEDGEMENTThis work is partially supported by the National Natural ScienceFoundation of China (Grant No 61902169) Shenzhen Peacock Plan(Grant No KQTD2016112514355531) and Science and Technol-ogy Innovation Committee Foundation of Shenzhen (Grant No

JCYJ20170817110848086) This work is also partially supported byNational Science Foundation under Grant Nos CCF-1763906 andCCF-1942430 as well as Ant Financial Services Group

REFERENCES[1] 2014 Professional CUDA C Programming (1st ed) Wrox Press Ltd Birmingham

UK UK[2] 2019 Cauchy distribution httpsenwikipediaorgwikiCauchy_distribution[3] 2019 CUDA program introduction httpsenwikipediaorgwikiCUDAr[4] 2019 GPGPU introduction httpsenwikipediaorgwikiGeneral-purpose_

computing_on_graphics_processing_units[5] 2019 Normal distribution httpsenwikipediaorgwikiNormal_distribution[6] 2019 Racecheck Tool httpsdocsnvidiacomcudacuda-memcheckindex

htmlracecheck-tool[7] 2019 The Simulee project httpsgithubcomLebronmydxSimulee[8] arrayfire 2019 ArrayFire httpsgithubcomarrayfirearrayfire[9] Adam Betts Nathan Chong Alastair Donaldson Shaz Qadeer and Paul Thomson

2012 GPUVerify A Verifier for GPU Kernels SIGPLAN Not 47 10 (Oct 2012)113ndash132 httpsdoiorg10114523988572384625

[10] M Burtscher R Nasre and K Pingali 2012 A quantitative study of irregularprograms on GPUs In 2012 IEEE International Symposium on Workload Charac-terization (IISWC) 141ndash151 httpsdoiorg101109IISWC20126402918

[11] Celebrandil [nd] SIFT features with CUDA httpsgithubcomCelebrandilCudaSift

[12] Shuai Che Michael Boyer Jiayuan Meng David Tarjan Jeremy W Sheaffer andKevin Skadron 2008 A performance study of general-purpose applicationson graphics processors using CUDA J Parallel and Distrib Comput 68 10(2008) 1370 ndash 1380 httpsdoiorg101016jjpdc200805014 General-PurposeProcessing using Graphics Processing Units

[13] Jong-Deok Choi Keunwoo Lee Alexey Loginov Robert OrsquoCallahan Vivek Sarkarand Manu Sridharan 2002 Efficient and Precise Datarace Detection for Multi-threaded Object-oriented Programs SIGPLAN Not 37 5 (May 2002) 258ndash269httpsdoiorg101145543552512560

[14] Peter Collingbourne Alastair F Donaldson Jeroen Ketema and Shaz Qadeer2013 Interleaving and Lock-Step Semantics for Analysis and Verification of GPUKernels In Programming Languages and Systems Matthias Felleisen and PhilippaGardner (Eds) Springer Berlin Heidelberg Berlin Heidelberg 270ndash289

[15] cudpp [nd] cudpp httpsgithubcomcudppcudpp[16] Leonardo De Moura and Nikolaj Bjoslashrner 2008 Z3 An Efficient SMT Solver

(2008) 337ndash340 httpdlacmorgcitationcfmid=17927341792766[17] D Demidov K Ahnert K Rupp and P Gottschling 2013 Programming CUDA

and OpenCL A Case Study Using Modern C++ Libraries SIAM Journal onScientific Computing 35 5 (2013) C453ndashC472 httpsdoiorg101137120903683arXivhttpsdoiorg101137120903683

[18] A Dinning and E Schonberg 1990 An Empirical Comparison of MonitoringAlgorithms for Access Anomaly Detection In Proceedings of the Second ACMSIGPLAN Symposium on Principles ampAmp Practice of Parallel Programming (PPOPPrsquo90) ACM New York NY USA 1ndash10 httpsdoiorg1011459916399165

[19] Peng Du Rick Weber Piotr Luszczek Stanimire Tomov Gregory Peterson andJack Dongarra 2012 From CUDA to OpenCL Towards a performance-portablesolution for multi-platform GPU programming Parallel Comput 38 8 (2012) 391ndash 407 httpsdoiorg101016jparco201110002 APPLICATION ACCELERA-TORS IN HPC

[20] Ariel Eizenberg Yuanfeng Peng Toma Pigli WilliamMansky and Joseph Devietti2017 BARRACUDA Binary-level Analysis of Runtime RAces in CUDA ProgramsIn Proceedings of the 38th ACM SIGPLAN Conference on Programming LanguageDesign and Implementation (PLDI 2017) ACM New York NY USA 126ndash140httpsdoiorg10114530623413062342

[21] Lawrence J Fogel 1999 Intelligence Through Simulated Evolution Forty Years ofEvolutionary Programming John Wiley amp Sons Inc New York NY USA

[22] Geof23 2019 GkleeTests httpsgithubcomGeof23GkleeTests[23] Ali Ghanbari Samuel Benton and Lingming Zhang 2019 Practical program re-

pair via bytecode mutation In Proceedings of the 28th ACM SIGSOFT InternationalSymposium on Software Testing and Analysis 19ndash30

[24] gunrock [nd] Gunrock httpsgithubcomgunrockgunrock[25] gunrock [nd] Gunrock Commit httpsgithubcomgunrockgunrockcommit

16294438272cdae8bee4d2db0f4ff65e28bf4331[26] Mark Harman S Afshin Mansouri and Yuanyuan Zhang 2012 Search-based

Software Engineering Trends Techniques and Applications ACM Comput Surv45 1 Article 11 (Dec 2012) 61 pages httpsdoiorg10114523797762379787

[27] Mark Harman and Laurence Tratt 2007 Pareto Optimal Search Based Refactoringat the Design Level In Proceedings of the 9th Annual Conference on Genetic andEvolutionary Computation (GECCO rsquo07) ACM New York NY USA 1106ndash1113httpsdoiorg10114512769581277176

[28] Robert M Hierons Miqing Li Xiaohui Liu Sergio Segura and Wei Zheng 2016SIP Optimal Product Selection from Feature Models Using Many-Objective

947


Evolutionary Optimization ACM Trans Softw Eng Methodol 25 2 Article 17(April 2016) 39 pages httpsdoiorg1011452897760

[29] kaldi asr [nd] Kaldi httpsgithubcomkaldi-asrkaldi[30] Guodong Li Peng Li Geof Sawaya Ganesh Gopalakrishnan Indradeep Ghosh

and Sreeranga P Rajan 2012 GKLEE Concolic Verification and Test Generationfor GPUs SIGPLAN Not 47 8 (Feb 2012) 215ndash224 httpsdoiorg10114523700362145844

[31] Peng Li Guodong Li and Ganesh Gopalakrishnan 2014 Practical Symbolic RaceChecking of GPU Programs In Proceedings of the International Conference forHigh Performance Computing Networking Storage and Analysis (SC rsquo14) IEEEPress Piscataway NJ USA 179ndash190 httpsdoiorg101109SC201420

[32] Xia Li Wei Li Yuqun Zhang and Lingming Zhang 2019 DeepFL integratingmultiple fault diagnosis dimensions for deep fault localization In Proceedings ofthe 28th ACM SIGSOFT International Symposium on Software Testing and AnalysisISSTA 2019 Beijing China July 15-19 2019 169ndash180 httpsdoiorg10114532938823330574

[33] Phil McMinn Mark Harman Gordon Fraser and Gregory M Kapfhammer 2016Automated Search for Good Coverage Criteria Moving from Code Coverage toFault Coverage Through Search-based Software Engineering In Proceedings ofthe 9th International Workshop on Search-Based Software Testing (SBST rsquo16) ACMNew York NY USA 43ndash44 httpsdoiorg10114528970102897013

[34] Robert H B Netzer and Barton P Miller 1991 Improving the Accuracy of DataRace Detection SIGPLAN Not 26 7 (April 1991) 133ndash144 httpsdoiorg101145109626109640

[35] Nvidia 2019 Optimizing Parallel Reduction in CUDA httpdeveloperdownloadnvidiacomcomputecuda11-Betax86_websiteprojectsreductiondocreductionpdf

[36] Ali Ouni Marouane Kessentini Houari Sahraoui Katsuro Inoue and Kalyan-moy Deb 2016 Multi-Criteria Code Refactoring Using Search-Based SoftwareEngineering An Industrial Case Study ACM Trans Softw Eng Methodol 25 3Article 23 (June 2016) 53 pages httpsdoiorg1011452932631

[37] Yuanfeng Peng Vinod Grover and Joseph Devietti 2018 CURD A DynamicCUDA Race Detector In Proceedings of the 39th ACM SIGPLAN Conference onProgramming Language Design and Implementation (PLDI 2018) ACM New YorkNY USA 390ndash403 httpsdoiorg10114531923663192368

[38] Stefan Savage Michael Burrows Greg Nelson Patrick Sobalvarro and ThomasAnderson 1997 Eraser A Dynamic Data Race Detector for MultithreadedPrograms ACM Trans Comput Syst 15 4 (Nov 1997) 391ndash411 httpsdoiorg101145265924265927

[39] August Shi Milica Hadzi-Tanovic Lingming Zhang Darko Marinov and OwolabiLegunsen 2019 Reflection-aware static regression test selection PACMPL 3OOPSLA (2019) 1871ndash18729 httpsdoiorg1011453360613

[40] M Wu L Zhang C Liu S H Tan and Y Zhang 2019 Automating CUDASynchronization via Program Transformation In 2019 34th IEEEACM Interna-tional Conference on Automated Software Engineering (ASE) 748ndash759 httpsdoiorg101109ASE201900075

[41] Mingyuan Wu Husheng Zhou Lingming Zhang Cong Liu and Yuqun Zhang2019 Characterizing and Detecting CUDA Program Bugs CoRR abs190501833(2019) arXiv190501833 httparxivorgabs190501833

[42] Xtra-Computing [nd] THUNDERSVM httpsgithubcomXtra-Computingthundersvm

[43] Y Yang P Xiang M Mantor and H Zhou 2012 Fixing Performance Bugs AnEmpirical Study of Open-Source GPGPU Programs In 2012 41st InternationalConference on Parallel Processing 329ndash339 httpsdoiorg101109ICPP201230

[44] Xin Yao Yong Liu and Guangming Lin 1999 Evolutionary Programming MadeFaster Trans Evol Comp 3 2 (July 1999) 82ndash102 httpsdoiorg1011094235771163

[45] Jie Yu and Satish Narayanasamy 2009 A Case for an Interleaving ConstrainedShared-Memory Multi-Processor In Proceedings of the 36th Annual InternationalSymposium on Computer Architecture (ISCA rsquo09) Association for Computing Ma-chinery New York NY USA 325ndash336 httpsdoiorg10114515557541555796

[46] Lingming Zhang Tao Xie Lu Zhang Nikolai Tillmann Jonathan De Halleux andHong Mei 2010 Test generation via dynamic symbolic execution for mutationtesting In 2010 IEEE International Conference on Software Maintenance 1ndash10

[47] M Zhang Y Li X Li L Chen Y Zhang L Zhang and S Khurshid 2019 AnEmpirical Study of Boosting Spectrum-based Fault Localization via PageRankIEEE Transactions on Software Engineering (2019) 1ndash1 httpsdoiorg101109TSE20192911283

[48] Mengshi Zhang Yuqun Zhang Lingming Zhang Cong Liu and Sarfraz Khur-shid 2018 DeepRoad GAN-based metamorphic testing and input validationframework for autonomous driving systems In Proceedings of the 33rd ACMIEEEInternational Conference on Automated Software Engineering ASE 2018 MontpellierFrance September 3-7 2018 132ndash142 httpsdoiorg10114532381473238187

[49] Yuhao Zhang Yifan Chen Shing-Chi Cheung Yingfei Xiong and Lu Zhang 2018An Empirical Study on TensorFlow Program Bugs In Proceedings of the 27th ACMSIGSOFT International Symposium on Software Testing and Analysis (ISSTA 2018)ACM New York NY USA 129ndash140 httpsdoiorg10114532138463213866

[50] Mai Zheng Vignesh T Ravi Feng Qin and Gagan Agrawal 2011 GRace Alow-overhead mechanism for detecting data races in GPU programs In PPoPP

[51] M Zheng V T Ravi F Qin and G Agrawal 2014 GMRace Detecting Data Racesin GPU Programs via a Low-Overhead Scheme IEEE Transactions on Parallel andDistributed Systems 25 1 (Jan 2014) 104ndash115 httpsdoiorg101109TPDS201344

[52] Husheng ZhouWei Li Zelun Kong Junfeng Guo Yuqun Zhang Bei Yu LingmingZhang and Cong Liu 2020 DeepBillboard Systematic Physical-World Testing ofAutonomous Driving Systems In Proceedings of the 42nd International Conferenceon Software Engineering ICSE 2020 Seoul Korea May 23 - 29 2020

[53] Husheng Zhou Wei Li Yuankun Zhu Yuqun Zhang Bei Yu Lingming Zhangand Cong Liu 2018 DeepBillboard Systematic Physical-World Testing of Au-tonomous Driving Systems CoRR abs181210812 (2018) arXiv181210812httparxivorgabs181210812

[54] zhxfl 2019 CUDA-CNN httpsgithubcomzhxflCUDA-CNN

948


which automatically detects general synchronization bugs for CUDAkernel functions In particular Simulee generates a Memory-AccessModel that maintains thread-wise memory-access information in-cluding thread id visit order and action Accordingly Simulee uti-lizes the LLVM bytecode of CUDA kernel functions to initialize therunning environmental setups including arguments dimensionsand global memory if necessary based on theMemory-Access ModelMoreover Simulee applies Evolutionary Programming [21] to ap-proach error-inducing inputs for exposing the dangerous programexecution paths which may possibly induce synchronization bugsSubsequently Simulee collects the corresponding memory-accessinformation from such paths At last by combining CUDA specificssuch collected memory-access information are analyzed to findwhether they can potentially lead to synchronization bugs

Compared with other CUDA synchronization bug detection ap-proaches Simulee can detect multiple bug types including data raceredundant barrier function and barrier divergence fully automati-cally Moreover Simulee benefits from exploring the dangerous exe-cution paths guided by the lightweight evolutionary search withoutincurring large overhead (eg for constraint solving) such that it ismore efficient than existing state-of-the-art approaches [9 30 31]that usually rely on heavyweight techniques or manual inputs

To evaluate the effectiveness and efficiency of Simulee we firstcollect 7 popular CUDA-related projects from GitHub (with a totalof 17928 commits and 136 million LOC by Aug 2019) as our bench-mark suite Then as a preliminary study we manually identified24 real-world synchronization bugs from a subset of the studiedreal-world CUDA projects and the GKLEE [30] benchmark suiteWe conduct a set of experiments to explore (1) how many of thosemanually identified bugs can be detected by Simulee for the prelim-inary analysis (2) whether Simulee can detect previously-unknownbugs on those real-world CUDA projects and (3) how Simuleecompares with state-of-the-art approaches in CUDA bug detectionThe experimental results suggest that Simulee can successfully de-tect 21 of the 24 manually identified synchronization bugs in ourpreliminary study Furthermore it even detects 24 previously un-known bugs from all the 7 real-world CUDA projects 10 of whichhave already been confirmed by the corresponding developers Theexperimental results also demonstrate that Simulee can be muchmore effective than state-of-the-art approaches ie GKLEE [30]GKLEE-SESA [31] GPUVerify [9] and RaceChecker [6] in detectingsynchronization bugs eg Simulee can detect 2X more previouslyunknown bugs than any of the other studied state-of-the-art ap-proaches In addition Simulee can significantly outperform theother studied approaches in terms of the runtime overhead

In summary our paper makes the following contributions

bull Idea To the best of our knowledge we propose the first ideaof CUDA synchronization bug detection via evolutionarysearch guided by memory-access modelingbull ImplementationWe have implemented the proposed ideaas a lightweight fully automated and general-purpose de-tection framework for CUDA synchronization bugs namelySimulee that can automatically detect a wide range of syn-chronization bugs in CUDA programs which are hard to becaptured manually

Block( 0 0 )

Block( 1 0 )

Grid

Block( 0 1 )

Block( 0 2 )

Block( 1 1 )

Block( 1 2 )

Thread( 0 0 )

Thread( 1 0 )

Thread( 2 0 )

Thread( 0 1 )

Thread( 1 1 )

Thread( 2 1 )

Thread( 0 2 )

Thread( 1 2 )

Thread( 2 2 )

Block ( 1 0 )

Figure 1 CUDA Hierarchy

bull Study We evaluate Simulee under multiple experimentalsetups The results suggest that Simulee is able to detectmost of the manually identified synchronization bugs inthe benchmark In addition it even detects 24 previously-unknown bugs of the entire benchmark fully automaticallyand significantly outperforms state-of-the-art approaches

2 BACKGROUNDIn this section we give an overview on CUDA the CUDA parallelcomputing mechanism and typical CUDA synchronization bugsCUDAOverview CUDA is a parallel computing platform and pro-grammingmodel which allows developers to use GPU hardware forgeneral-purpose computing eg autonoumos driving [48 52 53]CUDA is composed of a runtime library and an extended version ofCC++ In particular CUDA programs are executed on GPU coresnamely ldquodevicerdquo while they also need to be allocated with resourceson CPUs namely ldquohostrdquo prior to execution As a result developersneed to retrieve allocated resources such as global memory afterCUDA program execution To conclude a complete CUDA programcontains 3 runtime stages (1) host resource preparation (2) kernelfunction execution and (3) host resource retrieveCUDA Parallel Computing Mechanism A CUDA kernel func-tion refers to the part of CUDA programs that runs on the deviceside Specifically thread is the kernel functionrsquos basic executionunit At the physical level 32 threads are bundled as a thread warpwherein all the threads execute the same statement at any timeexcept undergoing a branch divergence while at the logic level oneor more threads are contained in a block and one or more blocksare contained in a grid The execution of kernel functions is ini-tialized by setting runtime environments eg dimensions of gridsand blocks passing relevant arguments such that the computationcan be divided into sub-computations and each sub-computationcan be dispatched to different threads Eventually the results ofsub-computations can be merged as the final result of the overallcomputation through applying algorithms such as reduction [35]The hierarchy of the parallel computing mechanism of CUDA ker-nel functions is presented in Figure 1

To synchronize threads CUDA applies barriers at which all thethreads in one block must wait before any can proceed In CUDAkernel functions the barrier function is ldquo__syncthreads()rdquo whichsynchronizes threads from the same block When a thread reachesa barrier it is expected to proceed to next statement if and only ifall the threads from the same block have reached the same barrierOtherwise the program would be exposed to undefined behaviors

938


1 tid = threadIdxx

2



5 else









3 s_idx[tid] = 0

-4 __syncthreads ()


6

7 s_idx[tid] = i

8 s_median[tid] = m

9







-3 __syncthreads ()

4


+6 __syncthreads ()

7





Slicing original

statements

Mutating arguments

Mutating dimensions


Sorting solutions

Top-ranksolution

acceptable

Iterating

Memory model

construction

Memory-model-based

detection

Results





939





endsumi=star t

д(i)

endsumi=star t

f (i)

(1)











eminusx 2

2)




940






x1radic2π

eminusx 2

2dx = 0399 (2)


x1

π (1 + x 2)dx = +infin (3)












Memory Unit

Barrier Function







941


















942












943












944




arrayfire


GkleeTests






cudaSift




kaldi








945





+ threadIdxx

2


4 pragma unroll



7

8

9














946
































947





























948


1 tid = threadIdxx

2



5 else









3 s_idx[tid] = 0

-4 __syncthreads ()


6

7 s_idx[tid] = i

8 s_median[tid] = m

9







-3 __syncthreads ()

4


+6 __syncthreads ()

7





Slicing original

statements

Mutating arguments

Mutating dimensions


Sorting solutions

Top-ranksolution

acceptable

Iterating

Memory model

construction

Memory-model-based

detection

Results





939





endsumi=star t

д(i)

endsumi=star t

f (i)

(1)











eminusx 2

2)




940






x1radic2π

eminusx 2

2dx = 0399 (2)


x1

π (1 + x 2)dx = +infin (3)












Memory Unit

Barrier Function







941


















942












943












944




arrayfire


GkleeTests






cudaSift




kaldi








945





+ threadIdxx

2


4 pragma unroll



7

8

9














946
































947





























948





endsumi=star t

д(i)

endsumi=star t

f (i)

(1)











eminusx 2

2)




940






x1radic2π

eminusx 2

2dx = 0399 (2)


x1

π (1 + x 2)dx = +infin (3)












Memory Unit

Barrier Function







941


















942












943












944




arrayfire


GkleeTests






cudaSift




kaldi








945





+ threadIdxx

2


4 pragma unroll



7

8

9














946
































947





























948






x1radic2π

eminusx 2

2dx = 0399 (2)


x1

π (1 + x 2)dx = +infin (3)












Memory Unit

Barrier Function







941


















942












943












944




arrayfire


GkleeTests






cudaSift




kaldi








945





+ threadIdxx

2


4 pragma unroll



7

8

9














946
































947





























948


















942












943












944




arrayfire


GkleeTests






cudaSift




kaldi








945





+ threadIdxx

2


4 pragma unroll



7

8

9














946
































947





























948












943












944




arrayfire


GkleeTests






cudaSift




kaldi








945





+ threadIdxx

2


4 pragma unroll



7

8

9














946
































947





























948












944




arrayfire


GkleeTests






cudaSift




kaldi








945





+ threadIdxx

2


4 pragma unroll



7

8

9














946
































947





























948




arrayfire


GkleeTests






cudaSift




kaldi








945





+ threadIdxx

2


4 pragma unroll



7

8

9














946
































947





























948





+ threadIdxx

2


4 pragma unroll



7

8

9














946
































947





























948
































947





























948





























948

simulee: detecting cuda synchronization bugs via memory ...lxz144130/publications/icse2020b.pdf ·...

Documents