1 gklee: concolic verification and test generation for gpus guodong li 1,2, peng li 1, geof sawaya...
TRANSCRIPT
![Page 1: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/1.jpg)
1
GKLEE: Concolic Verification and Test Generation for GPUs
Guodong Li1,2, Peng Li1, Geof Sawaya1, Ganesh Gopalakrishnan1, Indradeep Ghosh2, Sreeranga P. Rajan2
1
Feb. 2012
Fujitsu Labs of America2
1
![Page 2: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/2.jpg)
GPUs are widely used!• About 40 of the top 500 machines are GPU based
• Personal supercomputers used for scientific research (biology, physics, …) increasingly based on GPUs
2
(courtesy of AMD) (courtesy of Nvidia)
(courtesy of Nvidia, www.engadget.com)
(courtesy of Intel)
In such application domains, it is important that GPU computations yield correct answers and are bug-free.
![Page 3: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/3.jpg)
Existing GPU Testing Methods are Inadequate
• Insufficient branch-coverage and interleaving-coverage, leading to–Missed data races
3
![Page 4: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/4.jpg)
Existing GPU Testing Methods are Inadequate
• Insufficient branch-coverage and interleaving-coverage, leading to–Missed data races
4
Write(a) Write(a) Write(a) Read(a)
![Page 5: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/5.jpg)
Existing GPU Testing Methods are Inadequate
• Data races are a huge problem– Testing is NEVER conclusive – One has to infer data race's ill effects indirectly
through corrupted values– Even instrumented race checking gives results
only for a specific platform, and not for future validations, • for example for a different warp scheduling, e.g.
change over from old Tesla to New Fermi
5
![Page 6: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/6.jpg)
Existing GPU Testing Methods are Inadequate
• Insufficient branch-coverage and interleaving-coverage, leading to– Missed data races
–Missed deadlocks
6
![Page 7: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/7.jpg)
Existing GPU Testing Methods are Inadequate
• Insufficient branch-coverage and interleaving-coverage, leading to– Missed data races
–Missed deadlocks
7
__SyncThreads()
![Page 8: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/8.jpg)
Existing GPU Testing Methods are Inadequate
• Insufficient branch-coverage and interleaving-coverage, leading to– Missed data races– Missed deadlocks
• Insufficient measurement of performance penalties due to–Warp Divergence
8
![Page 9: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/9.jpg)
Existing GPU Testing Methods are Inadequate
• Insufficient branch-coverage and interleaving-coverage, leading to– Missed data races– Missed deadlocks
• Insufficient measurement of performance penalties due to–Warp Divergence
9
![Page 10: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/10.jpg)
Existing GPU Testing Methods are Inadequate
• Insufficient branch-coverage and interleaving-coverage, leading to– Missed data races– Missed deadlocks
• Insufficient measurement of performance penalties due to– Warp Divergence
– Non-coalesced memory accesses
10
![Page 11: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/11.jpg)
Existing GPU Testing Methods are Inadequate
• Insufficient branch-coverage and interleaving-coverage, leading to– Missed data races– Missed deadlocks
• Insufficient measurement of performance penalties due to– Warp Divergence
– Non-coalesced memory accesses
11
Memory
![Page 12: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/12.jpg)
Existing GPU Testing Methods are Inadequate
• Insufficient branch-coverage and interleaving-coverage, leading to– Missed data races– Missed deadlocks
• Insufficient measurement of performance penalties due to– Warp Divergence– Non-coalesced memory accesses
– Bank conflicts
12
Memory Banks
![Page 13: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/13.jpg)
Existing GPU Testing Methods are Inadequate
• CUDA GDB Debugger– Manually debug the code and check races and deadlocks
• CUDA Profiler– Report numbers difficult to read– Low coverage (i.e. no all possible inputs)
13
• GKLEE– Better tool for verification and testing– Can address all the previously mentioned
points– e.g. has found bugs in real SDK kernels
previously thought to be bug-free– give root causes of the bugs
![Page 14: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/14.jpg)
Our Contributions• GKLEE: a Symbolic Virtual GPU for
Verification, Analysis, and Test-generation
• GKLEE reports Races, Deadlocks, Bank Conflicts, Non-Coalesced Accesses,
Warp Divergences
• GKLEE generates Tests to Run on GPU Hardware
14
![Page 15: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/15.jpg)
15
Architecture of GKLEE
LLVM GCC Compiler
LLVM GCC Compiler
GKLEE(Executor, scheduler,
checker, test generator)
GKLEE(Executor, scheduler,
checker, test generator)
C++ GPU Program
(with Sym. Inputs)
LLVMcuda
GPU configuration
CUDA Syntax HandlerNVCCNVCC
Test Cases
Replay on Real GPU
Statistics /Bugs
![Page 16: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/16.jpg)
16
Rest of the Talk
• Simple CUDA example• Details of Symbolic Virtual GPU• Analysis Details:– Races, Deadlocks– Degree of
• Warp divergences, Bank Conflicts, Non-Coalesced Accesses
– Functional Correctness
• Automatic Test Generation– Coverage-directed test-case reduction
![Page 17: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/17.jpg)
CUDA
• A simple dialect of C++ with CUDA directives
• Thread blocks / teams -- SIMD “warps”• Synchronization through barriers / atomics
(GKLEE being extended to handle atomics)
17
![Page 18: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/18.jpg)
18
Example: Increment Array Elements
Increment N-element array A by scalar b
tid 0 1 …
A
A[0]+b
__global__ void inc_gpu(int*A, int b, intN) { int idx = blockIdx.x * blockDim.x + threadIdx.x; if (idx < N) A[idx] = A[idx] + b;}
...A[1]+b
t0 t1
![Page 19: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/19.jpg)
19
Illustration of Race
Increment N-element vector A by scalar btid 0 1 63
A
t63:write A[63]
...
__global__ void inc_gpu(int*A, int b, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) A[idx] = A[(idx – 1) % N] + b;}
RACE!
t0: read A[63]
![Page 20: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/20.jpg)
20
Illustration of Deadlock
Increment N-element vector A by scalar btid 0 1 …
A
...
__global__ void inc_gpu(int*A, int b, int N) { int idx = blockIdx.x * blockDim.x + threadIdx.x;
if (idx < N) { A[idx] = A[idx] + b;
__syncthreads(); }
DEADLOCK!
idx < N idx ≥ N
![Page 21: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/21.jpg)
21
Example of a Race Found by GKLEE
21
__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) { const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos]; ... addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); } __syncthreads(); ...}inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data)
{ s_Hist[ threadPos + IMUL(data, THREAD_N) ]++; }
“GKLEE: Is there a Race ?”
![Page 22: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/22.jpg)
22
Example of a Race Found by GKLEE
22
__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) { const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos]; ... addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); } __syncthreads(); ...}inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data)
{ s_Hist[ threadPos + IMUL(data, THREAD_N) ]++; }
Threads 5 and and 13 have a WW race
when d_Data[5] = 0x04040404 and d_Data[13] = 0. GKLEE
![Page 23: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/23.jpg)
23
Example of Test Coverage due to GKLEE
23
__global__ void Bitonic_Sort(unsigned* values) { unsigned int tid = tid.x; shared[tid] = values[tid]; __syncthreads();
for (unsigned k = 2; k <= bdim.x; k *= 2) for (unsigned j = k / 2; j > 0; j /= 2) { unsigned ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); else if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } __syncthreads(); } values[tid] = shared[tid];}
__shared__ unsigned shared[NUM];
inline void swap(unsigned& a, unsigned& b){ unsigned tmp = a; a = b; b = tmp; }
![Page 24: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/24.jpg)
24
Example of Test Coverage due to GKLEE
24
__global__ void Bitonic_Sort(unsigned* values) { unsigned int tid = tid.x; shared[tid] = values[tid]; __syncthreads();
for (unsigned k = 2; k <= bdim.x; k *= 2) for (unsigned j = k / 2; j > 0; j /= 2) { unsigned ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); else if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } __syncthreads(); } values[tid] = shared[tid];}
__shared__ unsigned shared[NUM];
inline void swap(unsigned& a, unsigned& b){ unsigned tmp = a; a = b; b = tmp; }
“How do we test this?”
![Page 25: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/25.jpg)
25
Example of Test Coverage due to GKLEE
25
__global__ void Bitonic_Sort(unsigned* values) { unsigned int tid = tid.x; shared[tid] = values[tid]; __syncthreads();
for (unsigned k = 2; k <= bdim.x; k *= 2) for (unsigned j = k / 2; j > 0; j /= 2) { unsigned ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); else if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } __syncthreads(); } values[tid] = shared[tid];}
__shared__ unsigned shared[NUM];
inline void swap(unsigned& a, unsigned& b){ unsigned tmp = a; a = b; b = tmp; }
Answer 1 : “Random + “
![Page 26: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/26.jpg)
26
Example of Test Coverage due to GKLEE
26
__global__ void Bitonic_Sort(unsigned* values) { unsigned int tid = tid.x; shared[tid] = values[tid]; __syncthreads();
for (unsigned k = 2; k <= bdim.x; k *= 2) for (unsigned j = k / 2; j > 0; j /= 2) { unsigned ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); else if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } __syncthreads(); } values[tid] = shared[tid];}
__shared__ unsigned shared[NUM];
inline void swap(unsigned& a, unsigned& b){ unsigned tmp = a; a = b; b = tmp; }
Answer 2 : Ask GKLEE:
Here are 5 tests with100% source code coverage79% avg. thread + barrier interval coverage
![Page 27: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/27.jpg)
27
GKLEE: Symbolic Virtual GPUHost
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
• GKLEE models a GPU using software– The virtual GPU
represents the CUDA Programming Model (hence hide many hardware details)
– Similar to the CUDA emulator in this aspect; but with many unique features
– Can simulate CPU+GPU
virtual CPU
virtual GPU
GKLEE
![Page 28: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/28.jpg)
28
Concolic Execution on the Virtual GPU• The values can be CONCrete or symbOLIC
(CONCOLIC) in GKLEE
– A value may be a complicated symbolic
expression
– Symbolic expressions are handled by constraint
solvers
• Determine satisfiability
• Give concrete values as evidence
– Constraint solving has become 1,000x faster over the last 10 years
![Page 29: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/29.jpg)
29
Comparing Concrete and Symbolic Execution
10
a b c
Program:
b = a * 2;
c = a + b;
if (c > 100)
assert(0);
2010
302010
unreachable
All values are concrete
![Page 30: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/30.jpg)
30
Comparing Concrete and Symbolic Execution
x(-,+ )
a b c
Program:
b = a * 2;
c = a + b;
if (c > 100)
assert(0);
else
…
reachable, e.g. x = 40
x(-,+ ) 2x
x(-,+ ) 3x2x
reachable, e.g. x = 30Now path condition is: 3x <= 100
The values can be concrete or symbolic
![Page 31: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/31.jpg)
31
GKLEE Works on LLVM Bytecode• CUDA C++ programs are compiled to LLVM bytecode by
LLVM-GCC with our CUDA syntax handler• Our online technical report contains detailed description• GKLEE extends KLEE to handle CUDA features
LLVMcuda Syntax and Semantics
![Page 32: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/32.jpg)
32
Thread Scheduling: In general, an Exp. Number of Schedules!
It is like shuffling decks of cards
> 13 trillion shuffles exist for 5 decks with 5 cards !!
> 13 trillion schedules exist for 5 threads with 5 instructions !!
More precisely, 25! / (5!)5
![Page 33: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/33.jpg)
33
GKLEE Avoids Examining Exp. Schedules !!
Instead of considering allSchedules and All Potential Races…
![Page 34: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/34.jpg)
34
GKLEE Avoids Examining Exp. Schedules !!
Instead of considering allSchedules and All Potential Races…
Consider JUST THIS SINGLECANONICAL SCHEDULE !!
Folk Theorem (proved in our paper):“We will find A RACEIf there is ANY race” !!
![Page 35: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/35.jpg)
35
Closer Look: canonical scheduling
Race-free operations can be exchanged
another valid schedule (e.g. canonical schedule):
t1:a1:read x
t2:a2: write y
t1:a3:write x
t2:a4:write y
t1:a5:read x
t2:a6:read y
a valid schedule:
t2:a2:write y
t1:a1: read x
t1:a3:write x
t2:a4:write y
t2:a6:read y
t1:a5:read x
The scheduler:
(1) Applies the canonical schedule;
(2) Checks races upon the barriers;
(3) If no race then continues; otherwise reports the race and terminate
![Page 36: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/36.jpg)
36
SIMD-aware Canonical Scheduling in GKLEE
SIMD/Barrier Aware Canonical scheduling within warp/blockt1 t32
BarrierInterval (BI1)
BarrierInterval (BI2)
Instr. 1t2
Instr. 2
Instr. 3
t33 t64
Instr. 1t34
Instr. 2
Instr. 3
Instr. 4
Instr. 5
Instr. 6
Instr. 4
Instr. 5
Instr. 6
…
Record accesses in canonical scheduleCheck whether the accesses conflict (e.g. have the same address)
![Page 37: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/37.jpg)
37
SIMD-aware Race Checking in GKLEE
Check races on the fly (in the canonical schedule) t1 t32
BarrierInterval (BI1)
BarrierInterval (BI2)
Instr. 1t2
Instr. 2
Instr. 3
t33 t64
Instr. 1t34
Instr. 2
Instr. 3
Instr. 4
Instr. 5
Instr. 6
Instr. 4
Instr. 5
Instr. 6
…
intra-warp races inter-warp and inter-block races
![Page 38: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/38.jpg)
38
SIMD-aware Race Checking in GKLEE
Check races on the fly (in the canonical schedule) t1 t32
BarrierInterval (BI1)
BarrierInterval (BI2)
Instr. 1t2
Instr. 2
Instr. 3
t33 t64
Instr. 1t34
Instr. 2
Instr. 3
Instr. 4
Instr. 5
Instr. 6
Instr. 4
Instr. 5
Instr. 6
…
intra-warp races inter-warp and inter-block races
![Page 39: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/39.jpg)
SDK Kernel Example: race checking
__global__ void histogram64Kernel(unsigned *d_Result, unsigned *d_Data, int dataN) { const int threadPos = ((threadIdx.x & (~63)) >> 0) | ((threadIdx.x & 15) << 2) | ((threadIdx.x & 48) >> 4); ... __syncthreads(); for (int pos = IMUL(blockIdx.x, blockDim.x) + threadIdx.x; pos < dataN; pos += IMUL(blockDim.x, gridDim.x)) { unsigned data4 = d_Data[pos]; ... addData64(s_Hist, threadPos, (data4 >> 26) & 0x3FU); } __syncthreads(); ...}
inline void addData64(unsigned char *s_Hist, int threadPos, unsigned int data){ s_Hist[threadPos + IMUL(data, THREAD_N)]++; }
threadPos = … threadPos = …
data = (data4>26) & 0x3FU
data = (data4>26) & 0x3FU
s_Hist[threadPos + Data*THREAD_N]++;
s_Hist[threadPos + data*THREAD_N]++;
t1 t2
![Page 40: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/40.jpg)
SDK Kernel Example: race checking
threadPos = … threadPos = …
data = (data4>26) & 0x3FU
data = (data4>26) & 0x3FU
s_Hist[threadPos + data*THREAD_N]++;
s_Hist[threadPos + data*THREAD_N]++;
RW set:t1: writes s_Hist((((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64), …
t2: writes s_Hist((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64), …
t1 t2
t1,t2,d_Data: (t1 t2) (((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64) == ((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64)
?
![Page 41: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/41.jpg)
SDK Kernel Example: race checking
threadPos = … threadPos = …
data = (data4>26) & 0x3FU
data = (data4>26) & 0x3FU
s_Hist[threadPos + data*THREAD_N]++;
s_Hist[threadPos + data*THREAD_N]++;
t1 t2
GKLEE indicates that these two addresses
are equal when
t1 = 5, t2 = 13, d_data[5]= 0x04040404,
and d_data[13] = 0
indicating a Write-Write race
RW set:t1: writes s_Hist((((t1 & (~63)) >> 0) | ((t1 & 15) << 2) | ((t1 & 48) >> 4)) + ((d_Data[t1] >> 26) & 0x3FU) * 64), …
t2: writes s_Hist((((t2 & (~63)) >> 0) | ((t2 & 15) << 2) | ((t2 & 48) >> 4)) + ((d_Data[t2] >> 26) & 0x3FU) * 64), …
![Page 42: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/42.jpg)
42
Experimental Results, Part I (check correctness and performance issues)
The results of running GKLEE on CUDA SDK 2.0 kernels. GKLEE checks(1) well synchronized barriers; (2) races; (3) functional correctness; (4) bank conflicts; (5)
memory coalescing; (6) warp divergence; (7) required volatile keyword.
Kernels Loc Race Func. Corrct.
#T Bank Conflict Perf.
Coalesced Accesses (Perf.)
Warp Divergperf
.
Volatile Needed
1.X 2.X ≤1.1 2.x
Bitonic Sort 30 yes 4 0% 0% 100% 100%
60% no
Scalar Prod. 30 yes 64 0% 0% 11% 100%
100% yes
Matric Mult 61 yes 64 0% 0% 100% 100%
0% no
Histogram64th.
69 WW unknown
32 66% 66% 100% 100%
0% yes
Reduction (7)
231 yes 16 0% 0% 100% 100%
16-83%
yes
Scan Best 78 yes 32 71% 71% 100% 100%
71% no
Scan Naïve 28 yes 32 0% 0% 50% 100%
85% yes
Scan Effi. 60 yes 32 83% 16% 0% 0% 83% no
Scan Large 196 yes 32 71% 71% 100% 100%
71% no
Radix Sort 750 WW unknown
16 3% 0% 0% 100%
5% yes
Bisect Small 1,000
ben. _ 16 38% 0% 97% 100%
43% yes
Bisect Large 1,400
ben. _ 16 15% 0% 99% 100%
53% yes
![Page 43: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/43.jpg)
43
Automatic Test Generation • GKLEE guarantees to explore all paths w.r.t. given
inputs• The path constraint at the end of each path is solved
to generate concrete test cases – GKLEE supports many heuristic reduction techniques
t1
c2
¬c1c1
¬c2 c4
¬c3
¬c4
c3
t2
c2
¬c1c1
¬c2
c4
¬c3
¬c4
c3
c4
¬c3
¬c4
c3 c4
¬c3
¬c4
c3
t1+t2
c1c2 c3 c4
¬ c1 ¬c3
…
solve this constraint to give a concrete test
![Page 44: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/44.jpg)
44
SDK Example: comprehensive testing
44
__global__ void BitonicKernel(unsigned* values) { unsigned int tid = tid.x; shared[tid] = values[tid]; __syncthreads();
for (unsigned k = 2; k <= bdim.x; k *= 2) for (unsigned j = k / 2; j > 0; j /= 2) { unsigned ixj = tid ^ j; if (ixj > tid) { if ((tid & k) == 0) if (shared[tid] > shared[ixj]) swap(shared[tid], shared[ixj]); else if (shared[tid] < shared[ixj]) swap(shared[tid], shared[ixj]); } __syncthreads(); } values[tid] = shared[tid];}
…
shared[0] > shared[1]
shared[0]≤shared[1]
shared[1] < shared[2]
shared[1] ≥ shared[2]
shared[0] > shared[2]
shared[0] ≤ shared[2]
Unsat: shared[0] > shared[1] shared[1] ≥ shared[2] shared[0] ≤ shared[2]
![Page 45: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/45.jpg)
45
SDK Example: comprehensive verification
45
…
… ……
Functional correctness: output values is sorted: values[0] ≤ values[1] ≤ … ≤ values[n]
…values=…
values=…
values=…
values=…
values=…
values=…
…… …
![Page 46: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/46.jpg)
46
Experimental Results, Part II… (Automatic Test Generation)
Coverage information about the generated tests for some CUDA kernels.
Kernels src. code coverage
Avg. Covt
max. Covt
Avg. CovBIt
Max. CovBIt
Exec. time
Bitonic Sort 100%/100%
78%/76%
100%/94%
79%/66% 90%/76% 1s
Merge Sort 100%/100%
88%/70%
100%/85%
93%/86% 100%/100%
1.6s
Word Search
100%/100%
100%/81%
100%/85%
100%/97%
100%/100%
0.1s
Suffix Tree Match
100%/90%
55%/49%
98%/66%
55%/49% 98%/83% 31s
Histogram64
100%/100%
100%/75%
100%/75%
100%/100%
100%/100%
600s
Covt and CovTBt measure bytecode coverage w.r.t threads. No test reductions used in generating this table. Exec. time on typical workstation.
![Page 47: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/47.jpg)
47
Experimental Results, Part II (Coverage Directed Test Reduction)
Results after applying reduction Heuristics
RedTB and RedBI cut the paths according to the coverage information of Thread+Barrier and Barrier respectively. Basically a path is pruned if it is unlikely to contribute new coverage.
Kernels No Reductions RedTB RedBI
#path
Avg. CovBIt
#path
Avg. CovBIt
#path
Avg. CovBIt
Bitonic Sort 28 79%/66% 5 79%/66% 5 79%/65%
Merge Sort 34 93%/86% 4 92%/84% 4 92%/84%
Word Search 8 100%/97% 2 100%/97% 2 94%/85%
Suffix Tree Match
31 55%/49% 6 55%/49% 6 55%/49%
Histogram64 13 100%/100%
5 100%/100%
5 100%/100%
![Page 48: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/48.jpg)
48
Additional GKLEE Features
• GKLEE employs an efficient memory
organization
• Employs many expression evaluation
optimizations• Simplify concolic expressions on the fly• Dynamically cache results• Apply dependency analysis before constraint
solving• Use manually optimized C/C++ Libraries
• GKLEE also handles all of the C++ Syntax
• GKLEE never generates false alarms
![Page 49: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/49.jpg)
49
Experimental Results, Part III(performance comparison of two tools)
Execution times (in seconds) of GKLEE and PUG [SIGSOFT FSE 2010] for functional correctness check.
#T is the number of threads. Time is reported in the format of GPU time(entire time); T.O means > 5 minutes.
Kernels #T = 4 #T = 16 #T = 64 #T = 256 #T = 1,024
PUG GKLEE PUG GKLEE GKLEE GKLEE GKLEE
Simple Reduct.
2.8 <0.1(<0.1)
T.O <0.1(<0.1)
<0.1(<0.1)
0.2(0.3) 2.3(2.9)
Matrix. Transp. 1.9 <0.1(<0.1)
T.O <0.1(0.3) <0.1(3.2) <0.1(63) 0.9(T.O)
Bitonic Sort 3.7 0.9(1) T.O T.O T.O T.O T.O
Scan Large _ <0.1(<0.1)
_ <0.1(<0.1)
0.1(0.2) 1.6(3) 22(51)
![Page 50: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/50.jpg)
50
Other Details• Diverged warp scheduling, intra-warp, inter-warp/-
block race checking, textual aligned barrier checking
• Checking performance issues– warp divergence, bank conflicts, global memory coalescing
• Path/Test reduction techniques• Volatile declaration checking • Handling symbolic aliasing and pointers• Drivers for the kernels and replaying on the real
GPU• Other results, e.g. on CUDA SDK 4.0 programs• CUDA’s relaxed memory model and semantics
![Page 51: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/51.jpg)
51
Summary
• GKLEE: symbolic virtual GPU– Identify correctness and performance issues– Produce concrete tests with high code coverage– Enable symbolic parallel debugging for CUDA programs – Good for other CUDA applications (e.g. compiler
optimization verification, regression testing, etc.)• The tool is open source and available at:– www.cs.utah.edu/fv/GKLEE– with tutorial, manual, tech. report, liveDVD,, etc.
• Future Work– Parameterized verification (e.g. equivalence checking) – Support for floating point numbers– Combination with runtime execution (on the real GPU)
![Page 52: 1 GKLEE: Concolic Verification and Test Generation for GPUs Guodong Li 1,2, Peng Li 1, Geof Sawaya 1, Ganesh Gopalakrishnan 1, Indradeep Ghosh 2, Sreeranga](https://reader035.vdocuments.mx/reader035/viewer/2022062713/56649cdb5503460f949a50d3/html5/thumbnails/52.jpg)
52
Thank You!
Questions?
Obtain GKLEE from
www . cs . utah . edu / fv / GKLEE