code gpu with cuda - applying optimization techniques

38
CODE GPU WITH CUDA APPLYING OPTIMIZATION TECHNIQUES Created by Marina Kolpakova ( ) for cuda.geek Itseez PREVIOUS

Upload: marina-kolpakova

Post on 11-Apr-2017

1.181 views

Category:

Education


3 download

TRANSCRIPT

Page 1: Code GPU with CUDA - Applying optimization techniques

CODE GPU WITH CUDAAPPLYING OPTIMIZATION TECHNIQUES

Created by Marina Kolpakova ( ) for cuda.geek Itseez

PREVIOUS

Page 2: Code GPU with CUDA - Applying optimization techniques

OUTLINE

ThresholdTranspose

Optimizing control flow

Streaming kernels

Reduction

Page 3: Code GPU with CUDA - Applying optimization techniques

STREAMING KERNELS

Page 4: Code GPU with CUDA - Applying optimization techniques

STREAMING KERNELy = f ( x )

t e m p l a t e < t y p e n a m e P t r 2 D I n , t y p e n a m e P t r 2 D O u t , t y p e n a m e O p >_ _ g l o b a l _ _ v o i d s t r e a m i n g ( c o n s t P t r 2 D I n s r c , P t r 2 D O u t d s t ){ c o n s t i n t x = b l o c k D i m . x * b l o c k I d x . x + t h r e a d I d x . x ; c o n s t i n t y = b l o c k D i m . y * b l o c k I d x . y + t h r e a d I d x . y ;

i f ( x < d s t . c o l s & & y < d s t . r o w s ) d s t ( y , x ) = s a t u r a t e _ c a s t < P t r 2 D O u t : : e l e m _ t y p e > ( O p : : a p p l y ( s r c ( x , y ) ) ) ;}

d i m 3 b l o c k ( b l o c k _ x , b l o c k _ y ) ;d i m 3 g r i d ( r o u n d U p ( d s t . c o l s , b l o c k _ x ) , r o u n d U p ( d s t . r o w s , b l o c k _ y ) ) ;s t r e a m i n g < < < g r i d , b l o c k > > > ( s r c , d s t ) ;

General arithmetic and conversions, repack by map, resize, etc.

Page 5: Code GPU with CUDA - Applying optimization techniques

THRESHOLDPIXEL PER THREAD

y = max ( x , τ )

_ _ g l o b a l _ _ t h r e s h o l d _ b w ( c o n s t D P t r b s r c , D P t r b d s t , i n t 3 2 s c o l s , i n t 3 2 s r o w s , i n t 8 u t h r ){ c o n s t i n t x = b l o c k D i m . x * b l o c k I d x . x + t h r e a d I d x . x ; c o n s t i n t y = b l o c k D i m . y * b l o c k I d x . y + t h r e a d I d x . y ;

i f ( x < c o l s & & y < r o w s ) d s t . r o w ( y ) [ x ] = m a x ( t h r , s r c . r o w ( y ) [ x ] ) ;}

The code is available in cumib microbenchmars library

Adjusting launch parameters for specific hardware

Page 6: Code GPU with CUDA - Applying optimization techniques

THRESHOLDPIXEL PER THREAD: RESULTS

block size GK107*, μs X-factor GM107**, μs X-factor32x16 456.49 1.00 214.94 1.0032x8 431.42 1.06 210.39 1.0232x6 435.87 1.05 226.02 0.9532x4 412.01 1.11 222.11 0.9832x2 785.36 0.58 228.28 0.9432x1 1516.19 0.30 419.94 0.51

* Time has been measured on 1080p input for GK107 with 2 SMX (1.0 GHz), 128-bit GDDR5 ** Time has been measured on 1080p input for GM107 with 5 SMX (0.5 GHz), 128-bit GDDR5

Page 7: Code GPU with CUDA - Applying optimization techniques

THRESHOLDKERNEL UNROLLING: IMAGE ROW PER WARP

_ _ g l o b a l _ _ t h r e s h o l d _ b w ( c o n s t D P t r b s r c , D P t r b d s t , i n t 3 2 s c o l s , i n t 3 2 s r o w s , i n t 8 u t h r ){ i n t b l o c k _ i d = ( b l o c k I d x . y * g r i d D i m . x + b l o c k I d x . x ) * ( b l o c k D i m . x / w a r p S i z e ) ; i n t y = ( t h r e a d I d x . y * b l o c k D i m . x + t h r e a d I d x . x ) / w a r p S i z e + b l o c k _ i d ;

i f ( y < r o w s ) f o r ( i n t x = l a n e ( ) ; x < c o l s ; x + = w a r p S i z e ) d s t . r o w ( y ) [ x ] = m a x ( t h r , s r c . r o w ( y ) [ x ] ) ;}

The code is available in cumib microbenchmars library

block size GK107*, μs X-factor GM107**, μs X-factorblockwise 32x4 412.01 1.0 226.02 1.00warpwise 8 223.01 1.85 265.13 0.85warpwise 4 222.53 1.85 248.04 0.91warpwise 2 374.47 1.10 246.83 0.92

* Time has been measured on 1080p input for GK107 with 2 SMX (1.0 GHz), 128-bit GDDR5 ** Time has been measured on 1080p input for GM107 with 5 SMX (0.5 GHz), 128-bit GDDR5

Page 8: Code GPU with CUDA - Applying optimization techniques

THRESHOLDKERNEL UNROLLING: MORE INDEPENDENT ELEMENTS

/ / s a m e a s p r e v i o u su n s i g n e d c h a r t m p [ 2 ] ;i f ( y * 2 < r o w s - 2 ) f o r ( i n t x = l a n e ( ) ; x < c o l s ; x + = w a r p S i z e ) { t m p [ 0 ] = m a x ( t h r e s h o l d , s r c . r o w ( y * 2 ) [ x ] ) ; t m p [ 1 ] = m a x ( t h r e s h o l d , s r c . r o w ( y * 2 + 1 ) [ x ] ) ;

d s t . r o w ( y * 2 ) [ x ] = t m p [ 0 ] ; d s t . r o w ( y * 2 + 1 ) [ x ] = t m p [ 1 ] ; }e l s e { / * c o m p u t e t a i l * / }

The code is available in cumib microbenchmars library

block size GK107*, μs X-factor GM107**, μs X-factorblockwise 32x4 412.01 1.0 226.02 1.10warpwise 4 222.53 1.85 248.04 0.91warpwise 4 U2 185.28 2.22 269.75 0.83

* Time has been measured on 1080p input for GK107 with 2 SMX (1.0 GHz), 128-bit GDDR5 ** Time has been measured on 1080p input for GM107 with 5 SMX (0.5 GHz), 128-bit GDDR5

Page 9: Code GPU with CUDA - Applying optimization techniques

THRESHOLDKERNEL UNROLLING: USE WIDER TRANSACTIONS

unsigned char -> unsigned integer

t e m p l a t e < t y p e n a m e T > _ _ d e v i c e _ _ _ _ f o r c e i n l i n e _ _ T v m i n _ u 8 ( T , T ) ;t e m p l a t e < > _ _ d e v i c e _ _ _ _ f o r c e i n l i n e _ _ u n s i g n e d v m i n _ u 8 ( u n s i g n e d v , u n s i g n e d m ){ u n s i g n e d i n t r e s = 0 ; a s m ( " v m a x 4 . u 3 2 . u 3 2 . u 3 2 % 0 , % 1 , % 2 , % 3 ; " : " = r " ( r e s ) : " r " ( v ) , " r " ( m ) , " r " ( 0 ) ) ; r e t u r n r e s ;}

/ / s a m e a s p r e v i o u si f ( y < r o w s ) f o r ( i n t x = l a n e ( ) ; x < c o l s / s i z e o f ( i n t 8 u ) ; x + = w a r p S i z e ) { i n t 3 2 u t m p = s r c . r o w < u n s i g n e d i n t > ( y ) [ x ] ; i n t 3 2 u r e s = v m i n _ u 8 ( t m p , m a s k ) ; d s t . r o w < u n s i g n e d i n t > ( y ) [ x ] = r e s ; }

The code is available in cumib microbenchmars library

Page 10: Code GPU with CUDA - Applying optimization techniques

THRESHOLDWIDER TRANSACTIONS : RESULTS

block size GK107*, μs X-factor GM107**, μs X-factorblockwise 32x4 412.01 1.0 226.02 1.10warpwise 4 222.53 1.85 248.04 0.91warpwise 4 U2 185.28 2.22 269.75 0.83warpwise 2W 126.01 3.27 245.21 0.92warpwise 2WU2 96.71 4.26 162.83 1.39warpwise 4W 85.22 4.83 257.69 0.88warpwise 4WU2 83.42 4.93 161.18 1.40

* Time has been measured on 1080p input for GK107 with 2 SMX (1.0 GHz), 128-bit GDDR5 ** Time has been measured on 1080p input for GM107 with 5 SMX (0.5 GHz), 128-bit GDDR5

Page 11: Code GPU with CUDA - Applying optimization techniques

TRANSPOSED i,j = S j,i

t e m p l a t e < t y p e n a m e T > _ _ g l o b a l _ _v o i d t r a n s p o s e N a i v e ( c o n s t D P t r < T > i d a t a , D P t r < T > o d a t a , i n t c o l s , i n t r o w s ){ i n t x I n d e x = b l o c k I d x . x * b l o c k D i m . x + t h r e a d I d x . x ; i n t y I n d e x = b l o c k I d x . y * b l o c k D i m . y + t h r e a d I d x . y ;

o d a t a . r o w ( x I n d e x ) [ y I n d e x ] = i d a t a . r o w ( y I n d e x ) [ x I n d e x ] ;}

The code is available in cumib microbenchmars library

Page 12: Code GPU with CUDA - Applying optimization techniques

TRANSPOSECOALESCE MEMORY ACCESS: SMEM USAGE

Split the input matrix into tiles, assigning one thread block for one tile. Tile size (inelements) and block size (in threads) are not necessarily the same.Load tile in coalesced fashion to smem -> read from smem by column -> write todestination in coalesced fashion.

Page 13: Code GPU with CUDA - Applying optimization techniques

COALESCE MEMORY ACCESS: SMEM USAGE CODE

t e m p l a t e < t y p e n a m e T > _ _ g l o b a l _ _v o i d t r a n s p o s e C o a l e s c e d ( c o n s t D P t r < T > i d a t a , D P t r < T > o d a t a , i n t c o l s , i n t r o w s ){ _ _ s h a r e d _ _ f l o a t t i l e [ T R A N S P O S E _ T I L E _ D I M ] [ T R A N S P O S E _ T I L E _ D I M ] ;

i n t x I n d e x = b l o c k I d x . x * T R A N S P O S E _ T I L E _ D I M + t h r e a d I d x . x ; i n t y I n d e x = b l o c k I d x . y * T R A N S P O S E _ T I L E _ D I M + t h r e a d I d x . y ; f o r ( i n t i = 0 ; i < T R A N S P O S E _ T I L E _ D I M ; i + = T R A N S P O S E _ B L O C K _ R O W S ) t i l e [ t h r e a d I d x . y + i ] [ t h r e a d I d x . x ] = i d a t a . r o w ( y I n d e x + i ) [ x I n d e x ] ;

_ _ s y n c t h r e a d s ( ) ;

x I n d e x = b l o c k I d x . y * T R A N S P O S E _ T I L E _ D I M + t h r e a d I d x . x ; y I n d e x = b l o c k I d x . x * T R A N S P O S E _ T I L E _ D I M + t h r e a d I d x . y ; f o r ( i n t i = 0 ; i < T R A N S P O S E _ T I L E _ D I M ; i + = T R A N S P O S E _ B L O C K _ R O W S ) o d a t a . r o w ( y I n d e x + i ) [ x I n d e x ] = t i l e [ t h r e a d I d x . x ] [ t h r e a d I d x . y + i ] ;}

The code is available in cumib microbenchmars library

Page 14: Code GPU with CUDA - Applying optimization techniques

TRANSPOSESMEM ACCESSES: AVOID BANK CONFLICTS

t e m p l a t e < t y p e n a m e T > _ _ g l o b a l _ _v o i d t r a n s p o s e C o a l e s c e d P l u s 1 ( c o n s t D P t r < T > i d a t a , D P t r < T > o d a t a , i n t c o l s , i n t r o w s ){ _ _ s h a r e d _ _ f l o a t t i l e [ T R A N S P O S E _ T I L E _ D I M ] [ T R A N S P O S E _ T I L E _ D I M + 1 ] ;

i n t x I n d e x = b l o c k I d x . x * T R A N S P O S E _ T I L E _ D I M + t h r e a d I d x . x ; i n t y I n d e x = b l o c k I d x . y * T R A N S P O S E _ T I L E _ D I M + t h r e a d I d x . y ; f o r ( i n t i = 0 ; i < T R A N S P O S E _ T I L E _ D I M ; i + = T R A N S P O S E _ B L O C K _ R O W S ) t i l e [ t h r e a d I d x . y + i ] [ t h r e a d I d x . x ] = i d a t a . r o w ( y I n d e x + i ) [ x I n d e x ] ;

_ _ s y n c t h r e a d s ( ) ;

x I n d e x = b l o c k I d x . y * T R A N S P O S E _ T I L E _ D I M + t h r e a d I d x . x ; y I n d e x = b l o c k I d x . x * T R A N S P O S E _ T I L E _ D I M + t h r e a d I d x . y ; f o r ( i n t i = 0 ; i < T R A N S P O S E _ T I L E _ D I M ; i + = T R A N S P O S E _ B L O C K _ R O W S ) o d a t a . r o w ( y I n d e x + i ) [ x I n d e x ] = t i l e [ t h r e a d I d x . x ] [ t h r e a d I d x . y + i ] ;}

The code is available in cumib microbenchmars library

Page 15: Code GPU with CUDA - Applying optimization techniques

TRANSPOSEWARP SHUFFLE

Page 16: Code GPU with CUDA - Applying optimization techniques

TRANSPOSETRANSPOSE SHUFFLE

Page 17: Code GPU with CUDA - Applying optimization techniques

TRANSPOSETRANSPOSE SHUFFLE CODE

_ _ g l o b a l _ _ v o i d t r a n s p o s e S h u f f l e ( c o n s t D P t r 3 2 i d a t a , D P t r 3 2 o d a t a , i n t c o l s , i n t r o w s ){ i n t x I n d e x = b l o c k I d x . x * b l o c k D i m . x + t h r e a d I d x . x ; i n t y I n d e x = b l o c k I d x . y * b l o c k D i m . y + t h r e a d I d x . y ; i n t y I n d e x 1 = y I n d e x * S H U F F L E _ E L E M E N T S _ V E C T O R S ;

y I n d e x * = S H U F F L E _ E L E M E N T S _ P E R F _ W A R P ;

i n t 4 r e g 0 , r e g 1 ;

r e g 0 . x = i d a t a . r o w ( y I n d e x + 0 ) [ x I n d e x ] ; r e g 0 . y = i d a t a . r o w ( y I n d e x + 1 ) [ x I n d e x ] ; r e g 0 . z = i d a t a . r o w ( y I n d e x + 2 ) [ x I n d e x ] ; r e g 0 . w = i d a t a . r o w ( y I n d e x + 3 ) [ x I n d e x ] ;

r e g 1 . x = i d a t a . r o w ( y I n d e x + 4 ) [ x I n d e x ] ; r e g 1 . y = i d a t a . r o w ( y I n d e x + 5 ) [ x I n d e x ] ; r e g 1 . z = i d a t a . r o w ( y I n d e x + 6 ) [ x I n d e x ] ; r e g 1 . w = i d a t a . r o w ( y I n d e x + 7 ) [ x I n d e x ] ;

continued on the next slide...

Page 18: Code GPU with CUDA - Applying optimization techniques

TRANSPOSETRANSPOSE SHUFFLE CODE (CONT.)

u n s i g n e d i n t i s E v e n = l a n e I s E v e n < u n s i g n e d i n t > ( ) ; i n t 4 t a r g e t = i s E v e n ? r e g 1 : r e g 0 ;

t a r g e t . x = _ _ s h f l _ x o r ( t a r g e t . x , 1 ) ; t a r g e t . y = _ _ s h f l _ x o r ( t a r g e t . y , 1 ) ; t a r g e t . z = _ _ s h f l _ x o r ( t a r g e t . z , 1 ) ; t a r g e t . w = _ _ s h f l _ x o r ( t a r g e t . w , 1 ) ;

c o n s t i n t o I n d e x Y = b l o c k I d x . x * b l o c k D i m . x + ( t h r e a d I d x . x > > 1 ) * 2 ; c o n s t i n t o I n d e x X = y I n d e x 1 + ( i s E v e n = = 0 ) ;

i f ( i s E v e n ) r e g 1 = t a r g e t ; e l s e r e g 0 = t a r g e t ;

o d a t a ( o I n d e x Y + 0 , o I n d e x X , r e g 0 ) ; o d a t a ( o I n d e x Y + 1 , o I n d e x X , r e g 1 ) ;}

The code is available in cumib microbenchmars library

Page 19: Code GPU with CUDA - Applying optimization techniques

TRANSPOSERESULTS

approach / time, ms GK107* GK20A** GM107***Copy 0.486 2.182 0.658CopySharedMem 0.494 2.198 0.623CopySharedMemPlus1 0.500 2.188 0.691TransposeCoalescedPlus1 0.569 2.345 0.631TransposeCoalesced 0.808 3.274 0.771TransposeShuffle 1.253 2.352 0.689TransposeNaive 1.470 5.338 1.614TransposeNaiveBlock 1.735 5.477 1.451

* Time has been measured on 1080p input for GK107 with 2 SMX (1 GHz), 128-bit GDDR5** Time has been measured on 1080p input for GK20A with 1 SMX (0.6 GHz)

*** Time has been measured on 1080p input for GM107 with 5 SMX (0.5 GHz), 128-bit GDDR5

Page 20: Code GPU with CUDA - Applying optimization techniques

WHEN USE KERNEL FUSION?Batch of small kernels

competitive solution for kernel unrolling since it improves instruction per byte ratioAppend one or more small kernels to register-heavy kernel

might affect kernel unrolling factors or launch parameters

Page 21: Code GPU with CUDA - Applying optimization techniques

REDUCTION

Page 22: Code GPU with CUDA - Applying optimization techniques

REDUCTIONS = ∑ i = 1 n f ( I i )

_ _ g l o b a l _ _ v o i d r e d u c e N a i v e ( c o n s t i n t * i d a t a , i n t * o d a t a , s i z e _ t N ){ i n t p a r t i a l = 0 ; s i z e _ t i = b l o c k I d x . x * b l o c k D i m . x + t h r e a d I d x . x ; f o r ( ; i < N ; i + = b l o c k D i m . x * g r i d D i m . x ) p a r t i a l + = i d a t a [ i ] ;

a t o m i c A d d ( o d a t a , p a r t i a l ) ;}

Naive implementation with CUDA atomics is limited by write queue atomic throughput,therefore hierarchical approaches are used:

Grid level: meta reduction approachBlock level: block reduction approach

Page 23: Code GPU with CUDA - Applying optimization techniques

META REDUCTION APPROACHESTree

divide array of N elements by factor b (block size). Problem size grows with N.Number of blocks used: (N - 1) (b - 1)

2-levelUse constant number of blocks C. Each block processes N/C elements. Problem sizeindependent form N, C is hardware dependent heuristic == block size b. Number ofblocks used: (C + 1)

Constant & atomicone level of reduction, fixed number of blocks C. Each block performs block-widereduction and stores results to gmem using atomic write.

Page 24: Code GPU with CUDA - Applying optimization techniques

BLOCK REDUCTION APPROACHESRANKING

_ _ g l o b a l _ _ v o i d r e d u c e R a n k i n g ( c o n s t i n t * i d a t a , i n t * o d a t a , s i z e _ t N ){ e x t e r n _ _ s h a r e d _ _ i n t * p a r t i a l s ; i n t t h r e a d _ p a r t i a l = t h r e a d _ r e d u c e ( i d a t a , N ) ; p a r t i a l s [ t h r e a d I d x . x ] = s u m ; _ _ s y n c t h r e a d s ( ) ; i f ( t h r e a d I d x . x < 5 1 2 ) p a r t i a l s [ t h r e a d I d x . x ] + = p a r t i a l s [ t h r e a d I d x . x + 5 1 2 ] ; _ _ s y n c t h r e a d s ( ) ; i f ( t h r e a d I d x . x < 2 5 6 ) p a r t i a l s [ t h r e a d I d x . x ] + = p a r t i a l s [ t h r e a d I d x . x + 2 5 6 ] ; _ _ s y n c t h r e a d s ( ) ; i f ( t h r e a d I d x . x < 1 2 8 ) p a r t i a l s [ t h r e a d I d x . x ] + = p a r t i a l s [ t h r e a d I d x . x + 1 2 8 ] ; _ _ s y n c t h r e a d s ( ) ; i f ( t h r e a d I d x . x < 6 4 ) p a r t i a l s [ t h r e a d I d x . x ] + = p a r t i a l s [ t h r e a d I d x . x + 6 4 ] ; _ _ s y n c t h r e a d s ( ) ; i f ( t h r e a d I d x . x < 3 2 ) w a r p _ r e d u c e ( p a r t i a l s ) ; i f ( ! t h r e a d I d x . x ) o d a t a [ b l o c k I d x . x ] = p a r t i a l s [ 0 ] ;}

Page 25: Code GPU with CUDA - Applying optimization techniques

BLOCK REDUCTION APPROACHESWARP REDUCE

t e m p l a t e < i n t S I Z E >_ _ d e v i c e _ _ _ _ f o r c e i n l i n e _ _ v o i d w a r p _ r e d u c e ( i n t v a l , v o l a t i l e i n t * s m e m ){ # p r a g m a u n r o l l f o r ( i n t o f f s e t = S I Z E > > 1 ; o f f s e t > = 1 ; o f f s e t > > = 1 ) v a l + = _ _ s h f l _ x o r ( v a l , o f f s e t ) ;

i n t w a r p I d = w a r p : : i d ( ) ; i n t l a n e I d = w a r p : : l a n e ( ) ; i f ( ! l a n e I d ) s m e m [ w a r p I d ] = v a l ;}

Page 26: Code GPU with CUDA - Applying optimization techniques

BLOCK REDUCTION APPROACHESWARP-CENTRIC REDUCTION

_ _ g l o b a l _ _ v o i d r e d u c e W a r p C e n t r i c ( c o n s t i n t * i d a t a , i n t * o d a t a , s i z e _ t N ){ _ _ s h a r e d _ _ i n t p a r t i a l s [ W A R P _ S I Z E ] ; i n t t h r e a d _ p a r t i a l = t h r e a d _ r e d u c e ( i d a t a , N ) ; w a r p _ r e d u c e < W A R P _ S I Z E > ( t h r e a d _ p a r t i a l , p a r t i a l s ) ; _ _ s y n c t h r e a d s ( ) ;

i f ( ! w a r p : : i d ( ) ) w a r p _ r e d u c e < N U M _ W A R P S _ I N _ B L O C K > ( p a r t i a l s [ t h r e a d I d x . x ] , p a r t i a l s ) ; i f ( ! t h r e a d I d x . x ) o d a t a [ b l o c k I d x . x ] = p a r t i a l s [ 0 ] ;}

Page 27: Code GPU with CUDA - Applying optimization techniques

BLOCK REDUCTION APPROACHESRESULTS

approach Block transaction, bit bandwidth*, GB/swarp-centric 32 32 32.58warp-centric 128 32 56.07warp-centric 256 32 56.32warp-centric 32 128 44.22warp-centric 128 128 56.10warp-centric 256 128 56.74ranking 32 32 32.32ranking 128 32 55.16ranking 256 32 55.36ranking 32 128 44.00ranking 128 128 56.56ranking 256 128 57.12

* Bandwidth has been measured on 1080p input for GK208 with 2 SMX (1 GHz), 64-bit GDDR5

Page 28: Code GPU with CUDA - Applying optimization techniques

OPTIMIZINGCONTROL FLOW

Page 29: Code GPU with CUDA - Applying optimization techniques

SLIDING WINDOW DETECTOR

f u n c t i o n i s P e d e s t r a i n ( x )d ← 0f o r t = 1 . . . T d o d ← d + C ( x ) i f d < r t t h e n r e t u r n f a l s e e n d i fe n d f o rr e t u r n t r u ee n d f u n c t i o n

Page 30: Code GPU with CUDA - Applying optimization techniques

SLIDING WINDOW DETECTORTHREAD PER WINDOW

Page 31: Code GPU with CUDA - Applying optimization techniques

THREAD PER WINDOW: ANALYSISCoalesced access to gmem in the beginningSparse access to the latest stagesUnbalanced workload. Time of block residence on SM is T(b) = max{T(w_0) , ..., T(w_blok_size) }Warp processes 32 sequential window positions, so it likely diverge

Page 32: Code GPU with CUDA - Applying optimization techniques

SLIDING WINDOW DETECTORWARP PER WINDOW

Page 33: Code GPU with CUDA - Applying optimization techniques

SLIDING WINDOW DETECTORWARP PER WINDOW: ANALYSIS

All lanes in a warp load different features. Access pattern is randomTextures are used to amortize random patternWork balanced for a warp. Warps in a block compute neighboring windows . Likelihoodthat block needs same number of features is hightUse warp-wide prefix sum for decision making.

f o r ( i n t o f f s e t = E x e c u t i o n P o l i c y : : w a r p _ p i x e l s ; o f f s e t < 3 2 ; o f f s e t * = 2 ){ a s m v o l a t i l e ( " { " " . r e g . f 3 2 r 0 ; " " . r e g . p r e d p ; " " s h f l . u p . b 3 2 r 0 | p , % 0 , % 1 , 0 x 0 ; " " @ p a d d . f 3 2 r 0 , r 0 , % 0 ; " " m o v . f 3 2 % 0 , r 0 ; " " } " : " + f " ( i m p a c t ) : " r " ( o f f s e t ) ) ;}

Page 34: Code GPU with CUDA - Applying optimization techniques

SLIDING WINDOW DETECTORWARP PER N WINDOWS

Page 35: Code GPU with CUDA - Applying optimization techniques

SLIDING WINDOW DETECTORWARP PER N WINDOWS: ANALYSIS, N = 4

_ _ d e v i c e _ _ _ _ f o r c e i n l i n e _ _ s t a t i c i n t p i x e l ( ) { r e t u r n t h r e a d I d x . x & 3 ; }_ _ d e v i c e _ _ _ _ f o r c e i n l i n e _ _ s t a t i c i n t s t a g e ( ) { r e t u r n t h r e a d I d x . x > > 2 ; }

Each warp loads 8 features for 4 windows. Each feature consists of 4 pixels. Total warptransactions: 8x4 of 16 bytes (instead of 32x4 of 4 byte).Warp is active while at least one of window positions is active.

u i n t d e s i s i o n = ( c o n f i d e n c e + i m p a c t > t r a c e [ t ] ) ;u i n t m a s k = _ _ b a l l o t ( d e s i s i o n ) ;u i n t p a t t e r n = 0 x 1 1 1 1 1 1 1 1 < < p i x e l ;

i f ( a c t i v e & & ( _ _ p o p c ( m a s k & p a t t e r n ) ! = 8 ) ) a c t i v e = 0 ;i f ( _ _ a l l ( ! a c t i v e ) ) b r e a k ;

Page 36: Code GPU with CUDA - Applying optimization techniques

SLIDING WINDOW DETECTORRESULTS

videosequence

thread/window, ms

warp/window,ms

warp /4windows, ms

speedupw/w, X

speedupw/4w, X

seq06 169.13 98.29 27.13 1.72 6.23seq07 166.92 100.12 36.52 1.66 4.57seq08 172.89 98.12 38.87 1.76 4.44seq09 175.82 102.54 34.18 1.76 4.45seq10 144.13 96.87 32.40 1.71 5.14

Page 37: Code GPU with CUDA - Applying optimization techniques

FINAL WORDSUse warp-wise approaches for memory bound kernelsMinimize transactions with global memoryLoad only bytes neededUse hierarchical approaches for non parallelizeble codesConsider data dependency while optimizing complicated algorithms

Page 38: Code GPU with CUDA - Applying optimization techniques

THE ENDNEXT

BY / 2013–2015CUDA.GEEK