code gpu with cuda - applying optimization techniques

CODE GPU WITH CUDAAPPLYING OPTIMIZATION TECHNIQUES

Created by Marina Kolpakova ( ) for cuda.geek Itseez

PREVIOUS

http://github.com/cuda-geek

http://itseez.com/

OUTLINE

ThresholdTranspose

Optimizing control flow

Streaming kernels

Reduction

STREAMING KERNELS

STREAMING KERNELy = f ( x )

t e m p l a t e < t y p e n a m e P t r 2 D I n , t y p e n a m e P t r 2 D O u t , t y p e n a m e O p >_ _ g l o b a l _ _ v o i d s t r e a m i n g ( c o n s t P t r 2 D I n s r c , P t r 2 D O u t d s t ){ c o n s t i n t x = b l o c k D i m . x * b l o c k I d x . x + t h r e a d I d x . x ; c o n s t i n t y = b l o c k D i m . y * b l o c k I d x . y + t h r e a d I d x . y ;

i f ( x < d s t . c o l s & & y < d s t . r o w s ) d s t ( y , x ) = s a t u r a t e _ c a s t ( O p : : a p p l y ( s r c ( x , y ) ) ) ;}

d i m 3 b l o c k ( b l o c k _ x , b l o c k _ y ) ;d i m 3 g r i d ( r o u n d U p ( d s t . c o l s , b l o c k _ x ) , r o u n d U p ( d s t . r o w s , b l o c k _ y ) ) ;s t r e a m i n g < < < g r i d , b l o c k > > > ( s r c , d s t ) ;

General arithmetic and conversions, repack by map, resize, etc.

THRESHOLDPIXEL PER THREAD

y = max ( x , τ )

_ _ g l o b a l _ _ t h r e s h o l d _ b w ( c o n s t D P t r b s r c , D P t r b d s t , i n t 3 2 s c o l s , i n t 3 2 s r o w s , i n t 8 u t h r ){ c o n s t i n t x = b l o c k D i m . x * b l o c k I d x . x + t h r e a d I d x . x ; c o n s t i n t y = b l o c k D i m . y * b l o c k I d x . y + t h r e a d I d x . y ;

i f ( x < c o l s & & y < r o w s ) d s t . r o w ( y ) [ x ] = m a x ( t h r , s r c . r o w ( y ) [ x ] ) ;}

The code is available in cumib microbenchmars library

Adjusting launch parameters for specific hardware

https://github.com/cuda-geek/cumib/blob/master/threshold.cu#L24

THRESHOLDPIXEL PER THREAD: RESULTS

block size GK107*, μs X-factor GM107**, μs X-factor32x16 456.49 1.00 214.94 1.0032x8 431.42 1.06 210.39 1.0232x6 435.87 1.05 226.02 0.9532x4 412.01 1.11 222.11 0.9832x2 785.36 0.58 228.28 0.9432x1 1516.19 0.30 419.94 0.51

* Time has been measured on 1080p input for GK107 with 2 SMX (1.0 GHz), 128-bit GDDR5 ** Time has been measured on 1080p input for GM107 with 5 SMX (0.5 GHz), 128-bit GDDR5

THRESHOLDKERNEL UNROLLING: IMAGE ROW PER WARP

_ _ g l o b a l _ _ t h r e s h o l d _ b w ( c o n s t D P t r b s r c , D P t r b d s t , i n t 3 2 s c o l s , i n t 3 2 s r o w s , i n t 8 u t h r ){ i n t b l o c k _ i d = ( b l o c k I d x . y * g r i d D i m . x + b l o c k I d x . x ) * ( b l o c k D i m . x / w a r p S i z e ) ; i n t y = ( t h r e a d I d x . y * b l o c k D i m . x + t h r e a d I d x . x ) / w a r p S i z e + b l o c k _ i d ;

i f ( y < r o w s ) f o r ( i n t x = l a n e ( ) ; x < c o l s ; x + = w a r p S i z e ) d s t . r o w ( y ) [ x ] = m a x ( t h r , s r c . r o w ( y ) [ x ] ) ;}


block size GK107*, μs X-factor GM107**, μs X-factorblockwise 32x4 412.01 1.0 226.02 1.00warpwise 8 223.01 1.85 265.13 0.85warpwise 4 222.53 1.85 248.04 0.91warpwise 2 374.47 1.10 246.83 0.92

THRESHOLDKERNEL UNROLLING: MORE INDEPENDENT ELEMENTS

/ / s a m e a s p r e v i o u su n s i g n e d c h a r t m p [ 2 ] ;i f ( y * 2 < r o w s - 2 ) f o r ( i n t x = l a n e ( ) ; x < c o l s ; x + = w a r p S i z e ) { t m p [ 0 ] = m a x ( t h r e s h o l d , s r c . r o w ( y * 2 ) [ x ] ) ; t m p [ 1 ] = m a x ( t h r e s h o l d , s r c . r o w ( y * 2 + 1 ) [ x ] ) ;

d s t . r o w ( y * 2 ) [ x ] = t m p [ 0 ] ; d s t . r o w ( y * 2 + 1 ) [ x ] = t m p [ 1 ] ; }e l s e { / * c o m p u t e t a i l * / }


block size GK107*, μs X-factor GM107**, μs X-factorblockwise 32x4 412.01 1.0 226.02 1.10warpwise 4 222.53 1.85 248.04 0.91warpwise 4 U2 185.28 2.22 269.75 0.83

THRESHOLDKERNEL UNROLLING: USE WIDER TRANSACTIONS

unsigned char -> unsigned integer

t e m p l a t e < t y p e n a m e T > _ _ d e v i c e _ _ _ _ f o r c e i n l i n e _ _ T v m i n _ u 8 ( T , T ) ;t e m p l a t e < > _ _ d e v i c e _ _ _ _ f o r c e i n l i n e _ _ u n s i g n e d v m i n _ u 8 ( u n s i g n e d v , u n s i g n e d m ){ u n s i g n e d i n t r e s = 0 ; a s m ( " v m a x 4 . u 3 2 . u 3 2 . u 3 2 % 0 , % 1 , % 2 , % 3 ; " : " = r " ( r e s ) : " r " ( v ) , " r " ( m ) , " r " ( 0 ) ) ; r e t u r n r e s ;}

/ / s a m e a s p r e v i o u si f ( y < r o w s ) f o r ( i n t x = l a n e ( ) ; x < c o l s / s i z e o f ( i n t 8 u ) ; x + = w a r p S i z e ) { i n t 3 2 u t m p = s r c . r o w ( y ) [ x ] ; i n t 3 2 u r e s = v m i n _ u 8 ( t m p , m a s k ) ; d s t . r o w ( y ) [ x ] = r e s ; }

THRESHOLDWIDER TRANSACTIONS : RESULTS

block size GK107*, μs X-factor GM107**, μs X-factorblockwise 32x4 412.01 1.0 226.02 1.10warpwise 4 222.53 1.85 248.04 0.91warpwise 4 U2 185.28 2.22 269.75 0.83warpwise 2W 126.01 3.27 245.21 0.92warpwise 2WU2 96.71 4.26 162.83 1.39warpwise 4W 85.22 4.83 257.69 0.88warpwise 4WU2 83.42 4.93 161.18 1.40


TRANSPOSED i,j = S j,i

t e m p l a t e < t y p e n a m e T > _ _ g l o b a l _ _v o i d t r a n s p o s e N a i v e ( c o n s t D P t r < T > i d a t a , D P t r < T > o d a t a , i n t c o l s , i n t r o w s ){ i n t x I n d e x = b l o c k I d x . x * b l o c k D i m . x + t h r e a d I d x . x ; i n t y I n d e x = b l o c k I d x . y * b l o c k D i m . y + t h r e a d I d x . y ;

o d a t a . r o w ( x I n d e x ) [ y I n d e x ] = i d a t a . r o w ( y I n d e x ) [ x I n d e x ] ;}


https://github.com/cuda-geek/cumib/blob/master/transpose.cu#L124

TRANSPOSECOALESCE MEMORY ACCESS: SMEM USAGE

Split the input matrix into tiles, assigning one thread block for one tile. Tile size (inelements) and block size (in threads) are not necessarily the same.Load tile in coalesced fashion to smem -> read from smem by column -> write todestination in coalesced fashion.

COALESCE MEMORY ACCESS: SMEM USAGE CODE

t e m p l a t e < t y p e n a m e T > _ _ g l o b a l _ _v o i d t r a n s p o s e C o a l e s c e d ( c o n s t D P t r < T > i d a t a , D P t r < T > o d a t a , i n t c o l s , i n t r o w s ){ _ _ s h a r e d _ _ f l o a t t i l e [ T R A N S P O S E _ T I L E _ D I M ] [ T R A N S P O S E _ T I L E _ D I M ] ;

i n t x I n d e x = b l o c k I d x . x * T R A N S P O S E _ T I L E _ D I M + t h r e a d I d x . x ; i n t y I n d e x = b l o c k I d x . y * T R A N S P O S E _ T I L E _ D I M + t h r e a d I d x . y ; f o r ( i n t i = 0 ; i < T R A N S P O S E _ T I L E _ D I M ; i + = T R A N S P O S E _ B L O C K _ R O W S ) t i l e [ t h r e a d I d x . y + i ] [ t h r e a d I d x . x ] = i d a t a . r o w ( y I n d e x + i ) [ x I n d e x ] ;

_ _ s y n c t h r e a d s ( ) ;

x I n d e x = b l o c k I d x . y * T R A N S P O S E _ T I L E _ D I M + t h r e a d I d x . x ; y I n d e x = b l o c k I d x . x * T R A N S P O S E _ T I L E _ D I M + t h r e a d I d x . y ; f o r ( i n t i = 0 ; i < T R A N S P O S E _ T I L E _ D I M ; i + = T R A N S P O S E _ B L O C K _ R O W S ) o d a t a . r o w ( y I n d e x + i ) [ x I n d e x ] = t i l e [ t h r e a d I d x . x ] [ t h r e a d I d x . y + i ] ;}

TRANSPOSESMEM ACCESSES: AVOID BANK CONFLICTS

t e m p l a t e < t y p e n a m e T > _ _ g l o b a l _ _v o i d t r a n s p o s e C o a l e s c e d P l u s 1 ( c o n s t D P t r < T > i d a t a , D P t r < T > o d a t a , i n t c o l s , i n t r o w s ){ _ _ s h a r e d _ _ f l o a t t i l e [ T R A N S P O S E _ T I L E _ D I M ] [ T R A N S P O S E _ T I L E _ D I M + 1 ] ;

i n t x I n d e x = b l o c k I d x . x * T R A N S P O S E _ T I L E _ D I M + t h r e a d I d x . x ; i n t y I n d e x = b l o c k I d x . y * T R A N S P O S E _ T I L E _ D I M + t h r e a d I d x . y ; f o r ( i n t i = 0 ; i < T R A N S P O S E _ T I L E _ D I M ; i + = T R A N S P O S E _ B L O C K _ R O W S ) t i l e [ t h r e a d I d x . y + i ] [ t h r e a d I d x . x ] = i d a t a . r o w ( y I n d e x + i ) [ x I n d e x ] ;

_ _ s y n c t h r e a d s ( ) ;

x I n d e x = b l o c k I d x . y * T R A N S P O S E _ T I L E _ D I M + t h r e a d I d x . x ; y I n d e x = b l o c k I d x . x * T R A N S P O S E _ T I L E _ D I M + t h r e a d I d x . y ; f o r ( i n t i = 0 ; i < T R A N S P O S E _ T I L E _ D I M ; i + = T R A N S P O S E _ B L O C K _ R O W S ) o d a t a . r o w ( y I n d e x + i ) [ x I n d e x ] = t i l e [ t h r e a d I d x . x ] [ t h r e a d I d x . y + i ] ;}

TRANSPOSEWARP SHUFFLE

TRANSPOSETRANSPOSE SHUFFLE

TRANSPOSETRANSPOSE SHUFFLE CODE

_ _ g l o b a l _ _ v o i d t r a n s p o s e S h u f f l e ( c o n s t D P t r 3 2 i d a t a , D P t r 3 2 o d a t a , i n t c o l s , i n t r o w s ){ i n t x I n d e x = b l o c k I d x . x * b l o c k D i m . x + t h r e a d I d x . x ; i n t y I n d e x = b l o c k I d x . y * b l o c k D i m . y + t h r e a d I d x . y ; i n t y I n d e x 1 = y I n d e x * S H U F F L E _ E L E M E N T S _ V E C T O R S ;

y I n d e x * = S H U F F L E _ E L E M E N T S _ P E R F _ W A R P ;

i n t 4 r e g 0 , r e g 1 ;

r e g 0 . x = i d a t a . r o w ( y I n d e x + 0 ) [ x I n d e x ] ; r e g 0 . y = i d a t a . r o w ( y I n d e x + 1 ) [ x I n d e x ] ; r e g 0 . z = i d a t a . r o w ( y I n d e x + 2 ) [ x I n d e x ] ; r e g 0 . w = i d a t a . r o w ( y I n d e x + 3 ) [ x I n d e x ] ;

r e g 1 . x = i d a t a . r o w ( y I n d e x + 4 ) [ x I n d e x ] ; r e g 1 . y = i d a t a . r o w ( y I n d e x + 5 ) [ x I n d e x ] ; r e g 1 . z = i d a t a . r o w ( y I n d e x + 6 ) [ x I n d e x ] ; r e g 1 . w = i d a t a . r o w ( y I n d e x + 7 ) [ x I n d e x ] ;

continued on the next slide...

TRANSPOSETRANSPOSE SHUFFLE CODE (CONT.)

u n s i g n e d i n t i s E v e n = l a n e I s E v e n ( ) ; i n t 4 t a r g e t = i s E v e n ? r e g 1 : r e g 0 ;

t a r g e t . x = _ _ s h f l _ x o r ( t a r g e t . x , 1 ) ; t a r g e t . y = _ _ s h f l _ x o r ( t a r g e t . y , 1 ) ; t a r g e t . z = _ _ s h f l _ x o r ( t a r g e t . z , 1 ) ; t a r g e t . w = _ _ s h f l _ x o r ( t a r g e t . w , 1 ) ;

c o n s t i n t o I n d e x Y = b l o c k I d x . x * b l o c k D i m . x + ( t h r e a d I d x . x > > 1 ) * 2 ; c o n s t i n t o I n d e x X = y I n d e x 1 + ( i s E v e n = = 0 ) ;

i f ( i s E v e n ) r e g 1 = t a r g e t ; e l s e r e g 0 = t a r g e t ;

o d a t a ( o I n d e x Y + 0 , o I n d e x X , r e g 0 ) ; o d a t a ( o I n d e x Y + 1 , o I n d e x X , r e g 1 ) ;}

TRANSPOSERESULTS

approach / time, ms GK107* GK20A** GM107***Copy 0.486 2.182 0.658CopySharedMem 0.494 2.198 0.623CopySharedMemPlus1 0.500 2.188 0.691TransposeCoalescedPlus1 0.569 2.345 0.631TransposeCoalesced 0.808 3.274 0.771TransposeShuffle 1.253 2.352 0.689TransposeNaive 1.470 5.338 1.614TransposeNaiveBlock 1.735 5.477 1.451

* Time has been measured on 1080p input for GK107 with 2 SMX (1 GHz), 128-bit GDDR5** Time has been measured on 1080p input for GK20A with 1 SMX (0.6 GHz)

*** Time has been measured on 1080p input for GM107 with 5 SMX (0.5 GHz), 128-bit GDDR5

WHEN USE KERNEL FUSION?Batch of small kernels

competitive solution for kernel unrolling since it improves instruction per byte ratioAppend one or more small kernels to register-heavy kernel

might affect kernel unrolling factors or launch parameters

REDUCTION

REDUCTIONS = ∑ i = 1 n f ( I i )

_ _ g l o b a l _ _ v o i d r e d u c e N a i v e ( c o n s t i n t * i d a t a , i n t * o d a t a , s i z e _ t N ){ i n t p a r t i a l = 0 ; s i z e _ t i = b l o c k I d x . x * b l o c k D i m . x + t h r e a d I d x . x ; f o r ( ; i < N ; i + = b l o c k D i m . x * g r i d D i m . x ) p a r t i a l + = i d a t a [ i ] ;

a t o m i c A d d ( o d a t a , p a r t i a l ) ;}

Naive implementation with CUDA atomics is limited by write queue atomic throughput,therefore hierarchical approaches are used:

Grid level: meta reduction approachBlock level: block reduction approach

META REDUCTION APPROACHESTree

divide array of N elements by factor b (block size). Problem size grows with N.Number of blocks used: (N - 1) (b - 1)

2-levelUse constant number of blocks C. Each block processes N/C elements. Problem sizeindependent form N, C is hardware dependent heuristic == block size b. Number ofblocks used: (C + 1)

Constant & atomicone level of reduction, fixed number of blocks C. Each block performs block-widereduction and stores results to gmem using atomic write.

BLOCK REDUCTION APPROACHESRANKING

_ _ g l o b a l _ _ v o i d r e d u c e R a n k i n g ( c o n s t i n t * i d a t a , i n t * o d a t a , s i z e _ t N ){ e x t e r n _ _ s h a r e d _ _ i n t * p a r t i a l s ; i n t t h r e a d _ p a r t i a l = t h r e a d _ r e d u c e ( i d a t a , N ) ; p a r t i a l s [ t h r e a d I d x . x ] = s u m ; _ _ s y n c t h r e a d s ( ) ; i f ( t h r e a d I d x . x < 5 1 2 ) p a r t i a l s [ t h r e a d I d x . x ] + = p a r t i a l s [ t h r e a d I d x . x + 5 1 2 ] ; _ _ s y n c t h r e a d s ( ) ; i f ( t h r e a d I d x . x < 2 5 6 ) p a r t i a l s [ t h r e a d I d x . x ] + = p a r t i a l s [ t h r e a d I d x . x + 2 5 6 ] ; _ _ s y n c t h r e a d s ( ) ; i f ( t h r e a d I d x . x < 1 2 8 ) p a r t i a l s [ t h r e a d I d x . x ] + = p a r t i a l s [ t h r e a d I d x . x + 1 2 8 ] ; _ _ s y n c t h r e a d s ( ) ; i f ( t h r e a d I d x . x < 6 4 ) p a r t i a l s [ t h r e a d I d x . x ] + = p a r t i a l s [ t h r e a d I d x . x + 6 4 ] ; _ _ s y n c t h r e a d s ( ) ; i f ( t h r e a d I d x . x < 3 2 ) w a r p _ r e d u c e ( p a r t i a l s ) ; i f ( ! t h r e a d I d x . x ) o d a t a [ b l o c k I d x . x ] = p a r t i a l s [ 0 ] ;}

BLOCK REDUCTION APPROACHESWARP REDUCE

t e m p l a t e _ _ d e v i c e _ _ _ _ f o r c e i n l i n e _ _ v o i d w a r p _ r e d u c e ( i n t v a l , v o l a t i l e i n t * s m e m ){ # p r a g m a u n r o l l f o r ( i n t o f f s e t = S I Z E > > 1 ; o f f s e t > = 1 ; o f f s e t > > = 1 ) v a l + = _ _ s h f l _ x o r ( v a l , o f f s e t ) ;

i n t w a r p I d = w a r p : : i d ( ) ; i n t l a n e I d = w a r p : : l a n e ( ) ; i f ( ! l a n e I d ) s m e m [ w a r p I d ] = v a l ;}

BLOCK REDUCTION APPROACHESWARP-CENTRIC REDUCTION

_ _ g l o b a l _ _ v o i d r e d u c e W a r p C e n t r i c ( c o n s t i n t * i d a t a , i n t * o d a t a , s i z e _ t N ){ _ _ s h a r e d _ _ i n t p a r t i a l s [ W A R P _ S I Z E ] ; i n t t h r e a d _ p a r t i a l = t h r e a d _ r e d u c e ( i d a t a , N ) ; w a r p _ r e d u c e < W A R P _ S I Z E > ( t h r e a d _ p a r t i a l , p a r t i a l s ) ; _ _ s y n c t h r e a d s ( ) ;

i f ( ! w a r p : : i d ( ) ) w a r p _ r e d u c e < N U M _ W A R P S _ I N _ B L O C K > ( p a r t i a l s [ t h r e a d I d x . x ] , p a r t i a l s ) ; i f ( ! t h r e a d I d x . x ) o d a t a [ b l o c k I d x . x ] = p a r t i a l s [ 0 ] ;}

BLOCK REDUCTION APPROACHESRESULTS

approach Block transaction, bit bandwidth*, GB/swarp-centric 32 32 32.58warp-centric 128 32 56.07warp-centric 256 32 56.32warp-centric 32 128 44.22warp-centric 128 128 56.10warp-centric 256 128 56.74ranking 32 32 32.32ranking 128 32 55.16ranking 256 32 55.36ranking 32 128 44.00ranking 128 128 56.56ranking 256 128 57.12

* Bandwidth has been measured on 1080p input for GK208 with 2 SMX (1 GHz), 64-bit GDDR5

OPTIMIZINGCONTROL FLOW

SLIDING WINDOW DETECTOR

f u n c t i o n i s P e d e s t r a i n ( x )d ← 0f o r t = 1 . . . T d o d ← d + C ( x ) i f d < r t t h e n r e t u r n f a l s e e n d i fe n d f o rr e t u r n t r u ee n d f u n c t i o n

SLIDING WINDOW DETECTORTHREAD PER WINDOW

THREAD PER WINDOW: ANALYSISCoalesced access to gmem in the beginningSparse access to the latest stagesUnbalanced workload. Time of block residence on SM is T(b) = max{T(w_0) , ..., T(w_blok_size) }Warp processes 32 sequential window positions, so it likely diverge

SLIDING WINDOW DETECTORWARP PER WINDOW

SLIDING WINDOW DETECTORWARP PER WINDOW: ANALYSIS

All lanes in a warp load different features. Access pattern is randomTextures are used to amortize random patternWork balanced for a warp. Warps in a block compute neighboring windows . Likelihoodthat block needs same number of features is hightUse warp-wide prefix sum for decision making.

f o r ( i n t o f f s e t = E x e c u t i o n P o l i c y : : w a r p _ p i x e l s ; o f f s e t < 3 2 ; o f f s e t * = 2 ){ a s m v o l a t i l e ( " { " " . r e g . f 3 2 r 0 ; " " . r e g . p r e d p ; " " s h f l . u p . b 3 2 r 0 | p , % 0 , % 1 , 0 x 0 ; " " @ p a d d . f 3 2 r 0 , r 0 , % 0 ; " " m o v . f 3 2 % 0 , r 0 ; " " } " : " + f " ( i m p a c t ) : " r " ( o f f s e t ) ) ;}

SLIDING WINDOW DETECTORWARP PER N WINDOWS

SLIDING WINDOW DETECTORWARP PER N WINDOWS: ANALYSIS, N = 4

_ _ d e v i c e _ _ _ _ f o r c e i n l i n e _ _ s t a t i c i n t p i x e l ( ) { r e t u r n t h r e a d I d x . x & 3 ; }_ _ d e v i c e _ _ _ _ f o r c e i n l i n e _ _ s t a t i c i n t s t a g e ( ) { r e t u r n t h r e a d I d x . x > > 2 ; }

Each warp loads 8 features for 4 windows. Each feature consists of 4 pixels. Total warptransactions: 8x4 of 16 bytes (instead of 32x4 of 4 byte).Warp is active while at least one of window positions is active.

u i n t d e s i s i o n = ( c o n f i d e n c e + i m p a c t > t r a c e [ t ] ) ;u i n t m a s k = _ _ b a l l o t ( d e s i s i o n ) ;u i n t p a t t e r n = 0 x 1 1 1 1 1 1 1 1 < < p i x e l ;

i f ( a c t i v e & & ( _ _ p o p c ( m a s k & p a t t e r n ) ! = 8 ) ) a c t i v e = 0 ;i f ( _ _ a l l ( ! a c t i v e ) ) b r e a k ;

SLIDING WINDOW DETECTORRESULTS

videosequence

thread/window, ms

warp/window,ms

warp /4windows, ms

speedupw/w, X

speedupw/4w, X

seq06 169.13 98.29 27.13 1.72 6.23seq07 166.92 100.12 36.52 1.66 4.57seq08 172.89 98.12 38.87 1.76 4.44seq09 175.82 102.54 34.18 1.76 4.45seq10 144.13 96.87 32.40 1.71 5.14

FINAL WORDSUse warp-wise approaches for memory bound kernelsMinimize transactions with global memoryLoad only bytes neededUse hierarchical approaches for non parallelizeble codesConsider data dependency while optimizing complicated algorithms

THE ENDNEXT

BY / 2013–2015CUDA.GEEK

https://github.com/cuda-geek

code gpu with cuda - applying optimization techniques

Education