class13_debugging.pdf
TRANSCRIPT
-
7/29/2019 class13_debugging.pdf
1/82
Debugging in Serial & Parallel
M. D. Jones, Ph.D.
Center for Computational ResearchUniversity at Buffalo
State University of New York
High Performance Computing I, 2012
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 1 / 89
-
7/29/2019 class13_debugging.pdf
2/82
Part I
Basic (Serial) Debugging
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 2 / 89
-
7/29/2019 class13_debugging.pdf
3/82
Introduction Software for Debugging
Software for Debugging
The most common method for debugging (by far) is theinstrumentation method:
One instruments the code with print statements to check valuesand follow the execution of the program
Not exactly sophisticated - one can certainly debug code in thisway, but wise use of software debugging tools can be more
effective
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 4 / 89
-
7/29/2019 class13_debugging.pdf
4/82
Introduction Software for Debugging
Debugging Tools
Debugging tools are abundant, but we will focus merely on some of the
most common attributes to give you a bag of tricks that can be usedwhen dealing with common problems.
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 5 / 89
-
7/29/2019 class13_debugging.pdf
5/82
Introduction Software for Debugging
Basic Capabilities
Common attributes:
Divided into command-line or graphical user interfaces
Usually have to recompile (-g is almost a standard option toenable debugging) your code to utilize most debugger features
Invocation by name of debugger and executable (e.g. gdb ./a.out[core])
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 6 / 89
I d i S f f D b i
-
7/29/2019 class13_debugging.pdf
6/82
Introduction Software for Debugging
Running Within
Inside a debugger (be it using a command-line interface (CLI) orgraphical front-end), you have some very handy abilities:
Look at source code listing (very handy when isolating an IEEEexception)
Line-by-line execution
Insert stops or breakpoints at certain functional points (i.e., when
critical values change)
Ability to monitor variable values
Look at stack trace (or backtrace) when code crashes
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 7 / 89
I t d ti S ft f D b i
-
7/29/2019 class13_debugging.pdf
7/82
Introduction Software for Debugging
Command-line debugging example
Consider the following code example:
1 #include < s t d i o . h>2 #include < s t d l i b . h>34 i n t i n d x ;56 void i n i t A r r a y ( i n t nelem_in_array , i n t a r r a y ) ;7 void p r i n t A r r a y ( i n t nelem_in_array , i n t a r r a y ) ;8 i n t squareArr ay ( i n t nelem_in_array , i n t a r r a y ) ;9
10 i n t main( void ) {11 c on st i n t ne le m = 1 0 ;12 i n t a rra y1 , a rra y2 , del ;1314 / A l l o c a t e memory f o r each a r r a y /15 a r r a y 1 = ( i n t ) malloc ( nelems i z e o f ( i n t ) ) ;16 a r r a y 2 = ( i n t ) malloc ( nelems i z e o f ( i n t ) ) ;17 d e l = ( i n t ) malloc (nelems i z e o f ( i n t ) ) ;18
19 / I n i t i a l i z e a r r a y 1 /20 i n i t A r r a y ( nelem , a rr ay 1 ) ;2122 f o r ( i nd x = 0 ; i nd x < n elem ; i nd x ++) {23 a r r a y 1 [ i n d x ] = i n d x + 2 ;24 }
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 8 / 89
Introduction Software for Debugging
-
7/29/2019 class13_debugging.pdf
8/82
Introduction Software for Debugging
25 / P r i n t t h e e le me nt s o f a r r a y1 /26 p r i n t f ( " a r ra y 1 = " ) ;27 p r i n tA r r a y ( nelem , a rr ay 1 ) ;2829 / Copy a rra y1 to a r r a y 2 /30 a r r a y 2 = a r r a y 1 ;3132 / Pass a rra y2 to th e fu n c ti o n sq u a re Arra y ( ) /33 s qu ar eA rr ay ( n elem , a r ra y 2 ) ;3435 / Compute d i f f e r e n c e b et we en e l em e nt s o f a r r a y 2 and a r r a y 1 /36 f o r ( i n d x = 0; i n d x < nelem ; i n d x ++) {
37 d el [ i nd x ] = a rr ay 2 [ i nd x ] a rra y1 [ i n d x ] ;38 }3940 / P r i n t t h e c o mp ut ed d i f f e r e n c e s /41 p r i n t f ( " The d i f f e r en c e i n t he elements o f a rr ay 2 and a r ra y1 are : " ) ;42 p r i n tA r r a y ( nelem , d el ) ;4344 f re e ( a r r a y 1 ) ;45 f re e ( a r r a y 2 ) ;
46 f re e ( d e l ) ;47 re tu rn 0 ;48 }
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 9 / 89
Introduction Software for Debugging
-
7/29/2019 class13_debugging.pdf
9/82
Introduction Software for Debugging
49 v oi d i n i t A r r a y ( c on st i n t n el em _i n_ ar ra y , i n t a r r ay ) {
50 f o r ( i nd x = 0; i nd x < ne le m_ in _ar ra y ; i nd x ++) {51 a r r a y [ i n d x ] = i n d x + 1 ;52 }53 }5455 i n t s qu ar eA rr ay ( c o ns t i n t n el em _i n_ ar ra y , i n t a r r ay ) {56 i n t i n d x ;57 f o r ( i nd x = 0 ; i nd x < n el em _i n_ ar ra y ; i nd x + +) {58 a r r a y [ i n d x ] = a r r a y [ i n d x ] ;59 }60 re tu rn a rra y ;61 }6263 v oi d p r i n t A r r a y ( c on st i n t n el em _i n_ ar ra y , i n t a r r a y ) {64 p r i n t f ( " \ n ( " ) ;65 f o r ( i nd x = 0 ; i nd x < n el em _i n_ ar ra y ; i nd x + +) {66 p r i n t f ( "%d " , a r ra y [ i n d x ] ) ;67 }68 p r i n t f ( " ) \ n " ) ;
69 }
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 10 / 89
Introduction Software for Debugging
-
7/29/2019 class13_debugging.pdf
10/82
Introduction Software for Debugging
Ok, now lets compile and run this code:1 [ b ono : ~ / d_debug ] $ gc c g o a r ra yex a r r aye x . c2 [ b ono : ~ / d_debug ] $ . / a rr ayex3 a r r a y 1 =4 ( 2 3 4 5 6 7 8 9 10 11 )5 The d i f f e r e n c e i n t he e le men ts o f a rr ay 2 and a rr ay 1 a re :6 ( 0 0 0 0 0 0 0 0 0 0 )7 g l i b c d e te c te d d o ub l e f r e e o r c o r r u p t i o n ( f a s t t o p ) : 0 x0 000 000 000 50 101 0 8 Aborted
Not exactly what we expect, is it? Array2 should contain the squares ofthe values in array1, and therefore the difference should be i2 i for
i = [2, 11].
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 11 / 89
Introduction Software for Debugging
-
7/29/2019 class13_debugging.pdf
11/82
Introduction Software for Debugging
Now let us run the code from within gdb. Our goal is to set abreakpoint where the squared arrays elements are computed, then
step through the code:
1 [ b ono : ~ / d_debug ] $ gdb a rr a yex2 ( gdb ) l 343 314 32 / Copy a r r ay 1 t o a r r ay 2 /5 33 a r ra y 2 = a r r a y 1 ;6 34
7 35 / Pass a r ra y 2 t o t he f u n c t i o n s qu ar eA rr ay ( ) /8 36 s qua reArray ( nelem , a r ra y 2 ) ;9 37
10 38 / Compute d i f f e r e n c e b et we en e l em e nt s o f a r r a y 2 and a r r a y 1 /11 39 f o r ( i n d x = 0 ; i n d x < nelem ; in d x ++) {12 40 d e l [ i n d x ] = a r ra y 2 [ i n d x ] a rra y1 [ i n d x ] ;13 ( gdb ) b 3414 B re ak po in t 1 a t 0x400604 : f i l e ex1 . c , l i n e 3 4.15 ( gdb ) run
16 S t a r t i n g p ro gr am : / san / u s er / jon es m / u2 / d_debug / a r ra yex17 a r r a y 1 =18 ( 2 3 4 5 6 7 8 9 10 11 )
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 12 / 89
Introduction Software for Debugging
-
7/29/2019 class13_debugging.pdf
12/82
Introduction Software for Debugging
19 B re ak po in t 1 , main ( ) a t a rr aye x . c : 3 420 34 s qua reArray ( nelem , a r ra y 2 ) ;
21 ( gdb ) s22 s q ua r eA r ra y ( n e le m _i n _a r r ay = 1 0 , a r r a y =0 x501010 ) a t a r ra ye x . c : 5 923 59 f o r ( i n d x = 0 ; i n d x < n e l e m_ i n _a rr ay ; i nd x + +) {24 ( gdb ) s25 60 a r r a y [ i n d x ] = a r r a y [ i n d x ] ;26 ( gdb ) s27 59 f o r ( i n d x = 0 ; i n d x < n e l e m_ i n _a rr ay ; i nd x + +) {28 ( gdb ) p i n d x29 $1 = 0
30 ( gdb ) p a r ra y [ i nd x ]31 $2 = 432 ( gdb ) d i s p l ay i nd x33 1 : i n d x = 034 ( gdb ) d i s p la y a r ra y [ i nd x ]35 2 : a rr ay [ i nd x ] = 436 ( gdb ) s37 60 a r r a y [ i n d x ] = a r r a y [ i n d x ] ;38 2 : a rr ay [ i nd x ] = 3
39 1 : i n d x = 1
Ok, that is instructive, but no closer to finding the bug.
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 13 / 89
Introduction Software for Debugging
-
7/29/2019 class13_debugging.pdf
13/82
gg g
So, what have we learned so far about the command-line debugger:
Useful for peaking inside source code
(break) Breakpoints
(s) Stepping through execution
(p) Print values at selected points (can also use handy printf
syntax as in C)
(display) Displaying values for monitoring while stepping through
code
(bt) Backtrace, or Stack Trace - havent used this yet, but
certainly will
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 14 / 89
Introduction Software for Debugging
-
7/29/2019 class13_debugging.pdf
14/82
gg g
Digging Out the Bug
What we have learned is enough - look more closely at the line wherethe differences between array1 and array2 are computed:
1 ( gdb ) l 382 33 / Pass a r ra y 2 t o t he f u n c t i o n s qu ar eA rr ay ( ) /3 34 s qua reArray ( nelem , a r ra y 2 ) ;4 355 36 / Compute d i f f e r e n c e b et we en e l em e nt s o f a r r a y 2 and a r r a y 1 /
6 37 f o r ( i n d x = 0 ; i n d x < nelem ; i n d x ++) {7 38 d e l [ i n d x ] = a r r a y 2 [ i n d x ] a rra y1 [ i n d x ] ;8 39 }9 40
10 41 / P r i n t t h e c omputed d i f f e r e n c e s /11 42 p r i n t f ( " The d if f e re n c e i n th e elements of ar r a y 2 and a r r a y 1 are : " ) ;12 ( gdb ) b 3713 B r ea k po i nt 1 a t 0 x400611 : f i l e a rr a yex . c , l i n e 3 7.14 ( gdb ) run
15 S t a r t i n g p ro gr am : / san / u s er / jon es m / u2 / d_debug / a r ra yex16 a r r a y 1 =17 ( 2 3 4 5 6 7 8 9 10 11 )
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 15 / 89
Introduction Software for Debugging
-
7/29/2019 class13_debugging.pdf
15/82
18 B re ak po in t 1 , main ( ) a t a rr aye x . c : 3 719 37 f o r ( i n d x = 0 ; i n d x < nelem ; in d x ++) {
20 ( gdb ) d i sp i n dx21 1 : i nd x = 1022 ( gdb ) d i sp a r ra y 1 [ i n dx ]23 2 : a rr ay 1 [ i nd x ] = 4924 ( gdb ) d i sp a r ra y 2 [ i n dx ]25 3 : a rr ay 2 [ i nd x ] = 4926 ( gdb ) s27 38 d e l [ i n d x ] = a r ra y 2 [ i n d x ] a rra y1 [ i n d x ] ;28 3 : a rr ay 2 [ i nd x ] = 429 2 : a rr ay 1 [ i nd x ] = 430 1 : i n d x = 031 ( gdb ) s32 37 f o r ( i n d x = 0 ; i n d x < nelem ; in d x ++) {33 3 : a rr ay 2 [ i nd x ] = 434 2 : a rr ay 1 [ i nd x ] = 435 1 : i n d x = 036 ( gdb ) s37 38 d e l [ i n d x ] = a r ra y 2 [ i n d x ] a rra y1 [ i n d x ] ;38 3 : a rr ay 2 [ i nd x ] = 939 2 : a rr ay 1 [ i nd x ] = 940 1 : i n d x = 1
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 16 / 89
Introduction Software for Debugging
-
7/29/2019 class13_debugging.pdf
16/82
Now that isnt right - array1 was not supposed to change. Let us goback and look more closely at the call to squareArray ...
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 17 / 89
Introduction Software for Debugging
-
7/29/2019 class13_debugging.pdf
17/82
1 ( gdb ) l2 323 33 / Pass a r ra y 2 t o t he f u n c t i o n s qu ar eA rr ay ( ) /4 34 s qua reArray ( nelem , a r ra y 2 ) ;5 35
6 36 / Compute d i f f e r e n c e b et we en e l em e nt s o f a r r a y 2 and a r r a y 1 /7 37 f o r ( i n d x = 0 ; i n d x < nelem ; i n d x ++) {8 38 d e l [ i n d x ] = a r r a y 2 [ i n d x ] a rra y1 [ i n d x ] ;9 39 }
10 4011 41 / P r i n t t h e c omputed d i f f e r e n c e s /12 ( gdb ) b 3413 B r ea k po i nt 2 a t 0 x400605 : f i l e a rr a yex . c , l i n e 3 4.14 ( gdb ) run
15 The p r og ra m b e i n g debug ged h as b een s t a r t e d a l r e a d y .16 S t ar t i t from th e b e g i nn i ng ? ( y o r n ) y1718 S t a r t i n g p ro gr am : / san / u s er / jon es m / u2 / d_debug / a r ra yex19 a r r a y 1 =20 ( 2 3 4 5 6 7 8 9 10 11 )2122 B re ak po in t 2 , main ( ) a t a rr aye x . c : 3 423 34 s qua reArray ( nelem , a r ra y 2 ) ;24 3 : a rr ay 2 [ i nd x ] = 4925 2 : a rr ay 1 [ i nd x ] = 4926 1 : i nd x = 1027 ( gdb ) d is p a rr ay 228 4 : a rr a y 2 = ( i n t ) 0x50101029 ( gdb ) d is p a rr ay 130 5 : a rr a y 1 = ( i n t ) 0x501010
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 18 / 89
Introduction Software for Debugging
-
7/29/2019 class13_debugging.pdf
18/82
Yikes, array1 and array2 point to the same memory location! See,pointer errors like this dont happen too often in Fortran ... Now , of
course, the bug is obvious - but arent they all obvious after you findthem?
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 19 / 89
Introduction Software for Debugging
-
7/29/2019 class13_debugging.pdf
19/82
The Fix Is In
Just as an afterthought, what we ought to have done in the first placewas copy array1 into array2:
/ Copy a r r a y1 t o a r r a y2 // a r r a y2 = a r r ay 1 ; /f o r ( i n d x = 0 ; i n dx
-
7/29/2019 class13_debugging.pdf
20/82
Array Indexing Errors
Array indexing errors are one of the most common errors in both
sequential and parallel codes - and it is not entirely surprising:
Different languages have different indexing defaults
Multi-dimensional arrays are pretty easy to referenceout-of-bounds
Fortran in particular lets you use very complex indexing schemes(essentially arbitrary!)
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 22 / 89
Array Indexing Errors
E l I d i E
-
7/29/2019 class13_debugging.pdf
21/82
Example: Indexing Error
1 #include < s t d i o . h>
2 # d e fi n e N 103 i n t main( i n t argc , char a rg v [ ] ) {4 i n t a r r [N ] ;5 i n t i , odd_sum , even_sum ;67 f o r ( i =1; i
-
7/29/2019 class13_debugging.pdf
22/82
Now, try compiling with gcc and running the code:
1 [ b ono : ~ / d_debug ] $ gc c O g o e x 2 e x 2 . c2 [ b ono : ~ / d_debug ] $ . / ex23 odd_sum=5 , even_sum=671173703
Ok, that hardly seems reasonable (does it?) Now, lets run thisexample from within gdb and set a breakpoint to examine theaccumulation of values to even_sum.
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 24 / 89
Array Indexing Errors
-
7/29/2019 class13_debugging.pdf
23/82
1 ( gdb ) l 162 11 a r r [ i ] = ( i i )%5;
3 12 }4 13 }5 14 odd_sum =0 ;6 15 even_sum =0;7 16 f o r ( i =0 ; i
-
7/29/2019 class13_debugging.pdf
24/82
So we see that our original example code missed initializing the firstelement of the array, and the results were rather erratic (in fact they willlikely be compiler and flag dependent).
Initialization is just one aspect of things going wrong with arrayindexing - let us examine another common problem ...
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 26 / 89
Array Indexing Errors
The (Infamous) Seg Fault
-
7/29/2019 class13_debugging.pdf
25/82
The (Infamous) Seg Fault
This example I borrowed from Norman Matloff (UC Davis), who has a
nice article (well worth the time to read): Guide to Faster, Less
Frustrating Debugging, which you can find easily enough on the web:
http://heather.cs.ucdavis.edu/~matloff/unix.html
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 27 / 89
Array Indexing Errors
http://heather.cs.ucdavis.edu/~matloff/unix.htmlhttp://heather.cs.ucdavis.edu/~matloff/unix.html -
7/29/2019 class13_debugging.pdf
26/82
Main code: findprimes.c
1 /2 primenu mbe r f i n d i n g p ro gram w i l l ( a f t e r bugs ar e f i x e d ) r e po r t a l i s t o f3 a l l pr im es whic h a re l e ss t ha n o r e qu al t o t he us ersu p p l i e d u p pe r b o un d4 /5 #include < s t d i o . h>67 # d e fi n e MaxPrimes 508 i n t Prime [MaxPrimes ] , / Prime [ I ] w i l l be 1 i f I i s prim e , 0 o th er wi se /9 UpperBound ; / we w i l l c he ck up t h r o u gh UpperB ound f o r p r im e ne s s /
10 void CheckPrime( i n t K ) ; / p r o t o t y p e f o r Chec kPr im e f u n c t i o n /11 i n t main ( )12 {13 i n t N;1415 p r i n t f ( " e nt er upper b ound \ n " ) ;16 s c a n f ( "%d " , U pper Bound ) ;1718 Prime [ 2 ] = 1 ;1920 f o r ( N = 3 ; N < = U pperBound ; N += 2 )
21 CheckPrime (N ) ;22 i f ( P ri me [ N ] ) p r i n t f ( "%d i s a p ri me \ n " , N ) ;23 }
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 28 / 89
Array Indexing Errors
-
7/29/2019 class13_debugging.pdf
27/82
Function FindPrime:
2425 void CheckPrime( i n t K) {26 i n t J ;2728 / t he p la n : see i f J d i vi d es K , f o r a l l v al ue s J which are29 ( a ) t he ms el ve s p ri me ( no n eed t o t r y J i f i t i s no np rim e ) , and30 ( b ) l es s t han o r eq ual t o s q r t ( K) ( i f K has a d i v i s o r l a r ge r31 t ha n t h i s s qu ar e r o ot , i t must a l so have a s m a l le r one ,32 so n o need t o c hec k f o r l a r g e r ones ) /3334 J = 2 ;35 while ( 1 ) {
36 i f ( P ri me [ J ] == 1 )37 i f ( K % J == 0 ) {38 Prime [ K ] = 0 ;39 re tu rn ;40 }41 J ++;42 }4344 / i f we g e t here , t he n t h er e were no d i v i s o r s o f K , so i t i s
45 prime /46 Prime [ K ] = 1 ;47 }
so now if we compile and run this code ...
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 29 / 89
Array Indexing Errors
-
7/29/2019 class13_debugging.pdf
28/82
1 [ b ono : ~ / d_debug ] $ gc c g o f i n d pr i m e s f i n d pr i m e s . c2 [ b ono : ~ / d_debug ] $ . / f i n d pr i m e s3 e n te r upper bound
4 205 S egm ent ati on f a u l t
Ok, lets fire up gdb and see where this code crashed:
1 [ b ono : ~ / d_debug ] $ gdb f i n d pr i m e s2
3 ( gdb ) run4 S t a r t i n g p ro gr am : / san / u s er / jon es m / u2 / d_debug / f i n d p r i m e s5 e n te r upper bound6 2078 P ro gr am r e c e i v e d s i g n a l SIGSEGV , S e gm e nt a ti on f a u l t .9 0 x000000392815062f i n _ I O _ v f s c a n f _ i n te r n a l ( ) f ro m / l i b 6 4 / t l s / l i b c . s o . 6
10 ( gdb ) b t11 #0 0 x000000392815062f i n _ I O _ v f s c a n f _ i n t er n a l ( ) f ro m / l i b 6 4 / t l s / l i b c . s o . 612 #1 0 x000000392815866a i n s c a nf ( ) f ro m / l i b 6 4 / t l s / l i b c . s o . 613 #2 0 x0000000000400524 i n main ( ) a t f i n d p r i m e s . c : 1 614 ( gdb )
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 30 / 89
-
7/29/2019 class13_debugging.pdf
29/82
Array Indexing Errors
-
7/29/2019 class13_debugging.pdf
30/82
1 [ b ono : ~ / d_debug ] $ gc c g o f i n d pr i m e s f i n d pr i m e s . c2 [ b ono : ~ / d_debug ] $ gdb f i n d pr i m e s3 ( gdb ) run4 S t a r t i n g p ro gr am : / san / u s er / jon es m / u2 / d_debug / f i n d p r i m e s5 e n te r upper bound6 2078 P ro gr am r e c e i v e d s i g n a l SIGSEGV , S e gm e nt a ti on f a u l t .9 0 x00 000 00 000 400 58 6 i n C he ck Pr im e ( K =3 ) a t f i n d p r i m e s . c : 3 7
10 37 i f ( Prime [ J ] == 1 )11 ( gdb ) b t12 #0 0 x0 00 000 00 004 005 86 i n C he ck Pr im e ( K = 3) a t f i n d p r i m e s . c : 3 713 #1 0 x0000000000400547 i n main ( ) a t f i n d p r i m e s . c : 2 114 ( gdb ) l 3715 32 than t h i s square r o o t , i t must a l s o have a s m a l l e r one ,16 33 so no need t o check f o r l a r g e r ones ) /17 3418 3 5 J = 2 ;19 3 6 w h i l e ( 1 ) {20 37 i f ( Prime [ J ] == 1 )21 38 i f (K % J == 0 ) {22 39 Prime [ K ] = 0 ;
23 40 r e t u r n ;24 41 }25 ( gdb )
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 32 / 89
Array Indexing Errors
-
7/29/2019 class13_debugging.pdf
31/82
very often we get seg faults on trying to reference an array
out-of-bounds, so have a look at the value of J:26 ( gdb ) l 3727 32 than t h i s square r o o t , i t must a l s o have a s m a l l e r one ,28 33 so no need t o check f o r l a r g e r ones ) /29 3430 3 5 J = 2 ;31 3 6 w h i l e ( 1 ) {32 37 i f ( Prime [ J ] == 1 )
33 38 i f (K % J == 0 ) {34 39 Prime [ K ] = 0 ;35 40 r e t u r n ;36 41 }37 ( gdb ) p J38 $1 = 376
Oops! That is just a tad outside the bounds (50). Kind of forgot to put acap on the value of J ...
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 33 / 89
Array Indexing Errors
-
7/29/2019 class13_debugging.pdf
32/82
Fixing the last bug:
1 ( gdb ) l 40
2 35 J = 2 ;3 36 / w h i l e ( 1 ) { /4 37 f o r ( J =2 ; JJ
-
7/29/2019 class13_debugging.pdf
33/82
Ok, so now we will set a couple of breakpoints - one at the call toFindPrime and the second where a successful prime is to be output:
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 35 / 89
Array Indexing Errors
-
7/29/2019 class13_debugging.pdf
34/82
1 ( gdb ) l2 16 sc a n f ("%d " ,& UpperBound ) ;3 174 18 Prime [ 2 ] = 1 ;5 19
6 20 f or (N = 3 ; N
-
7/29/2019 class13_debugging.pdf
35/82
Another gotcha - misplaced (or no) braces. Fix that:
1 ( gdb ) l2 16 sc a n f ("%d " ,& UpperBound ) ;3 174 18 Prime [ 2 ] = 1 ;
5 196 20 f or (N = 3 ; N
-
7/29/2019 class13_debugging.pdf
36/82
gg g
Well, ok, not exactly debugging life itself; rather the game of life.Mathematician John Horton Conways game of life1, to be exact. Thisexample will basically be similar to the prior examples, but now we will
work in Fortran, and debug some integer arithmetic errors.
And the context will be slightly more interesting.
1see, for example, Martin Gardners article in Scientific American, 223, pp.120-123 (1970).
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 39 / 89
Debugging Life Itself Game of Life
Game of Life
-
7/29/2019 class13_debugging.pdf
37/82
The Game of Life is one of the better known examples of cellularautomatons (CA), namely a discrete model with a finite number ofstates, often used in theoretical biology, game theory, etc. The rules
are actually pretty simple, and can lead to some rather surprising
self-organizing behavior. The universe in the game of life:Universe is an infinite 2D grid of cells, each of which is alive ordead
Cells interact only with nearest neighbors (including on the
diagonals, which makes for eight neighbors)
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 40 / 89
Debugging Life Itself Game of Life
Rules of Life
-
7/29/2019 class13_debugging.pdf
38/82
The rules in the game of life:
Any live cell with fewer than two neighbours dies, as if byloneliness
Any live cell with more than three neighbours dies, as if by
overcrowdingAny live cell with two or three neighbours lives, unchanged, to the
next generation
Any dead cell with exactly three neighbours comes to life
An initial pattern is evolved by simultaneously applying the above rulesto the entire grid, and subsequently at each tick of the clock.
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 41 / 89
Debugging Life Itself Game of Life
Sample Code - Game of Life
-
7/29/2019 class13_debugging.pdf
39/82
p
1 program l i f e2 !3 ! Conway game of l i f e ( debugging example )4 !5 i m p l i c i t none6 i n te g e r , parameter : : n i = 100 0, n j = 1000 , n st ep s = 1007 i n t e g e r : : i , j , n , im , i p , jm , j p , nsum , isum8 i n te g e r , dimension ( 0 : n i , 0 : n j ) : : o ld , new9 r e a l : : a ra nd , ni m2 , n jm 2
10 !11 ! i n i t i a l i z e elements o f " o ld " t o 0 o r 112 !13 do j = 1 , n j14 do i = 1 , n i15 CALL random_number ( arand )16 o l d ( i , j ) = NINT ( arand )17 enddo18 enddo
19 nim2 = n i 220 njm2 = n j 2
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 42 / 89
Debugging Life Itself Game of Life
-
7/29/2019 class13_debugging.pdf
40/82
21 !
22 ! t i me i t e r a t i o n23 !24 t i m e _ i t er a t i o n : do n = 1 , n st ep s25 do j = 1 , n j26 do i = 1 , n i27 !28 ! p e r i o d i c b ou nd ar ie s ,29 !30 im = 1 + ( i +nim2 ) ( ( i + ni m2 ) / n i ) n i ! i f i =1 , n i31 i p = 1 + i ( i / n i ) n i ! i f i =n i , 132 jm = 1 + ( j +njm2 ) ( ( j + nj m2 ) / n j ) n j ! i f j =1 , n j33 j p = 1 + j ( j / n j ) n j ! i f j =n j , 134 !35 ! f o r each p o in t , add s u rr o un d in g v al ue s36 !37 nsum = o l d ( im , j p ) + o ld ( i , j p ) + o ld ( i p , j p ) &38 + o l d ( im , j ) + o l d ( i p , j ) &39 + o l d ( im , jm ) + o l d ( i , jm ) + o l d ( i p , jm )
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 43 / 89
Debugging Life Itself Game of Life
-
7/29/2019 class13_debugging.pdf
41/82
40 !41 ! s e t new v a l ue based on number o f " l i v e " n e ig h bo r s42 !
43 s e l e c t c as e (nsum)44 case ( 3 )45 new ( i , j ) = 146 case ( 2 )47 new ( i , j ) = o l d ( i , j )48 c as e d e fa u l t49 new ( i , j ) = 050 end s e l e c t51 enddo52 enddo53 !54 ! copy new s t a t e i n t o o ld s t a t e55 !56 o l d = new57 p r i n t , T i ck , n , number o f l i v i n g : , sum ( new )58 enddo t i m e _ i t e r a t i o n59 !60 ! w r i t e number o f l i v e p o in t s61 !62 p r i n t , number o f l i v e p o i n t s = , sum ( new )63 end program l i f e
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 44 / 89
Debugging Life Itself Game of Life
Initial Run ...
-
7/29/2019 class13_debugging.pdf
42/82
1 [ bono : ~ / d_debug ] $ i f o r t g o l i f e l i f e . f9 02 [ b ono : ~ / d_debug ] $ . / l i f e3 T i c k 1 number o f l i v i n g : 3429464 T i c k 2 number o f l i v i n g : 3343815 T i c k 3 number o f l i v i n g : 2910226 T i c k 4 number o f l i v i n g : 2633567 T i c k 5 number o f l i v i n g : 2909408 T i c k 6 number o f l i v i n g : 322733
9 :10 :11 T i c k 99 number o f l i v i n g : 012 T i c k 100 number o f l i v i n g : 013 number o f l i v e p o i n t s = 0
Hmm, everybody dies! What kind of life is that? ... well, not a correct
one, in this context, at least. Undoubtedly the problem lies within theneighbor calculation, so let us take a closer look at the execution ...
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 45 / 89
Debugging Life Itself Game of Life
-
7/29/2019 class13_debugging.pdf
43/82
1 ( gdb ) l 302 25 do j = 1 , n j3 26 do i = 1 , n i4 27 !5 28 ! p e r i o d i c bo und ari es
6 29 !7 30 im = 1 + ( i +nim2 ) ( ( i + ni m2 ) / n i ) n i ! i f i =1 , n i8 31 i p = 1 + i ( i / n i ) n i ! i f i = n i , 19 32 jm = 1 + ( j +njm2 ) ( ( j + nj m2 ) / n j ) n j ! i f j =1 , n j
10 33 j p = 1 + j ( j / n j ) n j ! i f j = n j , 111 ( gdb ) b 2512 B r ea k po i nt 1 a t 0 x402e23 : f i l e l i f e . f 90 , l i n e 2 5.13 ( gdb ) run14 S t a r t i n g p ro gr am : / s an / u s e r / j on es m / u2 / d_ debug / l i f e
1516 B re ak po in t 1 , l i f e ( ) a t l i f e . f 90 : 2 517 25 do j = 1 , n j18 C ur re nt language : aut o ; c u r r e n t ly f o r t r a n19 ( gdb ) s20 26 do i = 1 , n i21 ( gdb ) s22 3 0 im = 1 + ( i +nim2 ) ( ( i + ni m2 ) / n i ) n i ! i f i =1 , n i23 ( gdb ) s
24 31 i p = 1 + i
( i / n i )
n i ! i f i = n i , 125 ( gdb ) p r i n t im26 $1 = 127 ( gdb ) p r i n t ( i + nim2 ) /1 00 028 $2 = 0.999
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 46 / 89
Debugging Life Itself Game of Life
-
7/29/2019 class13_debugging.pdf
44/82
Ok, so therein lay the problem - nim2 and njm2 should be integers,not real values ... fix that:
1 program l i f e2 !3 ! Conway game of l i f e ( debugging example )4 !5 i m p l i c i t none6 i n te g e r , parameter : : n i = 100 0, n j = 1000 , n st ep s = 1007 i n t e g e r : : i , j , n , im , i p , jm , j p , nsum , isum , nim2 , njm28 i n te g e r , dimension ( 0 : n i , 0 : n j ) : : o ld , new9 r e a l : : ar an d
and things become a bit more reasonable:
1 [ bono : ~ / d_debug ] $ i f o r t g o l i f e l i f e . f9 02 [ b ono : ~ / d_debug ] $ . / l i f e3 T i c k 1 number o f l i v i n g : 2729904 T i c k 2 number o f l i v i n g : 2536905 :
6 :7 T i c k 99 number o f l i v i n g : 950738 T i c k 100 number o f l i v i n g : 946649 number o f l i v e p o i n t s = 94664
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 47 / 89
Debugging Life Itself Game of Life
Diversion - Demo life
-
7/29/2019 class13_debugging.pdf
45/82
http://www.radicaleye.com/lifepage
http://en.wikipedia.org/wiki/Conways_Game_of_Life
Interesting repository of Conways life and cellular automatareferences.
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 48 / 89
Other Debugging Miscellany Core Files
Core Files
http://www.radicaleye.com/lifepagehttp://en.wikipedia.org/wiki/Conway's_Game_of_Lifehttp://en.wikipedia.org/wiki/Conway's_Game_of_Lifehttp://www.radicaleye.com/lifepage -
7/29/2019 class13_debugging.pdf
46/82
Core files can also be used to instantly analyze problems that causeda code failure bad enough to dump a core file. Often the computer
system has been set up in such a way that the default is not to outputcore files, however:
1 [ bono : ~ / d_debug ] $ u l i m i t c
2 0
for bash syntax. In tcsh you would use the limit built-in command toset the coredumpsize value:
1 [ b ono : ~ / d_debug ] $ t c s h2 [ j onesm@bono ~ / d _de bug ] $ l i m i t c or e du mp s iz e3 c or ed um ps iz e u n l i m i t e d
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 50 / 89
Other Debugging Miscellany Core Files
-
7/29/2019 class13_debugging.pdf
47/82
Systems administrators set the core file size limit to zero by default fora good reason - these files generally contain the entire memory imageof an application process when it dies, and that can be very large.
End-users are also notoriously bad about leaving these files laying
around ...
Having said that, we can up the limit, and produce a core file that canlater be used for analysis.
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 51 / 89
Other Debugging Miscellany Core Files
Core File Example
-
7/29/2019 class13_debugging.pdf
48/82
Ok, so now we can use one of our previous examples, and generate a
core file:
1 [ b ono : ~ / d_debug ] $ gc c g o f i n d p r i m e s _ o ri g f i n d p r i m e s _ o r ig . c2 [ b ono : ~ / d_debug ] $ . / f i n d p r i m e s _ o r ig3 e n te r upper bound4 205 S eg me nt at io n f a u l t ( c o re dumped )6 [ bono : ~ / d_debug ] $ l s l c o re . 7 42 87 rw - - - 1 j one sm c c r s t a f f 65536 Sep 27 1 2: 15 c o re . 7 42 8
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 52 / 89
Other Debugging Miscellany Core Files
this particular core file is not at all large (it is a very simple code,though with very little stored data generally the core file size will
-
7/29/2019 class13_debugging.pdf
49/82
though, with very little stored data - generally the core file size will
reflect the size of the application in terms of its memory use when itcrashed). Analyzing it is pretty much like we did when running this
example live in gdb:1 [ bono : ~ / d_debug ] $ l s l c o re . 7 42 82 rw - - - 1 j one sm c c r s t a f f 65536 Sep 27 1 2: 15 c o re . 7 42 83 [ b ono : ~ / d_debug ] $ gdb f i n d p r i m e s _ o r i g c or e . 74 284 GNU gdb Red H at L i n u x ( 6 . 3 . 0 . 01 .1 4 3 . e l 4 rh )5 . . .6 Core was g en er at ed by . / f i n d p ri m e s _ or i g .7 Program t e rm i n at e d w i t h s i g n a l 1 1 , S eg me nt at io n f a u l t .8 R ea di ng s ym bo ls f ro m / l i b 6 4 / t l s / l i b c . s o . 6 . . . done .9 Loaded s ymb ols f o r / l i b 6 4 / t l s / l i b c . s o . 6
10 R ea di ng s ym bo ls f ro m / l i b 6 4 / l dl i n u xx8664.so . 2 . . . do ne .11 Loaded s ym bo ls f o r / l i b 6 4 / l dl i n u xx8664.so.212 #0 0 x000000392815062f i n _ I O _ v f s c a n f _ i n t er n a l ( ) f ro m / l i b 6 4 / t l s / l i b c . s o . 613 ( gdb ) b t14 #0 0 x000000392815062f i n _ I O _ v f s c a n f _ i n t er n a l ( ) f ro m / l i b 6 4 / t l s / l i b c . s o . 615 #1 0 x000000392815866a i n s c a nf ( ) f ro m / l i b 6 4 / t l s / l i b c . s o . 616 #2 0 x0000000000400524 i n main ( ) a t f i n d p r i m e s _ o r i g . c : 1 617 ( gdb ) l 1618 11 i n t main ( )19 12 {20 13 i n t N;21 1422 15 p r i n t f ( " e n t e r upper bound \ n " ) ;23 16 sc an f ("%d " , UpperBound ) ;
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 53 / 89
Other Debugging Miscellany Core Files
Summary on Core Files
-
7/29/2019 class13_debugging.pdf
50/82
So why would you want to use a core file rather than interactively
debug?
Your bug may take quite a while to manifest itself
You have to debug inside a batch queuing system whereinteractive use is difficult or curtailed
You want to capture a picture of the code state when it crashes
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 54 / 89
Other Debugging Miscellany More Command-line Debuggers
More Comannd-line Debugging Tools
-
7/29/2019 class13_debugging.pdf
51/82
We focused on gdb, but there are command-line debuggers thataccompany just about every available compiler product:
pgdbg part of the PGI compiler suite, defaults to a GUI, but can
be run as a command line interface (CLI) using the -textoption
idb part of the Intel compiler suite, defaults to CLI (has a
special option -gdb for using gdb command syntax)
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 55 / 89
Other Debugging Miscellany Run-time Compiler Checks
Run-time Compiler Checks
-
7/29/2019 class13_debugging.pdf
52/82
Most compilers support run-time checks than can quickly catchcommon bugs. Here is a handy short-list (contributions welcome!):
For Intel fortran, -check bounds -traceback -g will automatebounds checking, and enable extensive traceback analysis in case
of a crash (leave out the bounds option to get a crash report onany IEEE exception, format mismatch, etc.)
For PGI compilers, -Mbounds -g will do bounds checking
For GNU compilers, -fbounds-check -g should also do bounds
checking, but is only currently supported for Fortran and Javafront-ends.
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 56 / 89
Other Debugging Miscellany Run-time Compiler Checks
Run-time Compiler Checks(contd)
-
7/29/2019 class13_debugging.pdf
53/82
WARNING
It should be noted that run-time error checking can very much slow
down a codes execution, so it is not something that you will want touse all of the time.
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 57 / 89
Other Debugging Miscellany Serial Debugging GUIs
Serial Debugging GUIs
-
7/29/2019 class13_debugging.pdf
54/82
There are, of course, a matching set of GUIs for the various
debuggers. A short list:
ddd a graphical front-end for the venerable gdb
pgdbg GUI for the PGI debuggeridb -gui GUI for Intel compiler suite debugger
It is very much a matter of preference whether or not to use the GUI. I
find the GUI to be constraining, but it does make navigation easier.
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 58 / 89
Other Debugging Miscellany Serial Debugging GUIs
DDD Example
-
7/29/2019 class13_debugging.pdf
55/82
Running one of our previous examples using ddd ...
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 59 / 89
-
7/29/2019 class13_debugging.pdf
56/82
Other Debugging Miscellany Source Code Checking Tools
Source Code Checking Tools
-
7/29/2019 class13_debugging.pdf
57/82
Now, in a completely different vein, there are tools designed to helpidentify errors pre-compilation, namely by running it through the source
code itself.
splint is a tool for statically checking C programs:
http://www.splint.org
ftncheck is a tool that checks only (alas) FORTRAN 77 codes:
http://www.dsm.fordham.edu/~ftnchek/
I cant say that I have found these to be particulary helpful, though.
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 61 / 89
Other Debugging Miscellany Source Code Checking Tools
Memory Allocation Tools
http://www.splint.org/http://www.dsm.fordham.edu/~ftnchek/http://www.dsm.fordham.edu/~ftnchek/http://www.splint.org/ -
7/29/2019 class13_debugging.pdf
58/82
Memory allocation problems are very common - there are some toolsdesigned to help you catch such errors at run-time:
efence , or Electric Fence, tries to trap any out-of-boundsreferences (see man efence)
valgrind is a suite of tools for anlayzing and profiling binaries (seeman valgrind) - there is a user manual available at:file:///usr/share/doc/valgrind-3.6.0/html/manual.html
valgrind I have seen used with good success, but not particularly in
the HPC arena.
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 62 / 89
Other Debugging Miscellany Source Code Checking Tools
Strace
http://usr/share/doc/valgrind-3.6.0/html/manual.htmlhttp://usr/share/doc/valgrind-3.6.0/html/manual.html -
7/29/2019 class13_debugging.pdf
59/82
strace is a powerful tool that will allow you to trace all system calls andsignals made by a particular binary, whether or not you have source
code. Can be attached to already running processes. A powerful low-
level tool. You can learn a lot from it, but is often a tool of last resort foruser applications in HPC due to the copious quantity of extraneousinformation it outputs.
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 63 / 89
Other Debugging Miscellany Source Code Checking Tools
Strace Example
-
7/29/2019 class13_debugging.pdf
60/82
As an example of using strace, lets peek in on a running MPI process
(part of a 32 task job on U2):[ c 06n15 : ~ ] $ ps u jonesm L fUID PID PPID LWP C NLWP STIME TTY TIME CMDjo ne sm 23964 16284 23964 92 2 14 :34 ? 0 0 :0 4: 11 / u t i l / nwchem / nwchem5.0/ bin / jo ne sm 23964 16284 23965 99 2 14 :34 ? 0 0 :0 4: 30 / u t i l / nwchem / nwchem5.0/ bin / jo ne sm 23987 23986 23987 0 1 14 :37 p t s / 0 0 0 :0 0: 00 bashjo ne sm 24128 23987 24128 0 1 14 :39 p t s / 0 0 0 :0 0: 00 ps u jonesm L f[ c 06n15 : ~ ] $ s t r a c e p 23965
Proce ss 2 39 65 a tta ch e d i n t e r r u p t t o q u i t:l s ee k ( 4 5 , 691535872 , SEEK_SET ) = 691535872r ea d ( 4 5 , " \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 2 \ 2 7 3 \ f [ \ 2 5 0 \ 2 0 7V \2 7 6\ 3 76 K& ] \ 33 1 \2 3 0 d " . . . , 52 428 8) =52 428 8g e t t i m e o f d a y ( { 11 6 11 0 76 3 1 , 1 2 66 0 4} , { 2 4 0 , 1 16 11 07 63 1} ) = 0g e t t i m e o f d a y ( { 11 6 11 0 76 3 1 , 1 2 85 5 3} , { 2 4 0 , 1 16 11 07 63 1} ) = 0::s e l e c t ( 47 , [ 3 4 6 7 8 9 42 43 44 4 6] , [ 4 ] , NULL , NULL ) = 2 ( i n [ 4 ] , o ut [ 4 ] )
w r i t e ( 4 , " \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 " . . . , 2932) = 2932w ri te v ( 4 , [ { " \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 1 7 \ 0 \ 0 \ 0 \ 3 7 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 , \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 " . . . , 3 2 } ,{ " \ 1 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 0 \ 3 7 \ 0 \ 0 \ 0 \ 1 7 \ 0 \ 0 \ 0 \ 3 7 \ 0 \ 0 \ 0 , \ 0 \ 1 \ 0 0 0 0 u " . . . , 4 4 }] , 2 ) = 76
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 64 / 89
-
7/29/2019 class13_debugging.pdf
61/82
Part II
Advanced (Parallel) Debugging
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 65 / 89
Basic Parallel Debugging Wither Goest the GUI?
Wither Goest the GUI?
-
7/29/2019 class13_debugging.pdf
62/82
Using a GUI-based debugger gets considerably more difficult whendealing with debugging an MPI-based parallel code (not so much on
the OpenMP side), due to the fact that you are now dealing withmultiple processes scattered across different machines.
The TotalView debugger is the premier product in this arena (it hasboth CLI and GUI support) - but it is very expensive, and not present
in all environments. We will start out using our same toolbox as before,and see that we can accomplish much without spending a fortune. Themethodologies will be equally applicable to the fancy commercial
products.
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 67 / 89
Basic Parallel Debugging Process Checking
Process Checking
-
7/29/2019 class13_debugging.pdf
63/82
First on the agenda - parallel processing involves multipleprocesses/threads (or both), and the first rule is to make sure that theyare ending up where you think that they should be (needless to say, all
too often they do not).
Use MPI_Get_processor_name to report back on whereprocesses are running
Use ps to monitor processes as they run (useful flags: ps u -L),
even on remote nodes (rsh/ssh into them)
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 68 / 89
-
7/29/2019 class13_debugging.pdf
64/82
Basic Parallel Debugging Process Checking
-
7/29/2019 class13_debugging.pdf
65/82
or you can script it (I called this script job_ps):
1 # ! / b i n / sh2 #3 # S h el l s c r i p t t o t ak e a s i n g l e argument ( PBS j ob i d ) and l a un ch a4 # ps command on ea ch node5 #6 QST= w hi ch q s t at 7 i f [ z $QST ] ; t h en8 e ch o "ERROR: no q st a t i n PATH: PATH=" $PATH9 e x i t
10 f i1112 case $# i n13 0 ) echo " s i ng l e PBS_JOBID r eq ui re d . " ; e x i t ; ; # no args , e x i t14 1 ) j o b i d =$1 ; ;15 ) echo " s i ng l e PBS_JOBID r e qu ir ed . " ; e x i t ; ; # t oo many args , e x i t16 esac
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 70 / 89
Basic Parallel Debugging Process Checking
-
7/29/2019 class13_debugging.pdf
66/82
17 # g et node l i s t i n g & t r i m o ut excess v er bi ag e18 n l i n e s = $QST an $ j o bi d | wc l 19 n l i n e s = e cho " $ n l i n es 6" | bc
20 n o d e l i s t = $QST n $ j ob i d | t a i l $ nl in es | sed " s / \ / [ 0 1 ] + / , / g " |21 sed " s / \ / [ 0 1 ] / / " | sed " s / +/ / g " | sed " s / , / \ / g " |22 awk { f o r ( i =1; i
-
7/29/2019 class13_debugging.pdf
67/82
2 MYPS = ps u jonesm L o pid , pcpu , tim e ,comm3 NODE = c23n31 , my CPU/ thr ead Usage :4 PID %CPU TIME COMMAND5 27130 0 .0 0 0: 00 :0 0 bash
6 27201 0 . 0 0 0: 0 0: 0 0 pbs_demux7 27235 0 .0 0 0: 00 :0 0 bash8 27244 0 .0 0 0: 00 :0 0 d oq mt es ts . m pi9 1652 0 .0 00 :0 0: 00 r u nt e st s . mpi . un
10 1664 0 .0 0 0: 00 :0 0 mpiexec11 1666 0 .0 0 0: 00 :0 0 mpiexec12 1667 9 4. 5 0 0 :1 7 :1 8 nwchemi n t e l 9 113 1667 1 0. 0 0 0 :0 1 :5 0 nwchemi n t e l 9 114 1668 9 4. 2 0 0 :1 7 :1 5 nwchemi n t e l 9 1
15 3813 0 .0 0 0 :0 0 :0 0 ps16 NODE = c23n30 , my CPU/ thr ead Usage :17 PID %CPU TIME COMMAND18 1975 9 6. 2 0 0 :1 7 :3 6 nwchemi n t e l 9 119 1975 6 .6 00 :0 1: 13 nwchemi n t e l 9 120 1976 9 6. 0 0 0 :1 7 :3 4 nwchemi n t e l 9 121 4033 0 .0 0 0 :0 0 :0 0 ps22 NODE = c23n29 , my CPU/ thr ead Usage :23 PID %CPU TIME COMMAND
24 2673 9 5. 2 0 0 :1 7 :2 6 nwchem
i n t e l 9 1
25 2673 8 .9 00 :0 1: 38 nwchemi n t e l 9 126 2674 9 4. 5 0 0 :1 7 :1 9 nwchemi n t e l 9 127 4728 1 .0 0 0 :0 0 :0 0 ps
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 72 / 89
Basic Parallel Debugging Process Checking
-
7/29/2019 class13_debugging.pdf
68/82
28 NODE = c23n28 , my CPU/ thr ead Usage :29 PID %CPU TIME COMMAND30 1 92 84 8 8 .2 0 0 :1 6 :0 9 nwchemi n t e l 9 131 1 92 84 1 4 .8 0 0 :0 2 :4 3 nwchemi n t e l 9 132 1 92 85 8 8 .2 0 0 :1 6 :0 9 nwchemi n t e l 9 133 21374 0 .0 00 :0 0: 00 ps
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 73 / 89
GDB in Parallel Using Serial Debuggers in Parallel?
Using Serial Debuggers in Parallel?
-
7/29/2019 class13_debugging.pdf
69/82
Yes, you can certainly run debuggers designed for use in sequential
codes in parallel. They are even quite effective. You may just have to
jump through a few extra hoops to do so ...
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 75 / 89
GDB in Parallel Attaching GDB
Attaching GDB to Running Processes
The simplest way to use a CLI-based debugger in parallel is to attach
-
7/29/2019 class13_debugging.pdf
70/82
p y gg pit to already running processes, namely:
Find the parallel processes using the ps command (may have torsh/ssh into remote nodes if that is where they are running)
Invoke gdb on each process ID:
1 [ u2 : ] $ s sh f 10n32
2 [ u2 : ~ ] $ ps
u jonesm3 PID TTY TIME CMD4 680 p t s / 0 00 :0 0 :0 0 bash5 682 p ts / 0 00 :0 0 :0 0 pbs_mom6 683 p ts / 0 00 :0 0 :0 0 pbs_demux7 784 ? 00 :0 0 :0 0 python8 789 p t s / 0 00 :0 0 :0 0 python9 790 p t s / 0 00 :0 0 :0 0 sh
10 791 ? 00 :0 0 :0 0 python11 792 ? 00 :0 0 :1 4 pp . gdb12 797 ? 00 :0 0 :0 0 sshd13 [ f 10 n3 2 : ~ ] $ gdb pp . gdb 79214 GNU gdb ( GDB ) R ed H a t E n t e r p r i s e L i n u x ( 7. 24 8 .e l 6 )15 . . . .16 0 x00 000 00 000 400 dd 5 i n pp ( ) a t pp . f 9 0 : 4 217 42 do w hi le ( gdbWait /= 1)
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 76 / 89
GDB in Parallel Attaching GDB
-
7/29/2019 class13_debugging.pdf
71/82
Of course, unless you put an explicit waiting point inside your code, the
processes are probably happily running along when you attach tothem, and you will likely want to exert some control over that.
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 77 / 89
GDB in Parallel Attaching GDB
First using our above example I was running one mpi task on f10n32
-
7/29/2019 class13_debugging.pdf
72/82
First, using our above example, I was running one mpi task on f10n32and one on f10n24. After attaching gdb to each process, they paused:
1 [ f 10 n3 2 : ~ / d _hw / d_hw2 / d_pp ] $ gdb pp . gdb 9232 GNU gdb ( GDB) Red H at E n t e r p r i s e L i n u x ( 7. 24 8 .e l 6 )3 . . .4 ( gdb ) where5 #0 0 x00000031ee0dc0e8 i n p o l l ( ) f ro m / l i b 6 4 / l i b c . s o . 66 #1 0 x0 00 02 b4 01 5e f9 a1 b i n M PI D_ ne m_ tc p_ co nn po ll ( i n _ b l o c k i n g _ p o l l = 1 ) a t . . / . . / sock sm . c : 2 5 267 #2 0 x0 00 02 b4 01 5e f8 e8 2 i n M PI D_ ne m_ tc p_ po ll ( i n _ b l o c k i n g _ p o l l = 1 ) a t . . / . . / s oc ks m . c : 2 3 248 #3 0 x0 00 02b 40 15e 51a 56 i n M PI D_ ne m_ ne tw or k_ po ll ( i n _ b l o c k i n g _ p r o g r e s s = 1 ) a t . . .9 #4 0 x 0 00 0 2b 4 01 5c f f4 c e i n M PI DI _C H3 I_ Pr og re ss ( p r o g r e s s _ s t a t e = 0 x 7 f f f 3 a f e 2 b 9 0 , . . .
10 #5 0x00002b4015eab822 in PMPI_Recv ( buf=0x601e60 , count =1, data type =1275070495, . . .11 a t . . / . . / r e cv . c :1 5612 #6 0 x00002b40161ab000 i n p mp i_ re cv __ ( ) f ro m / u t i l / i n t e l / 2 0 1 1 . 0 / i m p i / 4 . 0 . 1 . 0 0 7 / . . .13 #7 0 x0000000000401288 i n pp ( ) a t p p . f 9 0 : 8 014 #8 0 x0 00 000 00 004 017 6a i n m ai n ( )15 #9 0 x00000031ee01ecdd i n _ _ l i b c _ s ta r t _ m a i n ( ) f ro m / l i b 6 4 / l i b c . s o . 616 #10 0x0000000000400c49 i n _ s t a r t ( )17 ( gdb ) c18 C on ti nu in g .
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 78 / 89
GDB in Parallel Attaching GDB
and on f10n24:
1 [ u2 : ~ ] $ ssh f10n242 [ f10n24 : ~ ] $ ps u jonesm3 PID TTY TIME CMD
-
7/29/2019 class13_debugging.pdf
73/82
3 PID TTY TIME CMD4 22673 ? 00 :0 0 :0 0 python5 22684 ? 00 :0 0 :0 0 python6 22686 ? 00 :0 2 :5 4 pp . gdb
7 22693 ? 00 :0 0 :0 0 sshd8 22694 p ts / 0 00 :0 0 :0 0 bash9 22733 p t s / 0 00 :0 0 :0 0 ps
10 [ f 10 n2 4 : ~ ] $ g db pp . gdb 2268611 GNU gdb ( GDB ) R ed H a t E n t e r p r i s e L i n u x ( 7. 24 8 .e l 6 )12 . . .13 ( gdb ) where14 #0 0 x000000309b2dc0e8 i n p o l l ( ) f ro m / l i b 6 4 / l i b c . s o . 615 #1 0 x0 00 02 b3 33 11 2f a1 b i n M PI D_ ne m_ tc p_ co nn po ll ( i n _ b l o c k i n g _ p o l l = 1 ) a t . . / . . / sock sm . c : 2 5 26
16 #2 0 x0 00 02b 33 311 2ee 82 i n M PI D_ ne m_ tc p_ po ll ( i n _ b l o c k i n g _ p o l l = 1 ) a t . . / . . / s oc ks m . c : 2 3 2417 # 3 0 x0 00 02 b3 33 10 87 a5 6 i n MPID_n e m_ n etwork_p ol l ( i n _ b l o ck i n g _ p ro g r e ss =1 ) a t . . / . . / . . .18 #4 0 x 00 00 2b 33 30 f3 54 ce i n M PI DI _C H3 I_ Pr og re ss ( p r o g r e s s _ s t a t e = 0 x 7 f f f d 9 d f f 6 b 0 , . . .19 #5 0x00002b33310e1822 in PMPI_Recv ( buf=0x601e60 , count =4, data type =1275070495, . . .20 a t . . / . . / r e cv . c :1 5621 #6 0 x00002b33313e1000 i n p mp i_ re cv __ ( ) f ro m / u t i l / i n t e l / 2 0 1 1 . 0 / i m p i / 4 . 0 . 1 . 0 0 7 / . . .22 #7 0 x00000000004013be i n pp ( ) a t p p . f 9 0 : 9 123 #8 0 x0 00 000 00 004 017 6a i n m ai n ( )24 #9 0 x000000309b21ecdd i n _ _ l i b c _ s ta r t _ m a i n ( ) f ro m / l i b 6 4 / l i b c . s o . 6
25 #10 0x0000000000400c49 i n _ s t a r t ( )26 ( gdb ) c27 C on ti nu in g .
and we used the (c) continue command to let the execution pick upagain where we (temporarily) interrupted it.
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 79 / 89
GDB in Parallel Attaching GDB
Using a Waiting Point
-
7/29/2019 class13_debugging.pdf
74/82
You can insert a waiting point into your code to ensure that executionwaits until you get a chance to attach a debugger:
i n t e g e r : : g db W ai t=0::
CALL MPI_COMM_RANK(MPI_COMM_WORLD, my id , i e r r )CALL MPI_COMM_SIZE(MPI_COMM_WORLD, Np ro cs , i e r r )! dummy pause p o i n t f o r gdb i n s t e r t i o ndo wh i l e ( g db Wa it / = 1 )end do
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 80 / 89
GDB in Parallel Attaching GDB
and then you will find the waiting at that point when you attach gdb,
-
7/29/2019 class13_debugging.pdf
75/82
and you can release it at your leisure:
1 [ f1 0n 3 2 :~ /d _hw/ d_hw2 / d _p p ] $ g d b p p . g db 10 032 GNU gdb ( GDB) Red H at E n t e r p r i s e L i n u x ( 7. 24 8 .e l 6 )3 . . .4 0 x0000000000400dd2 i n pp ( ) a t pp . f 9 0 : 4 25 42 do w h il e ( gdbWait /= 1)6 ( gdb ) s g dbWait = 17 44 i f ( Nprocs /= 2) then8 ( gdb ) c9 C on ti nu in g .
1 [ f1 0n 2 4 :~ /d _hw/ d_hw2 / d _p p ] $ g d b p p . g db 2 27 772 GNU gdb ( GDB) Red H at E n t e r p r i s e L i n u x ( 7. 24 8 .e l 6 )3 0 x0000000000400dd2 i n pp ( ) a t pp . f 9 0 : 4 24 42 do w h il e ( gdbWait /= 1)5 ( gdb ) s g dbWait = 16 44 i f ( Nprocs /= 2) then7 ( gdb ) c8 C on ti nu in g .
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 81 / 89
GDB in Parallel Using GDB Within MPI Task Launcher
Using GDB Within MPI Task Launcher
Last, but not least, you can usually launch gdb through your MPI task
-
7/29/2019 class13_debugging.pdf
76/82
y y g g ylauncher. For example, using the Intel MPI task launcher, mpiexec,
1 [ k 07n 14 : ~ / d _hw / d_hw2 / d_pp ] $ mp ie xe c gdb np 2 . / pp . gdb2 01: ( gdb ) l i s t 303 01: 254 01: 26 i n t e ge r : : gdbWait=05 01: 27 i n t e g e r m yi d , N pr oc s , i e r r , m pi _p ro c na me _l en gt h6 01: 28 i n t e g e r : : s t a t u s ( MPI_STATUS_SIZE )7 01: 29 c h ar ac t er ( le n =MPI_MAX_PROCESSOR_NAME) : : mpi_procname8 01: 309 01: 31 !
10 01: 32 ! I n i t i a l i z e communicator , check t h a t we are u si ng o nl y 2p11 01: 33 !12 01: 34 CALL MPI _IN IT ( i e r r )13 01: ( gdb ) b 3414 01: B r ea k po i nt 2 a t 0 x4 00 d1 f : f i l e pp . f 90 , l i n e 3 4.15 01: ( gdb ) r un16 01: C o n t in u i ng .17 01:18 01: B re ak po in t 2 , pp ( ) a t p p . f 90 : 3 4
19 01: 34 CALL MPI _IN IT ( i e r r )20 01: ( gdb ) 01: ( gdb ) c21 01: C o n t in u i ng .22 0 : H e l l o from proc 0 o f 2 k07n14 . c c r . b u ff a lo . edu23 1 : H e l l o from proc 1 o f 2 k07n14 . c c r . b u ff a lo . edu24 0 : Number Averaged f o r Sigmas : 2
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 82 / 89
GDB in Parallel Using GDB Within MPI Task Launcher
or, in batch mode:
1 [ u2 : ~ ] $ qsub q debug lnodes =2:GM:ppn=2, wa ll t i me =00:15:00 I2 qsub : w a i t i n g f o r j o b 9 93 514 .d15n41 . c c r . b u f f a l o . e du t o s t a r t3 qs ub : j o b 9 93 51 4 d15n4 1 c c r b u f f a l o edu r e ad y
-
7/29/2019 class13_debugging.pdf
77/82
3 qs ub : j o b 9 93 51 4. d15n4 1 . c c r . b u f f a l o . edu r e ad y45 J ob 9 93 51 4. d15n41 . c c r . b u f f a l o . edu ha s r e q ue s t e d 2 c o r es / p r o c e s s or s p e r node .
6 PBSTMPDIR i s / scr a tch /9 9 3 5 1 4 .d 15 n4 1 . ccr . b u ff a l o . ed u7 [ f 10 n3 2 : ~ ] $ m odule l oa d i n t e lmpi8 [ f10n32 : ~] $ NNODES= cat $PBS_NODEFILE | uniq | wc l 9 [ f 10 n3 2 : ~ ] $ mpdboot n $NNODES f $PBS_NODEFILE
10 [ f 10 n3 2 : ~ ] $ m p ic c g o c p i . i m pi c p i . c11 [ f 10 n3 2 : ~ ] $ m pi ex ec gdb np 4 . / c p i . i m p i12 03: ( gdb ) l i s t 1 6 ,2013 03: 16 double s ta r tw t im e = 0 .0 , endwtime ;14 03: 17 i n t namelen ;15 03: 18 ch ar process or_nam e [MPI_MAX_PROCESSOR_NAME ] ;16 03: 1917 03: 20 M PI _I ni t (& argc ,& arg v ) ;18 03: ( gdb ) b 2019 03: B r ea k po i nt 2 a t 0x4009a0 : f i l e c p i . c , l i n e 2 0.20 03: ( gdb ) r un21 03: C o n t in u i ng .22 03:23 03: B re ak po in t 2 , main ( a rg c =1 , a rg v=0 x 7 f f f f f f f d 9 b 8 ) a t c p i . c : 2 024 03: 20 M PI _I ni t (& argc ,& arg v ) ;
25 03: ( gdb ) 03: ( gdb ) c26 03: C o n t in u i ng .27 0 : P ro ce ss 0 on f 10 n3 2 . c c r . b u f f a l o . e du28 1 : P ro ce ss 1 on f 10 n3 2 . c c r . b u f f a l o . e du29 2 : P ro ce ss 2 on f 10 n2 7 . c c r . b u f f a l o . e du30 3 : P ro ce ss 3 on f 10 n2 7 . c c r . b u f f a l o . e du
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 83 / 89
GDB in Parallel Using GDB Within MPI Task Launcher
Using Serial Debuggers in Parallel
-
7/29/2019 class13_debugging.pdf
78/82
So you can certainly use serial debuggers in parallel - in fact it is a
pretty handy thing to do. Just keep in mind:
Dont forget to compile with debugging turned on
You can always attach to a running code (and you can instrumentthe code with that purpose in mind)
Beware that not all task launchers are equally friendly towardsbuilt-in support for serial debuggers
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 84 / 89
GUI-based Parallel Debugging TotalView
The TotalView Debugger
-
7/29/2019 class13_debugging.pdf
79/82
The premier parallel debugger, TotalView:
Sophisticated commercial product (think many $$ ...)
Designed especially for HPC, multi-process, multi-thread
Has both GUI and CLISupports C/C++, Fortran 77/90/95, mixtures thereof
The official debugger of DOEs Advanced Simulation andComputing (ASC) program
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 86 / 89
GUI-based Parallel Debugging TotalView
Using TotalView at CCR
-
7/29/2019 class13_debugging.pdf
80/82
Pretty simple to start using TotalView on the CCR systems:1 Generally you want to load the latest version:
[ d 16n03 : ~ ] $ module a v a i l t o t a l v i e w
2 Make sure that your X DISPLAY environment is working if you aregoing to use the GUI.
3 The current CCR license supports 2 concurrent users up to 8processors (precludes usage on nodes with more than 8 coresuntil/unless this license is upgraded).
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 87 / 89
GUI-based Parallel Debugging DDT
The DDT Debugger
-
7/29/2019 class13_debugging.pdf
81/82
Allineas commercial parallel debugger, DDT:
Sophisticated commercial product (think many $$ ...)
Designed especially for HPC, multi-process, multi-thread
Has both GUI and CLI
Supports C/C++, Fortran 77/90/95, mixtures thereof
CCR has a 32-token license for DDT (including CUDA support)
To find the latest installed version,module avail ddt
M. D. Jones, Ph.D. (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 88 / 89 GUI-based Parallel Debugging Eclipse PTP
Current Recommendations
-
7/29/2019 class13_debugging.pdf
82/82
CCR has licenses for Allineas DDT and TotalView (although the
current TotalView license is very small and outdated and will be eitherupgraded or dropped in favor of DDT). Both are quite expensive, butstay tuned for further developments. Note that the open-source
eclipse project also has a parallel tools platform that can be used incombination with C/C++ and Fortran:
http://www.eclipse.org/ptp
M D Jones Ph D (CCR/UB) Debugging in Serial & Parallel HPC-I Fall 2012 89 / 89
http://%20http//www.eclipse.org/ptphttp://%20http//www.eclipse.org/ptp