tds bug 221

Click here to load reader

Upload: daniel-gomez-prado

Post on 14-Jun-2015

286 views

Category:

Education


0 download

DESCRIPTION

Debugging reordering in TDS

TRANSCRIPT

  • 1. Debugging TDSreordering bug v221Fixed v222 By Daniel Gomez-Prado06/28/2012 Disclaimer: This is not a tutorial. This is a debugging session for anyone working on TDS http://www.dgomezpr.com/

2. Debugging TDS reordering (FFT)Trigger command: reorder* --sift --nodeReordering is one of the most difficult algorithms in TDS because: Nodes can have an arbitrary number of children and parents (2+ is normal) Weight on edges propagate up and down the graph Registers on edges represent a boundary for reorderingIn particular, the weight propagation can generate graph reductions throughisomorphism. This characteristic is highly desirable in TEDs as it allows to reduce thegraph complexity by pruning replicated nodes.In the present, debugging session, I show the steps needed to isolate the bug and thesteps required to fix it.2 3. Debugging TDS reordering (FFT)Trigger command: reorder* --sift --node(From file fft.scr)Tds 01> read fft.cdfgTds 02> ntl2tedTds 03> reorder* --sift --node aborts at 4% of the reorderingthe assertion is due to a dangling node while reordering file debug_dangling_ted_node.txt is saved on the working directory Error: 04040. Stopping the reordering, the container might be unstable Warning 03036. Stopped execution at line "reorder* --sift --node"We can establish with by inspecting the dumped file, that the initial order of variables is:wr_1 wi_1 wr_2 wi_2 wr_3 wi_3 wr_4 wi_4 wr_5 wi_5 wr_6 wi_6 wr_7 wi_7 wr_8 wi_8 wr_9 wi_9wr_10 wi_10 wr_11 wi_11 wr_12 wi_12 wr_13 wi_13 wr_14 wi_14 wr_15 wi_15 wi_16 wr_17wi_17 wr_18 wi_18 wr_19 wi_19 wr_20 wi_20 wr_21 wi_21 wr_22 wi_22 wr_23 wi_23 wr_24wi_24 wr_25 wi_25 wr_26 wi_26 wr_27 wi_27 wr_28 wi_28 wr_29 wi_29 wr_30 wi_30 wr_31wi_31 wr_16 ar_0 ar_32 ai_0 ai_32 ar_1 ar_33 ai_1 ai_33 ar_2 ar_34 ai_2 ai_34 ar_3 ar_35 ai_3ai_35 ar_4 ar_36 ai_4 ai_36 ar_5 ar_37 ai_5 ai_37 ar_6 ar_38 ai_6 ai_38 ar_7 ar_39 ai_7 ai_39ar_8 ar_40 ai_8 ai_40 ar_9 ar_41 ai_9 ai_41 ar_10 ar_42 ai_10 ai_42 ar_11 ar_43 ai_11 ai_43ar_12 ar_44 ai_12 ai_44 ar_13 ar_45 ai_13 ai_45 ar_14 ar_46 ai_14 ai_46 ar_15 ar_47 ai_15ai_47 ar_16 ar_48 ai_16 ai_48 ar_17 ar_49 ai_17 ai_49 ar_18 ar_50 ai_18 ai_50 ar_19 ar_51ai_19 ai_51 ar_20 ar_52 ai_20 ai_52 ar_21 ar_53 ai_21 ai_53 ar_22 ar_54 ai_22 ai_54 ar_23ar_55 ai_23 ai_55 ar_24 ar_56 ai_24 ai_56 ar_25 ar_57 ai_25 ai_57 ar_26 ar_58 ai_26 ai_583 ar_27 ar_59 ai_27 ai_59 ar_28 ar_60 ai_28 ai_60 ar_29 ar_61 ai_29 ai_61 ar_30 ar_62 ai_30ai_62 ar_31 ar_63 ai_31 ai_63 4. and that the order of variables before the crash is:wr_1 wi_1 wr_2 wi_2 wr_3 wi_3 wr_4 wi_4 wr_5 wi_5 wr_6 wi_6 wr_7 wi_7 wr_8wi_8 wr_9 wi_9 wr_10 wi_10 wr_11 wi_11 wr_12 wi_12 wr_13 wi_13 wr_14 wi_14wr_15 wi_15 wi_16 wr_17 wi_17 wr_18 wi_18 wr_19 wi_19 wr_20 wi_20 wr_21wi_21 wr_22 wi_22 wr_23 wi_23 wr_24 wi_24 wr_25 wi_25 wr_26 wi_26 wr_27wi_27 wr_28 wi_28 wr_29 wi_29 wr_30 wi_30 wr_31 wi_31 wr_16 ar_0 ar_32 ai_0ai_32 ar_1 ar_33 ai_1 ai_33 ar_2 ar_34 ai_2 ai_34 ar_3 ar_35 ai_3 ai_35 ar_4 ar_36ai_4 ai_12 ai_36 ar_5 ar_37 ai_5 ai_37 ar_6 ar_38 ai_6 ai_38 ar_7 ar_39 ai_7 ai_39ar_8 ar_40 ai_8 ai_40 ar_9 ar_41 ai_9 ai_41 ar_10 ar_42 ai_10 ai_42 ar_11 ar_43ai_43 ar_12 ar_44 ai_44 ar_13 ar_45 ai_13 ai_45 ar_14 ar_46 ai_14 ai_46 ar_15ar_47 ai_15 ai_47 ar_16 ar_48 ai_16 ai_48 ar_17 ar_49 ai_17 ai_49 ar_18 ar_50ai_18 ai_50 ar_19 ar_51 ai_19 ai_51 ar_20 ar_52 ai_20 ai_52 ar_21 ar_53 ai_21ai_53 ar_22 ar_54 ai_22 ai_54 ar_23 ar_55 ai_23 ai_55 ar_24 ar_56 ai_24 ai_56ar_25 ar_57 ai_25 ai_57 ar_26 ar_58 ai_26 ai_58 ar_27 ar_59 ai_27 ai_59 ar_28ar_60 ai_28 ai_60 ar_29 ar_61 ai_29 ai_61 ar_30 ar_62 ai_30 ai_62 ai_11 ar_31ar_63 ai_31 ai_63That is, the variables that have been reorder so far are:[INITIAL ]wr_1 ... ai_4 ***** ai_36 ... ar_43 ai_11 ai_43 ar_12 ar_44 ai_12 ai_44 ... ai_62 &&&& ar_31 ar_63 ai_31 ai_63[BEFORE CRASH] wr_1 ... ai_4 ai_12 ai_36 ... ar_43 &&& ai_43 ar_12 ar_44 ***** ai_44 ... ai_62 ai_11 ar_31 ar_63 ai_31 ai_63And that the crash (or error) occurs when node ai_12 is being moved up one position.That is, when variables ai_14 and ai_12 are being swapped.4 5. So far, we have been able to reproduce the bug in a faster and predictable manner:read fft.cdfgntl2tedjumpAbove -p ai_11 ar_31jumpBelow -p ai_12 ai_4write fft_order_bug.ted Now the bug can be reproduced by simply doing read fft_order_bug.ted bblup ai_12The afore mentioned test contains 190 different variables, and close to 7000 nodes.Furthermore, some variables such as the wi_16 variable have more than 432 nodesconnected to 864 parents in 44 different levels; and have 1232 children spread in 30different levels. All this information can be gathered with the tds t command. With thisgraph structure the number of manual iterations to discover the bug might becomeburdensome, and therefore we need to refine our test case.To reduce the test case, we develop a small method called print_cone. We can call this method from within the debugger and print out the output cone of the nodes involved in the5swap of ai_12 and ai_4. The goal is to prune all outputs that do not contain ai_12 and ai_4 nodes, and therefore do not excite the bug. 6. The method print_cone thus developed, help us to identify a subset of the entire graphwhich correspond to the following outputs, which could be extracted to reduce thetest case. aai_0000005 aai_10000004 aai_12000004 aai_14000004 aai_2000004aai_4000004 aai_6000004 aai_8000004 aar_10000004 aar_12000004aar_14000004 aar_8000004 op1481 op1485 op1489 op1496 op1497 op1500op1501 op1504 op1505 op1508 op1509In this newly reduced test case, which is more tractable, we can further check whichoutput cones have both ai_12 and ai_4 variables. These output cones are candidatesto cause the bug, and therefore we can disregard all outputs containing only one ofthose variables. This condition further reduces the test case to the following outputs: op1497 op1501 aai_10000004 aai_8000004 op1509 aai_12000004 op1505 aai_14000004 op1500 aar_10000004 op1508 op1504 aar_12000004 aar_14000004 op1496 aar_8000004In this water down version of the bug, we can (by visual inspection) reduce the test caseeven more. In the following figure, the red and yellow lines tracks the parents, whilethe blue line tracks if the different parent nodes share common children. The greenlines check the polarity of the weight that could lead toward an output propagation ofthe local ordering; and the pink lines draw at the bottom of the graph are our first6 attempt to formulate a hipothesis of what could be causing a dangling node. 7. 7 The embedded PDF can be retrieved at: http://www.dgomezpr.com/electrical-and-computer-engineering/code/tds/tds-debugging/55-bug-fft 8. The previous traversal in the figure allows us to do reduce the test case to 6 primaryoutputs.aai_10000004 aai_8000004 aar_10000004 op1497 op1500 op1501This time we can start focusing on our hypothesis by printing from within the debugger the paths involved in the re-ordering. We can limit the parent traversal of the graph up to 4 parents through the debugging command : break atTedOrderProper.cc:370 to execute pathWith_X_Y.visualize(px,&parents,3,4);In fact, we can visualize the data structure at the point where the dangling node wasfound (at the assertion) using the debugging command shown above. The followingpdf file shows there are indeed not only one but four different dangling nodes; whichare highlighted in gold and have as parent the fake no parent node. Upon carefulexamination we can see that those 4 nodes, do have other counterpart nodes(pointed through purple lines) that are their equivalents. These nodes with no parentare the nodes pre-existing the reordering in question; these nodes should haveprevented new equivalent nodes from being created, and these nodes should havebeen reused to avoid updating all the parents. Furthermore, we notice that among the8 new nodes, if the weight of -1 is propagated we obtain the node link through a pinkline. 9. 9 The embedded PDF can be retrieved at: http://www.dgomezpr.com/electrical-and-computer-engineering/code/tds/tds-debugging/55-bug-fft 10. At this stage, we have narrow down our bug to the following condition: our troublingpath is the node ai_4 connected to child ai_12; when the parents of ai_4 converge tothe same parent node wi_28 through different paths; and these parent nodes differonly in the weight to be propagated through them. In the figure below, in the left had side of the figure, we can observe yellow and tilt wi_28 parents; both parents of nodes wr_16 which are direct parents of the node ai_4 in question. The red line among these nodes, indicate that if the weight of -1 were to be propagated (as it will be during the reordering) these nodes have equivalent within the container. The problem therefore, in its most simple form is the following: When the upper nodes wi_28 called during a recursive call, creates temporary place holders in the container no_touch in the forward_weight_up method, these place holders contain as children the original nodes wr_16; but when the nodes wr_16 are swapped as well (and new place holders for these are created) the upper nodes wi_28 maintain the old reference to wr_16 nodes, and therefore they are no longer equivalent to the new nodes, and these nodes persist in the container despite the fact that they are no longer needed.10 11. 11 The embedded PDF can be retrieved at:http://www.dgomezpr.com/electrical-and-computer-engineering/code/tds/tds-debugging/55-bug-fft 12. Solution: 1. The most appropriate solution will consist of two steps:1. Manage the nodes in the no_touch container (those nodes candidates to be stitched back) in a levelize manner. In this manner the nodes wi_28 will never update before the nodes wr_16; and therefore when the wr_16 nodes are updated first, they will update the correct reference to the nodes wi_28 eliminating this problem.2. The bug above points out to much recursive problems. In this case, we have only two 3 pairs of nodes that swap, one being the base parent. This same pattern could be replicated among more pairs if it traverses upwards. Check for dependent nodes in the no_touch container, that is a node that appears at some point as a data and at another point as a key. 2. The other solution would be to force the no-parent nodes to be dropquietly; but the main problem with this approach is that it is not a fix,but a patch. At such, it will require future hacks when some codespasses correctly through it and other fails with it.12 Solution taken #1 with steps 1 and 2