debugging of #100019 p. hristov 04/03/2013. introduction difficult problem – the behavior is...
Post on 17-Jan-2016
215 Views
Preview:
TRANSCRIPT
Debugging of #100019
P. Hristov04/03/2013
Introduction
• Difficult problem– The behavior is “random” and depends on the
“history”– The debugger doesn’t show what actually
happens– The standard tools (i.e. valgrind) do not detect it– All the tricks I tried did not help
• Important problem– Many jobs crash with the same message
Binary search for problem identification
• Needs fast multiprocessor machine for compilation (thanks to DAQ!)
• One has to repeat several (N) times the same test to be sure that given revision works or crashes (I used N=5)
• In this way I identified rev. 59755: the fix for “label==0” problem
• You remember that this also caused #99670 Increased virtual memory after the "Label 0 fix"
Static initialization order fiasco• See the details in
– http://www.parashift.com/c++-faq-lite/static-init-order.html– http://www.parashift.com/c++-faq-lite/static-init-order-on-intrinsics.html
• Short description: in the implementation file#include “MyClass.h”…const int fkLookForTrouble = AnotherClass::GetValue();//Methods…void MyClass::DoMess() {// Use fkLookForTrouble …}• We had similar problem in 2007 with pointers initialized from a factory method: easier
since the crash is “less random”• The fix is to use AnotherClass::GetValue() directly
Test after the fix
• Run 195566 – one of the worst runs– DONE 2,686 from 4,648 (57.8%)– ERROR_V 1227 (26.5%) – bug– EXPIRED 724 (15.6%) – memory
• After the fix– DONE 4,089 from 4,648 (88.0%)– ERROR_V 415 (8.9%) – bug– EXPIRED 144 (3.1%) – memory
To Do
• Check all similar places in AliRoot and provide a fix
• Fix some memory leaks found by Insure++ (an evaluation license was provided by Parasoft)
• Run again the test on run 195566
Old slides
Introduction• More than 15% of the jobs crash in one of the two streamers:
AliTRDtrackV1::Streamer or AliTRDcluster::Streamer• The problem is reproducible only on SLC5
– The same Root/AliRoot with the same raw files work on Ubuntu or MacOS
– If the code is compiled without optimization (-O0) it works also on SLC5– If you start directly from the event that crashed, the job is OK => the
crash depends on the “history”– Sometimes the reconstruction doesn’t crash, and the probability that it
is OK again in the next run is higher• Hypothesis
– memory corruption– problem in IO
Localization of problem
• Replace AliEn OCDB with local one• Replace the AliEn raw file with local one• => xrootd is not involved in the crash• Reduce the list of detectors:– “minimal configuration” to reproduce the crash:
ITS, TPC, TRD, PHOS, EMCAL, HLT• Debug printout (gDebug): large and useless
“Simple” debugging with gdb
• Find the exact place of crash• Investigate the content– Corrupted structure in CINT– Try watchpoint on the address with wrong content:
this doesn’t work because the corrupted address changes
• Compile without optimization only the affected class– The problem is reproducible, but almost no additional
information came out
Debugging with test function
• The test function examines the content of the global list, where the corruption occurs
• Possibility to “bracket” the place of corruption (closer to the actual place, less reproducible)
• Localized to the reading of PHOS raw data• Possibility to set watchpoint (worked once/twice out of
many attempts)• Full calling chain: involves TBufferFile,
TStreamerInfoReadBuffer, TStreamerInfoActions, TBranchElement, TBranch, TBranchRef, TRefTable, TRef, AliRawEquipmentV2
Changes in Root/AliRoot
• Inspection of all modifications in the affected classes– Nothing suspicious
• Test with old Root tag: works, but probably by chance
• PHOS raw data format/consistency: tested by the PHOS experts, no changes since 2011
• RAW data framework: no changes since 2011
Runs with Valgrind
• Memcheck– it detects the use of corrupted CINT structures,
but not the moment they are corrupted– no errors when we only read RAW
• SGcheck– one invalid write in string operations– no errors when we only read RAW
• Latest version of Valgrind (3.8.1): no new problems detected
Other tools
• Free– electric fence: put 2 words for each allocated word, too “memory
hungry” => cannot be easily used– duma: clone of electric fence– libcwd: not tried
• Commercial– Insure++: problem with the license server, no reply from
Sdt.Support@cern.ch • Parasoft contacted for evaluation license
– Purify: the same problem + no experience– TotalView: no version is available
• Coverity: check carefully the remaining defects
Status on 21/02/13
• Hypothesis: “second order” corruption– Corruption in the IO caused by unknown code (in
allocated memory since Valgrind doesn’t detect it)– Corruption in the CINT structures caused by the
problem in IO• Difficult to debug
Additional simplification
• rec.SetRunVertexFinderTracks(kFALSE);• rec.SetRunMultFinder(kFALSE);• rec.SetRunCascadeFinder(kFALSE);• rec.SetFillTriggerESD(kFALSE);• rec.SetWriteAlignmentData(kFALSE);• rec.SetRunLocalReconstruction("ITS TPC TRD PHOS
HLT");• rec.SetRunTracking("ITS TPC TRD");• rec.SetFillESD("HLT");• When HLT is out of FillESD, everything works!
Investigation of FillESD• Set “return” at different places to localize the code that causes crash• Since somehow the success of the attempt is correlated with the
previous attempt, always rerun the “crashing” version and then go to the changed one
• Calling chain– AliHLTReconstructor::FillESD– AliHLTSystem::ProcessHLTOUT– AliHLTOUTHandler::ProcessData -> AliHLTTriggerAgent::ProcessData– AliHLTOUT::GetDataObject– AliHLTMessage::Extract– AliHLTMessage::ReadObject (calls TBufferFile::ReadObject)
• The object that causes problems is AliHLTGlobalTriggerDecision• Everything works with default object (no IO)
AliHLTGlobalTriggerDecision
• This class has not changed since long time (607 days)• The object is written using Root v5-33-02b. Contains
– AliHLTDomainEntry (594 days after the last change)– AliHLTComponentDataType (in AliHLTDataTypes.h, 374 d.)– AliHLTTriggerDomain (807 days)– AliHLTTriggerDecision (861 days)– AliHLTLogging (209 days)– AliHLTCTPData (857 days)– AliHLTReadoutList (502 days)– AliHLTEventDDLV1 (in AliHLTDataTypes.h, 374 days)
Plans
• Try with realistic AliHLTGlobalTriggerDecision object without IO
• Investigate all changes between Root v5-33-02b and v5-34-02
• Check with different raw files
top related