debugging of #100019 p. hristov 04/03/2013. introduction difficult problem – the behavior is...

19
Debugging of #100019 P. Hristov 04/03/2013

Upload: hubert-arnold

Post on 17-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Debugging of #100019 P. Hristov 04/03/2013. Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t

Debugging of #100019

P. Hristov04/03/2013

Page 2: Debugging of #100019 P. Hristov 04/03/2013. Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t

Introduction

• Difficult problem– The behavior is “random” and depends on the

“history”– The debugger doesn’t show what actually

happens– The standard tools (i.e. valgrind) do not detect it– All the tricks I tried did not help

• Important problem– Many jobs crash with the same message

Page 3: Debugging of #100019 P. Hristov 04/03/2013. Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t

Binary search for problem identification

• Needs fast multiprocessor machine for compilation (thanks to DAQ!)

• One has to repeat several (N) times the same test to be sure that given revision works or crashes (I used N=5)

• In this way I identified rev. 59755: the fix for “label==0” problem

• You remember that this also caused #99670 Increased virtual memory after the "Label 0 fix"

Page 4: Debugging of #100019 P. Hristov 04/03/2013. Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t

Static initialization order fiasco• See the details in

– http://www.parashift.com/c++-faq-lite/static-init-order.html– http://www.parashift.com/c++-faq-lite/static-init-order-on-intrinsics.html

• Short description: in the implementation file#include “MyClass.h”…const int fkLookForTrouble = AnotherClass::GetValue();//Methods…void MyClass::DoMess() {// Use fkLookForTrouble …}• We had similar problem in 2007 with pointers initialized from a factory method: easier

since the crash is “less random”• The fix is to use AnotherClass::GetValue() directly

Page 5: Debugging of #100019 P. Hristov 04/03/2013. Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t

Test after the fix

• Run 195566 – one of the worst runs– DONE 2,686 from 4,648 (57.8%)– ERROR_V 1227 (26.5%) – bug– EXPIRED 724 (15.6%) – memory

• After the fix– DONE 4,089 from 4,648 (88.0%)– ERROR_V 415 (8.9%) – bug– EXPIRED 144 (3.1%) – memory

Page 6: Debugging of #100019 P. Hristov 04/03/2013. Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t

To Do

• Check all similar places in AliRoot and provide a fix

• Fix some memory leaks found by Insure++ (an evaluation license was provided by Parasoft)

• Run again the test on run 195566

Page 7: Debugging of #100019 P. Hristov 04/03/2013. Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t

Old slides

Page 8: Debugging of #100019 P. Hristov 04/03/2013. Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t

Introduction• More than 15% of the jobs crash in one of the two streamers:

AliTRDtrackV1::Streamer or AliTRDcluster::Streamer• The problem is reproducible only on SLC5

– The same Root/AliRoot with the same raw files work on Ubuntu or MacOS

– If the code is compiled without optimization (-O0) it works also on SLC5– If you start directly from the event that crashed, the job is OK => the

crash depends on the “history”– Sometimes the reconstruction doesn’t crash, and the probability that it

is OK again in the next run is higher• Hypothesis

– memory corruption– problem in IO

Page 9: Debugging of #100019 P. Hristov 04/03/2013. Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t

Localization of problem

• Replace AliEn OCDB with local one• Replace the AliEn raw file with local one• => xrootd is not involved in the crash• Reduce the list of detectors:– “minimal configuration” to reproduce the crash:

ITS, TPC, TRD, PHOS, EMCAL, HLT• Debug printout (gDebug): large and useless

Page 10: Debugging of #100019 P. Hristov 04/03/2013. Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t

“Simple” debugging with gdb

• Find the exact place of crash• Investigate the content– Corrupted structure in CINT– Try watchpoint on the address with wrong content:

this doesn’t work because the corrupted address changes

• Compile without optimization only the affected class– The problem is reproducible, but almost no additional

information came out

Page 11: Debugging of #100019 P. Hristov 04/03/2013. Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t

Debugging with test function

• The test function examines the content of the global list, where the corruption occurs

• Possibility to “bracket” the place of corruption (closer to the actual place, less reproducible)

• Localized to the reading of PHOS raw data• Possibility to set watchpoint (worked once/twice out of

many attempts)• Full calling chain: involves TBufferFile,

TStreamerInfoReadBuffer, TStreamerInfoActions, TBranchElement, TBranch, TBranchRef, TRefTable, TRef, AliRawEquipmentV2

Page 12: Debugging of #100019 P. Hristov 04/03/2013. Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t

Changes in Root/AliRoot

• Inspection of all modifications in the affected classes– Nothing suspicious

• Test with old Root tag: works, but probably by chance

• PHOS raw data format/consistency: tested by the PHOS experts, no changes since 2011

• RAW data framework: no changes since 2011

Page 13: Debugging of #100019 P. Hristov 04/03/2013. Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t

Runs with Valgrind

• Memcheck– it detects the use of corrupted CINT structures,

but not the moment they are corrupted– no errors when we only read RAW

• SGcheck– one invalid write in string operations– no errors when we only read RAW

• Latest version of Valgrind (3.8.1): no new problems detected

Page 14: Debugging of #100019 P. Hristov 04/03/2013. Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t

Other tools

• Free– electric fence: put 2 words for each allocated word, too “memory

hungry” => cannot be easily used– duma: clone of electric fence– libcwd: not tried

• Commercial– Insure++: problem with the license server, no reply from

[email protected] • Parasoft contacted for evaluation license

– Purify: the same problem + no experience– TotalView: no version is available

• Coverity: check carefully the remaining defects

Page 15: Debugging of #100019 P. Hristov 04/03/2013. Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t

Status on 21/02/13

• Hypothesis: “second order” corruption– Corruption in the IO caused by unknown code (in

allocated memory since Valgrind doesn’t detect it)– Corruption in the CINT structures caused by the

problem in IO• Difficult to debug

Page 16: Debugging of #100019 P. Hristov 04/03/2013. Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t

Additional simplification

• rec.SetRunVertexFinderTracks(kFALSE);• rec.SetRunMultFinder(kFALSE);• rec.SetRunCascadeFinder(kFALSE);• rec.SetFillTriggerESD(kFALSE);• rec.SetWriteAlignmentData(kFALSE);• rec.SetRunLocalReconstruction("ITS TPC TRD PHOS

HLT");• rec.SetRunTracking("ITS TPC TRD");• rec.SetFillESD("HLT");• When HLT is out of FillESD, everything works!

Page 17: Debugging of #100019 P. Hristov 04/03/2013. Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t

Investigation of FillESD• Set “return” at different places to localize the code that causes crash• Since somehow the success of the attempt is correlated with the

previous attempt, always rerun the “crashing” version and then go to the changed one

• Calling chain– AliHLTReconstructor::FillESD– AliHLTSystem::ProcessHLTOUT– AliHLTOUTHandler::ProcessData -> AliHLTTriggerAgent::ProcessData– AliHLTOUT::GetDataObject– AliHLTMessage::Extract– AliHLTMessage::ReadObject (calls TBufferFile::ReadObject)

• The object that causes problems is AliHLTGlobalTriggerDecision• Everything works with default object (no IO)

Page 18: Debugging of #100019 P. Hristov 04/03/2013. Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t

AliHLTGlobalTriggerDecision

• This class has not changed since long time (607 days)• The object is written using Root v5-33-02b. Contains

– AliHLTDomainEntry (594 days after the last change)– AliHLTComponentDataType (in AliHLTDataTypes.h, 374 d.)– AliHLTTriggerDomain (807 days)– AliHLTTriggerDecision (861 days)– AliHLTLogging (209 days)– AliHLTCTPData (857 days)– AliHLTReadoutList (502 days)– AliHLTEventDDLV1 (in AliHLTDataTypes.h, 374 days)

Page 19: Debugging of #100019 P. Hristov 04/03/2013. Introduction Difficult problem – The behavior is “random” and depends on the “history” – The debugger doesn’t

Plans

• Try with realistic AliHLTGlobalTriggerDecision object without IO

• Investigate all changes between Root v5-33-02b and v5-34-02

• Check with different raw files