debugging memory leaks
TRANSCRIPT
Debugging Memory Leaks
TotalView Overview
Hybrid and Accelerated Applications
• What do we see – NVIDIA Tesla GP-GPU computational accelerators – Intel Xeon Phi Coprocessors – Complex memory hierarchies (numa, device vs host, etc) – Custom languages such as CUDA and OpenCL – Directive based programming such as OpenACC and OpenMP – Core and thread counts going up
• A lot of complexity to deal with if you want performance – C or Fortran with MPI starts to look “simple” – Everything is Multiple Languages / Parallel Paradigms – Up to 4 “kinds” of parallelism (cluster, thread, heterogeneous, vector) – Data movement and load balancing
3 © 2014 Rogue Wave Software, Inc. All Rights Reserved
How does Rogue Wave help?
• Troubleshooting and analysis tool
– Visibility Into
– Control Over • Scalability
• Usability
• Support for HPC platforms and languages
TotalView debugger
© 2014 Rogue Wave Software, Inc. All Rights Reserved
Application Analysis and Debugging Tool: Code Confidently
• Debug and Analyse C/C++ and Fortran on Linux™, Unix or Mac OS X • Laptops to supercomputers • Makes developing, maintaining, and supporting critical apps
easier and less risky
Major Features • Easy to learn graphical user interface with data visualization • Parallel Debugging
– MPI, Pthreads, OpenMP™, GA, UPC – CUDA™, OpenACC®, and Intel® Xeon Phi™ coprocessor
• Low tool overhead resource usage • Includes a Remote Display Client which frees you to work
from anywhere • Memory Debugging with MemoryScape™ • Deterministic Replay Capability Included on Linux/x86-64 • Non-interactive Batch Debugging with TVScript and the CLI • TTF & C++View to transform user defined objects
What is TotalView®?
© 2014 Rogue Wave Software, Inc. All Rights Reserved
TotalView for the NVIDIA ® GPU Accelerator
• NVIDIA Tesla K40 • NVIDIA CUDA 5.0 and 5.5 new in 8.13
– With limited support for dynamic parallelism
• Cray CCE OpenACC • Features and capabilities include
– Support for MPI based clusters and multi-card configurations
– Flexible Display and Navigation on the CUDA device
• Physical (device, SM, Warp, Lane) • Logical (Grid, Block) tuples
– CUDA device window reveals what is running where
– Support for types and separate memory address spaces
– Leverages CUDA memcheck
© 2014 Rogue Wave Software, Inc. All Rights Reserved
HRL Laboratories
• “In the first full day of using TotalView, we were quickly able to solve the bug that had us stumped for weeks. “
Kirill Minkovich, HRL Laboratories • Neural Network Simulation in the Defense Industry
• Need to adopt new technology and achieve massive scaling on a deadline
• Project went from failing to over-achieving
7 © 2014 Rogue Wave Software, Inc. All Rights Reserved
TotalView for the Intel® Xeon Phi™ coprocessor
Supports All Major Intel Xeon Phi Coprocessor Configurations • Native Mode
– With or without MPI • Offload Directives
– Incremental adoption, similar to GPU • Symmetric Mode - New in 8.13
– Host and Coprocessor • Multi-device, Multi-node • Clusters
User Interface • MPI Debugging Features
– Process Control, View Across, Shared Breakpoints • Heterogeneous Debugging
– Debug Both Xeon and Intel Xeon Phi Processes
Memory Debugging - New in 8.13 • Both native and symmetric mode
© 2014 Rogue Wave Software, Inc. All Rights Reserved
Beacon Project
• 210 Tflops, 11,520 Xeon Phi Coprocessor Cores • #1 on the Dec 2012 Green 500
• 9 scientific teams optimizing apps including: MHD, Plasma, Cosmology, Chemistry, QCD, Bio-informatics…
• Porting & optimization – Hybrid MPI + OpenMP
– Many more threads than previous paradigms
• Subtle issues might present themselves – And did
9 © 2014 Rogue Wave Software, Inc. All Rights Reserved
Debugging on Beacon with TotalView
• OpenMP Hybridization of Boltzman BGK – Correctness issues came up with the OpenMP code – Troubleshooting with TotalView
• Native mode debugging on the Xeon Phi • Thread level examination of the OpenMP region
• Comparison of data between threads • … Clarified otherwise puzzling results
– Developers were able to resolve the correctness issue
– Ultimately obtained performance gains on the Xeon Phi
• And correct results
10 © 2014 Rogue Wave Software, Inc. All Rights Reserved
What Is MemoryScape®?
• Runtime Memory Analysis : Eliminate Memory Errors – Detects memory leaks before they are a problem – Explore heap memory usage with powerful analytical tools – Use for validation as part of a quality software development process
• Major Features – Included in TotalView, or Standalone – Detects
• Malloc API misuse • Memory leaks • Buffer overflows
– Supports • Now includes Intel® Xeon Phi™
• C, C++, Fortran • Linux, Unix, and Mac OS X • MPI, pthreads, OMP, and remote apps
– Low runtime overhead – Easy to use
• Works with vendor libraries • No recompilation or instrumentation
© 2014 Rogue Wave Software, Inc. All Rights Reserved
Dassault Systems
• “MemoryScape enabled us to … proactively fix potential errors prior to release.”
– Nick Monyatovsky, Software Engineer
• Computer Aided Engineering ISV for Aero/Auto/Industry
• Struggling with intermittent errors
• Continuous Integration – Better Product Quality
12
• “MemoryScape enabled us to identify memory issues, and by using its scripting interface, we were able to automate the evaluation process. Now, the system automatically uncovers any hidden latent errors in our code with every build, allowing our developers to proactively fix potential errors prior to release.” – Nick Monyatovsky, Software Engineer at SIMULIA
© 2014 Rogue Wave Software, Inc. All Rights Reserved
• Reverse Debugging: Radically simplify your debugging
– Captures and Deterministically Replays Execution • Not just “checkpoint and restart”
– Eliminate the Restart Cycle and Hard-to-Reproduce Bugs – Step Back and Forward by Function, Line, or Instruction
• Specifications – A feature included in TotalView on Linux x86 and x86-64
• No recompilation or instrumentation • Explore data and state in the past just like in a
live process, including C++View transformations – Replay on Demand: enable it when you want it – Supports MPI on Ethernet, Infiniband, Cray XE Gemini – Supports Pthreads, and OpenMP
Deterministic Replay Debugging
© 2014 Rogue Wave Software, Inc. All Rights Reserved
Cambridge Study
• Survey conducted by the Judge Business School at Cambridge University concluded that Reverse Debuggers allow users, on average, to spend 13% less of their programming time debugging.
– Programming was 50% of total work week on average – Debugging was 50% of programming time without reverse debugging – Debugging was 37% of programming time with reverse debugging – That frees up 130 hours (>3 work weeks, 6.5% total time) per developer per year for design and new
feature development
• The survey looked at total value (salaries & overhead) of debugging as a task and they determined that this savings could, across the whole world economy, be work $41 billion in increased productivity.
– The productivity improvement should be worth $2,500 per developer per year (salary only) or $5,000 per year with overhead.
• http://www.roguewave.com/company/news-events/press-releases/2013/university-of-cambridge-reverse-debugging-study.aspx
14 © 2014 Rogue Wave Software, Inc. All Rights Reserved
Memory Debugging
What is a Memory Bug?
• A Memory Bug is a mistake in the management of heap memory
• Failure to check for error conditions
• Leaking: Failure to free memory
• Dangling references: Failure to clear pointers
• Memory Corruption
• Writing to memory not allocated
• Overrunning array bounds
16
Why Are Memory Bugs Hard to Find?
17
What is a Memory Bug?
• Memory problems can lurk • For a given scale or platform or problem, they may not be
fatal • Libraries could be source of problem • The fallout can occur at any subsequent memory access
through a pointer • The mistake is rarely fatal in and of itself • The mistake and fallout can be widely separated
• Potentially 'racy' • Memory allocation pattern non-local • Even the fallout is not always fatal. It can result in data
corruption which may or may not result in a subsequent crash • May be caused by or cause of a 'classic' bug
The Agent and Interposition"
18
Process
TotalView
Malloc API
User Code and Libraries
The Agent and Interposition"
19
Malloc API
User Code and Libraries
Process
TotalView Heap Interposition
Agent (HIA) Allocation
Table Deallocation
Table
TotalView HIA Technology
• Advantages of TotalView HIA Technology
• Use it with your existing builds • No Source Code or Binary Instrumentation
• Programs run nearly full speed • Low performance overhead
• Low memory overhead • Efficient memory usage
• Support wide range of platforms and compilers
20
Memory Debugger Features
• Automatically detect allocation problems
• View the heap
• Leak detection
• Block painting
• Memory Hoarding
• Dangling pointer detection
• Deallocation/reallocation notification
• Memory Corruption Detection - Guard Blocks
• Memory Comparisons between processes
• Collaboration features
21
Enabling Memory Debugging Memory Event Notification
22
Memory Event Details Window
23
Heap Graphical View
24
Leak Detection
• Leak Detection • Based on Conservative
Garbage Collection
• Can be performed at any point in runtime
• Helps localize leaks in time
• Multiple Reports
• Backtrace Report
• Source Code Structure
• Graphically Memory Location
25
Dangling Pointer Detection
26
Memory Corruption Report
27
Memory Comparisons
28
• “Diff” live processes
• Compare processes across cluster
• Compare with baseline
• See changes between point A and point B
• Compare with saved session
• Provides memory usage change from last run
Memory Usage Statistics
29
Memory Reports
30
• Multiple Reports • Memory Statistics • Interactive Graphical
Display • Source Code Display • Backtrace Display
• Allow the user to
• Monitor Program Memory Usage
• Discover Allocation Layout
• Look for Inefficient Allocation
• Look for Memory Leaks
MemoryScape
31
• Memory debug many processes at the same time • MPI • Client-Server • Fork-Exec • Compare
two runs • Remote
applications • Mutli-threaded
applications
Multi-Process and Multi-Thread
Script Mode - MemScript
• Automation Support • MemoryScape lets users run tests and check programs for
memory leaks without having to be in front of the program • Simple command line program called MemScript
• Doesn’t start up the GUI • Can be run from within a script or test harness
• The user defines • What configuration options are active • What thing to look for • Actions MemoryScape should take for each type of
event that may occur
32
33
Red Zones instant array bounds detection for Linux
• Red Zones provides: • Immediate detection of memory overruns.
• Detection of access violations both before and after the bounds of allocated memory.
• Detection of deallocated memory accesses.
• Red Zones events • MemoryScape will stop your programs execution and raise an event
alerting you to the illegal access. You will be able to see exactly where your code over stepped the bounds.
• Note: Not available on the Blue Gene/Q
MemoryScape
MemoryScape
34
Configuring Guard Blocks
Guard allocated memory When selected, the Memory Debugger writes guard blocks before
and after a memory block that your program allocates
Pre-Guard and Post-Guard Size:
Sets the size in bytes of the block that the Memory Debugger places immediately before and after the memory block that your program allocates
Pattern:
Indicates the pattern that the Memory Debugger writes into guard blocks. The default values are 0x77777777 and 0x99999999
Memory Corruption Detection (Guard Blocks)
35
TVScript
tvscript and memscript
• A straightforward language for unattended and/or batch debugging with TotalView and/or MemoryScape
• Usable whenever jobs need to be submitted or batched
• Can be used for automation
• A more powerful version of printf, no recompilation necessary between runs
• Schedule automated debug runs with cron jobs
• Expand its capabilities using TCL
37
Output
• All of the following information is provided by default for each print
• Process id
• Thread id
• Rank
• Timestamp
• Event/Action description
• A single output file is written containing all of the information regardless of the number of processes/threads being debugged
38
Sample Output
• Simple interface to create an action point -create_actionpoint ”#85=>print foreign_addr”
• Sample output with all information !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! ! Print ! ! Process: ! ./TVscript_demo (Debugger Process ID: 5, System ID: [email protected]) ! Thread: ! Debugger ID: 5.1, System ID: 3077191888 ! Rank: ! 0 ! Time Stamp: ! 05-14-2012 17:11:24 ! Triggered from event: ! actionpoint ! Results: ! err_detail = { ! intervals = 0x0000000a (10) ! almost_pi = 3.1424259850011 ! delta = 0.000833243988525023 ! } ! !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
39
Events
• General • any_event
• Source code debugging events • actionpoint • error
• Memory events (just a few, all are listed in Chapter 4 of TotalView Reference Guide)
• any_memory_event • free_not_allocated • guard_corruption • rz_overrun, rz_underrun, rz_use_after_free
40
Actions
• Source code
• display_backtrace [-level num] [numlevels] [options]
• print [-slice {exp}] {variable | exp}
• Memory
• check_guard_blocks
• list_allocations
• list_leaks
• save_html_heap_status_source_view
• save_memory_debugging_file
• save_text_heap_status_source_view
41
Command syntax
• General syntax
• tvscript [options] [filename] –a [program_args]
• MPI Options
• -mpi starter starter comes from Parallel tab dropdown
• -starter_args “args for starter program”
• -nodes
• -np or –procs or –tasks
42
Command syntax
• Action options • -create_actionpoint “src_expr[=>action1[,action2] …]”
• Repeat on command line for each actionpoint
• -event_action “event_action_list” • event1=action1,event2=action2 or event1=>action1,action2
• Can repeat on command line for multiple actions
• General options • -display_specifiers “display_specifiers_list” • -maxruntime “hh:mm:ss” • -script_file scriptFile • -script_log_filename logFilename • -script_summary_log_filename summaryLogFilename
43
Command syntax
• Memory debugging options
• -memory_debugging (must use for debugging memory)
• -mem_detect_leaks
• -mem_detect_use_after_free
• -mem_guard_blocks
• -mem_hoard_freed_memory
• -mem_hoard_low_memory_threshold nnnn
• -mem_paint_all
• -mem_paint_on_alloc
• -mem_paint_on_dealloc
44
Command syntax
• Memory debugging red zone options
• -mem_red_zone_overruns
• -mem_red_zones_size_ranges min:max[,min:max]…
• Ranges can be
• min:max
• min:
• :max
• fixed
• -mem_red_zones_underruns
45
Script Files
• Instead of putting everything on the command line, you can also write and use script files
• Script files can also include TCL
• Logging functions
• tvscript_log msg – logs msg to the log file
• tvscript_slog msg – logs msg to the summary log file
• Property functions
• tvscript_get_process_property process_id property
• tvscript_get_thread_property thread_id property
46
Roadmap
What is new in 8.13
• TotalView 8.13
– CUDA 5.0 and 5.5
– MemoryScape for Xeon Phi Native and Symmetric
– Xeon Phi Symmetric
– Mac OS X Mavericks
– Early Access: Scalability Improvements
Next Release
• TotalView 8.14
– CUDA 6.0 Support
– Early Access Support for Save/Load of ReplayEngine recordings
– GCC C++ Type Transform Support for Unordered Collections
• Unordered Map
• Unordered Set
• Unordered Multimap
• Unordered Multiset
– Platform updates
Plans subject to change without notice
© 2014 Rogue Wave Software, Inc. All Rights Reserved
N+1 Release
• TotalView 8.15
– CUDA 6.5 Support
– Co-array Fortran support
– OpenACC support in PGI
– Scalability and Performance at Scale Improvements
– Platform updates
Plans subject to change without notice
© 2014 Rogue Wave Software, Inc. All Rights Reserved