first-fault data capture - jhuapl.edu

17
FIRST-FAULT DATA CAPTURE Steven A. Stolper, Software Consultant [email protected] FIRST-FAULT DATA CAPTURE

Upload: others

Post on 20-Jul-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: First-Fault Data Capture - jhuapl.edu

FIRST-FAULT DATA CAPTURE

Steven A. Stolper, Software Consultant

[email protected]

FIRST-FAULT DATA CAPTURE

Page 2: First-Fault Data Capture - jhuapl.edu

Agenda

� Logging and its problems

� First-Fault Data Capture (Tracing)

� FFDC Architectures� FFDC Architectures

Page 3: First-Fault Data Capture - jhuapl.edu

Logging

� Traditionally used to provide information

for debugging.

� Best for low-rate data

� “Can you send me the logs?”

Time-Stamp Seq# Thread ID Severity Module Message ID Text :2007-8/16:14:17:43 000003 (0xb7f3c6c0) DEBUG STARTUP (00001) test_startup: spawning thread 12007-8/16:14:17:43 000004 (0xb7f3c6c0) DEBUG STARTUP (00002) test_startup: spawned thread 1::2007-8/16:14:17:43 000028 (0xb7f3bbb0) INFO STARTUP (08003) system_WaitForStart: Waiting for system startup.2007-8/16:14:17:43 000029 (0xb753abb0) DEBUG STARTUP (00004) Thread 1: Initializing!2007-8/16:14:17:43 000030 (0xb753abb0) DEBUG STARTUP (00005) Thread 1: Waiting for start!

Page 4: First-Fault Data Capture - jhuapl.edu

The Problem with Logging

� The “immediate cause” of the failure

occurs after the actual problem.

� Error messages are often missing or poorly

coded.coded.

� System state information is unavailable

� High-rate data is unavailable

� Many types of failures do not log errors

Relying on error messages to find “the smoking gun”

is a matter of chance!!!

Page 5: First-Fault Data Capture - jhuapl.edu

Traditional Debugging

� Turn on “verbose” logging or enable

“instrumented” code.

� Put an instrumented build on the � Put an instrumented build on the

spacecraft (if on the testbed or in ATLO)

Both approaches suffer from a key flaw:

The problem must occur again!!!!

Q: What are the 5 words guaranteed to upset your QA Team?

“Can you reproduce the problem?”A:

Page 6: First-Fault Data Capture - jhuapl.edu

Trace Data Recorder

� Analogous to “Black Box” flight recorder on aircraft

� Records information about system's dynamic behavior

� After crash, the information in the TDR helps to analyze the failure

� The data can provide valuable insight into � The data can provide valuable insight into � complex behavioral problems

� sporadic changes in system inputs

� unexpected interactions between subsystems

� bugs in low-level software.

� High rate data “always on”

*Software architecture

for troubleshooting

high-availability systems

Page 7: First-Fault Data Capture - jhuapl.edu

Process 1 Process 2 Process 3

Thread NThread 1

TDRLib

TDRLib

TDRLib

TDR_Store()

TDR_BufferInit()

LOG

Subsystem

Uplink

“catastrophic”

errors

TDR Subsystem

Data path

Command path

Library/

Data Structure

Execution context

RetrievalAgent

BillboardNon-volatileStorage

“Raw”

Dump

“Freeze” /

“Dump”

TDR

“Freeze” /

“Dump”

Post-

processing

Tool

Ground Data System

“Raw” Dump

Uplink

“Freeze” /

“Dump”

“Retrieve”

“Retrieve”

DumpAgent

Page 8: First-Fault Data Capture - jhuapl.edu

Buffer Organization

Process 1

Lib 1

Option #1

Process 1

Lib 1

Option #2

Lib 1

Lib 1

Process 2

Lib 1

Lib 1

Lib 1

Lib 1

Process 2

Lib 1

Lib 1* Option 3 for Q&D

Page 9: First-Fault Data Capture - jhuapl.edu

Example Library Interface

TDR_BufferInit(Application_ID, num_entries, process_str)Creates and initializes buffer for process

TDR_ThreadAddIdentity(thread_str)Identifies thread a thread of execution using buffer

TDR_ThreadRemoveIdentity()Removes thread identification information from buffer

TDR_Store(module_ID, event_id, data_p)TDR_Store(module_ID, event_id, data_p)Stores trace information

TDR_BufferDestroy()Destroys trace buffer

The “data_p” argument to TDR_store() points to:

typedef union {

unsigned char data_byte[TDR_BYTES_PER_ENTRY];

unsigned int data_word[TDR_WORDS_PER_ENTRY];

} TDR_data_t;

Page 10: First-Fault Data Capture - jhuapl.edu

Trace Data Recorder

� Example assumes one buffer per heavy-weight process

� The implementation of the TDR_Store() function is critical:

� “Application” software calls function as part of normal execution so data always available if problem. TDR_Store() called many so data always available if problem. TDR_Store() called many hundreds of times each second.

� Ideally should execute in fast, constant time, and not make any system calls that can block execution of caller.

� Should avoid mutual exclusion primitives that can cause unrelated execution contexts to serialize as they contend for access to the capture buffer. (atomic_inc_return())

� If buffer sizes restricted to power of 2, can roll the buffer very swiftly.

Page 11: First-Fault Data Capture - jhuapl.edu

Data structures

PID Module ID Num Entries Buffer size Instance String

Next available

Capture Buffer Body (Shared between Lib and Dump Agent)

Capture Buffer Header (Lib only)

Freeze Flag

Buffer Pointer Optional Information

Buffer Statistics

Capture Entry

Thread Identification Information

Buffer Entries

.

.

.

.

PID Module ID Entry ID Time Stamp Data

Page 12: First-Fault Data Capture - jhuapl.edu

Buffer and Dump Control

PID

Billboard

Buffer Pointer Buffer Size

Current Dump Size Max Allowed Dump Size Global “Freeze” Flag

Orphaned Time

Buffer Pointer Table

PID Buffer Pointer Buffer Size Orphaned Time

Page 13: First-Fault Data Capture - jhuapl.edu

Compile-Time Interface

� Each entry placed into a capture buffer

should have a unique ID.

� Ideally, when defining the ID, the developer

could also provide information to help could also provide information to help

generate a “parser” to post-process the raw

dump to produce human-readable output.

� Entry ID’s placed in own header file for each

module.

� Can construct tool to scan for duplicate ID’s

Page 14: First-Fault Data Capture - jhuapl.edu

Compile-Time Interface

(cont)

/* Define an example entry ID for a buffer entry containing two integers

* Note that there is no “;” following the definition.

*/

TDR_DEFINE_ENTRY(EXAMPLE_ENTRY_ID, 1, “hex value = 0x%x and decimal value =

%d”)

#define TDR_DEFINE_ENTRY(entry_ID_mnemonic, entry_number, format_string) \

enum { entry_ID_mnemonic = entry_number};

#define TDR_DEFINE_ENTRY(entry_ID_mnemonic, entry_number, format_string) \

{entry_number, #entry_ID_mnemonic, format_string},

%d”)

Page 15: First-Fault Data Capture - jhuapl.edu

Recent Example:M

ete

rs p

er

seco

nd

Page 16: First-Fault Data Capture - jhuapl.edu

Summary

� Logging as a debugging tool suffers from

inherent problems

� Tracing gives state and high-rate data vital

to troubleshooting problems in the fieldto troubleshooting problems in the field

Page 17: First-Fault Data Capture - jhuapl.edu

Additional Reading

“The Software Detective: First-fault Data

Capture”

Embedded Systems Design Magazine, CMP Embedded Systems Design Magazine, CMP

Media LLC., August 2007