se420 software quality...

19
October 22, 2014 Sam Siewert SE420 Software Quality Assurance Lecture 9 Negative Testing, Defect Tracking and Root-Cause Analysis http://www.nasa.gov/pdf/65776main_noaa_np_mishap.pdf, http://en.wikipedia.org/wiki/NOAA-19

Upload: others

Post on 20-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: SE420 Software Quality Assurancemercury.pr.erau.edu/~siewerts/se420/documents/Lectures/Fall-14/Le… · Data Driven CPU Loading Root-Cause on Pathfinder was a Combination of Issues

October 22, 2014 Sam Siewert

SE420

Software Quality Assurance

Lecture 9 – Negative Testing, Defect

Tracking and Root-Cause Analysis

http://www.nasa.gov/pdf/65776main_noaa_np_mishap.pdf, http://en.wikipedia.org/wiki/NOAA-19

Page 2: SE420 Software Quality Assurancemercury.pr.erau.edu/~siewerts/se420/documents/Lectures/Fall-14/Le… · Data Driven CPU Loading Root-Cause on Pathfinder was a Combination of Issues

Reminders

Assignment #4 Due Saturday, 10/25

Remaining Assignments [Top Down / Bottom-Up]

– #5 – Design, Module Unit Tests and Regression Suite

– #6 – Complete Code, Refine and Run all V&V Tests and Deliver

Track Bugs with Bugzilla - http://prclab.pr.erau.edu/

Import your Project Code into GitHub -

https://github.com/

Sam Siewert 2

Page 3: SE420 Software Quality Assurancemercury.pr.erau.edu/~siewerts/se420/documents/Lectures/Fall-14/Le… · Data Driven CPU Loading Root-Cause on Pathfinder was a Combination of Issues

Integration and Test Integrate Software Modules [units] and Hardware Components into Sub-systems

Test Focus on Interfaces [Function, Message, Shared Memory, Hardware], Protocols, and Interoperability of Modules

Sam Siewert 3

Page 4: SE420 Software Quality Assurancemercury.pr.erau.edu/~siewerts/se420/documents/Lectures/Fall-14/Le… · Data Driven CPU Loading Root-Cause on Pathfinder was a Combination of Issues

Test Types – Goals Today Positive Tests

– Functional Software Interface Tests Functions calling Functions – API

Message Passing – Local Message Queues, Network, Client-Server

Shared Memory – Synchronization, Buffers

– Hardware Interface Tests Drivers and Device Interfaces

Firmware [ROM Code, Run out of Reset]

Negative Tests – Software Interface Faults

– Hardware Interface Fault Injection

Bug Tracking, Defect Rate, How to Use for Project and SQA Management

Root-Cause-Analysis [RCA] Wrap-Up – JPL Mars Pathfinder Story

Diagnostics [Built-in Self-Test]

Unit Interoperability – Sub-system Resource Testing – Memory, CPU, I/O, Storage, Power

– Protocols – Message Acknowledgement, Command/Response, Background Commands, Peer-to-Peer, etc.

Performance Tests – Profiles and Traces

Sam Siewert 4

Page 5: SE420 Software Quality Assurancemercury.pr.erau.edu/~siewerts/se420/documents/Lectures/Fall-14/Le… · Data Driven CPU Loading Root-Cause on Pathfinder was a Combination of Issues

Outline for Every Integration Test 1. Check out Specific Source Code Test Configuration – CMVC Tools, Git

– Collection of Modules [Units] Tagged by Revision Control

– OR Current

2. Build and Link Modules (*.o) and Libraries (*.a) into Sub-system to Test

3. Load / Install Sub-system Code onto Test Hardware Platform of Known Configuration

– Record key hardware configuration parameters

– E.g. for I/O HW config - lspci, lsusb,

– General config - hwinfo

– Linux OS kernel build config - uname –a

– cat /proc/meminfo

– cat /proc/cpuinfo

4. Run Integrated Test(s) [with Gcov, Lcov, Gprof]

5. Review of Expected Syslogs, Output to Terminal, for Each Feature

6. Review Performance Profiles

7. Track Bugs, Anomalies, and Disposition as Defects

Sam Siewert 5

Page 6: SE420 Software Quality Assurancemercury.pr.erau.edu/~siewerts/se420/documents/Lectures/Fall-14/Le… · Data Driven CPU Loading Root-Cause on Pathfinder was a Combination of Issues

Bug Open/Close Rates and Readiness Controversy – Bug Counts, Closure and Prediction of Phase Transition Readiness – E.g. Unit to I&T to System Test to Acceptance Test to Shipment

– Can Be inaccurate due to Unsatisfactory Testing or Lack of Criteria

– Guideline for Project Management [Compared to Guessing!]

– Not all Reported Bugs Become Defects [Test Case Errors, Human Error]

Sam Siewert 6

http://www.testandverification.com/DVClub/24_Jan_2011/Greg_Smith.pdf

Test C

ase C

overa

ge

[E.g

. C

ode P

ath

Cove

rage]

Bug C

ounts

[Re

port

ed,

Not

Verified a

s D

efe

ct]

Page 7: SE420 Software Quality Assurancemercury.pr.erau.edu/~siewerts/se420/documents/Lectures/Fall-14/Le… · Data Driven CPU Loading Root-Cause on Pathfinder was a Combination of Issues

Root-Cause Analysis

Field Issue - Anomaly, Reported Bug, Data Corruption,

– Software Defect?

– Hardware Reliability

– User Error

Reproducibility – Capture Conditions via Logging

– Recreate Scenario in SQA / QA Lab

Trace to Root-Cause – Assert

– Analysis Triggers

– Propose Fixes

– Apply and Regression Test

– Release Maintenance Patch

Sam Siewert 7

Page 8: SE420 Software Quality Assurancemercury.pr.erau.edu/~siewerts/se420/documents/Lectures/Fall-14/Le… · Data Driven CPU Loading Root-Cause on Pathfinder was a Combination of Issues

Case Study – Mars Pathfinder Story

JPL Mission Flow to Mars, Landing on July 4th, 1997

Pathfinder Rolling Resets on Final Approach to Mars

Capture Orbit

VxWorks RTOS Used

Reproduction of Anomaly on the Ground

Root-Cause Analysis

Proposed Fix

Sam Siewert 8

Page 10: SE420 Software Quality Assurancemercury.pr.erau.edu/~siewerts/se420/documents/Lectures/Fall-14/Le… · Data Driven CPU Loading Root-Cause on Pathfinder was a Combination of Issues

Note on Data Driven Algorithms and

CPU Loading Real-Time Algorithms Ideally have Fixed Computational Demands per Request – Provide Predictable Response,

Enables Accurate Rate-Monotonic Analysis

– Rate Monotonic Theory Requires Known C, T, D Inputs [CPU Required, Request Rate, Deadline Relative to Request Time]

Computer Vision and Image Processing Depends on Data from Instrument Observation – Parsing Scene for Linear Segments

[Edges]

– Finding Elliptical or Circular Objects [Craters, Holes, etc.]

– Number of Features Found and Processed will Vary!

– Optical Navigation – Making an Impact: AI Group at JPL

Sam Siewert 10

Hough Linear Example

Hough Circular Example

Page 11: SE420 Software Quality Assurancemercury.pr.erau.edu/~siewerts/se420/documents/Lectures/Fall-14/Le… · Data Driven CPU Loading Root-Cause on Pathfinder was a Combination of Issues

Discussion … List of Theories for Root Cause [Good List, From OS, General Engineering Judgement]

Suggestions for Teamwork [Good Approaches – Brainstorm, Gather all Cognizant Engineers into One Room – JPL, Wind River, RAD6000]

Scenario and Anomaly [Rolling Reset on Approach] Reproduction on Ground System

Software Re-Use and Lack of Default to Inversion Safe MUTEX in POSIX Pipes, Triggered due to Meteorological Increased CPU Loading for Landing Sites, Root-Cause

Ground Verification and Uplink to Enable Inversion Safe Option for Hidden MUTEX

Mission Saved and Quite Successful!

Sam Siewert 11

Page 12: SE420 Software Quality Assurancemercury.pr.erau.edu/~siewerts/se420/documents/Lectures/Fall-14/Le… · Data Driven CPU Loading Root-Cause on Pathfinder was a Combination of Issues

Diagnostic Tests

Primarily Hardware Tests, Driven by Software

Could be OS test, E.g. During Boot of System – CPU

– I/O

– Network

– Memory test

– File system test

– OS Services

Memory Test – Simple – Walking 1’s,

Address Bus Test, Pattern Tests all Read-after-Write to Address

– Advanced – ECC, SoC Drawer Paper

Sam Siewert 12

E.g. Linux Boot-up Process for Centos 6.x

Page 13: SE420 Software Quality Assurancemercury.pr.erau.edu/~siewerts/se420/documents/Lectures/Fall-14/Le… · Data Driven CPU Loading Root-Cause on Pathfinder was a Combination of Issues

BIST – Built-in Self Tests

SW Driven and Controlled Diagnostics [Firmware] Key to

Hardware Verification

Cooperative Hardware and Firmware Mode

Make Available for Root-Cause Analysis Post-Ship or

During I&T and System Testing

E.g. Dell Laptops – LCD BIST

Disk Drive Test-Unit Ready – sg_turs, T10 TUR

Sam Siewert 13

Page 14: SE420 Software Quality Assurancemercury.pr.erau.edu/~siewerts/se420/documents/Lectures/Fall-14/Le… · Data Driven CPU Loading Root-Cause on Pathfinder was a Combination of Issues

Performance Tests Profiling

– Gprof – Open souce tool [similar to Gcov, but for Profiling]

– Vtune – Commercial Tool from Intel

– Logic Analyzer and HP’s SPA (Statistical Performance Analysis)

Tracing – E.g. Timestamps output to syslog

Statistics

– top, htop

– iostat

– memstat

Workloads

– Iometer

– stress

Sam Siewert 14

Page 15: SE420 Software Quality Assurancemercury.pr.erau.edu/~siewerts/se420/documents/Lectures/Fall-14/Le… · Data Driven CPU Loading Root-Cause on Pathfinder was a Combination of Issues

Performance - Sysprof What is Using CPU on my System

Rather than Profile of an Application – Sub-System [Service]

Sam Siewert 15

Page 16: SE420 Software Quality Assurancemercury.pr.erau.edu/~siewerts/se420/documents/Lectures/Fall-14/Le… · Data Driven CPU Loading Root-Cause on Pathfinder was a Combination of Issues

Gprof Simple –pg compile opiton

Run, gprof on gmon.out to get analysis

Sam Siewert 16

%make

cc -O3 -Wall -pg -msse3 -malign-double -g -c raidtest.c

raidtest.c: In function 'main':

raidtest.c:99: warning: format '%d' expects type 'int', but argument 2 has type 'long

unsigned int'

raidtest.c:68: warning: unused variable 'aveRate'

raidtest.c:68: warning: unused variable 'totalRate'

raidtest.c:66: warning: unused variable 'rc'

raidtest.c:212: warning: control reaches end of non-void function

cc -O3 -Wall -pg -msse3 -malign-double -g -c raidlib.c

cc -O3 -Wall -pg -msse3 -malign-double -g -o raidtest raidtest.o raidlib.o

%./raidtest

Will default to 1000 iterations

Architecture validation:

sizeof(unsigned long long)=8

RAID Operations Performance Test

Test Done in 453 microsecs for 1000 iterations

2207505.518764 RAID ops computed per second

%ls

Makefile gmon.out raidlib.h raidlib64.c raidtest raidtest.o

Makefile64 raidlib.c raidlib.o raidlib64.h raidtest.c raidtest64

%gprof raidtest gmon.out > raidtest_analysis.txt

Page 17: SE420 Software Quality Assurancemercury.pr.erau.edu/~siewerts/se420/documents/Lectures/Fall-14/Le… · Data Driven CPU Loading Root-Cause on Pathfinder was a Combination of Issues

Gprof Analysis 1 million iterations of RAID test XOR and Rebuild

Sam Siewert 17

Flat profile:

Each sample counts as 0.01 seconds.

% cumulative self self total

time seconds seconds calls ns/call ns/call name

82.13 1.54 1.54 main

15.47 1.83 0.29 2000001 145.38 145.38 xorLBA

2.67 1.88 0.05 2000001 25.07 25.07 rebuildLBA

% the percentage of the total running time of the

time program used by this function.

cumulative a running sum of the number of seconds accounted

seconds for by this function and those listed above it.

self the number of seconds accounted for by this

seconds function alone. …

calls the number of times this function was invoked, if

this function is profiled, else blank.

self the average number of milliseconds spent in this

ms/call function per call, …

total the average number of milliseconds spent in this

ms/call function and its descendents per call, …

name the name of the function. …

RAID Operations Performance Test

Test Done in 206417 microsecs for 1000000 iterations

4844562.221135 RAID ops computed per second

Page 18: SE420 Software Quality Assurancemercury.pr.erau.edu/~siewerts/se420/documents/Lectures/Fall-14/Le… · Data Driven CPU Loading Root-Cause on Pathfinder was a Combination of Issues

Call Graph Profile from Gprof

Sam Siewert 18

Call graph (explanation follows)

granularity: each sample hit covers 2 byte(s) for 0.53% of 1.88 seconds

index % time self children called name

<spontaneous>

[1] 100.0 1.54 0.34 main [1]

0.29 0.00 2000001/2000001 xorLBA [2]

0.05 0.00 2000001/2000001 rebuildLBA [3]

-----------------------------------------------

0.29 0.00 2000001/2000001 main [1]

[2] 15.4 0.29 0.00 2000001 xorLBA [2]

-----------------------------------------------

0.05 0.00 2000001/2000001 main [1]

[3] 2.7 0.05 0.00 2000001 rebuildLBA [3]

-----------------------------------------------

This table describes the call tree of the program, and was sorted by

the total amount of time spent in each function and its children…

% time This is the percentage of the `total' time that was spent

in this function and its children…

self This is the total amount of time spent in this function.

children This is the total amount of time propagated into this

function by its children.

called This is the number of times the function was called…

Page 19: SE420 Software Quality Assurancemercury.pr.erau.edu/~siewerts/se420/documents/Lectures/Fall-14/Le… · Data Driven CPU Loading Root-Cause on Pathfinder was a Combination of Issues

Discussion and Q&A

I&T is to Verify and Validate Sub-systems from Integrated SW Units and HW Components, in a Configuration – Unit Tests Precede

– Integrate and Configure

– Function/Feature Positive Tests

– Negative Testing [Fault Injection]

– Interoperability Testing

– Diagnostics, Root-Cause, and Bug Tracking Critical New Aspects

– Performance Testing [of Integrated and Configured Sub-systems]

– Determine Readiness for Final Integration and Entry to System Testing

– Provides Regression Test Cases for System Test

Precedes System Test, Where Sub-systems are … – Fully Integrated

– Configured Similar to Deployment [Perhaps Not Exact – E.g. Spacecraft in Thermal-Vac Testing]

– Stimulated with Tests Replicating Operations

Sam Siewert 19