1 an approach to software testing of machine learning applications chris murphy, gail kaiser, marta...

1

An Approach to An Approach to Software Testing of Software Testing of Machine Learning Machine Learning

ApplicationsApplications

Chris Murphy, Gail Kaiser, Chris Murphy, Gail Kaiser, Marta AriasMarta Arias

Columbia UniversityColumbia University

2

IntroductionIntroduction

• We are investigating the quality assurance of We are investigating the quality assurance of Machine Learning (ML) applicationsMachine Learning (ML) applications

• Currently we are concerned with a real-world Currently we are concerned with a real-world application for potential future use in predicting application for potential future use in predicting electrical device failureselectrical device failures

• Machine Learning applications fall into a class Machine Learning applications fall into a class for which it can be said that there is “no reliable for which it can be said that there is “no reliable oracle”oracle”– These are also known as “non-testable programs” and These are also known as “non-testable programs” and

could fall into Davis and Weyuker’s class of could fall into Davis and Weyuker’s class of “programs “programs which were written in order to determine the answer which were written in order to determine the answer in the first place. There would be no need to write in the first place. There would be no need to write such programs, if the correct answer were known.”such programs, if the correct answer were known.”

3

IntroductionIntroduction

• We have developed an approach to We have developed an approach to creating test cases for Machine creating test cases for Machine Learning applications:Learning applications:

• Analyze the problem domain and real-world data Analyze the problem domain and real-world data setssets

• Analyze the algorithm as it is definedAnalyze the algorithm as it is defined• Analyze an implementation’s runtime optionsAnalyze an implementation’s runtime options

• Our approach was designed for Our approach was designed for MartiRank and then generalized to MartiRank and then generalized to other ranking algorithms such as other ranking algorithms such as Support Vector Machines (SVM)Support Vector Machines (SVM)

4

OverviewOverview

• Machine Learning BackgroundMachine Learning Background• Testing Approach and FrameworkTesting Approach and Framework• Findings and ResultsFindings and Results• Evaluation and ObservationsEvaluation and Observations• Future WorkFuture Work

5

Machine Learning Machine Learning FundamentalsFundamentals

• Data sets consist of a number of Data sets consist of a number of examplesexamples, each of which has , each of which has attributesattributes and a and a labellabel

• In the first phase (“In the first phase (“trainingtraining”), a ”), a modelmodel is is generated that attempts to generalize generated that attempts to generalize how attributes relate to the labelhow attributes relate to the label

• In the second phase, the model is applied In the second phase, the model is applied to a previously-unseen data set (“to a previously-unseen data set (“testingtesting” ” data) with unknown labels to produce a data) with unknown labels to produce a classification (or, in our case, a ranking)classification (or, in our case, a ranking)– This can be used for validation or for This can be used for validation or for

predictionprediction

6

MartiRank and SVMMartiRank and SVM

• MartiRank was specifically designed for MartiRank was specifically designed for the device failure applicationthe device failure application– Seeks to find the combination of segmenting Seeks to find the combination of segmenting

and sorting the data that produces the best and sorting the data that produces the best resultresult

• SVM is typically a classification algorithmSVM is typically a classification algorithm– Seeks to find a hyperplane that separates Seeks to find a hyperplane that separates

examples from different classesexamples from different classes– Different “kernels” use different approachesDifferent “kernels” use different approaches– SVM-Light has a ranking mode based on the SVM-Light has a ranking mode based on the

distance from the hyperplanedistance from the hyperplane

7

Related WorkRelated Work

• There has been much research into There has been much research into applying Machine Learning applying Machine Learning techniques to software testing, but techniques to software testing, but not the other way aroundnot the other way around

• Reusable real-world data sets and Reusable real-world data sets and Machine Learning frameworks are Machine Learning frameworks are available for checking how well a available for checking how well a Machine Learning algorithm predicts, Machine Learning algorithm predicts, but not for testing its correctnessbut not for testing its correctness

8

Analyzing the Problem Analyzing the Problem DomainDomain

• Consider properties of the real-world Consider properties of the real-world data setsdata sets– Data set size: Number of attributes and Data set size: Number of attributes and

examplesexamples– Range of values: attributes and labels Range of values: attributes and labels – Precision of floating-point numbersPrecision of floating-point numbers– Categorical data: how alphanumeric attrs Categorical data: how alphanumeric attrs

are addressedare addressed

• Also, repeating or missing data valuesAlso, repeating or missing data values

9

Analyzing the AlgorithmAnalyzing the Algorithm

• Look for imprecisions in the Look for imprecisions in the specification, not necessarily bugs in specification, not necessarily bugs in the implementationthe implementation– How to handle missing attribute valuesHow to handle missing attribute values– How to handle negative labelsHow to handle negative labels

• Consider how to construct a data set Consider how to construct a data set that could cause a “predictable” that could cause a “predictable” rankingranking

10

Analyzing the Runtime Analyzing the Runtime OptionsOptions

• Determine how the implementation Determine how the implementation may manipulate the input datamay manipulate the input data– Permuting the input orderPermuting the input order– Reading the input in “chunks”Reading the input in “chunks”

• Consider configuration parametersConsider configuration parameters– For example, disabled anything For example, disabled anything

probabilisticprobabilistic

• Need to ensure that results are Need to ensure that results are deterministic and repeatabledeterministic and repeatable

12

Equivalence ClassesEquivalence Classes

• Data sizes of different orders of magnitudeData sizes of different orders of magnitude• Repeating vs. non-repeating attribute valuesRepeating vs. non-repeating attribute values• Missing vs. no-missing attribute valuesMissing vs. no-missing attribute values• Categorical vs. non-categorical dataCategorical vs. non-categorical data• 0/1 labels vs. non-negative integer labels0/1 labels vs. non-negative integer labels• Predictable vs. non-predictable data setsPredictable vs. non-predictable data sets

• Used data set generator to parameterize Used data set generator to parameterize test case selection criteria test case selection criteria

13

Testing MartiRankTesting MartiRank

• Produced a core dump on data sets Produced a core dump on data sets with large number of attributes with large number of attributes (over 200)(over 200)

• Implementation does not correctly Implementation does not correctly handle negative labelshandle negative labels

• Does not use a “stable” sorting Does not use a “stable” sorting algorithmalgorithm

14

Regression Testing of Regression Testing of MartiRankMartiRank

• Creation of a suite of testing data Creation of a suite of testing data allowed us to use it for regression allowed us to use it for regression testingtesting

• Discovered that refactoring had Discovered that refactoring had introduced a bug into an important introduced a bug into an important calculationcalculation

15

Testing Multiple Testing Multiple Implementations of Implementations of

MartiRankMartiRank• We had three implementations We had three implementations

developed by three different codersdeveloped by three different coders• Can be used as “pseudo-oracles” for Can be used as “pseudo-oracles” for

each othereach other• Used to discover a bug in the way Used to discover a bug in the way

one implementation was handling one implementation was handling missing valuesmissing values

16

Applying Approach to Applying Approach to SVM-LightSVM-Light

• Permuting the input data led to different Permuting the input data led to different modelsmodels– Caused by “chunking” data for use by an Caused by “chunking” data for use by an

approximating variant of optimization approximating variant of optimization algorithmalgorithm

• Introduction of noise in a data set in Introduction of noise in a data set in some cases caused it not to find a some cases caused it not to find a “predictable” ranking“predictable” ranking

• Different kernels also caused different Different kernels also caused different results with “predictable” rankingsresults with “predictable” rankings

17

Evaluation and Evaluation and ObservationsObservations

• Testing approach revealed bugs and imprecision Testing approach revealed bugs and imprecision in the implementations, as well as discrepancies in the implementations, as well as discrepancies from the stated algorithmsfrom the stated algorithms

• Inspection of the algorithms led to the creation Inspection of the algorithms led to the creation of “predictable” data setsof “predictable” data sets

• What is “predictable” for one algorithm may not What is “predictable” for one algorithm may not lead to a “predictable” ranking in anotherlead to a “predictable” ranking in another

• Algorithm’s failure to address specific data set Algorithm’s failure to address specific data set traits can lead to incorrect results (and/or traits can lead to incorrect results (and/or inconsistent results across implementations)inconsistent results across implementations)

• The approach can be generalized to other The approach can be generalized to other Machine Learning ranking algorithms, as well as Machine Learning ranking algorithms, as well as classificationclassification

18

Limitations and Future Limitations and Future WorkWork

• Test suite adequacy for coverage not Test suite adequacy for coverage not addressedaddressed

• Can also include mutation testing for Can also include mutation testing for effectiveness of data setseffectiveness of data sets

• Should investigate creating large data Should investigate creating large data sets that correlate to real-world datasets that correlate to real-world data

• Could also consider non-deterministic Could also consider non-deterministic Machine Learning algorithmsMachine Learning algorithms

19

Questions?Questions?

1 an approach to software testing of machine learning applications chris murphy, gail kaiser, marta...

Documents