# Performance Evaluation of Machine Learning Algorithms

Post on 23-Mar-2016

51 views

DESCRIPTION

Mohak ShahNathalie Japkowicz GE SoftwareUniversity of Ottawa ECML 2013, Prague. Performance Evaluation of Machine Learning Algorithms . Evaluation is the key to making real progress in data mining [Witten & Frank, 2005], p. 143. Yet. - PowerPoint PPT PresentationTRANSCRIPT

Slide 1Performance Evaluation of Machine Learning Algorithms Mohak ShahNathalie JapkowiczGE SoftwareUniversity of OttawaECML 2013, Prague 2013 GE All Rights ReservedPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedEvaluation is the key to making real progress in data mining[Witten & Frank, 2005], p. 1432Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedYet.Evaluation gives us a lot of, sometimes contradictory, information. How do we make sense of it all?Evaluation standards differ from person to person and discipline to discipline. How do we decide which standards are right for us?Evaluation gives us support for a theory over another, but rarely, if ever, certainty. So where does that leave us?3Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedExample I: Which Classifier is better?4There are almost as many answers as there are performance measures! (e.g., UCI Breast Cancer)AlgoAccRMSETPRFPRPrecRecFAUCInfo SNB71.7.4534.44.16.53.44.48.748.11C4.575.5.4324.27.04.74.27.4.5934.283NN72.4.5101.32.1.56.32.41.6343.37Ripp71.4494.37.14.52.37.43.622.34SVM69.6.5515.33.15.48.33.39.5954.89Bagg67.8.4518.17.1.4.17.23.6311.30Boost70.3.4329.42.18.5.42.46.734.48RanF69.23.47.33.15.48.33.39.6320.78acceptable contradictionsquestionable contradictionsPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedExample I: Which Classifier is better?5Ranking the resultsacceptable contradictionsquestionable contradictionsAlgoAccRMSETPRFPRPrecRecFAUCInfo SNB351731112C4.51171175753NN276226433Ripp433443366SVM684554671Bagg848288838Boost522872214RanF764554737Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedExample II: Whats wrong with our results? A new algorithm was designed based on data that had been provided to by the National Institute of Health (NIH). [Wang and Japkowicz, 2008, 2010]According to the standard evaluation practices in machine learning, the results were found to be significantly better than the state-of-the-art.The conference version of the paper received a best paper award and a bit of attention in the ML community (35+ citations)NIH would not consider the algorithm because it was probably not significantly better (if at all) than others as they felt the evaluation wasnt performed according to their standards. 6Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedExample III: What do our results mean? In an experiment comparing the accuracy performance of eight classifiers on ten domains, we concluded that One-Way Repeated-Measure ANOVA rejects the hypothesis that the eight classifiers perform similarly on the ten domains considered at the 95% significance level [Japkowicz & Shah, 2011], p. 273.Can we be sure of our results?What if we are in the 5% not vouched for by the statistical test?Was the ANOVA test truly applicable to our specific data? Is there any way to verify that?Can we have any 100% guarantee on our results? 7Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedPurpose of the tutorialTo give an appreciation for the complexity and uncertainty of evaluation. To discuss the different aspects of evaluation and to present an overview of the issues that come up in their context and what the possible remedies may be.To present some of the newer lesser known results in classifier evaluation.Finally, point to some available, easy-to-use, resources that can help improve the robustness of evaluation in the field. 8Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedThe main steps of evaluation9The Classifier Evaluation Framework1 2 : Knowledge of 1 is necessary for 21 2 : Feedback from 1 should be used to adjust 2Choice of Learning Algorithm(s)Datasets SelectionError-Estimation/ Sampling MethodPerformance Measure of InterestStatistical TestPerform EvaluationPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedWhat these steps depend onThese steps depend on the purpose of the evaluation:Comparison of a new algorithm to other (may be generic or application-specific) classifiers on a specific domain (e.g., when proposing a novel learning algorithm)Comparison of a new generic algorithm to other generic ones on a set of benchmark domains (e.g. to demonstrate general effectiveness of the new approach against other approaches)Characterization of generic classifiers on benchmarks domains (e.g. to study the algorithms' behavior on general domains for subsequent use)Comparison of multiple classifiers on a specific domain (e.g. to find the best algorithm for a given application task)10Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedOutlinePerformance measures Error Estimation/Resampling Statistical Significance Testing Data Set Selection and Evaluation Benchmark Design Available resources 11Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedPerformance Measures 2013 GE All Rights ReservedPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedPerformance Measures: OutlineOntology Confusion Matrix Based Measures Graphical Measures:ROC Analysis and AUC, Cost CurvesRecent Developments on the ROC Front: Smooth ROC Curves, the H Measure, ROC Cost curvesOther Curves: PR Curves, lift, etc..Probabilistic MeasuresOthers: qualitative measures, combinations frameworks, visualization, agreement statistics13Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedOverview of Performance Measures14All MeasuresAdditional Info (Classifier Uncertainty Cost ratio Skew)Confusion MatrixAlternate InformationDeterministic ClassifiersScoring ClassifiersContinuous and Prob. Classifiers (Reliability metrics)Multi-class FocusSingle-class FocusNo Chance CorrectionChance CorrectionAccuracy Error RateCohens Kappa Fielss KappaTP/FP Rate Precision/Recall Sens./Spec. F-measure Geom. Mean DiceGraphical MeasuresSummary StatisticRoc Curves PR Curves DET Curves Lift Charts Cost CurvesAUCH MeasureArea under ROC- cost curveDistance/Error measuresKL divergence K&B IR BIRRMSEInformationTheoretic MeasuresInterestingness Comprehensibility Multi-criteriaPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedConfusion Matrix-Based Performance MeasuresMulti-Class Focus:Accuracy = (TP+TN)/(P+N)Single-Class Focus: Precision = TP/(TP+FP)Recall = TP/PFallout = FP/NSensitivity = TP/(TP+FN)Specificity = TN/(FP+TN)15True class Hypothesized classPosNegYesTPFPNoFNTNP=TP+FNN=FP+TNConfusion MatrixPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights Reserved15Aliases and other measuresAccuracy = 1 (or 100%) - Error rate Recall = TPR = Hit rate = SensitivityFallout = FPR = False Alarm ratePrecision = Positive Predictive Value (PPV)Negative Predictive Value (NPV) = TN/(TN+FN)Likelihood Ratios:LR+ = Sensitivity/(1-Specificity) higher the better LR- = (1-Sensitivity)/Specificity lower the better16Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedPairs of Measures and Compounded MeasuresPrecision / RecallSensitivity / SpecificityLikelihood Ratios (LR+ and LR-)Positive / Negative Predictive ValuesF-Measure:F = [(1+ ) (Precision x Recall)]/ = 1, 2, 0.5 [( x Precision) + Recall]G-Mean: 2-class version single-class versionG-Mean = Sqrt(TPR x TNR) or Sqrt(TPR x Precision)17Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedF-measure is weighted harmonic mean of Pr and RecallF1 balanced F-measureF2 weighs recall twice as much as precision and F0.5 half as much17Skew and Cost considerationsSkew-Sensitive Assessments: e.g., Class RatiosRatio+ = (TP + FN)/(FP + TN)Ratio- = (FP + TN)/(TP + FN)TPR can be weighted by Ratio+ and TNR can be weighted by Ratio-Asymmetric Misclassification Costs: If the costs are known,Weigh each non-diagonal entries of the confusion matrix by the appropriate misclassification costsCompute the weighted version of any previously discussed evaluation measure. 18Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedSome issues with performance measuresBoth classifiers obtain 60% accuracyThey exhibit very different behaviours:On the left: weak positive recognition rate/strong negative recognition rateOn the right: strong positive recognition rate/weak negative recognition rate19True class PosNegYes200100No300400P=500N=500True class PosNegYes400300No100200P=500N=500Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedSome issues with performance measures (contd)The classifier on the left obtains 99.01% accuracy while the classifier on the right obtains 89.9%Yet, the classifier on the right is much more sophisticated than the classifier on the left, which just labels everything as positive and misses all the negative examples.20True class PosNegYes5005No00P=500N=5True class PosNegYes4501No504P=500N=5Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedSome issues with performance measures (contd)Both classifiers obtain the same precision and recall values of 66.7% and 40% (Note: the data sets are different)They exhibit very different behaviours:Same positive recognition rateExtremely different negative recognition rate: strong on the left / nil on the rightNote: Accuracy has no problem catching this!21True class PosNegYes200100No300400P=500N=500True class PosNegYes200100No3000P=500N=100Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedRecall is TPR and hence pr and recall do not focus on negative class21ObservationsThis and similar analyses reveal that each performance measure will convey some information and hide other.Therefore, there is an information trade-off carried through the different metrics.A practitioner has to choose, quite carefully, the quantity s/he is interested in monitoring, while keeping in mind that other values matter as well.Classifiers may rank differently depending on the metrics along which they are being compared.22Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedGraphical Measures:ROC Analysis, AUC, Cost CurvesMany researchers have now adopted the AUC (the area under the ROC Curve) to evaluate their results. Principal advantage:more robust than Accuracy in class imbalanced situationstakes the class distribution into consideration and, therefore, gives more weight to correct classification of the minority class, thus outputting fairer results. We start first by understanding ROC curves23Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedROC CurvesPerformance measures for scoring classifiersmost classifiers are in fact scoring classifiers (not necessarily ranking classifiers)Scores neednt be in pre-defined intervals, or even likelihood or probabilitiesROC analysis has origins in Signal detection theory to set an operating point for desired signal detection rateSignal assumed to be corrupted by noise (Normally distributed)ROC maps FPR on horizontal axis and TPR on the vertical axis; Recall thatFPR = 1 SpecificityTPR = Sensitivity24Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedRanking classifiers only in the sense of ranking positive instances generally higher than negative ones24ROC Space25Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedROC Plot for discrete classifiers26Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedROC plot for two hypothetical scoring classifiers27Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedSkew considerationsROC graphs are insensitive to class imbalancesSince they dont account for class distributionsBased on 2x2 confusion matrix (i.e. 3 degrees of freedom)First 2 dimensions: TPR and FPRThe third dimension is performance measure dependent, typically class imbalanceAccuracy/risk are sensitive to class ratiosClassifier performance then corresponds to points in 3D spaceProjections on a single TPR x FPR slice are performance w.r.t. same class distributionHence, implicit assumption is that 2D is independent of empirical class distributions28Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedSkew considerationsMeasures such as Class imbalances, Asymmetric misclassification costs, credits for correct classification, etc. can be incorporated via single metric: skew-ratio rs: s.t. rs < 1 if positive examples are deemed more important and rs > 1 if vice-versaPerformance measures then can be re-stated to take skew ratio into account29Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedSkew considerationsAccuracyFor symmetric cost, skew represents the class ratio:Precision:30Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedIsometricsIsometrics or Isoperformance lines (or curves)For any performance metric, lines or curves denoting the same metric value for a given rsIn case of 3D ROC graphs, Isometrics form surfaces31Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedIsoaccuracy lines32Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedBlack lines: rs = 1 (balanced)Red lines: rs = 0.5 (positives weighed more than negatives)32Isoprecision lines33Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedIsometricsSimilarly, Iso-performance lines can be drawn on the ROC space connecting classifiers with the same expected cost:f1 and f2 have same performance if:Expected cost can be interpreted as a special case of skew34Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedROC curve generation35Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedCurves from Discrete points and Sub-optimalities36Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedROC Convex hullThe set of points on ROC curve that are not suboptimal forms the ROC convex hull 37Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedROCCH and model selection38Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedSelectively optimal classifiers39Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedSelectively optimal classifiers40Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedAUCROC Analysis allows a user to visualize the performance of classifiers over their operating ranges.However, it does not allow the quantification of this analysis, which would make comparisons between classifiers easier.The Area Under the ROC Curve (AUC) allows such a quantification: it represents the performance of the classifier averaged over all the possible cost ratios.The AUC can also be interpreted as the probability of the classifier assigning a higher rank to a randomly chosen positive example than to a randomly chosen negative exampleUsing the second interpretation, the AUC can be estimated as follows:where and are the sets of positive and negative examples, respectively, in the Test set and is the rank of the th example in given by the classifier under evaluation.41Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedRecent Developments I: Smooth ROC Curves (smROC) [Klement, et al., ECML2011]42ROC Analysis visualizes the ranking performance of classifiers but ignores any kind of scoring information.Scores are important as they tell us whether the difference between two rankings is large or small, but they are difficult to integrate because unlike probabilities, each scoring system is different from one classifier to the next and their integration is not obvious.smROC extends the ROC curve to include normalized scores as smoothing weights added to itsline segments.smROC was shown to capture the similarity between similar classification models better than ROC and that it can capture difference between classification models that ROC is barely sensitive to, if at all.Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedA criticism of the AUC was given by [Hand, 2009]. The argument goes along the following lines:The misclassification cost distributions (and hence the skew-ratio distributions) used by the AUC are different for different classifiers. Therefore, we may be comparing apples and oranges as the AUC may give more weight to misclassifying a point by classifier A than it does by classifier BTo address this problem, [Hand, 2009] proposed the H-Measure.In essence, The H-measure allows the user to select a cost-weight function that is equal for all the classifiers under comparison and thus allows for fairer comparisons.43Recent Developments II: The H Measure IPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedRecent Developments II: The H Measure II[Flach et al. ICML2011] argued that [Hand2009]s criticism holds when the AUC is interpreted as a classification measure but not when it is interpreted as a pure ranking measure.When [Hand2009]s criticism holds, the problem occurs only if the threshold used to construct the ROC curve are assumed to be optimal.When constructing the ROC curve using all the points in the data set as thresholds rather than only optimal ones, then the AUC measure can be shown to be coherent and the H-Measure unnecessary, at least from a theoretical point of view.The H-Measure is a linear transformation of the value that the area under the Cost-Curve (discussed next) would produce.44Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedCost Curves45Cost-curves tell us for what class probabilities one classifier is preferable over the other. ROC Curves only tellus that sometimesone classifier ispreferable over theother00.20.40.60.810.10.20.30.40.50.60.70.80.91False Positive RateTrue Positive Ratef1f2Classifier that isalways wrong00.4100.251ErrorRateP(+), Probability of an examplebeing from the Positive ClassPositiveTrivialClassifierNegativeTrivialClassifierClassifier B2Classifier A1ClassifiersA at different thresholdsClassifiersB at different thresholdsClassifiers that is always rightPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedCost curves plot the relative expected misclassification costs as a function of the proportion of positive instances in the datasetCost curves are point-line duals of ROC curvesIn cost space, each of the ROC points are represented by a line and the convex hull of the ROC space corresponds to the lower envelope created by all the classifier lines.45Recent Developments III: ROC Cost curvesHernandez-Orallo et al. (ArXiv, 2011, JMLR 2012) introduces ROC Cost curvesExplores connection between ROC space and cost space via expected loss over a range of operating conditionsShows how ROC curves can be transferred to cost space by threshold choice methods s.t. the proportion of positive predictions equals the operating conditions (via cost or skew)Expected loss measured by the area under ROC cost curves maps linearly to the AUCAlso explores relationship between AUC and Brier score (MSE)46Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedROC cost curves(going beyond optimal cost curve)47Figure from Hernandez-Orallo et al. 2011 (Fig 2)Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedROC cost curves(going beyond optimal cost curve)48Figure from Hernandez-Orallo et al. 2011 (Fig 2)Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedOther CurvesOther research communities have produced and used graphical evaluation techniques similar to ROC Curves.Lift charts plot the number of true positives versus the overall number of examples in the data sets that was considered for the specific true positive number on the vertical axis.PR curves plot the precision as a function of its recall.Detection Error Trade-off (DET) curves are like ROC curves, but plot the FNR rather than the TPR on the vertical axis. They are also typically log-scaled.Relative Superiority Graphs are more akin to Cost-Curves as they consider costs. They map the ratios of costs into the [0,1] interval.49Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights Reserved Probabilistic Measures I: RMSEThe Root-Mean Squared Error (RMSE) is usually used for regression, but can also be used with probabilistic classifiers. The formula for the RMSE is: where m is the number of test examples, f(xi), the classifiers probabilistic output on xi and yi the actual label.50RMSE(f) = sqrt(1/5 * (.0025+.36+.04+.5625+.01)) = sqrt(0.975/5) = 0.4416IDf(xi)yi(f(xi) yi)21.951.00252.60.363.81.044.750.56255.91.01Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedProbabilistic Measures II: Information ScoreKononenko and Bratkos Information Score assumes a prior P(y) on the labels. This could be estimated by the class distribution of the training data. The output (posterior probability) of a probabilistic classifier is P(y|f), where f is the classifier. I(a) is the indicator function. Avg info score: 51xP(yi|f)yiIS(x)1.9510.662.6003.81.424.750.325.91.59P(y=1)= 3/5 = 0.6P(y=0) = 2/5 = 0.4I(x1) = 1 * (-log(.6) + log(.95)) + 0 * (-log(.4) + log).05)) = 0.66 Ia= 1/5 (0.66+0+.42+.32+.59) = 0.40Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedOther Measures I:framework for combining various measures including qualitative metrics, such as interestingness. considers the positive metrics for which higher values are desirable (e.g, accuracy) and the negative metrics for which lower values are desirable (e.g., computational time).pmi+ are the positive metrics and pmj- are the negative metrics. The wis and wjs have to be determined and a solution is proposed that uses linear programming.52A Multi-Criteria Measure The Efficiency MethodPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedOther Measures IIClassifier evaluation can be viewed as a problem of analyzing high-dimensional data.The performance measures currently used are but one class of projections that can be applied to these data.Japkowicz et al ECML08 ISAIM08] introduced visualization-based measuresProjects the classifier performance on domains to a higher dimensional spaceDistance measure in projected space allows to qualitatively differentiate classifiersNote: Dependent on distance metric used53Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedOther Measures III: Agreement StatisticsMeasures considered so far assumes existence of ground truth labels (against which, for instance, accuracy is measured)In the absence of ground truth, typically (multiple) experts label the instancesHence, one needs to account for chance agreements while evaluating under imprecise labelingTwo approaches:either estimate ground truth; or evaluate against the group of labelings54Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedMultiple Annotations55Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights Reserved2 Questions of interest56Inter-expert agreement:Overall Agreementof the groupClassifier AgreementAgainst the groupPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedSuch measurements are also desired whenAssessing a classifier in a skewed class settingAssessing a classifier against expert labelingThere are reasons to believe that bias has been introduced in the labeling process (say, due to class imbalance, asymmetric representativeness of classes in the data,)Agreement statistic aim to address this by accounting for chance agreements (coincidental concordances)57Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedCoincidental concordances (Natural Propensities) need to be accounted for58R6:1891R1: 1526R2: 2484R4: 816R5: 754R3: 1671Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights Reserved59General Agreement StatisticPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedExamples: Cohens kappa, Fleiss Kappa, Scotts pi, ICCGeneral Agreement Statistic60Agreement measureChance AgreementMaximum Achievable AgreementPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights Reserved2-raters over 2 classes: Cohens KappaSay we wish to measure classifier accuracy against some true process. Cohens kappa is defined as:where represents the probability of overall agreement over the label assignments between the classifier and the true process, and represents the chance agreement over the labels and is defined as the sum of the proportion of examples assigned to a class times the proportion of true labels of that class in the data set.61Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedCohens Kappa Measure: ExampleAccuracy = = (60 + 100 + 90) / 400 = 62.5% = 100/400 * 120/400 + 160/400 * 150/400 + 140/400 * 130/400 = 0.33875 = 43.29% Accuracy is overly optimistic in this example!62One of many generalizations for multiple class casePredicted ->Actual ABCTotalA605010120B1010040150C301090130Total100160140Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedMultiple raters over multiple classesVarious generalizations have been proposed to estimate inter-rater agreement in general multi-class multi-rater scenarioSome well known formulations include Fleiss kappa (Fleiss 1971); ICC, Scotts pi, Vanbelle and Albert (2009)s measures;These formulations all relied on marginalization argument (fixed number of raters are assumed to be sampled from a pool of raters to label each instance)-applicable, for instance, in many crowd-sourcing scenarios For assessing agreement against a group of raters, typically consensus argument is usedA deterministic label set is formed via majority vote of raters in the group and the classifier is evaluated against this label set63Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedMultiple raters over multiple classesShah (ECML 2011) highlighted the issues with both the marginalization and the consensus argument, and related approaches such as Dice coefficient, for machine learning evaluation when the raters group is fixedTwo new measures were introduced S : to assess inter-rater agreement (generalizing Cohens kappa)Variance upper-bounded by marginalization counterpartthe S measure: to evaluate a classifier against a group of raters64Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedIssues with marginalization and Consensus argument:A quick review65Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedBack to the agreement statisticPair-wise Agreement measureDifference arise in obtaining the expectationPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedTraditional Approaches:Inter-rater AgreementInst #Rater 1Rater 2111212311421522612722The Marginalization Argument: Consider a simple 2 rater 2 class caseAgreement: 4/7Probability of chance agreement over label 1:Pr(Label=1| Random rater)2= 7/14 * 7/14 = 0.25Agreement =1 0.54/7 0.5= 0.143Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights Reserved67Traditional Approaches:Inter-rater AgreementInst #Rater 1Rater 2112212312412521621721The Marginalization Argument: Consider another scenarioObserved Agreement: 0Probability of chance agreement over label 1:Pr(Label=1| Random rater)2= 7/14 * 7/14 = 0.25Agreement =1 0.50 0.5= -1Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights Reserved68Traditional Approaches:Inter-rater AgreementInst #Rater 1Rater 2112212312412512612712The Marginalization Argument: But this holds even when there is no evidence of a chance agreementObserved Agreement: 0Probability of chance agreement over label 1:Pr(Label=1| Random rater)2= 7/14 * 7/14 = 0.25Agreement =1 0.50 0.5= -1Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights Reserved69Marginalization argument not applicable in fixed rater scenarioMarginalization ignores rater correlationIgnores rater asymmetryResults in loose chance agreement estimates by optimistic estimationHence, overly conservative agreement estimateShahs kappa corrects for these while retaining the convergence propertiesPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights Reserved70Shahs kappa: Variability upper-bounded by that of Fleiss kappaPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedII. Agreement of a classifier against a groupExtension of marginalization argumentRecently appeared in Statistics literature: Vanbelle and Albert, (stat. ner. 2009)Consensus Based (more traditional)Almost universally used in the machine learning/data mining communityE.g., medical image segmentation, tissue classification, recommendation systems, expert modeling scenarios (e.g. market analyst combination)Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedMarginalization Approach and Issues in Fixed experts settingObserved agreement: Proportion of raters with which the classifier agreesIgnores qualitative agreement, may even ignore group dynamicscR1R2R3Inst #Rater 1Rater 2Rater 31123212331234123512361237123Classifier1312322Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedMarginalization Approach and Issues in Fixed experts settingObserved agreement: Proportion of raters with which the classifier agreesIgnores qualitative agreement, may even ignore group dynamicsInst #Rater 1Rater 2Rater 31123212331234123512361237123Classifier1312322Any random label assignment gives the same observed agreementPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedMarginalization Approach and Issues in Fixed experts settingChance Agreement: Extend the marginalization argumentNot informative when the raters are fixed, ignores rater-specific correlationsInst #Rater 1Rater 2Rater 31123212331234123512361237123Classifier1312322When fixed raters never agree, chance agreement should be zeroPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedConsensus Approach and IssuesApproach:Obtain a deterministic label for each instance if at least k raters agreeTreat this label set as ground truth and use dice coefficient against classifier labelsPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedProblem with consensus:TiesR6:1891R1: 1526R2: 2484R4: 816R5: 754R3: 1671Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights Reserved77Problem with consensus:TiesPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights Reserved78Problem with consensus:Ties: Solution -> Remove 1 (Tie-breaker)Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights Reserved79Problem with consensus:Ties: Remove which 1?Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights Reserved80Problem with consensus:Ties: Remove which 1?Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights Reserved81Two Consensus outputs: One ClassifierC1:LiberalClassifierC2:ConservativePerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights Reserved82Chance Correction: neither done nor meaningful(Even Cohen not useful over consensus since it is not a single classifier)C1:LiberalClassifierC2:ConservativePerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights Reserved83Both notions of liberal and conservative are not meaningful, not just because of how they are in politics, but also because here these notions are over consensus which do not reflect the expert groupProblem with consensus:Consensus upper bounded by threshold-highestR6:1891 (t2)R1: 1526 (t4)R2: 2484R4: 816 (t5)R5: 754 (t6)R3: 1671 (t3)Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights Reserved84A theoretical bound on the consensus(put in paper)if ti is the number of voxels labeled as lesions by the ith most liberal rater, then this is the upper bound on the labeled example in consensusConsensus Approach and IssuesApproach:Obtain a deterministic label for each instance if at least k >= r/2 raters agreeTreat this label set as ground truth and use dice coefficient against classifier labelsIssues:Threshold sensitiveEstablishing threshold can be non-trivialTie breaking not clearTreats estimates as deterministicIgnores minority raters as well as rater correlationThe S measure (Shah, ECML 2011) addresses these issues with marginalization and is independent of the consensus approachPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedError Estimation/Resampling 2013 GE All Rights ReservedPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedHow do we estimate it in an unbiased manner?What if we had all the data?Re-substitution: Overly optimistic (best performance achieved with complete over-fit)Which factors affect such estimation choices?Sample Size/DomainsRandom variations: in training set, testing set, learning algorithms, class-noise87Se we decided on a performance measure.Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedSet aside a separate test set T. Estimate the empirical risk:Advantagesindependence from training setGeneralization behavior can be characterizedEstimates can be obtained for any classifier88Hold-out approachPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedConfidence intervals can be found too (based on binomial approximation):89Hold-out approachPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedBased on binomial inversion:More realistic than asymptotic Gaussian assumption ( 2)90A Tighter boundPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedBl and Bu shows binomial lower and upper boundsCIl and CIu shod lower and upper confidence intervals based on Gaussian assumption91Binomial vs. Gaussian assumptionsData-SetARTBlBuCIlCIuBupaSVM0.3520.2350.376-0.5741.278Ada0.2910.2250.364-0.6201.202DT0.3250.2560.400-0.6141.264DL0.3250.2560.400-0.6141.264NB0.40.3260.476-0.5821.382SCM0.3770.3050.453-0.5951.349CreditSVM0.1830.1410.231-0.5920.958Ada0.170.1290.217-0.5820.922DT0.130.0940.173-0.5430.803DL0.1930.01500.242-0.5980.984NB0.20.1560.249-0.6031.003SCM0.190.1470.239-0.5960.976Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedBinomial vs. Gaussian assumptionsBl and Bu shows binomial lower and upper boundsCIl and CIu shod lower and upper confidence intervals based on Gaussian assumption92Data-SetARTBlBuCIlCIuBupaSVM0.3520.2350.376-0.5741.278Ada0.2910.2250.364-0.6201.202DT0.3250.2560.400-0.6141.264DL0.3250.2560.400-0.6141.264NB0.40.3260.476-0.5821.382SCM0.3770.3050.453-0.5951.349CreditSVM0.1830.1410.231-0.5920.958Ada0.170.1290.217-0.5820.922DT0.130.0940.173-0.5430.803DL0.1930.01500.242-0.5980.984NB0.20.1560.249-0.6031.003SCM0.190.1470.239-0.5960.976Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights Reserved92The bound:gives93Hold-out sample size requirementsPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedHold-out sample size requirementsThe bound:givesThis sample size bound grows very quickly with and 94Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedThe need for re-samplingToo few training examples -> learning a poor classifierToo few test examples -> erroneous error estimatesHence: ResamplingAllows for accurate performance estimates while allowing the algorithm to train on most data examplesAllows for closer duplication conditions95Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedWhat implicitly guides re-sampling:Bias-Variance analysis helps understand the behavior of learning algorithmsIn the classification case: Bias: difference between the most frequent prediction of the algorithm and the optimal label of the examples Variance: Degree of algorithms divergence from its most probable (average) prediction in response to variations in training setHence, bias is independent of the training set while variance isnt96A quick look into relationship with Bias-variance behaviorPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedWhat implicitly guides re-sampling:Having too few examples in the training set affects the bias of algorithm by making its average prediction unreliableHaving too few examples in test set results in high varianceHence, the choice of resampling can introduce different limitations on the bias and variance estimates of an algorithm and hence on the reliability of error estimates97A quick look into relationship with Bias-variance behaviorPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedAn ontology of error estimation techniques98All Data RegimenNo Re-samplingRe-samplingHold-OutSimple Re-samplingMultiple Re-samplingCross-ValidationRandom Sub-SamplingBootstrappingRandomizationRe-substitutionRepeated k-fold Cross-Validation0 Bootstrap0.632 BootstrapPermutation TestNon-Stratified k-fold Cross-ValidationLeave-One-Out10x10CV5x2CVStratified k-fold Cross-ValidationPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedSimple Resampling:K-fold Cross ValidationFOLD 1FOLD 2FOLD k-1FOLD k99In Cross-Validation, the data set is divided into k folds and at each iteration, a different fold is reserved for testing while all the others used for training the classifiers.Testing data subsetTraining data subsetPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedSimple Resampling:Some variations of Cross-ValidationStratified k-fold Cross-Validation:Maintain the class distribution from the original dataset when forming the k-fold CV subsetsuseful when the class-distribution of the data is imbalanced/skewed. Leave-One-OutThis is the extreme case of k-fold CV where each subset is of size 1.Also known a Jackknife estimate100Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedObservationsk-fold CV is arguable the best known and most commonly used resampling techniqueWith k of reasonable size, less computer intensive than Leave-One-OutEasy to applyIn all the variants, the testing sets are independent of one another, as required, by many statistical testing methodsbut the training sets are highly overlapping. This can affect the bias of the error estimates (generally mitigated for large dataset sizes). Results in averaged estimate over k different classifiers101Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedObservationsLeave-One-Out can result in estimates with high variance given that the testing folds contains only one exampleOn the other hand, the classifier is practically unbiased since each training fold contains almost all the dataLeave-One-Out can be quite effective for moderate dataset sizesFor small datasets, the estimates have high variance while for large datasets application becomes too computer-intensiveLeave-One-Out can also be beneficial, compared to k-fold CV, whenthere is wide dispersion of data distributiondata set contains extreme values102Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedLimitations of Simple ResamplingDo not estimate the risk of a classifier but rather expected risk of algorithm over the size of partitionsWhile stability of estimates can give insight into robustness of algorithm, classifier comparisons in fact compare the averaged estimates (and not individual classifiers)Providing rigorous guarantees is quite problematicNo confidence intervals can be established (unlike, say, hold-out method)Even under normality assumption, the standard deviation at best conveys the uncertainty in estimates103Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedMultiple Resampling: BootstrappingAttempts to answer: What can be done when data is too small for application of k-fold CV or Leave-One-Out?Assumes that the available sample is representativeCreates a large number of new sample by drawing with replacement104Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedMultiple Resampling: 0 BootstrappingGiven a data D of size m, create k bootstrap samples Bi, each of of size m by sampling with replacementAt each iteration i:Bi represents the training set while the test set contains single copy of examples in D that are not present in BiTrain a classifier on Bi and test on the test set to obtain error estimate 0iAverage over 0is to obtain 0 estimate105Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedMultiple Resampling: 0 BootstrappingIn every iteration the probability of an example not being chosen after m samples is:Hence, average distinct examples in the training set is m(1-0.368)m = 0.632mThe 0 measure can hence be pessimistic since in each run the classifier is typically trained on about 63.2% of data106Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedMultiple Resampling: e632 Bootstrappinge632 estimate corrects for this pessimistic bias by taking into account the optimistic bias of resubstitution error over the remaining 0.368 fraction:107Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedDiscussionBootstrap can be useful when datasets are too smallIn these cases, the estimates can also have low variance owing to (artificially) increased data size0: effective in cases of very high true error rate; has low variance than k-fold CV but more biasede632 (owing to correction it makes): effective over small true error ratesAn interesting observation: these error estimates can be algorithm-dependente.g., Bootstrap can be a poor estimator for algorithms such as nearest neighbor or FOIL that do not benefit from duplicate instances108Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedMultiple Resampling: other approachesRandomizationOver labels: Assess the dependence of algorithm on actual label assignment (as opposed to obtaining similar classifier on chance assignment)Over Samples (Permutation): Assess the stability of estimates over different re-orderings of the dataMultiple trials of simple resamplingPotentially Higher replicability and more stable estimatesMultiple runs (how many?): 5x2 cv, 10x10 cv (main motivation is comparison of algorithms)109Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedRecent Developments:Bootstrap and big dataBootstrap has gained significant attention in the big data case recentlyAllows for error bars to characterize uncertainty in predictions as well as other evaluationKleiner et al. (ICML 2012) devised the extension of bootstrap to large scale problemsAimed at developing a database that returns answers with error bars to all queriesCombines the ideas of classical bootstrap and subsampling110Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedAssessing the Quality of InferenceObserve data X1, ..., XnSlide courtesy of Michael JordanPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedAssessing the Quality of InferenceObserve data X1, ..., XnForm a parameter estimate qn = q(X1, ..., Xn)Slide courtesy of Michael JordanPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedAssessing the Quality of InferenceObserve data X1, ..., XnForm a parameter estimate qn = q(X1, ..., Xn)Want to compute an assessment x of the quality of our estimate qn(e.g., a confidence region)Slide courtesy of Michael JordanPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedThe Unachievable Frequentist IdealIdeally, we would Observe many independent datasets of size n. Compute qn on each. Compute x based on these multiple realizations of qn.Slide courtesy of Michael JordanPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedThe Unachievable Frequentist IdealIdeally, we would Observe many independent datasets of size n. Compute qn on each. Compute x based on these multiple realizations of qn.But, we only observe one dataset of size n.Slide courtesy of Michael JordanPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedThe BootstrapUse the observed data to simulate multiple datasets of size n:Repeatedly resample n points with replacement from the original dataset of size n.Compute q*n on each resample.Compute x based on these multiple realizations of q*n as our estimate of x for qn.(Efron, 1979)Slide courtesy of Michael JordanPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedThe Bootstrap:Computational IssuesSeemingly a wonderful match to modern parallel and distributed computing platformsBut the expected number of distinct points in a bootstrap resample is ~ 0.632ne.g., if original dataset has size 1 TB, then expect resample to have size ~ 632 GBCant feasibly send resampled datasets of this size to distributed serversEven if one could, cant compute the estimate locally on datasets this largeSlide courtesy of Michael JordanPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedSubsamplingn(Politis, Romano & Wolf, 1999)Slide courtesy of Michael JordanPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedSubsamplingnbSlide courtesy of Michael JordanPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedSubsamplingThere are many subsets of size b < nChoose some sample of them and apply the estimator to eachThis yields fluctuations of the estimate, and thus error barsBut a key issue arises: the fact that b < n means that the error bars will be on the wrong scale (theyll be too large)Need to analytically correct the error barsSlide courtesy of Michael JordanPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedSubsamplingSummary of algorithm: Repeatedly subsample b < n points without replacement from the original dataset of size n Compute q*b on each subsample Compute x based on these multiple realizations of q*b Analytically correct to produce final estimate of x for qnThe need for analytical correction makes subsampling less automatic than the bootstrapStill, much more favorable computational profile than bootstrapSlide courtesy of Michael JordanPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedBag of Little Bootstraps (BLB)combines the bootstrap and subsampling, and gets the best of both worldsworks with small subsets of the data, like subsampling, and thus is appropriate for distributed computing platforms But, like the bootstrap, it doesnt require analytical rescalingAnd its shown to be successful in practiceSlide courtesy of Michael JordanPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedTowards the Bag of Little BootstrapsnbSlide courtesy of Michael JordanPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedTowards the Bag of Little BootstrapsbSlide courtesy of Michael JordanPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedApproximationSlide courtesy of Michael JordanPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedPretend the Subsample is the PopulationSlide courtesy of Michael JordanPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedPretend the Subsample is the PopulationAnd bootstrap the subsample!This means resampling n times with replacement, not b times as in subsampling Slide courtesy of Michael JordanPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedThe Bag of Little Bootstraps (BLB)The subsample contains only b points, and so the resulting empirical distribution has its support on b pointsBut we can (and should!) resample it with replacement n times, not b timesDoing this repeatedly for a given subsample gives bootstrap confidence intervals on the right scale---no analytical rescaling is necessary!Now do this (in parallel) for multiple subsamples and combine the results (e.g., by averaging)Slide courtesy of Michael JordanPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedThe Bag of Little Bootstraps (BLB)Slide courtesy of Michael JordanPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedBLB: Theoretical ResultsBLB is asymptotically consistent and higher-order correct (like the bootstrap), under essentially the same conditions that have been used in prior analysis of the bootstrap.Theorem (asymptotic consistency): Under standard assumptions (particularly that q is Hadamard differentiable and x is continuous), the output of BLB converges to the population value of x as n, b approach .Slide courtesy of Michael JordanPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedWhat to watch out when selecting error estimation methodDomain (e.g., for data distribution, dispersion)Dataset SizeBias-Variance dependence(Expected) true error rateType of learning algorithms to be evaluated (e.g. can be robust to, say, permutation test; or do not benefit over bootstrapping)Computational ComplexityThe relations between the evaluation methods, whether statistically significantly different or not, varies with data quality. Therefore, one cannot replace one test with another, and only evaluation methods appropriate for the context may be used.Reich and Barai (1999, Pg 11)131Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedStatistical Significance Testing 2013 GE All Rights ReservedPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedStatistical Significance TestingError estimation techniques allows us to obtain an estimate of the desired performance metric(s) on different classifiersA difference is seen between the metric values over classifiersQuestionsCan the difference be attributed to real characteristics of the algorithms?Can the observed difference(s) be merely coincidental concordances?133Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedStatistical Significance TestingStatistical Significance Testing can enable ascertaining whether the observed results are statistically significant (within certain constraints)We primarily focus on Null Hypothesis Significance Testing (NHST) typically w.r.t.:Evaluate algorithm A of interest against other algorithms on a specific problemEvaluate generic capabilities of A against other algorithms on benchmark datasetsEvaluating performance of multiple classifiers on benchmark datasets or a specific problem of interest134Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedNHSTState a null hypothesisUsually the opposite of what we wish to test (for example, classifiers A and B perform equivalently)Choose a suitable statistical test and statistic that will be used to (possibly) reject the null hypothesisChoose a critical region for the statistic to lie in that is extreme enough for the null hypothesis to be rejected.Calculate the test statistic from the dataIf the test statistic lies in the critical region: reject the null hypothesis.If not, we fail to reject the null hypothesis, but do not accept it either.Rejecting the null hypothesis gives us some confidence in the belief that our observations did not occur merely by chance. 135Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedChoosing a Statistical TestThere are several aspects to consider when choosing a statistical test.What kind of problem is being handled?Whether we have enough information about the underlying distributions of the classifiers results to apply a parametric test?Statistical tests considered can be categorized by the task they address, i.e., comparing2 algorithms on a single domain2 algorithms on several domainsmultiple algorithms on multiple domains 136Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedIssues with hypothesis testingNHST never constitutes a proof that our observation is validSome researchers (Drummond, 2006; Demsar, 2008) have pointed out that NHST:overvalues the results; andlimits the search for novel ideas (owing to excessive, often unwarranted, confidence in the results)NHST outcomes are prone to misinterpretationRejecting null hypothesis with confidence p does not signify that the performance-difference holds with probability 1-p1-p in fact denotes the likelihood of evidence from the experiment being correct if the difference indeed existsGiven enough data, a statistically significant difference can always be shown 137Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedBut NHST can still be helpfulNHST never constitutes a proof that our observation is validSome researchers (Drummond, 2006; Demsar, 2008) have pointed out that NHST:overvalues the results; andlimits the search for novel ideas (owing to excessive, often unwarranted, confidence in the results)NHST outcomes are prone to misinterpretationRejecting null hypothesis with confidence p does not signify that the performance-difference holds with probability 1-p1-p in fact denotes the likelihood of evidence from the experiment being correct if the difference indeed existsGiven enough data, a statistically significant difference can always be shown 138(but provides added concrete support to observations)(to which a better sensitization is necessary)The impossibility to reject the null hypothesis while using a reasonable effect size is tellingPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedAn overview of representative tests139All Machine Learning& Data Mining problems2 AlgorithmsMultiple Domains2 Algorithms1 DomainMultiple AlgorithmsMultiple DomainsTwo MatchedSample t-TestWilcoxons Signed Rank Test for Matched PairsRepeated Measure One-way ANOVAFriedmans TestNemeuyi TestBonferroni-Dunn Post-boc TestSign TestMcNemars TestTukey Post-boc TestParametric and Non-ParametricParametric TestNon-ParametricPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedParametric vs. Non-parametric tests140All Machine Learning& Data Mining problems2 AlgorithmsMultiple Domains2 Algorithms1 DomainMultiple AlgorithmsMultiple DomainsTwo MatchedSample t-TestWilcoxons Signed Rank Test for Matched PairsRepeated Measure One-way ANOVAFriedmans TestNemeuyi TestBonferroni-Dunn Post-boc TestSign TestMcNemars TestTukey Post-boc TestParametric and Non-ParametricParametric TestNon-ParametricMake assumptions on distribution of population (i.e., error estimates); typically more powerful than non-parametric counterpartsPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedTests covered in this tutorial141All Machine Learning& Data Mining problems2 AlgorithmsMultiple Domains2 Algorithms1 DomainMultiple AlgorithmsMultiple DomainsTwo MatchedSample t-TestWilcoxons Signed Rank Test for Matched PairsRepeated Measure One-way ANOVAFriedmans TestNemeuyi TestBonferroni-Dunn Post-boc TestSign TestMcNemars TestTukey Post-boc TestParametric and Non-ParametricParametric TestNon-ParametricPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedComparing 2 algorithms on a single domain: t-testArguably, one of the most widely used statistical testsComes in various flavors (addressing different experimental settings)In our case: two matched samples t-testto compare performances of two classifiers applied to the same dataset with matching randomizations and partitionsMeasures if the difference between two means (mean value of performance measures, e.g., error rate) is meaningfulNull hypothesis: the two samples (performance measures of two classifiers over the dataset) come from the same population142Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedComparing 2 algorithms on a single domain: t-testThe t-statistic:where143Average Performance measure, e.g., error rate of classifier f1number of trialsPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedComparing 2 algorithms on a single domain: t-testThe t-statistic:where144Standard Deviation over the difference in performance measuresDifference in Performance measures at trial iPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedComparing 2 algorithms on a single domain: t-test: IllustrationC4.5 and Nave Bayes (NB) algorithms on the Labor dataset 10 runs of 10-fold CV (maintaining same training and testing folds across algorithms). The result of each 10-fold CV is considered a single result. Hence, each run of 10-fold CV is a different trial. Consequently n=10We have:and hence: 145Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights Reserved145Comparing 2 algorithms on a single domain: t-test: IllustrationReferring to the t-test table, for n-1=9 degrees of freedom, we can reject the null hypothesis at 0.001 significance level (for which the observed t-value should be greater than 4.781)146Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedEffect sizet-test measured the effect (i.e., the different in the sample means is indeed statistically significant) but not the size of this effect (i.e., the extent to which this difference is practically important)Statistics offers many methods to determine this effect size including Pearsons correlation coefficient, Hedges G, etc.In the case of t-test, we use: Cohens d statisticwherePooled standard deviation estimate147Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedEffect sizeA typical interpretation proposed by Cohen for effect size was the following discrete scale:A value of about 0.2 or 0.3 : small effect, but probably meaningful0.5: medium effect that is noticeable 0.8: large effectIn the case of Labor dataset example, we have:148Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedComparing 2 algorithms on a single domainAssumptionsThe Normality or pseudo-normality of populationsThe randomness of samples (representative-ness) Equal Variances of the populationLets see if these assumptions hold in our case of Labor dataset:Normality: we may assume pseudo-normality since each trial eventually tests the algorithms on each of 57 instances (note however, we compare averaged estimates)Randomness: Data was collected in 1987-1988. All the collective agreements were collected. No reason to assume this to be non-representative149t-test: so far so good; but what about Assumptionsmade by t-test?Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedComparing 2 algorithms on a single domainHowever,The sample variances cannot be considered equalWe also ignored: correlated measurement effect (due to overlapping training sets) resulting in higher Type I error.Were we warranted to use t-test? Probably Not (even though t-test is quite robust over some assumptions).Other variations of t-test (e.g., Welchs t-test)McNemars test (non-parametric)t-test: so far so good; but what about Assumptionsmade by t-test?150Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedComparing 2 algorithms on a single domain: McNemars testNon-parametric counterpart to t-testComputes the following contingency matrix0 here denotes misclassification by concerned classifier: i.e., c00 denotes the number of instances misclassified by both f1 and f2C10 denotes the number of instances correctly classified by f1 but not by f2151Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedComparing 2 algorithms on a single domain: McNemars testNull hypothesisMcNemars statisticdistributed approximately as if is greater than 20To reject the null hypothesis: should be greater than for desired (typically 0.05)152Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedComparing 2 algorithms on a single domain: McNemars testBack to the Labor dataset: comparing C4.5 and NB, we obtainWhich is greater than the desired value of 3.841 allowing us to reject the null hypothesis at 95% confidence153Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedComparing 2 algorithms on a single domain: McNemars testTo keep in mind:Test aimed for matched pairs (maintain same training and test folds across learning algorithms)Using resampling instead of independent test sets (over multiple runs) can introduce some biasA constraint: (the diagonal) should be at least 20 which was not true in this case. In such a case, sign test (also applicable for comparing 2 algorithms on multiple domains) should be used instead154Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedComparing 2 algorithms on a multiple domainsNo clear parametric way to compare two classifiers on multiple domains:t-test is not a very good alternative because it is not clear that we have commensurability of the performance measures across domainsThe normality assumption is difficult to establishThe t-test is susceptible to outliers, which is more likely when many different domains are considered.Therefore we will describe two non-parametric alternatives The Sign Test Wilcoxons signed-Rank Test155Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedComparing 2 algorithms on a multiple domains: Sign testCan be used to compare two classifiers on a single domain (using the results at each fold as a trial)More commonly used to compare two classifiers on multiple domainsCalculate: nf1: the number of times that f1 outperforms f2nf2: the number of times that f2 outperforms f1 Null hypothesis holds if the number of wins follow a binomial distribution.Practically speaking, a classifier should perform better on at least w (the critical value) datasets to reject null hypothesis at significance level156Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedComparing 2 algorithms on a multiple domains: Sign test: IllustrationNB vs SVM: nf1= 4.5, nf2= 5.5 and w0.05 = 8 we cannot reject the null hypothesis (1-tailed) Ada vs RF: nf1=1 and nf2=8.5 we can reject the null hypothesis at level =0.05 (1-tailed) and conclude that RF is significantly better than Ada on these data sets at that significance level. 157DatasetNBSVMAdaboostRand ForestAnneal96.4399.4483.6399.55Audiology73.4281.3446.4679.15Balance Scale72.3091.5172.3180.97Breast Cancer71.7066.1670.2869.99Contact Lenses71.6771.6771.6771.67Pima Diabetes74.3677.0874.3574.88Glass70.6362.2144.9179.87Hepatitis83.2180.6382.5484.58Hypothyroid98.2293.5893.2199.39Tic-Tac-Toe69.6299.9072.5493.94Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedComparing 2 algorithms on a multiple domains:Non-parametric, and typically more powerful than sign test (since, unlike sign test, it quantifies the differences in performances)Commonly used to compare two classifiers on multiple domainsMethodFor each domain, calculate the difference in performance of the two classifiersRank the absolute differences and graft the signs in front of these ranksCalculate the sum of positive ranks Ws1 and sum of negative ranks Ws2Calculate Twilcox = min (Ws1 , Ws2)Reject null hypothesis if Twilcox V (critical value)158Wilcoxons Signed-Rank testPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedComparing 2 algorithms on a multiple domains:WS1 = 17 and WS2 = 28 TWilcox = min(17, 28) = 17For n= 10-1 degrees of freedom and = 0.05, V = 8 for the 1-sided test. Since 17 > 8. Hence, we cannot reject the null hypothesis159Wilcoxons Signed-Rank test: IllustrationDataNBSVMNB-SVM|NB-SVM|Ranks1.9643.9944-0.03010.03013-32.7342.8134-0.07920.07926-63.7230.9151-0.19210.19218-84.7170.6616+0.05540.05545+55.7167.716700RemoveRemove6.7436.7708-0.02720.02722-27.7063.6221+0.08420.08427+78.8321.8063+0.02580.02581+19.9822.9358+0.04640.04644+410.6962.9990-0.30280.30289-9Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedComparing multiple algorithms on a multiple domainsTwo alternativesParametric: (one-way repeated measure) ANOVA ;Non-parametric: Friedmans Test.Such multiple-hypothesis tests are called Omnibus tests. Their null hypotheses: is that all the classifiers perform similarly, and its rejection signifies that: there exists at least one pair of classifiers with significantly different performances.In case of rejection of this null hypothesis, the omnibus test is followed by a Post-hoc test to identify the significantly different pairs of classifiers.In this tutorial, we will discuss Friedmans Test (omnibus) and the Nemenyi test (post-hoc test).160Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedFriedmans TestAlgorithms are ranked on each domain according to their performances (best performance = lowest rank)In case of a d-way tie at rank r: assign ((r+1)+(r+2)++(r+d))/d to each classifierFor each classifier j, compute R.j : the sum of its ranks on all domainsCalculate Friedmans statistic:over n domains and k classifiers161Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedIllustration of the Friedman test162DomainClassifier fAClassifier fBClassifier fC185.8375.8684.19285.9173.1885.90386.1269.0883.83485.8274.0585.11586.2874.7186.38686.4265.9081.20785.9176.2586.38886.1075.1086.75985.9570.5088.031986.1273.9587.18DomainClassifier fAClassifier fBClassifier fC113221.531.5313241325231613272318231923110231R .j15.53014.5Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedNemenyi TestThe omnibus test rejected the null hypothesisFollow up by post-hoc test: Nemenyi test to identify classifier-pairs with significant performance differenceLet Rij : rank of classifier fj on data Si; Compute mean rank of fj on all datasetsBetween classifier fj1 and fj2, calculate163Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedNemenyi Test: IllustrationComputing the q statistic of Nemenyi test for different classifier pairs, we obtain:qAB = -32.22qBC = 34.44qAC = 2.22At = 0.05, we have q = 2.55.Hence, we can show a statistically significantly different performance between classifiers A and B, and between classifiers B and C, but not between A and C164Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedRecent DevelopmentsKernel-based tests: There has been considerable advances recently in employing kernels to devise statistical tests (e.g., Gretton et al JMLR 2012; Fromont et al. COLT 2012)Employing f-divergences: generalizes the information-theoretic tests such as KL to the more general f-divergences by considering distributions as class-conditional (thereby expressing the two distributions in terms of Bayes risk w.r.t. a loss function) (Reid and Williamson, JMLR 2011)Also extended to multiclass case (Garcia-Garcia and Williamson COLT 2012)165Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedData Set Selection and Evaluation Benchmark Design 2013 GE All Rights ReservedPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedConsiderations to keep in mind while choosing an appropriate test bedWolperts No Free Lunch Theorems: if one algorithm tends to perform better than another on a given class of problems, then the reverse will be true on a different class of problems.LaLoudouana and Tarate (2003) showed that even mediocre learning approaches can be shown to be competitive by selecting the test domains carefully.The purpose of data set selection should not be to demonstrate an algorithms superiority to another in all cases, but rather to identify the areas of strengths of various algorithms with respect to domain characteristics or on specific domains of interest.167Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedWhere can we get our data from?Repository Data: Data Repositories such as the UCI repository and the UCI KDD Archive have been extremely popular in Machine Learning ResearchArtificial Data: The creation of artificial data sets representing the particular characteristic an algorithm was designed to deal with is also a common practice in Machine LearningWeb-Based Exchanges: Could we imagine a multi-disciplinary effort conducted on the Web where researchers in need of data analysis would lend their data to machine learning researchers in exchange for an analysis?168Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedPros and Cons of Repository Data Pros:Very easy to use: the data is available, already processed and the user does not need any knowledge of the underlying fieldThe data is not artificial since it was provided by labs and so on. So in some sense, the research is conducted in real-world setting (albeit a limited one)Replication and comparisons are facilitated, since many researchers use the same data set to validate their ideasCons:The use of data repositories does not guarantee that the results will generalize to other domainsThe data sets in the repository are not representative of the data mining process which involves many steps other than classificationCommunity experiment/Multiplicity effect: since so many experiments are run on the same data set, by chance, some will yield interesting (though meaningless) results 169Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedPros and Cons of Artificial DataPros:Data sets can be created to mimic the traits that are expected to be present in real data (which are unfortunately unavailable)The researcher has the freedom to explore various related situations that may not have been readily testable from accessible dataSince new data can be generated at will, the multiplicity effect will not be a problem in this settingCons:Real data is unpredictable and does not systematically behave according to a well defined generation modelClassifiers that happen to model the same distribution as the one used in the data generation process have an unfair advantage170Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedPros and Cons of Web-Based ExchangesPros:Would be useful for three communities:Domain experts, who would get interesting analyses of their problemMachine Learning researchers who would be getting their hand on interesting data, and thus encounter new problemsStatisticians, who could be studying their techniques in an applied contextCons:It has not been organized. Is it truly feasible?How would the quality of the studies be controlled?171Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedAvailable Resources 2013 GE All Rights ReservedPerformance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedWhat help is available for conducting proper evaluation? There is no need for researchers to program all the code necessary to conduct proper evaluation nor to do the computations by handResources are readily available for every steps of the process, including:Evaluation metricsStatistical testsRe-samplingAnd of course, Data repositories Pointers to these resources and explanations on how to use them are discussed in our book: 173Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedWhere to look for evaluation metrics?Actually, as most people know, Weka is a great source for, not only, classifiers, but also computation of the results according to a variety of evaluation metrics 174=== Stratified cross-validation ====== Summary ===Correctly Classified Instances4273.6842 %Incorrectly Classified Instances1526.3158 %Kappa statistics0.4415K&B Relative Info Score1769.6451 %K&B Information Score16.5588 bits0.2905 bits/instanceClass complexity | order 053.3249 bits0.9355 bits/instanceClass complexity | scheme3267.2456 bits57.3201 bits/instanceComplexity improvement(Sf)-3213.9207 bits-56.3846 bits/instanceMean absolute error0.3192Root mean squared error0.4669Relative absolute error69.7715 %Root relative squared error97.7888 %Total Number of Instances57=== Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure ROC Area Class 0.7 0.243 0.609 0.7 0.651 0.695 bad 0.757 0.3 0.824 0.757 0.789 0.695 goodWeighted Avg. 0.737 0.28 0.748 0.737 0.74 0.695=== Confusion Matrix ===a b c45 = c(.2456, .1754, .1754, .2632, .1579, .2456, .2105, .1404, .2632, .2982)> nb = c(.0702, .0702, .0175, .0702, .0702, .0526, .1579, .0351, .0351, .0702)> t.test(c45, nb, paired= TRUE)Paired t-testdata: c45 and nbt = 7.8645, df = 9, p-value = 2.536e-05alternative hypothesis: true difference in means is not equal to 095 percent confidence interval:0.1087203 0.1965197sample estimates:mean of the differences0.15262> c4510folds= c(3, 0, 2, 0, 2, 2, 2, 1, 1, 1)> nb10folds= c(1, 0, 0, 0, 0, 1, 0, 2, 0, 0)> wilcox.test(nb10folds, c4510folds, paired= TRUE)Wilcoxon signed rank test with continuity correctiondata: nb10folds and c4510foldsV = 2.5, p-value = 0.03125alternative hypothesis: true location shift is not equal to 0> | > classA= c(85.83, 85.91, 86.12, 85.82, 86.28, 86.42, 85.91, 86.10, 85.95, 86.12) > classB= c(75.86, 73.18, 69.08, 74.05, 74.71, 65.90, 76.25, 75.10, 70.50, 73.95) > classC= c(84.19, 85.90, 83.83, 85.11, 86.38, 81.20, 86.38, 86.75, 88.03, 87.18) > t=matrix(c(classA, classB, classC), nrow=10, byrow=FALSE) > t [,1] [,2] [,3] [1,] 85.83 75.86 84.19 [2,] 85.91 73.18 85.90 [3,] 86.12 69.08 83.83 [4,] 85.82 74.05 85.11 [5,] 86.28 74.71 86.38 [6,] 86.42 65.90 81.20 [7,] 85.91 76.25 86.38 [8,] 86.10 75.10 86.75 [9,] 85.95 70.50 88.03 [10,] 86.12 73.95 87.18> friedman.test(t)Friedman rank sum testdata: tFriedman chi-squared = 15, df = 2, p-value = 0.0005531> |Performance Evaluation of Machine Learning Algorithms 2013 Mohak Shah All Rights ReservedWhere to look for Re-sampling methods?In our book, we have implemented all the re-sampling schemes described in this tutorial. The actual code is also available upon request. (e-mail me at )177e0Boot = function(iter, dataSet, setSize, dimension, classifier1, classifier2){classifier1e0Boot

Recommended

View more >