introduction to computation and programming using python%2c revised - guttag%2c john v..233

216 Chapter 15. Understanding Experimental Data 15.2.1 Coefficient of Determination When we fit a curve to a set of data, we are finding a function that relates an independent variable (inches horizontally from the launch point in this example) to a predicted value of a dependent variable (inches above the launch point in this example). Asking about the goodness of a fit is equivalent to asking about the accuracy of these predictions. Recall that the fits were found by minimizing the mean square error. This suggests that one could evaluate the goodness of a fit by looking at the mean square error. The problem with that approach is that while there is a lower bound for the mean square error (zero), there is no upper bound. This means that while the mean square error is useful for comparing the relative goodness of two fits to the same data, it is not particularly useful for getting a sense of the absolute goodness of a fit. We can calculate the absolute goodness of a fit using the coefficient of determination, often written as R 2 . 97 Let ! ! be the ! !! observed value, ! ! be the corresponding value predicted by model, and ! be the mean of the observed values. ! ! !! ! !! ! ! ! ! ! ! ! !! ! ! !! ! ! By comparing the estimation errors (the numerator) with the variability of the original values (the denominator), R 2 is intended to capture the proportion of variability in a data set that is accounted for by the statistical model provided by the fit. When the model being evaluated is produced by a linear regression, the value of R 2 always lies between 0 and 1. If R 2 = 1, the model explains all of the variability in the data. If R 2 = 0, there is no relationship between the values predicted by the model and the actual data. The code in Figure 15.5 provides a stra ightforward implementatio n of this statistical measure. Its compactness stems from the expressiveness of the operations on arrays. The expression (predicted - measured)**2 subtracts the elements of one array from the elements of another, and then squares each element in the result. The expression (measured - meanOfMeasured)**2 subtracts the scalar value meanOfMeasured from each element of the array measured, and then squares each element of the results. Figure 15.5 Computing R 2 97 There are several different definitions of the coefficient of determination. The definition supplied here is used to evaluate the quality of a fit produced by a linear regression. def rSquared(measured, predicted): """Assumes measured a one-dimensional array of measured values predicted a one-dimensional array of predicted values Returns coefficient of determination""" estimateError = ((predicted - measured)**2).sum() meanOfMeasured = measured.sum()/float(len(measured)) variability = ((measured - meanOfMeasured)**2).sum() return 1 - estimateError/variability

Upload: zhichaowang

Post on 03-Nov-2015

214 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

DESCRIPTION

TRANSCRIPT

216 Chapter 15. Understanding Experimental Data

15.2.1 Coefficient of Determination

When we fit a curve to a set of data, we are finding a function that relates an independent variable (inches horizontally from the launch point in this example) to a predicted value of a dependent variable (inches above the launch point in this example). Asking about the goodness of a fit is equivalent to asking about the accuracy of these predictions. Recall that the fits were found by minimizing the mean square error. This suggests that one could evaluate the goodness of a fit by looking at the mean square error. The problem with that approach is that while there is a lower bound for the mean square error (zero), there is no upper bound. This means that while the mean square error is useful for comparing the relative goodness of two fits to the same data, it is not particularly useful for getting a sense of the absolute goodness of a fit.

We can calculate the absolute goodness of a fit using the coefficient of determination, often written as R2.97 Let !! be the !!! observed value, !! be the corresponding value predicted by model, and ! be the mean of the observed values. !! = 1 (!! !!)!! (!! !)!! By comparing the estimation errors (the numerator) with the variability of the original values (the denominator), R2 is intended to capture the proportion of variability in a data set that is accounted for by the statistical model provided by the fit. When the model being evaluated is produced by a linear regression, the value of R2 always lies between 0 and 1. If R2 = 1, the model explains all of the

variability in the data. If R2 = 0, there is no relationship between the values predicted by the model and the actual data.

The code in Figure 15.5 provides a straightforward implementation of this statistical measure. Its compactness stems from the expressiveness of the operations on arrays. The expression (predicted - measured)**2 subtracts the elements of one array from the elements of another, and then squares each element in the result. The expression (measured - meanOfMeasured)**2 subtracts the scalar value meanOfMeasured from each element of the array measured, and then squares each element of the results.

Figure 15.5 Computing R2

97 There are several different definitions of the coefficient of determination. The definition supplied here is used to evaluate the quality of a fit produced by a linear regression.

def rSquared(measured, predicted): """Assumes measured a one-dimensional array of measured values predicted a one-dimensional array of predicted values Returns coefficient of determination""" estimateError = ((predicted - measured)**2).sum() meanOfMeasured = measured.sum()/float(len(measured)) variability = ((measured - meanOfMeasured)**2).sum() return 1 - estimateError/variability