assessment of data quality
DESCRIPTION
Assessment of data quality. Mirza Muhammad Waqar Contact: [email protected] +92-21-34650765-79 EXT:2257. RG610. Course: Introduction to RS & DIP. Contents. Hard vs Soft Classification Supervised Classification Training Stage Field Truthing - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/1.jpg)
ASSESSMENT OF DATA QUALITY
Course: Introduction to RS & DIP
Mirza Muhammad WaqarContact:
[email protected]+92-21-34650765-79 EXT:2257
RG610
![Page 2: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/2.jpg)
2
Contents
Hard vs Soft Classification Supervised Classification
Training Stage Field Truthing Inter class vs Intra Class Variability
Classification Stage Minimum Distance to Mean Classifier Parallelepiped Classifier Maximum Likelihood Classifier
Output Stage Supervised vs Unsupervised Classification
![Page 3: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/3.jpg)
Positional and Attribute Accuracies
Positional and attribute accuracies are the most critical factors in determining the quality of geographic data.
Can be quantified by sample data (a portion of whole data set) against reference data.
The concepts and methods of spatial data quality are applicable to both raster and vector data.
![Page 4: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/4.jpg)
Evaluation of Positional Accuracy
Made up of two elements: Planimetric accuracy, and
This is done by comparing the coordinates (x and y) of sample points on maps to the coordinates (x and y) of corresponding reference points.
Height accuracy Involves comparison of elevation values of sample and
reference data points.
![Page 5: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/5.jpg)
Reference Data
To be used as a sample point, the point must be well defined, which means that it can be unambiguously identified both on the map and on the ground. Survey monuments Bench marks Road intersections Corner of building Lampposts Fire hydrants etc.
![Page 6: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/6.jpg)
Reference Data
It is important for both the sample and reference data to be in the same map projection and based on the same datum.
The Accuracy Standards for Large-scale Maps however, specifies that: A minimum of 20 check points must be established
throughout the area covered by the map. These sample points should be spatially distributed in
such a way that at least 20% of the points be located in each quadrant of the map.
with individual points spaced at intervals equal to at least 10% of the diagonal of the map sheet.
![Page 7: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/7.jpg)
Standard to take sample Points
![Page 8: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/8.jpg)
Root Mean Square Error
The discrepancies between the coordinate values of the sample points and their corresponding reference coordinate values are used to compute the overall accuracy of the map as represented by the root mean-square error (RMSE)
The RMSE is defined as the square root of the average of the squared discrepancies. The RMSE for discrepancies in the X coordinate direction (rmsx) Y coordinate direction (rmsy) and elevation (rmst ) are computed from:
![Page 9: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/9.jpg)
RMS for discrepancies
Where
dx = discrepancies in X coordinate direction = X reference – X sample
![Page 10: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/10.jpg)
dy = discrepancies in Y coordinate direction = Yreference – Ysample
e = discrepancies in elevation = E reference – E sample n = total number of points checked (sampled)
RMS for discrepancies
![Page 11: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/11.jpg)
From rmsx and rmsy, a single RMSE of planimetry (rmsp) can be computed as follows.
RMS for discrepancies
![Page 12: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/12.jpg)
RMS as Overall Accuracy
The RMSEs of planimetry and elevation have now been generally accepted as the overall accuracy of the map.
RMSE is used as the index to check against specific standards to determine the fitness for use of the map.
The major drawback of the RMSE is that it provides information of only the overall accuracy. It does not give any indication of the spatial variation of the errors.
![Page 13: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/13.jpg)
For users who require such information, a map showing the positional discrepancies at the sample points can be generated.
Separate maps can be generated for discrepancies in easting and northing.
Alternatively a map showing the vectors of discrepancies at each point can be plotted
RMS as Overall Accuracy
![Page 14: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/14.jpg)
![Page 15: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/15.jpg)
![Page 16: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/16.jpg)
![Page 17: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/17.jpg)
![Page 18: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/18.jpg)
![Page 19: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/19.jpg)
![Page 20: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/20.jpg)
![Page 21: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/21.jpg)
Evaluation of Attribute Accuracy
Attribute accuracy is obtained by comparing values of sample spatial data units with reference data obtained either by field checks or from sources of data with a higher degree of accuracy.
These sample spatial units can be raster cells; raster image pixels; or sample points, lines, and polygons.
![Page 22: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/22.jpg)
Error Matrix
An error matrix is constructed to show the frequency of discrepancies between encoded values (i.e., data values on a map or in a database) and their corresponding actual or reference values for a sample of locations.
The error matrix has been widely used as a method for assessing classification accuracy of remotely sensed images
![Page 23: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/23.jpg)
Error/Confusion Matrix
An error matrix, also known as classification error matrix or confusion matrix, is a square array of values, which cross-tabulates the number of sample spatial data units assigned to a particular category relative to the actual category as verified by the reference data.
![Page 24: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/24.jpg)
Error Matrix
Conventionally, the rows of the error matrix represent the categories of the classification of the database, while the columns indicate the classification of the reference data.
In the error matrix, the element ij represents the frequency of spatial data units assigned to category i that actually belong to category j.
![Page 25: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/25.jpg)
An Error Matrix
Sample Data
Reference Data Total
A B C D E F
A 1 2 0 0 0 0 3
B 0 5 0 2 3 0 10
C 0 3 5 1 0 0 9
D 0 0 4 4 0 0 8
E 0 0 0 0 4 0 4
F 0 0 0 0 0 1 1
Total 1 10 9 7 7 1 35
A = Exposed soilB = CroplandC = RangeD = Sparse woodlandE = ForestF = water body
![Page 26: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/26.jpg)
Error Matrix
The numbers along the diagonal of the error matrix (i.e. when i = j) indicate the frequencies of correctly classified spatial data units in each category; and the off-diagonal numbers (when I j) represent the frequencies of misclassification in the various categories.
![Page 27: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/27.jpg)
Error Matrix
The error matrix is an effective way to describe attribute accuracy of geographic data.
If in a particular error matrix, all the nonzero entries lie on the diagonal. it indicates that no misclassification at the sample locations has occurred and an overall accuracy of 100% is obtained.
![Page 28: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/28.jpg)
Commission or Omission
When misclassification occurs, it can be identified either as an error of commission or an error of omission.
Any misclassification is simultaneously an error of commission and an error of omission.
![Page 29: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/29.jpg)
Error of Commission and Omission
Errors of commission, also known as errors of inclusion, are defined as wrongful inclusion of a sample location in a particular category due to misclassification.
When this happens, it means that the same sample location is omitted from another category in the reference data, which is an error of omission.
![Page 30: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/30.jpg)
Commission vs Omission
Errors of commission are identified by off-diagonal values across the rows.
Errors of omission. also known as errors of exclusion, are identified by those off-diagonal values down the columns.
![Page 31: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/31.jpg)
An Error Matrix
Sample Data
Reference Data Total
A B C D E F
A 1 2 0 0 0 0 3
B 0 5 0 2 3 0 10
C 0 3 5 1 0 0 9
D 0 0 4 4 0 0 8
E 0 0 0 0 4 0 4
F 0 0 0 0 0 1 1
Total 1 10 9 7 7 1 35
A = Exposed soilB = CroplandC = RangeD = Sparse woodlandE = ForestF = water body
Error of Commission
![Page 32: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/32.jpg)
An Error Matrix
Sample Data
Reference Data Total
A B C D E F
A 1 2 0 0 0 0 3
B 0 5 0 2 3 0 10
C 0 3 5 1 0 0 9
D 0 0 4 4 0 0 8
E 0 0 0 0 4 0 4
F 0 0 0 0 0 1 1
Total 1 10 9 7 7 1 35
A = Exposed soilB = CroplandC = RangeD = Sparse woodlandE = ForestF = water body
Error of Omission
![Page 33: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/33.jpg)
Indices to check Accuracy
In addition to the interpretation of errors of commission and omission, the error matrix may also be used to compute a series of descriptive indices to quantify the attribute accuracy of the data.
These include: Overall Accuracy Producer's Accuracy User's Accuracy
![Page 34: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/34.jpg)
Overall Accuracy
The PCC (Percent Correctly Classified) index represents the overall accuracy of the data.
In the case of simple random sampling, the PCC is defined as the trace of the error matrix (i.e., the sum of the diagonal values) divided by n, the total number of sample locations.
![Page 35: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/35.jpg)
Overall Accuracy
PCC = (Sd / n) * 100% Where
Sd = sum of values along diagonal n = total number of sample locations
![Page 36: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/36.jpg)
PCC – Overall Accuracy
Sample Data
Reference Data TotalA B C D E F
A 1 2 0 0 0 0 3B 0 5 0 2 3 0 10C 0 3 5 1 0 0 9D 0 0 4 4 0 0 8E 0 0 0 0 4 0 4F 0 0 0 0 0 1 1
Total 1 10 9 7 7 1 35
A = Exposed soilB = CroplandC = RangeD = Sparse woodlandE = ForestF = water body
PCC = (1+5+5+4+4+1) x 100/35 PCC = 20 x 100 / 35 = 57.1%
![Page 37: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/37.jpg)
Overall Accuracy
The maximum value of the PCC index is 100 when there is perfect agreement between the database and the reference data. The minimum value is 0, which indicates no agreement.
![Page 38: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/38.jpg)
Deficiencies in PCC index
In the first place, since the sample points are randomly selected, the index is sensitive to the structure of the error matrix. This means that if one category of data dominates the sample (this occurs when the category covers a much larger area than others), the PCC index can be quite high even if the other classes are poorly classified.
![Page 39: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/39.jpg)
Second, the computation of the PCC index does not take into account the chance agreements that might occur between sample and reference data. The index therefore always tends to overestimate the accuracy of the data.
Third, the PCC index does not differentiate between errors of omission and commission. Indices of these two types of errors are provided by the producer's accuracy and the user's accuracy.
Deficiencies in PCC index
![Page 40: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/40.jpg)
Producer’s Accuracy
This is the probability of a sample spatial data unit being correctly classified and is a measure of the error of omission for the particular category to which the sample data belong.
The producer's accuracy is so-called because it indicates how accurate the classification is at the time when the data are produced.
![Page 41: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/41.jpg)
Producer’s Accuracy
Producer’s accuracy is computed by: Producer’s accuracy = (Ci/Ct) * 100 Where
Ci = Correctly classified sample locations in column Ct = Total number of sample locations in column Error of omission = 100 – producer’s accuracy
![Page 42: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/42.jpg)
User’s Accuracy
This is the probability that a spatial data unit classified on the map or image actually represents that particular category on the ground.
This index of attribute accuracy, which is actually a measure of the error of commission, is of more interest to the user than the producer of the data.
![Page 43: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/43.jpg)
User’s Accuracy
User’s accuracy is computed by: User’s accuracy = (Ri/Rt) * 100 where
Rj = correctly classified sample locations in row Rt = total number of sample locations in row error of commission = 100 – user's accuracy
![Page 44: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/44.jpg)
An Error Matrix
Sample
Data
Reference Data TotalA B C D E F
A 1 2 0 0 0 0 3B 0 5 0 2 3 0 10C 0 3 5 1 0 0 9D 0 0 4 4 0 0 8E 0 0 0 0 4 0 4F 0 0 0 0 0 1 1
Total 1 10
9 7 7 1 35A = Exposed soilB = CroplandC = RangeD = Sparse woodlandE = ForestF = water body
PCC = (1+5+5+4+4+1) x 100/35 = 57.1%
Producer’s accuracy:A = 1/1 = 100% D = 4/7 = 57.1%B = 5/10 = 50% E = 4/7 = 57.1%C = 5/9 = 55.6% F = 1/1 = 100%
User’s Accuracy:A = 1/3 = 33.3% D = 4/8 = 50%B = 5/10 = 50% E = 4/4 = 100%C = 5/9 = 55.6% F = 1/1 = 100%
![Page 45: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/45.jpg)
Kappa Coefficient (k)
Another useful analytical technique is the computation of the kappa coefficient or Kappa Index of Agreement (KIA)
It is capable of controlling the tendency of the PCC index to overestimate by incorporating all the off-diagonal values in its computation
The use of the off-diagonal values in the computation of the kappa coefficients also makes them useful for testing the statistical significance of the differences in different error matrices
![Page 46: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/46.jpg)
The coefficient (K), first developed by Cohen (1960) for nominal scale data
K = Po – Pc / 1 – Pc Po is the proportion of agreement between the
reference and sample data (PCC) Kappa coefficient varies from a minimum of 1 to
a maximum of 0.
Kappa Coefficient (k)
![Page 47: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/47.jpg)
Tau Coefficient
Kappa coefficient tends to overestimate the agreement between data sets.
Foody (1992) described a modified kappa coefficient based on equal probability of group membership that resembles and is derived more properly from the tau coefficient.
![Page 48: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/48.jpg)
Tau Coefficient
= Po – Pr / 1 – Pr
It was demonstrated that the tau coefficient, which is based on the a priori probabilities of group membership, provides an intuitive and relatively more precise quantitative measure of classification accuracy than the kappa coefficient, which is based on the a posteriori probabilities
![Page 49: Assessment of data quality](https://reader036.vdocuments.mx/reader036/viewer/2022062301/5681623e550346895dd273de/html5/thumbnails/49.jpg)
Questions & Discussion