measuring texture classification algorithms

7
Ž . Pattern Recognition Letters 18 1997 1495–1501 Measuring texture classification algorithms 1 Guy Smith ) , Ian Burns School of Information Technology, The UniÕersity of Queensland, Brisbane, QLD 4072, Australia Received 6 March 1997; revised 15 July 1997 Abstract The texture analysis literature lacks a widely accepted method for comparing algorithms. This paper proposes a framework for comparing texture classification algorithms. The framework consists of several suites of texture classification problems, a standard functionality for algorithms, and a method for computing a score for each algorithm. We use the framework to demonstrate the peaking phenomenon in texture classification algorithms. The framework is publicly available on the Internet. q 1997 Elsevier Science B.V. Keywords: Texture features; Classification; Comparison; Metric 1. Introduction There are no widely used measures of the perfor- mance of texture analysis algorithms, as evidenced by the lack of quantitative comparisons in the litera- ture. Papers describing texture segmentation algo- rithms usually demonstrate the algorithm on images Ž included in the paper Van Hulle and Tollenaere, 1993; Manjunath and Chellappa, 1993; Manjunath et . al., 1990; Hsiao and Sawchuk, 1989 . This has the advantage that images can be chosen which most clearly demonstrate the salient properties of the algo- rithm. Likewise, papers describing texture classifica- tion algorithms usually give percentage classification accuracy results on textures shown in the paper Ž . Ohanian and Dubes, 1992; Van Gool et al., 1985 . ) Corresponding author. E-mail: [email protected]. 1 Electronic annexes available. See http:rrwww.elsevier.nlr locaterpatrec. Although these textures are often drawn from a widely used set of textures, such as the Brodatz textures, the selection of textures making up a classi- fication problem varies from paper to paper. In both fields, texture segmentation and texture classifica- tion, this lack of a standard set of test problems makes it difficult to compare results from separate papers. Ž . Haralick 1994 criticises the state of performance characterization in the general field of machine vi- sion and has the following to say about the absence of an established performance characterization: This is an awful state of affairs for the engineers whose job it is to design and build image analysis or machine vision systems. This comment applies equally to the texture analysis subfield. Some papers do describe rigorous comparisons of Ž . algorithms. Weszka et al. 1976 compare the accu- racy of four classification algorithms on textures 0167-8655r97r$17.00 q 1997 Elsevier Science B.V. All rights reserved. Ž . PII S0167-8655 97 00132-3

Upload: guy-smith

Post on 15-Jul-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Measuring texture classification algorithms

Ž .Pattern Recognition Letters 18 1997 1495–1501

Measuring texture classification algorithms 1

Guy Smith ), Ian BurnsSchool of Information Technology, The UniÕersity of Queensland, Brisbane, QLD 4072, Australia

Received 6 March 1997; revised 15 July 1997

Abstract

The texture analysis literature lacks a widely accepted method for comparing algorithms. This paper proposes aframework for comparing texture classification algorithms. The framework consists of several suites of texture classificationproblems, a standard functionality for algorithms, and a method for computing a score for each algorithm. We use theframework to demonstrate the peaking phenomenon in texture classification algorithms. The framework is publicly availableon the Internet. q 1997 Elsevier Science B.V.

Keywords: Texture features; Classification; Comparison; Metric

1. Introduction

There are no widely used measures of the perfor-mance of texture analysis algorithms, as evidencedby the lack of quantitative comparisons in the litera-ture. Papers describing texture segmentation algo-rithms usually demonstrate the algorithm on images

Žincluded in the paper Van Hulle and Tollenaere,1993; Manjunath and Chellappa, 1993; Manjunath et

.al., 1990; Hsiao and Sawchuk, 1989 . This has theadvantage that images can be chosen which mostclearly demonstrate the salient properties of the algo-rithm. Likewise, papers describing texture classifica-tion algorithms usually give percentage classificationaccuracy results on textures shown in the paperŽ .Ohanian and Dubes, 1992; Van Gool et al., 1985 .

) Corresponding author. E-mail: [email protected] Electronic annexes available. See http:rrwww.elsevier.nlr

locaterpatrec.

Although these textures are often drawn from awidely used set of textures, such as the Brodatztextures, the selection of textures making up a classi-fication problem varies from paper to paper. In bothfields, texture segmentation and texture classifica-tion, this lack of a standard set of test problemsmakes it difficult to compare results from separatepapers.

Ž .Haralick 1994 criticises the state of performancecharacterization in the general field of machine vi-sion and has the following to say about the absenceof an established performance characterization:

This is an awful state of affairs for the engineerswhose job it is to design and build image analysis ormachine vision systems.

This comment applies equally to the texture analysissubfield.

Some papers do describe rigorous comparisons ofŽ .algorithms. Weszka et al. 1976 compare the accu-

racy of four classification algorithms on textures

0167-8655r97r$17.00 q 1997 Elsevier Science B.V. All rights reserved.Ž .PII S0167-8655 97 00132-3

Page 2: Measuring texture classification algorithms

( )G. Smith, I. BurnsrPattern Recognition Letters 18 1997 1495–15011496

taken from images of terrain. They use the numberof correctly classified test images as the measure ofclassification accuracy. Subsequently, Conners and

Ž .Harlow 1980 theoretically analysed the algorithmsstudied by Weszka, Dyer and Rosenfeld. This theo-retical study examined the information that could becaptured in the texture features measured by thealgorithms. In particular, Conners and Harlow de-scribed pairs of textures which gave identical featurevalues with one algorithm, but distinguishable fea-ture values with another algorithm. In this way,Conners and Harlow were able to establish a hierar-chy of algorithms, according to the power of thetexture features measured by each algorithm. Nosuch theoretical study has since been attempted; theincreasing sophistication of algorithms, and the lackof a formal definition of texture, make such a studyinfeasible.

Ž .Van Gool et al. 1985 quote classification accu-racy percentages for several algorithms in their sur-vey of the texture analysis literature. However, theseaccuracies were not measured on a uniform set oftexture problems; this limits the usefulness of thesepercentages in comparing algorithms.

Ž .Ohanian and Dubes 1992 compare four com-monly used textural feature sets based on classifica-tion error. However, the image sets were limited tofour examples of each of two artificial image setsŽ .fractal and Gaussian Markov Random Fields and

Ž .two real image sets leather and painted surfaces .The authors compare results using only 5 textureproblems. These texture problems are now publiclyavailable at the MeasTex web site.

Ž .Ojala et al. 1996 compare classification errorrates for singleton features and pairs of featuresselected from several algorithms. They measure theirfeatures on a texture problem based on the Brodatz

Ž .images Brodatz, 1966 and on the five texture prob-Ž .lems described by Ohanian and Dubes 1992 .

In the related field of texture segmentation, DuŽ .Buf et al. 1990 describe a benchmark for compar-

ing the accuracy of algorithms. Their benchmarkmeasures the accuracy of the border found by thealgorithm; it assumes that the classification of thetexture in distinct regions is correct. Their bench-mark is based entirely on a suite of artificial textures.

In this paper we propose a framework, MeasTex,for measuring the accuracy of texture classification

algorithms. This framework provides a way of stan-dardising the reporting practice for algorithm results.MeasTex addresses the aforementioned deficienciesby:Ø incorporating a large set of texture classification

problems;Ø being publicly available;Ø having supporting software;Ø being easily extendible.Section 2 describes the framework, discussing issuesin the design of the framework, giving details of thetexture problems, and demonstrating use of the quan-titative results provided by the framework. Section 3presents a revised measure of texture classificationperformance. Section 4 reviews issues in developinga measure of texture classification algorithms, andthe degree to which MeasTex addresses these.

2. Standardised framework

This section describes a framework for measuringtexture classification algorithms. Texture classifica-tion involves a statistical classification stage as wellas a feature extraction stage. 2 As such, the character-istics of each must be matched for optimum perfor-mance. This framework can be used to benchmarkthe performance of any combination of features andclassifier.

Section 2.1 describes the database of textures andsuites of classification problems we have compiled.Section 2.2 describes an experiment using the frame-work.

The framework, including images, software anddocumentation, is available from http:rrww.cssip.elec.uq.edu.aur;guyrmeastexrmeastex.html.

2.1. Database of textures and problems

When designing a suite of test problems intendedto characterize the performance of texture classifica-tion algorithms, it is important to incorporate a broad

2 Occasionally both stages will be incorporated into the oneŽ Ž ..structure for example, Jain and Karu 1996 but mostly they are

distinct modules.

Page 3: Measuring texture classification algorithms

( )G. Smith, I. BurnsrPattern Recognition Letters 18 1997 1495–1501 1497

range of textures and classification problems. Al-though artificial images can be created to mimicalmost any texture, the models used to create theseimages may unfairly advantage some algorithms. Forexample, Markov Random Fields are often used tosynthesise textures; it is reasonable to expect classifi-cation algorithms based on Markov Random Fieldsto perform well on these textures. Also, the worth oftexture classifiers is ultimately based on performinguseful tasks on real-images.

Our database is divided into test suites of twotypes: general and domain specific. The two generaltest suites are based on the Brodatz images andVisTex images.

The Brodatz images are scanned from an albumŽ .of 112 textures photographed by Phil Brodatz 1966 .

The images are diverse, including grass, pebbles,Styrofoam, paper, cloth weave and clouds. Theseimages have become a de facto standard for texture

Ž .classification problems Picard et al., 1993 .Ž .VisTex 1995 is gaining acceptance as a standard

image database. The VisTex images offer a signifi-cant advantage over the Brodatz images. Both theBrodatz images and VisTex images are diverse;however, Brodatz tends not to have multiple imagesof similar scenes, and so very few pairs of Brodatztextures are difficult to distinguish. On the otherhand, many of the VisTex images are of similarscenes; it is possible to find pairs of textures whichare difficult to distinguish visually. This allows amore challenging set of texture problems to be com-piled; the results in Section 2.2 show that the textureproblems drawn from the Vistex images are moredifficult than those drawn from the Brodatz images.

A further two test suites, Grass and Material, aredomain specific. The Grass and Material suites con-tain various images collected specifically for theMeasTex framework. The images in the Grass suitewere obtained from plots at the Brisbane BotanicalGardens and the University of Queensland campus.The Botanical Gardens images are labelled with theirbotanical and common names. This suite represents achallenging application in classifying like textures.The Material suite contains images of a wide rangeof building and landscaping materials including manygrades of rock and many surfaces.

Each test suite contains several classificationproblems. The majority of the test problems distin-

guish between two classes which can be distin-guished regardless of rotation and scale. These testproblems vary considerably in the similarity of thetextures being compared. The remainder of the testproblems have three or more classes, or distinguishbetween textures which are identical except for rota-tion or scale. They are included to measure analgorithm’s accuracy under varying conditions. Forexample, the Brodatz test suite contains a problemcomparing pressed cork, beach sand, and pigskinŽ .D4, D29, and D92 and a problem comparing straw

Ž .and its 908 rotation D15 and D15 rotated , amongothers.

Each texture in the database is defined by a singleŽ .256=256 pixels or 512=512 pixels image of thattexture. From each full image, 64 non-overlappingsubimages are extracted; 32 images each for estima-tion and validation. A test problem is specified by alist of labelled estimation subimages for each class,and a list of un-labelled validation subimages foreach class.

2.2. Example

To demonstrate the usefulness of the framework’squantitative measure, we implemented a multivariate

Ž .Gaussian classifier Gonzalez and Woods, 1992 withŽGabor Energy features Turner, 1986; Fogel and

.Sagi, 1989 . The effect of feature set dimension onperformance is investigated as the orientation resolu-tion of the features is varied.

This example does not compare the performanceof competing algorithms – which might be consid-ered the main purpose of the framework. This deci-sion is deliberate: this paper is intended to describethe framework without bias towards or against par-ticular algorithms. 3 Also, this example highlights the

Žpeaking phenomenon Jain and Chandrasekaran,.1982 , which we believe is an important and little

discussed consideration in texture analysis algo-rithms.

The Gabor Energy is the sum, over the phases, ofthe squares of the convolution of the image with aGabor mask. A Gabor mask consists of a Gaussian

3 Readers who wish to see comparisons of well-known algo-rithms are referred to the MeasTex web site.

Page 4: Measuring texture classification algorithms

( )G. Smith, I. BurnsrPattern Recognition Letters 18 1997 1495–15011498

windowed sinusoidal waveform with a predeter-mined orientation, wavelength and phase shift. Phaseshifts of 08 and 908 are used. A separate energyfeature is thus constructed for every combination ofwavelength and orientation. 4

In this experiment, we consider the wavelengths2, 4 and 8 pixels with different numbers of maskorientations equally spaced in the range 08 to 1808.For example, the simulations with three angles usedGabor masked oriented at 08, 608 and 1208 from thevertical. The masks are 17=17 pixels. The Gaussianwindow is centred on each mask with a standarddeviation proportional to the wavelength of the sinu-soidal waveform.

This experiment has been designed to demonstratethe Curse of Dimensionality, or peaking phe-nomenon. There is some redundancy in the featuresmeasured by the Gabor Energy method describedabove. For example, one would expect the featurescorresponding to angles only 308 apart to be stronglycorrelated; such angles are measured by the simula-tion with six orientations. Given a fixed number ofsamples, increasing the number of features measuredfor each sample has two conflicting effects:Ø The additional features provide more information.

This improves classification accuracy.Ø The sample of instances is less representative of

the true distribution. This degrades classificationaccuracy on unseen instances.

Typically, as the number of features measured in-creases, classification accuracy on unseen instancesrises to a peak, then degrades. This is known as thepeaking phenomenon. This phenomenon is reviewed

Ž .by Jain and Chandrasekaran 1982 .Ž .Fig. 1 a shows the average percentage correct

classification scores across all tests in each test suite.The number of equally spaced orientations has beenvaried from 2 to 8 whilst maintaining all otherparameters. The general trend is an initial improve-ment in performance as the number of orientationsincreases. However, after a point, increasing thenumber of orientations has a detrimental effect. Max-imum performance is achieved for these suites atbetween 3 and 5 orientations. Although performance

4 The source code for this algorithm is available at the Meas-Tex web site.

Ž . Ž .Fig. 1. a Average percentage correct classification and bconfidence-based classification scores for Gabor Energy algorithmwith 2 to 8 orientations.

for the Material suite drops throughout the entirerange of orientations, the decrease is gradual at firstand approaches the same rate as the other test suitesat the higher numbers of orientations.

These results are consistent with the peaking phe-nomenon, and demonstrate that the MeasTex frame-work gives consistent results over a range of testsuites, even where the absolute variations in accu-racy are not large.

3. Performance measure

In Section 2, we measured the performance of analgorithm using percentage correct classification.This measure only requires the texture classificationalgorithm to nominate a class for each validation

Page 5: Measuring texture classification algorithms

( )G. Smith, I. BurnsrPattern Recognition Letters 18 1997 1495–1501 1499

image. As such, the scores cannot reflect the confi-dence of the algorithm in making that choice.

Consider an algorithm classifying a validationimage in a two class problem. The classifier mightcompute that the image is 90% likely to be of classA, or might compute that it is 60% likely to be ofclass A. In either case, if the classifier is required tojust nominate a single class, it should nominate classA. However, we believe that the confidence of theclassifier is important information. In particular, if aclassifier is 90% confident, but wrong, it should notscore as well as a classifier that is 60% confident butwrong.

Many statistical classifiers are able to providefurther information than just the chosen class. Theyoften internally compute probabilities for each class;that is, the probability that the validation imagebelongs to each of the classes. The relative probabili-ties reflect the confidence of the classifier’s decision.Of course, a classifier will still give a value of 1.0for the chosen class and 0.0 for all other classes if itis completely confident of its choice.

We propose a new measure of accuracy whichŽawards a classifier the highest expected score in the

.statistical sense if it writes out the probabilities itcomputes for each class, rather than just nominatingthe most likely class.

™Let x be the output of the classifier for the ithi™validation image of a problem suite and x be thei,c

™element of x corresponding to class c. Likewise, leti™ ™p and p be the algorithm’s computed probabilityi i,c

vector and its cth element. Let xX be the elementi

associated with the true class of the ith validationimage. We will first describe how a score is com-puted for each validation image, and then describehow that score is aggregated to form per-test-prob-lem and per-test-suite scores.

The new measure of a classifier’s performance ona single validation image is given by

xXi

,™5 5xi

5 5where P denotes the L2 norm. Given the algo-rithm’s computed probabilities, this measure willhave an expected value of™ ™p Pxi i

.™5 5xi

The expected value for this measure is maximised™ ™when p and x are collinear; that is, when the xi i i,c

correspond to the algorithm’s computed probabili-ties. Thus, when an algorithm is not completely

Žconfident of its classification, its expected score in.the statistical sense is larger if it writes out the

probability associated with each class, rather than ifit writes out only the most likely class. If an algo-rithm is completely confident of its classification, itwill write out a probability of 1 for the most likelyclass and 0 for all other classes. In this case, thepercentage correct and confidence-based scoringmethods give the same result.

A score for a complete test problem is obtainedby averaging the measure over each validation imageas

1 xXi

,Ý Ý ™5 5N xc icgC igIc

where I is the set of validation images of class cc

and N is the number of validation images in class c.c

The summary score for a suite is computed byaveraging an algorithm’s scores on all problems inthe suite.

Ž .Fig. 1 b graphs results for the same experimentsŽ .as shown in Fig. 1 a , but using the confidence-based

scores. Qualitatively, the two graphs are similar.Both show the effect of the peaking phenomenon.However, the confidence-based scores are, in gen-eral, higher than the corresponding percentage cor-rect scores. This achieves our goal of encouragingclassifiers to indicate their confidence in their classi-fication.

4. Discussion

In this paper, we have described a framework forthe comparison of texture classification algorithms.We have compiled several suites of texture prob-lems, described a standard functionality for textureclassification algorithms, presented a metric forquantitative scoring of performance, and demon-strated that the framework is useful for comparingvariations of a texture classification algorithm. Theweb site also compares algorithms from disparateparadigms: Grey-level Co-occurrence Matrices, Ga-

Page 6: Measuring texture classification algorithms

( )G. Smith, I. BurnsrPattern Recognition Letters 18 1997 1495–15011500

bor Energy, Gauss Markov Random Fields, andFractal Dimension.

There is no widely accepted definition of imagetexture. Therefore, it is difficult to compare textureanalysis algorithms on theoretical grounds. Conners

Ž .and Harlow 1980 do make a theoretical compari-son; they demonstrate that particular algorithms en-code supersets or subsets of the information encodedby other algorithms. This is not a generally applica-ble technique. In a mature field such as moderntexture analysis, two successful algorithms will havecomplementary strengths; neither will encode a su-perset of the information encoded by the other.

Consequently, comparisons between texture anal-ysis algorithms must be made empirically. Empiricalcomparisons between algorithms must be based on acommon set of texture problems. There are severalissues involved in selecting texture problems. Themost fundamental issue is whether to use real-worldtextures or synthetic textures. Real world textureshave one unarguable advantage: ultimately, visionsystems must operate in a real-world environment,and so should be developed, so far as possible, ondata with real world properties. On the other hand,synthetic textures offer some advantages: an abun-dant supply of estimation and validation data can begenerated, and the degree of difficulty of the textureproblems can be precisely controlled. Synthetic tex-tures also have a significant disadvantage: the algo-rithm for generating the texture may introduce arti-facts which favour particular algorithms in wayswhich will not be reflected in real-world textures.

We favour the use of real-world textures. Theproblems of synthetic textures are inherent. Theproblems of real-world textures, such as a limitedsupply of data and little control over the difficulty ofthe texture problems, could be addressed by a thor-ough data collection effort. We have compiled foursuites of texture problems based on real-world tex-tures. Allowing for the greater difficulty of the Vis-Tex and Material problems, results on the four suitesare qualitatively similar. The success of these suitesindicates that a sufficiently thorough data collectioneffort will not be impractically expensive.

There are many issues in texture classificationwhich might be investigated with a quantitativebenchmark. Such issues include the effect of normal-ising the first order intensity information in images,

the effect of the number of estimation images avail-able to the classifier, and the impact of boundaryeffects in small images.

The MeasTex framework, with the test suitesprovided at the web site, is a quantitative benchmarkdesigned for general comparison of texture classifi-cation algorithms. However, the framework is modu-lar and additional test suites can be incorporatedwithout modification of the current structure. It hasbeen our goal that researchers can develop test suitesto investigate issues of particular interest, and thatthese test suites can be made available on the Inter-net.

In summary, there are several major paradigms oftexture features described in the literature, but nowidely accepted framework for comparing them. Thisletter describes a framework for the comparison oftexture classification algorithms. We hope that thisframework, possibly with additional suites of textureproblems, will provide a broadly applicable tech-nique for the replicable, quantitative comparison oftexture classification algorithms.

References

Brodatz, P., 1966. Textures: A Photographic Album for Artistsand Designers. Dover Publications, New York.

Conners, R., Harlow, C., 1980. A theoretical comparison oftexture algorithms. IEEE Trans. Pattern Anal. Machine Intell.2, 204–222.

Du Buf, J., Kardan, M., Spann, M., 1990. Texture feature perfor-mance for image segmentation. Pattern Recognition 23, 291–309.

Fogel, I., Sagi, D., 1989. Gabor filters as texture discriminator. J.Biological Cybernet. 61, 103–113.

Ž .Gonzalez, R., Woods, R. Eds. , 1992. Digital Image Processing.Addison-Wesley, Reading, MA.

Haralick, R., 1994. Performance characterization in computervision. CVGIP: Image Understanding 60, 245–249.

Hsiao, J., Sawchuk, A., 1989. Unsupervised textured image seg-mentation using feature smoothing and probabilistic relaxationtechniques. Comput. Vision Graphics Image Process. 48, 1–21.

Jain, A., Chandrasekaran, B., 1982. Dimensionality and SampleSize Considerations in Pattern Recognition Practice, Vol. 2.North-Holland, Amsterdam, Chapter 39, pp. 835–855.

Jain, A., Karu, K., 1996. Learning texture discrimination masks.IEEE Trans. Pattern Anal. Machine Intell. 18, 195–205.

Manjunath, B., Chellappa, R., 1993. A unified approach to bound-ary perception: Edges, textures and illusory contours. IEEE

Ž .Trans. Neural Networks 4 1 , 96–108.

Page 7: Measuring texture classification algorithms

( )G. Smith, I. BurnsrPattern Recognition Letters 18 1997 1495–1501 1501

Manjunath, B., Simchony, T., Chellappa, R., 1990. Stochastic anddeterministic networks for texture segmentation. IEEE Trans.

Ž .Acoust. Speech Signal Process. 38 6 , 1039–1049.Ohanian, P., Dubes, R., 1992. Performance evaluation for four

classes of textural features. Pattern Recognition 25, 819–833.Ojala, T., Pietikainen, M., Harwood, D., 1996. A comparative

study of texture measures with classification based on featuredistributions. Pattern Recognition 29, 51–59.

Picard, R., Kabir, T., Liu, F., 1993. Real-time recognition with theentire Brodatz texture database. In: Proc. IEEE Conf. Com-puter Vision and Pattern Recognition, pp. 638–639.

Turner, M., 1986. Texture discrimination by Gabor functions. J.Biological Cybernet. 55, 71–82.

Van Gool, L., Dewaele, P., Oosterlinck, A., 1985. Texture analy-sis anno 1983. Comput. Vision Graphics Image Process. 29,336–357.

Van Hulle, M., Tollenaere, T., 1993. A modular artificial neuralŽ .network for texture processing. Neural Networks 6 1 , 7–32.

VisTex, 1995. VisTex Texture Database. Maintained by the Vi-sion and Modeling group at the MIT Media Laboratory.http: rrwww-white.media.mit.edurvismodrimageryrVisionTexturervistex.html.

Weszka, J., Dyer, C., Rosenfeld, A., 1976. A comparative studyof texture measures for terrain classification. IEEE Trans.Systems Man Cybernet. 6, 269–285.