combinational feature optimization for classification of lung tissue images

8
Combinational Feature Optimization for Classification of Lung Tissue Images Ravi K. Samala a , Tatyana Zhukov b , Jianying Zhang c , Melvyn Tockman b , Wei Qian* a a Department of Electrical & Computer Engineering, c Department of Biological Sciences, University of Texas, 500 West University Avenue, El Paso, TX 79968; b H. Lee Moffitt Cancer Center and Research Institute, 12902 Magnolia Drive, Tampa, FL 33612 ABSTRACT A novel approach to feature optimization for classification of lung carcinoma using tissue images is presented. The methodology uses a combination of three characteristics of computational features: F-measure, which is a representation of each feature towards classification, inter-correlation between features and pathology based information. The metadata provided from pathological parameters is used for mapping between computational features and biological information. Multiple regression analysis maps each category of features based on how pathology information is correlated with the size and location of cancer. Relatively the computational features represented the tumor size better than the location of the cancer. Based on the three criteria associated with the features, three sets of feature subsets with individual validation are evaluated to select the optimum feature subset. Based on the results from the three stages, the knowledgebase produces the best subset of features. An improvement of 5.5% was observed for normal Vs all abnormal cases with A z value of 0.731 and 74/114 correctly classified. The best A z value of 0.804 with 66/84 correct classification and improvement of 21.6% was observed for normal Vs adenocarcinoma. Keywords: feature optimization, feature selection, lung tissue, classification, correlation, neural network, multiple regression analysis, F-measure 1. INTRODUCTION It is estimated by the American Cancer Society that in the year 2009, there will be 219,440 new cases of cancer and 159,390 mortality cases 1 associated with lung and bronchus. Lung cancer is considered to be the most deadly, accounting for 15% of the new cases and 28% of the mortality rate. Three types of cancer cases are evaluated here: squamous cell carcinoma (SQC), adenocarcinoma (ADC) and bronchioalveolar carcinoma (BAC). The suggested feature optimization is an extension of previous work from our group 2-5 , where an end-to-end process for identification of cancer from lung tissue images was successfully developed with an area under the receiver operating characteristics (AROC) value of 0.61 for normal vs. all the three types of cancer. Based on the statistical values of the segmented nucleus and cytoplasm, fifteen features were used. The objective is to improve the performance of the end-to-end automated process using feature optimization. Fig.1 Block diagram of the proposed method *[email protected]; phone 1 915 747-8090; fax 1 915 747-7871; engineering.utep.edu/imaginginformatics Medical Imaging 2010: Computer-Aided Diagnosis, edited by Nico Karssemeijer, Ronald M. Summers, Proc. of SPIE Vol. 7624, 76240Z · © 2010 SPIE · CCC code: 1605-7422/10/$18 · doi: 10.1117/12.844509 Proc. of SPIE Vol. 7624 76240Z-1 Downloaded from SPIE Digital Library on 18 Mar 2010 to 129.108.41.95. Terms of Use: http://spiedl.org/terms

Upload: usf

Post on 17-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Combinational Feature Optimization for Classification of Lung Tissue Images

Ravi K. Samalaa, Tatyana Zhukovb, Jianying Zhangc, Melvyn Tockmanb, Wei Qian*a

aDepartment of Electrical & Computer Engineering, cDepartment of Biological Sciences, University of Texas, 500 West University Avenue, El Paso, TX 79968; bH. Lee Moffitt Cancer Center and

Research Institute, 12902 Magnolia Drive, Tampa, FL 33612

ABSTRACT

A novel approach to feature optimization for classification of lung carcinoma using tissue images is presented. The methodology uses a combination of three characteristics of computational features: F-measure, which is a representation of each feature towards classification, inter-correlation between features and pathology based information. The metadata provided from pathological parameters is used for mapping between computational features and biological information. Multiple regression analysis maps each category of features based on how pathology information is correlated with the size and location of cancer. Relatively the computational features represented the tumor size better than the location of the cancer. Based on the three criteria associated with the features, three sets of feature subsets with individual validation are evaluated to select the optimum feature subset. Based on the results from the three stages, the knowledgebase produces the best subset of features. An improvement of 5.5% was observed for normal Vs all abnormal cases with Az value of 0.731 and 74/114 correctly classified. The best Az value of 0.804 with 66/84 correct classification and improvement of 21.6% was observed for normal Vs adenocarcinoma.

Keywords: feature optimization, feature selection, lung tissue, classification, correlation, neural network, multiple regression analysis, F-measure

1. INTRODUCTION It is estimated by the American Cancer Society that in the year 2009, there will be 219,440 new cases of cancer and 159,390 mortality cases1 associated with lung and bronchus. Lung cancer is considered to be the most deadly, accounting for 15% of the new cases and 28% of the mortality rate. Three types of cancer cases are evaluated here: squamous cell carcinoma (SQC), adenocarcinoma (ADC) and bronchioalveolar carcinoma (BAC). The suggested feature optimization is an extension of previous work from our group2-5, where an end-to-end process for identification of cancer from lung tissue images was successfully developed with an area under the receiver operating characteristics (AROC) value of 0.61 for normal vs. all the three types of cancer. Based on the statistical values of the segmented nucleus and cytoplasm, fifteen features were used. The objective is to improve the performance of the end-to-end automated process using feature optimization.

Fig.1 Block diagram of the proposed method

*[email protected]; phone 1 915 747-8090; fax 1 915 747-7871; engineering.utep.edu/imaginginformatics

Medical Imaging 2010: Computer-Aided Diagnosis, edited by Nico Karssemeijer, Ronald M. Summers, Proc. of SPIE Vol. 7624, 76240Z · © 2010 SPIE · CCC code: 1605-7422/10/$18 · doi: 10.1117/12.844509

Proc. of SPIE Vol. 7624 76240Z-1

Downloaded from SPIE Digital Library on 18 Mar 2010 to 129.108.41.95. Terms of Use: http://spiedl.org/terms

Fig.2 Lung tissue images of (left) normal and (right) adenocarcinoma located at right upper lobe

The suggested feature optimization methodology uses a rather indirect approach to evaluate the weightage of each feature from a collective combination of classification, correlation and pathology perspective. The final constraint is to ensure that the learning algorithm performs adequately without over-generalizing or under-generalizing the high dimensional mapping between the predictor and the criterion variables. A similar methodology was implemented previously by us on thin section thoracic computed tomography imaging elsewhere3,4. Briefly, the feature optimization is achieved by considering the combinations of F-measure (class representation) of each feature, their inter-correlation and the different broad classifications each feature is categorized.

Fig.3 Histogram of Nodule locations

Fig.4 Histogram of cancer stages (LUL - Left Upper Lobe, LLL - Left Lower Lobe, RUL - Right Upper Lobe, RLL - Right Lower

Lobe, RML - Right Middle Lobe)

2. MATERIALS AND METHODS The biopsy samples used in this study were provided by Cancer Prevention and Control Division, H. Lee Moffitt Cancer Center. The dataset consists of 114 samples from 41 patients, with 58 normal, 26 ADC, 15 SQC and 15 BAC cases. The mean age of the sample population is 65.85 with a mean tumor size of 4.09mm. All the 5 lobes of the lung and all cancer stages, except for IIA are considered here. The histogram of the cancer stages, nodule locations and tumor size is shown in Fig.3, Fig.4 and Fig.5 respectively. Metadata is the domain knowledge or pathology based information, which in this case is the category each feature is classified into. There are four categories with a total of 15 features as shown in Table

0

5

10

15

LLL LUL RLL RML RUL

Num

ber o

f cas

es

Nodule location

0

5

10

15

IA IB IIA IIB IIIA IIIB IV

Num

ber o

f cas

es

Cancer Stage

Proc. of SPIE Vol. 7624 76240Z-2

Downloaded from SPIE Digital Library on 18 Mar 2010 to 129.108.41.95. Terms of Use: http://spiedl.org/terms

1. The main focus is to select an optimum feature set that will result in higher AROC value and also generalized sufficiently to adjust to any minor variations. Table 1. Cellular features and categories

Category Metric Feature

1 Cell size

1 Average (Avg) Cytoplasm Area

2 Avg cell size

3 Avg Nucleus Area

4 Standard Deviation (SD) Nucleus Area

2 Nucleocytoplasmic Ratio 5 Nucleocytoplasmic Ratio

3 Nuclear texture

6 Avg Nucleus Avg Intensity

7 SD of Metric 6

8 Avg Nucleus Intensity Distance

9 SD of Metric 8

10 Avg Nucleus Intensity(hyperchromasia)

11 SD Nucleus Intensity(texture)

4 Nuclear shape

12 Avg Nucleus width

13 SD of Metric 12

14 Avg Nucleus width Distance

15 SD of Metric 14

2.1 Multiple Regression Analysis

The multiple regression analysis gives information content of the feature subsets6 based on the four categories. Analyzing the mappings between the each subset of features and the pathology based features will be advantageous in two ways: firstly, this cross validates whether the computational features can collectively represent the pathology features; secondly, to generate standardized constraints on how to select the feature set. Two important features considered by pathologists for decision making are size and location. The collective mapping of the computational features and the pathologist’s features will give an insight into how the tumor diagnosis is achieved.

Proc. of SPIE Vol. 7624 76240Z-3

Downloaded from SPIE Digital Library on 18 Mar 2010 to 129.108.41.95. Terms of Use: http://spiedl.org/terms

Fig.5 Histogram of the tumor size

Fig 6. Multiple regression analysis for mapping between computational features and pathology features

2.2 Classification

A three layered backpropagation neural network is used for F-measure evaluation of individual features as well as validation of the proposed method. For the evaluation of F-measure, only one hidden node is used for every feature, since it was observed that in general one node is enough to capture the nonlinearity of the given feature. For the final validation, the number of nodes is chosen from 1 to the maximum number of features until the best ROC Az is observed. The number of iterations is fixed at 500, because in general, the error per iteration stabilized before 500 iterations. A k-fold cross validation with a k value of 3 (split of 66.66% + 33.33%) is considered ideal for the current dataset.

2.3 Combinational Knowledgebase

The feature set can be divided based on the broad categories: size, nucleocytoplasmic ratio, texture and shape. It can be further sub-grouped based on the inter-correlation between the features. Feature optimization is divided into three stages, based on F-measure, broad classification and inter-correlation, resulting in three optimum subsets of features. Validation will be done at each stage to calculate the accuracy of the computer aided diagnostic (CAD) system. The focus of stage I is maximizing the F-measure of each class which will indirectly increase the area under the ROC curve (Az)7. Stage II considers if all the categories of features are considered in the feature subset from Stage I. Observing the same data from different statistical point of views will present the hidden information; thus all the views will have to be considered in the optimum feature set. From multiple regression analysis, it is obvious that the tumor size is adequately represented by all the different categories of feature set, hence every category was taken into consideration. Stage III deals with the inter-correlation analysis of the original feature set with the feature set from Stage II.

The methodology is explained in detail in the following steps: Get the F-measure values for each feature individually using 1 hidden unit. Use principal component analysis (PCA) to evaluate the highly correlated features. Sort F values according to the less represented class (in all the cases, abnormal class is less represented). Select the highest value F-measure by as the first optimum feature by default. The combinational knowledgebase comprises of implementing the following constraints on the given features giving the three stages:

0 2 4 6 8 10 12 14 16 180

1

2

3

4

5

Tumor size

Num

ber o

f cas

es

0 0.1 0.2 0.3 0.4

All features

Metric 1

Metric 2

Metric 3

Metric 4

R2

Feat

ures

Tumor size

Location

Proc. of SPIE Vol. 7624 76240Z-4

Downloaded from SPIE Digital Library on 18 Mar 2010 to 129.108.41.95. Terms of Use: http://spiedl.org/terms

I. Loop through the sorted listed by subtracting the ith with (i-1)th value and only accepting the features if the criteria below is satisfied. This process selects the higher F values of the lesser represented class without compromising the higher represented class.

( ) ( )11 −− −>− jsu

jsu

jsl

jsl FFFF

where slF and

suF are the F-measures of the sorted lower and higher represented classes respectively

II. For the stage I set, make sure that there exists a minimum of one feature from each metric, and if not, add one feature from the same. This step is done to ensure that all the broad category features are included in the optimum set. This is to increase the generalization capability for newer datasets.

III. Based on the correlation analysis, for the stage II set, add the non-redundant features from PCA which are not present in the optimum set using this criterion: select the feature with high slF and low suF values.

Table 2. Feature optimization for Normal Vs ADC case

Stage Feature in Metrics (Ref. Table 1) Number of

hidden nodes

F-measure (Normal, Abnormal) ROC

All features [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15] 8 (0.82, 0.522) 0.661

I [1,2,3,10,13,14,15] 5 (0.797, 0.52) 0.761

II [1,2,3,10,13,14,15,5] 3 (0.847, 0.64) 0.759

III [1,2,3,10,13,14,15,5,7,8,9] 8 (0.847, 0.64) 0.804

3. RESULTS Feature optimization was performed in three sequential stages resulting in three different sets of features with individual validation results. Five combinations of training and testing data, normal Vs (all, ADC, BAC, ADC+BAC, SQC) were used. Table 2 gives an example of how features are selected at the 3 stages for normal Vs ADC case. Before feature optimization, all 15 features are used. From stage I to stage III, number of features increases from 7-11. Each stage resulted in higher Az when compared to original feature set.

Fig.7 Histogram of number of times each feature is selected in the 5 test cases

From Fig.7, it can be observed that no single feature was used in all the 5 experimental setup cases. Average nucleus pixel intensity is not selected even a single time. The nucleocytoplasmic ratio was always added at stage II, never selected at stage I. For Normal Vs All case/ADC, the optimum set is obtained at Stage III. For these two cases the normal Vs abnormal is in the ratio of 58:56 and 58:26. For Normal Vs BAC/(ADC+BAC), Stage I is considered the best. For Normal Vs SQC case, the classification accuracy relative to the other cases is found to be the worst. It can be attributed to the poor representation of each class. Fig.8 gives the best ROC curves for all the experimental setups. Fig. 9 compares the ROC curves between the optimized and original feature sets for normal Vs (All and ADC) cases.

012345

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Tota

l num

ber o

f se

lect

ions

Feature in Metrics

Proc. of SPIE Vol. 7624 76240Z-5

Downloaded from SPIE Digital Library on 18 Mar 2010 to 129.108.41.95. Terms of Use: http://spiedl.org/terms

Fig.8 Best Az ROC curves after feature optimization

Table 3. Best Az for different combination of Normal Vs Carcinoma cases after feature optimization. The best Az values are highlighted for different stages

Stage 1 Stage 2 Stage 3 Best Az (not optimized)

Best Az after (optimized)

Normal Vs All +3.75% +3.46% +5.48% 0.693 0.731

Normal Vs ADC +15.13% +14.8% +21.6% 0.661 0.804

Normal Vs BAC +3.86% +2.7% +3.86% 0.698 0.725

Normal Vs ADC+BAC +4.9% +3.65% +1.4% 0.711 0.746

Normal Vs SQC -13.8% +1.7% -4.65% 0.58 0.59

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Normal Vs AllNormal Vs ADCNormal Vs BACNormal Vs ADC+BACNormal Vs SQC

Proc. of SPIE Vol. 7624 76240Z-6

Downloaded from SPIE Digital Library on 18 Mar 2010 to 129.108.41.95. Terms of Use: http://spiedl.org/terms

Fig.9 Best Az ROC curves after and before feature optimization

4. DISCUSSION Several methods have been implemented so far for feature selection, but all these methods relied on the properties of features and the impact of the features on the final classification but not from a biological/domain perspective. Multiple regression analysis (Fig.6) between computational features and pathological features resulted in low R2 coefficients on the x-axis, which can be explained by the low-level mapping from cellular level to anatomical level. Tumor size is relatively better represented when compared to the location of the tumor. It can be deduced that more features representing the five lobes of the lung have to be included to improve the classification accuracy.

Our feature optimization method primarily uses F-measure, which defines the extent of feature(s) has on the final accuracy of the classification methodology. Inter-correlation between features and pathological information are also used to get stage II and stage III subset of features. It is evident from Table 3 that one of the 3 stages suggested here will always give an optimum set. Overall, the SQC class was under-represented by all the features, hence inclusion of more features other than the 15 mentioned is recommended. The approach also suggests whether the computational features used at that point are reflective enough of the pathological features; the relative representation of each category of features and presence of noise. The deviation of results from the SQC and BAC cases is attributed to the usage of the small dataset.

The method described can be used for any computerized medical diagnostic process because it is independent of the type of data used or the classification algorithm used. Hence it can be easily customized for other classification methods like support vector machines and linear discriminant analysis. The generated model can create an ideological platform for pathologists as well as engineers to better communicate the manner in which pathologists and computerized algorithms conclude the final diagnosis.

Finally, we conclude that feature selection based on the combination of F-measure, correlation and pathology will result in an optimum set. Thus the analysis generated from the current suggested model results in an optimum set of features and also recommends the steps to further increase the sensitivity and specificity. Future work will be based on usage of a larger database and evaluating the bias a learning algorithm has on the accuracy.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Normal Vs AllNormal Vs ADCNormal Vs All (not optimized)Normal Vs ADC (not optimized)

Proc. of SPIE Vol. 7624 76240Z-7

Downloaded from SPIE Digital Library on 18 Mar 2010 to 129.108.41.95. Terms of Use: http://spiedl.org/terms

REFERENCES

1. Society, American Cancer, Cancer Facts and Figures 2009. Atlanta, American Cancer Society, 2009. 2. Land Jr., W.H., McKee, D.W., Zhukov, T., Song, D. and Qian, W., "A kernelised fuzzy-Support Vector Machine

CAD system for the diagnosis of lung cancer from tissue images", Int. J. Functional Informatics and Personalised Medicine, 1(1), 26-52 (2008).

3. Samala, R., Moreno, W., You, Y., Qian W., "A Novel Approach to Nodule Feature Optimization on Thin Section Thoracic CT", Acad Radiol, 16(4), 418-427 (2009).

4. Samala, R., Moreno, W., Song, D., You, Y., Qian W., "Knowledge based optimum feature selection for lung nodule diagnosis on thin section thoracic CT", Proc. SPIE,.7260, 726036 (2009).

5. Qian, W., Zhukov, T. A., Song, D. S., Tockman M. S., "Computerized analysis of cellular features and biomarkers for cytologic diagnosis of early lung cancer," Anal Quant Cytol Histol, 29(2), 103-111 (2007).

6. Raicu, D. S., Varutbangkul, E., Cisneros, J. G., Furst, J. D., Channin, D. S., Armato III, S. G., "Semantics and image content integration for pulmonary nodule interpretation in thoracic computed tomography", Proc. SPIE,. 6512, 65120S (2007).

7. Liu, Z., Tan, M., Jiang, F., "Regularized F-Measure Maximization for Feature Selection and Classification", J. Biomed. Biotechnol., 2009(617946), 8 (2009).

Proc. of SPIE Vol. 7624 76240Z-8

Downloaded from SPIE Digital Library on 18 Mar 2010 to 129.108.41.95. Terms of Use: http://spiedl.org/terms