stacking-based ensemble framework and feature selection ......stacking‑based ensemble framework...

13
Vol.:(0123456789) SN Computer Science (2021) 2:67 https://doi.org/10.1007/s42979-021-00465-3 SN Computer Science ORIGINAL RESEARCH Stacking‑Based Ensemble Framework and Feature Selection Technique for the Detection of Breast Cancer Vikas Chaurasia 1  · Saurabh Pal 1 Received: 29 April 2020 / Accepted: 9 January 2021 / Published online: 2 February 2021 © The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd. part of Springer Nature 2021 Abstract Breast cancer is the second most common cancer in women worldwide. The uncontrolled growth of breast cells is called breast cancer. The treatment of human breast cancer is a very critical process, and sometimes certain indicators may produce negative results. To avoid this misleading outcome situation, a reliable and accurate breast cancer diagnosis system must be available. The machine learning (ML) method is a modern and accurate technique that researchers have recently applied to predict and diagnose breast cancer. In this research article, we developed stack-based ensemble techniques and feature selection methods for the comprehensive performance of the algorithm and comparative analysis of breast cancer datasets with reduced attributes and all attributes. In this article, we first take the SVM, k nearest neighbors, Naive Bayes and per- ceptron as four ML algorithms as sub-models that have been trained and predicted from, and then combine them to make a new model called blending (stacking). Finally, logistic regression is used to predict the stacked model. It is significant that sub-models produce different results that are not correlated predictions. The stacking technique is best when all the sub- models are skillfully combined together. This article uses the five-feature selection technique because it affects the overall performance of the model. Unrelated or moderately related features may adversely affect the behavior of the model. After applying the feature selection method, now we have data set with reduced features as well as all features. We implemented logistic regression on a dataset with all features and a dataset with reduced features. Finally, we see that the dataset with reduced features has got improved accuracy. Keywords Breast cancer · KNN · Perceptron · SVM · Naïve Bayes · Stacking · Machine learning · Feature selection Introduction Among women, malignant breast growth is the most well- known disease. Consistently, 2.1 million women are influ- enced by the illness and cause more deaths, therefore. Early finding and screening is the best way to conquer this disease [1]. AI gives an effective technique for creating intricate, robotized, and target strategies to dissect high-dimensional and multimodal clinical data. This current article’s explora- tion centers around some cutting edge propel. Clarification in the article has been the improvement of a more inside and out understanding and speculative examination of huge issues identified with algorithmic structure and learning hypothesis. These incorporate compromises for boosting rearrangements execution, utilization of truly pragmatic lim- itations, and coordination of earlier information and vague- ness. This investigation depicts the most recent improve- ments in AI, with attention on administered and unaided, which significantly affects ailment discovery and analysis. In this exploration article, stack procedures are utilized to assess calculations. Stacking is a basic strategy, like a pail of the model method, where two degrees of arrangement are utilized. The preparation informational index is partitioned into two sections A and B. The split of the preparing dataset This article is part of the topical collection “Computational Biology and Biomedical Informatics” guest-edited by Dhruba Kr Bhattacharyya, Sushmita Mitra and Jugal Kr Kalita. Supplementary Information The online version contains supplementary material available at https://doi.org/10.1007/s4297 9-021-00465-3. * Saurabh Pal [email protected] Vikas Chaurasia [email protected] 1 Department of Computer Applications, VBS Purvanchal University, Jaunpur 222001, India

Upload: others

Post on 30-Aug-2021

11 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stacking-Based Ensemble Framework and Feature Selection ......Stacking‑Based Ensemble Framework and Feature Selection Technique for the Detection of Breast Cancer Vikas Chaurasia1

Vol.:(0123456789)

SN Computer Science (2021) 2:67 https://doi.org/10.1007/s42979-021-00465-3

SN Computer Science

ORIGINAL RESEARCH

Stacking‑Based Ensemble Framework and Feature Selection Technique for the Detection of Breast Cancer

Vikas Chaurasia1 · Saurabh Pal1

Received: 29 April 2020 / Accepted: 9 January 2021 / Published online: 2 February 2021 © The Author(s), under exclusive licence to Springer Nature Singapore Pte Ltd. part of Springer Nature 2021

AbstractBreast cancer is the second most common cancer in women worldwide. The uncontrolled growth of breast cells is called breast cancer. The treatment of human breast cancer is a very critical process, and sometimes certain indicators may produce negative results. To avoid this misleading outcome situation, a reliable and accurate breast cancer diagnosis system must be available. The machine learning (ML) method is a modern and accurate technique that researchers have recently applied to predict and diagnose breast cancer. In this research article, we developed stack-based ensemble techniques and feature selection methods for the comprehensive performance of the algorithm and comparative analysis of breast cancer datasets with reduced attributes and all attributes. In this article, we first take the SVM, k nearest neighbors, Naive Bayes and per-ceptron as four ML algorithms as sub-models that have been trained and predicted from, and then combine them to make a new model called blending (stacking). Finally, logistic regression is used to predict the stacked model. It is significant that sub-models produce different results that are not correlated predictions. The stacking technique is best when all the sub-models are skillfully combined together. This article uses the five-feature selection technique because it affects the overall performance of the model. Unrelated or moderately related features may adversely affect the behavior of the model. After applying the feature selection method, now we have data set with reduced features as well as all features. We implemented logistic regression on a dataset with all features and a dataset with reduced features. Finally, we see that the dataset with reduced features has got improved accuracy.

Keywords Breast cancer · KNN · Perceptron · SVM · Naïve Bayes · Stacking · Machine learning · Feature selection

Introduction

Among women, malignant breast growth is the most well-known disease. Consistently, 2.1 million women are influ-enced by the illness and cause more deaths, therefore. Early

finding and screening is the best way to conquer this disease [1]. AI gives an effective technique for creating intricate, robotized, and target strategies to dissect high-dimensional and multimodal clinical data. This current article’s explora-tion centers around some cutting edge propel. Clarification in the article has been the improvement of a more inside and out understanding and speculative examination of huge issues identified with algorithmic structure and learning hypothesis. These incorporate compromises for boosting rearrangements execution, utilization of truly pragmatic lim-itations, and coordination of earlier information and vague-ness. This investigation depicts the most recent improve-ments in AI, with attention on administered and unaided, which significantly affects ailment discovery and analysis. In this exploration article, stack procedures are utilized to assess calculations. Stacking is a basic strategy, like a pail of the model method, where two degrees of arrangement are utilized. The preparation informational index is partitioned into two sections A and B. The split of the preparing dataset

This article is part of the topical collection “Computational Biology and Biomedical Informatics” guest-edited by Dhruba Kr Bhattacharyya, Sushmita Mitra and Jugal Kr Kalita.

Supplementary Information The online version contains supplementary material available at https ://doi.org/10.1007/s4297 9-021-00465 -3.

* Saurabh Pal [email protected]

Vikas Chaurasia [email protected]

1 Department of Computer Applications, VBS Purvanchal University, Jaunpur 222001, India

Page 2: Stacking-Based Ensemble Framework and Feature Selection ......Stacking‑Based Ensemble Framework and Feature Selection Technique for the Detection of Breast Cancer Vikas Chaurasia1

SN Computer Science (2021) 2:6767 Page 2 of 13

SN Computer Science

is applied for the essential degree of the classifier that is the part of the troupe and the split B is applied for the next degree of the classifier for joining the result from dissimilar to the segment of gathering taken from before stage [2]. These two phases of stacking clarify underneath:

a. Taken preparing information split A, and prepared the k distinctive ensemble instrument on it. This k ensemble segment could be sacking, boosting, and diverse choice tree or heterogeneous classifiers.

b. Apply these k yields of every classifier on subset B. Presently produce a most recent k included arrangement of classifiers which is the aftereffect of these classifiers. In preparing information subset B each point is changed to k dimensional expectation of the k classifiers. Subset B is presently prepared for the following stage classifier for the exhibit.

The last arrangement of k first degree of the model con-tains the attribute break and a combiner classifier at the fol-lowing stage. The test set of the disease dataset, the essential stage model used to produce another k dimensional showing and test cases expectation is made bythe following phase of classifier. The exceptional highlights of subset B with the k new highlights for the second-level classifier are kept up in numerous usage of stacking. Rather than forecasting of dis-ease this method highlight the features. This method could be utilized to interface with multi-way cross-validation to stop the decay of training records in the underlying stage and second-stage models. Along these lines, an alternate new arrangement of highlight will be produced for each prepara-tion information point by utilizing n-1 cycle for preparing first-stage classifier, additionally by utilizing to acquire the rest of. The second-stage classifier, which speaks to all the preparation information focuses on the newly created infor-mational index. Presently, the first phase of the classifier is prepared again on a full prepared dataset to empowering the more grounded highlight change. Because of taking the capacity of gathering segment from its combiner, stacking can diminish fluctuation and inclination. In stacking, diverse group strategies can be viewed as a specific case in which converging of calculation utilized for information autono-mous model. Adaptability of learning in the stacking strat-egy has more unrivaled from other group procedures that makes it incredible in all combiner approach [3].

Feature choice is another huge phase of the data mining technique. These procedures advance towards reliance on utilizing data mining calculations. In the examination of the characterization method, highlight choice is the initial step. In the expectation of class marks, there might be various highlights of noteworthiness in genuine information [4]. Sex of a human is less answerable for choosing the infection forecast in contrast with the age. The highlights which are

fewer family members or no connection between them are extremely unsafe to discover the exactness of the characteri-zation model. The point of highlight determination strate-gies is the extraction of the most valuable highlights which is comparative with the classmark [5]. In this exploration article, five unique kinds of methods utilized for highlight determination are univariate feature selection with K best and χ2 , extra tree classifier, correlation matrix with heat map, recursive component end and irregular timberland. By various element determination strategies, we acquire various highlights. So we have applied the investigation technique for choosing the features.

The idea of driving this strategy is to choose the high-lights that are least significant and do not influence the gen-eral precision of the informational collection. Remembering these contemplations, we initially apply the classifier to the whole informational index, and afterward evacuate the less significant highlights by applying the component determina-tion process. Thus, the informational collection we get has around five highlights. Essentially, we actualized the above procedure on a dataset with decreased highlights. Subse-quently, regarding exactness, it is conceivable to direct near examinations on the two sorts of informational indexes.

Related Work

With regard to disease analysis, much work has basically been done in predicting heart disease, Parkinson’s, tuber-culosis, and chronic kidney disease. Even many researchers have worked to find out whether it is malignant or benign in case of breast cancer. There are many datasets available in different repositories based on breast cancer. Researchers have taken the dataset basically from breast cancer Wiscon-sin for analysis purposes.

Multisurface Method Tree and Logistic regression with tenfold cross-validation was applied to breast cancer dataset by Wolberg et al. [6] and they found the accuracy which is 96.2 and 97.5%, respectively.

Artificial neural network (ANN) which is based on the Pareto differential development algorithm is better with local search. Using this approach, Abbass [7] found an accuracy of 98% approx. Three methods were applied to identify breast cancer by Mu et al. [8] and the performance is based on tenfold cross-validation. The self-organizing maps, radial basis function networks and support vector machines were applied to obtain the average accuracy which is 98% in this research.

Breast cancer dataset was taken from Srinagarind hos-pital for the period 1900–2001. Jaree Thangkam et al. [9] proposed a RELIEFF Attribute technique of feature selec-tion and applied an AdaBoost algorithm and CART as a base learner.

Page 3: Stacking-Based Ensemble Framework and Feature Selection ......Stacking‑Based Ensemble Framework and Feature Selection Technique for the Detection of Breast Cancer Vikas Chaurasia1

SN Computer Science (2021) 2:67 Page 3 of 13 67

SN Computer Science

In an experiment conducted by Liu Ya-Qin et.al. [10], the bagging and C5 algorithms were applied to the breast cancer dataset to find the 5-year survival rate. They per-formed tenfold stratified cross-validation with ROC curve, accuracy, specificity, and sensitivity by AUC to evaluate the performance of the model.

Association rules and a neural network is used by Karaba-tak et al. [11] on the Wisconsin breast cancer dataset for the detection of breast cancer. For reducing the feature dimen-sion, they have used an association rule that eliminates the unnecessary data, and neural networks classify the record with residual features. The concluding result was found in the form of accuracy which is around 97%.

The highest accuracy was obtained by Akay [12] which is approximately 99.5%. They have used the feature selection method and SVM model to analyze the Wisconsin diagnostic dataset with other studies on it.

For improving the accuracy of the breast cancer dataset, CaiLing Dong et al. [13] proposed a medical data mining application development model based on KDD cup 2008 task 1. This article analyzes some possible solutions to this problem and derives an improved enhanced tree as the result of the final solution.

By applying Multilayer Perceptron which is an artifi-cial metaplasticity algorithm on the breast cancer dataset, Marcano-Cedeño et al. [14] found the accuracy to be 99% approximately.

Using three datasets of Wisconsin Breasts Cancer, Breast Cancer Diagnostic and Prognostic, Salama et al. [15] used multiple classifiers and decision tree Naive Bayes, Sequen-tial Minimal Optimization, Multi-Layer Perceptron, and K-Nearest neighbor. They applied these algorithms and found the highest accuracies 97.28, 97.72, and 77.32%, respectively, where all the datasets have different features.

Using supervised learning algorithms, Chaurasia and Pal [16] conducted an experiment in which they used CART, Decision tree, Naïve Bayes, RBF neural network, SVM-RBF kernel and decision tree as classifier to find the accuracy of Wisconsin original dataset. They obtain an accuracy of approximately 97% with SVM-RBF kernel.

An experiment was conducted by Chaurasia and Pal [17] on Wisconsin breast cancer dataset taken from UCI machine learning. In this dataset, there are 683 rows and 10 columns and they have applied three classifiers, Sequential Minimal Optimization (SMO), IBK (K Nearest Neighbours classifier) and BF Tree (Best First trees), for the accurate prediction of the algorithm. They have found that SMO prediction rate is better than the other two which is approx 96%.

5-Classifier was chosen to predict breast cancer as well as heart disease by Chaurasia and Pal [18]. Decision Tree, J48, Simple Logistic, RBF Network and Naïve Bayes were used to find the accuracy with TP rates, F-measure), precision, recall, under the curve (AUC) and a set of errors.

Using an ensemble technique, Asri et al. [19] compared the interpretation of Naïve Bayes, k-NN, C4.5 and SVM and they found that SVM achieves the best accuracy on the breast cancer dataset.

Three popular algorithms (Naı¨ve Bayes, RBF Network and J48) were chosen by Chaurasia V. [20] for the predic-tion of the best algorithm among them. They used ten-fold cross-validation for unbiased estimation of the clas-sifier. They found accuracies 97.36, 96.77 and 93.41%, respectively.

To demonstrate the accuracy of certain classifiers on a breast cancer dataset and a comparison between them Borah, Rupam, Sunil Dhimal, and Kalpana Sharma [22] conducted an experiment to determine the model is suitable for the given dataset.

Using four data mining classifiers SVM, J48, k-NN and Naive Bayes, Shaikh, Tawseef Ayoub, and Rashid Ali [23] performed an experiment on two well-known breast cancer datasets, Wisconsin and Portuguese. They have used WEKA tool for the conduction of an experiment and finally, they applied MATLAB and Weka for showing their results. They obtained accuracy improvement from 92, 92, 96, 97 to 97, 96, 97, 97% on Wisconsin dataset and from 87, 80, 93, 91 to 89, 90, 97, and 95% on Portuguese breast cancer dataset.

Using weka, an algorithm analysis tool, Sri et al. [24] performed an experiment to find out the accuracy of two classification techniques, decision tree and Bayesian algo-rithms on Wisconsin breast cancer dataset. The classification model is built on the basis of training dataset and apply it to the test set. On the basis of this model, they obtain that the decision tree got the highest accuracy in comparison to Bayesian algorithm.

Fuzzy inference-based classification system was used to find the growth of tumor, whether it converts or not into breast cancer, by Dutta et al. [25]. The prediction is based on the behavior of tumor genetics.

Methodology Used

Stacking

Figure 1 shows a basic block diagram of a stacked system for prediction. Stack generalization was introduced by Wolpert [32].

• Stacking requires multiple base learning algorithms (L1,…., Ln) and runs at the dataset D (base level data).

• It creates a sequence of models (m1,…, mn). It creates a new data set D by changing the instance description of the base level model prediction to an instance of the base level data set.

Page 4: Stacking-Based Ensemble Framework and Feature Selection ......Stacking‑Based Ensemble Framework and Feature Selection Technique for the Detection of Breast Cancer Vikas Chaurasia1

SN Computer Science (2021) 2:6767 Page 4 of 13

SN Computer Science

• In turn, this new meta-dataset is presented to new learner Lmeta, who constructed the metamodel mmeta mapping base learner’s predictions to target classes.

• For final classification, the query instance p traverses all basic learners to create a meta query instance p ’, which is used as an input to the meta-model.

Support Vector Machine

SVM or Support Vector Machine is a straight model for request and backslide issues. It can deal with straight and non-direct issues and capacity splendidly for some rational issues. The chance of SVM is clear: the count makes a line or a hyperplane which disengages the data into classes. According to the SVM estimation, we find the concentrates closest to the line from both the classes. These centers are called support vectors. By and by, we figure the partition between the line and the assistance vectors. This partition is known as the edge. We will most likely extend the edge. The hyperplane for which the edge is most extraordinary is the perfect hyperplane. Thus, SVM endeavors to choose a decision limit with the goal that the division between the two classes is as wide as could sensibly be normal.

K‑Nearest Neighbors

KNN is a supervised and simple machine learning technique specialized in classification that is used to explain together regression and classification problems. The k-nearest neigh-bor (KNN) algorithm relies on the law that objects related in the input space are also related in the output space. It is based on majority vote of k nearest neighbors which col-lect all cases and classify them for new cases. Using the distance function it measures its nearest neighbors for the

cases assigned to classes. There are Euclidean, Manhattan, Minkowski and Hamming distance functions among which Hamming is used for categorical variables and Euclidean Minkowski and Manhattan used for continuous function. In case of K = 1, the class is assigned to the adjacent neigh-bor. It becomes difficult to choose k again and again while implementing KNN. Generally, Euclidean function is used to analyze those algorithms in which it uses different cen-troids and assigns every point to the group so it could be kept near the nearest point. [21].

Figure 2 shows how the latest data point is distributed among its neighbor category labels.

The k-Nearest Neighbors algorithm uses the whole train-ing dataset. Training the representation involves retaining the training dataset. The predictions involve selecting the most common class values and results the k- majority related records in the training dataset. For this purpose Euclidean distance technique is applied to estimate the relationship between rows in the training dataset and new rows of data.

Naive Bayes

Naive Bayes classifier is a straight classifier, known for its effortlessness be that as it may, compelling. The probabil-istic model of the Naive Bayes classifier depends on Bayes’ hypothesis and guileless descriptors depend on this supposi-tion: the components in the informational index are free of one another. By and by, freedom assumptions are regularly disregarded, yet the credulous Bayes classifier despite every-thing will, in general, perform Good under this unreasonable presumption. Particularly for little examples as far as size, the guileless Bayes classifier can beat all the more remark-able choices. Moderately strong, simple to actualize, quick

Fig. 1 Basic block diagram of the Stacking ensemble system

Page 5: Stacking-Based Ensemble Framework and Feature Selection ......Stacking‑Based Ensemble Framework and Feature Selection Technique for the Detection of Breast Cancer Vikas Chaurasia1

SN Computer Science (2021) 2:67 Page 5 of 13 67

SN Computer Science

and precise Naive Bayes Classifiers are utilized in a wide range of fields.

Perceptron

In 1957, Frank Rosenblatt presented the concept of percep-tron. According to him, a perceptron is a unit of neural net-work that performs some computation to discover features in input data. For discovering and process the elements percep-tron enables the neurons in training dataset also it is known as supervised learning of binary classifiers [26].

When perceptron gets multiple penetrate signals, and whenever the total of main signals exceeds an affirmative threshold, it produces a signal or does not produce an output. Using this framework, it is possible to predict the rank of a sample (Fig. 3).

The perceptron technique is a group of weights learned from the training data. To calculate error value and train the weight, many predictions need to make. So for making prediction and training the model, there needs a function.

Logistic Regression

LR is known as the machine learning technique of classifica-tion that performs when the dependent variable is 0 or 1 that contains 1 for success or 0 for unsuccessful. The LR model predicts P of Y = 1 as the purpose of X (Fig. 4). The LR is a predictive analysis. LR is used to explain the relationship between dependent binary variable and additional, interval, and ordinal or ratio-level independent variables, nominal. It is the most important member of a class of models called generalized linear models [27].

Fig. 2 k-nearest neighbor classi-fier predicts on a test instance

Fig. 3 Perceptron rule

Page 6: Stacking-Based Ensemble Framework and Feature Selection ......Stacking‑Based Ensemble Framework and Feature Selection Technique for the Detection of Breast Cancer Vikas Chaurasia1

SN Computer Science (2021) 2:6767 Page 6 of 13

SN Computer Science

For the illustration of the model, LR uses a group of weights known as coefficients. And similar to the Perceptron technique, predictions on training data aer made by updating them, iteratively learned by the coefficients.

Feature Selection

Feature selection is the concept of improving model perfor-mance by applying machine learning. The presentation of the model is influenced by the features of the data we use to train the model. For model design, data cleaning and feature selection, this is an initial step and an important step because some partially related or unrelated features may have a neg-ative impact. Feature selection is the process of selecting those features that we are most interested in contributing the most or producing the most. Uncorrelated feature data will reduce the accuracy of the model, and model learning will be performed based on uncorrelated features.

Univariate Selection

Univariate selection is used to choose features that have the powerful relationship to the output variable. We use χ2 (chi2) test for feature selection to calculate χ2 between each feature and the target and select the desired number of features with the best χ2 scores using the following formula:

where Oi observations in class I and Ei observations in class i if there was no relationship between the feature and target.

Extra Tree Classifier

For extraction of important features among the feature of dataset by applying the feature importance of the model which provide a score for every feature of data, the higher in score means more relevant feature in output variable.

�2 =

∑n

i=1Oi − Ei∕Ei

Correlation Matrix with Heatmap

Correlation is the property to see whether the features of dataset connected to target variable. Correlation may posi-tive or negative. In positive correlation if increasing the sin-gle importance of feature increase the value of objective and in negative correlation if increasing the single importance of feature decrease the value of objective. By heat map, it could easily spot which features are the majority relevant to the objective variable.

Recursive Feature Elimination

As its name implies recursive feature elimination is the pro-cess of eliminating the weakest features until it does not get the specified number of features. It is a technique of feature selection that is used to fit a model. In RFE, initially, it is not known that how many features are valid so it needs to keep a specified number of features. To get the best possible figure of features cross-validation is applied with RFE to get unusual feature subsets and choose the top-scoring set of features.

Random Forest

It provides a good predictive performance, easy interpret-ability and low overfitting. Due to interpretability, it is easy to compute the significance of all variables on the hierarchy decision and how many variables are participating in the decision. Feature selection is implemented by algorithms using Random forest comes under the category of fixed methods which merge the character of filter and wrapper techniques.

We can understand the overall approach of the methodol-ogy by looking at Fig. 5.

Datasets

Table 1 depicts the description of Breast Cancer Wiscon-sin (Original) dataset which contains 699 Instances and ten attributes.

Results and Discussion

A threefold k value is used for cross-validation, for each fold 699/3 = 233 records to be evaluated upon each iteration and also included ten attributes. Table 2 and Fig. 6 present the result of mean scores and the final scores.

Here, we used five feature selection techniques. Using these techniques, we reduced the data set and selected five salient attributes from ten attributes (see Table 7), which

Fig. 4 Logistic regression

Page 7: Stacking-Based Ensemble Framework and Feature Selection ......Stacking‑Based Ensemble Framework and Feature Selection Technique for the Detection of Breast Cancer Vikas Chaurasia1

SN Computer Science (2021) 2:67 Page 7 of 13 67

SN Computer Science

is half of the total attributes, to see if the accuracy of the stacked model is affected. Then apply the same stacked classifier to check if its accuracy performance improves [28–30].

In Table 3, we used χ2 test for non-negative features (i.e. features which are dependent on the target) to choose five of the top features from the dataset.

Fig. 5 Overall presentation of the methodology

Table 1 Description of Wisconsin breast cancer (original) dataset [31]

Input Description

Clump thickness Assesses if cells are mono- or multi-layeredUniformity of cell size Evaluates the consistency in size of the cells in the sampleUniformity of cell shape Estimates the equality of cell shapes and identifies marginal variancesMarginal adhesion Quantifies how much cells on the outside of the epithelial tend to stick togetherSingle epithelial cell size Relates to cell uniformity, determines if epithelial cells are significantly enlargedBare nuclei Calculates the proportion of the number of cells not surrounded by cytoplasm to those that areBland chromatin Rates the uniform "texture" of the nucleus in a range from fine to coarseNormal nucleoli Determines whether the nucleoli are small and barely visible or larger, more visible, and more plentifulMitoses Describes the level of mitotic (cell reproduction) activityClass 2 for benign

4 for malignant

Page 8: Stacking-Based Ensemble Framework and Feature Selection ......Stacking‑Based Ensemble Framework and Feature Selection Technique for the Detection of Breast Cancer Vikas Chaurasia1

SN Computer Science (2021) 2:6767 Page 8 of 13

SN Computer Science

We applied ET Classifier for estimating the top five fea-tures of the dataset (Table 4 and Fig. 7).

In Fig. 8 and Table 5, we can easily identify the five most prominent features of dataset attributes.

In Table  6, we use PCA and select five principal components.

Figure 9 depicts the important features by the random forest algorithm.

Using five different feature selection techniques, we esti-mated whether different features are selected by different techniques. So now, we have to select the five most promi-nent features from overall features. For this, we have used the inspection method to decide the ranking in overall features. This process is mentioned in Table 7.

The highest rank was got by uniformity of cell shape, bare nuclei, uniformity of cell size, clump thickness and bland chro-matin. Now we have applied the same stacking model to these five selected features of dataset for obtaining the improved accuracy by the classifiers. Figure 10 and Table 8 depict the result.

Confusion Matrix and Other Indicators

Confusion Matrix

Another indicator used to measure the performance of the clas-sifier is the confusion matrix. When diagnosing breast cancer patients, each patient may be benign or malignant. In the con-fusion matrix, the columns represent predicted categories and the rows represent actual categories (see Table 9). We have the following four situations:

True positive (TP): Classifiers predict cases that are "malig-nant" and the cancer is actually malignant.

True negative (TN): Classifiers predict cases that are "benign" and the cancer is actually benign.

False positive (FP): The classifier predicts "malignant" but the cancer is actually a benign case.

False negative (FN): The classifier predicts "benign" but the cancer is actually malignant.

Accuracy

The accuracy can be obtained from the confusion matrix, that is, the number of all correct predictions (TP + TN) divided by the total number of data sets (TP + TN + FP + FN).

Accuracy = (TN + TP) × 100∕(TP + TN + FP + FN).

Table 2 Performance analysis of the original dataset without feature selection using KNN, perceptron and LR

% Accuracy of classifiers with 10 attributes

Accuracy of SVM 98.800Accuracy of KNN 98.357Accuracy of Naive Bayes 96.985Accuracy of perceptron 98.049Accuracy of stacked (logistic regression) 98.968

SVM KNN Naive bayes Perceptron Stacked(Logistic

Regression)

98.898.357

96.985

98.049

98.968

% Accuracy by Classifiers

Fig. 6 % Accuracy of classifiers

Table 3 Univariate feature selection with 5 attributes

Attributes Score

Bare nuclei 1667.772372Uniformity of cell size 1387.106202Uniformity of cell shape 1289.040711Normal nucleoli 1151.664431Marginal adhesion 984.413887

Table 4 Extra tree feature selection with 5 attributes

Attributes Score

Uniformity of cell size 0.37438031Bare nuclei 0.26607926Normal nucleoli 0.12124661Clump thickness 0.08057467Uniformity of cell shape 0.05593402

Fig. 7 ET feature selection with 5 attributes

Page 9: Stacking-Based Ensemble Framework and Feature Selection ......Stacking‑Based Ensemble Framework and Feature Selection Technique for the Detection of Breast Cancer Vikas Chaurasia1

SN Computer Science (2021) 2:67 Page 9 of 13 67

SN Computer Science

The accuracy of the confusion matrix is 97.14% with 10 attributes and 95.71% with 5 attributes, respectively.

Precision

This is the "accuracy" of the model and only the ability to return relevant instances. If the problem statement involves minimizing false positives, that is, in the current situation, if we do not want the model to mark malignant tumors as

Fig. 8 Heat map feature selection with 5 attributes

Table 5 Extra tree feature selection with 5 attributes

Attributes Score

Uniformity of cell size 0.82Uniformity of cell shape 0.82Bare nucleoli 0.81Bland chromatin 0.76Clump thickness 0.72

Table 6 RFE feature selection with 5 attributes

Selected feature Attribute name Feature ranking

True Clump thickness 1False Uniformity of cell size 4True Uniformity of cell shape 1True Marginal adhesion 1False Single epithelial cell size 3True Bare nuclei 1True Bland chromatin 1

Fig. 9 Random forest feature selection with 5 attributes

Page 10: Stacking-Based Ensemble Framework and Feature Selection ......Stacking‑Based Ensemble Framework and Feature Selection Technique for the Detection of Breast Cancer Vikas Chaurasia1

SN Computer Science (2021) 2:6767 Page 10 of 13

SN Computer Science

authentic, then Precision is required. This is obtained by using the following formula:

The precision values are 0.98 in both cases, i.e. with 10 attributes as well as with 5 attributes.

Recall

Recognizing the authenticity of all relevant instances is the model’s complete ability, which in our data set corresponds

Precision = TP∕ (TP + FP).

to minimizing the number of malignant tumors classified as benign. This is obtained by using the following formula:

We get the recall value 0.93 with 10 attributes and 0.89 with 5 attributes.

F1 Measure

The harmonic average value of Precision and Recall is used to represent the balance of Precision and Recall between each equal weight, and the range is 0–1. The F1 score reaches the best value at 1 and the worst value at 0. This is obtained by using the following formula:

Here, the F1 measure tends towards 1 as we get 0.95 with 10 attributes and 0.93 with 5 attributes.

Specificity

Also known as "true negative rate", that is, the higher the true negative rate contained in the data, the higher its speci-ficity. This is obtained by using the following formula:

In both cases, specificity is 0.99.

ROC Curve

The area under the ROC curve measures the entire two-dimensional area under the curve. It measures the degree to which the parameters can distinguish the two diagnostic groups. Commonly used to measure the quality of classifi-cation models, it is created by plotting the true positive rate (TPR) and the false positive rate (FPR). Table 10 represents the ROC curve.

Recall = TP∕ (TP + FN).

F1 = (2 × Precision × Recall)∕(Precision + Recall).

Specificity = TN∕ (TN + FP).

Table 7 Deciding higher rank of feature from overall features of dataset

Univariate selection (KBest)

Extra tree clas-sifier

Correlation matrix with heatmap

Recursive feature elimination

Random forest Rank

Clump thickness × √ √ √ × 3Uniformity of cell size √ √ √ × √ 4Uniformity of cell shape √ √ √ √ √ 5Marginal adhesion √ × × √ × 2Single epithelial cell size × × × × √ 1Bare nuclei √ √ √ √ √ 5Bland chromatin × × √ √ √ 3Normal nucleoli √ √ × × × 2Mitoses × × × × × 0

SVM KNN Naive bayes Perceptron Stacked(Logistic

Regression)

98.95898.768 98.652 98.657

99.968

% Accuracy by Classifiers

Fig. 10 % Accuracy of classifiers

Table 8 Accuracy obtained with reduced 5 features

% Accuracy of classifiers with 5 attributes

Accuracy of SVM 98.958Accuracy of KNN 98.768Accuracy of naive bayes 98.652Accuracy of perceptron 98.657Accuracy of stacked (logistic regression) 99.968

Page 11: Stacking-Based Ensemble Framework and Feature Selection ......Stacking‑Based Ensemble Framework and Feature Selection Technique for the Detection of Breast Cancer Vikas Chaurasia1

SN Computer Science (2021) 2:67 Page 11 of 13 67

SN Computer Science

Conclusion

In general, the research in this article is based on the accu-racy of the stacked model on the breast cancer dataset and the accuracy of reduced features obtained by using

different types of feature selection techniques. We have seen in this research article that the accuracy of SVM, KNN, Naive Bayes and Perceptron with ten features are 98.8, 98.357, 96.985 and 98.049%, respectively, and the accuracy of LR (stacked model) is 98.968%. The accuracy

Table 9 Metrics of confusion matrix

Metrics Data set with 699 instances and 10 attributes Data set with 699 instances and 5 attributes

Confusion matrix True negatives: 94False positives: 1False negatives: 3True positives: 42

True negatives: 94False positives: 1False negatives: 5True positives: 40

Table 10 ROC curve and its corresponding accuracy

Metrics Data set with 699 instances and 10 attributes Data set with 699 instances and 5 attributes

ROC curve Logistic regression accuracy: 99.6% Logistic regression accuracy: 99.4%

Page 12: Stacking-Based Ensemble Framework and Feature Selection ......Stacking‑Based Ensemble Framework and Feature Selection Technique for the Detection of Breast Cancer Vikas Chaurasia1

SN Computer Science (2021) 2:6767 Page 12 of 13

SN Computer Science

of each classifier (that is, five features) obtained by using reduced features are SVM, KNN, Naive Bayes and Percep-tron, respectively, about 98.958, 98.768, 98.652, 98.657%, and the accuracy of the stacked model (LR) is 99.968%. Compared to all ten features, this is an improved condi-tion. In the comparison of other models described in the related work section, when the classifier is applied to a simplified data set obtained by reducing features, we obtain approximately improved (99.968%) accuracy.

In clinical data sets, other indicators are also important, such as accuracy, recall, F1 score, and specificity (Tables 9 and 10). All of these indicators are critical to discovering the “type of tumor in breast cancer patients”, and the high accuracy of the model will tell us the proportion of tumors classified as malignant that are actually malignant. The ratio of true positives (words classified as malignant, actually malignant) to all positives (all words classified as malig-nant, regardless of whether their classification is correct). In cancer detection, recall that it is higher so that even if we miss the opportunity to classify certain non-cancerous activities as cancer, we can correctly classify/identify cancer and it will not cause any significant damage. Sensitivity has the ability to identify all relevant instances. Specificity, the higher the true negative rate held by the data, the higher the specificity. ROC curve is a method to measure the perfor-mance of binary classifier (LR). It is drawn by plotting the TPR or recall rate against the FPR. In Tables 9 and 10, we extracted accuracy (confusion matrix), accuracy of classi-fiers, specificity, ROC curve (logistic regression accuracy). Therefore, in the dataset with 699 instances and ten attrib-utes, the values are 97%, 0.98, 0.99, and 99.6%, respectively, while the dataset with 699 instances and five attributes is 96%, 0.98, 0.99, and 99.4%. Therefore, we can conclude that datasets with fewer attributes do not affect. We can remove those attributes that are not needed or lack information from the dataset.

Therefore, we can conclude that reducing the features of the data set can improve the performance of the classifier. Using the above technologies, we can expand the scope of implementation, such as fraud detection in the banking sec-tor and spam mail detection in spam filtering dataset.

In future work, we will include more breast cancer data-sets from a different repository. We know that stacking works better if the predictions of basic learners are weakly correlated, so we can apply calculations to build correlations between different sub-models. We can use different basic learners and different aggregator algorithms to estimate the accuracy of the classifier and check if the accuracy of the classifier has improved.

Funding No funding was received from any organization.

Compliance with Ethical Standards

Conflict of interest The authors declare that they have no conflict of interest.

Ethical approval This article does not contain any studies with human participants or animals performed by any of the authors.

References

1. World Health Organization. WHO PEN protocol 4.1: Assess-ment and referral of women with suspected breast cancer at primary health care, 2010. Available at: http://www.who.int/entit y/ncds/manag ement /Proto col4_1_Breas tCanc erAss essme nt_and_refer ral.pdf?ua=1. Accessed 13 Sep 2017.

2. Aggarwal CC. Data Mining: The Textbook. Switzer-land: Springer International Publishing; 2015. https ://doi.org/10.1007/978-3-319-14142 -8.

3. Aggarwal C. Outlier ensembles: position paper. ACM SIGKDD Explor Newsl. 2012;14(2):49–58.

4. Dash, M., Choi, K., Scheuermann, P., Liu, H.: Feature selection for clustering - a filter solution. In: Proceedings of the 2002 IEEE International Conference on Data Mining (ICDM 2002), Maebashi City, Japan, pp. 115–122. IEEE Computer Society Press, Los Alamitos, 2002.

5. Guyon I, Steve G, Masoud N, Zadeh LA. Feature extraction: foundations and applications. Vol. 207. Springer, 2008. pp. 1–25.

6. Wolberg WH, Street WN, Mangasarian OL. Image analysis and machine learning applied to breast cancer diagnosis and prog-nosis. Anal Quant Cytol Histol. 1995;17(2):77–87.

7. Abbass H. An evolutionary artificial neural networks approach for breast cancer diagnosis. Artif Intell Med. 2002;25(3):265–81.

8. Tingting Mu, Nandi AK. Breast cancer detection from FNA using SVM with different parameter tuning systems and SOM-RBF classifier. J Franklin Inst. 2007;344(3):285–311.

9. Thongkam J, Guandong X, Yanchun Z, Fuchun H. Breast cancer survivability via AdaBoost algorithms. In: Proceedings of the second Australasian workshop on Health data and knowledge management Vol 80, pp. 55–64. 2008.

10. Ya-Qin L, Wang C, Zhang L. Decision tree based predictive mod-els for breast cancer survivability on imbalanced data. In: 2009 3rd international conference on bioinformatics and biomedical engineering, pp. 1–4. IEEE, 2009.

11. Murat Karabatak M, Ince C. An expert system for detection of breast cancer based on association rules and neural network. Expert Syst Appl. 2009;36(2):3465–9.

12. Akay MF. Support vector machines combined with fea-ture selection for breast cancer diagnosis. Expert Syst Appl. 2009;36(2):3240–7.

13. Dong C, YiLong Y, XiuKun Y. Detecting malignant patients via modified boosted tree. Science China Information Sciences 53, no. 7, 1369–1378 (2010).

14. Marcano-Cedeño A, Quintanilla-Domnguez J, Andina D. WBCD breast cancer database classification applying artificial metaplas-ticity neural network. Expert Syst Appl. 2011;38(8):9573–9.

15. Salama GI, Abdelhalim M, Zeid MA. Breast cancer diagnosis on three different datasets using multi-classifiers. Breast Cancer (WDBC). 2012;32(569):2.

Page 13: Stacking-Based Ensemble Framework and Feature Selection ......Stacking‑Based Ensemble Framework and Feature Selection Technique for the Detection of Breast Cancer Vikas Chaurasia1

SN Computer Science (2021) 2:67 Page 13 of 13 67

SN Computer Science

16. Chaurasia V, Pal S. Data mining techniques: to predict and resolve breast cancer survivability. Int J Comput Sci Mobile Comput. 2014;3:10–22.

17. Chaurasia V, Pal S. A novel approach for breast cancer detection using data mining techniques. Int J Innov Res Comput Commun Eng. 2014;2:2456–65.

18. Vikas C, Pal S. Performance analysis of data mining algorithms for diagnosis and prediction of heart and breast cancer disease. Rev Res. 2014;3:1–13.

19. Asri H, Mousannif H, Moatassime HA, Noel T. Using machine learning algorithms for breast cancer risk prediction and diagno-sis. Procedia Comput Sci. 2016;83:1064–9.

20. Chaurasia V, Pal S, Tiwari BB. Prediction of benign and malignant breast cancer using data mining techniques. J Algorithms Comput Technol. 2018;12(2):119–26. https ://doi.org/10.1177/17483 01818 75622 5 (ISSN (Online):1748-3026, UK).

21. Ramaswamy S, Rastogi R. Shim K. Efficient algorithms for min-ing outliers from large datasets. In: Proceedings of the 2000 ACM SIGMOD international conference on management of data. Dal-las, USA 427 2000;438.

22. Borah, Rupam, Sunil Dhimal, and Kalpana Sharma. "Medical Diagnostic Models an Implementation of Machine Learning Techniques for Diagnosis inBreast Cancer Patients." In Advanced Computational and Communication Paradigms, pp. 395–405. Springer, Singapore, 2018.

23. Shaikh TA, Rashid A. Applying machine learning algorithms for early diagnosis and prediction of breast cancer risk. In: Proceed-ings of 2nd international conference on communication, comput-ing and networking. Springer, Singapore, 2019.

24. Sri, MN, Hari Priyanka JSVS, Sailaja D, Ramakrishna Murthy M. A comparative analysis of breast cancer data set using different

classification methods. In Smart Intelligent Computing and Appli-cations, pp. 175–81. Springer, Singapore, 2019.

25. Dutta S, Sujata G, Abhijit S, Rechik P, Rohit P, Rohit R. Cancer prediction based on fuzzy inference system. In: Smart innova-tions in communication and computational sciences, pp. 127–36. Springer, Singapore, 2019.

26. Morel D, Singh C, Levy WB. Linearization of excitatory synaptic integration at no extra cost. J Comput Neurosci. 2018;44(2):173–88. https ://doi.org/10.1007/s1082 7-017-0673-5.

27. Hosmer D. Applied logistic regression. Hoboken New Jersey: Wiley; 2013. (ISBN 978-0470582473).

28. Saghapour, E, Saeed K, Mohammadreza S. A novel feature rank-ing method for prediction of cancer stages using proteomicsdata. PLoS One 12, no. 9 2017; e0184203.

29. Einicke GA. Maximum-entropy rate selection of features for clas-sifying changes in knee and ankle dynamics during running. IEEE J Biomed Health Inf. 2018;28(4):1097–103.

30. Kai Han; Yunhe Wang; Chao Zhang; Chao Li; Chao Xu. Autoen-coder inspired unsupervised feature selection. IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018.

31. Wolberg, W.H.; Breast cancer Wisconsin (original) data set. Retrieved from http://archi ve.ics.uci.edu/ml/datas ets/Breas t+Cance r+Wisco nsin+(Origi nal). 1992, July 15

32. Vilalta R, Giraud-Carrier C, Brazdil P, Soares C. Using meta-learning to support data-mining. Intern J Comput Sci Appl. 2004;I(31):31–45.

Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.